TY - JOUR T1 - Characterizing regulatory sequence features that discriminate between overlapping annotation labels JF - bioRxiv DO - 10.1101/100511 SP - 100511 AU - Akshay Kakumanu AU - Silvia Velasco AU - Esteban Mazzoni AU - Shaun Mahony Y1 - 2017/01/01 UR - http://biorxiv.org/content/early/2017/01/15/100511.abstract N2 - Genomic loci with regulatory potential can be identified and annotated with various labels. For example, sites may be annotated as being bound or unbound by a transcription factor (TF) under particular cellular conditions, or as being proximal or distal to known transcription start sites. Given such a collection of labeled genomic sites, it is natural to ask what sequence features are associated with each annotation label. However, discovering such label-specific sequence features is often confounded by uneven overlaps between annotation labels. In order to meet this challenge, we developed SeqUnwinder, a principled approach to deconvolving interpretable discriminative sequence features associated with overlapping annotation labels. We demonstrate the novel analysis abilities of SeqUnwinder using three examples. Firstly, we show SeqUnwinder’s ability to unravel sequence features associated with the dynamic binding behavior of TFs during motor neuron programming from features associated with chromatin state in the initial embryonic stem cells. Secondly, we demonstrate that multi-condition TF binding sites are typically characterized by better quality instances of the TF’s cognate binding motifs. Finally, we demonstrate the scalability of SeqUnwinder to discover cell-specific sequence features from over one hundred thousand genomic loci that display DNase I hypersensitivity in one or more ENCODE cell lines.Availability https://github.com/seqcode/sequnwinder ER -