ABSTRACT
Cellular identity relies on cell type-specific gene expression profiles controlled by cis-regulatory elements (CREs), such as promoters, enhancers and anchors of chromatin interactions. CREs are unevenly distributed across the genome, giving rise to distinct subsets such as individual CREs and Clusters Of cis-Regulatory Elements (COREs), also known as super-enhancers. Identifying COREs is a challenge due to technical and biological features that entail variability in the distribution of distances between CREs within a given dataset. To address this issue, we developed a new unsupervised machine learning approach termed Clustering of genomic REgions Analysis Method (CREAM). We demonstrate that COREs identified by CREAM are predictive of cell identity, consists of CREs strongly bound by master transcription factors according to ChIP-seq signal intensity and are proximal to highly expressed genes. We further show that COREs identified by CREAM are preferentially found near genes essential for cell growth. Overall, CREAM offers an improved method compared to the state-of-the-art to identify COREs of biological function. CREAM is available as an open source R package (https://CRAN.R-project.org/package=CREAM) to identify COREs from cis-regulatory annotation datasets from any biological samples.