ABSTRACT
Cellular identity relies on cell type-specific gene expression profiles controlled by cis-regulatory elements (CREs), such as promoters, enhancers and anchors of chromatin interactions. CREs are unevenly distributed across the genome, giving rise to distinct subsets such as individual CREs and Clusters Of cis-Regulatory Elements (COREs), also known as super-enhancers. Identifying COREs is a challenge due to technical and biological features that entail variability in the distribution of distances between CREs within a given dataset. To address this issue, we developed a new unsupervised machine learning approach termed Clustering of genomic REgions Analysis Method (CREAM). We demonstrate that COREs identified by CREAM are predictive of cell identity, consists of CREs strongly bound by master transcription factors according to ChIP-seq signal intensity and are proximal to highly expressed genes. We further show that COREs identified by CREAM are preferentially found near genes essential for cell growth. Overall, CREAM offers an improved method compared to the state-of-the-art to identify COREs of biological function. CREAM is available as an open source R package (https://CRAN.R-project.org/package=CREAM) to identify COREs from cis-regulatory annotation datasets from any biological samples.
BACKGROUND
Over 98% of the human genome consists of sequences lying outside of gene coding regions that harbor functional features, including cis-regulatory elements (CREs), important in defining cellular identity [1]. CREs such as enhancers, promoters and anchors of chromatin interactions, are predicted to cover 20-40% of the noncoding genomic landscape [2]. CREs define cell type identity by establishing lineage-specific gene expression profiles [3–5]. Current methods to annotate CREs in any given biological sample include ChIP-seq for histone modifications (e.g., H3K27ac, H3K4me3, H3K4me1) [3,4,6], for chromatin binding protein (e.g., MED1, P300) [6,7] or through chromatin accessibility assays (e.g., DNase-seq, ATAC-seq) [8,9].
Clusters Of cis-Regulatory Elements (COREs) were recently introduced as a subset of CREs based on different parameters including close proximity to each other [7,10–12]. COREs are significantly associated to cell identity and are bound with higher intensity by transcription factors than individual CREs [7,11,13]. Furthermore, inherited risk-associated loci preferentially map to COREs from disease related cell types [10,14–16]. Finally, COREs found in cancer cells lie proximal to oncogenic driver genes [17–19]. Together, these features showcase the utility of classifying CREs into individual CREs versus COREs.
Recent work assessed the role of COREs as a collection of individual CREs proximal to each other as opposed to a community of synergizing CREs [20–22]. Partial redundancy between effect of individual CREs versus a super-enhancer/CORE on regulating expression of genes in embryonic stem cells was observed [20] as well as low synergy between the individual CREs within COREs [23]. Whether COREs provide an added value over individual CREs to gene expression is still debated. Conclusions may be confounded by the simplistic approach commonly used to identify COREs. For instance, the distance between CREs is a critical feature that distinguishes COREs from individual CREs. Available methods to identify COREs dismiss the variability in the distribution of distances between CREs that stems from technical and biological features unique to each CRE dataset. Instead, arbitrary thresholds are considered including 1) a fixed stitching distance limit between CREs (such as 12.5 [7] or 20 [12] kilobases) to report them within a CORE, 2) a fixed cutoff in the ChIP-seq signal intensity from the assay used to identify CREs to separate COREs from individual CREs [7], or 3) reporting an individual CRE with high signal intensity as a CORE [7,24]. To address these limitations, we developed a new methodology termed CREAM (Clustering of genomic REgions Analysis Method) (Fig. 1). CREAM is an unsupervised machine learning approach that takes into account the distribution of distances between CREs in a given biological sample.
Benchmarking CREAM against Rank Ordering of Super-Enhancers (ROSE) [11], the current standard method to call COREs, we demonstrate that CREAM identifies COREs predictive of cell identity, proximal to highly expressed genes and associated with high intensity transcription factor binding. We further demonstrate the utility of COREs identified by CREAM as chromatin regions associated with genes essential for the growth of cancer cells.
RESULTS
We compared the number and width of the COREs identified by CREAM and ROSE using GM12878 and K562 cell lines. We focused on these cell lines because of their extensive characterization by the ENCODE project for DNA-protein interactions, namely ChIP-seq profiles for over 80 transcription factors. This provides a unique opportunity to assess the biological relevance of COREs identified in each cell line. CREAM identified a total of 1,694 and 4,968 COREs in GM12878 and K562 cell lines, respectively, based on their DNase-seq defined CREs. These COREs account for 14.6% and 17.2% of all CREs reported by the DNase-seq profiles from these cells. In contrast, ROSE identifies 2,490 and 2,527 COREs in GM12878 and K562, respectively. These account for 31% and 30% of the CREs detected in GM12878 and K562 cell lines. To determine if CREAM identifies new COREs or simply subdivides those reported by ROSE, we assessed the exclusivity of the COREs identified by CREAM and ROSE in GM12878 or K562 cell lines. The CREAM-identified COREs have 85% and 49% shared genomic regions with ROSE-identified COREs in GM12878 and K562 cell lines, respectively (Fig. 2A). However, shared genomic regions between ROSE and CREAM-identified COREs account only for 14% and 8% of the total identified COREs by ROSE in GM12878 and K562 cell lines, respectively. Hence, while many COREs are identified by both methods, CREAM and ROSE differ sufficiently that a number of COREs are uniquely identified by each method. Moreover, ROSE-identified COREs occupy significantly larger genomic regions (average 138 kb width) than those identified by CREAM (average 5kb width) (Fig. 2B). We therefore compared COREs identified by CREAM and ROSE according to a series of biological features previously shown to discriminate COREs from individual CREs.
DNase I hypersensitive signal is elevated within COREs
COREs are reported to associate with higher levels of binding for a wide range of chromatin binding proteins [11]. Our results show that COREs identified by CREAM have 2 to 5 fold higher average DNasel hypersensitivity signal per base pair (bp) compared to individual CREs (Fig. 2C). This is in contrast to COREs identified by ROSE, which show an equivalent DNasel hypersensitivity to individual CREs in K562 cells and less than a 1.5 fold increased in GM12878 cells (Fig. 2C). The distinct behavior between CREAM and ROSE-identified COREs could be due to their size difference that translates in more base pairs free of CRE (CRE-free gaps) in ROSE-identified COREs (102 mbp and 208 mbp in GM12878 and K562 cells, respectively) compared to CREAM-identified COREs (14.7 mbp and 18.7 mbp in GM12878 and K562 cells, respectively) (Fig. 2D). This stems from a permissive <12.5kb distance between CREs criteria in ROSE. In contrast, a learned maximum distance limit between CREs criteria is used by CREAM resulting in smaller average of maximum distances (<1.7 kb in GM12878 and <1 kb in K562 cells for their respective DNase-seq delineated CREs).
CREAM-identified COREs are proximal to highly expressed genes
In agreement with previous reports [7,11], COREs identified by ROSE are proximal to genes expressed at higher levels than those near individual CREs (Fig. 2E). This also applies to COREs identified by CREAM in both GM12878 (>4 fold difference) and K562 cell lines (>2.5 fold difference)(Fig. 2E). Noteworthy, genes proximal to CREAM-identified COREs have a 1.5 fold higher expression levels compared to genes proximal to ROSE-called COREs (p < 0.001) (Fig. 2E). We further assessed expression of genes in proximity of COREs specific to CREAM and ROSE. Expression of genes in proximity of CREAM-specific COREs were significantly higher than genes in proximity of ROSE-specific COREs in both GM12878 and K562 cell lines (p<0.001) (Supplementary Fig. 1). Moreover, comparing the percentage of COREs which overlap with promoters, exons, introns, and intergenic regions reveals a very similar distributions for COREs identified by CREAM or ROSE in GM12878 and K562 cell lines. (Supplementary Fig. 2). Taken together, our results show that COREs identified by CREAM share similarities with those identified by ROSE in term of genomic distribution but are associated with stronger differences in gene expression compared to individual CREs.
CREAM identifies COREs bound by master transcription factors
Transcription factors bind to CREs to modulate the expression of cell-type specific gene expression patterns [29,30]. COREs were previously found to associate with strong transcription factor binding intensity based on ChIP-seq signal [11]. Hence, we assessed transcription factors binding intensities within COREs using the extensive characterization of transcription factor binding profiles performed by the ENCODE project in GM12878 and K562 cells [31]. We find that more than 25% of transcription factors show binding intensity significantly higher over CREAM-identified COREs compared to individual CREs in both GM12878 and K562 cell lines (FC > 2; FDR < 0.001) (Fig. 3A). In contrast, less than 15% of all transcription factors bind with higher intensity in ROSE-identified COREs compared to individual CREs in both GM12878 and K562 cell lines (FDR < 0.001, FC>2; Fig. 3A).
Difference in transcription factor binding intensity at CREAM versus ROSE-identified COREs is showcased by the master transcription factors TCF3 and EBF1 [25] in GM12878 cells and GABP and CREB1 [26–28] in K562 cells. Indeed, over a 2 fold difference in binding intensity of TCF3 and EBF1 is observed for CREAM-identified COREs compared to individual CREs in GM12878 cell (Fig. 3B). This is exemplified over the CORE proximal to the ZFAT gene in GM12878 cells (Fig. 3C). Similarly, over a 3 fold difference in GABP and CREB1 binding intensity is observed over COREs compared to individual CREs in K562 cells (Fig. 3B) and exemplified at the 7q36 locus harboring a series of COREs bound strongly by GABP and CREB 1 in K562 cells (Fig. 3C).
The binding intensity of transcription factors over COREs was calculated as the average ChIP-seq signal within each CORE. We assessed if the difference between enrichment of transcription factor binding intensity within CREAM- and ROSE-identified COREs is not merely due to the difference in their burden of CRE-free gaps. We calculated the transcription factor binding intensity excluding the CRE-free gaps within ROSE-identified COREs (Supplementary Fig. 3). More than 25% of the transcription factors have significantly higher binding intensity within CREAM-identified COREs compared to the signal over the CREs in COREs identified by ROSE (FC > 2; FDR < 0.001).
Generalizability of CORE identification across cell and tissue types
ROSE-identified COREs were reported to discriminate cell types [7]. We therefore assessed the predictive value of CREAM-identified COREs to discriminate cellular identity. Running CREAM on the DNase-seq defined CREs from 102 cell lines provided by the ENCODE project [31] reveals between 1,022 to 7,597 COREs per cell line (Fig. 4A). The number of COREs correlates with the total number of CREs identified in each cell line (Fig. 4B). However, the average width of COREs across cell lines shows low correlation with the total number of CREs (|Spearman correlation| ρ < 0.25; Fig. 4C). Hence, CORE widths are specific to each biological sample irrespective of the total number of CREs.
To test whether COREs can discriminate cells with respect to their tissue source and their malignant status, we clustered the ENCODE cell lines based on their CREAM- and ROSE-identified COREs (Fig. 4D). Predicting each cell line based on its nearest neighbor, we could classify tissue source with high accuracy using CREAM-identified COREs (Matthew correlation coefficient [MCC] of 0.90 for tissues with ≥ 5 cell lines; Fig. 4E). However, ROSE-identified COREs yielded substantially lower predictive value (MCC of 0.74; Fig. 4E). Similarly, CREAM-identified COREs were more discriminative of non-malignant versus malignant cell lines than ROSE-identified COREs (MCC of 0.80 versus 0.67 for CREAM and ROSE, respectively; Fig. 4F).
CREAM-identified COREs are proximal to essential genes
COREs are reported to lie in proximity to genes essential for self-renewal and pluripotency of stem cells, respectively [32]. A CRISPR/Cas9 gene essentiality screen was recently reported in K562 cells by Wang et al. (2015) [33]. Merging these genomic screening data with CORE identification from K562 cells reveals a significant enrichment of gene essential for growth proximal to CREAM-identified COREs (FDR < 1e-4; Fig. 5A). BCR is the top essential gene in proximity of CREAM-identified COREs. Oncogenic BCR-ABL gene fusion plays an essential role in pathogenesis of Chronic Myelogenous Leukemia which is the tumor of origin of K562 cell line [34].
In contrast, genes proximal to individual CREs or ROSE-identified COREs are not enriched with essential genes (CRE: FDR=0.26; ROSE-identified CORE: FDR=0.92; Fig. 5B). Moreover, expression of genes essential for growth in K562 proximal to CREAM-identified COREs is significantly higher than expression of the essential genes associated with individual CREs or ROSE-identified COREs (p < 0.001; Fig. 5C). Hence, CREAM identifies COREs associated with essential genes in K562 cell line. To further assess the specificity of COREs’ association with essential genes we extended our analysis to essentiality score from other model cell lines tested by Wang et al. (2015) [33]. Essentiality score of genes proximal to K562 CREAM-identified COREs in KBM-7, Jiyoye, and Raja cell lines were significantly less negative than for the genes proximal to the COREs in K562 cells (FDR<0.001; Fig. 5D). This supports the cell type-specific nature of COREs and their association with essential genes.
CONCLUSIONS
State-of-the-art approach for CORE calling (ROSE) dismiss the variability in the distribution of distances between CREs, unique to each CRE dataset, hence limited by considering a fixed threshold of 12.5 to 20 kilobases in distance between CREs within COREs, and a fixed cutoff in the ChIP-seq signal intensity from the assay used to identify CREs to separate COREs from individual CREs [11,12]. To overcome these limitations, we developed CREAM as an unsupervised machine learning method providing a systematic approach for identifying COREs.
Here, we show that CREAM identifies COREs that, (i) have higher transcription factor binding intensity with respect to individual CREs, (ii) associated with identity of normal and cancer cell lines, and (iii) have significantly higher probability of being essential for growth of the cells compared to the rest of epigenetic landscape. Hence, CREAM can open a new avenue of research for personalized therapeutic identification in clinical cancer setting. Taken all together, we show that CREAM can be used to further characterize cis-regulatory landscapes of cells.
METHODS
CREAM
CREAM uses genome-wide maps of cis-regulatory elements (CREs) in the tissue or cell type of interest, such as those generated from chromatin-based assays including DNase-seq, ATAC-seq or ChIP-seq. CREs can be identified from these profiles by peak calling tools such as MACS [35]. The called individual CREs then will be used as input of CREAM. Hence, CREAM does not need the signal intensity files (bam, fastq) as input. CREAM considers proximity of the CREs within each sample to adjust parameters of inclusion of CREs into a CORE in the following steps (Fig. 1):
Step 1: Clustering of individual CREs throughout genome
CREAM initially groups neighboring individual CREs throughout the genome. Each group (or cluster) can have different number of individual CREs. Then it categorizes the clusters based on their included CRE numbers. We defined Order (O) for each cluster as its included CRE number. In the next steps, CREAM identifies maximum allowed distance between individual CREs for COREs of a given O.
Step 2: Maximum window size identification
We defined maximum window size (MWS) as the maximum distance between individual CREs included in a CORE. For each Order, CREAM builds a distribution of window sizes, as the maximum distance between individual CREs in each CORE, in all clusters of that Order within the genome. Afterward, MWS will be identified as follows where MWS is the maximum distance between neighboring individual CREs within a CORE. Q1(log(WS)) and IQ(log(WS)) are the first quartile and interquartile of distribution of window sizes (Fig. 1).
Step 3: Maximum Order identification
After determining MWS for each Order of COREs, CREAM identifies maximum O (Omax) for the given sample. By increasing O of COREs, the individual CREs should be allowed to have further distance from each other as a result of gain of information within the clusters. Hence, starting from COREs of O=2, the O increases up to a plateau at which an increase of O does not result an increase in MWS. This threshold is considered as maximum O (Omax) for COREs within the given sample.
Step 4: CORE calling
CREAM starts to identify COREs from Omax down to O=2. For each O, it calls clusters with window size less than MWS as COREs. As a result, many COREs with lower Os are clustered within COREs with higher Os. Therefore, remaining lower O COREs, for example O=2 or 3, have individual CREs with distance close to MWS (Fig. 1). These clusters could have been identified as COREs because of the initial distribution of MWS derived mainly by COREs of the same O which are clustered in COREs of higher Os. Hence, CREAM eliminate these low O COREs as follows.
Step 5: Minimum Order identification
COREs that contain individual CREs with distance close to MWS can be identified as COREs due to the high skewness in the initial distribution of MWS. To avoid reporting these COREs, CREAM filters out the clusters with (O < Omin) which does not follow monotonic increase of maximum distance between individual CREs versus O (Fig. 1).
ROSE
ROSE clusters the neighboring individual CREs in a given sample if they have distance less than 12.5kb. It subsequently identifies the signal overlap on the clusters and sorts the identified clusters based on their signal intensity. It then stratifies the clusters based on the inflation point in the sorted clusters and call the clusters with signal intensity higher than the inflation point as super-enhancers (or COREs). This method is comprehensively explained in Whyte et al. (2013) [11]. We ran ROSE using the default parameters.
Genomic overlap of COREs
Bedtools (version 2.23.0) is used to identify unique and shared genomic coverage between CREAM and ROSE-identified COREs.
Comparison of CORES identified by CREAM and ROSE and single enhancers
First, signals (either DNase I hypersensitivity or ChIP-seq) over the identified COREs (or individual CREs) and 1kb flanking regions of them were extracted from the BAM files. Then each CORE (or individual CREs) is binned to 100 binned regions with equal size. Each left and right flanking region is also divided to 100 bins with equal size. Hence, in total 300 bins are obtained for each CORE plus its flanking regions. We then scale the signal in these regions to the library size for the mapped reads. Finally, a Savitzky-Golay filter is applied to remove high frequency noise from the data while preserving the original shape of the data [36,37].
Association with genes
A gene is considered associated with a CRE or a CORE if found within a ∓100kb window from each other.
Gene expression
RNA sequencing profile of GM12878 and K562 cells lines, available in ENCODE database [31], are used to identify expression of genes in proximity of individual CREs and COREs.
Transcription factor binding enrichment
Bedgraph files of ChIP-Seq profiles of transcription factors are overlapped with the identified COREs and individual CREs in GM12878 and K562 using bedtools (version 2.23.0). The resulting signal were summed over all the individual CREs or COREs and then normalized to the total genomic coverage of individual CREs or COREs, respectively. These normalized transcription factor binding intensities are used for comparing TF binding intensity in individual CREs and COREs (Fig. 3).
Sample similarity
Similarity between two samples in ENCODE is identified based on Jaccard index for the commonality of their identified COREs throughout the genome. Then this Jaccard index is used as the similarity statistics in a 1-nearest-neighbor classification approach. We assess performance of the classification using leave-one-out cross validation. In this classification scheme, we considered phenotype of the closest sample to an out of pool sample as its phenotype.
Association with essential genes
Number of genes which are in ∓100kb proximity of COREs and are essential in K562 are identified [33]. This number is then compared with number of essential genes in 10,000 randomly selected (permuted) genes, among the genes included in the essentiality screen. This comparison is used to identify FDR and z-score regarding the significance of enrichment of essential genes among genes in ∓100kb proximity of COREs identified for K562 cell line.
Pathway enrichment analysis
ConsensusPathDB is used to implement pathway enrichment analysis [38]. Protein complex-based gene sets is used as query gene sets.
Research Reproducibility
CREAM is now available as an open source R package (https://CRAN.R-project.org/package=CREAM).
List of abbreviations
DECLARATIONS
Author contributions
S.A.M.T. developed CREAM. S.A.M.T. and V.K. prepared CREAM R package. S.A.M.T. and P.M. performed the analysis and interpreted the results. S.A.M.T., P.M., B.H-K, and M.L. conceived the design of the study. S.A.M.T., P.M., B.H-K, and M.L. wrote the manuscript. B.H-K and M.L. supervised the study.
Funding
This study was conducted with the support of the Terry Fox Research Institute, Canadian Cancer Research Society and the Ontario Institute for Cancer Research through funding provided by the Government of Ontario. We acknowledge the Princess Margaret Bioinformatics group for providing the infrastructure assisting us with analysis presented here. This work was supported by the Princess Margaret Cancer Foundation (M.L. and B.H.K.). M.L. holds an Investigator Award from the Ontario Institute for Cancer Research, a CIHR New Investigator Award and a Movember Rising Star Award from Prostate Cancer Canada and is proudly funded by the Movember Foundation (grant #RS2014-04). S.A.M.T was supported by Connaught International Scholarships for Doctoral Students. P. M. was supported by the Canadian Institutes of Health Research Scholarship for Doctoral Students. B.H.K is supported by the Gattuso-Slaight Personalized Cancer Medicine Fund at Princess Margaret Cancer Centre and the Canadian Institutes of Health Research.
Competing financial interests
The authors declare no competing financial interests.
Acknowledgment
DNase I sequencing profile of HeLa cell line is used in this research. Henrietta Lacks, and the HeLa cell line that was established from her tumor cells without her knowledge or consent in 1951, have made significant contributions to scientific progress and advances in human health. We are grateful to Henrietta Lacks, now deceased, and to her surviving family members for their contributions to biomedical research. We also acknowledge the ENCODE Consortium and the ENCODE production laboratories that generated the data sets provided by the ENCODE Data Coordination Center used in this manuscript.