Abstract
Background Super-enhancers are clusters of active enhancers densely occupied by the Mediators, transcription factors and chromatin regulators, control expression of cell identity and disease associated genes. Current studies demonstrated the possibility of multiple factors with important roles in super-enhancer formation; however, a systematic analysis to asses the relative contribution of chromatin and sequence features of super-enhancers and their constituents remain unclear. In addition, a predictive model that integrates various types of data to predict super-enhancers has not been established.
Results Here, we integrated diverse types of genomic and epigenomic datasets to identify key signatures of super-enhancers and their constituents and to investigate their relative contribution. Through computational modelling, we found that Cdk8, Cdk9 and Smad3 as new key features of super-enhancers along with many known. Comprehensive analysis of these features in embryonic stem cells and pro-B cells revealed their role in the super-enhancer formation and cellular identity. Further, we observed significant correlation and combinatorial predictive ability among many cofactors at the constituents of super-enhancers. By utilizing these features, we developed computational models which can accurately predict super-enhancers and their constituents. We validated these models using cross-validation and also independent datasets in four human cell-types.
Conclusions Our analysis of these features and prediction models can serve as a resource to further characterize and understand the formation of super-enhancers. Taken together, our results also suggest a possible cooperative and synergistic interactions of numerous factors at super-enhancers and their constituents. We have made available our analysis pipeline as an open-source tool with a command line interface at https://github.com/asntech/improse.
Background
Enhancers are cis-regulatory regions in the DNA that not only augment the transcription of associated genes but also play a key role in cell-type-specific gene expression [1, 2]. A myriad of transcription factors (TFs) bind to enhancers and regulate gene expression by recruiting coactivators and RNA polymerase II (RNA Pol II) to target genes [3–7]. A typical mammalian cell is estimated to have thousands of active enhancers, a number which rises to roughly one million in the human genome [1, 8]. It has been more than three decades since the first enhancer was discovered [9], but our understanding of the mechanisms by which enhancers regulate gene expression is still limited. However, development of methods such as chromatin immunoprecipitation followed by sequencing (ChIP-seq) and DNase I hypersensitivity followed by sequencing (DNase-seq), have helped to discover and characterize enhancers at genome scale. Many factors have been associated with enhancer activity, including mono methylation of histone H3 at lysine 4 (H3K4me1), acetylation of histone H3 at lysine 27 (H3K27ac), binding of the coactivator proteins p300 and CBP, and DNase I hypersensitivity [8, 10, 11]. By exploiting these factors and other genomic features, many computational approaches have been developed to predict enhancers genome-wide [4, 12].
Mediator, a transcriptional coactivator, forms a complex with Cohesin to create cell-type-specific DNA loops and by facilitating enhancer-bound transcription factors, to recruit RNA Pol II to the promoters of target genes [13, 14]. In embryonic stem cells (ESC), the pluripotency transcription factors Oct4, Sox2 and Nanog (OSN) are known to have 100% enhancer activity (25/25) [15]. By using ChIP-seq data for Oct4, Sox2 and Nanog in ESC, 10,227 co-bound regions have been identified and classified into super-enhancers (SEs) and typical enhancers (TEs) by using ChIP-seq signal for Mediator subunit Med1 [16]. Super-enhancers form clusters of active enhancers, are cell-type specific, associated with key cell identity genes, and linked to many biological processes which define the cell identity [16]. These super-enhancers are densely loaded with the Mediator complex, master transcription factors and chromatin regulators [16–19]. Many disease-and trait-associated single nucleotide polymorphisms (SNPs) have been found in these regions [18]. Super-enhancers differ from typical enhancers in terms of size, ChIP-seq density of various cofactors, DNA motif content, DNA methylation level, enhancer RNA (eRNA) abundance, ability to activate transcription and sensitivity to perturbation [16–18, 20–22]. Further, studies have found super-enhancers in multiple cancers and demonstrated their importance in cellular-identity and disease and emphasized their use as potential biomarkers [17, 18, 23–25]. Other parallel studies demonstrated nearly similar patterns by using different approaches and termed them ‘stretch enhancers’ [26] and ‘enhancer clusters’ [27].
Since the discovery of super-enhancers, the research community used ChIP-seq data for different factors to differentiate super-enhancers from typical enhancers in different cell-types. ChIP-seq data for Med1 optimally differentiated super-enhancers and typical enhancers by comparing it with enhancer marks, including H3K27ac, H3K4me1 and DNase I hypersensitivity [16]. BRD4, a member of the BET protein family, was also used to distinguish super-enhancers from typical enhancers as it is highly correlated with MED1 [17]. H3K27ac was extensively used to create a catalogue of super-enhancers across 86 different human cell-types and tissues due to its availability [18]. Other studies used the coactivator protein P300 to define super-enhancers [28, 29]. However, the knowledge about these factors’ ability to define a set of super-enhancers in a particular cell-type and their relative and combinatorial importance remains limited. Master transcription factors which might form the super-enhancer domains are largely unknown for most of the cell-types, while performing ChIP-seq for the Mediator complex is difficult and costly. Current studies demonstrated the possibility of multiple cofactors with important roles in super-enhancer formation; however, a predictive model that integrates various types of data to predict super-enhancers and their constituents (enhancers within a super-enhancer) has not been established. In addition, the degree to which the sequence-specific features of constituents by itself explains the differences between super-enhancers and typical enhancers remains unknown.
Herein, to identify key features of super-enhancers and to investigate their relative contribution to predict super-enhancers, we integrated diverse types of publicly available datasets, including ChIP-seq data for histone modifications, chromatin regulators and transcription factors, DNase I hypersensitive sites and genomic data. Using correlation analysis and computational modelling, we found that Med1, Med12, H3K27ac, Brd4, Cdk8, Cdk9, p300 and Smad3 were significantly correlated and had a higher predictive importance. By utilizing these features, we developed imPROSE (integrated methods for prediction of super-enhancers) to predict super-enhancers and their constituents from a list of enhancers. We implemented and compared six different state-of-the art learning models and validated them using 10-fold stratified cross validation as well as by independent datasets in four human cell-types. imPROSE trained on Smad3 and H3K27ac data in mESC predicts more cell-type-specific super-enhancers in pro-B cells as compared with H3K27ac-based method (ROSE). We also performed a genome-wide analysis of Cdk8, Cdk9 and Smad3 binding in mESC and pro-B cells to analyse and assess their relative importance in defining super-enhancers. By using ChIP-seq data, RNA-seq based gene expression data, Gene Ontology (GO) and motif analyses we found that these factors differentiate super-enhancers from typical enhancers. Our prediction model and derived features can be further used as a platform to precisely define super-enhancers in other cell-types.
Results
Analysis of features including chromatin and transcription factors
Studies have shown that super-enhancers are occupied by various cofactors, chromatin regulators, histone modifications, RNA polymerase II and transcriptions factors [16, 18]. An understanding of the occupancy of these factors at the constituents of super-enhancers and typical enhancers is lacking. We extensively analysed 32 publicly available ChIP-seq and DNase-seq datasets to unveil their association with the constituents of super-enhancers and typical enhancers in mouse embryonic stem cells (mESC). We found that most of these factors, which are enriched in super-enhancers, were also highly enriched in the constituents of super-enhancers relative to typical enhancers (Fig. 1a; Figure S1 in Additional file 1). It is understandable to see that Oct4, Sox2 and Nanog were nearly equally enriched across the constituents of super-enhancers and typical enhancers because these constituents are defined by intersecting OSN co-bound regions.
Through correlation analysis, we found that most of the factors were highly correlated at the constituents of super-enhancers compared to typical enhancers (Fig. 1b). This suggests a possible combinatorial interplay among these cofactors at super-enhancers and that could be the reason that super-enhancers are more active and sensitive to perturbation as compared to typical enhancers. Interestingly, features with similar lineage/functionality were clustered together. For example, histone modifications (H3K27ac and H3K4me3), Mediator complex subunits (Med1, Med12 and Cdk8), the pluripotency genes of ESC (Oct4, Sox2, Nanog) were clustered together. It was particularly interesting to observe that Smad3 clustered together with the co-activator protein p300/CBP. Previous studies have shown that p300/CBP interacts with Smad3 [30, 31].
Analysis of sequence specific features
Super-enhancers differ from typical enhancers in terms of size, ChIP-seq density of various cofactors, TF content, ability to activate transcription, and sensitivity to perturbation [16–18]. But, to what extent constituents of super-enhancers differ from the constituents of typical enhancers in terms of sequence composition remains unknown. To gain insights into their biological functions, we sought to identify DNA sequence signatures of constituent enhancers. We tested GC-content, repeat fraction, size, and phastCons across the constituents of super-enhancers and typical enhancers in mouse ESC and proB cells.
Previous studies have shown that GC-rich regions have distinct features including frequent TF binding [32], active conformation [33] and nucleosome formation [34]. We found that constituents of super-enhancers are significantly more GC-rich than the constituents of typical enhancers (p-value < 2.2e-16, Wilcoxon rank sum test) (Fig. 1c). This suggests that GC content has an important role in super-enhancers formation and it can be a defining feature to distinguish them from typical enhancers. Enhancers, larger than 3 kb have been shown to be cell-type-specific and are known as stretch enhancers [26]. We checked the size (bp) of constituents and found that constituents of super-enhancers are significantly larger than the constituents of typical enhancers (p-value < 2.2e-16, Wilcoxon rank sum test) (Fig. 1d). A previous study showed that majority of super-enhancers do overlap with stretch enhancers [35]. Taken together, these results suggests that super-enhancers are actually clusters of stretch enhancers.
Enhancers are hardly conserved across mammalian genomes and evolved recently from ancestral DNA exaptation, rather than lineage-specific expansions of repeat elements [36]. We did not find any significant difference in conservation at constituents of super-enhancers and typical enhancers in mESC (p-value = 0.6285, Wilcoxon rank sum test), but in pro-B cells the conservation score was statistically significant (p-value < 1.7e-4, Wilcoxon rank sum test) (Fig. 1e). Similarly, there was no significant difference in repeat fraction at constituents of both super-enhancers and typical enhancers in mESC (p-value = 0.0202, Wilcoxon rank sum test) and pro-B cells (p-value = 0.8976, Wilcoxon rank sum test) (Fig. 1f).
Feature-ranking revealed previously known and new features of super-enhancers
With the increasing discovery of factors associated with super-enhancers, the determination of their relative importance in defining super-enhancers is important. Hence, we ranked chromatin and transcription factors to find a minimal optimal subset, which can be used to optimally distinguish super-enhancers from typical enhancers. We used a random-forest based approach, Boruta [37] to assess the importance of each feature by ranking them based on their predictive importance (Fig. 2a) (Methods). We also used an out-of-bag approach to calculate the relative importance of each feature and achieved almost identical results (Figure S5 in Additional file 1).
After ranking chromatin features, we found Brd4, H3K27ac, Cdk8, Cdk9, Med12 and p300 as the six most important factors with Brd4 and H3K27ac as the top two most informative factors (Fig. 2a). It was particularly interesting to observe that Cdk8 and Cdk9 were ranked as the third and fourth most informative features, respectively. Cdk9, a subunit of the positive transcription elongation factor b (P-TEFb) has been found in enhancers and promoters of active genes along with the Mediator coactivator [17]. Cdk8, a subunit of Mediator complex, positively regulates precise steps in the assembly of transcriptional elongation, including the recruitment of P-TEFb and Brd4 [38].
Previous studies have shown that five ESC transcription factors (Sox2, Oct4, Nanog, Esrrb and Klf4) and other TFs (Smad3, Stat3, Tcf3, Nr5a2, Prdm14 and Tcfcp2l1) were enriched in super-enhancers [16, 18]. As these transcription factors are specific for ESC biology, we ranked them separately from other cofactors to find their relative importance in ESC. Surprisingly, Smad3 turn to be the most informative among other transcription factors including Klf4 and Esrrb which were previously described as key defining features of super-enhancers [16] (Fig. 2b). It will be interesting to further understand the importance of Cdk8, Cdk9 and Smad3 in the formation of super-enhancers.
We also ranked the chromatin and transcription factors together, not surprisingly Med1 was ranked as most informative feature followed by H3K27ac, Brd4, Cdk8, Cdk9. We found that Smad3 was ranked higher than p300 and also ESC specific master TFs (Fig. 2c). To test this in more differentiated cell types, we have ranked the several factors, including H3K27ac, H3K4me1, H3K4me3, p300, Smad3, PU.1, Foxo1, Ebf1, Pol2 and DNaseI in pro-B cell. Interestingly, we found that Smad3 turn to be the most informative feature followed by PU.1, p300 and H3K27ac (Fig. 2d). This shows that Smad3 might be more informative feature to distinguish super-enhancers in differentiated cells.
Comparison of six different state-of-the-art machine learning models
We compared six different state-of-the-art, supervised, machine learning models, including Random Forest, linear SVM, k-NN, AdaBoost, Naive Bayes and Decision Tree. We used chromatin, transcription factors and sequence-specific features to train these models individually and evaluated their performance using 10-fold cross validation. The parameters used for each model can be found in the methods section.
Using all chromatin, transcription factors and sequence-specific features, Random Forest performed well with AUC=0.98, while Naive Bayes performed poorly with AUC=0.85 (Fig. 3b,Figurin Additional file 1). Linear SVM and k-NN performed equally with AUC=0.95, while Decision
Tree achieved AUC=0.91. We found that AdaBoost performed almost equally with AUC=0.96 as compared to Random Forest with AUC=0.98, but we achieved stable precision and recall for Random Forest.
We further compared these models, using individual features as well as different types of features and found feature-type-specifics for each model, (S2 in Additional file 1). For example, linear SVM performed poorly on DNA sequence-specific features with AUC=0.69, while Random Forest and AdaBoost performed best with AUC=0.81.
Random Forest performed optimally across different combinations of features and data sampling approaches. Hence, we chose Random Forest for further analysis due to its performance and flexibility, though any of these models can be used to predict super-enhancers to some extend depending on the type of features used. This comparative analysis of models using various types of features, provides a guide to select a model based on the type of features available.
Prediction using chromatin and transcription factors
Random Forests are ensemble and non-parametric models, which run efficiently on large datasets without over-fitting. Here, we developed imPROSE, a Random Forest based model, to predict super-enhancers and their constituents. A detailed workflow of imPROSE is illustrated in (Fig. 3a). We investigated six state-of-the-art machine learning models, including Random Forest, linear SVM, k-NN, AdaBoost, Naive Bayes and Decision Tree. In order to train the models, we used normalized ChIP-seq profiles for all chromatin, cofactors and transcription factors, and other genomic features at the constituents of super-enhancers and typical enhancers defined by Med1 signal at OSN (Oct4, Sox2 and Nanog) co-bound sites. To avoid over-fitting, we used a hybrid-sampling approach to balance the training and test data (Additional file 1). The prediction accuracy was assessed using 10-fold stratified cross-validation (Methods).
We investigated the individual predictive power of features using all the six models in Additional file 1). The AUC (Area Under the Curve) scores reported in this manuscript are based on the Random Forest model, unless stated otherwise. The features which were ranked as more important predictors in (Fig. 2a, b), achieved higher AUC and PRC (Precision Recall Curve) scores. Among the top ranked chromatin features, including H3K27ac, Brd4, Cdk8, Cdk9, Med12 and p300, we observed Brd4 performed slightly better than H3K27ac, with AUC=0.85 and 0.84, respectively. The model achieved AUC=0.84 for Cdk8, 0.83 for Cdk9, 0.83 for Med12 and 0.76 for p300 (Fig. 3c). After combining the top three chromatin features (H3K27ac, Brd4 and Cdk8) the model performed well with AUC=0.93. This shows that the combination of Brd4, H3K27ac and Cdk8 is more effective at predicting super-enhancers than either alone. Among the top ranked transcription factors, including Smad3, Esrrb, Klf4, Tcfcp2l1, Nr5f2a and Stat3, we found that Smad3 achieved the highest predictive power with AUC=0.73, followed by Esrrb with AUC=0.69 and Klf4 with AUC=0.65 (Fig. 3d). The model achieved AUC=0.65, 0.65, 0.64 for Tcfcp2l1, Nr5f2a and Stat3, respectively. After combining the top three ranked transcription factors, including Smad3, Esrrb and Klf4, the model performed better with AUC=0.84 than either alone.
This shows that combinatorial information of cofactors is more effective at predicting super-enhancers than the individual features alone. We performed the combinatorial analysis based on feature types in the following section. We also noticed that the top ranked features, including Cdk8, Cdk9, p300, CBP, Smad3 and Brd4 are highly correlated with Med1, at the constituents of super-enhancers and typical enhancers (Fig. 1b).
Prediction using sequence-specific features
We used genomic features, including conservation score (phastCons), GC content and repeat fraction, and investigated their individual and combinatorial predictive power. The model achieved AUC=0.58 for phastCons, AUC=0.63 for GC content and AUC=0.64 for repeat fraction (Fig. 3e). By combining GC content and phastCons the model achieved AUC=0.71 and by combining GC content and repeat fraction it achieved AUC=0.76. By using three of the sequence-specific features, including GC content, phastCons and repeat fraction together, the model performed significantly higher with AUC=0.81. This shows that only genomic features could be enough to predict super-enhancers where high throughput sequencing data is not available.
Further, we used the DNA motifs for the 11 transcription factors, including Oct4, Sox2, Nanog, Esrrb, Klf4, Tcfcp2l1, Prdm14, Nr5a2, Smad3, Stat3 and Tcf3 to train the model. Using only the motifs information, the model achieved AUC=0.72. By using the ChIP-seq signal for these TFs, it achieved AUC=0.93 (Fig. 3g).
We also tested prediction accuracy using a sequence specific k-mer based approach[39]. We achieved AUC=0.74 for stitched sequences of super-enhancers and typical enhancers and AUC=0.75 for the constituents of super-enhancers and typical enhancers (Figures S2G, S2H in Additional file 1).
Combinatorial predictive ability of chromatin and genomic features
Previous studies have shown a combinatorial interplay between multiple histone modifications, transcriptions factors and DNA motifs, and this can be functionally informative [40, 41]. We noticed significant correlation among many chromatin regulators and TFs at the constituents of super-enhancers. This suggests the existence of a combinatorial relationship among these factors, which might dictate an accurate explanation for their role in super-enhancer formation. This combinatorial information could also be more predictive than the individual information of each factor.
We therefore investigated the combinatorial predictive power of chromatin, TFs and sequence-specific features. We tested 22 different combinations of these features and reported precision, recall, F1-score, AUC, and PRC (Fig. 3f). By training the model with the top six chromatin features, including Brd4, H3K27ac, Cdk8, Cdk9, Med12 and p300, it achieved AUC=0.95. By training the model with top two features (Brd4 and H3K27ac), it achieved AUC=0.91. By further adding the sequence specific features, the predictive ability of model greatly increased with AUC=0.95. By combining histone modifications, including H3K27ac, H3K4me1 and H3K4me3, with sequence specific features, the model achieved AUC=0.94. By training the model on known enhancer features data, including H3K4me1, H3K27ac, p300 and DNaseI, it achieved AUC=0.92. It is well known that Mediator forms a complex with Cohesin to create cell-type-specific DNA loops, and it facilitates enhancer-bound transcription factors to recruit RNA Pol II to the promoters of target genes [13, 14]. Hence, when we tested the combinatorial predictive power of Mediator sub-unit Med12, Cohesin sub-unit Smc1 and RNA Pol II, the model achieved AUC=0.91. Further, by adding the sequence specific features, the model performance considerably improved to AUC=0.95.
Furthermore, we tested the predictive ability of features that were grouped based on their type and functionality, and found that transcription factors, histone modifications and chromatin regulators have higher predictive power with AUC=0.93, 0.92 and 0.92, respectively. For models trained on Mediator complex, Cohesin, coactivators and Pol II data, achieved AUC=0.89, 0.79, 0.83 and 0.81, respectively (Fig. 3g). It was particularly interesting to see that a model trained on genomic features, including GC content, phastCons and repeat fraction, achieved AUC=0.81. We also examined the combinatorial importance of these grouped features based on their functionality and type (Figure S2E in Additional file 1). The increase in the model AUC is statistically significant (p-value = 0.001, Wilcoxon rank sum statistic) (Figure S6 in Additional file 1.
Our analysis shows that the combinatorial information greatly increased the predictive power of the models. Further, the sequence-specific features alone are reliable predictors, and the addition of sequence-specific features to other features greatly enhanced their predictive power.
Model validation using independent datasets
The above described models performed well on mESC data using 10-fold cross validation as a validation strategy. To further validate, we used independent datasets, which were not seen by the models during the training. We used publicly available data in four human tumor cell-types, including B-cell lymphoma (P493-6), multiple myeloma (MM1.S), small cell lung carcinoma (H2171) and glioblastoma (U87). We chose these cell-types because ChIP-seq data for MED1 and the two top-ranked chromatin features, BRD4 and H3K27ac, were publicly available (Table S2 in Additional file 1). Initially, we used H3K27ac ChIP-seq peaks 2 kb upstream and downstream of the transcription start site (TSS), to define constituent enhancers and ranked those constituents based on MED1 ChIP-seq signal to define super-enhancers as described in [16–18].
First, we trained the model on ESC data and tested it on each of the four human cell-type data and achieved AUC=0.92, 0.90, 0.90 and 0.86 for P493-6, MM1.S and U87 cells, respectively (Fig. 3h). We also checked the classification accuracy after combining five cell-types data by training the model on four cell-types data and testing it on one of the remaining cell-type data and repeated this for all the combinations (Figure S2H in Additional file 1). We achieved the highest AUC=0.95 for the model tested on P493-6 cell-type data and the lowest AUC=0.88 for the model tested U87 cell-type data (Figure S2F in Additional file 1). The classification measures, including precision, recall, F1-score and AUC for each tested cell-type after training the model on remaining four cell-types can be found in (Table S3 in Additional file 1).
We next trained the model on one genome data and tested on another. We used four human cell-types (P493-6, Plasma cell, H2171 and U87), and one mouse cell-type (mESC) data. The model trained on mouse cell-type data tested it on human cell-type data, achieved AUC=0.90 (Fig. 3i). The model trained on human cell-type data and tested on mouse cell-type data, achieved AUC=0.85.
We also tested whether a model trained on constituent data can predict the stitched regions and vice versa. The model, trained with H3K27ac data, accurately predicted the stitched super-enhancers with AUC=0.92 (Fig. 3j). The same model when trained on stitched data and tested on constituent data, performed poorly with AUC=0.68.
Genome-wide profiles of Cdk8, Cdk9 and Smad3 at super-enhancers
Through the ranking of chromatin and transcription factors, we found that Cdk8, Cdk9 and Smad3 were important features along many known signatures of super-enhancers, including H3K27ac, Brd4, Med12 and p300 which are well characterized at super-enhancers [16–18, 28]. However, the genome-wide profiles of Cdk8, Cdk9 and Smad3 are not well characterized at super-enhancers. Hence, we investigated the genome-wide profiles of Cdk8, Cdk9 and Smad3 at super-enhancers, identified by using Med1 in mESC. We found that, ChIP-seq binding sites of Cdk8, Cdk9 and Smad3 were highly co-localized with Med1, Brd4, H3K27ac, p300 and DNaseI (Fig. 4a, b). Like Med1, the ChIP-seq density for Cdk8, Cdk9 and Smad3 is exceptionally higher at super-enhancers compared to typical enhancers (Fig. 4c). The ChIP-seq density of Med1, Cdk8, Cdk9 and Smad3 at super-enhancers is significantly higher compared to typical enhancers (p-value < 2.2e-16, Wilcoxon rank sum test) (Fig. 4d). Similarly, Med1, Cdk8, Cdk9 and Smad3 are also enriched at ESC super-enhancers identified using Med1 (Fig. 4f). The ChIP-seq binding for Med1 and master TFs, including Sox2, Oct4 and Nanog, is exceptionally higher and forms clusters at super-enhancers, which are associated with cell-type-specific genes [16]. We found that the ChIP-seq binding sites for Cdk8, Cdk9 and Smad3 also form clusters at super-enhancer regions and associated with cell-type-specific genes. For example, the super-enhancer (mSE_00038) is associated with ESC pluripotency gene Sox2 (Fig. 3e). In another example, the super-enhancer (mSE_00085) is associated with the ESC pluripotency gene Nanog and the super-enhancer (mSE_00084) is associated with Dppa3 (developmental pluripotency associated 3) gene, which plays a key role in cell division and maintenance of cell pluripotency (Figure S4A in Additional file 1).
Furthermore, we calculated the Pearson’s correlation of Med1 with Cdk8, Cdk9 and Smad3 and found a high and significant correlation (p-value < 2.2e-16) (Fig. 4g). The correlation between Med1/Cdk8 (Pearson’s r = 0.90, p-value < 2.2e-16), Med1/Cdk9 (Pearson’s r = 0.86, p-value < 2.2e-16), and Med1/Smad3 (Pearson’s r = 0.76, p-value < 2.2e-16). Previous studies have used the ChIP-seq binding sites of the co-activator protein p300 to find enhancers [4, 10]. We found that Smad3 is significantly correlated with p300 (Pearson’s r = 0.85, p-value < 2.2e-16) (Figure 4g).
Identification and characterization of super-enhancers by using Cdk8, Cdk9 and Smad3 in mESC
Since we found that Cdk8, Cdk9 and Smad3 are highly correlated with Med1 and co-occupy super-enhancers genome-wide (Fig. 4), we investigated the importance of Cdk8, Cdk9 and Smad3 in super-enhancer formation and also compared them with super-enhancers identified by Med1. We used ChIP-seq data and RNA-seq data to identify and characterize super-enhancers by using Cdk8, Cdk9 and Smad3 in mESC. We found 400, 494 and 435 super-enhancers by using Cdk8, Cdk9 and Smad3, respectively (Fig. 5a). A list of all the super-enhancers and typical enhancers can be found in (Additional file 2). Further, Cdk8, Cdk9 and Smad3 successfully identified 88%, 84% and 73% of the Med1 super-enhancers, respectively (Fig. 5a). After Med1, we can see more clear distinction of super-enhancers and typical enhancers by using Cdk8 as compared with H3K27ac (Figure 5b). The majority of super-enhancers identified using Cdk8, Cdk9 and Smad3 do overlap with super-enhancers identified using Med1, which is 66% of the super-enhancers identified using Med1 (Fig. 5c).
The ChIP-seq density at super-enhancers, identified using Cdk8, Cdk9 and Smad3, is significantly higher compared with typical enhancers (p-value < 2.2e-16, Wilcoxon rank sum test) (Fig. 5d). Our analysis showed that Cdk8 could separate most of the super-enhancers (88%) defined using Mediator complex (Med1) ChIP-seq signal.
In ESC, the DNA motifs Klf4 and Esrrb were particularly enriched at the constituents of super-enhancers, compared to typical enhancers [16]. Hence, we tested the frequency of these two motifs at the constituents of super-enhancers and typical enhancers; we defined using Cdk8, Cdk9 and Smad3. We found that the frequency of binding motifs Klf4 and Esrrb is significantly higher at constituents of super-enhancers than typical enhancers (p-value < 2.2e-16, Wilcoxon rank sum test) (Figure S4G in Additional file 1). Further, when we compared the frequency of known ESC specific motifs (Oct4, Sox2, Nanog, Esrrb and Klf4) at the constituents of super-enhancers and typical enhancers defined by Med1, Cdk8, Cdk9 and Smad3. We found a higher frequency of these motifs at super-enhancers defined by Cdk9, Cdk8 and Smad3 as compared with Med1 (Fig. 5e). Further, the frequency of these motifs was slightly higher at the typical enhancers identified by Med1 as compared with Cdk8, Cdk9 and Smad3.
The genes associated with super-enhancers are significantly expressed, compared to genes associated with typical enhancers [16–18]. To test this we associated genes with the super-enhancers and typical enhancers as described in [16, 18]. We found that 65% of the Med1 super-enhancers associated genes were also associated with super-enhancers identified by Cdk8, Cdk9 and Smad3 (Fig. 5f). Further, these genes associated with super-enhancers were significantly expressed, compared to genes associated with typical enhancers (p-value < 2.2e-16, Wilcoxon rank sum test) (Fig. 5g).
Super-enhancers known to be highly enriched for cell-type-specific master regulators and these regulators should have a higher rank. Hence, we checked the rank of super-enhancers, associated with the key cell identity genes, including Oct4, Sox2, Nanog, Esrrb and Klf4 in ESC, and ranked the factors based on the average rank of the super-enhancers that are associated with these genes (Fig. 5h). These genes were selected due to their important roles in the pluripotency and reprogramming of ESC biology [42–44]. The rankings for Med1 super-enhancers were downloaded from [18]. We found that Smad3 achieved a highest rank followed by Med1, Cdk9 and Cdk8. The Smad3 achieved a higher rank for Oct4, compared with other genes. This might be due to the fact that Smad3 co-bonded with the master transcription factor Oct4 genome-wide in ESC [45]. Further, we found almost similar ChIP-seq patterns for factors including Med1, H3K27ac, Brd4, Cdk8, Cdk9 and Smad3 at super-enhancers regions defined by all three factors (Cdk8, Cd9 and Smad3) and are associated with cell-type-specific genes, including Nanog and Dppa3 (Fig. 5i). Like Med1, the genes associated with super-enhancers ranked by Cdk8, Cdk9 and Smad3 were enriched with cell-type-specific GO terms, supporting the notion that super-enhancers regulate cellular identity genes (Figure S7a in Additional file 1). Taken together, our results indicate the role of Cdk8, Cdk9 and Smad3 in defining and formation of super-enhancers.
Identification and characterization of super-enhancers by using Smad3 in pro-B cells
It was particularly interesting to see Smad3 ranked the most informative among transcription factors, including Oct4, Sox2, Nanog, Esrrb, Klf4, Tcfcp2l1, Prdm14, Nr5a2, Stat3 and Tcf3 in mESC (Fig. 2b). In ESC, the highly ranked super-enhancers identified using Smad3 were associated with ES cell-identity genes, including Oct4, Sox2, Nanog, Klf4 and Esrrb compared to Med1 super-enhancers (Fig. 5f). A previous study showed that Smad3 co-occupies the master transcription factors genome-wide[45]. We also observed Smad3 turn to be the highly ranked feature when we ranked Smad3 and several other factors in pro-B cells (Fig. 2d)
Hence, we argued that Smad3 could be used to define super-enhancers instead of Med1. We already showed the ability of Smad3 in defining super-enhancers in ESC. To test this in more differentiated cells, we identified and characterized super-enhancers in pro-B cells using Smad3. We compared the super-enhancers identified using Smad3 with previously identified super-enhancers that use Med1 in pro-B cells [16]. The ChIP-seq density of Smad3 super-enhancers is exceptionally higher compared to typical enhancers (Fig. 6a). Further, Smad3 have strong binding along with Med1 and PU.1 at a super-enhancer(mSE_00293) which is associated with Foxo1 gene (Fig. 6b).
By using Smad3, we identified 694 super-enhancers and among these, 65% were identified by Med1 (Figure S4f in Additional file 1) and with Smad3 we can see a more clear distinction of super-enhancers and typical enhancers as compared with H3K27ac (Fig. 6c). The ChIP-seq density at super-enhancers identified using Smad3 is significantly higher, compared to typical enhancers (p-value < 2.2e-16, Wilcoxon rank sum test) (Fig. 6d). The genes associated with Smad3 super-enhancers are significantly expressed, compared with typical enhancers (p-value < 2.2e-16, Wilcoxon rank sum test) (Fig. 6e). The GO terms for super-enhancers ranked by Smad3 are highly enriched and cell-type-specific, compared to Med1 (Figure S7b in Additional file 1).
Further, to test the functional importance of super-enhancers identified only by Smad3 or by Med1, we performed GO analysis on these subset of super-enhancers. Interestingly, the super-enhancers identified by Smad3 but not by Med1 turn to be highly enriched for cell-type-specific GO terms such as immune cell development and immune system development. While super-enhancers identified by Med1 but not by Smad3 low enriched for cell-type-specific GO terms (Fig. 6f). These results, taken together with previous studies, demonstrate the importance of Smad3 in super-enhancer formation and Smad3 could be used to define super-enhancers.
imPROSE predicts cell-type-specific super-enhancers
To further test the ability of imPROSE, we compared it with the most commonly used H3K27ac based method ROSE [16]. We have demonstrated above that H3K27ac was ranked as the most informative feature in mESC and Smad3 was ranked as the most informative feature in pro-B cells. Hence, we trained imPROSE using H3K27ac and Smad3 data in mESC and predicted super-enhancers in pro-B cells. Our model predicted ~2,000 super-enhancers in pro-B cells, but when we used ROSE with H3K27ac data, we found 934 super-enhancers (Additional files 3 and 4). To assess the functional importance of super-enhancers identified by these two methods, we performed GO enrichment analysis using GREAT tool [46]. We have listed the top 20 GO terms enriched at super-enhancers Interestingly, we found several GO terms pertinent to the specific biological functions in pro-B cells are highly enriched at the super-enhancers identified by imPROSE and low enriched at super-enhancers identified by ROSE. For example, immune system process, hematopoietic or lymphoid organ development and leukocyte differentiation are significantly enriched for genes associated with super-enhancers predicted by imPROSE (Fig. 7a). Further, cell-type-specific GO terms such as regulation of B cell activation and regulation of B cell proliferation were highly enriched for typical enhancers defined by ROSE as compared with imPROSE (Figure S8 in Additional file 1). This shows that several super-enhancers were possibly labelled as typical enhancers by ROSE.
To further assess the ability of imPROSE, we carried out motif analysis at the constituents of super-enhancers identified by both ROSE and imPROSE methods. We used TRAP tool [47] to find motifs and reported the top ranked motifs. We found that most of the motifs were highly enriched at super-enhancers identified by imPROSE (Fig. 7b). More interestingly, motifs such as PU.1 and Ebf1 were significantly enriched at SEs predicted by imPROSE, which have previously been shown to play important role in the control of B cell identity [48]. Therefore, taken together these results suggest that the integration of multiple factors appears to be a better way to predict super-enhancers than the the current H3K27ac based approach (ROSE).
Discussion
Super-enhancers regulate expression of key genes that are critical for cellular identity, thus, alterations at these regions can lead to several disorders. Hence, exploring these cis-regulatory elements and their features to uncover their molecular mechanisms will help us in designing better, precise, and personalized drugs.
In this study, we first presented a systematic approach to rank and access the importance of different features of super-enhancers. We investigated different features including histone modifications, chromatin regulators, transcription factors, DNA hypersensitive sites and DNA sequence motifs in mESC. We also analysed sequence-specific features including GC content, conversation score and repeat fraction in mESC and pro-B cells. We found new features including Cdk8, Cdk9, Smad3 and GC content as key features of super-enhancers along with many known features, which make super-enhancers distinct from typical enhancers. Further, we developed imPROSE, a supervised machine-learning model, which can accurately predict super-enhancers and their constituents. imPROSE trained on one cell-type data can predict super-enhancers and their constituents in other cell-type. imPROSE trained on only H3K27ac and Smad3 data in mESC can predict super-enhancers in pro-B cells, which are more cell-type specific compared to super-enhancers defined by the current H3K27ac based approach (ROSE).
Among the chromatin features we found that Brd4, H3K27ac, Cdk8, Cdk9, Med12 and p300 were the top six features and by assessing their individual predictive powers we achieved higher AUC for higher ranked features. Previous studies showed the importance of these highly ranked features and their role in transcriptional regulation. Through our ranking of features, H3K27ac achieved a higher ranking, compared to other histone modifications. H3K27ac was found as a mark to separate active enhancers from poised enhancers [11] - this shows that super-enhancers might be the clusters of active enhancers. The Mediator sub-units Med1 and Med12 has been known as master coordinators of cell lineage and development [16, 49]. We did not include Mediator sub-unit Med1 in our training data because a Med1 signal was used to define super-enhancers [16]. Bromodomain-containing protein 4 (Brd4), a member of the BET protein family which functions as an epigenetic reader and transcriptional regulator that binds acetylated lysines in histones [50], was ranked as the second highest important feature. Brd4 has been associated with anti-pause enhancers (A-PEs) which regulate the RNA Polymerase II (Pol II) promoter-proximal pause release [51, 52]. Brd4 regulates the positive transcription elongation factor b (P-TEFb) to allow Pol II phosphorylation and the subsequent elongation of target genes [52, 53]. In ESC it specifically governs the transcriptional elongation by occupying super-enhancers and by recruiting Mediator and Cyclin dependent kinase 9 (Cdk9) to these super-enhancers [54]. Cdk9, a sub-unit of P-TEFb, has been found at enhancers and promoters of active genes along with the Mediator coactivator [17]. Cyclin-dependent kinase 8 (Cdk8), a subunit of Mediator complex, positively regulates precise steps in the assembly of transcriptional elongation, including the recruitment of P-TEFb and BRD4 [38]. During the preparing of this manuscript, a very recent study has demonstrated that Cdk8 regulates the key genes associated with super-enhancers in acute myeloid leukaemia (AML) cells [55]. We identified and characterized super-enhancers in mESC by using Cdk8 and Cdk9, to further validate their importance in super-enhancer formation and cell identity.
Among the transcriptions factors, we found that Smad3, Esrrb, Klf4, Tcfcp2l1, Nr5f2a and Stat3 were the top ranked features and by assessing their individual predictive powers we achieved higher AUC for higher ranked features. It was particularly interesting to see that Smad3 was ranked as the best feature among the transcription factors including Esrrb and Klf4. We know that Smad3 is a target of the TGF-β signaling pathway, and studies have shown that Smad3 is recruited to enhancers formed by master transcription factors [45]. We found significant correlation between Smad3 and coactivators p300/CBP at super-enhancers and previous studies have shown that p300/CBP interacts with Smad3 [30, 31]. The evidence for the enrichment of Smad3 at super-enhancers shows how the transforming growth factor beta (TGF-β) signaling pathway can converge on key genes that control ES cell identity. A very recent study validates our findings by showing that super-enhancers provide a platform for signalling pathways, including TGF-β to regulate genes that control cell identity during development and tumorigenesis [56]. To validate further, we identified and characterized super-enhancers using Smad3 in mESC and pro-B cells. By integrating ChIP-seq and RNA-seq data we showed the importance of Smad3 in super-enhancer formation and cell identity.
By investigating sequence-specific features, we found that the constituents of super-enhancers were significantly GC-rich. The GC-richness of a genomic region is associated with several distinctive features that can affect the cis-regulatory potential of a sequence [32, 33]. GC-rich and AT-rich chromatin domains are marked by distinct patterns of histone modifications. GC-rich chromatin domains tend to occur in a more active conformation and histone deacetylase activity represses this propensity throughout the genome [33]. Also GC content and nucleosome occupancy are positively correlated [32] and GC-rich sequences promote nucleosome formation [34]. Transcription factors tend to bind GC-rich regions in the genome, regardless of the distance and orientation [32]. This suggests that there is a role for the GC content in the formation of super-enhancers, which control the cell-type-specific gene expression.
Enhancers function due to cooperative and synergistic interplay of different coactivators and transcription factors [57]. A recent study showed that multiple enhancer variants cooperatively contribute to altered expression of their gene targets [58]. It is not well understood whether constituents of super-enhancers work synergistically or additively. The constituents of super-enhancers make frequent physical contacts with one another [59] and extensive cooperative binding of transcription factors have been found at super-enhancers [60]. A study in ESC demonstrated the functional importance of super-enhancer constituents [56]. Further, two recent studies have suggested additive and functional hierarchy among the constituents of α-globin and Wap super-enhancer locus, respectively [61, 62]. But, a very recent study argues that it is still need to be determined whether the constituents of a super-enhancer functions synergistically or additively [63]. Through computational modelling and correlation analysis, we noticed a combinatorial relationship between chromatin regulators and transcription factors at the constituents of super-enhancers. This advanced our current understanding of the determinants of super-enhancers and led us to hypothesize that these combinatorial patterns may be involved in mediating super-enhancers. Further, the significant correlation of many cofactors at the constituents of super-enhancers suggests cooperative and synergistic interactions. These results, taken together with previous studies suggest a cooperative and synergistic interplay of between constituents of super-enhancers. More sophisticated experiments are needed to validate the functional importance of constituents within a super-enhancer, by utilizing the latest CRISPR-Cas9 system.
Conclusions
We integrated diverse types of genomic and epigenomics datasets to predict super-enhancers and their constituents in a cell-type-specific manner. We investigated the relative importance of each feature in predicting super-enhancers, and also their combinatorial predictive power. We demonstrated that the model trained on one cell-type can be used to predict super-enhancers in other cell-types, and also performed better than the current H3K27ac-based approach. More importantly, we found Cdk8, Cdk9 and Smad3 as new signatures, which can be used to define super-enhancers where Mediator or master transcription data is not available. Taken together with previous studies, our results suggest a possible cooperative and synergistic interactions of numerous factors at super-enhancers. Our feature analysis and prediction models can serve as a resource to further characterize and understand the formation of super-enhancers.
Methods
Data description
We downloaded 32 publicly available ChIP-seq and DNase-seq datasets in mouse embryonic stem cells (mESC) from Gene Expression Ominibus (GEO). These include four histone modifications: H3K27ac, H3K4me1, H3K4me3 and H3K9me3; DNA hypersensitive site (DNaseI); RNA polymerase II (Pol II); transcriptional co-activating proteins (p300, CBP); P-TFEb subunit (Cdk9); sub-units of Mediator complex (Med1, Med12, Cdk8); other chromatin regulators (Brg1, Brd4, Chd7); Cohesin (Smc1, Nipbl); subunits of Lsd1-NuRD complex (Lsd1, Mi2b) and 11 transcription factors (Oct4, Sox2, Nanog, Esrrb, Klf4, Tcfcp2l1, Prdm14, Nr5a2, Smad3, Stat3 and Tcf3). A detailed list of all datasets used in this study is provided in (Table S4 in Additional file 1).
To validate the model using independent data, we downloaded ChIP-seq datasets for (MED1, BRD4 and H3K27ac) in four human tumor cell-types, including B-cell lymphoma (P493-6), Multiple myeloma (MM1.S), Small cell lung carcinoma (H2171) and Glioblastoma (U87), which were not seen by the model during training. A detailed list of ChIP-seq datasets used to validate the model is provided in (Table S5 in Additional file 1).
To predict super-enhancers and perform features ranking in pro-B cells, we used ChIP-seq data for Med1, PU.1, Foxo1, Smad3, Ebf1, p300, H3K27ac, H3K4me1, H3K4me3 and Pol2 and also DNase-seq (Table S6 in Additional file 1).
We also obtained processed RNA-seq based gene expression data (RPKM) from [64] and [16] for mESC and pro-B cells, respectively. We downloaded super-enhancer regions in mESC and pro-B cells identified using Med1 ChIP-seq occupancy from dbSUPER [65]. We also used other genomic features including GC content, conservation score (phastCons) and repeat fraction downloaded from the UCSC table browser [66].
Data pre-processing and feature extraction
Initially, ChIP-seq reads were aligned to mouse genome-build mm9 using bowtie [67] (Version 0.12.9) with parameters (-k 1, -m 1,-n 2, -e 70,–best). We calculated read densities for 30 ChIP-seq datasets at the constituents of super-enhancers (646) and typical enhancers (9981) and normalized it as described in [16, 68]. Briefly, for each constituent region, reads were extended by 200bp and the density of reads per base pair was calculated using bamToGFF (https://github.com/BradnerLab/pipeline). Next, these densities were normalized in units of reads per million mapped reads per base pair (rpm/bp) with background subtraction. We used the similar approach for DNase-seq data but without background subtraction.
The data for model validation was aligned to hg19 as described above. We used MACS (Model-based Analysis of ChIP-Seq) [69] (Version 1.4.2) to perform the peak calling and to find ChIP-seq-enriched regions over background. We used a p-value (10-9) as the enrichment threshold. To generate wiggle files, we used MACS with parameter -w -S --space=50.
For DNA sequence motif data, we collected DNA binding motif information (PWM) from the transfac professional database version 2014 [70] for all the 11 transcription factors (Fig. 2a). We computed the binding affinity score for the constituents of super-enhancer and typical enhancer sequences using the Transcription factor Affinity Prediction (TRAP) [47] using individual TF’s position weight matrix (PWM).
Data sampling
There are two commonly used sampling approaches, over-sampling and under-sampling. In over-sampling we increased the size of the minority class while in under-sampling we through away the samples from the majority class to balance the data. The current data we are dealing with is highly imbalanced. We used Weka implementation of SMOTE [71] to perform over-sampling with parameters (nearest neighbours=5, random seed=1 and oversampling percentage=500). We used a hybrid approach by first applying over-sampling using SMOTE on the minority class and then under-sampling on the majority class. We performed under-sampling by randomly selecting a subset of size similar to the minority class. In this study we used hybrid-sampling approach to perform analysis because it performed better (Additional file 1).
Feature-ranking
To find an optimal feature subset, we used two Random Forest based approaches. First, we used Boruta algorithm [37] to rank important features. Briefly, it finds important features by measuring the relevance of each original feature with respect to a reference attribute using Random Forest. Second, we used Random Forest’s out-of-bag approach to calculate the relative importance of each feature. Briefly, this approach takes one feature out and measures its relative importance and contribution to the model.
We divided the features into two groups with the aim to achieve two different goals. The 11 transcription factors in mouse embryonic stem cells, were used to rank the transcription factors and explore their importance in super-enhancer prediction. The other 20 chromatin features including, histone modifications (HMs), RNA polymerase II (Pol II), transcriptional co-activating proteins, chromatin regulators (CRs), Cohesin, and sub-units of Lsd1-NuRD complex, were used to develop a general super-enhancer prediction model which can be used further to predict super-enhancers in other cell-types. We also ranked these features together in mESC and pro-B cells.
Training data
We downloaded 10627 loci of constituents of super-enhancers and typical enhancers in mESC defined based of Med1 ChIP-seq signal [16]. Among these 646 were constituents of super-enhancers and 9981 were typical enhancers. The median size of enhancer constituents is 703bp and super-enhancer constituents is 862bp. After performing hybrid-sampling we have 10,336 instances of data and among these 50% (5,168) are constituents of super-enhancers, and 50% (5,168) are constituents of typical enhancers. We considered constituents of super-enhancers as positive class and constituents of typical enhancers as negative class. In total, we have 45 features, including 20 chromatin features, 11 transcription factors, 11 DNA motifs and three sequence-specific features. We excluded Med1 from our training data because super-enhancers were defined based on Med1 ChIP-seq signal [16]. Not surprisingly, we achieved best classification results by using Med1 as a feature.
Prediction models
We investigated six state-of-art supervised approaches including: Random Forest [72], Support Vector Machines (SVM) [73], K-Nearest Neighbor (k-NN) [74], AdaBoost [75], Decision Tree and Naive Bayes. For all the analysis, we used scikit-learn (version 0.14.1), a Python library for machine learning [76]. We used LibSVM [77] with a linear kernel and regularization parameter C = 1.0. We used Random Forest with the number of trees=20. We calculated an out-of-bag error to find the optimal number of trees to use for Random Forest (Figure S2C in Additional file 1). For other models we used default parameters set in the scikit-learn library.
K-mer based prediction
We used a sequence specific enhancer prediction method (Kmer-SVM) [39] to classify constituents of typical enhancers and super-enhancers. We used the default settings with k-mer size=5, spectrum kernel, regularization parameter (C) = 1.0. We used 5-fold cross validation for model validation.
Performance evaluation
We used 10-fold stratified cross-validation (CV) to validate models, which makes the folds by preserving the percentage of samples for each class. Stratified CV is generally consider a better scheme than standard CV in terms of bias and variance [78]. To evaluate the performance of the models, we reported precision, recall, F1-score, area under the ROC curve (AUC) and the precision-recall curve (PRC). The receiver-operating characteristic (ROC) is a graphical representation of true positive rate (sensitivity) v/s false positive rate (1-sensitivity). The true positive rate is also known as sensitivity or recall. The false positive rate is also known as (1-sensitivity). F1 score is an accuracy measure, which considers both the precision and the recall of the test to compute the score. The mathematical representation of these measures is as follows:
Precision = true positive/(true positive + false positive)
Recall = true positive/(true positive + false negative)
F1 score = 2 [(precision x recall)/(precision + recall)]
We also tested if the increase in model AUC is statistically significant by using permutation test (1000 runs). The p-value is calculated using Wilcoxon rank sum statistic.
Super-enhancer identification
We used ROSE (Rank Ordering of Super-Enhancers) with parameters (stitching distance=12.5 kb, TSS exclusive zone= +/- 2 kb) to define super-enhancers as described in [16]. We used the ChIP-seq peaks for H3K27ac as enhancer constituents and MED1 signal to rank them.
Assigning genes to super-enhancers and typical enhancers
We assigned genes to super-enhancers and typical enhancers using a proximity rule as descried in [16, 18]. It is known that enhancers tend to loop and communicate with target genes [7], and most of these enhancer-promoter interactions occur within a distance of ~50kb [79]. This approach identified a large proportion of true enhancer/promoter interactions in ESC [80]. Hence, we assigned all transcriptionally active genes to super-enhancers and typical enhancers within a 50kb window.
Motif analysis
We used FIMO (Find Individual Motif Occurrences) version 4.10.0 [81] for motif analysis with p-value < 10-4 with a custom library of TRANSFAC motifs including (Oct4: M01124, Sox2: M01272, Nanog: M01123, Esrrb: M01589, Klf4: M01588). The number of occurrences of each motif were counted for the constituents of super-enhancer and typical enhancer regions. Motif analysis in Fig. 7b was performed with TRAP tool using TRANSFAC vertebrates motif library, mouse promoters as the background, and Benjamini-Hochberg as the correction [47]. A DNA sequence (FASTA format) was extracted from mm9 genome as input for FIMO and TRAP.
Gene ontology analysis
We performed the gene ontology analysis using Genomic Regions Enrichment of Annotations Tool (GREAT) web tool (version 3.0.0) [46] with whole-genome as background and default parameters. We reported the top Gene Ontology (GO) terms with the lowest p-value.
Visualization and statistical analysis
We generated box plots using R programming language by extended the whiskers to 1.5x the interquartile range. The P-values were calculated based on Wilcoxon signed-rank test for box plots, by using wilcox.test function in R. We used ngs.plot [82], to generate heat maps and normalized binding profiles at the constituents of super-enhancers and typical enhancers and their flanking 3kb regions (for example, Fig. 1a, 4a).
imPROSE availability
To foster the reproducible research, we developed our analysis pipeline as an open-source Python package with a command line interface and made it freely available for academic use at https://github.com/asntech/improse. A detailed documentation can be found at http://improsedoc.readthedocs.io/.
Additional files
Additional file 1:
A PDF document contains Supplementary Figures S1–S8 and Tables S1-S3. And also the public dataset used in this study in Supplementary Table S4-S6.
Additional file 2:
An Excel spreadsheet contains a list of super-enhancers and typical enhancers and their associated genes in mESC identified by using Cdk8, Cdk9 and Smad3.
Additional file 3:
An Excel spreadsheet contains a list of super-enhancers and typical enhancers identified by using H3K27ac and Smad3 in pro-B cell.
Additional file 4:
An Excel spreadsheet contains a list of super-enhancers and typical enhancers predicted by imPROSE in pro-B cell using Smad3 and H3K27ac data.
Abbreviations
TFs: Transcription Factors; ChIP-seq: Chromatin immune precipitation followed by high-throughput sequencing; ESCs: Embryonic Stem Cells; SEs: Super-enhancers; TEs: Typical enhancers; ImPROSE: Integrated methods for prediction of super-enhancer; TSS: Transcription Start Site; SNPs: Single Nucleotide Polymorphisms; ROC: Receiver Operating Characteristic; AUC: Area Under the Curve; GEO: Gene Expression Omnibus; GO: Gene Ontology.
Competing interests
The authors declare that they have no competing interests.
Funding
This work is supported in part by the National Basic Research Program of China [2012CB316504], Hi-tech Research and Development Program of China [2012AA020401] and NSFC grant [91010016].
Authors’ contributions
AK conceived and designed the experiments. XG reviewed and approved experiment design. AK performed the experiments and analysed the data. AK wrote the manuscript and XG reviewed it. All authors read and approved the final manuscript.
Author details
1 MOE Key Laboratory of Bioinformatics, Bioinformatics Division and Center for Synthetic and Systems Biology, TNLIST/Department of Automation, Tsinghua University, Beijing, 100084, China
2 Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0349 Oslo, Norway
3 School of Life Sciences, Tsinghua University, Beijing, 100084, China
Acknowledgements
We thank Anthony Mathelier and Masood ur Rehman Kayani for their comments and suggestions to improve the manuscript. We would like to thank Lydia Danner for proofreadingthe manuscript.
Footnotes
Author’s email address: aziz.khan{at}ncmm.uio.no; zhangxg{at}tsinghua.edu.cn
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.
- 6.
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.
- 22.↵
- 23.↵
- 24.
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵