Abstract
Motivation Identifying transcription factor binding sites is the first step in pinpointing non-coding mutations that disrupt the regulatory function of transcription factors and promote disease. ChIP-seq is the most common method for identifying binding sites, but performing it on patient samples is hampered by the amount of available biological material and the cost of the experiment. Existing methods for computational prediction of regulatory elements primarily predict binding in genomic regions with sequence similarity to known transcription factor sequence preferences. This has limited efficacy since most binding sites do not resemble known transcription factor sequence motifs, and many transcription factors are not even sequence-specific.
Results We developed Virtual ChIP-seq, which predicts binding of individual transcription factors in new cell types using an artificial neural network that integrates ChIP-seq results from other cell types and chromatin accessibility data in the new cell type. Virtual ChIP-seq also uses learned associations between gene expression and transcription factor binding at specific genomic regions. This approach outperforms methods that use transcription factor sequence preferences in the form of position weight matrices, predicting binding for 31 transcription factors (Matthews correlation coefficient > 0.3).
Availability The datasets we used for training and validation are available at https://virchip.hoffmanlab.org. We have deposited in Zenodo the current version of our software (http://doi.org/10.5281/zenodo.1066928), datasets (http://doi.org/10.5281/zenodo.823297), predictions for 31 transcription factors on Roadmap Epigenomics cell types (http://doi.org/10.5281/zenodo.1243913), and predictions in Cistrome as well as ENCODE-DREAM in vivo TF Binding Site Prediction Challenge (http://doi.org/10.5281/zenodo.1209308).
1 Introduction
Transcription factor (TF) binding regulates gene expression. Each TF can harmonize expression of many genes by binding to genomic regions that regulate transcription. Cellular machinery utilizes these master regulators to regulate key cellular processes and adapt to environmental stimuli. Alteration in sequence or quantity of a given TF can impact expression of many genes. In fact, these alterations can be the primary cause of hereditary disorders, complex disease, autoimmune defects, and cancer1.
TFs bind to accessible chromatin based on weak non-covalent interactions between amino acid residues and nucleic acids. DNA’s primary structure (sequence)2, secondary structure (shape)3, and tertiary structure (conformation)4 all play roles in TF binding. Many TFs form a complex with others as well as chromatin-binding proteins and therefore bind to DNA indirectly. Some TFs also have different isoforms and undergo various post-translational modifications. In vitro assays, such as high throughput systematic evolution of ligands by exponential enrichment (HT-SELEX)5 and protein binding microarrays6, have provided a compelling understanding of context-independent TF sequence and shape preference7. Yet, for the aforementioned reasons, performance of models trained on these in vitro data are poor when applied on in vivo experiments8,9. To address this challenge, we must explore how to better model DNA shape, TF-TF interactions, and context-dependent TF binding.
Chromatin immunoprecipitation and sequencing (ChIP-seq)10 and similar methods, such as ChIP-exo11 and ChIP-nexus12, can map the presence of a given TF in the genome of a biological sample. To map TFs, these assays require a minimum of 1,000,000 to 100,000,000 cells, depending on properties of the TF itself and available antibodies. Such large numbers of cells are not often available from clinical samples. Therefore, it is impossible to systematically assess TF binding in most disease systems. Assessing chromatin accessibility through transposase-accessible chromatin using sequencing (ATAC-seq)13, however, requires only hundreds or thousands of cells. One can obtain this many cells from many more clinical samples. While chromatin accessibility does not determine TF binding, several methods use this information together with knowledge of TF sequence preference, genomic conservation, and other genomic features to predict TF binding14,15,16.
Predicting TF binding with motif discovery tools within chromatin accessible regions has helped us understand the role of several TFs in various disease. For example, He et al.17 used motif discovery tools to identify the role of OCT1 and NKX3-1 after prolonged androgen stimulation in prostate cancer. Similarly, Bailey et al.18 discovered that a known breast cancer risk single nucleotide polymorphism (SNP) upstream of ESR1 disrupts GATA3 binding and enhances expression of ESR1. We propose that using more accurate tools to predict TF binding will allow understanding the role of TF binding in more contexts.
Previous studies have used various approaches to predict TF binding. Several methods use unsupervised approaches such as hierarchical mixture models14 or hidden Markov models15 to identify transcription factor footprint using chromatin accessibility data. These approaches use sequence motif scores to attribute footprints to different transcription factors. Convolutional neural network models can boost precision by learning sequence preferences from in vivo, rather than in vitro data20,21. Variation in sequence specificity and cooperative binding of some transcription factors prevents these methods from accurately predicting binding of all transcription factors. A more recent approach uses matrix completion to impute TF binding using a 3-mode tensor representing genomic positions, cell types, and TF binding22. This method doesn’t rely on sequence specificity, but can only predict TF binding in well-studied cell types with many ChIP-seq datasets. This means one cannot use it to predict binding in a cell type where ChIP-seq is not possible, such as limited clinical samples.
Identifying the best approach for predicting TF binding remains a challenge, because most studies use different benchmarking approaches. For example, one earlier study14 only assesses prediction on genomic regions that match the TF’s sequence motif. By excluding ChIP-seq peaks not matching the TF’s sequence motif from benchmarking, it underestimates false negative peaks and overestimates prediction accuracy. Most previous studies benchmark their predictions using the area under receiver operating characteristic curve (auROC) statistic22,23,24. When test data is imbalanced, meaning it has very different numbers of positive and negative examples, using auROC misleads evaluators25,26. Unfortunately, the TF binding status of genomic regions is highly imbalanced, making auROC alone a poor metric for evaluating TF binding prediction. Evaluation is further complicated by wildly varying prediction performance across different TFs. Recently, the ENCODE-DREAM in vivo TF Binding Site Prediction Challenge (DREAM Challenge) introduced guidelines for assessing TF binding prediction27. They recommend reporting both auROC, which assesses false negative predictions and the area under precision-recall curve (auPR), which also assesses false positives.
RNA-seq allows us to obtain transcriptome data from samples with small cell counts, including patient samples. We hypothesized that we could leverage the transcriptome to better predict TF binding. Previous methods have predicted gene expression using information on active regulatory elements28,29,30. Others have predicted chromatin accessibility using gene expression data31, but they haven’t predicted TF binding using transcriptome data, as we do below.
Here, we introduce Virtual ChIP-seq, a novel method for more accurate prediction of TF binding. Virtual ChIP-seq predicts TF binding by learning from publicly available ChIP-seq experiments. Unlike Qin and Feng23, it can do this in new cell types with no existing ChIP-seq data. Virtual ChIP-seq also learns from other data such as genomic conservation, and the association of gene expression with TF binding.
Virtual ChIP-seq also accurately predicts the locations of DNA-binding proteins without known sequence preference. This would be impossible for most existing methods, which rely on sequence preference. Strictly speaking, only some of these proteins are TFs, but we usually refer to all DNA-binding proteins as TFs in this paper for ease of communication and comparison with other methods.
Virtual ChIP-seq predicted binding of 31 TFs in new cell types with a minimum Matthews correlation coefficient (MCC) of 0.3. These TFs had minimum accuracy (fraction of all predictions that were correct) of 0.99 and minimum specificity (fraction of negative predictions that were correct) of 0.99. Precision (fraction of positive predictions that were correct) ranged between 0.16 and 0.78 (Table 1). We predicted binding of these 31 TFs on 34 Roadmap Epigenomics32 cell types and provide these predictions as a track hub for community use (https://virchip.hoffmanlab.org).
2 Results
2.1 Sequence motifs are absent in most TF binding sites
2.1.1 Most ChIP-seq peaks lack the TF’s relevant sequence motif
Many computational tools predict TF binding using sequence preference data14,15. Most tools represent TF sequence preference in position weight matrix (PWM) format. PWMs encode the likelihood for presence of each nucleotide at different positions of a sequence motif. With tools such as FIMO33, we can efficiently search and rank genomic regions that match TF sequence motifs.
One cannot determine a TF’s binding sites based solely on its sequence preference. We can identify some additional properties, such as co-binding partners, from high-throughput experiments. For other properties, such as post-translational modifications to the TF, we lack corresponding large-scale data. Therefore, we expect existing computational prediction methods to be more accurate for TFs where post-translational modifications and co-binding partners contribute less to TF binding. For TFs with more complex biology, however, we expect computational prediction methods to fail.
Using ChIP-seq data on 201 DNA-binding proteins in 54 different cell types, we investigated whether the majority of binding sites matched the sequence motif of the same TF. Among these 201 proteins, 76 lacked a sequence motif in JASPAR (Figure 1a, Supplementary Table 1). Some of these motif-free proteins, such as EZH2 and HDAC, are chromatin-binding proteins rather than true TFs. For simplicity in describing the prediction task, we refer to them as TFs nonetheless. Others are TFs without known sequence preference. For sequence-specific TFs, the fraction of peaks that match a sequence motif ranges from 4.55% (for SIX5) to 94.2% (for CTCF) with a mean of 49.4% (Figure 1b).
2.1.2 Many sequence motifs are not centrally enriched
Central enrichment measures how close a sequence motif occurs to a set of ChIP-seq peak summits. High central enrichment indicates direct TF binding19. We used CentriMo19 to measure central enrichment. We compared central enrichment between TFs with low motif occupancy (< 50% of ChIP-seq peaks contain the motif) and high motif occupancy (> 50% of peaks contain the motif; Figure 1c). TFs with low motif occupancy had weaker central enrichment (t-test; p = 0.02). For example, 30.87% of ATF3 peaks overlapped with the MA0605.1 JASPAR motif. ATF3 peaks also had lower central enrichment than MAFK peaks, which had 74.29% overlap with the MA0496.1 JASPAR motif (Figure 1d).
2.2 Model, performance, and benchmarking
2.2.1 Datasets
Virtual ChIP-seq learns from the association of gene expression and TF binding in publicly available datasets. Our method requires ChIP-seq data of each TF in as many cell types as possible, with matched RNA-seq data from the same cell types. We used ChIP-seq data (from Cistrome DB34 and ENCODE35) and RNA-seq data (from CCLE36 and ENCODE37) to assess Virtual ChIP-seq’s binding predictions for 63 DNA-binding proteins in new cell types.
In addition to benchmarking on our own held-out test cell types, we wanted to compare against the DREAM Challenge27. To do this, we also used their datasets, which include ChIP-seq data for 31 TFs. For most of these TFs, the DREAM Challenge held out test chromosomes instead of test cell types. The DREAM Challenge included ChIP-seq data for only 12 TFs in completely held-out cell types. Completely holding out cell types better fits the real-world scenarios that require binding site prediction. Using the datasets we generated, we had matched data in enough cell types to train and validate models for 9 of these 12 TFs (CTCF, E2F1, EGR1, FOXA1, GABPA, JUND, MAX, REST, and TAF1).
2.2.2 Learning from the transcriptome
Different cell types have distinct transcriptomic and epigenomic states38. Changing gene expression levels can affect patterns of TF binding and chromatin structure. We hypothesized that some gene expression changes would lead to consistent and observable changes in TF binding. As an extreme example, eliminating expression of a TF would eventually eliminate binding of that TF genome-wide. Other changes in gene expression could lead to competitive, cooperative, allosteric, and other indirect effects that would affect TF binding. To exploit this model, we identified genes with significant positive or negative correlation with TF binding at any given genomic bin. We did this for genes all over the genome, irrespective of distance from the binding site.
For each TF, we created an association matrix measuring correlation between gene expression and binding of that TF in previously collected datasets (Figure 2a-c). In this matrix, each value corresponds to the Pearson correlation between ChIP-seq binding of that TF at one genomic bin and the expression level of one gene. We used missing values when there was no significant association between gene expression and TF binding (p > 0.1).
Power analysis (Methods) identified which correlations the p > 0.1 cutoff would exclude depending on the number of available cell types with matched ChIP-seq and RNA-seq data. For CTCF, which had the largest number of cell types available—21 cell types with matched ChIP-seq and RNA-seq—this cutoff provided 80% power to detect an absolute value of Pearson correlation |r| ≥ 0.52. Many TFs had only 5 cell types with matched data and the cutoff provided 80% power to detect only larger correlations, |r| ≥ 0.92.
We calculated an expression score for a TF in a new cell type using the association matrix and RNA-seq data for the new cell type, but no ChIP-seq data. The expression score is the Spearman correlation between the non-NA values for that genomic bin in the association matrix and the expression levels of those genes in the new cell type (Figure 2d, Figure 3a). We used the rank-based Spearman correlation to make the score robust against slight differences in analytical methodology used to estimate gene expression.
2.2.3 Learning from other predictive features
We included a number of other predictive features beyond expression score. Virtual ChIP-seq includes as input for each genomic bin the frequency of the TF’s presence in existing ChIP-seq data (Figure 3b). Since most TF binding occurs within accessible chromatin39, we also used evidence of chromatin accessibility from DNase-seq or ATAC-seq (Figure 3c).
While many intra-species genomic differences lie in the non-coding genome40, we expect some regulatory elements to be conserved among closely related species. Previous studies highlight the association of genomic conservation and TF binding in organisms as simple as yeast41 or as complex as human42. To learn from patterns of genomic conservation, we used PhastCons43,44 scores from a 7-way primate and placental mammal comparison (http://hgdownload.cse.ucsc.edu/goldenPath/hg38/phastCons7way) in our model (Figure 3d).
We used sequence motif score where available (Figure 3e). Relying only on TF sequence preference, however, would prevent accurate prediction of most true TF binding sites9 (Figure 1). For each TF, we represented sequence preference using the FIMO score of JASPAR sequence motifs of that TF or a similar TF. JASPAR has no motif for some TFs, such as EP300. Where JASPAR has more than one motif for a TF, additional motifs often represent different versions of the motif such as SREBF2 (MA0596.1) and SREBF2-var2 (MA0828.1). In some cases, the additional motif represents a preference of a cooperative TF heterodimer, such as MAX-MYC (MA0059.1). Regardless of reason, we included all of each TF’s motifs as features in its model (Supplementary Table 2).
We also investigated potential improvements by adding a couple of additional integrative features available for a limited number of TFs and cell types (Supplementary Table 2). First, we used the output of Hidden Markov model-based Identification of TF footprints (HINT)15 which identifies TF footprints within accessible chromatin. Second, we used a boolean feature indicating overlap of each genomic bin with clusters of chromatin accessibility peaks identified by CREAM45.
2.2.4 Selecting hyperparameters and training
We created an input matrix with rows corresponding to 200 bp genomic windows and columns representing the features described above. Specifically, these features included expression score (Figure 3a), previous evidence of binding of TF of interest in publicly available ChIP-seq data (Figure 3b), chromatin accessibility (Figure 3c), genomic conservation (Figure 3d), sequence motif scores (Figure 3e), HINT footprints, and CREAM peaks. We used sliding genomic bins with 50 bp shifts, where most 200 bp bins overlap six other bins. This provided a maximum resolution of 50 bp in binding prediction. This resulted in a sparse matrix with 60,620,768 rows representing each bin in the GRCh38 genome assembly46. The sparse matrix used in the main model had between 4 and 11 columns, depending on the number of available sequence motifs. When we added HINT footprints and CREAM peaks, the matrix had between 6 and 13 columns instead. We trained on an imbalanced subset of genomic regions which had TF binding or chromatin accessibility (FDR < 10−4) in any of the training cell types. To speed the process of training and evaluation, we further limited training input data to four chromosomes (chr5, chr10, chr15, and chr20). For validation, however, we used data from these same four chromosomes in completely different cell types held out from training. We evaluated the performance on all of the 9,635,407 bins in these four chromosomes (Figure 3f), not just those with prior evidence of TF binding or chromatin accessibility.
To build a generalizable classifier that performs well on new cell types with only transcrip-tome and chromatin accessibility data, we concatenated input matrices from 12 training cell types: A549, GM12878, HepG2, HeLa-S3, HCT-116, BJ, Jurkat, NHEK, Raji, Ishikawa, LNCaP, and T47D (Supplementary Table 3).
2.2.5 The multi-layer perceptron
The multi-layer perceptron (MLP) is a fully connected feed-forward artificial neural network47. Our MLP assumes binding at each genomic window is independent of upstream and downstream windows (Figure 3). For each TF, we trained the MLP with adaptive momentum stochastic gradient descent48 and a minibatch size of 200 samples. We used 4-fold cross validation to optimize hyperparameters including activation function (Figure 3g), number of hidden units per layer (Figure 3h), number of hidden layers (Figure 3i), and L2 regularization penalty (Figure 3j). In each cross validation fold, we iteratively trained on 3 of the 4 chromosomes (5, 10, 15, and 20) at a time, and assessed performance in the remaining chromosome. We selected the model with the highest average Matthews correlation coefficient (MCC) 49 after 4-fold cross validation. MCC incorporates all four categories of a confusion matrix and assesses performance well even on imbalanced datasets50. For 23 TFs the optimal model had 10 hidden layers, and for another 23 TFs the optimal model had 5 hidden layers, and for the final 17 TFs, the optimal model had only 2 hidden layers. For 57 TFs, the best-performing model had 100 hidden units in each layers. The optimal model of 6 TFs had 10–24 hidden units in their hidden layers. Different activation functions—sigmoid, hyperbolic tangent (tanh), or rectifier—proved optimal for different TFs (Supplementary Table 4).
2.2.6 Virtual ChIP-seq predicts TF binding with high accuracy
We evaluated the performance of Virtual ChIP-seq in validation cell types (K562, PANC-1, MCF-7, IMR-90, H1-hESC, and primary liver cells) which we did not use in calculating the expression score, training the MLP, or optimizing hyperparameters. Before predicting in new cell types, we chose a posterior probability cutoff for use in point metrics such as accuracy and F1 score. When a TF had ChIP-seq data in more than one of the validation cell types, we chose the cutoff that maximizes MCC of that TF in H1-hESC cells. Then, we excluded H1-hESC when reporting threshold-requiring metrics. For these TFs, we pre-set a posterior probability cutoff of 0.4, the mode of the cutoffs for other TFs (Supplementary Table 5).
We used area under precision-recall (auPR) curves to compare performance of Virtual ChIP-seq in validation cell types with other available methods. Virtual ChIP-seq predicts binding of 31 TFs in validation cell types with MCC > 0.3, auROC > 0.9, and 0.3 < auPR < 0.8 (Figure 4a, Table 1, Supplementary Table 6).
2.2.7 Virtual ChIP-seq correctly predicts binding sites in genomic locations not found in training data
We evaluated the performance of Virtual ChIP-seq for 63 TFs with binding in validation cell types. For 59 of these TFs, Virtual ChIP-seq predicted true TF binding in regions without conservation among placental mammals. For 44 out of 63 TFs, Virtual ChIP-seq predicted true TF binding in regions without TF binding in any of the training ChIP-seq data. From these 63 TFs, 43 are sequence-specific, and for all of these TFs, Virtual ChIP-seq predicted true binding for regions that did not match the TF’s sequence motif. For 47 TFs, Virtual ChIP-seq even correctly predicted TF binding in regions that didn’t overlap chromatin accessibility peaks (Supplementary Table 7). Most of these regions were frequently bound to the TF in publicly available ChIP-seq data. These predictions showed that the MLP learned to leverage multiple kinds of information and predict TF binding accurately, even in the absence of features required by previous generations of binding site classifiers.
2.2.8 Comparison with DREAM Challenge
DREAM Challenge rules forbid using genomic conservation or ChIP-seq data as training features. This also excludes the expression score, as creating its association matrix relies on ChIP-seq data. The challenge also required training and validation on its own provided datasets. These datasets have ChIP-seq data in only a few cell types. This restricts Virtual ChIP-seq’s approach which leverages all publicly available datasets. The DREAM Challenge ChIP-seq datasets use only two replicates for each experiment and requires that peaks have a irreproducibility discovery rate (IDR)51 of less than 5%. IDR only handles experiments with exactly two replicates, but most of the public ChIP-seq experiments we used had more than two replicates (Supplementary Table 8). In these cases, we included peaks that pass a false discovery rate (FDR) threshold of 10−4 in at least two replicates.
The DREAM Challenge assessed participant entries by measuring performance on three validation chromosomes (chrl, chr8, and chr21), combined. To assess performance of Virtual ChIP-seq on DREAM Challenge data, we did the same. To assess performance on Cistrome data, however, we measured performance on each chromosome independently. This allowed us to examine the variance in performance among these chromosomes.
Although Virtual ChIP-seq used features not allowed in the DREAM Challenge, comparing with DREAM Challenge participants is the only sound way to show how any method including these features compares to the state of the art. Before the DREAM Challenge, TF binding prediction methods mostly reported performance measurements only in those parts of a chromosome where a method had more likelihood of success. The DREAM Challenge, like Virtual ChIP-seq, instead reports performance on the intended deployment domain of such methods: whole chromosomes. Leading DREAM Challenge methods potentially could improve their performance by including the features used by Virtual ChIP-seq. We compared Virtual ChIP-seq with DREAM Challenge results when we trained and validated on either Cistrome DB data or DREAM Challenge data.
2.2.9 Prediction accuracy varies by transcription factor
The DREAM Challenge evaluates predictions on binding of 31 TFs. The final submission round evaluates predictions for 12 TFs in held-out cell types. The datasets we used, however, allow us to predict binding of 63 TFs in new cell types. Of these TFs, 41 are unique to our dataset and do not overlap any of the DREAM Challenge TFs (Supplementary Table 9). The DREAM Challenge has data on the other 22 TFs, but the challenge evaluated only 9 of these TFs in its final round.
For CTCF, FOXA1, TAF1, and REST, Virtual ChIP-seq had a higher auPR in at least one validation cell type than any DREAM Challenge participant52,53. For EGR1 and E2F1, Virtual ChIP-seq performed better than at least one of the four top-performing methods of the challenge in one of the validation cell types (Figure 4b). DREAM Challenge and Cistrome ChIP-seq peak calls had different class imbalances, making auPR statistics not directly comparable (Supplementary Table 10). These imbalances were not always in the same direction. In FOXA1 peak calls in liver, for example, Cistrome called 0.12% of genomic bins bound to a TF, half the fraction of the DREAM Challenge (0.25%). Our predictions for FOXA1 binding in T47D and MCF-7 using Cistrome had a higher auPR than participants of DREAM Challenge for liver. The FOXA1 peak calls for these cell types also had a higher fraction of TF-bound genomic bins: 1.36% for MCF-7, and 0.39% for T47D. This opposed the smaller fraction of bins bound in Cistrome data in CTCF (in PANC-1, liver, and T47D), TAF1 (in liver, H1-hESC, K562, and T47D), and REST (in H1-hESC, K562, and PANC-1). The differences in class prevalence are both minor and in diverging directions. Because of this, they do not bias the baseline auPR of evaluation on Cistrome datasets in a particular direction when compared to evaluation on DREAM Challenge datasets.
The power of Virtual ChIP-seq to learn from the transcriptome data diminishes when fewer cell types are available, as in the DREAM Challenge data. Nonetheless, when trained on DREAM Challenge data, Virtual ChIP-seq outperformed 13/14 DREAM Challenge participants when predicting CTCF binding in PC-3 cells. When predicting CTCF binding in iPSC cells, Virtual ChIP-seq had a higher auPR than 8/14 Challenge participants. The Virtual ChIP-seq auPR for binding of REST in liver was also higher than that of 9/14 DREAM Challenge participants (Supplementary Table 11).
Virtual ChIP-seq predicted binding of 31 TFs with a median MCC > 0.3. These 31 TFs had a auPR between 0.27 and 0.84 (Table 1). Some of these TFs show high levels of consistent binding among different cell types, which makes predictions easier. The fraction of bins bound to a TF in at least half of training cell types, however, varies between 0 to 15.75% across all TFs. Even for TFs with a median auPR > 0.5 (purple in Figure 4a) the fraction of bins bound in half of training cell types varied from 0.5% in FOXA1 to 10.5% in NRF1. For some DNA-binding proteins, Virtual ChIP-seq fails to predict binding accurately (auPR < 0.3). DNA-binding proteins with low auPR and low MCC include chromatin modifiers such as KAT2B, KDM1A, EZH2 and chromatin binding proteins such as CHD1 and BRD4. TFs with low prediction accuracy include ATF2, CUX1, E2F1, EP300, FOSL1, FOXM1, JUN, RCOR1, RELA, RXRA, SREBF1, TCF12, TCF7L2, and ZBTB33. For some proteins, such as ATF2, EP300, EZH2, FOXM1, KAT2B, KDM1A, TCF12, and TCF7L2, in at least one validation cell type, most ChIP-seq peaks didn’t overlap with chromatin accessible regions.
2.3 The choice of input features determines prediction performance
2.3.1 The most important features
To evaluate the importance of each feature in our predictive model, we performed an ablation study on training data. First, we systematically removed features. Second, we fitted the model without these features on some of the training cell types (HeLa-S3, GM12878, HCT-116, LNCaP). Third, we evaluated performance on one held-out training cell type (HepG2; Supplementary Table 12). This ablation study did not use any of the validation cell types which we used for final evaluation of the model.
We called the effect of excluding an input feature substantive only when the average increase or decrease in auPR was at least 0.05. Excluding sequence motif, HINT, or CREAM did not substantively change performance of the model for most TFs (Figure 5). Excluding publicly available ChIP-seq data, the expression score, or both decreased performance in most TFs. Excluding expression score substantively decreased median auPR in 13/21 TFs, while excluding publicly available ChIP-seq data substantively decreased auPR in 18/21 TFs.
2.3.2 Inclusion of some features have opposite effects on prediction of different TFs
Beyond the most important features—ChIP-seq and expression score—excluding other features rarely substantively decreased prediction performance (Figure 5b-c). When we excluded sequence motifs, auPR decreased substantively for ZBTB33, JUN, JUND, FOXA1, and ELF1. Excluding HINT footprints decreased auPR substantively only for CEBPB, JUN, and JUND. Excluding CREAM clusters of chromatin accessibility peaks decreased auPR substantively only for ZBTB33, ELF1, and FOXA1.
Removing certain input features actually improved prediction for some TFs (Figure 5b-c). Associations that differed between training cell types and validation cell types suggested that these input features generalize poorly. For example, CREAM clusters’ overlap with NRF1 ChIP-seq peaks was not consistent among GM12878 (7.52%), HeLa-S3 (31.8%), and HepG2 (25.78%). This represented a significant variation among these cell types (ANOVA; p = 1.9 × 10−4).
While most TF footprints (95.96%) overlapped NRF1 peaks, TF footprints constituted only a small fraction of NRF1 peaks (0.73%). NRF1 peaks overlapped a small proportion of TF footprints in training cell types GM12878 (1.14%) and HeLa-S3 (0.59%), but significantly greater than the 0.45% overlap in HepG2 (Welch t-test; p = 0.007). In HepG2, 7.28% of YY1 peaks overlap TF footprints while in the training cell type GM12878, the overlap is only 1.22% (Welch t-test; p = 5 × 10−5) and in the other training cell type HCT-116 the overlap is much higher (17.92%; Welch t-test; p = 5 × 10−6). Overlap of ZBTB33 peaks with TF footprints is much smaller in HepG2 (0.49%) compared to training cell types GM12878 (2.32%) and HCT-116 (5.27%; Welch t-test; p = 6 × 10−4). Features with varying and cell-specific association with TF binding complicate convergence of the MLP and may result in overfitting. As a result, the MLP achieved a higher performance on some TFs when we ablated those features.
Association of clusters of regulatory elements and TF footprints with TF binding varies among cell types. Using a CREAM feature substantively improved performance in 3/21 TFs and using a HINT feature substantively improved performance in 3/21 TFs (Figure 5b-c). In contrast, including CREAM substantively decreased performance for 1 case and including HINT for 4 cases. When we repeat this experiment by using different training and validation cell types, clusters of regulatory elements and TF footprints result in increase or decrease in performance of different TFs, while they barely result in an increase in auPR above 0.05. Because of the limited upside and apparent downside, we didn’t use these two features for our final model.
2.4 Transcription factors and their targets regulate similar biological pathways
2.4.1 Gene set enrichment analysis of TF targets
To understand biological implications of transcriptome perturbation in response to TF binding, we measured how frequently each gene’s expression associated with binding of each TF. We hypothesized that if expression of a gene consistently correlates with binding of a TF, it is a potential target of that TF. Similarly, if the expression of a gene negatively correlates with binding of a TF, cellular machinery upregulated by that TF might cause net suppression of that gene’s expression.
To identify such genes, for each TF, we ranked genes by subtracting the number of genomic bins they are positively correlated with from the number of genomic bins they are negatively correlated. We call this difference the association delta. For each TF, we identified the 5,000 genes with the highest variance in expression among cells with matched RNA-seq and ChIP-seq data (Figure 2a). We measured correlation of expression of each of the 5,000 genes with TF binding at every 100 bp genomic window in 4 chromosomes (chr5, chr10, chr15, and chr20). This approach identified genes that have consistent positive or negative association with TF binding (Figure 6a). We considered these genes as potential targets of each TF, and used the Gene Set Enrichment Analysis (GSEA) tool55 to identify pathways with significant enrichment in either direction (Figure 6a.) Only the rank of association delta affects these results, and we presumed that there would be little difference in using all chromosomes instead of just 4. The 4-chromosome analysis for JUND had no significant rank difference from an analysis of chromosome 10 alone (Wilcoxon rank sum test p = 0.3). We only investigated Gene Ontology (GO) terms annotated to a minimum of 10 and a maximum of 500 out of a total of 17,106 GO-annotated genes.
We identified 1,681 GO terms with significant enrichment (GSEA p < 0.001) among potential targets of at least one of the 113 TFs we investigated (Figure 6b). Only 63 of these 113 TFs had matched ChIP-seq and RNA-seq in at least 5 of the training cell types and one of the validation cell types we used for learning from the transcriptome. Each TF had potential targets with significant enrichment in a mean of 92 terms (median 76; Figure 6c). Each of the 1,681 terms had significant enrichment in potential targets of a mean of 6 TFs (median 2; Figure 6d). Furthermore, 300 of these GO terms had significant enrichment in potential targets of at least 10 TFs.
To identify TFs involved in similar biological processes, we searched for enrichment of any of the 1,681 GO terms in 113 TFs. This analysis relied on the GSEA enrichment score as a normalized test statistic. We examined the pairwise correlation between the vector of enrichment scores for each pair of TFs. These pairwise correlations constitute a symmetric correlation matrix. We hypothesized that TFs with high correlation are involved in similar biological processes.
To identify groups of TFs involved in similar biological processes, we performed hierarchical clustering on the correlation matrix. We sought to identify clusters of TFs, and the best number of clusters between 2 and 10, inclusive. As a control, we generated a correlation matrix of same dimensions from a matrix of random Gaussian values (Methods). For each matrix we repeatedly generated random subsamples and clustered them. For each subsample, we found the set of pairs of TFs with the same cluster membership. For couples of these subsamples, we identified the Jaccard index between these sets as a measure of cluster stability104 (Methods). We then compared the increase or decrease in Jaccard indices from each number of clusters to the number of clusters one larger.
The smallest number of clusters with an increase in Jaccard index only for the correlation matrix was 6 (Figure 6e-f). We assigned names to these clusters based on their enriched biological pathways. We then examined the TFs included in those clusters. The Neural cluster (Figure 6g) includes ASCL156, HSF161, GATA260, and PPARγ62. These TFs play important roles in the development of the nervous system and are implicated in neurological disorders56,60,61,62. The top 5 GO terms enriched in the potential targets of these TFs are all related to nervous system development and function (Figure 6g). The downregulated pathways of the Motility cluster (Figure 6h) relate to cytoskeletal organization. The included TFs, CTBP166, KDM5B67, MEF2A68, and STAT169, all play a role in the epithelial-to-mesenchymal transition, which involves re-organization of the cytoskeleton. Similarly, we found that for other clusters, specific upregulated or downregulated pathways of cluster’s targets are also regulated by many of the cluster’s TFs (Figure 6i-l, Table 2).
2.5 A compendium of TF binding predictions for 34 tissues and cell types
2.5.1 Predicting TF binding in Roadmap datasets
The Roadmap Epigenomics Project32 performed DNase-seq on 55 and RNA-seq on 39 human tissues and cell types, but not ChIP-seq of any TF. For 34 of these tissues, they produced matched DNase-seq and RNA-seq data. This makes the Roadmap data an ideal application for Virtual ChIP-seq.
We generated an annotation similar to peak calls by converting the MLP’s posterior probabilities to a presence or absence call. We made this call based on a different cutoff for each TF. We defined this cutoff as the posterior probability which maximized MCC in H1-hESC. For TFs without ChIP-seq data in H1-hESC, we used the mode of cutoffs from the other different TFs (0.4). We excluded H1-hESC when reporting all performance metrics that depend on this threshold. The number of binding sites we predicted in other validation cell types and Roadmap data is similar to ChIP-seq peaks in other validation cell types (Figure 7a).
Using the cutoff which maximized MCC in H1-hESC only slightly decreased performance measurements from what one could achieve with the optimal cutoff for each cell type (Figure 7b). For example, the MCC score showed a median decrease of 0.06 and F1 score showed a median decrease of 0.1.
Narrowing predictions to only those that pass the cutoff, we found that many correctly predicted binding sites in K562 lack important predictive features of TF binding (Figure 7c). For example, many of the correctly predicted binding sites of EZH2 and KAT2B are not conserved among placental mammals. Many correctly predicted binding sites for MAFK, REST, FOSL1, and CTCF don’t overlap chromatin accessibility peaks. We correctly predicted many binding sites for TCF12, RCOR1, TEAD4, CHD1, FOXM1, GABPA, and CUX1 in regions that have no binding in other cell types. In these cases, MLP learned from other available predictive features. For example, in RCOR1, all novel correctly predicted binding sites of chromosome 5 overlapped chromatin accessibility peaks. These correct predictions also had an average genomic conservation of 0.19 which was significantly higher than other genomic bins (Welch t-test p = 0.006).
As a community resource, we created a public track hub (https://virchip.hoffmanlab.org) with predictions for 34 Roadmap cell types (Figure 7d). This track hub contains predictions for 31 TFs which had a median MCC > 0.3 in validation cell types (Table 1).
3 Methods
3.1 Data used for prediction
3.1.1 Overlapping genomic bins
To generate the input matrix for training and validation, we used 200 bp genomic bins with sliding 50 bp windows. We excluded any genomic bin which overlaps with ENCODE blacklist regions (https://www.encodeproject.org/files/ENCFF419RSJ/@@download/ENCFF419RSJ.bed.gz). Except where otherwise specified, we used the Genome Reference Consortium GRCh38/hg38 assembly 46.
3.1.2 Chromatin accessibility
We used Cistrome DB ATAC-seq and DNase-seq narrowPeak files for assessing chromatin accessibility (Supplementary Table 8). We mapped the signal value of peak summits to all the bins overlapping that summit. In rare cases where a genomic bin overlaps more than one summit, we used the signal value of the summit closest to the p terminus of the chromosome When data were available from multiple experiments, we averaged signal values. Because Cistrome DB does not include raw data that one can use for DNase footprinting, we limited the analysis of HINT TF footprinting and CREAM regulatory element clustering to ENCODE DNase-seq experiments on GM12878, HCT-116, HeLa-S3, LNCaP, and HepG2.
3.1.3 Genomic conservation
We used GRCh38 primate and placental mammal 7-way PhastCons genomic conservation43,44 scores from the UCSC Genome Browser105 (http://hgdownload.cse.ucsc.edu/goldenPath/hg38/phastCons7way). We assigned each bin the mean PhastCons score of the nucleotides within.
3.1.4 Sequence motif score
We used FIMO33 (version 4.11.2) to search for motifs from JASPAR 2016106 to identify binding sites of each TF that have the sequence motif of that TF. To get a liberal set of motif matches, we used a liberal p-value threshold of 0.001 and didn’t adjust for multiple testing. If the motif for the TF didn’t exist in JASPAR, we used other motifs with same initial 3 letters and counted any TF binding site which had overlap with any of those motifs (Supplementary Table 1).
We also used FIMO and JASPAR 2016 to identify the sequence specificity of chromatin accessible regions. For this analysis, we used a false discovery rate threshold of 0.01%. We used any sequence motif matching the initial 3 letters of a TF as a predictive feature of binding for that TF. For many TFs, more than one motif matched this criteria, and we used all as independent features in the model (Supplementary Table 2).
3.1.5 ChIP-seq data
We used Cistrome DB and ENCODE ChIP-seq narrowPeak files. We only used peaks with FDR < 10−4. When multiple replicates of the same experiment existed, we only considered peaks that passed the FDR threshold in at least two replicates. We considered bound only those genomic bins overlapping peak summits. We calculated prevalence of bound bins in each chromosome as and used it as an auPR baseline25.
3.1.6 RNA-seq data
We downloaded an ENCODE expression matrix (https://public-docs.crg.es/rguigo/encode/expressionMatrices/H.sapiens/hg19/2014_10/gencodev19_genes_with_RPKM_and_npIDR_oct2014.txt.gz)37 with RNA-seq data for each gene, measured in reads per kilobase per million mapped reads (RPKM). We retrieved similar Cancer Cell Line Encyclopedia (CCLE) RNA-seq data using PharmacoGx107. Since these data are processed differently, we limited our analysis to Ensembl gene IDs shared between the two datasets, and ranked gene expression values by cell type. The two datasets have 4 shared cell types: A549, HepG2, K562, and MCF-7. Within each of these cell types, we examined the concordance of RNA-seq data between ENCODE and CCLE after possible transformations. The concordance correlation coefficient108 of rank of RPKM (0.827) was higher compared to untransformed RPKM (0.007) or quantile-normalized RPKM (0.006; Welch t-test p = 10−6). The DREAM Challenge, however, had processed RNA-seq of all cell types uniformly, allowing us to directly use transcripts per million reads (TPM) in analysis of DREAM Challenge datasets.
3.1.7 Expression score
We created an expression matrix for each TF with matched ChIP-seq and RNA-seq data in N ≥ 5 training cell types with the following procedure:
We divided the genome into M 100 bp non-overlapping genomic bins.
We created a non-negative ChIP-seq matrix (Figure 2a). We used signal mean among replicate narrowPeak files generated by MACS2109 for each of M bins and N cell types and quantile-normalized this matrix.
We row-normalized C to C′, scaling the values of each row between 0 and 1.
We identified the G = 5000 genes with the highest variance among the N cell types.
We created an expression matrix containing the row-normalized rank of expression each of the G = 5000 genes in N cell types (Figure 2b).
For each bin i ∈ [1, M] and each gene g ∈ [1, G], we calculated the Pearson correlation coefficient Ai,g between the ChIP-seq data for that bin and the expression ranks for that gene E:,j over all cell types. If the Pearson correlation was not significant (p > 0.1), we set Ai,g to NA. These coefficients constitute an association matrix (Figure 2c).
We performed power analysis of the Pearson correlation test using the R pwr package110.
To predict ChIP-seq binding for a new cell type (Figure 2d), we calculated an expression score for each genomic bin in that cell type. The expression score is Spearman’s ρ for expression of the same G = 5000 genes in the new cell type with every row of the association matrix A. Each of these rows represents a single genomic bin. An expression score close to 1 indicates that genes with high expression have high values in the association matrix, and genes with low expression genes have low values. An expression score close to –1 indicates that genes with high or low expression have opposite values in the association matrix (Figure 2d).
3.2 Training, optimization, and benchmarking
3.2.1 Training and optimization
For the purpose of training and validating the model on Cistrome datasets, we only used chromosomes 5, 10, 15, and 20. These 4 chromosomes constitute 481.78 Mbp (15.6% of the genome). For training only, we excluded any genomic region without chromatin accessibility signal and previous evidence of TF binding. For validation and reporting performance, we included these regions, using the totality of the 4 chromosomes. We concatenated data from training cell types (A549, GM12878, HepG2, HeLa-S3, HCT-116, BJ, Jurkat, NHEK, Raji, Ishikawa, LNCaP, and T47D; Supplementary Table 3) into the training matrix.
We used Python 2.7.13, Scikit-learn 0.18.1111, NumPy 1.11.0, and Pandas 0.19.2 for processing data and training classifiers.
We optimized hyperparameters of the multi-layer perceptron (MLP)47 using grid search and 4-fold cross validation. We used minibatch training with 200 genomic bins in each minibatch. We searched for several options to optimize the activation function (Figure 3g), number of hidden units per hidden layer (Figure 3h), number of hidden layers (Figure 3i), and L2 regularization penalty (Figure 3j). In each round of 4-fold cross-validation, we trained on data of 3 chromosomes, and assessed best MCC on the remaining chromosome. We selected the set of hyperparameters yielding highest average MCC after 4-fold cross validation.
3.2.2 Benchmarking
We used the R precrec package112 to calculate auPR and auROC. Precision-recall curves better assess a binary classifier’s performance on imbalanced test data than ROC25,50.
3.2.3 DREAM Challenge comparison
For comparison to DREAM results, we also trained and validated the Virtual ChIP-seq model on GRCh37 DREAM Challenge data. For training the model on DREAM Challenge datasets, we used the data of chr5, chr10, chr15, and chr20 of training cell types. We evaluated performance against the union of the DREAM validation chromosomes (chr1, chr8, and chr21) in validation cell types. For CTCF, we trained on all cell types except MCF-7, PC-3, and iPSC which we used for validation. For MAX, we used all cell types except liver and K562 for training. For GABPA, REST, and JUND, we used all cell types except liver for training. We compared these metrics to those of DREAM Challenge participants in the final round of cross-cell-type competition.
3.3 Clustering TFs based on enrichment of their potential targets in GO terms
To identify groups of TFs involved in similar biological processes, we performed hierarchical clustering on the correlation matrix. We sought to identify clusters of TFs, and the best number of clusters between 2 and 10, inclusive. For use in this process, we created a Gaussian random matrix of 1,681 rows and 113 columns as a control, and calculated its correlation matrix. Then, we compared cluster stability between the original correlation matrix and the control for each potential number of clusters. To do this, we subsampled 75% of each correlation matrix rows twice without replacement. Then, we clustered TFs in each matrix into the specified number clusters. For both of these clusterings, we constructed the set of every pair of TFs present in the same cluster. We then calculated the Jaccard index between the first clustering’s constructed set and that of the second 104. We repeated this subsampling and clustering process 50 times for each number of clusters. We picked the smallest number of clusters which had an increase in Jaccard index compared to the number of clusters one smaller only in the TF correlation matrix.
3.4 TF prediction on Roadmap data
We downloaded Roadmap DNase-seq and RNA-seq data aligned to GRCh38 from the ENCODE DCC32. For each DNase-seq narrowPeak file with matched RNA-seq, we predicted binding of 31 TFs with MCC > 0.3 in validation cell types (Table 1, Supplementary Table 6, https://virchip.hoffmanlab.org).
4 Discussion
Performing functional genomics assays to assess binding of all TFs may never be possible in patient tissues. Nevertheless, computational prediction of TF binding based on sequence specificity of TFs has identified the role of many TFs in various diseases1. Scanning the genome for occurrences of each sequence motif, results in a range of 200–2000 predictions/Mbp. In some cases, this is 1,000 times more frequent than experimental data from ChIP-seq peaks. Similar observations led to a futility conjecture that almost all TF binding sites predicted in this way will have no functional role113.
Nevertheless, there is more to TF binding than sequence preference. Most TFs don’t have any sequence preference9 (Figure 1), and indirect TF binding through complexes of chromatin-binding proteins complicates predictions based solely on sequence specificity. In addition to the high number of false positive motif occurrences, many ChIP-seq peaks lack the TF’s sequence motif. Therefore, relying on sequence specificity alone not only generates too many false positives, but also many false negatives. We call this latter observation the dual futility conjecture, although it differs in degree from the original. Adding additional data about cellular state allows us to move beyond both conjectures.
We can assess TF binding through ChIP-seq or its more precise variations ChIP-nexus12 or ChIP-exo11. These experiments may still not properly reflect in vivo TF binding due to technical difficulties such as non-specific or low affinity antibodies. Using publicly available ChIP-seq data produced with different protocols and reagents, complicates prediction of TFs more sensitive to experimental conditions52. Variations among training and validation cell types in our datasets, overfitted the MLP to certain input features of some TFs. More robust approaches in assessment of TF binding—such as CRISPR epitope tagging ChIP-seq (CETCh-seq)114, which doesn’t rely on specific antibodies—may provide less noisy reference data for learning and prediction of TF binding.
Virtual ChIP-seq predicted binding of 31 TFs in new cell types, using from the new cell types only chromatin accessibility and transcriptome data. By learning from direct evidence of TF binding and the association of the transcriptome with TF binding at each genomic region, most use of sequence motif scores becomes redundant. As more ChIP-seq data in diverse cell types and tissues becomes available, our approach allows predicting binding of more TFs with high accuracy. This is true even in the case of factors that are not sequence-specific. Although Virtual ChIP-seq uses direct evidence of TF binding at each genomic region as one of the input features, it is able to correctly predict new peaks which don’t exist in training cell types. For 39 of 41 sequence specific TFs, Virtual ChIP-seq correctly predicted TF binding in regions without any match to sequence motifs.
The DREAM Challenge datasets provide data for training and validating machine learning models for predicting binding of 31 TFs. Our datasets, using a combination of Cistrome DB and ENCODE, allow training and validating models for predicting binding in a more extensive 63 TFs. Our provided predictions of binding of 31 high-confidence TFs in 34 different Roadmap tissue types will allow the research community to better investigate epigenomics of disease affecting those tissues (https://virchip.hoffmanlab.org/). In addition to providing our predictions as a resource for use by biologists, we also provide the processed datasets we use as a resource for machine learning researchers. This should accelerate the development of future methods by many groups.
Competing interests
The authors declare that they have no competing interests.
Acknowledgments
We thank Shirley X. Liu for providing us with the Cistrome DB narrowPeak files. We thank the Roadmap Epigenomics Mapping Consortium and the ENCODE Project Consortium for generating the datasets which enabled this work. We thank Sage Bionetworks-DREAM and the ENCODE-DREAM Challenge organizers for providing data and results before publication. We thank Carl Virtanen and Zhibin Lu (University Health Network High Performance Computing Centre and Bioinformatics Core) for technical assistance. This work was supported by the Canadian Cancer Society (703827 to M.M.H.), the Ontario Ministry of Training, Colleges and Universities (Ontario Graduate Scholarship to M.K.), and the University of Toronto Faculty of Medicine Frank Fletcher Memorial Fund (M.K.).
Footnotes
↵5 Lead contact: michael.hoffman{at}utoronto.ca
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].
- [58].
- [59].
- [60].↵
- [61].↵
- [62].↵
- [63].
- [64].
- [65].
- [66].↵
- [67].↵
- [68].↵
- [69].↵
- [70].
- [71].
- [72].
- [73].
- [74].
- [75].
- [76].
- [77].
- [78].
- [79].
- [80].
- [81].
- [82].
- [83].
- [84].
- [85].
- [86].
- [87].
- [88].
- [89].
- [90].
- [91].
- [92].
- [93].
- [94].
- [95].
- [96].
- [97].
- [98].
- [99].
- [100].
- [101].
- [102].
- [103].
- [104].↵
- [105].↵
- [106].↵
- [107].↵
- [108].↵
- [109].↵
- [110].↵
- [111].↵
- [112].↵
- [113].↵
- [114].↵