SUMMARY
Algorithms that accurately predict gene structure from primary sequence alone were transformative for annotating the human genome. Can we also predict the expression levels of genes based solely on genome sequence? Here we sought to apply deep convolutional neural networks towards this goal. Surprisingly, a model that includes only promoter sequences and features associated with mRNA stability explains 59% and 71% of variation in steady-state mRNA levels in human and mouse, respectively. This model, which we call Xpresso, more than doubles the accuracy of alternative sequence-based models, and isolates rules as predictive as models relying on ChIP-seq data. Xpresso recapitulates genome-wide patterns of transcriptional activity and predicts the influence of enhancers, heterochromatic domains, and microRNAs. Model interpretation reveals that promoter-proximal CpG dinucleotides strongly predict transcriptional activity. Looking forward, we propose the accurate prediction of cell type-specific gene expression based solely on primary sequence as a grand challenge for the field.
INTRODUCTION
Cellular function is governed in large part by the repertoire of proteins present and their relative abundances. Initial attempts to model the gene regulatory forces specifying the mammalian proteome posited a major role for translational regulation, implying that mRNA levels might be more poorly predictive of protein abundance than is often assumed1. However, subsequent reanalyses of those data have shown that as much as 84% of variation in protein levels can be explained by mRNA levels, with transcription rates contributing 73%, and mRNA degradation rates contributing 11%2. This work reinforces that view that steady-state protein abundances are highly predictable as a function of mRNA levels3.
Although quantitative models that predict protein levels from mRNA levels are available4, we lack models that can accurately predict mRNA levels. Steady-state mRNA abundance is governed by the rates of transcription and mRNA decay. For each gene, a multitude of regulatory mechanisms are carefully integrated to tune these rates and thus specify the concentrations of the corresponding mRNAs that cells of each type will produce. Key mechanisms include: i) the recruitment of an assortment of transcription factors (TFs) to a gene’s promoter region, ii) epigenetic silencing, as frequently demarcated by Polycomb-repressed domains associated with H3K27me3 histone marks5, iii) the activation of genes by enhancers, stretch enhancers6, and super-enhancers7, frequently associated with H3K27ac histone marks and the Mediator complex, iv) the degradation of mRNA through microRNA-mediated targeting8 and PUF family proteins9, and v) the stabilization of mRNA through the recruitment of HuR10. Jointly modeling these diverse aspects of gene regulation within a quantitative framework has the potential to shed light on their relative importance, to elucidate their mechanistic underpinnings, and to uncover new modes of gene regulation.
Previous attempts to model transcription and/or mRNA decay can be broadly split into those based on correlative biochemical measurements and those based on primary sequence. In the former category, there have been several attempts to model the relationship between TF binding, histone marks, and/or chromatin accessibility and gene expression (e.g. predicting the expression levels of genes based on ChIP-seq and/or DNase I hypersensitivity data)11–17. While such models can clarify the relationships between heterogeneous, experimentally-derived biochemical marks and transcription rates, their ability to deliver mechanistic insights is limited. For example, for models relying on histone marks, the temporal deposition of such marks might follow, rather than precede, the events initiating transcription, in which case the histone marks reinforce or maintain, rather than specify, a transcriptional program. For models relying on measurements of TF binding, a substantial fraction of ChIP-seq peaks lack the expected DNA binding motif, and potentially reflect artifactual binding signals originating from the predisposition of ChIP to pull down highly transcribed regions or regions of open chromatin18–20.
In the latter category, there have also been a few attempts to model transcript levels or mRNA decay rates based solely on primary sequence. For example, a model of the spatial positioning of in silico predicted TF binding sites relative to transcriptional start sites (TSSs) was able to explain 8-28% of variability in gene expression14. Models based on simple features including the GC content and lengths of different functional regions (e.g., the 5′ UTR, ORF, introns, and 3′ UTR) and ORF exon junction density21,22 explain as much as 40% of the variability in mRNA half-lives, which are in turn estimated to explain 6-15% of the variability in steady-state mRNA levels in mammalian cells1,2,22. However, the vast majority of variation in in steady-state mRNA levels has yet to be explained by sequence-based models.
To what extent is gene expression predictable directly from genome sequence? Relevant to this question, a recent study relying on a massively parallel reporter assay (MPRA) demonstrated that the transcriptional activities associated with isolated promoters can explain a majority (~54%) of endogenous promoter activity23. This result establishes a clear mechanistic link between the primary sequence of promoters and variability in gene expression levels. This in turn implies that there may exist a mathematical function which, if properly parameterized, could accurately predict mRNA expression levels based upon nothing more than genomic sequence. However, it remains unknown whether such a function is “learnable” given limited training data and highly incomplete domain-specific knowledge of the parameters governing gene regulation [e.g., biochemical parameters describing the affinity of TFs to their cognate motifs (Kd), constants describing the rates of TF binding and unbinding (Kon and Koff, respectively), potential cooperative effects from the combinatorial binding of TFs (as measured by Hill coefficients), the distance dependencies between TF binding relative to the TSS and RNA Polymerase II recruitment, and competition for binding between TFs and histones24, etc.].
Methods based upon deep learning are providing unprecedented opportunities to automatically learn relationships among heterogeneous data types in the context of incomplete biological knowledge25. Such methods often employ deep neural networks, in which multiple layers are employed hierarchically to parametrize a model which transforms a given input into a specified output. For example, deep convolutional neural networks have been used to predict the binding preferences of RNA and DNA binding proteins26, the impact of noncoding variants on the chromatin landscape27, the chromatin accessibility of a cell type from DNA sequence28, and genome-wide epigenetic measurements of a cell type from DNA sequence29.
The application of deep learning to model the various regulatory processes governing gene expression in a unified framework has great potential, and could enable the discovery of fundamental relationships between primary DNA sequence and steady-state mRNA levels that have heretofore remained elusive. To this end, we introduce Xpresso, a deep convolutional neural network that jointly models promoter sequences and features associated with mRNA stability in order to predict steady-state mRNA levels.
RESULTS
An optimized deep learning model to predict mRNA expression levels
We aspired to train a quantitative model utilizing nothing more than genomic sequence to predict mRNA expression levels. To simplify the prediction problem, we first evaluated the correlation structure of 56 human cell types in which mRNA expression levels had been collected and normalized by the Epigenomics Roadmap Consortium30. An evaluation of the pairwise Spearman correlations of mRNA expression levels among cell types revealed that most cell types were highly correlated, exhibiting an average correlation of ~0.78 between any pair of cell types (Figure S1). This suggests that the relative abundances of mRNA species are largely consistent among all cell types, justifying the initial development of a cell type-agnostic model for predicting median mRNA expression levels. We observed that median mRNA levels for chrY genes were highly variable due to the sex chromosome differences among cell types. Histone mRNAs were also undersampled and measured inaccurately because of the dependency of the underlying RNA-seq protocols on poly(A)-tails, which histones lack. We therefore excluded chrY and histone genes from our analyses.
Given the lack of deep neural networks available in this prediction context, we sought to perform a search of hyperparameters defining the architecture of a neural network that could more optimally predict gene expression levels while jointly modeling both promoter sequences and sequence-based features correlated with mRNA decay (Figure 1A). During this search, we varied several key hyperparameters defining the deep neural network, including the sequence window of the promoter to consider, the batch size for training, the number of convolutional and max pooling layers in which to feed the promoter sequence, and the number of densely connected layers preceding the final output neuron (Table 1). The mRNA decay features, which included the GC content and lengths of different functional regions (e.g., the 5′ UTR, ORF, introns, and 3′ UTR) and ORF exon junction density21,22, were not varied.
We applied two strategies which have shown promise in the context of hyperparameter searches: the simulated annealing (SA) and the Tree of Parzen estimators (TPE)31, optimization strategies that iteratively randomly sample sets of hyperparameters before converging on a set that minimizes the error rate on a validation set. We compared the performance of these methods to the best deep learning architecture defined manually, which was guided by prior knowledge that information governing transcription rate is most likely localized to sequence elements within ±1500bp promoter around a TSS, and further inspired by a deep learning architecture previously used to predict regions of chromatin accessibility from DNA sequence (Table 1)28. We observed that each optimization strategy progressively discovered better sets of hyperparameters, with the most of the improvements occurring within 200 iterations (Figure 1B). The TPE method outperformed the SA strategy, identifying a model whose validation mean squared error (MSE) was 0.401, substantially better than the MSE of 0.479 derived from the best manually defined model.
Given the stochastic nature of training deep learning models, which depend on both the initial parameter configurations and trajectory of the optimization procedure’s search, we devised a strategy to train ten independent trials using the best deep learning architecture specified by the hyperparameters discovered using TPE method. Measuring the validation MSE as a function of the training epoch for these ten trials, we observed one trial that did not converge and nine that converged to similar MSE values (Figure 1C). For our final model, we selected the parameters derived from the specific trial and epoch that minimized the validation MSE. All following results of this study report the performance derived from the best of ten trained models.
Our final model, which considered the region 7Kb upstream of the TSS to 3.5Kb downstream of the TSS, was comprised of two sequential convolutional and max pooling layers followed by two sequential fully connected layers preceding the output neuron (Table 1, Figure 1D, and Figure S2A), and consisted of 112,485 parameters in total (Figure S2A). While this large 10.5Kb sequence window was the best discovered, an evaluation of suboptimal hyperparameters suggested this expansive upstream region was not critical for good performance, as an alternative model spanning the region 1.5Kb upstream to 7.5Kb downstream of the TSS obtained a similar validation MSE (Figure S2B). Furthermore, our manually defined architecture, spanning the region 1.5Kb upstream to 1.5Kb downstream of the TSS achieved an r2 of 0.53, only 6% worse than the model discovered by TPE, indicating that a localized region around the core promoter region captured the majority of learnable information, with only a modest additional contribution gained from the consideration of surrounding regions.
To evaluate the relationship between the number of genes in the training set and the performance of the model, we subsampled the training set and evaluated the MSE and r2 on the validation and test set, respectively. We found that the greatest gain in performance occurred between 4,000 and 6,000 training examples, with performance on the validation and test sets improving only modestly between 6,000 and 16,000 training examples (Figure 2A). The best model trained on the full training set achieved an r2 of 0.59.
Performance of predictive models in human and mouse
We next sought to compare the generality and performance of our method across mammalian species. We focused on 18,377 and 21,856 genes in human and mouse, respectively, for which we could match promoter sequences and gene expression levels, and held out 1,000 genes in each species as a test set. With the aforementioned model architecture, we trained and tested a model for predicting the median mRNA expression levels of mouse genes based on mouse genomic sequence, achieving an r2 of 0.71, substantially higher than the r2 of 0.59 achieved with the human model (Figure 2B). Interested in explaining this 12% discrepancy, we compared the distribution of median mRNA expression levels between species. While over 20% of mouse genes were non-expressed (Figure 2C, displayed on the x-axis as −1 due to the addition of a pseudocount of 0.1 prior to log-transforming the RPKM values), fewer than 10% of human genes were non-expressed. In contrast, evaluating the subset of 15,348 one-to-one orthologs in human and mouse revealed a similar proportion of non-expressed genes (Figure 2C). These one-to-one orthologs showed strikingly concordant median expression levels (Figure 2D), indicating that gene expression levels of these two species have remained heavily conserved since the divergence of a common mammalian ancestor ~80 million years ago.
Of note, a subset of genes did exhibit substantial differences in expression between mouse and human. We identified a cohort of 584 mRNAs enriched by at least 10-fold in one species; of these, 176 were relatively up-regulated in the mouse and 408 in human (Figure 2D). To what extent can these species-specific differences be explained on the basis of differences in promoter sequence? We next compared our mouse and human Xpresso models to evaluate how well these models could discriminate species-specific mRNAs. A binary classifier based upon the difference in predictions from each species could indeed correctly discriminate these species-specific mRNAs better than chance expectation (AUC = 0.78, Figure 2E). This observation beckoned the question of whether the sequence–expression relationships learned in each species were of a similar nature.
To test whether the regulatory rules learned by each model could generalize across species, we retrained human and mouse-specific models using training and validation sets that were matched to have the same groups of one-to-one orthologs. We then tested the performance of these models on a held-out group of one-to-one orthologs in either the same species or the opposite species. While the best performing models were trained on the same species, each model performed only marginally worse when tested on the opposite species (i.e., within a 6% decrease in r2) (Figure 2F), demonstrating that the regulatory principles learned by the deep learning model generalize across the mammalian phylogeny. The similar r2 values between human and mouse obtained when restricting the analyses to one-to-one orthologs implies that the greater r2 previously observed in the mouse relative to human (Figure 2B) was due to the mouse model simply discriminating expressed genes from the mouse’s larger fraction of non-expressed genes. To directly test this possibility, we downsampled the proportion of non-expressed mouse genes to match that of the human, and re-trained a mouse-specific model. The r2 decreased only modestly to 0.65 on a held-out test set, suggesting that the differential proportion of non-expressed genes only partially explains the improved performance in the mouse, with other factors such as data quality potentially playing a role.
Cell type-specific models implicate a diversity of gene regulatory mechanisms
Given the generality of our deep learning framework, we next sought to build cell type-specific models. With the same hyperparameters, we trained new models to predict the expression levels of all protein-coding genes for human myelogenous leukemia cells (K562), human lymphoblastoid cells (GM12878), and mouse embryonic stem cells (mESCs) (Supplementary Table 1), cell types for which abundant functional data is available. To avoid overfitting, we developed a 10-fold cross-validation-based procedure to ensure that the prediction for any given gene resulted from its being part of a held-out sample (i.e., it was not part of the training set from which the model that predicted its expression level was built). From the residuals (i.e., the difference between actual expression levels and our predictions based solely on promoter sequence and mRNA decay rate features), we investigated whether we could observe the influence of additional gene regulatory mechanisms that were not initially considered, or incompletely accounted for, in the Xpresso model.
We first evaluated K562 cells, finding that our cell type-specific predictions correlated with observed K562 expression levels with an r2 of 0.51 (Figure 3A). This ~8% decrease in performance relative to cell type-agnostic predictions (Figure 2B) suggested that the expression levels of tissue-specific genes may be harder to predict than those of housekeeping genes, because such genes might be under the control of gene regulatory mechanisms not considered by our model. One obvious such mechanism involves enhancers, cis-acting regulatory elements that may be located hundreds of kilobases away from a TSS. For example, in K562 cells, distal enhancers have been implicated as modulating the expression of most genes of the α-globin and β-globin loci, GATA1, MYC, and others32–34. Reasoning that such genes should be consistently underestimated by our predictions, we plotted the distribution of their residuals (Figure 3A). Indeed, all of these genes were expressed much more highly than our predictions in K562 cells, reinforcing the notion that such genes are activated by regulatory mechanisms beyond promoters. Among the biggest outliers were the β-globin genes, which in some cases were expressed over four orders of magnitude more highly than predicted (Figure 3A). Collectively, the residuals corresponding to all of these known enhancer-driven genes were heavily biased towards positive values relative to all other genes (P < 10−9, one-sided Kolmogorov-Smirnov (K-S) test, Figure S3A), confirming that the model consistently under-predicted their expression levels.
To more systematically identify genes activated or silenced by non-promoter mechanisms, we developed a method to predict them genome-wide. Given that enhancers are frequently associated with large domains of H3K27Ac activity, and that silenced genes are frequently associated with heterochromatic domains marked by H3K27me3, we examined genome-wide chromatin state annotations based upon diHMM, a method that annotates such domains genome-wide in K562 and GM12878 cells using histone marks35. Although a subset of H3K27Ac-associated domains were originally called “super-enhancers”35, we find it more appropriate to refer to them as “stretch enhancers”, which are more loosely defined as clusters of enhancers spanning ≥3Kb6. In total, we identified 4,277 genes that overlap stretch enhancers, and 3,772 genes that overlap with H3K27me3-domains, ignoring those that happen to overlap with both. Consistent with our expectation, these collections of genes were significantly associated with predominantly positive and negative residuals, respectively, relative to the background distribution of other genes (P < 10−200, one-sided K-S test, Figure 3B). The same was true for GM12878 cells (Figure S3B), reinforcing the generality of this finding across cell types.
We next examined whether our residuals could be further explained by post-transcriptional gene regulatory mechanisms. Highly reproducible mRNA half-life estimates for 5,007 genes in K562 cells were previously measured using TimeLapse-seq36. We observed a positive correlation between mRNA half-lives (which were log-transformed) and our residuals (Pearson correlation = 0.28; P < 10−15). We visualized this trend by splitting the half-lives into five equally-sized bins (Figure 3C). These results show that although we included sequence-based features associated with mRNA decay rates in the model, these were insufficient to capture the full contribution of mRNA decay rates to steady-state mRNA levels.
We next turned our attention to mouse ESCs. Key stem cell identity genes include Pou5f1 (also known as Oct4), Sox2, Nanog, and a host of others whose dependence upon enhancers and super-enhancers has been experimentally validated with CRISPR-deletion experiments7,37. Similar to the other cell types, although a mESC-specific model could strongly predict mRNA expression levels in mESCs (r2 = 0.59), key stem cell identity genes harbored residuals that were strongly biased towards positive values (Figure 3D, Figure S3C), confirming that their promoter sequences and mRNA sequence features could not adequately explain their high abundance. Extending this question more systematically to 180 protein-coding genes thought to be governed by super-enhancers in mESCs7, we observed a strong enrichment for highly positive residuals in these genes relative to all other genes (P < 10−33, one-sided K-S test; Figure 3E).
In mESCs, genes associated with the Polycomb Repressive Complex (PRC), as delineated by binding to both PRC1 and PRC2 and frequently marked with H3K27me3, are thought to be associated with key developmental regulators, many of which are silenced but poised to be activated upon differentiation38. We observed that this group of genes, in contrast to the super-enhancer-associated set, exhibited a strong enrichment in highly negative residuals relative to all other genes (P < 10−52, one-sided K-S test; Figure 3E), consistent with a model in which PRC-targeted genes are actively silenced.
Mirroring our analysis from human cells, we next evaluated the relationship between mRNA half-lives in mESCs and our residuals. We obtained reproducible mRNA half-life estimates for 6,266 genes in mESCs measured using SLAM-seq39. Similar to human cell types, residuals were positively correlated with mRNA half-lives (Pearson correlation = 0.24; P < 10−15; Figure 3F).
A distinguishing feature of mESCs relative to K562 cells is that more is known about the post-transcriptional regulatory mechanisms governing mRNA half-life. In particular, microRNAs serve as strong candidates for further inquiry as they are guided by their sequence to bind and repress dozens to hundreds of mRNAs, mediating transcript degradation and thereby shortening an mRNA’s half-life. The miR-290-295 locus, essential for embryonic survival, encompasses the most highly abundant miRNAs in mESCs. Under the control of a super-enhancer, the members of this miRNA cluster are expressed in a highly cell type-specific manner and are thought to operate as key post-transcriptional regulators in ESCs7. We asked whether we could use our residuals to infer the endogenous regulatory roles of abundant miRNAs in mESCs. Defining a miRNA family as any miRNA sharing an identical seed sequence (as indicated by positions 2-8 relative to the miRNA 5′ end8), we used existing small RNA sequencing data from mESCs40 (GSE76288) to quantify miRNA family abundances for the top 10 miRNA families. In addition to the miR-290-295 family, we detected other highly abundant families including miR-17/20/93/106, miR-19, miR-25/32/92/363/367, miR-15/16/195/332/497, and miR-130/301; these miRNA families collectively comprised over 75% of the total miRNA pool in mESCs (Figure 3G).
For each of the 222 miRNA families conserved across the mammalian phylogeny, which includes 7 of the 10 top miRNA families in mESCs, we assessed whether the predicted repression of targets correlated to our residuals. We used the TargetScan7 Cumulative Weighted Context+ Score (CWCS)8 to rank predicted conserved targets for the subset of mRNAs expressed in mESCs, assigning a CWCS of zero for non-targets. The miRNA family with the most highly ranked Spearman correlation corresponded to that of the miR-291a-3p/294/295/302abd family (Figure 3H), which was also the most highly expressed miRNA family in ES cells. The sign of this correlation was consistent with expectation, as targets with more highly negative CWCSs (corresponding to predicted targets with greater confidence) had negatively shifted residual values. More generally, the 22 miRNA families comprising the highest 10% of Spearman correlations were strongly enriched for the 7 miRNA families highly abundant in mESCs (P < 10−5, Fisher’s exact test; Figure 3H). Our results thus reinforce the finding that highly abundant miRNAs mediate target suppression41 while providing an alternative, fully computational method to infer highly active endogenous miRNA families in specific cell types solely from primary sequence and gene expression data.
Collectively, our results from K562, GM12878, and mESCs demonstrate how analyses of Xpresso’s residuals can be used to explore and quantify the influence of diversity gene regulatory mechanisms. The mRNAs with the most positive residuals are highly enriched in genes associated with the activity of enhancers and super-enhancers, while those associated with the most negative residuals are are highly enriched in genes associated with the activity of pathways involved in gene silencing, such as those targeted by Polycomb Repressive Complexes and microRNAs. We anticipate that cell type-specific quantitative models for any arbitrary cell type can serve as a useful hypothesis generation engine for the characterization of active regulatory regions in the genome and key regulators such as miRNAs, including for cell types in which histone ChIP and small RNA sequencing data is limited or unavailable.
Performance of cell type-specific Xpresso models
To further characterize the ability of Xpresso to learn cell type-specific expression patterns, we evaluated the relative performance of cell type-agnostic and cell type-specific models in predicting cell type-specific mRNA levels. In all three cases considered, models trained on the cell type of origin out-performed those trained on median expression levels by about 3-5% (Figure 4A). We next compared our K562 and GM12878 Xpresso models to evaluate how well these models could discriminate cell type-specific mRNAs. We identified a cohort of 1,977 mRNAs enriched by at least 10-fold in each cell type; of these, 890 were relatively up-regulated in K562 cells and 1,087 were up-regulated in GM12878 cells (Figure 4B). A binary classifier based upon the difference in predictions from each cell type could discriminate these cell type-specific mRNAs modestly better than chance expectation (AUC = 0.65, Figure 4C).
Next, we sought to estimate the maximum possible performance for predicting gene expression from promoter sequences alone. A recently developed genome-wide MPRA measuring autonomous promoter activity in K562 cells, called Survey of Regulatory Elements or SuRE, linked 200bp to 2Kb regions of the genome to an episomally-encoded reporter in order to measure the transcriptional potential of regulatory sequences23. SuRE therefore provides an orthogonal, empirical means of assessing the regulatory information held in promoters, independent of the influence of genomic context and distal regulatory elements. Indeed, we observed that SuRE activity in the ±500bp promoter region around a TSS was highly correlated to mRNA expression levels in K562 cells (r2 = 0.53, Figure 4D), indicating that just over half of the variation in mRNA expression levels can be explained by regulatory sequences contained within promoters.
The comparable r2 achieved by SuRE measurements and the K562-specific Xpresso model in predicting K562 expression levels (r2 = 0.53 and 0.51, respectively) provided a unique opportunity to evaluate how well our model captured the experimentally measurable information regarding a promoter’s transcriptional activity. To assess the level of information shared between SuRE measurements and Xpresso predictions, we built a joint model to predict K562 levels. This model achieved an r2 of 0.61, 8-10% better than either method alone, suggesting that while both methods captured largely redundant information, each also captured additional information not incorporated in the other (Figure 4D). Nevertheless, the relatively modest increase in performance of the joint model over Xpresso alone indicates that Xpresso was able to learn the major sources of sequence-encoded information that explain mRNA expression levels.
Predictive models perform competitively with models utilizing experimental data
We next evaluated the performance of Xpresso relative to an assortment of baseline and pre-existing models that attempted to predict mRNA levels, both with and without the consideration of mRNA half-life features. For the baseline models, we attempted to predict median expression level using simple k-mer counts in the ±1500bp promoter region, the presence of predicted TF binding sites given known motifs available in the JASPAR database, or joint models considering both (Figure S4A). These models were trained using simple multiple linear regression and evaluated on the same test set as that utilized in Figure 2. Varying the k-mer size from k = 1 to 6, we found that performance plateaued at k = 4 and 5 for human and mouse, respectively, with the greatest gain in performance occurring between k = 1 and 2 in both species. Consideration of known TF binding sites in a joint model at best only marginally improved performance, though a model considering these binding sites alone performed as well as a model based on 2-mers.
All of the models benefitted from the additional consideration of half-life features. We evaluated the coefficients associated with a model considering only half-life features to assess the relative contribution of individual features. The features most strongly associated with increased steady-state mRNA abundance in both the human and mouse corresponded to ORF exon density and 5′ UTR GC content, followed by weaker associations to 5′ UTR length and ORF length. In contrast, intron length was negatively associated with mRNA abundance (Figure S4B).
Overall, although our baseline models demonstrated that models built upon simple features could perform surprisingly well, our hyperparameter-tuned Xpresso model improved upon these models by 11.2% and 11.7% in human and mouse, respectively (Figure S4A). Our 10-fold cross-validation results further verified that Xpresso performed significantly better than the best alternative k-mer-based approach in both the human and mouse (P < 10−8, paired t-test, Figure S4C).
Next, we compared our best baseline and Xpresso models to existing models described in the literature, categorizing the types of features used as input in each model into five categories: i) those using nothing more than sequence features, which included our method and two others14,42, those using MPRAs to measure promoter activity23,43–45, iii) those utilizing the binding signal of transcription factors (TFs) at promoter regions, as measured by ChIP11,13–15, iv) those utilizing the signal of histone marks such as H3K4me1, H3K4me3, H3K9me3, H3K27Ac, H3K27me3, and H3K36me3 at promoters and gene bodies, as measured by ChIP11,12,14,16,17,42, and v) those utilizing DNase hypersensitivity signal at promoters and nearby enhancers12,14,16,46 (Figure 4E). Many of these models were trained and tested on cell lines such as K562, GM12878, and mESCs, for which ChIP data is available for a multitude of histone marks and TFs. Thus, we were also able to compare the relative performance of our cell type-specific models for these same cell lines.
The only existing method which attempted to predict expression from sequence features alone14 achieved an r2 of 0.08 and 0.28 in GM12878 and mESCs, respectively. In comparison, Xpresso models tested on the same cell types achieved an r2 of 0.43 and 0.59, more than doubling the performance of sequence-only models in each (Figure 4E). MPRA-based models exhibited a wide diversity of r2 values, although the genome-wide MPRA performed in K562 cells23 performed comparably to Xpresso. Among all models examined, those utilizing multiple forms of experimental data such as TF ChIP, histone ChIP, and DNase achieved the best r2 values of 0.6212 and 0.7014 in human and mouse, respectively, marginally better than the best Xpresso models in these species for matched cell types.
Thus, we demonstrate that models utilizing nothing more than genomic sequence are capable of explaining mRNA expression levels with as much predictive power as—and often more than—analogous models trained on abundant experimental data. Our models have the advantage that they are simple to train on any arbitrary cell type, including those lacking experimental data such as ChIP and DNase. Furthermore, sequence-only models can further augment the performance of existing models that predict mRNA levels in cell types for which experimental data is already available (Figure 4A, Figure S4).
Xpresso predicts genome-wide patterns of transcriptional activity
Convolutional neural networks have recently been used to successfully predict patterns of CAGE activity and histone ChIP signal throughout the genome29. To accomplish this feat, a deep convolutional neural network was trained on genome-wide information across entire chromosomes, using 131Kb windows that collectively encompassed ~60% of the human genome29. This led us to ask whether it was possible to predict genome-wide CAGE signal using our cell type-agnostic Xpresso model, which was trained, in contrast, on 10.5Kb windows comprising only ~5% of the human genome. As a proof of concept, we fixed the half-life features to equal that of the average gene, and generated promoter activity predictions in 100bp increments along a randomly selected 800Kb region of the human genome that encompassed twenty genes, visualizing whether Xpresso could recapitulate the average pattern of CAGE activity across cell types (Figure 5). We observed that the Xpresso predictions faithfully reproduced the pattern of CAGE activity47,48 in this region. Peaks of high predicted transcriptional activity frequently corresponded to CpG islands49 and promoter regions across multiple cell types as predicted by ChromHMM50. Xpresso predicted similar expression signatures for both positive and negative DNA strands, revealing it could not distinguish the strandedness of CAGE signal. Confirming that our results generalize across species, consistent results were observed if Xpresso predictions were generated on the 700Kb syntenic locus of the mouse genome (Figure S5).
Xpresso automatically identifies the expression level-determining region of the promoter
Interested in ascertaining how our deep learning models could predict cell type-agnostic gene expression levels with high accuracy, we developed a procedure to interpret the dominant features learned and utilized by the mouse and human models. Specifically, we tested four strategies intended to map the regions of the input space of deep learning models with the greatest contribution to the final prediction: i) Gradient * input, ii) Integrated gradients, iii) DeepLIFT (with Rescale rules), and iv) ε-LRP51. Each of these methods computed “saliency scores”, which represent a decomposition of the final prediction values into their constituent individual feature importance scores for each nucleotide in the input promoter sequences. We partitioned genes into four groups, including those predicted to be approximately non-expressed and three additional terciles (predicted low, medium, or high expression). We computed mean saliency scores for each of these groups, and then computed the difference in mean scores relative to predicted non-expressed genes for each nucleotide position in the entire input window (i.e., 7Kb upstream of the TSS to 3.5Kb downstream). Averaging our results across our 10-fold cross-validated models, we discovered that the models had automatically learned to consistently rely upon local sequences from the 1Kb sequence centered upon the TSS to predict gene expression (Figure 6A, Figure S6). Sequences upstream of the TSS contributed more heavily to the predictions than those downstream, with those in the proximal promoter (i.e. within 100bp upstream of the TSS) best discerning genes predicted to have high expression from those predicted to have medium expression (Figure 6A). Thus, from only genomic sequence and expression data, the model automatically learned spatial relationships and asymmetries in the relevance of genomic sequence that are consistent with experimental measurements23.
Of note, because we used Rectified Linear Units (ReLUs) in our networks, the ε-LRP method resulted in identical results as the Gradient * input method51 (Gradient * input: Figure S6A; ε-LRP: data not shown). We also observed that the DeepLIFT method led to nearly identical results as the Integrated gradients technique (Integrated gradients: Figure S6B; DeepLIFT: data not shown), as observed previously in other contexts51. The results in the 10-fold cross-validated mouse model also emulated those of the human (Figure S6C-D), indicating that they generalize across species and do not depend upon the specific saliency scoring method used.
CpG dinucleotides are the dominant signal explaining expression levels
The ability of the model to automatically identify and heavily weight the proximal promoter in the expression prediction task naturally led to the question of which sequence motifs within this region were responsible for quantitatively defining expression level. We therefore devised a strategy to identify k-mers enriched in the genes predicted to be highly expressed (Figure 6B). The distribution of predicted gene expression levels was largely bimodal, allowing the partitioning of genes broadly into genes predicted to have low expression (class A) and high expression (class B). We extracted sub-sequences from the promoters of class B whose saliency scores were higher than the 99th percentile of those observed at the same positions in the promoters of class A. To identify enriched k-mers in this set of sub-sequences, we devised a equivalently sized negative set of sub-sequences by permuting the extracted positions in class B to control for positional sequence biases. We then used DREME52 to identify k-mers enriched in our positive set relative to our negative set. Evaluating the E-values of the top five significantly enriched k-mers, we observed the dinucleotide CpG as enriched by orders of magnitude over the second best k-mer in both human and mouse species (Figure 6B), implicating it as a dominant factor discriminating highly expressed genes from lowly expressed ones. We repeated our procedure on genes in the top half of class B relative to its lower half, and identified an even stronger enrichment of CpG dinucleotides (data not shown). These observations suggest that both the mouse and human models predominantly utilize the spatial distribution of CpG dinucleotides surrounding the proximal promoter to predict the entire continuum of gene expression levels.
The model therefore arrives at a specific prediction: CpG dinucleotides are more enriched in the proximal promoters of highly expressed genes relative to lowly expressed genes. To test this hypothesis, we evaluated the positional enrichment of all 16 possible dinucleotides around the TSS of genes in different gene expression bins relative to chance expectation. While CpGs are globally depleted, genes in higher gene expression bins preserved a greater fraction of CpGs closer to the TSS (Figure 6C). This property was true to a much lesser extent for other dinucleotides, with only AA/TT, CA/TG, and CC/GG dinucleotides being able to discriminate between highly and lowly expressed genes in both human and mouse genomes (Figure S7, Figure S8).
DISCUSSION
In this study, we demonstrate that a substantial proportion of variability across genes with respect to their steady-state mRNA expression levels is predictable from features derived solely from genomic sequence. In doing so, our work illustrates—as is the case for gene prediction—that the mathematical function linking genomic sequence to mRNA abundance is in a large part learnable without the use of additional sources of experimental data such as those derived from DNase hypersensitivity, TF ChIP, histone ChIP, or MPRAs. Consistent with dogma and recent experiments23, we find that the instructions governing transcriptional output are heavily enriched in a gene’s proximal promoter (more specifically, the ±500bp around the TSS). We establish Xpresso as an early initial attempt to confront the problem of gene expression prediction from genomic sequence alone, and anticipate that future algorithms can utilize our effort as a baseline model to improve upon at this prediction task. As deep learning approaches become increasingly sophisticated, we predict that new methods will be able to extract additional spatial relationships and long-range dependencies among motifs that are currently outside of the scope of those learnable by convolutional neural network architectures. While this manuscript was in preparation, another paper was published that similarly demonstrates the feasibility of predicting expression levels from genomic sequence53.
Our study provides a theoretical framework to further understand the fundamental question of how different modes of gene regulation contribute to steady-state abundance of mRNA. Querying the performance of the model while considering subsets of features associated with various mechanisms of gene regulation (e.g., mRNA decay and transcription rate) helped dissect their relative contributions to steady-state mRNA levels. Based on the proportion of variance explained by our model thus far, we estimate that between 57-89% of variability can be explained by transcriptional regulation, with the remaining explained by regulation of mRNA stability. These estimates are generally consistent with those derived from experimental measurements in mammalian cells2,22. Additionally, we estimate that promoter sequences alone explain ~50% of gene expression variability in humans, again surprisingly consistent with experimentally derived estimates23. Collectively, these results reveal that in silico strategies to estimate the relative influence of various modes of gene regulation can approximate those more directly relying on experimental measurement.
While our model makes substantial headway into predicting expression levels, between 40-60% of variability still remains unexplained, depending upon the cell type and species considered. We propose that the limitations of our model are also interesting in that they have the potential to inform and resolve the many layers of gene regulation that the model fails to capture. A residual analysis of highly expressed genes that the model under-predicts confirms that enhancers and super-enhancers play a measurably significant role in governing transcriptional programmes. Incorporation of the effects of enhancers in the model is complicated by the difficulty of predicting which promoter(s) any given enhancer influences, as these can be positioned hundreds of kilobases away and skip over genes, as well as the extent to which parameters such as distance modulate the level of enhancer-mediated activation. Such long-range dependencies are poorly modeled by convolutional neural networks. Although the incorporation of distal enhancers into the model has proved to be evasively difficult, the model can be used as a hypothesis generation engine to uncover additional gene regulatory mechanisms that further explain outliers. Our model provides a natural strategy to quantitatively rank candidate silenced and activated genes in different cell types in a way that prioritizes those that most heavily deviate from its predictions. We propose these rankings as a foundation to guide experimentalists interested in dissecting the layers of gene regulation that operate in their cell type of interest.
Scanning large regions of the genome with our pipeline revealed a striking association between regions of high predicted transcriptional activity and CpG islands. While CpG islands are a well-established feature of mammalian genomes that frequently demarcate promoter sequences49, our results support the idea that the spatial positioning of CpGs around the proximal promoter is intimately associated with gene expression levels. Our data therefore reinforce the findings from MPRAs that promoter regions enriched with CpG dinucleotides are functionally associated with increased gene expression levels23.
Looking forward, we envision the delineation of set of mathematical functions for each cell type that can accurately predict their mRNA expression levels from genome sequence alone as a grand challenge for the field. As shown here, this framework will allow us to quantify and characterize the mechanisms of gene regulation of which we are aware, and may draw our attention to ones that have yet to be discovered.
METHODS
Gene expression data collection and pre-processing
We retreived a matrix of normalized expression values for protein-coding mRNAs across 56 tissues and cell lines from RNA-seq data gathered and quantified by the Epigenomics Roadmap Consortium (http://egg2.wustl.edu/roadmap/data/byDataType/rna/expression/57epigenomes.RPKM.pc.gz)30.
For mouse gene expression data, we gathered all ENCODE RNA-seq datasets satisfying the following constraints: i) datasets corresponded to “polyA-selected mRNA RNA-seq” or “total RNA-seq”, ii) reads were mapped to the Mus musculus mm10 genome assembly, iii) files were “tsv” files corresponding to gene-level quantifications, iv) biosamples were not treated with “DMS”, “LPS”, or “β-estradiol”, v) files were not derived from samples with “low replicate concordance”, “low read depth”, “insufficient read depth”, or “insufficient read length”. Only samples corresponding to the first replicate of each tissue or cell line were utilized. In total, 254 mouse RNA-seq datasets passed these criteria. For cell type-specific questions in the mouse, we used gene expression data from mESCs that were computed previously15.
For each species, we computed the median expression level across all cell types for each gene, and transformed all gene expression values to reduce the right skew of the data. Quantile normalizing the samples of the mouse gene expression matrix prior to computing the median values resulted in nearly an identical set of median expression levels (i.e., 0.9996 correlation before and after quantile normalization), making this step optional.
One-to-one human-to-mouse orthologs were acquired from the Ensembl v90 BioMart54 by extracting the “Mouse gene stable ID” and “Mouse homology type” with respect to each human gene.
Collecting promoter sequences and mRNA half-life features
Human and mouse promoter CAGE peak annotations were downloaded from the FANTOM5 consortium’s UCSC data hub (http://fantom.gsc.riken.jp/5/datahub/hg38/peaks/hg38.cage_peak.bb, hg38 genome build; http://fantom.gsc.riken.jp/5/datahub/mm10/peaks/mm10.cage_peak.bb, mm10 genome build)47,48. The best peak corresponding to each promoter, labeled with the keyword “p1@”, was extracted. HUGO gene names or gene name synonyms were converted into Ensembl IDs using the Ensembl v90 BioMart, HGNC ID tables (https://www.genenames.org/cgi-bin/download), and Mouse Genome Informatics gene model tables (http://www.informatics.jax.org/downloads/reports/MGI_Gene_Model_Coord.rpt).
Gene annotations for protein coding genes were derived from Ensembl v90 (hg38 genome build)54. Only protein-coding genes were carried forward for analysis, with the following genes filtered out as sources of bias: i) genes located on chrY, whose gene expression depended upon whether their cells of origin were male or female, ii) histone genes, whose expression was mis-quantified due to their mRNAs lacking poly(A) tails, therefore being undersampled in poly(A)-selected RNA-seq libraries. Out of all transcripts corresponding to each gene, the one with the longest ORF, followed by the longest 5′ UTR, followed by the longest 3′ UTR was chosen as the representative transcript for that gene. The G/C content and lengths of each of these functional regions (i.e., 5′ UTRs, ORFs, and 3′ UTRs), intron length, and ORF exon junction density (computed as the number of exon junctions per kilobase of ORF sequence) were gathered as additional features associated with mRNA half-life21,22. All length-related features were transformed such that: to reduce the right skew, and along with gene expression levels were then z-score normalized by subtracting their respective mean values and dividing by their standard deviations. The starting coordinate of the first exon of the representative transcript was defined as that gene’s transcriptional start site (TSS). The vast majority of mRNAs possessed a dominant CAGE peak; if this was so, the TSS was re-centered to the coordinates of the CAGE peak. While considering CAGE data helped to improve our model, its use was optional as the performance of our models was only slightly worse when considering only Ensembl TSS annotations (r2 of 0.54 instead of 0.59 in the human). The ±10 kilobase sequence centered at the TSS was extracted as the putative promoter region to consider. Intermediate steps such as extracting sequences from the genome or converting between bed formats were executed with BEDTools55.
Hyperparameter optimization and model training
Matching gene expression levels to promoter sequences resulted in a total of 18,377 and 21,856 genes in human and mouse, respectively. All continuous variables were mean-centered and scaled to have unit variance, and promoter sequences were one-hot encoded into a boolean matrix. For each species, genes were then randomly partitioned into training, validation, and test sets such that the validation and test sets were allotted 1,000 genes each. We defined the objective function as the minimum mean squared error achieved on the validation set across 10 epochs of training. For the best set of hyperparameters specifying the neural network structure, we trained ten independent trials, and selected the parameters derived from the specific trial and epoch that minimized the validation MSE as our final model. For each trial, parameters were first randomly initialized by sampling from a Glorot normal distribution56. Next, the Adam optimizer was used to search for a local minima56. The search was performed for 100 epochs but cancelled if a lower validation MSE was not discovered within 7 epochs of the best model discovered so far for that trial. If a cross-validation strategy was implemented, we performed an identical strategy for each of the 10 folds of the data using the respective training and validation sets of the fold. The following software packages were required for model training and testing: Keras 2.0.856, TensorFlow 1.3.057, CUDA 8.0.61, cuDNN 5.1.10, and the Anaconda2 distribution of Python. We initialized a hyperparameter search space to specify the model architecture (Table 1), and used Hyperopt58 to search for an optimal set of hyperparameters. All models were trained on an NVIDIA Quadro P6000 GPU equipped with 24Gb of video RAM.
Note to reviewers: We intend to release a fully reproducible implementation of the above-described procedures in our Xpresso Github package during the review process.
Whole-transcriptome predictions
To predict expression levels for all annotated genes, we implemented a 10-fold cross-validation procedure. Specifically, we partitioned the dataset into 10 equally sized bins. For each fold, we i) reserved 1/10 of the data as a test set, 1000 genes as a validation set, and the remaining genes as a training set; ii) trained 10 independent models until convergence, iii) selected the model with the minimum validation mean squared error, and iv) generated a prediction on the test set. We then concatenated all of the predictions together that were derived from the best model from each of the ten folds of the data.
SuRE MPRA data
Pre-processed genome-wide, stranded SuRE MPRA data mapped to hg19 was acquired as bigwig files from GEO record GSE7870923. To compute SuRE activity at a specified promoter, we extracted the mean SuRE signal over covered bases on the correct strand as the gene, centered at 1000bp around the TSS, using utilities provided in the UCSC genome browser (“bigWigAverageOverBed-sampleAroundCenter=1000”)59. For the TSS annotations, we utilized our set of CAGE-corrected TSSs lifted over from hg38 to hg19 using liftOver59, supplemented with TSSs for genes annotated in Gencode release 27/Ensembl v90 for any missing IDs (https://www.gencodegenes.org/releases/27lift37.html)60.
Baseline models
To train baseline models for comparison (Figure S4), we merged the training and validation sets used initially for hyperparameter optimization and model training. For each gene, we first extracted the ±1500bp window centered each TSS and defined this as the promoter. For k-mer-based models, we counted the frequency of all k-mers occuring occurring in the promoter region, varying k from 1 to 5 and 6 for human and mouse, respectively. To train transcription-factor-based models, we scanned promoters using FIMO61 using positional weight matrices derived from the JASPAR 2016 Core Vertebrate set62. Default parameters were used for the search, except that the set of promoter sequences to compute a first order Markov background model for the search. For the transcription factors matched to the promoters, we populated a binary matrix with a 1 if a significant motif was detected for the promoter, and 0 otherwise. We then trained multiple linear regression models explaining median mRNA expression levels as a function of i) the collection of k-mer counts, ii) the binary matrix of JASPAR matches, or iii) both of the former. These models were trained both with and without half-life features, with the r2 evaluated on the test set.
Prediction on a genomic window
We extracted 10.5Kb sequences tiling across the 600 Kb and 700 Kb regions of the human and mouse genomes, respectively, in 100bp increments (Genome coordinates: chr1:109500000-110300000, hg19 genome build; chr3:107800000-108500000, mm10 genome build). The “bedtools makewindows” (parameters “-w 10500 -s 100”) was used to generate these windows, and “bedtools getfasta” to extract the sequences55. Predictions were then made using our cell type-agnostic Xpresso model trained upon median gene expression data, using zero values for all half-life features (with zero corresponding to the mean values as these features had been z-score normalized).
DATA ACCESS
Associated code necessary to train and test the deep learning models and reproduce each figure of the manuscript will be made available upon publication through a GitHub package.
ACKNOWLEDGMENTS
We thank J. Tome, S. Kim, and other members of the Shendure lab for critical commentary and helpful discussions. This material is based upon work supported under an NRSA NIH fellowship 5T32HL007093 (to V.A.) and NIH grants 5DP1HG007811, 5R01HG009136, and 5R01CA197139 (to J.S.). J.S. is an investigator of the Howard Hughes Medical Institute.