Abstract
Introns are found in 5’ untranslated regions (5’UTRs) for 35% of all human transcripts. These 5’UTR introns are not randomly distributed: genes that encode secreted, membrane-bound and mitochondrial proteins are less likely to have them. Curiously, transcripts lacking 5’UTR introns tend to harbor specific RNA sequence elements in their early coding regions. To model and understand the connection between coding-region sequence and 5’UTR intron status, we developed a classifier that can predict 5’UTR intron status with >80% accuracy using only sequence features in the early coding region. Thus, the classifier identifies transcripts with 5’ proximal-intron-minus-like-coding regions (“5IM” transcripts). Unexpectedly, we found that the early coding sequence features defining 5IM transcripts are widespread, appearing in 21% of all human RefSeq transcripts. The 5IM class of transcripts is enriched for non-AUG start codons, more extensive secondary structure both preceding the start codon and near the 5’ cap, greater dependence on eIF4E for translation, and association with ER-proximal ribosomes. 5IM transcripts are bound by the Exon Junction Complex (EJC) at non-canonical 5’ proximal positions. Finally, N1-methyladenosines are specifically enriched in the early coding regions of 5IM transcripts. Taken together, our analyses point to the existence of a distinct 5IM class comprising ∼20% of human transcripts. This class is defined by depletion of 5’ proximal introns, presence of specific RNA sequence features associated with low translation efficiency, N1-methyladenosines in the early coding region, and enrichment for non-canonical binding by the Exon Junction Complex.
Introduction
Approximately 35% of all human transcripts harbor introns in their 5’ untranslated regions (5’UTRs) (Cenik et al. 2010; Hong et al. 2006). Among genes with 5’UTR introns (5UIs), those annotated as “regulatory” are significantly overrepresented, while there is an under-representation of genes encoding proteins that are targeted to either the endoplasmic reticulum (ER) or mitochondria (Cenik et al. 2011). For transcripts that encode ER- and mitochondria-targeted proteins, 5UI depletion is associated with presence of specific RNA sequences (Cenik et al. 2011; Palazzo et al. 2007, 2013). Specifically, nuclear export of an otherwise inefficiently exported microinjected mRNA or cDNA transcript can be promoted by an ER-targeting signal sequence-containing region (SSCRs) or mitochondrial signal sequence coding region (MSCRs) from a gene lacking 5’ UTR introns (Cenik et al. 2011; Lee et al. 2015). However, more recent studies suggest that many SSCRs have little impact on nuclear export for RNAs transcribed in vivo (Lee et al. 2015), but rather enhance translation in a RanBP2-dependent manner (Mahadevan et al. 2013).
Among SSCR- and MSCR-containing transcripts (referred to hereafter as SSCR and MSCR transcripts), ∼75% lack 5’ UTR introns (“5UI−” transcripts) and ∼25% have them (“5UI+” transcripts). These two groups have markedly different sequence compositions at the 5’ ends of their coding sequences. 5UI− transcripts tend to have lower adenine content (Palazzo et al. 2007) and use codons with fewer uracils and adenines than 5UI+ transcripts (Cenik et al. 2011). Their signal sequences also contain leucine and arginine more often than the biochemically similar amino acids isoleucine and lysine, respectively. Leucine and arginine codons contain fewer adenine and thymine nucleotides, consistent with adenine and thymine depletion. This depletion is also associated with the presence of a specific GC-rich RNA motif in the early coding region of 5UI− transcripts (Cenik et al. 2011).
Despite some knowledge as to their early coding region features, key questions about this class of 5UI− transcripts have remained unanswered: Do the above sequence features extend beyond SSCR- and MSCR-containing transcripts to other 5UI− genes? Do 5UI− transcripts having these features share common functional or regulatory features? What binding factor(s) recognize these RNA elements? A more complete model of the relationship of early coding features and 5UI− status would begin to address these questions.
Here, to better understand the relationship between early coding region features and 5UI status, we undertook an integrative machine learning approach. We reasoned that a machine learning classifier which could identify 5UI− transcripts solely from early coding sequence would potentially provide two types of insight: First, it could systematically identify predictive features. Second, the subset of 5UI− transcripts which could be identified by the classifier might then represent a functionally distinct transcript class. Having developed such a classifier, we found that it identified ∼21% of all human transcripts as harboring coding regions characteristic of 5UI− transcripts. While many of these transcripts encode ER- and mitochondrial-targeted proteins, many others encode nuclear and cytoplasmic proteins. This class of transcripts shares characteristics such as a tendency to lack 5’ proximal introns, to contain non-canonical Exon Junction Complex binding sites, to have multiple features associated with lower intrinsic translation efficiency, and to have an increased incidence of N1- methyladenosine modification.
Results
A classifier that predicts 5UI status using only early coding sequence information
To better understand the previously-reported enigmatic relationship between certain early coding region sequences and the absence of a 5UI, we sought to model this relationship. Specifically, we used a random forest classifier (Breiman 2001) to learn the relationship between 5UI absence and a collection of 36 different nucleotide-level features extracted from the first 99 nucleotides of all human coding regions (CDS) (Figs 1A-C; Table S1; Methods). We then used all transcripts known to contain an SSCR (a total of 3743 transcripts clusters; Methods), regardless of 5UI status, as our training set. This training constraint ensured that all input nucleotide sequences were subject to similar functional constraints at the protein level. Thus, we sought to identify sequence features that differ between 5UI− and 5UI+ transcripts at the RNA level.
Our classifier assigns to each transcript a “5’UTR-intron-minus-predictor” (5IMP) score between 0 and 10, where higher scores correspond to a higher likelihood of being 5UI− (Fig 1C). Interestingly, preliminary ranking of the 5UI− transcripts by 5IMP score revealed a relationship between the position of the first intron in the coding region and the 5IMP score. 5UI− transcripts for which the first intron was more than 85 nts downstream of the start codon had the highest 5IMP scores. Furthermore, the closer the first intron was to the start codon, the lower the 5IMP score (Fig 1D). We explored this relationship further by training classifiers that increasingly excluded from the training set 5UI− transcripts according to the distance of the first intron from the 5’ end of the coding region. This revealed that classifier performance, as measured by the area under the precision recall curve (AUPRC), increased as a function of the distance from start codon to first intron distance (Methods, Fig 1E). Thus, the RNA sequence features we identified as being predictive of 5UI− transcripts are more accurately described as being predictors of transcripts without 5’-proximal introns.
To minimize the impact of transcripts that may ‘behave’ as though they were 5UI+ due to an intron early in the coding region, we eliminated 5UI− SSCR transcripts with a first intron <90 nts downstream of the start codon (Methods) and generated a new classifier. Discriminative motif features were learned independently (Methods), and performance of this new classifier was gauged using 10-fold cross validation. We assessed cross-validation performance in two ways: 1) in terms of the area under the receiver operating curve (AUC)— which can be thought of as a measure of average recall across a range of false positive rates; 2) in terms of area under the precision vs recall curve (AUPRC), which can be thought of as the average precision (fraction of predictions which are correct) across a range of recall values. Specifically, the classifier showed an AUC of 74% and AUPRC of 88% (Fig 2A, yellow curves; exceeding AUC 50% and AUPRC 71%, the performance value expected of a naïve predictor). We used this optimized classifier for all subsequent analyses.
SSCR transcripts exhibited markedly different 5IMP score distributions for the 5UI+ and 5UI− subsets (Fig 2B). The 5UI+ score distribution was unimodal with a peak at ∼2.4. In contrast, the 5UI− score distribution was bimodal with one peak at ∼3.6 and another at ∼9, suggesting the existence of at least two underlying 5UI− transcript classes. The peak at score 3.6 resembled the 5UI+ peak. Also contributing to the peak at 3.6 is the set of 5UI− transcripts harboring an intron in the first 90 nts of the CDS (55% of all 5UI− transcripts). The other distinct high-scoring 5UI− class (peak at score 9) is composed of transcripts that have specific 5UI−-predictive RNA sequence elements within the early coding region.
We next wished to evaluate whether our classifier was discriminating 5UI+ and 5UI− SSCR transcripts using signals that appear specifically in the early coding region as opposed to signals that appear broadly across the coding region. To do so, for every transcript we randomly chose 99 nts from the region downstream of the 3rd exon. The 5IMP score distributions of these ‘3’ proximal exon’ sets were identical for 5UI+ and 5UI− transcripts (Figs 2A, and 2C), confirming that the sequence features that distinguish 5UI+ and 5UI− transcripts are specific to the early coding region.
RNA elements associated with 5UI− transcripts are pervasive in the human genome
Having trained the classifier on SSCR transcripts, we wondered how well it would predict the 5UI status of other transcripts. Despite having been trained exclusively on SSCR transcripts, the classifier performed remarkably well on MSCR transcripts, achieving an AUC of 86% and AUPRC of 95% (Fig 2A, purple line; Fig 2D; as compared with 50% and 77%, respectively, expected by chance). This result suggests that RNA elements within early coding regions of 5UI− MSCR-transcripts are similar to those in 5UI− SSCR-transcripts despite distinct functional constraints at the protein level.
We next wondered whether the class of 5UI− transcripts that can be predicted on the basis of early coding region features is restricted to transcripts encoding proteins trafficked to the ER or mitochondria, or is instead a more general class of transcripts. We therefore asked whether the classifier could predict 5UI− status in transcripts that contain neither an SSCR nor an MSCR (“S−/M−” transcripts). Because unannotated SSCRs could confound this analysis, we first used SignalP 3.0 to identify S−/M− transcripts most likely to contain an unannotated SSCR (Bendtsen et al. 2004). These ‘SignalP+‘ transcripts had a 5IMP score distribution comparable to those of known SSCR and MSCR transcripts (Fig 2E), and the classifier worked well to identify the 5UI− subset of these transcripts (AUC 82% and AUPRC 95%, Fig 2A, light blue line). While 5UI+ SignalP+ transcripts had predominantly low 5IMP scores, 5UI− SignalP+ 5IMP scores were strongly skewed towards high 5IMP scores (peak at ∼9; Fig 2E). These results were consistent with the idea that SignalP+ transcripts do contain many unannotated SSCRs.
Having considered SignalP+ transcripts as well as SSCR- and MSCR-containing transcripts, we used the classifier to calculate 5IMP scores for all remaining “S−/M−/SignalP−” transcripts. Although the performance was weaker on this gene set, it was still better than expected of a naive predictor (Fig 2A, green line). 5UI+ S−/M−/SignalP− transcripts were strongly skewed toward low 5IMP scores (Fig 2F). Surprisingly, however, a significant fraction of 5UI− S−/M−/SignalP− transcripts had high 5IMP scores (∼18%). Thus, our results suggest a broad class of transcripts with early coding regions carrying sequence signals that predict the absence of a 5’proximal intron, or in other words, a class of transcripts with 5’proximal-intron-minus-like coding regions. Hereafter we refer to transcripts in this class as “5IM” transcripts.
We sought to identify what fraction of transcripts have 5IMP scores that exceed what would be expected in the absence of 5UI−-predictive early coding region signals. To establish this expectation, we used the above-described negative control set of equal-length coding sequences from outside of the early coding region. By quantifying the excess of high-scoring sequences in the real distribution relative to this control distribution, we estimate that 21% of all human transcripts are 5IM transcripts (a 5IMP score of 7.41 corresponds to a 5% False Discovery Rate; Fig 2G). The set of 5IM transcripts defined by our classifier (Table S2) includes many that do not encode ER-targeted or mitochondrial proteins. The distribution of various classes of mRNAs among the 5IM transcripts was: 38% ER-targeted (SSCR or SignalP+), 9% mitochondrial (MSCR) and 53% other classes (S−/M−/SignalP−) (Figure 2G). These results suggest that RNA-level features prevalent in the early coding regions of 5UI− SSCR and MSCR transcripts are also found in other transcript types (Figs 2F-G), and that 5IM transcripts represent a broad class.
Functional characterization of 5IM transcripts
5IM transcripts are defined by mRNA sequence features. Hence, we hypothesized that 5IM transcripts may be functionally related through shared regulatory mechanisms mediated by the presence of these common features. To this end, we collected large-scale datasets representing diverse attributes covering six broad categories (see Table S3 for a complete list): (1) Curated functional annotations -- e.g., Gene Ontology terms, annotation as a ‘housekeeping’ gene, genes subject to RNA editing; (2) RNA localization -- e.g., to dendrites, to mitochondria; (3) Protein and mRNA half-life, ribosome occupancy and features that decrease stability of one or more mRNA isoforms -- e.g., AU-rich elements (4) Sequence features associated with regulated translation -- e.g., codon optimality, secondary structure near the start codon; (5) Known interactions with RNA-binding proteins or complexes such as Staufen-1, TDP-43, or the Exon Junction Complex (EJC) (6) RNA modifications – i.e., N1-methyladenosine (m1A).
We adjusted for multiple hypotheses testing at two levels. First, we took a conservative approach (Bonferroni correction) to correct for the number of tested functional characteristics. Second, some of the functional categories were analyzed in more depth and multiple sub-hypotheses were tested within the given category. In this in-depth analysis a false discovery-based correction was adopted. Below, all reported p-values remain significant (padjusted < 0.05) after multiple hypothesis test correction.
No associations between 5IM transcripts and features in categories (1), (2) and (3) were found, other than the already-known enrichments for ER- and mitochondrial-targeted mRNAs. However, analyses for the remaining categories yielded the significant results described below.
5IM transcripts have features suggesting lower translation efficiency
Translation regulation is a major determinant of protein levels (Vogel and Marcotte 2012). To investigate potential connections between 5IM transcripts and translational regulation, we examined features associated with translation. Features found to be significant were:
(I) Secondary structures near the start codon can affect initiation rate by modulating start codon recognition (Parsyan et al. 2011). We observed a positive correlation between 5IMP score and the free energy of folding (-ΔG) of the 35 nucleotides immediately preceding the start codon (Fig 3A; Spearman rho=0.39; p < 2.2e-16). This suggests that 5IM transcripts have a greater tendency for secondary structure near the start codon, presumably making the start codon less accessible.
(II) Similarly, secondary structures near the 5’cap can modulate translation by hindering binding by the 43S-preinitiation complex to the mRNA (Babendure et al. 2006). We observed a positive correlation between 5IMP score and the free energy of folding (-ΔG) of the 5’most 35 nucleotides (Fig 3B; Spearman rho=0.18; p = 7.9e-130). This suggests that 5IM transcripts have a greater tendency for secondary structure near the 5’cap, presumably hindering binding by the 43S-preinitiation complex.
(III) eIF4E overexpression. The heterotrimeric translation initiation complex eIF4F (made up of eIF4A, eIF4E and eIF4G) is responsible for facilitating the translation of transcripts with strong 5’UTR secondary structures (Parsyan et al. 2011). The eIF4E subunit binds to the 7mGpppG ‘methyl-G’ cap and the ATP-dependent helicase eIF4A (scaffolded by eIF4G) destabilizes 5’UTR secondary structure (Marintchev et al. 2009). A previous study identified transcripts that were more actively translated under conditions that promote capdependent translation (overexpression of eIF4E) (Larsson et al. 2007). In agreement with the observation that 5IM transcripts have more secondary structure upstream of the start codon and near the 5’cap, transcripts with high 5IMP scores were more likely to be translationally, but not transcriptionally, upregulated upon eIF4E overexpression (Fig 3C; Wilcoxon Rank Sum Test p = 2.05e-22, and p = 0.28, respectively).
(IV) Non-AUG start codons. Transcripts with non-AUG start codons also have intrinsically low translation initiation efficiencies (Hinnebusch and Lorsch 2012). These mRNAs were greatly enriched among transcripts with high 5IMP scores (Fisher’s Exact Test p = 0.0003; odds ratio = 3.9) and have a median 5IMP score that is 3.57 higher than those with an AUG start (Fig 3D).
(V) Codon optimality. The efficiency of translation elongation is affected by codon optimality (Hershberg and Petrov 2008). Although some aspects of this remain controversial (Charneski and Hurst 2013; Shah et al. 2013; Zinshteyn and Gilbert 2013; Gerashchenko and Gladyshev 2015), it is clear that decoding of codons by tRNAs with different abundances can affect the translation rate under conditions of cellular stress (reviewed in Gingold and Pilpel 2011). We therefore examined the tRNA adaptation index (tAI), which correlates with copy numbers of tRNA genes matching a given codon (dos Reis et al. 2004). Specifically, we calculated the median tAI of the first 99 coding nucleotides of each transcript, and found that 5IMP score was negatively correlated with tAI (Fig S1A; Spearman Correlation rho= −0.23; p < 2e-16 median tAI and 5IMP score). This effect was restricted to the early coding regions as the negative control set of randomly chosen sequences downstream of the 3rd exon from each transcript did not exhibit a relationship between 5IMP score and codon optimality (Fig S1B-C). Thus, 5IM transcripts show reduced codon optimality in early coding regions, suggesting that 5IM transcripts have decreased translation elongation efficiency.
To more precisely determine where the codon optimality phenomenon occurs within the entire early coding region, we grouped transcripts by 5IMP score. For each group, we calculated the mean tAI at codons 2-33 (i.e., nts 4-99). Across this entire region, 5IM transcripts (5IMP >7.41; 5% FDR) had significantly lower tAI values at every codon except codons 24 and 32 (Fig 3E; Wilcoxon Rank Sum test Holm-adjusted p < 0.05 for all comparisons). To eliminate potential confounding variables, including nucleotide composition, we performed several additional control analyses (Methods); none of these altered the basic conclusion that 5IM transcripts have lower codon optimality than non-5IM transcripts across the entire early coding region.
(VI) Ribosomes per mRNA. Finally, we examined the relationship between 5IMP score and translation efficiency, as measured by the steady-state number of ribosomes per mRNA molecule. To this end, we used a large dataset of ribosome profiling and RNA-Seq experiments from human lymphoblastoid cell lines (Cenik et al. 2015). From this, we calculated the average number of ribosomes on each transcript and identified transcripts with high or low ribosome occupancy (respectively defined by occupancy at least one standard deviation above or below the mean; see Methods). 5IM transcripts were slightly but significantly depleted in the high ribosome-occupancy category (Fig 3F; Fisher’s Exact Test p = 0.0006, odds ratio = 1.3). Moreover, 5IMP scores exhibited a weak but significant negative correlation with the number of ribosomes per mRNA molecule (Spearman rho=-0.11; p = 5.98e-23).
Taken together, all of the above results reveal that 5IM transcripts have sequence features associated with lower translation efficiency, at the stages of both translation initiation and elongation.
Non-ER trafficked 5IM transcripts are enriched in ER-proximal ribosome occupancy
We next investigated the relationship between 5IMP score and the localization of translation within cells. Exploring the subcellular localization of translation at a transcriptome-scale remains a significant challenge. Yet, a recent study described proximity-specific ribosome profiling to identify mRNAs occupied by ER-proximal ribosomes in both yeast and human cells (Jan et al. 2014). In this method, ribosomes are biotinylated based on their proximity to a marker protein such as Sec61, which localizes to the ER membrane (Jan et al. 2014). For each transcript, the enrichment for biotinylated ribosome occupancy yields a measure of ERproximity of translated mRNAs.
We reanalyzed this dataset to explore the relationship between 5IMP scores and ERproximal ribosome occupancy in HEK-293 cells. As expected, transcripts that exhibit the highest enrichment for ER-proximal ribosomes were SSCR-containing transcripts and transcripts with other ER-targeting signals. Yet, we noticed a surprising positive correlation between ER-proximal ribosome occupancy and 5IM transcripts with no ER-targeting evidence (Fig 4). This relationship was true for both mitochondrial genes (Fig 4; Spearman rho=0.43; p < 2.2e-16), and genes with no evidence for either ER- or mitochondrial-targeting (Fig 4; Spearman rho=-0.36; p < 2.2e-16). These results suggest that 5IM transcripts are more likely than non-5IM transcripts to engage with ER-proximal ribosomes.
5IM transcripts are strongly enriched in non-canonical EJC occupancy sites
Shared sequence features and functional traits among 5IM transcripts causes one to wonder what common mechanisms might link 5IM sequence features to 5IM traits. For example, 5IM transcripts might share regulation by one or more RNA-binding proteins (RBPs). To investigate this idea further, we tested for enrichment of 5IM transcripts among the experimentally identified targets of 23 different RBPs (including CLIP-Seq, and variants; see Methods). Only one dataset was substantially enriched for high 5IMP scores among targets: a transcriptome-wide map of binding sites of the Exon Junction Complex (EJC) in human cells, obtained via tandem-immunoprecipitation followed by deep sequencing (RIPiT) (Singh et al. 2014, 2012). The EJC is a multi-protein complex that is stably deposited upstream of exonexon junctions as a consequence of pre-mRNA splicing (Le Hir et al. 2000). RIPiT data confirmed that canonical EJC sites (cEJC sites; sites bound by EJC core factors and appearing ∼24 nts upstream of exon-exon junctions) occupy ∼80% of all possible exon-exon junction sites and are not associated with any sequence motif. Unexpectedly, many EJC-associated footprints outside of the canonical −24 regions were observed (Fig 5A) (Singh et al. 2012). These ‘non-canonical’ EJC occupancy sites (ncEJC sites) were associated with multiple sequence motifs, three of which were similar to known recognition motifs for SR proteins that co-purified with the EJC core subunits (Singh et al. 2012). Interestingly, another motif (Fig 5B; top) that was specifically found in first exons is not known to be bound by any known RNAbinding protein (Singh et al. 2012). This motif was CG-rich, a sequence feature that also defines 5IM transcripts. This similarity presages the possibility of enrichment of first exon ncEJC sites among 5IM transcripts. Position analysis of called EJC peaks revealed that while only 9% of cEJCs reside in first exons, 19% of all ncEJCs are found there. When we investigated the relationship between 5IMP scores and ncEJCs in early coding regions, we found a striking correspondence—the median 5IMP score was highest for transcripts with the greatest number of ncEJCs (Fig 5C; Wilcoxon Rank Sum Test; p < 0.0001). When we repeated this analysis by conditioning on 5UI status, we similarly found that ncEJCs were enriched among transcripts with high 5IMP scores regardless of 5UI status (Fisher’s Exact Test, p < 3.16e-14, odds ratio > 2.3; Fig 5D). These results suggest that the striking enrichment of ncEJC peaks in early coding regions was generally applicable to all transcripts with high 5IMP scores regardless of 5UI presence.
Transcripts harboring N1-methyladenosine (m1A) have high 5IMP scores
It is increasingly clear that ribonucleotide base modifications in mRNAs are highly prevalent and can be a mechanism for post-transcriptional regulation (Frye et al. 2016). One RNA modification present towards the 5’ ends of mRNA transcripts is N1-methyladenosine (m1A) (Li et al. 2016; Dominissini et al. 2016), initially identified in total RNA and rRNAs (Dunn 1961; Hall 1963; Klootwijk and Planta 1973). Intriguingly, the position of m1A modifications has been shown to be more correlated with the position of the first intron than with transcriptional or translational start sites (Figure 2g from Dominissini et al. 2016). When the distance of m1As to each splice site in a given mRNA was calculated, the first splice site was found to be the nearest for 85% of m1 As (Dominissini et al. 2016). When 5’UTR introns were present, m1A was found to be near the first splice site regardless of the position of the start codon (Dominissini et al. 2016). Given that 5IM transcripts are also characterized by the position of the first intron, we investigated the relationship between 5IMP score and m1A RNA modification marks.
We analyzed the union of previously identified m1A modifications (Dominissini et al. 2016) across all cell types and conditions. Although there is some evidence that these marks depend on cell type and growth condition, it is difficult to be confident of the cell type and condition-dependence of any particular mark given experimental variation (see Methods). Nevertheless, we found that mRNAs with m1A modification early in the coding region (first 99 nucleotides) had substantially higher 5IMP scores than mRNAs lacking these marks (Fig 6A; Wilcoxon Rank Sum Test p = 3.4e-265), and were greatly enriched among 5IM transcripts (Fig 6A; Fisher’s Exact Test p = 1.6e-177; odds ratio = 3.8). In other words, the sequence features within the early coding region that define 5IMP transcripts also associate with m1A modification in the early coding region.
We next wondered whether 5IMP score was related to m1A modification generally, or only associated with m1A modification in the early coding region. Indeed, many of the previously identified m1A peaks were within the 5’UTRs of mRNAs (Li et al. 2016). Interestingly, 5IMP scores were only associated with m1A modification in the early coding region, and not with m1A modification in the 5’UTR (Fig 6B). This offers the intriguing possibility that the sequence features that define 5IMP transcripts are co-localized with m1A modification.
Discussion
Coordinating the expression of functionally related transcripts can be achieved by post-transcriptional processes such as splicing, RNA export, RNA localization or translation (Moore and Proudfoot 2009). Sets of mRNAs subject to a common regulatory transcriptional process can exhibit common sequence features that define them to be a class. For example, transcripts subject to regulation by particular miRNAs tend to share certain sequences in their 3’UTRs that are complementary to the regulatory miRNA (Ameres and Zamore 2013). Similarly, transcripts that share a 5’ terminal oligopyrimidine tract are coordinately regulated by mTOR and ribosomal protein S6 kinase (Meyuhas 2000). Here we quantitatively define ‘5IM’ transcripts as a class that shares common sequence elements and functional properties. We estimate the 5IM class to comprise 21% of all human transcripts.
Whereas 35% of human transcripts have one or more 5’UTR introns, the majority of 5IM transcripts have neither a 5’UTR intron nor an intron in the first 90 nts of the ORF. Other shared features of 5IM transcripts include sequence features associated with low translation initiation rates. These are: (1) a tendency for RNA secondary structure in the region immediately preceding the start codon (Fig 3A), and near the 5’cap (Fig 3B); (2) translational upregulation upon overexpression of eIF4E (Fig 3C); and (3) more frequent use of non-AUG start codons (Fig 3D). Also consistent with low intrinsic translation efficiencies, 5IM transcripts additionally tend to depend on less abundant tRNAs to decode the beginning of the open reading frame (Fig 3E).
We had previously reported that transcripts encoding proteins with ER- and mitochondrial-targeting signal sequences (SSCRs and MSCRs, respectively) are over-represented among the 65% of transcripts lacking 5’UTR introns (Cenik et al. 2011). Transcripts in this set are enriched for the sequence features detected by our 5IM classifier. By examining these enriched sequence features, we showed that the 5IM class extends beyond mRNAs encoding membrane proteins. Jan et al. recently developed a transcriptome-scale method to identify mRNAs occupied by ER-proximal ribosomes in both yeast and human cells (Jan et al. 2014). As expected, transcripts known to encode ER-trafficked proteins were highly enriched for ER-proximal ribosome occupancy. However, their data also showed many transcripts encoding non-ER trafficked proteins to also be engaged with ER-proximal ribosomes (Reid and Nicchitta 2015b). Similarly, several other studies have suggested a critical role of ER-proximal ribosomes in translating several cytoplasmic proteins (Reid and Nicchitta 2015a). Here, we found that 5IM transcripts including those that are not ER-trafficked or mitochondrial were significantly more likely to exhibit binding to ER-proximal ribosomes (Figs 4A-B).
In addition to ribosomes directly resident on the ER, an interesting possibility is the presence of a pool of peri-ER ribosomes (Jan et al. 2015; Reid and Nicchitta 2015a). Association of 5IM transcripts with such a peri-ER ribosome pool could potentially explain the observed correlation of 5IM status with binding to ER-proximal ribosomes. The ER is physically proximal to mitochondria (Rowland and Voeltz 2012), so peri-ER ribosomes may include those translating mRNAs on mitochondria (i.e., mRNAs with MSCRs) (Sylvestre et al. 2003). However, even when transcripts corresponding to ER-trafficked and mitochondrial proteins were excluded from consideration, ER-proximal ribosome enrichment and 5IMP scores were highly correlated (Fig 4A). Thus another shared feature of 5IM transcripts is their translation on or near the ER regardless of the ultimate destination of the encoded protein.
In an attempt to identify a common factor binding 5IM transcripts, we asked whether 5IM transcripts were enriched among the experimentally identified targets of 23 different RBPs. Only one RBP emerged--the exon junction complex (EJC). Specifically, we observed a dramatic enrichment of non-canonical EJC (ncEJC) binding sites within the early coding region of 5IM transcripts. Further, the CG-rich motif identified for ncEJCs in first exons is strikingly similar to the CG-rich motif enriched in the first exons of 5IM transcripts (Fig 5B). Previous work implicated RanBP2, a protein associated with the cytoplasmic face of the nuclear pore, as a binding factor for some SSCRs (Mahadevan et al. 2013). This finding suggests that nuclear pore proteins may influence EJC occupancy on these transcripts.
EJC deposition during the process of pre-mRNA splicing enables the nuclear history of an mRNA to influence post-transcriptional processes including mRNA localization, translation efficiency, and nonsense mediated decay (Chang et al. 2007; Kervestin and Jacobson 2012; Choe et al. 2014). While canonical EJC binding occurs at a fixed distance upstream of exon-exon junctions and involves direct contact between the sugar-phosphate backbone and the EJC core anchoring protein eIF4AIII, ncEJC binding sites likely reflect stable engagement between the EJC core and other mRNP proteins (e.g., SR proteins) recognizing nearby sequence motifs. Although some RBPs were identified for ncEJC motifs found in internal exons (Singh et al. 2014, 2012), to date no candidate RBP has been identified for the CG-rich ncEJC motif found in the first exon. If this motif does result from an RBP interaction, it is likely to be one or more of the ∼70 proteins that stably and specifically bind to the EJC core (Singh et al. 2012).
Finally, we observed a dramatic enrichment for m1A modifications among 5IM transcripts, with specific enrichment for m1A modifications in the early coding region. Given this striking enrichment perhaps it is perhaps not surprising that m1A containing mRNAs were also shown to have more structured 5’UTRs that are GC-rich compared to m1A lacking mRNAs (Dominissini et al. 2016). Similar to 5IM transcripts, m1A containing mRNAs were found to decorate start codons that appear in a highly structured context. While ALKBH3 has been identified as a protein that can demethylate m1A, it is currently unknown whether there are any proteins that can specifically act as “readers” of m1A. Recent studies have begun to identify such readers for other mRNA modifications such as YTHDF1, YTHDF2, WTAP and HNRNPA2B1 (Ping et al. 2014; Liu et al. 2014; Wang et al. 2014, 2015; Alarcón et al. 2015). Our study highlights a possible link between non-canonical EJC binding and m1A. Hence, our results yield the intriguing hypothesis that one or more of the ∼70 proteins that stably and specifically bind to the EJC core can function as an m1A reader. Future work involving directed experiments would be needed to test this hypothesis.
Given that 5IM transcripts are enriched for ER-targeted and mitochondrial proteins, it is plausible that the observed functional characteristics of 5IM transcripts are driven solely by SSCR and MSCR-containing transcripts. Hence, we repeated all analyses for the subclasses of 5IM transcripts (MSCR-containing, SSCR-containing, S−/M−/SignalP+, or S−/M−/SignalP−). We found the observed associations remained statistically significant and had the same direction of effect, even after eliminating SSCR- and MSCR-containing transcripts, despite the fact that all training of the 5IM classifier was performed only using SSCR transcripts. We also found that 5IMP score was equally or more strongly associated with each of the functional characteristics compared to the 5UI status. In conclusion, the molecular associations we report apply to 5IM transcripts as a whole, and are not driven solely by the subset of 5IM transcripts encoding ER- or mitochondria-targeting signal peptides, and seem to indicate shared features beyond simple lack of a 5’UTR intron.
An intriguing possibility is that 5IM transcript features associated with lower intrinsic translation efficiency may together enable greater ‘tunability’ of 5IM transcripts at the translation stage. Regulated enhancement or repression of translation, for 5IM transcripts, could allow for rapid changes in protein levels. There are analogies to this scenario in transcriptional regulation, wherein highly regulated genes often have promoters with low baseline levels that can be rapidly modulated through the action of regulatory transcription factors. As more ribosome profiling studies are published examining translational responses transcriptome-wide under multiple perturbations, conditions under which 5IM transcripts are translationally regulated may be revealed. Directed experiments will be needed to test translational features of 5IM transcripts hypothesized via this computational analysis.
Taken together, our analyses reveal the existence of a distinct ‘5IM’ class comprising 21% of human transcripts. This class is defined by depletion of 5’ proximal introns, presence of specific RNA sequence features associated with low translation efficiency, non-canonical binding by the Exon Junction Complex and an enrichment for N1-methyladenosine modification.
Materials and Methods
Datasets and Annotations
Human transcript sequences were downloaded from the NCBI human Reference Gene Collection (RefSeq) via the UCSC table browser (hg19) on Jun 25 2010 (Kent et al. 2002; Pruitt et al. 2005). Transcripts with fewer than three coding exons, and transcripts where the first 99 coding nucleotides straddle more than two exons were removed from further consideration. The criteria for exclusion of genes with fewer than three coding exons was to ensure that the analysis of downstream regions was possible for all genes that were used in our analysis of early coding regions. Specifically, the downstream regions were selected randomly from downstream of the 3rd exons. Hence, genes with fewer exons would not be able to contribute a downstream region potentially creating a skew in representation. In total there were ∼3000 genes that were removed from consideration due to this filter. Therefore, our classifier is limited in its ability to assess transcripts from these genes. However, the performance measures reported in our manuscript are robust to exclusion of these genes, in the sense that the same class of transcripts was used in both training and test datasets.
Transcripts were clustered based on sequence similarity in the first 99 coding nucleotides. Specifically, each transcript pair was aligned using BLAST with the DUST filter disabled (Altschul et al. 1990). Transcript pairs with BLAST E-values < 1e-25 were grouped into transcript clusters. In total, there were 15576 transcript clusters that were considered further. These clusters that were subsequently assigned to one of four categories: MSCR containing, SSCR-containing, S−/MSCR−SignalP+, or S−/MSCR− SignalP− as follows:
MSCR-containing transcripts were annotated using MitoCarta and other sources as described in (Cenik et al. 2011). SSCR-containing transcripts were the set of transcripts annotated to contain signal peptides in the Ensembl Gene v.58 annotations, which were downloaded through Biomart on Jun 25 2010. For transcripts without an annotated MSCR or SSCR, the first 70 amino acids were analyzed using SignalP 3.0 (Bendtsen et al. 2004). Using the eukaryotic prediction mode, transcripts were assigned to the S−/MSCR− SignalP+ category if either the Hidden Markov Model or the Artificial Neural Network classified the sequence as signal peptide containing. All remaining transcript clusters were assigned to the S−/MSCR− SignalP− category. The number of transcript clusters in each of the four categories was: 3743 SSCR, 737 MSCR, 696 S−/MSCR− SignalP+, 10400 S−/MSCR− SignalP−.
For each transcript cluster, we also constructed matched control sequences. Control sequences were derived from a single randomly chosen in-frame ‘window’ downstream of the 3rd exon from the evaluated transcripts. If an evaluated transcript had fewer than 99 nucleotides downstream of the 3rd exon, no control sequence was extracted. 5UI labels and transcript clustering for the control sequences were inherited from the evaluated transcript. The rationale for this decision is that our analysis depends on the position of the first intron; hence genes with fewer than two exons need to be excluded, as these will not have introns. We further required the matched control sequences to fall downstream of the early coding region. In the vast majority of cases the third exon fell outside the first 99 nucleotides of the coding region, making this a convenient criterion by which to choose control regions.
Sequence Features and Motif Discovery
36 sequence features were extracted from each transcript (Table S1). The sequence features included the ratio of arginines to lysines, the ratio of leucines to isoleucines, adenine content, length of the longest stretch without adenines, preference against codons that contain adenines or thymines. These features were previously found to be enriched in SSCR-containing and certain 5UI− transcripts (Cenik et al. 2011; Palazzo et al. 2007). In addition, we extracted ratios between several other amino acid pairs based on having biochemical/evolutionary similarity, i.e. having positive scores, according to the BLOSUM62 matrix (Henikoff and Henikoff 1992). To avoid extreme ratios given the relatively short sequence length, pseudo-counts were added to amino acid ratios using their respective genome-wide prevalence.
In addition, we used three published motif finding algorithms (AlignACE, DEME, and MoAN (Roth et al. 1998; Redhead and Bailey 2007; Valen et al. 2009)) to discover RNA sequence motifs enriched among 5UI− transcripts. AlignACE implements a Gibbs sampling approach and is one of the pioneering efforts in motif discovery (Roth et al. 1998). We modified the AlignACE source code to restrict motif searches to only the forward strand of the input sequences to enable RNA motif discovery. DEME and MoAN adopt discriminative approaches to motif finding by searching for motifs that are differentially enriched between two sets of sequences (Redhead and Bailey 2007; Valen et al. 2009). MoAN has the additional advantage of discovering variable length motifs, and can identify co-occurring motifs with the highest discriminative power.
In total, six motifs were discovered using the three motif finding algorithms (Table S1). Position specific scoring matrices for all motifs were used to score the first 99 - l positions in each sequence, where l is the length of the motif. We assessed the significance of each motif instance by calculating the p-value of enrichment (Fisher’s Exact Test) among 5UI− transcripts considering all transcripts with a motif instance achieving a PSSM score greater than equal to the instance under consideration. The significance score and position of the two best motif instances were used as features for the classifier (Table S1).
5IM Classifier Training and Performance Evaluation
We modified an implementation of the Random Forest classifier (Breiman 2001) to model the relationship between sequence features in the early coding region and the absence of 5’UTR introns (5UIs). This classifier discriminates transcripts with 5’proximal-intron-minus-like-coding regions and hence is named the ‘5IM’ classifier. The training set for the classifier was composed of SSCR transcripts exclusively. There were two reasons to restrict model construction to SSCR transcripts: 1) we expected the presence of specific RNA elements as a function of 5UI presence based on our previous work (Cenik et al. 2011); and 2) we wanted to restrict model building to sequences that have similar functional constraints at the protein level.
We observed that 5UI− transcripts with introns proximal to 5’ end of the coding region have sequence characteristics similar to 5UI+ transcripts (Fig 1D). To systematically characterize this relationship, we built different classifiers using training sets that excluded 5UI− transcripts with a coding region intron positioned at increasing distances from the start codon. We evaluated the performance of each classifier using 10-fold cross validation.
Given that a large number of motif discovery iterations were needed, we sought to reduce the computational burden. We isolated a subset of the training examples to be used exclusively for motif finding. Motif discovery was performed once using this set of sequences, and the same motifs were used in each fold of the cross validation for all the classifiers. Imbalances between the sizes of positive and negative training examples can lead to detrimental classification performance (Wang and Yao 2012). Hence, we balanced the training set size of 5UI− and 5UI+ transcripts by randomly sampling from the larger class. We constructed 10 sub-classifiers to reduce sampling bias, and for each test example, the prediction score from each subclassifier was summed to produce a combined score between 0 and 10. For the rest of the analyses, we used the classifier trained using 5UI− transcripts where first coding intron falls outside the first 90 coding nucleotides (Fig 1E).
We evaluated classifier performance using a 10-fold cross validation strategy for SSCRcontaining transcripts (i.e. the training set). In each fold of the cross-validation, the model was trained without any information from the held-out examples, including motif discovery. For all the other transcripts and the control sets (see above), the 5IMP scores were calculated using the classifier trained using SSCR transcripts as described above. 5IMP score distribution for the control set was used to calculate the empirical cumulative null distribution. Using this function, we determined the p-value corresponding to the 5IMP score for all transcripts. We corrected for multiple hypotheses testing using the qvalue R package (Storey 2003). Based on this analysis, we estimate that a 5IMP score of 7.41 corresponds to a 5% False Discovery Rate and suggest that 21% of all human transcripts can be considered as 5IM transcripts.
While the theoretical range of 5IMP scores is 0-10, the highest observed 5IMP is 9.855. We note that for all figures that depict 5IMP score distributions, we displayed the entire theoretical range of 5IMP scores (0-10).
Functional Characterization of 5IM Transcripts
We collected genome-scale dataset from publically available databases and from supplementary information provided in selected articles. For all analyzed datasets, we first converted all gene/transcript identifiers (IDs) into RefSeq transcript IDs using the Synergizer webserver (Berriz and Roth 2008). If a dataset was generated using a non-human species (ex. Targets identified by TDP-43 RNA immunoprecipitation in rat neuronal cells), we used Homologene release 64 (downloaded on Sep 28 2009) to identify the corresponding ortholog in humans. If at least one member of a transcript cluster was associated with a functional phenotype, we assigned the cluster to the positive set with respect to the functional phenotype. If more than one member of a cluster had the functional phenotype, we only retained one copy of the cluster unless they differed in a quantitative measurement. For example, consider two hypothetical transcripts: NM_1 and NM_2 that were clustered together and have a 5IMP score of 8.5. If NM_1 had an mRNA half-life of 2hr while NM_2’s half-life was 1hr than we split the cluster while preserving the 5IMP score for both NM_1 and NM_2.
Once the transcripts were partitioned based on the functional phenotype, we ran two statistical tests: 1) Fisher’s Exact Test for enrichment of 5IM transcripts within the functional category; 2) Wilcoxon Rank Sum Test to compare 5IMP scores between transcripts partitioned by the functional phenotype. Additionally, for datasets where a quantitative measurement was available (ex. mRNA half-life), we calculated the Spearman rank correlation between 5IMP scores and the quantitative variable. In these analyses, we assumed that the test space was the entire set of RefSeq transcripts. For all phenotypes where we observed a preliminary statistically significant result, we followed up with more detailed analyses described below.
Analysis of Features Associated with Translation
For each transcript, we predicted the propensity for secondary structure preceding the translation start site and the 5’cap. Specifically, we extracted 35 nucleotides preceding the translation start site or the first 35 nucleotides of the 5’UTR. If a 5’UTR is shorter than 35 nucleotides, the transcript was removed from the analysis. hybrid-ss-min utility (UNAFold package version 3.8) with default parameters was used to calculate the minimum folding energy (Markham and Zuker 2008).
Codon optimality was measured using the tRNA Adaptation Index (tAI), which is based on the genomic copy number of each tRNA (dos Reis et al. 2004). tAI for all human codons were downloaded from Tuller et al. 2010 Table S1. tAI profiles for the first 30 amino acids were calculated for all transcripts. Codon optimality profiles were summarized for the first 30 amino acids for each transcript or by averaging tAI at each codon.
We carried out two control experiments to test whether the association between 5IMP score and tAI could be explained by confounding variables. First, we permuted the nucleotides in the first 90 nts and observed no relationship between 5IMP score and mean tAI when these permuted sequences were used (Fig S1). Second, we selected random in-frame 99 nucleotides from 3rd exon to the end of the coding region and found no significant differences in tAI (Fig S1). These results suggest that the relationship between tAI and 5IMP score is confined to the first 30 amino acids and is not explained by simple differences in nucleotide composition.
Ribosome profiling and RNA expression data for human lymphoblastoid cells (LCLs) were downloaded from NCBI GEO database accession number: GSE65912. Translation efficiency was calculated as previously described (Cenik et al. 2015). Median translation efficiency across the different cell-types was used for each transcript.
Analysis of Proximity Specific Ribosome Profiling Data
We downloaded proximity specific ribosome profiling data for HEK 293 cells from Jan et al. 2014; Table S6. We converted UCSC gene identifiers to HGNC symbols using g:Profiler (Reimand et al. 2011). We retained all genes with an RPKM >5 in either input or pulldown and required that at least 30 reads were mapped in either of the two libraries. We used ERtargeting evidence categories “secretome”, “phobius”, “TMHMM”, “SignalP”, “signalSequence”, “signalAnchor” from (Jan et al. 2014) to annotate genes as having ER-targeting evidence. The genes that did not have any ER-targeting evidence or “mitoCarta” /“mito.GO” annotations were deemed as the set of genes with no ER-targeting or mitochondrial evidence. We calculated the log2 of the ratio between ER-proximal ribosome pulldown RPKM and control RPKM as the measure of enrichment for ER-proximal ribosome occupancy (as in Jan et al. 2014). A moving average of this ratio was calculated for genes grouped by their 5IMP score. For this calculation, we used bins of 30 mitochondrial genes or 100 genes with no evidence of ER- or mitochondrial targeting.
Analysis of Genome-wide Binding Sites of Exon Junction Complex
Dr. Gene Yeo and Gabriel Pratt generously shared uniformly processed peak calls for experiments identifying human RNA binding protein targets. These datasets include various CLIP-Seq datasets and its variants such as iCLIP. A total 49 datasets from 22 factors were analyzed. These factors were: hnRNPA1, hnRNPF, hnRNPM, hnRNPU, Ago2, hnRNPU, HuR, IGF2BP1, IGF2BP2, IGF2BP3, FMR1, eIF4AIII, PTB, IGF2BP1, Ago3, Ago4, MOV10, Fip1, CF Im68, CF Im59, CF Im25, and hnRNPA2B1. We extracted the 5IMP scores for all targets of each RBP. We calculated the Wilcoxon Rank Sum test statistic comparing the 5IMP score distribution of the targets of each RBP to all other transcripts with 5IMP scores. None of the tested RBP target sets had an adjusted p-value < 0.05 and a median difference in 5IMP score > 1 when compared to non-target transcripts.
In addition, we used RNA:protein immunoprecipitation in tandem (RIPiT) data to determine Exon Junction Complex (EJC) binding sites (Singh et al. 2014, 2012). We analyzed the common peaks from the Y14-Magoh immunoprecipitation configuration (Singh et al. 2012; Kucukural et al. 2013). Canonical EJC binding sites were defined as peaks whose weighted center were 15 to 32 nucleotides upstream of an exon-intron boundary. All remaining peaks were deemed as “non-canonical” EJC binding sites. We extracted all non-canonical peaks that overlapped the first 99 nucleotides of the coding region and restricted our analysis to transcripts that had an RPKM greater than one in the matched RNA-Seq data.
Analysis of m1A modified transcripts
We downloaded the list of RNAs observed to contain m1A from Li et al. (2016) and Dominissini et al. (2016). RefSeq transcript identifiers were converted to HGNC symbols using g:Profiler (Reimand et al. 2011). The overlap between the two datasets was determined using HGNC symbols. m1A modifications that overlap the first 99 nucleotides of the coding region were determined using bedtools (Quinlan and Hall 2010).
Li et al. (2016) identified 600 transcripts with m1A modification in normal HEK293 cells. Of these, 368 transcripts were not found to contain the m1A modification in HEK293 cells by Dominissini et al. (2016). Yet, 81% of these were found to be m1A modified in other cell types. Li et al. (2016) also analyzed m1A upon H2O2 treatment and serum starvation in HEK293 cells and identified many m1A modifications that are only found these stress conditions. However, 20% of 371 transcripts harboring stress-induced m1A modifications were found in normal HEK293 cells by Dominissini et al. (2016). Taken together, these analyses suggest that transcriptome-wide m1A maps remain incomplete. Hence, we analyzed the 5IMP scores of all mRNAs with m1A across cell types and conditions. We reported results using the Dominissini et al. dataset but the same conclusions were supported by m1A modifications from Li et al.
Author contributions
CC, MJM, and FPR designed the study. CC and HNC implemented the random forest classifier and carried out all analyses. GS and AA performed initial experiments. MPS and AFP provided critical feedback. MJM and FPR jointly supervised the project. CC, FPR and MJM wrote the manuscript with input from all authors.
Acknowledgements
We would like to thank Gene Yeo, and Gabriel Pratt for sharing peak calls for RNA binding protein target identification experiments (CLIP, PAR-CLIP, iCLIP). We thank Elif Sarinay Cenik, Başar Cenik, and Alper Küçükural for helpful discussions. F.P.R. and H.N.C. were supported by the Canada Excellence Research Chairs Program. This work was supported in part by NIH grants GM50037 (M.J.M.), HG001715 and HG004233 (F.P.R.), the Krembil Foundation, an Ontario Research Fund award, and the Avon Foundation (F.P.R.). A.F.P. was supported by a grant from the Canadian Institutes of Health Research to A.F.P. (FRN 102725).