Abstract
Cells express thousands of transcripts that show weak coding potential. Known as long non-coding RNAs (lncRNAs), they typically contain short open reading frames (ORFs) having no homology with known proteins. Recent studies show that a significant proportion of lncRNAs are translated, challenging the view that they are non-coding. These results are based on selective sequencing of ribosome-protected fragments, or ribosome profiling. The present study used ribosome profiling data from eight mouse tissues and cell types, combined with ∼330,000 synonymous and non-synonymous single nucleotide variants, to dissect the patterns of purifying selection in proteins translated from lncRNAs. Using the three-nucleotide read periodicity that characterizes actively translated regions, we identified 832 mouse translated lncRNAs. Overall, they produced 1,489 different proteins, most of them smaller than 100 amino acids. Nearly half of the ORFs then showed sequence conservation in rat and/or human transcripts, and many of them are likely to encode functional micropeptides, including the recently discovered Myoregulin. For lncRNAs not conserved in rats or humans, the ORF codon usage bias distinguished between two classes, one with particularly high coding scores and evidence of purifying selection, consistent with the presence of lineage-specific functional proteins, and a second, larger, class of ORFs producing peptides with no significant purifying selection signatures. We obtained evidence that the translation of these lncRNAs depends on the chance occurrence of ORFs with a favorable codon composition. Some of these lncRNAs may be precursors of novel protein-coding genes, filling a gap in our current understanding of de novo gene birth.
Introduction
In recent years, the sequencing of transcriptomes has revealed that, in addition to classical protein-coding transcripts, the cell expresses thousands of long transcripts with weak coding potential [1–5]. Some of these transcripts, known as long non-coding RNAs (lncRNAs), have well-established roles in gene regulation; for example, Air is an Igf2r antisense lncRNA involved in silencing the paternal Igf2r allele in cis [6,7]. However, the vast majority of lncRNAs remain functionally uncharacterized. Some have nuclear roles, but most are polyadenylated and accumulate in the cytoplasm [8]. In addition, many lncRNAs are expressed at low levels and have a limited phylogenetic distribution [9,10].
In 2009, Nicholas Ingolia and co-workers published the results of a new technique to measure translation of mRNAs by deep sequencing of ribosome-protected RNA fragments, called ribosome profiling (Ribo-Seq) [11]. This technique generates millions of ribosome footprints that can be mapped to a species genome or transcriptome to assess the translation of thousands of open reading frames (ORFs) [12], including low-abundant small peptides that may be difficult to detect by standard proteomics approaches [13–15]. In ribosome profiling experiments, the three-nucleotide periodicity of the reads, resulting from the movement of the ribosome along the coding sequence, can be used to differentiate translated sequences from other possible RNA protein complexes [13,16–19]. A growing number of studies based on this technique have found that a significant proportion of lncRNAs are translated [16,18,20–25], but the functional significance of this finding is not yet clear. Some of the translated lncRNAs may be mis-annotated protein-coding genes that encode micropeptides (<100 amino acids) which, due to their short length, have not been correctly predicted by bioinformatics algorithms [13,15,26,27]. This is likely to include some recently evolved proteins that lack homologues in other species and are even harder to detect than conserved short peptides [24].
One striking feature of the ORFs reported to be translated from lncRNAs is that, in general, they appear to have fewer selective constraints than standard proteins [18,24], raising the possibility that a large fraction of them encode proteins that are not functional, despite being translated in a stable manner. However, evidence for this hypothesis is presently lacking.
Non-synonymous and synonymous single nucleotide polymorphisms in coding sequences provide useful information to distinguish between neutrally evolving proteins and proteins undergoing purifying or negative selection. Under no selection, both kinds of variants accumulate at the same rate, whereas under purifying selection there is a deficit of non-synonymous variants [28]. The detection of selection signatures provides strong evidence of functionality, whereas non-functional proteins evolve neutrally. The present study takes advantage of existing nucleotide variation data for the house mouse to investigate the selective patterns of peptides translated by lncRNAs. Our findings provide evidence that lncRNAs are pervasively translated and that a large fraction of them produce neutrally evolving peptides. We discuss the importance of these peptides as raw material for the evolution of de novo protein-coding genes.
Results
Identification of translated sequences
We sought to identify translated open reading frames (ORFs) in a comprehensive set of long non-coding RNAs (lncRNAs) and protein-coding genes (codRNAs) from mouse, using ribosome-profiling RNA-sequencing (Ribo-Seq) data from eight different tissues and cell types (Table 1 and references therein). The samples corresponded to healthy individuals and comprised hippocampus, neural embryonic stem cells, brain, testis, neutrophils, splenic B cells, heart and skeletal muscle. In contrast to RNA sequencing (RNA-Seq) reads, which are expected to cover the complete transcript, Ribo-Seq reads are specific to regions bound by ribosomes. We mapped the RNA-Seq and Ribo-Seq reads of each experiment to a mouse transcriptome that comprised all Ensembl mouse gene annotations, including both coding genes and lncRNAs, as well as thousands of additional de novo assembled polyadenylated transcripts derived from non-annotated expressed loci (novel lncRNAs, see Methods). For the assembly of this transcriptome, we used more than 1.5 billion strand-specific RNA sequencing reads from mouse [29].
We selected all expressed transcripts (FPKM > 0.2, see Methods) and predicted all possible canonical ORFs encoding putative proteins with a length of at least 9 amino acids. For each mapped Ribo-Seq experiment, we selected the ORFs covered by at least 10 Ribo-Seq reads and examined the distribution of the predicted ribosome P-sites along the ORF using RibORF software [16] (Fig 1a). ORFs classified as translated by the program showed clear three-nucleotide periodicity and uniformity when compared to the reads for the rest of the ORFs (Fig 1b and 1c). These two biases are characteristic of regions that are being actively translated [11,20,16,13,17], and are absent from other types of protein-RNA interactions [30]).
This method translated ORFs in ∼20% of the loci annotated as lncRNAs and ∼90% of the coding genes (Table 1 and Fig 1d). We also identified 286 novel genes that did not overlap with annotated protein-coding genes but contained translated ORFs. A substantial fraction of the codRNAs (29.54%) showed translation of more than one non-overlapping ORF; 2,954 ORFs were located upstream or downstream of the main protein-coding ORF in the same transcript (uORFs and dORFs, respectively), while 3,951 ORFs corresponded to putative alternatively translated products. Moreover, we found that 325 lncRNAs (∼39%, including annotated and novel lncRNAs) showed evidence of polycistronic translation, producing two or more peptides.
A significant fraction of the ORFs in codRNAs were transcribed and translated in several samples, whereas lncRNAs, uORFs, and dORFs tended to be sample-specific (S1 Fig). About 75% of the translated lncRNAs encoded putative proteins shorter than 100 amino acids (small ORFs or smORFs). Overall, ORFs in lncRNAs were longer than uORFs and dORFs (median 48-52 vs. 26 amino acids, Wilcoxon test, p-value < 10−5), but shorter than the main ORF in protein-coding genes (median 381 amino acids, Wilcoxon test, p-value < 105). The characteristics of translated transcripts and the size of the translated products were very similar for annotated lncRNAs and for novel expressed loci (Fig 1b, 1c and 1e). Therefore, these two types of transcripts were merged into a single class (lncRNA) for most analyses.
Properties of translated lncRNAs compared to coding genes
The number of transcribed and translated ORFs varied substantially depending on the sample (Table 1, Fig 2a). The largest number of translated genes were detected in hippocampus tissue, followed by testis, embryonic stem cells and brain, both for codRNAs and lncRNAs. Similar results were obtained when we focused on ORFs translated in a single tissue (Fig 2b) or separately considered long ORFs and smORFs (S2 Fig). There were two reasons for these differences. The first reason was the number of available Ribo-Seq sequencing reads in each experiment, about three times greater in hippocampus than in other tissues, which provided increased resolution to detect translation of lowly expressed transcripts. As expected, subsampling the number of reads in the hippocampus resulted in a decrease in the number of translated ORFs detected (S2 Fig). The second reason was that, in some tissues, the pool of translated ORFs was highly skewed towards a few very abundant proteins (S3 Fig). For example, in skeletal muscle and heart the five most highly translated genes, which included myosin and titin, gathered 22.5-31.2% of the sequencing reads; this substantially reduced the number of reads available to detect other products of translation. Overall, the data suggested that the experimental translation signal was not saturated and that the true number of translated lncRNAs may be higher than was estimated here.
When we compared the translated and non-translated protein-coding genes, the former had higher expression levels and were longer than the latter (Fig 2c, Wilcoxon test, p-value < 10−5). In the case of lncRNAs, translation was positively associated with ORF length (Wilcoxon test, p-value < 10−5), but we did not detect any relationship between translation status and expression level. In general, lncRNAs were expressed at much lower levels than coding genes (Fig 2C); this is a well-known global difference between the two types of genes [9,36].
Phylogenetic conservation and codon usage bias
We next examined which fraction of the mouse lncRNAs with evidence of translation were conserved in rat and/or human transcripts. For this we employed de novo human and rat transcript assemblies of a quality similar to that used for mouse (see Methods). We searched for homologues of the putatively translated mouse ORFs in the human and rat transcripts using TBLASTN (e-value < 10−4). We found hits in human and/or rat for 41% of the mouse-translated ORFs in lncRNAs, compared to 92% for protein-coding genes (Fig 3a). This is in line with previous studies showing that lncRNAs tend to be much less conserved than protein-coding genes [10,37,38]. Codon usage bias is usually employed to predict coding sequences in conjunction with other variables such as ORF length and sequence conservation [39,40]. In the case of non-conserved smORFs, such as those translated from many of the lncRNAs, only measures based on codon usage bias can be applied. We have previously implemented [24] a metric based on the differences in dicodon (hexamer) frequencies between coding and non-coding sequences, which we have used to calculate length-independent coding scores for translated and non-translated ORFs in different species. Based on this metric, we developed a computational tool to identify ORFs with significant coding scores in any set of sequences of interest, which is available online (evolutionarygenomics.imim.es/CIPHER).
We observed a positive relationship between the RibORF translation score and the coding score produced by CIPHER, both for codRNAs and in lncRNAs (Fig 2c, S4 Fig). We also found that conserved ORFs (group C) had significantly higher coding scores than non-conserved ORFs, both for codRNAs and lncRNAs (Fig 3b). We reasoned that ORFs with a very biased codon usage may correspond to functional proteins even if not conserved across species. We used CIPHER to divide the non-conserved genes into a group with high coding scores (NC-H, coding score > 0.079, above the median value for conserved coding genes) and another group with lower coding scores (NCL, coding score ≤ 0.079). We also searched for proteomics evidence in PRIDE [41]. Using stringent criteria, we found proteomics evidence for 37 of the ORFs in the lncRNAs, with similar numbers in the different groups (11 in C, 12 in NC-H and 14 in NC-L).
Testing for signatures of natural selection in translated ORFs
We had detected ∼1,500 putatively translated ORFs in lncRNAs, but it was unclear if they were likely to encode functional proteins. To address this, we investigated the signatures of natural selection in the ORFs using a large collection of mouse single nucleotide polymorphisms (SNP) for the house mouse subspecies Mus musculus castaneus [42,43]. We used the ratio between non-synonymous and synonymous SNPs to evaluate whether proteins translated from different sets of transcripts were subject to purifying selection. This method has an advantage over non-synonymous to synonymous substitutions in that it can be applied to sequences which do not show phylogenetic conservation. This allowed us to investigate the signatures of selection in hundreds of translated ORFs from mouse lncRNAs that were not conserved in human or rat transcripts.
In the absence of selection, and considering that all codons have the same frequency and all mutations between pairs of nucleotides are equally probable, we would expect the PN/PS ratio of a sequence or set of sequences to be 2.89 [44]. However, not all codons are equally frequent in coding sequences, and the probability of mutation differs between pairs of nucleotides [45–47]. These parameters can be estimated from real data and subsequently used to compute an expected PN/PS under neutrality. The difference between the observed and expected PN/PS ratios informs us on the strength of purifying selection. If the observed PN/PS normalized by the expected PN/PS is not significantly different from 1, the observed proportion of non-synonymous and synonymous SNPs is consistent with neutral evolution. If it is significantly lower than 1, there is a depletion of non-synonymous SNPs. Such a depletion is consistent with purifying selection acting at the amino acid sequence level and provides a strong argument for functionality.
We mapped a total of 324,729 SNPs from Mus musculus castaneus to the previously defined ORFs from codRNAs and lncRNAs. For each sequence, and sequence dataset (C, NC-H and NC-L), we calculated the ratio between observed non-synonymous and synonymous SNPs (PN/PS(obs)) and divided it by the ratio expected under neutrality (PN/PS(exp)), obtaining a normalized PN/PS. The expected PN/PS was obtained using a table of nucleotide mutation frequencies in Mus musculus castaneus, which we derived from SNPs in intronic sequences (S1 and S2 Tables), and the observed codon frequencies in the sequences of interest. The values ranged from 2.31 to 2.47 for different sequence datasets (S3 Table). We used the chi-square test to determine if the sequences under analysis showed a PN/PS that deviated significantly from that expected under neutrality (Fig 3c, S4 Table).
We found that conserved translated ORFs, both in codRNAs and lncRNAs, had PN/PS values significantly lower than the neutral expectation (Fig 3c, chi-square test p-value < 10−5). In lncRNAs, there was an approximately 40% depletion of non-synonymous SNPs over the expected value, strongly suggesting that a sizable fraction of the lncRNAs in this group are in fact protein-coding genes that produce functional small proteins or micropeptides (smORFs). The computational identification of smORFs is especially challenging because they can randomly occur in any part of the genome [48]. Therefore, it is not surprising that some remain hidden in the vast ocean of transcripts annotated as non-coding. For instance, the recently discovered peptide Myoregulin, which is only 46 amino acids long, regulates muscle performance [49]. Another example is NoBody, which encodes a protein just 68 amino acids long and has recently been shown to interact with the mRNA decaying complex [50]. NoBody was annotated as non-coding when we initiated the study, and Myoregulin was annotated with a different non-canonical ORF, although their annotations are now fully consistent with our findings. Other examples of conserved smORFs in our set were Stannin, a mediator of neuronal cell apoptosis conserved across metazoans [51,52], and Apela, a peptide ligand that acts as an embryonic regulator and increases cardiac contractility in mouse [53,54]. The distribution of the Ribo-Seq reads in these examples is shown in Figure 4.
The group of ORFs which were not conserved across species but had high coding scores (NC-H) showed weaker purifying selection than conserved genes; however, PN/PS was significantly lower than the neutral expectation (Fig 3c, chi-square p-value = 1.6 x 10−5 for codRNAs and p-value = 0.0026 for lncRNAs). Despite the lack of detectable homologues in rat and human, this finding indicates some of the proteins in this group are probably functional.
In contrast, the normalized PN/PS in the rest of the non-conserved ORFs (NC-L) was not different from 1 and therefore consistent with neutral evolution (Fig 3c). This result was equivalent to that found for randomly selected ORFs from introns (S4 Table). Despite the lack of evidence of selection, these ORFs showed strong three-nucleotide read periodicity and uniformity (Fig 4, Fig 5a), indicating bona fide translation. The lack of selection signatures was evident both in transcripts annotated as coding RNAs and as lncRNAs. Although these cases represented a very small minority of the protein-coding genes (∼1%), they were a much larger fraction of the lncRNAs (∼40%).
The above analyses grouped the sequences into classes before computing the PN/PS ratio. In general, ORF-by-ORF analysis was not possible because the ORFs were small and contained too few SNPs. Nevertheless, a small fraction of the ORFs in lncRNAs contained 10 or more SNPs, and we computed a normalized PN/PS ratio for these cases. The results were very much in line with those obtained with the complete sequence sets and supported our previous conclusions (Fig 3d).
The group of lncRNAs producing proteins with no selection signatures included several RNAs with known non-coding functions, such as Malat1, Neat1, Jpx, and Cyrano. These genes are involved in several cellular processes: Cyrano in the regulation of embryogenesis [55], Jpx in X chromosome inactivation [56], Neat1 in the maintenance and assembly of paraspeckles [57], and Malat1 in regulating the expression of other genes [58]. Many other translated ORFs were located in transcripts with no known function. Two examples are shown in Figure 4. Due to the absence of selection signatures, one must conclude that the translation of these transcripts is probably due to promiscuous activity of the ribosome machinery. This may lead to the production of thousands of novel non-functional small proteins in different cell types and tissues.
What drives the translation of lncRNAs?
Translated ORFs in lncRNAs lacking conservation in rat and/or human and with no evidence of selection (NC-L) comprised 369 genes translating 472 different ORFs. These genes were not coding in the usual sense, as we observed no signatures of selection but yet they produced proteins. The ORFs in the NC-L group of genes showed the characteristic three-nucleotide periodicity of actively translated regions (Fig 5a). In addition, the ORF frame bias was highly reproducible across tissues and the correlation coefficient similar to that computed for conserved, well-established, codRNAS (Fig 5b and S5-S7 Fig).
Why was translation detected in these lncRNAs but not in others? The lack of selection signatures at the protein level precluded their being mis-annotated classical protein-coding genes. We inspected the translation initiation sequence context but did not detect any significant differences between translated and non-translated ORFs. We then hypothesized that the ORF coding score could affect the “translatability” of the transcript, because codons that are abundant in coding sequences are expected to be more efficiently translated than other, more rare, codons. Consistent with this hypothesis, we found that the translated ORFs in this group exhibited higher coding scores than the ORFs in non-translated genes (Fig 5c, Wilcoxon test, p-value < 10−5). Importantly, we obtained a similar result after controlling for gene expression level (Fig 5d, Wilcoxon test, p-value < 10−5). This is consistent with codon composition having an effect per se in ORF translation. We also detected significant differences in the expression level of translated and non-translated ORFs when controlling by coding score (Fig 5e). This may reflect better capacity to detect translation in the case of highly expressed transcripts. In contrast, although translated ORFs tend to be longer than non-translated ORFs (Fig 2c), ORF length appeared to have no effect per se in translatability (Fig 5e).
DISCUSSION
Several studies have reported that many lncRNAs translate small proteins [13,16–18,22]. Each study detected hundreds or even thousands of lncRNAs with patterns consistent with translation. Varying criteria have been used to differentiate active translation from other signals, including three-nucleotide periodicity of the Ribo-Seq reads, high translational efficiency values (number of Ribo-Seq reads with respect to transcript abundance), and signatures of ribosome release after the STOP codon. As lncRNAs are, in general, expressed at low levels, the stringency of the method, as well as the sequencing depth, can be expected to strongly impact the number of translated lncRNAs identified.
The recent discovery that a large number of lncRNAs show ribosome profiling patterns consistent with translation has puzzled the scientific community [59]. Most lncRNAs are not conserved across mammals or vertebrates, which limits the use of substitution-based methods to infer selection. Methods based on the number of non-synonymous and synonymous nucleotide polymorphisms (PN and PS, respectively) detect selection at the population level and can be applied to both conserved and non-conserved ORFs. This analysis is well-suited for pre-defined sets of ORFs; individual coding sequences in mammals do not always contain enough polymorphisms to test for selection [60]. In a previous study using ribosome profiling experiments from several species, we found that, in general, ORFs with evidence of translation in lncRNAs have weak but significant purifying selection signatures [24]. Together with previous observations that lncRNAs tend to be lineage-specific [10] and that young proteins evolve under relaxed purifying selection [61], this finding led us to hypothesize that lncRNAs are enriched in young protein-coding genes.
The present study employed recently generated mouse, rat and human deep transcriptome sequencing, together with extensive mouse variation data and codon usage bias to investigate the patterns of selection in translated ORFs from lncRNAs. LncRNAs conserved across species are more likely to be functional than those which are not conserved. This is supported by studies measuring the sequence constraints of lncRNAs with different degrees of phylogenetic conservation [38,62]. Here we found that about 5% of the lncRNAs in databases may encode conserved functional micropeptides (smORFs). Standard proteomics techniques have important limitations for the detection of micropeptides and there is evidence that the smORFs currently annotated in databases are only a small part of the complete set [63–66]. As shown here, and in other recent studies [13,15], computational prediction of ORFs coupled with ribosome profiling is a promising new avenue to unveil many of these peptides. In our study, the majority of transcripts encoding micropeptides were not annotated as coding, emphasizing the power of using whole transcriptome analysis instead of only annotated genes to characterize the so-called smORFome. Analysis of other tissues, and case-by-case experimental validation, will no doubt lead to a sustained increase in the number of micropeptides with characterized functions.
Aside from lncRNAs which translate functional microproteins, the present study identified another large class of lncRNAs that appears to evolve neutrally and thus to translate proteins that do not perform any useful function. These ORFs can be distinguished from the rest because they were not conserved across species and did not exhibit high coding scores. As the test of neutrality was applied to the complete group, it remains possible that a few of the ORFs were under selection, but this is likely to be a very small number. An interesting observation is that the lack of selection signatures was not only observed in lncRNAs but also in coding RNAs that share the same characteristics. This blurs any differences between the two classes of genes when we focus on genes showing limited phylogenetic conservation. Overall, we detected 1,333 proteins that appeared to be translated but showed no signs of selection. This could be a gross underestimate, considering that many cell types and tissues have not yet been sampled.
Although the existence of non-functional proteins may seem counterintuitive at first, we must consider that most of these transcripts (lncRNAs and non-conserved codRNAs) tend to be expressed at low levels and so the associated energy costs of this activity may be negligible. This is in agreement with recent estimates that the cost of transcription, and even translation, in multicellular organisms is probably too small to overcome genetic drift [67]. In other words, provided the peptides are not toxic, the negative selection coefficient associated with the cost of producing them may be too low for natural selection to effectively remove them. We observed that the translation patterns of many of these peptides were similar across tissues, indicating that their translation is relatively stable and reproducible. The “neutral” translation of lncRNAs provides an answer for the conundrum of why transcripts that have been considered to be non-coding appear to be coding when viewed through the lens of ribosome profiling.
According to our results, the “neutral” translation of certain lncRNAs, but not others, may be due to the chance existence of ORFs with a more favorable codon composition. This is consistent with the observation that abundant codons enhance translation elongation [68], whereas rare codons might affect the stability of the mRNA and activate decay pathways [69]. Other researchers have hypothesized that the distinction between translated and non-translated lncRNAs may be related to the relative amount of the lncRNA in the nucleus and the cytoplasm [16]. However, we found evidence that some lncRNAs with nuclear functions, such as Malat1 and Neat1, are translated, suggesting that the cytosolic fraction of any lncRNA may be translated independently of the role or preferred location of the transcript.
In the absence of experimental evidence, the codon composition of an ORF can provide a first indication of whether the ORF will be translated or not. Differences in codon frequencies between genes reflect the specific amino acid abundance as well as the codon usage bias, and are influenced both by selection and drift [70,71]. Algorithms to predict coding sequences often use dicodon instead of codon frequencies, as the former also capture dependencies between adjacent amino acids or nucleotide triplets. We found that ORFs with very low coding scores are in general not translated. One example of this sort was the previously described de novo non-coding gene Poldi [72], which lacked any evidence of translation in the data we analyzed. The group of ORFs that had high coding scores, but lacked conservation in human or rat transcripts, had weak but significant purifying selection signatures. Although there may be different reasons why we did not detect any homologues, such as rapid evolution linked to very short protein size, or loss of the gene in different lineages, this set is probably enriched in genes that have recently evolved de novo [73,74]. For this type of genes, the annotation as coding and non-coding appears to be highly irrelevant, as the two types of genes displayed very similar features in all the analyses performed.
A growing number of protein-coding genes have been reported to have originated de novo from previously non-functional genomic regions [75–83]. These genes encode proteins with unique new sequences, which may have important roles in lineage-specific adaptations. The encoded proteins are usually small and disordered, and have been hypothesized to become longer and more complex over time [79,84,85]. Interestingly, there is recent evidence that many of these genes may have originated from lncRNAs [29,86,87]. This is also consistent with the large number of species-specific transcripts with lncRNA features identified in comparative transcriptomics studies [29,88–90]. The discovery that some non-coding RNAs are translated makes the transition from non-coding/non-functional to coding/functional more plausible, as deleterious polypeptides can be purged by selection [91], and the remaining ones tested for new functions. However, the observation that lncRNAs are translated is by itself inconclusive, as one could also argue that translated lncRNAs are simply mis-annotated functional protein-coding genes. Here we have shown that, for the bulk of translated lncRNAs, this is not the case, because many of the peptides do not show signatures of purifying selection. We propose that the evolutionary neutral translation of lncRNAs represents the missing link between transcribed genomic regions with no coding function and the eventual birth of proteins with new functions.
In conclusion, our data support the use of ribosome profiling and conservation analysis to uncover putative new functional micropeptides. We also observed that many lncRNAs produce small proteins that lack a function; these peptides can serve as raw material for the evolution of new protein-coding genes. We found that the translated ORFs in these lncRNAs are enriched in coding-like hexamers when compared to non-translated or intronic ORFs, which implies that the sequences available for the formation of new proteins are not random but may have coding-like features from the start.
METHODS
Transcriptome assembly
The polyA+ RNA-Seq from mouse was comprised of 18 strand-specific paired-end data publicly available in the Gene Expression Omnibus under accession numbers GSE69241 [29], GSE43721 [92], and GSE43520 [10]. Data corresponded to 5 brain, 2 liver, 1 heart, 3 testis, 3 ovary and 4 placenta samples. The polyA+ RNA-Seq from human comprised 8 strand-specific paired-end data publicly available in the Gene Expression Omnibus under accession number GSE69241 [29]. Data corresponded to 2 brain, 2 liver, 2 heart and 2 testis samples.
RNA-Seq sequencing reads were filtered by length (> 25 nucleotides) and by quality using Condetri (v.2.2) with the following settings: −hq = 30 –lq = 10. We retrieved genome sequences and gene annotations from Ensembl v. 75. We aligned the reads to the corresponding reference species genome with Tophat (v. 2.0.8, –N 3, −a 5 and –m 1) [93]. Multiple mapping to several locations in the genome was allowed unless otherwise stated.
We assembled the transcriptome with Stringtie [94], merging the reads from all the samples, with parameters −f 0.01, and −M 0.2. We used the species transcriptome as a guide (Ensembl v.75), including all annotated isoforms, but permitting the assembly of annotated and novel isoforms and genes (antisense, intergenic and intronic) as well. We complemented our human and mouse transcript assemblies with an additional rat transcript assembly generated in a parallel study [95]. The latter assembly was derived from RNA-seq data from 11 tissues: adrenal gland, brain, heart, kidney, liver, lung, muscle, spleen, testis, thymus, and uterus.
In mouse, we selected genes with a minimum size of 300 nucleotides and with a per-nucleotide read coverage ≥ 5 in at least one sample. This ensures a high degree of transcript completeness, as shown previously [29]. The resulting transcriptome comprised 16,679 protein-coding genes (average of 6.64 isoforms/gene); 2,580 long non-coding RNAs (average of 2.35 isoforms/gene) defined as assembled genes that overlapped annotated genes that were not annotated as protein-coding; 3,912 novel non-annotated genes (average of 1.07 isoforms/gene), and 3,467 genes overlapping pseudogenes.
Ribosome profiling data
We used 8 different data sets that included both strand-specific ribosome profiling (Ribo-Seq) and RNA-seq data and which we obtained from Gene Expression Omnibus under accession numbers GSE51424 [31], GSE50983 [32], GSE22001 [33], GSE62134 [34], GSE72064 [35], and GSE41246. Data corresponded to brain, testis, neutrophils, splenic B cells, neural embryonic stem cells, hippocampus, heart and skeletal muscle (Table 1). Only datasets corresponding to normal samples were considered. Any replicates were merged before the analyses. For all analyses we considered only genes expressed at significant levels in at least one sample (fragments per kilobase per Million mapped reads (FPKM) > 0.2).
Ribo-Seq data sets were depleted of anomalous reads (length < 26 or > 33 nt) and small RNAs after discarding reads that mapped to annotated rRNAs and tRNAs in mouse. Next, reads were mapped to the assembled mouse genome (mm10) with Bowtie, allowing read multimapping (v. 0.12.7, −k 1 −m 20 −n 1 --best --strata –norc) and controlling for strand information.
We used the mapping of the Ribo-Seq reads to the complete set of annotated coding sequences in mouse to compute the position of the P-site, corresponding to the tRNA binding-site in the ribosome complex, for reads of different length, as in other studies [11,13,16,17]. If no P-site offset was clear for a specific length, reads with that length were not considered for subsequent analysis. Considering that the ORFs had to be extensively covered by reads to be considered translated (high uniformity), we decided to include multiple mapped reads so as not to compromise the detection of paralogous proteins (S8 Fig).
Detection of translated ORFs
We predicted all possible ORFs (ATG to TGA/TAA/TAG) with a minimum length of 30 nucleotides (9 amino acids) in transcripts expressed at FPKM > 0.2 in any sample. Next, we ran RibORF (v.0.1) [16] to obtain a set of translated ORFs per sample. This program is a support vector machine classifier and we used a score threshold of 0.7 to classify an ORF as translated, as in the original study. This cutoff is considered to be very stringent, with a false positive rate of 0.67% and a false negative rate of 2.5% [16]. With the 0.7 threshold, no translated ORFs were found in annotated small RNAs, providing additional support for our approach. We only considered ORFs with 10 or more mapped reads; the rest were classified as non-translated. The ORFs classified as translated by the program showed high three-nucleotide read periodicity and uniformity when compared to ORFs classified as non-translated (Figure 1b).
For every gene, we selected all the ORFs that were translated in any of the samples and merged overlapping ORFs in clusters represented by the longest ORF in the group, for conservation and coding score analyses. If any of the ORFs were found upstream or downstream of another longer ORF in an annotated protein-coding transcript, we defined them as upstream ORF (uORF) or downstream ORF (dORF). If a gene was not translated, we selected the longest ORF across all transcripts for comparative purposes. In translated ORFs, the ORF with the highest number of mapped Ribo-Seq reads was usually the longest ORF (75.7% for codRNAs and 84% for lncRNAs).
We differentiated between genes with small ORFs (smORFs) and those with longer ORFs. In the first class, the longest ORF in the gene encoded a protein of less than 100 amino acids. We did not consider genes overlapping annotated pseudogenes and excluded smORFs in lncRNAs that showed significant sequence similarity to known protein-coding sequences, since they might be pseudogenized regions.
Sequence conservation
We searched for homologues of the mouse ORFs in the human and rat transcript assemblies using TBLASTN (limited to one strand, e-value < 10−4) [96]. The aim was to define a set of proteins which were conserved in human, rat, or both, and a set of non-conserved proteins for which homologues in the transcriptomes of these species could not be identified. An additional requirement to classify a protein as conserved was that the alignment covered at least 50 amino acids or 75% of the total ORF length. The smallest conserved protein was 19 amino acids long. In the nonconserved ORFs we only considered proteins of size 24 amino acids or longer, as homologues of shorter proteins may be difficult to detect even if they exist. For simplicity, in the analysis of the signatures of purifying selection we also discarded a small number of non-conserved ORFs that were in the same transcript than conserved ORFs (uORFs and dORFs were not taken into account here).
Single nucleotide polymorphism data
Single nucleotide polymorphism (SNP) data was obtained from Harr et al. [43], and included complete genotyping information from 20 individuals of the house mouse subspecies Mus musculus castaneus. We classified SNPs in ORFs as non-synonymous (PN, amino acid altering) and synonymous (PS, not amino-acid altering). We discarded any nucleotide variants that are fixed in the population used. We calculated the PN/PS ratio in each ORF group by using the sum of PN and PS in all the sequences ((PN/PS)obs). In general, estimation of PN/PS ratios of individual sequences was not reliable due to lack of a sufficiently large number of SNPs per ORF; we only performed this calculation in cases with at least 10 SNPs in the ORF.
We calculated the expected PN/PS under neutrality ((PN/PS)exp) using the mutation frequencies between pairs of nucleotides in Mus musculus castaneus and the codon composition of the different sequences or sets of sequences under study. The mutation frequencies were estimated from SNPs in introns from the same population of mice (S1 Table). The transition to transversion ratio was 4.42, very similar to the 4.26 value obtained in early observations based on mouse-rat divergence data [97]. As a test of neutrality on the coding sequence, we used a chi-square test with one degree of freedom that compares the observed and expected PN and PS values in the sequences of interest. In the absence of selection we expect (PN/PN)obs/(PN/PS)exp to be approximately 1. Under purifying selection, this ratio will be lower than one. Positively selected mutations are rapidly fixed in the population and their effect is expected to be negligible when using SNP data.
Analysis of proteomics data
We used the proteomics database PRIDE [41] to search for peptide matches in the proteins encoded by various gene sets. For a protein to have proteomics evidence, we required at least two distinct perfect matches of peptides that did not map to any other protein in the dataset, allowing for up to two mismatches. These are very stringent conditions, for which a false positive rate lower than 0.2% has been previously estimated [95].
Computation of coding scores with CIPHER
For each hexanucleotide (hexamer), we calculated the relative frequency of the hexamer in the complete set of mouse annotated coding sequences encoding experimentally validated proteins and in the ORFs of a large set of randomly selected intronic sequences [24]. Hexamer frequencies were calculated in frame, using a sliding window and 3 nucleotide steps. Subsequently, we obtained the logarithm of each hexamer frequency in coding sequences divided by the frequency in non-coding sequences. This log likelihood ratio was calculated for each possible hexamer i and termed CShexamer(i). The coding score of an ORF (CSORF) was defined as the average of the hexamer coding scores in the ORF.
The following equations were employed:
We have developed a computational tool, CIPHER, that uses this metric to calculate the coding score of the ORFs in any set of sequences. It also predicts ORFs with a high likelihood to be translated by using an empirical calculation of p-values derived from the distribution of coding scores in ORFs from introns. Specific parameters have been derived for several eukaryotic species. The code and executable file is freely available at https://github.com/jorruior/CIPHER. The program can also be accessed at http://evolutionarygenomics.imim.es/cipher/.
Using this metric, we divided the set of non-conserved genes into a group of genes with high coding score (NC-H) and a group of genes with low coding score (NC-L). Genes in the NC-H group were defined as those with a coding score over the median value of conserved coding sequences (> 0.079, with CIPHER significant at p-value < 0.025).
Statistical data analyses
The generation of plots and statistical tests was performed with the R package [98].
DATA AVAILABILITY
Supplemental file 1 contains supplementary Tables and Figures. Supplemental file 2 contains detailed information on the translated ORFs identified in this study. Transcript assemblies, open reading frames (ORFs), and code to calculated the PN/PS expected under neutrality in mouse sequences have been deposited at figshare (http://dx.doi.org/10.6084/m9.figshare.4702375).
ACKNOWLEDGEMENTS
We thank colleagues for useful comments that helped us improved the work. We are grateful to Elaine Lilly, Ph.D., for text revision
REFERENCES
- 1.↵
- 2.
- 3.
- 4.
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.
- 22.↵
- 23.
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.
- 65.
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.
- 77.
- 78.
- 79.↵
- 80.
- 81.
- 82.
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.↵