Abstract
Cells respond to changes in the environment by modifying the concentration of specific proteins. Paradoxically, the cellular response is usually examined by measuring variations in transcript abundance by high throughput RNA sequencing (RNA-Seq), instead of directly measuring protein concentrations. This happens because RNA-Seq-based methods provide better quantitative estimates, and more extensive gene coverage, than proteomics-based ones. However, variations in transcript abundance do not necessarily reflect changes in the corresponding protein abundance. How can we close this gap? Here we explore the use of ribosome profiling (Ribo-Seq) to perform differentially gene expression analysis in a relatively well-characterized system, oxidative stress in baker’s yeast. Ribo-Seq is an RNA sequencing method that specifically targets ribosome-protected RNA fragments, and thus is expected to provide a more accurate view of changes at the protein level than classical RNA-Seq. We show that gene quantification by Ribo-Seq is indeed more highly correlated with protein abundance, as measured from mass spectrometry data, than quantification by RNA-Seq. The analysis indicates that, whereas a subset of genes involved in oxidation-reduction processes is detected by both types of data, the majority of the genes that happen to be significant in the RNA-Seq-based analysis are not significant in the Ribo-Seq analysis, suggesting that they do not result in protein level changes. The results illustrate the advantages of Ribo-Seq to make inferences about changes in protein abundance in comparison with RNA-Seq.
Introduction
In recent years high throughput RNA sequencing (RNA-Seq) has become the method of choice for comparing gene expression changes of cells grown under different conditions (Rapaport et al., 2013). The relatively low cost of RNA-Seq, together with the availability of efficient computational methods to process information from millions of sequencing reads, has undoubtedly accelerated our understanding of gene regulation. However, a change in mRNA relative abundance does not always imply a change in the amount of the encoded protein (Schwanhäusser et al., 2011). Filling this gap in understanding is essential to discern the functional changes in the cell upon a given stimulus.
Many studies have shown that mRNA levels only partially explain protein levels in the cell (de Sousa Abreu et al., 2009; Schwanhäusser et al., 2011; Payne, 2015; Ponnala et al., 2014). In yeast, the correlation between mRNA and protein abundance is typically in the range 0.6-0.7 (de Sousa Abreu et al., 2009). In addition, the ratio between protein and mRNA levels may vary across different conditions. For instance, substantial differences in this ratio have been observed during osmotic stress in yeast (Lee et al. 2011) or after the treatment of human cells with epidermal growth factor (Tebaldi et al., 2012). This strongly suggests that measuring changes in mRNA levels may often be insufficient to identify the functional shifts taking place in the cell upon a given stimulus.
Protein quantification is often performed using whole proteome mass spectrometry-based methods (Gerber et al., 2003; Edfors et al., 2016). These methods provide a direct measurement of protein abundance but they also have limitations, especially for the detection of lowly expressed and/or small proteins (Slavoff et al., 2013). An alternative way to estimate protein levels is the sequencing of ribosome-protected mRNA fragments, or ribosome profiling (Ribo-Seq) (Ingolia et al., 2009, 2011; Aspden et al., 2014; Ruiz-Orera et al., 2014). In contrast to RNA-Seq, which measures the total amount of mRNA in the cell, Ribo-Seq only captures those mRNAs that are being actively translated. Although Ribo-Seq measures translation, which is an indirect estimate of protein abundance, it has the advantage over proteomics that virtually any mRNA can be interrogated. In addition, Ribo-Seq reads can be quantified in the same manner as RNA-Seq reads. This implies that we can use the same pipelines as for RNA-Seq to identify differentially expressed genes.
It has been proposed that alterations in the ratio between the relative number of Ribo-Seq and RNA-Seq reads mapping to a given locus, known as the translation efficiency (TE), can be used to identify putative translation activation or repression events (Ingolia, 2016). Numerous recent studies have used ribosome profiling data has been used to study translation regulatory mechanisms (Jungfleisch et al., 2017; Yordanova et al., 2018) or to discover new translated RNA sequences (Michel et al. 2012; Aspden et al. 2014; Ingolia et al. 2014; Ruiz-Orera et al. 2014).
Here we perform differential gene expression analysis using RNA-Seq and Ribo-Seq data during oxidative stress in Saccharomyces cerevisiae, a condition that is known to trigger important regulatory changes both at the transcriptional and translational levels (Shenton et al., 2006; Gerashchenko et al., 2012). We compare the results to proteomics data obtained from the same samples. The results show that the dynamics of total mRNA and translated mRNAs are very distinct, and that most changes in the relative amount of mRNA do not appear to have any consequences at the protein level. The study opens a door for a more generalized use of Ribo-Seq data to measure changes in protein expression across conditions.
Results and Discussion
Quantification of gene expression by Ribo-Seq and RNA-Seq
We extracted ribosome-protected RNA fragments, as well as total polyadenylated RNAs, from Saccharomyces cerevisiae grown in rich medium (normal) and in H2O2-induced oxidative stress conditions (stress). We then sequenced ribosome-protected RNAs (Ribo-Seq) as well as complete polyA+ mRNAs (RNA-Seq) using a strand-specific protocol. The Ribo-Seq data corresponded to the translated mRNA fraction (translatome), whereas the RNA-Seq data corresponded to total mRNAs (transcriptome). For comparison we also estimated protein concentrations (proteome) in the two conditions by mass spectrometry (Figure 1).
After quality control of the sequencing reads we obtained 31-36 million reads for Ribo-Seq and 12-15 million reads for RNA-Seq (Supplementary Table S1). We mapped the reads to the genome and generated a table of gene counts for each of the samples. After filtering out non-expressed genes (see Methods), the table contained data for 5,419 S. cerevisiae genes. Using mass spectrometry (mass spec) we could quantify the protein products of 2,200 genes (see Methods), representing about 40% of the genes quantified by RNA-Seq.
We normalized the RNA-Seq and Ribo-Seq-based table of counts by calculating counts per million (CPM) in logarithmic scale, or log2CPM (Supplementary Figure S1). The gene normalized expression values showed a very high correlation between biological replicates, with a correlation coefficient large than 0.99 between all pairs of Ribo-Seq or RNA-Seq replicas (Supplementary Table S2). In contrast, normalized protein abundances between pairs of proteomics replicates showed correlation coefficients between 0.83 and 0.93 (Supplementary Table S3), indicating that quantification by proteomics is less reproducible than quantification by RNA-Seq and Ribo-Seq.
Importantly, the Ribo-Seq data correlated better with the proteomics data than RNA-Seq; in the first case the correlation was 0.67-0.71 and in the second one 0.46-0.62 (Figure 3). This supports that notion that Ribo-Seq provides a more accurate view of protein expression than RNA-Seq (Ingolia et al., 2009).
We next clustered the RNA-Seq and Ribo-Seq gene expression values using multidimensional scaling (MDS)(Borg and Groenen, 1997)(Supplementary Figure S2). Remarkably, the Ribo-Seq measurements for the two conditions (normal and stress) were more similar to each other than any of them was to the condition-matched RNA-Seq measurements, and the same thing happened with the RNA-Seq-based measurements. Thus, the sequencing approach employed is expected to have a strong impact in the results.
Next, we calculated the fold change (FC) gene expression difference between conditions, taking the average expression values between replicates of the same experimental condition. In agreement with the results obtained with MDS, the log2FC distribution based on the Ribo-Seq data had a lower variance than the log2FC distribution using RNA-Seq data (Figure 4). We considered the possibility that this pattern was due to the number of Ribo-Seq reads being 2-3 times larger than the number of RNA-Seq reads (Supplementary Table S1). To test for this, we subsampled the mapped reads so as to have a similar number of reads in all the RNA-Seq and Ribo-Seq samples (Supplementary Tables S4 and S5). We again observed a lower log2FC variance for Ribo-Seq than for RNA-Seq (Supplementary Figure S3), indicating that the observed variance difference has a biological origin.
Differential gene expression analysis
We performed differential gene expression analysis, separately for Ribo-Seq and RNA-Seq data, using multivariable linear regression with the Limma package (Law et al., 2014). Limma provides a list of differentially expressed genes with the corresponding adjusted p-values. We selected genes with an adjusted p-value < 0.05 and a log2FC larger than one standard deviation; the latter corresponded to a minimum FC of 1.49 for RNA-Seq data and 1.36 for Ribo-Seq data. We used the standard deviation instead of a fixed value to accommodate for the differences in the width of the log2FC distributions (Figure 4).
We obtained 817 up-regulated genes during oxidative stress using RNA-Seq data, compared to only 92 with Ribo-Seq data. Thus, the vast majority of the genes identified as up-regulated in stress with RNA-Seq data were not significantly up-regulated when using the Ribo-Seq data to do the same analysis. The number of down-regulated genes was 846 and 519 for RNA-Seq and Ribo-Seq, respectively. Overall, only a small fraction of the differentially expressed genes was common to both approaches (5-10%, see below).
The induction of oxidative stress by hydrogen peroxide (H2O2) results in an excess of reactive oxygen species (ROS) in the cell. This is known to activate the expression of several protein families including thioredoxins, hexoquinases, and heat shock proteins (Morano et al., 2012). The set of up-regulated genes identified by both RNA-Seq and Ribo-Seq included several members of these families (e.g. HXK2, TDH1, CYC1, HSP10), consistent with transcriptional activation of genes directly involved in stress response.
Attempts to use the same pipeline to identify differentially expressed genes using the proteomics data did not yield significant results. The reproducibility of protein abundance estimates using mass spec data is not as high as the reproducibility of gene expression levels in the case of RNA sequencing data, which decreases the power of differential gene expression analysis using this kind of data (Supplementary Table S3).
Uncoupling between changes at the transcriptome and translatome levels
The correlation between RNA-Seq and Ribo-Seq gene log2FC values was quite low (0.18), supporting an important disconnect between the two kinds of data (Figure 5). We quantified the number of genes that showed a significant change in the same direction i.e. homodirectional changes. There were 38 genes that were up-regulated during stress using both RNA-Seq and Ribo-Seq data, this is a small number but still more than double the number expected by chance (15 genes). The number of homodirectional down-regulated genes was 89, compared to 55 be expected by chance. In summary, while there was a modest overlap between the stories told by RNA-Seq and Ribo-Seq data (test of proportions p-value < 1.32×105), the majority of the differentially expressed genes were not concordant.
Dissecting differential regulation by functional class
To better understand the biological relevance of the above results, we investigated if certain functional classes were significantly enriched among the sets of differentially expressed genes. We used DAVID (Huang et al., 2009) to identify significantly over-represented functional clusters (Figure 4). Only one class, ‘oxidation-reduction process’, was enriched among genes up-regulated during stress both using RNA-Seq and Ribo-Seq data. This is consistent with transcriptional activation of this set of genes upon stress, increasing the signal for both total mRNA and the translated fraction. Three other classes – ‘translation’, ‘ATPase’ and ‘proteasome’ – showed increased mRNA levels during stress, but this was not reflected in an increase in the translated fraction. Thus, it is likely that an important part of these transcripts are stored in a translation inactive form during stress, for example as P-bodies or stress granules (Zid and O’Shea, 2014; Khong et al., 2017; Luo et al., 2018). In this case, an accumulation of transcripts would be detected by RNA-Seq but not by Ribo-Seq, as translation of the transcripts is impaired.
Interestingly, there were functions that only appeared when we performed differential gene expression analysis with the Ribo-Seq data: ‘cell wall’, ‘mitochondrial intermembrane space’ and ‘catalytic activity’ were enriched among up-regulated genes, whereas ‘cell cycle’ was enriched among down-regulated genes (Figure 6). As these classes are not detected by RNA-Seq, they are candidates to be regulated at the translational level only. An alternative possibility is that the storage of some transcripts in stress granules distorts the RNA-Seq patterns to such a degree that some truly up-regulated genes become undetectable with RNA-Seq; they would only be detected when examining actively translated mRNAs with Ribo-Seq.
Translation inhibition of cell cycle genes
In order to further identify possible translational regulatory events we compared the translational efficiency (TE; Ribo-Seq reads divided by RNA-Seq reads) of the different genes in the two conditions using the program Ribodiff (Zhong et al., 2017). This approach is based on the assumption that the number of Ribo-Seq reads is proportional to the amount of translated protein. We detected 470 genes that showed increased TE, and 714 genes that showed decreased TE, in oxidative stress versus normal growth conditions (adjusted p-value < 0.05; see Methods).
We reasoned that genes whose translation becomes more active during stress should have increased TE values but also be classified as upregulated when using Ribo-Seq for differential gene expression analysis. We only found 17 genes fulfilling both conditions (3.6% of the genes with increased TE), indicating that activation of translation probably has a relatively small impact in the response to oxidative stress. In the vast majority of cases the increase in TE could be explained by a decrease in RNA-Seq signal during stress (Supplementary Table S6).
By the same token, genes whose translation is repressed during stress are expected to have decreased TE values but also be classified as down-regulated by Ribo-Seq. We found 246 such genes (34.4% of the genes with decreased TE), suggesting that this mechanism may be more prevalent. Among them there were 12 genes from the cell cycle functional category (Supplementary Table S7). The putative translational repression of these genes did not appear to be mediated by increased translation of upstream ORFs (Gerashchenko et al., 2012), as we did not detect any increase in the number of Ribo-Seq reads mapping to 5’UTR regions when compared to coding sequences in stress conditions.
Concluding remarks
The adaptation of organisms to variations in the environmental conditions is associated with the activation or repression of the expression of particular genes. These changes are usually studied at the level of complete mRNA molecules using microarrays or next generation sequencing. However, changes in mRNA concentration do not necessarily reflect changes in their encoded protein products; rather, uncoupling between total and polysomal mRNA levels has been observed in many different conditions (Tebaldi et al., 2012; Shenton et al., 2006).
Ribo-Seq specifically targets ribosome-protected mRNAs, providing a closer view to protein expression than RNA-Seq, which is for total mRNA sequences. Although Ribo-Seq data is more labour-intensive than RNA-Seq, the protocols are being simplified and its use is rapidly growing (Reid et al. 2015; Xie et al. 2016; Liu et al. 2018; Michel et al. 2018). Here we have used Ribo-Seq data to perform differential gene expression analysis during oxidative stress, and compared the results to RNA-Seq and to proteomics data.
We have shown that gene expression levels inferred from Ribo-Seq data correlate better with protein abundance than those inferred from RNA-Seq data. Remarkably, many of the genes that are classified as differentially regulated using RNA-Seq do not show a similar effect when the Ribo-Seq data is analyzed, strongly suggesting that, for these genes, no significant changes at the protein level take place. The methodological framework we have developed here can be applied to other conditions and help advance our understanding of gene regulation.
Methods
Biological material
We grew S. cerevisiae (S288C) in 500 ml of rich medium (Tsankov et al., 2010). In order to induce oxidative stress, 30 minutes before harvesting we added diluted H2O2 to the medium for a final concentration of 1.5 mM. The cells were harvested in log growth phase (OD600 of ~0.25) via vacuum filtration and frozen with liquid nitrogen.
Ribosome profiling
In order to capture ribosome protected mRNAs, cyclohexamide was added one minute before the cells were harvested. Cyclohexamide is commonly used as a protein synthesis inhibitor in order to prevent ribosome run-off and the subsequent loss of ribosome-transcript complexes. One third of each culture was used for ribosome profiling (Ribo-Seq); the rest was reserved for RNA-Seq.
Cells were lysed using the freezer/mill method (SPEX SamplePrep); after preliminary preparations, lysates were treated with RNaseI (Ambion), and subsequently with SUPERaseIn (Ambion). Monosomal fractions were collected; SDS was added to stop any possible RNAse activity, then samples were flash-frozen with N2(l). Digested extracts were loaded in 7%-47% sucrose gradients. RNA was isolated from monosomal fractions using the hot acid phenol method. Ribosome-Protected Fragments (RPFs) were selected by isolating RNA fragments of 28-32 nucleotides (nt) using gel electrophoresis. The preparation of sequencing libraries for Ribo-Seq and RNA-Seq was based on a previously described protocol (Ingolia et al., 2012). Pair-end sequencing reads of size 35 nucleotides (2×35bp) were produced for Ribo-Seq and RNA-Seq on MiSeq and NextSeq platforms, respectively. The data has been deposited at NCBI Bioproject PRJNA435567 (https://www.ncbi.nlm.nih.gov/bioproject/435567).
Processing of the sequencing data
The RNA-Seq data was filtered using Trimmomatic with default parameters (version 0.36)(Bolger et al., 2014). In the Ribo-Seq data we discarded the second read pair as it was redundant and of poorer quality than the first read, and then used Cutadapt (Martin, 2011) to eliminate the adapters and to trim five and four nucleotides at 5’ and 3’ edges, respectively. Ribosomal RNA was depleted from the Ribo-Seq data in silico by removing all reads which mapped to annotated rRNAs. Ribo-Seq reads shorter than 25 nucleotides were not used.
After quality check and read trimming, the reads were aligned against the S. cerevisiae genome (S288C R64-2-1) using Bowtie 2 (Langmead et al., 2009). For annotation we used a previously generated S. cerevisiae transcriptome containing 6,184 annotated coding sequences plus 1,009 non-annotated assembled transcripts (see Supplementary data). SAMtools (Li et al., 2009) was used to filter out unmapped reads.
We counted the number of reads that mapped to each gene with HTSeq-count (Anders et al., 2015). We used the mode ‘intersection strict’ to generate a table of counts from the data; the procedure removed about 5% of the reads in the case of RNA-Seq, and 8% in the case of Ribo-Seq. Only genes in which the average read count of the two replicates was larger than 10 in all conditions (normal and stress, for RNA-Seq and for Ribo-Seq) were kept. The filtered table of counts contained data for 5,419 genes.
For subsampling the number of mapped reads we used SAMtools (Li et al., 2009). We used the function ‘samtools view’ with option ‘-s 0.X’, where X is the percentage of reads that we wish to keep.
Differential gene expression analysis
The table of counts was normalized to log2 Counts per Million (log2CPM), in order to account for the different number of total reads in each sample. Before performing differential gene expression analysis, we normalized the data using Trimmed Mean of M-values (TMM) as implemented is the package edgeR (Robinson et al., 2010). Finally, we applied the Limma voom method (Law et al., 2014) to identify differentially expressed genes, separately for RNA-Seq and Ribo-Seq data (adjusted p-value < 0.05 and |log2FC| > 1 SD(log2FC)).
We also performed the same kind of analysis for the proteomics data. We used genes which had at least 3 unique peptides and could be quantified in all 6 replicates (1,580 genes); the procedure did not identify any significantly up or down regulated genes, using an adjusted p-value < 0.05.
Quantification of protein abundance by mass spectrometry
For our proteomics experiment, we analysed 3 replicates per condition by LCMSMS using a 90-min gradient in the Orbitrap Fusion Lumos. These samples were not treated with cyclohexamide. As a quality control measure, BSA controls were digested in parallel and ran between each sample to avoid carry-over and assess the instrument performance. The peptides were searched against SwissProt Yeast database, using the Mascot v2.5.1 search algorithm. The search was performed with the following parameters: peptide mass tolerance MS1 7 ppm and peptide mass tolerance MS2 0.5 Da; three maximum missed cleavages; trypsin digestion after K or R except KP or KR; dynamic modifications oxidation (M) and acetyl (N-term), static modification carbamidomethyl (C). Protein areas were obtained from the average area of the three most intense unique peptides per protein group. Considering the data from all 6 samples, we detected proteins from 3,336 genes. We limited our quantitative analysis to a subset of 2,200 proteins which had proteomics hits for at least 3 unique peptides; this filter eliminates noise arising from technical challenges of quantifying lowly abundant proteins with LCMSMS.
Analysis of functional clusters
We identified significantly enriched functional clusters in differentially expressed genes using DAVID (Huang et al., 2009). The analysis was done separately for over- and under-expressed genes and for RNA-Seq and Ribo-Seq derived data. Only clusters with enrichment score ≥ 1.5 and adjusted p-val < 0.05 were retained. In each cluster we chose a representative Gene Ontology (GO) term (Ashburner et al., 2000), with the highest number of genes inside the cluster. Figure 4 integrates the results obtained with the Ribo-Seq and the RNA-Seq data, the log10 fold enrichment of the significant GO terms is plotted.
Analysis of translational efficiency
We searched for genes with significantly increased or decreased translational efficiency (TE)(Ingolia et al., 2009) using the RiboDiff program (Zhong et al., 2017). We selected genes significant at an adjusted p-value < 0.05 and showing log2(TEstress/TEnormal) higher than 0.67 or lower than −0.67 (plus or minus the standard deviation of the distribution).
We downloaded S.cerevisiae 5’UTR sequences from the Yeast Genome Database (https://downloads.yeastgenome.org/sequence/S288C_reference/SGD_all_ORFs_5prim_e_UTRs.fsa). We selected 5’UTR sequences longer than 30 nucleotides, removed identical sequences and took the longest 5’UTR per gene when several existed. The resulting annotation file contained the genomic coordinates of the 5’UTRs of 2,424 genes. We recovered 5’UTR sequences for 5 of the 12 cell cycle-related genes that were potentially repressed at the translational level (HTL1, SPC19, CDC26, BNS1, DIB1). In none of these cases the number of Ribo-Seq reads in the 5’UTR divided by the number of Ribo-Seq reads in the coding sequence increased in oxidative stress with respect to normal growth conditions.
Acknowledgements
We acknowledge the Proteomics Unit of Center for Regulatory Genomics and Universitat Pompeu Fabra for their lab support to isolate proteins from the yeast cultures. We are also grateful to Jorge Ruiz-Orera and Robert Castelo for advice during this project. The work was funded by grants BFU2015-65235-P, BFU2015-68351-P and BFU2016-80039-R, from Ministerio de Economía e Innovación (Spanish Government) - FEDER (EU), and from grant PT17/0009/0014 from Instituto de Salud Carlos III – FEDER. We also received funding from the “Maria de Maeztu” Programme for Units of Excellence in R&D (MDM-2014-0370) and from Agència de Gestió d’Ajuts Universitaris i de Recerca Generalitat de Catalunya (AGAUR), grant number 2014SGR1121, 2014SGR0974, 2017SGR01020 and, predoctoral fellowship (FI) to W.B. We also acknowledge support from the EU Erasmus Programme to T.T.
Footnotes
↵# Shared first co-authorship