Abstract
Transcriptional and post-transcriptional regulation shape tissue-type-specific proteomes, but their relative contributions remain contested. Estimates of the factors determining protein levels in human tissues do not distinguish between (i) the factors determining the variability between the abundances of different proteins, i.e., mean-level-variability and, (ii) the factors determining the physiological variability of the same protein across different tissue types, i.e., across-tissue variability. We sought to estimate the contribution of transcript levels to these two orthogonal sources of variability, and found that mRNA levels can account for most of the mean-level-variability but not necessarily for across-tissue variability. The precise quantification of the latter estimate is limited by substantial measurement noise. However, protein-to-mRNA ratios exhibit substantial across-tissue variability that is functionally concerted and reproducible across different datasets, suggesting extensive post-transcriptional regulation. These results caution against estimating protein fold-changes from mRNA fold-changes between different cell-types, and highlight the contribution of post-transcriptional regulation to shaping tissue-type-specific proteomes.
Introduction
The relative ease of measuring mRNA levels has facilitated numerous investigations of how cells regulate their gene expression across different pathological and physiological conditions (Sørlie et al, 2001; Slavov and Dawson, 2009; Spellman et al, 1998; Slavov et al, 2011, 2012; Djebali et al, 2012a). However, often the relevant biological processes depend on protein levels, and mRNA levels are merely proxies for protein levels (Alberts et al, 2014). If a gene is regulated mostly transcriptionally, its mRNA level is a good proxy for its protein level. Conversely, posttranscriptional regulation can set protein levels independently from mRNA levels, as in the cases of classical regulators of development (Kuersten and Goodwin, 2003), cell division (Hengst and Reed, 1996; Polymenis and Schmidt, 1997) and metabolism (Daran-Lapujade et al, 2007; Slavov et al, 2014). Thus understanding the relative contributions of transcriptional and post-transcriptional regulation is essential for understanding their trade-offs and the principles of biological regulation, as well as for assessing the feasibility of using mRNA levels as proxies for protein levels.
Previous studies have considered single cell-types and conditions in studying variation in absolute mRNA and protein levels genome-wide, often employing unicellular model organisms or mammalian cell cultures (Gygi et al, 1999; Smits et al, 2014; Schwanhäusser et al, 2011; Li et al, 2014; Csárdi et al, 2015; Jovanovic et al, 2015; Cheng et al, 2016). However, analyzing per-gene variation in relative mRNA and protein expression across different tissue types in a multicellular organism presents a potentially different and critical problem which cannot be properly addressed by examining only genome-scale correlations between mRNA and protein levels. Wilhelm et al (2014) and Kim et al (2014) have measured protein levels across human tissues, thus providing valuable datasets for analyzing the regulatory layers shaping tissue-type-specific proteomes. The absolute levels of proteins and mRNAs in these datasets correlate well, highlighting that highly abundant proteins have highly abundant mRNAs. Such correlations between the absolute levels of mRNA and protein mix/conflate many sources of variation, including variability between the levels of different proteins, variability within the same protein across different conditions and cell-types, and the variability due to measurement error and technological bias.
However, these different sources of variability have very different biological interpretations and implications. A major source of variability in protein and mRNA data arises from differences between the levels of mRNAs and proteins corresponding to different genes. That is, the mean levels (averaged across tissue-types) of different proteins and mRNAs vary widely. We refer to this source of variability as mean-level variability. This mean-level variability reflects the fact that some proteins, such as ribosomal proteins, are highly abundant across all profiled tissues while other proteins, such as cell cycle and signaling regulators, are orders of magnitude less abundant across all profiled conditions (Wilhelm et al, 2014). Another principal source of variability in protein levels, intuitively orthogonal to the mean-level variability, is the variability within a protein across different cell-types or physiological conditions and we refer to it as across-tissue variability. The across-tissue variability is usually much smaller in magnitude, but may be the most relevant source of variability for understanding different phenotypes across cells-types and physiological conditions.
Here, we sought to separately quantify the contributions of transcriptional and post-transcriptional regulation to the mean-level variability and to the across-tissue variability across human tissues. Our results show that the much of the mean-level protein variability can be explained well by mRNA levels while across-tissue protein variability is poorly explained by mRNA levels; much of the unexplained variance is due to measurement noise but some of it is reproducible across datasets and thus likely reflects post-transcriptional regulation. These results add to previous results in the literature (Gygi et al, 1999; Schwanhäusser et al, 2011; Li et al, 2014; Wilhelm et al, 2014; Jovanovic et al, 2015; Csárdi et al, 2015; Smits et al, 2014) and suggest that the post-transcriptional regulation is a significant contributor to shaping tissue-type specific proteomes in human.
Results
The correlation between absolute mRNA and protein levels conflates distinct sources of variability
We start by outlining the statistical concepts underpinning the common correlational analysis and depiction (Gygi et al, 1999; Schwanhäusser et al, 2011; Wilhelm et al, 2014; Csárdi et al, 2015) of estimated absolute protein and mRNA levels as displayed in Figure 1a. The correlation between the absolute mRNA and protein levels of different genes and across different tissue-types has been used to estimate the level at which the protein levels are regulated (Wilhelm et al, 2014).
One measure reflecting the post-transcriptional regulation of a gene is its protein to mRNA ratio, which is sometimes referred to as a gene’s “translational efficiency” because it reflects, at least in part, its translational rate. Since this ratio also reflects other layers of regulation, such as protein degradation (Jovanovic et al, 2015), and noise we will refer to it descriptively as protein-to-mRNA (PTR) ratio. If the across-tissue variability of a gene is dominated by transcriptional regulation, its PTR in different tissue-types will be a gene-specific constant. Based on this idea, Wilhelm et al (2014) estimated these protein-to-mRNA ratios and suggested that the median ratio for each gene can be used to scale its tissue-specific mRNA levels and that this “scaled mRNA” predicts accurately tissue-specific protein levels.
Indeed, mRNA levels scaled by the corresponding median PTR explain large fraction of the total protein variance (, across 6104 measured proteins, Figure 1a) as previously observed (Schwanhäusser et al, 2011; Wilhelm et al, 2014). However, quantifies the fraction of the total protein variance explained by mRNA levels between genes and across tissue-types; thus, it conflates the mean-level variability with the across-tissue variability. This conflation is shown schematically in Figure 1b for a subset of 100 genes measured across 12 tissues. The across-tissue variability is captured by the variability within the regression fits and the mean-level variability is captured by the variability between the regression fits.
Such aggregation of distinct sources of variability, where different subgroups of the data show different trends, may lead to counter-intuitive results and incorrect conclusions, and is known as the Simpson’s or amalgamation paradox (Blyth, 1972). To illustrate the Simpson’s paradox in this context, we depicted a subset of genes for which the measured mRNA and protein levels are unrelated across-tissues-the mean-level variability still spans the full dynamic range of the data. For this subset of genes, the overall (conflated/amalgamated) correlation is large and positive, despite the fact that all within-gene trends are close to zero. This counter-intuitive result is possible because the conflated correlation is dominated by the variability with larger dynamical range, in this case the mean-level variability. This conceptual example taken from the Wilhelm et al (2014) data demonstrates that is not necessarily informative about the across-tissue variability, i.e., the protein variance explained by scaled mRNA within a gene . Thus the conflated correlation is not generally informative about the level — transcriptional or post-transcriptional — at which across-tissue variability is regulated. This point is further illustrated in Supplementary Fig. 1 with data for all quantified genes: The correlations between scaled mRNA and measured protein levels are not informative for the correlations between the corresponding relative changes in protein and mRNA levels.
While across-tissue variability is smaller than mean-level variability, it is exactly the across-tissue variability that contributes to the biological identity of each tissue type. This across-tissue variability has a dynamic range of about 2 – 10 fold and is thus dwarfed by the 103 – 104 fold dynamic range of abundances across different proteins.
Estimates of transcriptional and post-transcriptional regulation across-tissues depend strongly on data reliability
Next, we sought to estimate the fractions of across-tissue protein variability due to transcriptional regulation and to post-transcriptional regulation. This estimate depends crucially on noise in the mRNA and protein data, from sample collection to measurement error. Both RNA-seq (Marioni et al, 2008; Consortium et al, 2014) and mass-spectrometry (Schwanhäusser et al, 2011; Peng et al, 2012) have relatively large and systematic error in estimating absolute levels of mRNAs and proteins, i.e., the ratios between different proteins/mRNAs. These errors originate from DNA sequencing GC-biases, and variations in protein digestion and peptide ionization. However, relative quantification of the same gene across tissue-types by both methods can be much more accurate since systematic biases are minimized when taking ratios between the intensities/counts of the same peptide/DNA-sequence measured in different tissue types (Ong et al, 2002; Blagoev et al, 2004; Consortium et al, 2014; Jovanovic et al, 2015). It is this relative quantification that is used in estimating across-tissue variability, and we start by estimating the reliability of the relative quantification across human tissues, Figure 2a-d. Reliability is defined as the fraction of the observed/ empirical variance due to signal. Thus reliability is proportional to the signal strength and decreases with the noise levels.
To estimate the within study reliability of mRNA levels, we split each dataset into two subsets, each of which contain measurements for all tissues. The levels of each mRNA were estimated from each subset and the estimates correlated, averaging across tissues (Figure 2a). These correlations provide estimates for the reliability of each mRNA and their median provides a global estimate for the reliability of relative RNA measurement, not taking into account noise due to sample collection and handling.
To estimate the within study reliability of protein levels, we computed separate estimates of the relative protein levels within a dataset. For each protein, Estimate 1 was derived from 50% of the quantified peptides and Estimate 2 from the other 50%. Since much of the analytical noise related to protein digestion, chromatographic mobility and peptide ionization is peptide-specific, such non-overlapping sets of of peptides provide mostly, albeit not completely, independent estimates for the relative protein levels. The correlations between the estimates for each protein (averaging across 12 tissues) are displayed as a distribution in Figure 2b.
In addition to the within study measurement error, protein and mRNA estimates can be affected by study-dependable variables such as sample collection and data processing. To account for these factors, we estimated across study reliability by comparing estimates for relative protein and mRNA levels derived from independent studies, Figure 2c-d. For each gene, we estimate the reliability for each protein by computing the the empirical correlation between mRNA abundance reported by the ENCODE (Djebali et al, 2012b) and by (Fagerberg et al, 2014). The correlations in Figure 2c have much broader distribution than the within-study correlations, indicating that much of the noise in mRNA estimates is study-dependent.
To estimate the across study reliability of protein levels, we compared the protein levels estimated from data published by Wilhelm et al (2014) and Kim et al (2014). To quantify protein abundances, Wilhelm et al (2014) used iBAQ scores and Kim et al (2014) used spectral counts. To ensure uniform processing of the two datasets, we downloaded the raw data and analyzed them with maxquant using identical settings, and estimate protein abundances in each dataset using iBAQ; see Methods. The corresponding estimates for each protein were correlated to estimate their reproducibility. Again, the correlations depicted in Figure 2d have a much broader distribution compared to the within-study protein correlations in Figure 2b, indicating that, as with mRNA, the vast majority of the noise is study-dependent. As one representative estimate of the reliability of protein levels, we use the median of the across tissue correlations from Figures 2c-d.
The across tissue correlations and the reliability of the measurements can be used to estimate the across tissue variability in protein levels that can be explained by mRNA levels (i.e., transcriptional regulation) as shown in Figure 2e; see Methods. As the reliabilities of the protein and the mRNA estimates decrease, the noise sensitivity of the estimated transcriptional contribution increases. Although the average across-tissue mRNA protein correlation was only 0.29 (R2 = 0.08), the data are consistent with approximately 50% of the variance being explained by transcriptional regulation and approximately 50% coming from post-transcriptional regulation. However, the low reliability of the data and large sampling variability precludes making this estimate precise. Thus, we next considered analyses that can provide estimates for the scope of post-transcriptional regulation even when the reliability of the data is low.
Coordinated post-transcriptional regulation of functional gene sets
The low reliability of estimates across datasets limits the reliability of estimates of transcriptional and post-transcriptional regulation for individual proteins, Figure 2. Thus, we focused on estimating the post-transcriptional regulation for sets of functionally related genes as defined by the gene ontology (Consortium et al, 2004). By considering such gene sets, we may be able to average out some of the measurement noise and see regulatory trends shared by functionally related genes. Indeed, some of the noise contributing to the across-tissue variability of a gene is likely independent from the function of the gene; see Methods. Conversely, genes with similar functions are likely to be regulated similarly and thus have similar tissue-type-specific PTR ratios. Thus, we explored whether the across-tissues variability of the PTR ratios of functionally related genes reflects such tissue-type-specific and biological-function-specific post-transcriptional regulation.
Since this analysis aims to quantify across-tissue variability, we define the “relative protein to mRNA ratio” (rPTR) of a gene in a given tissue to be the PTR ratio in that tissue divided by the median PTR ratio of the gene across the other 11 tissues. We evaluated the significance of rPTR variability for a gene-set in each tissue-type by comparing the corresponding gene-set rPTR distribution to the rPTR distribution for those same genes pooled across the other tissues (Figure 3); we use the KS-test to quantify the statistical significance of differences in the rPTR distributions; see Methods. The results indicate that the genes from many GO terms have much higher rPTR in some tissues than in others. For example the ribosomal proteins of the small subunit (40S) have high rPTR in kidney but low rPTR in stomach (Figure 3a-c).
While the strong functional enrichment of rPTR suggests functionally concerted post-transcriptional regulation, it can also reflect systematic dataset-specific measurement artifacts. To investigate this possibility, we obtained two estimates for rPTR from independent datasets: Estimate 1 is based on data from Wilhelm et al (2014) and Fagerberg et al (2014), and Estimate 2 is based on data from Kim et al (2014) and Djebali et al (2012b). These two estimates are highly reproducible for most tissues, as shown by the correlation between the median rPTR for GO terms in Figure 3d; Supplementary Fig. 2 shows the reproducibility for all tissues. The correlations between the two rPTR estimates remain strong when computed with all GO terms (not only those showing significant enrichment) as shown in Table S1, as well as when computed between the rPTRs for all genes Table S2.
Consensus protein levels
Given the low reliability of protein estimates across studies Figure 2, we sought to increase it by deriving consensus estimates. Indeed, by appropriately combining data from both protein studies, we can average out some of the noise thus improving the reliability of the consensus estimates; see Methods. As expected for protein estimates with increased reliability, the consensus protein levels correlate better to mRNA levels than the corresponding protein levels estimated from a either dataset alone, Figure 4. We further validate our consensus estimates against 124 protein/tissue measurements from a targeted MS study by Edfors et al (2016). We computed the mean squared errors (MSE) between the protein levels estimated from the targeted study and the other three datasets using only protein/tissue measurements quantified in all datasets, facilitating fair comparison (Table S3). The MSE are lower for the consensus dataset than for either (Wilhelm et al, 2014) or (Kim et al, 2014) and are consistent with a 10% error reduction relative to the Kim et al (2014) dataset. In addition to increased reliability, the consensus dataset increased coverage, providing a more comprehensive quantification of protein levels across human tissues than either draft of the human proteome taken alone (Table S3).
Discussion
Highly abundant proteins have highly abundant mRNAs. This dependence is consistently observed (Jovanovic et al, 2015; Csárdi et al, 2015; Gygi et al, 1999; Smits et al, 2014; Schwanhäusser et al, 2011) and dominates the explained variance in the estimates of absolute protein levels (Figure 1 and Supplementary Fig. 1). This underscores the role of transcription for setting the full dynamic range of protein levels. In stark contrast, differences in the proteomes of distinct human tissues are poorly explained by transcriptional regulation, Figure 1. This is due to measurement noise (Figure 2) but also to post-transcriptional regulation. Indeed, large and reproducible rPTR ratios suggest that the mechanisms shaping tissue-specific proteomes involve post-transcriptional regulation, Figure 3. This result underscores the role of translational regulation and of protein degradation for mediating physiological functions within the range of protein levels consistent with life.
As with all analysis of empirical data, the results depend on the quality of the data and the estimates of their reliability. This dependence on data quality is particularly strong given that some conclusions rest on the failure of across-tissue mRNA variability to predict across-tissue protein variability. Such inference based on unaccounted for variability is substantially weaker than measuring directly and accounting for all sources of variability. The low across study reliability suggest that the signal is strongly contaminated by noise, especially systematic biases in sample collection and handling, and thus the data cannot accurately quantify the contributions of different regulatory mechanisms, Figure 2. Another limitation of the data is that isoforms of mRNAs and proteins are merged together, i.e., using razor proteins. This latter limitation is common to all approaches quantifying proteins and mRNAs from peptides/short-sequence reads. It stems from the limitation of existing approaches to their to infer isoform and quantify them separately.
The strong enrichment of rPTR ratios within gene sets (Figure 3) demonstrates a functionally concerted regulation at the post-transcriptional level. Some of the rPTR trends can account for fundamental physiological differences between tissue types. For example, the kidney is the most metabolically active (energy consuming) tissue among the 12 profiled tissues (Hall, 2010) and it has very high rPTR for many gene sets involved in energy production (Figure 3a). In this case, posttranscriptional regulation very likely plays a functional role in meeting the high energy demands of kidneys.
The rPTR patterns and the across tissue correlations in Supplementary Fig. 1 indicate that the relative contributions of transcriptional and post-transcriptional regulation can vary substantially depending on the tissues compared. Thus, the level of gene regulation depends strongly on the context. For example transcriptional regulation is contributing significantly to the dynamical responses of dendritic cells (Jovanovic et al, 2015) and to the differences between kidney and prostate gland (Supplementary Fig. 1b) but less to the differences between kidney and thyroid gland (Supplementary Fig. 1a). All data, across all profiled tissues, suggest that post-transcriptional regulation contributes substantially to the across-tissue variability of protein levels. The degree of this contribution depends on the context.
Indeed, if we only increase the levels for a set of mRNAs without any other changes, the corresponding protein levels must increase proportionally as demonstrated by gene inductions (Mclsaac et al, 2011). However, the differences across cell-types are not confined only to different mRNA levels. Rather, these differences include different RNA-binding proteins, alternative untranslated regions (UTRs) with known regulatory roles in protein synthesis, specialized ribosomes (Mauro and Edelman, 2002; Mauro and Matsuda, 2015; Slavov et al, 2015; Preiss, 2016), and different protein degradation rates (Gebauer and Hentze, 2004; Rojas-Duran and Gilbert, 2012; Castello et al, 2012; Arribere and Gilbert, 2013; Katz et al, 2014). The more substantial these differences, the bigger the potential for post-transcriptional regulation. Thus cell-type differentiation and commitment may result in much more post-transcriptional regulation than observed during perturbations preserving the cellular identity. Consistent with this possibility, tissue-type specific proteomes may be shaped by substantial post-transcriptional regulation; in contrast, cell stimulation that preserves the cell-type, may elicit a strong transcriptional remodeling but weaker post-transcriptional remodeling.
Methods
Data and scaled mRNA levels
We used data from Wilhelm et al (2014); Kim et al (2014); Fagerberg et al (2014); Djebali et al (2012b) containing estimates for the mRNA levels (based on RNA-seq) and for the protein levels (based on mass-spectrometry) of N = 6104 genes measured in each of twelve different human tissues: adrenal gland, esophagus, kidney, ovary, pancreas, prostate, salivary gland, spleen, stomach, testis, thyroid gland, and uterus. For these genes, about 8% of the mRNA measurements and about 40% of the protein measurements are missing.
First, denote mit the log mRNA levels for gene i in tissue t. Similarly, let pit denote the corresponding log protein levels. First, we normalize the columns of the data, for both protein and mRNA, to different amounts of total protein per sample. Any multiplicative factors on the raw scale correspond to additive constants on the log scale. Consequently, we normalize data from each tissue-type by minimizing the absolute differences between data from the tissue and the first tissue (arbitrarily chosen as a baseline). That is, for all t > 1, we define with
Where and represent the normalized and non-normalized protein measurements respectively. For each t, the value of μt which minimizes the absolute difference is
We use the same normalization for mRNA. This normalization, which corresponds to a location shift of the log abundances for each tissue, corrects for any multiplicative differences in the raw (unlogged) mRNA or protein. We normalize these measurements by aligning the medians rather than the means, as the median is more robust to outliers.
After normalization, we define rit = pit − mit as the log PTR ratio of gene i in condition t. If the post-transcriptional regulation for the ith gene were not tissue-specific, then the ith PTR ratio would be independent of tissue-type and can be estimated as
In such a situation the log “scaled mRNA” (or mean protein level) can be defined as
On the raw scale this amounts to scaling each mRNA by its median PTR ratio and represents and estimate of the mean protein level. The residual difference between the log mean protein level and the measured log protein level, which we call the log rPTR ratio consists of both tissue-specific post-transcriptional regulation and measurement noise.
Across-Tissue Correlations
For each gene, i, we compute the correlation between mRNA and protein across tissues. Unlike the between gene correlations which are consistently large after scaling for each tissue (Figure 1a), across-tissue correlations are highly variable between genes. Although this could be in part because true mRNA/protein correlations vary significantly between genes, a huge amount of the heterogeneity can be explained by sampling variability. There are only 10 and 12 tissues in common across datasets (depending on which datasets are used) and for many genes the abundances are missing, which means that the empirical estimates of across tissue correlation for each gene are very noisy. To find a representative estimate of the across-tissue correlation we can take the median over all genes. As an alternative, if the correlation was roughly constant between genes, we could pool information to yield a representative estimate of this across-tissue correlation. For a gene i, we compute the Fisher transformation of the within-gene correlation. This Fisher transformation, is approximately normally distributed: where Ni are the number of observed mRNA-protein pairs for gene i (at most 11) and ρ corresponds to the population correlation. We estimate the maximum likelihood estimate of the Fisher transformed population correlation by weighting each observation by its variance:
We then transform this estimate back to the correlation scale
Depending on the data sets used, with this method we estimate the population across-tissue mRNA/protein correlation to be between 0.21 ((Wilhelm et al, 2014)) and 0.29 ((Kim et al, 2014)). This correlation cannot be used as direct evidence for the relationship between mRNA and protein levels since both mRNA and protein datasets are unreliable due to measurement noise. This measurement noise attenuates the true correlation. Below we address this by directly estimating data reliability and correcting for noise.
Noise Correction
Measurement noise attenuates estimates of correlations between mRNA and protein level (Franks et al, 2015). A simple way to quantify this attenuation of correlation due to measurement error is via Spearman’s correction. Spearman’s correction is based on the fact that the variance of the measured data can be decomposed into the sum of variance of the noise and the signal. If the noise and the signal are independent, this decomposition and the Spearman’s correction are exact (Csárdi et al, 2015).
Note that it is simple to show that the empirical variance is the sum of the variance of the signal and the variance of the noise:
ei - Expectation at the ith data point;
ζi - Noise at the ith data point; 〈ζ〉 = 0
xi - Observation at the ith data point;
Spearman’s correction is based on estimates of the “reliability” of the measurements, which is defined as the fraction of total measured variance due to signal rather than to noise:
If X and Y are noisy measurements of two quantities, we can compute the noise corrected correlation between them as
In practice, reliabilities are not known but we can often estimate them. In this application, for both mRNA and protein we need measurements in which all steps, from sample collection to level estimation, are repeated independently. In order to estimate the mRNA reliabilities we use independent measurements from Fagerberg et al (2014) and Djebali et al (2012b). For estimating protein reliabilities we use measurements from Wilhelm et al (2014) and Kim et al (2014). Across-tissue reliabilities are computed per gene whereas within-tissue reliabilities are computed per tissue across genes. If two independent measurements have the same reliability, it can be estimated by computing the correlation between the two measurements (Spearman, 1904; Zimmerman and Williams, 1997; Csárdi et al, 2015). We estimated the approximate across-tissue protein reliability to be 0.21 and the across-tissue mRNA reliability to be 0.77. Given the estimated across-tissue mRNA/protein correlation of 0.29 (calculated using data from Kim et al (2014) and Fagerberg et al (2014)) we estimated the noise-corrected fraction of across-tissue protein variance explained by mRNA to be approximately 50%, Figure 2. Note that if both mRNA or both protein datasets share biases, then the estimated reliabilities will be too small, thus deflating the inferred fraction of protein variance explained by mRNA. Moreover, because the reliabilities are low, sampling variability is large, missing data is prevalent, and mRNA/protein correlation likely vary by gene there is uncertainty about this estimate.
Creating The Consensus Estimates
We use the two independent protein datasets to create a single consensus data set which is of arguably higher reliability than either dataset individually. To create this dataset, we take a weighted average of the two protein abundance datasets, by tissue. We compute the weights based on measurement reliabilities for each tissue in each of the two datasets.
Assume we have two random variables, and , corresponding to measurements on the same quantity (e.g. two independent protein measurements) with is the signal which is independent of , the measurement error for sample i. We have a third random variable corresponding to a different quantity (e.g. an mRNA measurement), that is typically positively correlated with and with the same covariance . To create the consensus data set we first compute the reliability of for both datasets.
Note that
Thus,
Similarly, . We use these facts and compute the empirical correlations between datasets to independently estimate the across gene reliabilities for each tissue from each dataset. We then Fisher weight the protein abundances based on their reliabilities. That is, for each tissue t, the consensus dataset, is
When the reliability of and are close, each dataset is weighted equally. When one reliability dominates the other, that dataset contributes more to the aggregated dataset. We found that the full consensus data set has a higher median per gene correlation with mRNA than either of the protein datasets individually (0.34) and agreed more closely with validation data from (Edfors et al, 2016) (Table S3).
Functional gene set analysis
To identify tissue-specific rPTR for functional sets of genes, we analyzed the distributions of rPTR ratios within functional gene-sets using the same methodology as Slavov and Botstein (2011). We restrict our attention to functional groups in the GO ontology (Consortium et al, 2004) for which at least 10 genes were quantified by Wilhelm et al (2014). Let k index one of these approximately 1600 functional gene sets. First, for every gene in every tissue we estimate the relative PTR (rPTR) or equivalently, the difference between log mean protein level and measured protein level:
To exclude the possibility that exactly, we require that t′ ≠ t. When the estimated rPTR is larger than zero, the measured protein level in tissue t is larger than the estimated mean protein level. Likewise, when this quantity is smaller than zero, the measured protein is smaller than expected. Measured deviations from the mean protein level are due to both measurement noise and tissue specific PTR. To eliminate the possibility that all of the variability in the rPTR ratios is due to measurement error we conduct a full gene set analysis.
For each of the gene sets we compute a vector of these estimated log ratios so that a gene set is comprised of where i1 to ink index the genes in set k and t indexes the tissue type.
Let be the function that returns the p-value of the Kolmogorov-Smirnov test on the distribution in sets and . The KS-test is a test for a difference in distribution between two samples. Using this test, we identify gene sets that show systematic differences in PTR ratio in a particular tissue (t) relative to all other tissues.
Specifically, the p-value associated with gene set k in condition j is
To correct for testing multiple hypotheses, we computed the false discovery rate (FDR) for all gene sets in tissue t (Storey, 2003). In Figure 3a-c, we present only the functional groups with FDR less than 1% and report their associated p-values. Note that the test statistics for each gene set are positively correlated since the gene sets are not disjoint, but Benjamini and Yekutieli (2001) prove that the Benjamini-Hochberg procedure applied to positively correlated test statistics is conservative. Thus, the significance of of certain functional groups suggests that not all of the variability in rPTR is due to measurement noise. We also calculated rPTR using two pairs of measurements: one set of rPTR estimates was calculated using protein data from Wilhelm et al (2014) and mRNA from Fagerberg et al (2014) and the other was calculated using data from Kim et al (2014) and Djebali et al (2012b). rPTR of the significant sets was largely reproducible across estimates from independent datasets (Figure 3)d and across genes (Table S2). Note that when computing the per tissue reliabilities for the construction of the consensus data set, we found that the reliabilities of the lung and pancreas datasets from Wilhelm et al (2014) were much less reliable than the data from Kim et al (2014). This could explain why the independent estimates of the rPTR ratios for these tissues were less reproducible.
Acknowledgments
We thank E. Wallace, J. Schmiedel, and D. A. Drummond for discussions and constructive comments. This work was partially funded by a SPARC grant from the Broad Institute to N.S. and E.A., the Washington Research Foundation Fund for Innovation in Data-Intensive Discovery and the Moore/Sloan Data Science Environments Project at the University of Washington, and NIGMS of the NIH under Award Number DP2GM123497.