Abstract
RNA-Seq and gene expression microarrays provide comprehensive profiles of gene activity, by measuring the concentration of tens of thousands of mRNA molecules in single assays [1, 2, 3, 4, 5]. However, lack of accuracy and reproducibility [6, 7, 8, 9] have hindered the application of these high-throughput technologies [10, 11]. A key challenge in the data analysis is the normalization of gene expression levels, which is required to make them comparable between samples [12, 13, 14, 15, 16]. This normalization is currently performed following approaches resting on an implicit assumption that most genes are not differentially expressed [16]. Here we show that this assumption is unrealistic and likely results in failure to detect numerous gene expression changes. We have devised a mathematical approach to normalization that makes no assumption of this sort. We have found that variation in gene expression is much greater than currently believed, and that it can be measured with available technologies. Our results also explain, at least partially, the problems encountered in transcriptomics studies. We expect this improvement in detection to help efforts to realize the full potential of gene expression profiling, especially in analyses of cellular processes involving complex modulations of gene expression, such as cell differentiation, toxic responses and cancer [17, 18, 19, 20].
Since the discovery of DNA structure by Watson and Crick, molecular biology has progressed increasingly quickly, with rapid advances in sequencing and related genomic technologies. Among these, microarrays and RNA-Seq have been widely adopted to obtain gene expression profiles. However, problems of reproducibility and reliability [6, 8, 9] have discouraged their use in some areas, e.g. biomedicine [21, 22, 23]. In more mature microarray technologies [7], issues such as probe design, cross-hybridization, non-linearities and batch effects have been identified as possible culprits, but the problems persist[8, 9].
The normalization of gene expression, which is required to set a common reference level among samples [12, 13, 14, 15, 16], is also reportedly problematic, affecting the reproducibility of both microarray [8, 24] and RNA-Seq [9, 14, 16] results. An underlying assumption of the most widely used normalization methods (such as median and quantile normalization [25] for microarrays, or RPKM [4] and TMM [26] for RNA-Seq), is that most genes are not differentially expressed [16]. This lack-of-variation assumption may seem reasonable for many applications, but it has not been confirmed. Furthermore, results obtained with other technologies, particularly qRT-PCR, suggest that it may not be valid [8, 14]. Thus, in attempts to clarify and overcome limitations imposed by this assumption, we have developed an approach to normalization that does not assume lack-of-variation. The analysis of a large gene expression dataset using this approach shows that the assumption can severely undermine the detection of variation in gene expression. We find numbers of differential expressed genes and magnitudes of expression changes so large that they cannot be neglected in the normalization step of the data analysis.
The dataset was obtained from biological triplicates of Enchytraeus crypticus (a globally distributed soil organism used in standard ecotoxicity tests), sampled under 51 experimental conditions (42 treatments and 9 controls), involving exposure to several substances, at several concentrations and durations according to a factorial design (Extended Data Table 1). Gene expression was measured using a customized high-density oligonucleotide microarray [30], and the resulting dataset was normalized with four methods. Two of these methods are the most widely used procedures for microarrays, median (or scale) normalization and quantile normalization [25], whereas the other two, designated median condition-decomposition normalization and standard-vector condition-decomposition normalization, have been developed for this study.
With the exception of quantile normalization, all used methods apply a multiplicative factor to the expression levels in each sample, equivalent to the addition of a number in the usual log2-scale for gene expression levels. Solving the normalization problem consists of finding these correction factors. The problem can be exactly and linearly decomposed into several sub-problems: one within-condition normalization for each experimental condition and one final between-condition normalization for the condition averages. In the within-condition normalizations, the samples (replicates) subjected to each experimental condition are normalized separately, whereas in the final between-condition normalization average levels for all conditions are normalized together. Because there are no genes with differential expression in any of the within-condition normalizations, the lack-of-variation assumption only affects the final between-condition normalization. The assumption is avoided by using, in this normalization, expression levels only from no-variation genes, i.e. genes that show no evidence of differential expression under a statistical test. Both novel methods of normalization proposed here follow this condition-decomposition approach (see Supplementary Methods and Supplementary Videos 1–5).
In the median condition-decomposition normalization, all normalizations are performed with median values, as in conventional median normalization, but only no-variation genes are included in the between-condition step. Otherwise, if all genes were used in this final step, the resulting total normalization factors would be exactly the same as those obtained with conventional median normalization.
For standard-vector condition-decomposition normalization, a vectorial procedure was developed to carry out each normalization step. The samples of any experimental condition, in a properly normalized dataset, must be exchangeable. In mathematical terms, the expression levels of each gene can be considered as an s-dimensional vector, where s is the number of samples for the experimental condition. After standardization (mean subtraction and variance scaling), these standard vectors are located in a (s – 2)-dimensional hypersphere. The exchangeability mentioned above implies that, when properly normalized, the distribution of standard vectors must be invariant with respect to permutations of the sample labels and must have zero expected value. These properties allow to obtain, under fairly general assumptions, a robust estimator of the normalization factors.
An important feature of the novel approaches to normalization proposed here (linear decomposition into normalization sub-problems per condition, and standard-vector normalization for each sub-problem) is that they do not depend on any particular aspect of the technology of gene expression microarrays or RNA-Seq. The numbers in the input data are interpreted as measured concentrations of mRNA molecules, irrespective of whether they were obtained from fluorescence intensities of hybridized cDNA (microarrays) or from counts of fragments read of mRNA sequences (RNA-Seq). Nevertheless, we consider that specific within-sample corrections for each technology are still necessary and must be applied before the between-sample normalizations proposed here. Examples include background correction for microarrays or gene-length normalization (RPKM) for RNA-Seq.
To further explore and compare outcomes of the normalization methods, they were also applied to a synthetic random dataset. This dataset was generated with identical means and variances gene-by-gene to the real dataset, and with the assumption that all genes were no-variation genes. In addition, normalization factors were applied, equal to those obtained from the real dataset. Thus, the synthetic dataset was very similar to the real one, while complying by construction with the lack-of-variation assumption.
Figure 1 displays the results of applying the four normalization methods to the real and synthetic datasets. Each panel shows the interquartile ranges of expression levels for the 153 samples, grouped in triplicates exposed to each experimental condition. Both median (second row) and quantile normalization (third row) yielded similar outputs, for both datasets. In contrast, the condition-decomposition normalizations (fourth and fifth rows) identified marked differences, detecting much greater variation between conditions in the real dataset. Conventional median normalization makes, by design, the median of each sample the same, while quantile normalization makes the full distribution of each sample the same. Hence, if there were differences in medians or distributions of gene expression between experimental conditions, both methods would have removed them. Figures 1g,i show that such variation between conditions was present in the real dataset.
The variation between medians displayed in Figs. 1g,i may seem surprising, given routine expectations based on current methods (Figs. 1c,e). Nevertheless, this variation inevitably results from the imbalance between overand under-expressed genes. As an illustration, let us consider a case with two experimental conditions, in which the average expression of a given gene is less than the distribution median under one condition, but greater than the median under the other. The variation of this gene alone will change the value of the median to the expression level of the next ranked gene. Therefore, if the number of over-expressed genes is different from the number of under-expressed genes, and enough changes cross the median boundary, then the median will substantially differ between conditions. Only when the differential expression is balanced or small enough, will the median stay the same. This argument applies equally to any other quantile in the distribution of gene expression.
To clarify how the condition-decomposition normalizations preserved the variation between conditions, we studied the influence of the choice of no-variation genes in the final between-condition normalization. To this end, we obtained the between-condition variation with both methods in two families of cases. In one family, no-variation genes were chosen in decreasing order of p-values from an ANOVA test. In the other family, genes were chosen at random. The first option was similar to the approach implemented to obtain the results presented in Fig. 1g–j, with the difference that there the number of genes was chosen automatically by a statistical test. As shown in Fig. 2a, for the real dataset the random choice of genes resulted in n−1/2 decays (n being the number of chosen genes), followed by a plateau. The n−1/2 decays reflect the standard errors of the estimators of the normalization factors. Selecting the genes by decreasing p-values, however, yielded a completely different result. Up to a certain number of genes, the variance remained similar, but for larger numbers of genes the variance dropped rapidly. Figure 2a shows, therefore, that between-condition variation was removed as soon as the between-condition normalizations used genes that varied in expression level across experimental conditions. The big circles in Fig. 2a indicate the working points of the normalizations used to generate the results displayed in Figs. 1g,i. In fact, these points slightly underestimated the variation between conditions. Although the statistical test for identifying no-variation genes ensured that there was no evidence of variation, inevitably the expression of some selected genes varied across conditions.
Figure 2b displays the results obtained with the synthetic dataset. There were no plateaus when no-variation genes were chosen randomly, only n−1/2 decays, and small differences when no-variation genes were selected by decreasing p-values. Big circles show that the algorithms selected working points with much larger number of genes in the synthetic dataset (Figs. 1h,j) than in the real dataset (Figs. 1g,i). The residual variation, produced by errors in the estimation of the normalization factors, was much smaller than the variation detected in the real dataset, especially for standard-vector condition-decomposition normalization. Overall, Figs. 2a,b show that the between-condition variation pictured in Figs. 1g,i is not an artifact caused by using an exceedingly small or extremely particular set of genes in the final between-condition normalization, but that this variation originated from the real dataset.
Finally, Fig. 3a shows the numbers of differentially expressed gene probes (DEGP), identified after normalizing with the four methods, for each of the 42 experimental treatments versus the corresponding control (Extended Data Table 2). Compared to conventional methods, the number of DEGP detected with the condition-decomposition normalizations was much larger under most treatments, including some whose number of DEGP was larger by more than one order of magnitude. These are statistically significant changes of gene expression, i.e. changes that cannot be explained by chance. More important is the scale of the detected variation, as illustrated by the boxplots in Fig. 3b showing absolute fold changes of DEGP detected after standard-vector condition-decomposition normalization. For all treatments, the entire interquartile range of absolute fold change is above 1.5-fold, and for more than two thirds of the treatments the median absolute fold change is greater than 2. This amount of gene expression variation cannot be neglected (cf. Extended Data Fig. 2), and warrants further research to explore its biological significance.
The lack-of-variation assumption underlying the current methods of normalization was self-fulfilling, removing variation in gene expression that was present in the real dataset. Moreover, it had negative consequences for downstream analyses, as it both removed potentially important biological information and introduced errors in the detection of gene expression. A removal of variation can be understood as errors in the estimation of normalization factors. Considering data and errors vectorially, the length of each vector equals, after centering and up to a constant factor, the standard deviation of the data or error. The addition of an error of small magnitude, compared to the data variance, would have only a minor effect. However, errors of similar or greater magnitude than the data variance may, depending on the lengths and relative angles of the vectors, severely distort the observed data variance. This will in turn cause spurious results in the statistical analyses. Furthermore, the angles between the data and the correct normalization factors (considered as vectors) are random. Data reflect biological variation, while normalization factors respond to technical variation. If the experiment is repeated, even with exactly the same experimental settings, the errors in the normalization factors will vary randomly, causing random spurious results in the downstream analyses. This explains, at least partially, the lack of reproducibility found in transcriptomics studies, especially for the detection of small changes of gene expression, because small variations are most likely to be distorted by errors in the estimates of normalization factors. Accordingly, the largest differences in numbers of DEGP detected by conventional versus condition-decomposition methods (Fig. 3a) occurred consistently in the treatments with the smallest magnitudes of gene expression changes, e.g. treatments 28, 29 and 33 (Fig. 3b and Extended Data Fig. 2).
In summary, this study proves that large numbers of genes change in expression level (often strongly) across experimental conditions, and too extensively to ignore in the nor-malization of high-throughput data. Further, our novel approach, which avoids the pre-vailing lack-of-variation assumption, demonstrates that current normalization methods likely remove and distort important variation in gene expression. It also offers a means to investigate broad changes in gene expression that have remained hidden to date. We expect this to provide revealing insights about diverse biomolecular processes, particularly those involving substantial numbers of genes, such as cell differentiation, toxic responses, diseases with non-Mendelian inheritance patterns and cancer. After years of lagging be-hind the advances in genome sequencing, we believe that the procedures presented here will assist efforts to realize the full potential of gene expression profiling.
Acknowledgements
This work was funded by the European Union FP7 projects MODERN (Ref. 309314-2) and MARINA (Ref. 263215), by FEDER through COMPETE (Programa Operacional Factores de Competitividade), by FCT (Fundação para a Ciência e Tcompliant microarray data from the experimentecnologia) through project bio-CHIP (Ref. FCT EXPL/AAG-MAA/0180/2013), and by a PhD grant (Ref. SFRH/BD/63261/2009).
Author Contributions
S.I.L.G., M.J.B.A. and J.J.S.-F. designed the toxicity experiment. S.I.L.G. carried out the experimental work and collected the microarray data. C.P.R. designed and implemented the novel normalization methods. C.P.R. performed the statistical analyses. All the authors jointly discussed the results. C.P.R. drafted the paper, with input from all the authors. All the authors edited the final version of the paper.
Author Information
MIAME-compliant microarray data from the experiment were submitted to the Gene Ex-pression Omnibus (GEO) at the NCBI website (platform: GPL20310; series: GSE69746, GSE69792, GSE69793 and GSE69794). Custom code that reproduces all the reported results starting from the raw microarray data is available at the GitHub repository https://github/carlosproca/gene-expr-norm-paper. The authors declare no competing financial interests. Correspondence and requests for materials should be addressed to Carlos P. Roca (carlosp.roca{at}urv.cat) and Janeck J. Scott-Fordsmand (jsf{at}bios.au.dk).
Methods
Test organism
The test species was Enchytraeus crypticus. Individuals were cultured in Petri dishes containing agar medium, in controlled conditions [31].
Exposure media
For copper (Cu) exposure, a natural soil collected at Hygum, Jutland, Denmark was used [31,32]. For silver (Ag) and nickel (Ni) exposure, the natural standard soil LUFA 2.2 (LUFA Speyer, Germany) was used [31]. The exposure to ultra-violet (UV) radiation was done in ISO reconstituted water [33].
Test chemicals
The tested Cu forms [31] included copper nitrate (Cu(NO3)2 ·3H2O > 99%, Sigma Aldrich), Cu nanoparticles (Cu-NPs, 20–30 nm, American Elements) and Cu nanowires (Cu-Nwires, synthesized by reduction of copper (II) nitrate with hydrazine in alkaline medium [34]).
The tested Ag forms [31] included silver nitratre AgNO3 > 99%, Sigma Aldrich), noncoated Ag nanoparticles (Ag-NPs Non-Coated, 20–30 nm, American Elements), Polyvinylpyrroli-done (PVP)-coated Ag nanoparticles (Ag-NPs PVP-Coated, 20–30 nm, American Elements), and Ag NM300K nanoparticles (Ag NM300K, 15 nm, JRC Repository). The Ag NM300K was dispersed in 4% Polyoxyethylene Glycerol Triolaete and Polyoxyethylene (20) orbitan mono-Laurat (Tween 20), thus the dispersant was tested alone as control (CTdisp).
The tested Ni forms included nickel nitrate (Ni(NO3)2 ·6H2O ≥ 98.5%, Fluka) and Ni nanoparticles (Ni-NPs, 20 nm, American Elements).
Spiking procedure
Spiking for the Cu and Ag materials was done as in previous work [31]. For the Ni materials, the Ni-NPs were added to the soil as powder, following the same procedure as for the Cu materials. NiNO3, being soluble, was added to the pre-moistened soil as aqueous dispersions.
The concentrations tested were selected based on the reproduction effect concentrations EC20 and EC50, for E. crypticus, within 95% of confidence intervals, being: CuNO3 EC20/50 = 290/360 mgCu/kg, Cu-NPs EC20/50 = 980/1760 mgCu/kg, Cu-Nwires EC20/50 = 850/1610 mgCu/kg, Cu-Field EC20/50 = 500/1400 mgCu/kg, AgNO3 EC20/50 = 45/60 mgAg/kg, Ag-NP PVP-coated EC20/50 = 380/550 mgAg/kg, Ag-NP Non-coated EC20/50 = 380/430 mgAg/kg, Ag NM300K EC20/50 = 60/170 mgAg/kg, CTdisp = 4% w/w Tween 20, NiNO3 EC20/50 = 40/60 mgNi/kg, Ni-NPs EC20/50 = 980/1760 mgNi/kg.
Four biological replicates were performed per test condition, including controls. For Cu exposure, the control condition for all the treatments consisted of soil from a control area at Hygum site, which has a Cu background concentration of 15 mg/kg [32]. For Ag exposure, two control sets were performed: CT (un-spiked LUFA soil, to be the control condition for AgNO3, Ag-NPs PVP-Coated and Ag-NPs Non-Coated treatments) and CTdisp (LUFA soil spiked with the dispersant Tween 20, to be the control condition for the Ag NM300K treatments). For Ni exposure, the control consisted of un-spiked LUFA soil.
Exposure details
In soil (i.e. for Cu, Ag and Ni) exposure followed the standard ERT [35] with adaptations as follows: twenty adults with well-developed clitellum were introduced in each test vessel, containing 20 g of moist soil (control or spiked). The organisms were exposed for three and seven days under controlled conditions of photoperiod (16:8 h light:dark) and temperature 20 ± 1 °C without food. After the exposure period, the organisms were carefully removed from the soil, rinsed in deionized water and frozen in liquid nitrogen. The samples were stored at -80 °C, until analysis.
For UV exposure, the test conditions [33] were adapted for E. crypticus [36]. The exposure was performed in 24-well plates, where each well correspond to a replicate and contain 1 ml of ISO water and five adult organisms with clitellum. The test duration was five days, at 20 ± 1 °C. The organisms were exposed to UV on a daily basis, during 15 minutes per day to two UV intensities (280–400nm) of 1669.25±50.83 and 1804.08±43.10 mW/m2, corresponding to total UV doses of 7511.6 and 8118.35 J/m2, respectively. The remaining time was spent under standard laboratory illumination (16:8 h photoperiod). UV radiation was provided by an UV lamp (Spectroline XX15F/B, Spectronics Corporation, NY, USA, peak emission at 312 nm) and a cellulose acetate sheet was coupled to the lamp to cut-off UVC-range wavelengths [36]. Thirty two replicates per test condition (including control without UV radiation) were performed to obtain 4 biological replicates with 40 organisms each for RNA extraction. After the exposure period, the organisms were carefully removed from the water and frozen in liquid nitrogen. The samples were stored at -80 °C, until analysis.
RNA extraction, labeling and hybridization
RNA was extracted from each replicate, which contained a pool of 20 and 40 organisms, for soil and water exposure, respectively. Three biological replicates per test treatment (including controls) were used. Total RNA was extracted using SV Total RNA Isolation System (Promega). The quantity and purity were measured spectrophotometrically with a nanodrop (NanoDrop ND-1000 Spectrophotometer) and its quality checked by denaturing formaldehyde agarose gel electrophoresis.
500 ng of total RNA were amplified and labeled with Agilent Low Input Quick Amp Labeling Kit (Agilent Technologies, Palo Alto, CA, USA). Positive controls were added with the Agilent one-color RNA Spike-In Kit. Purification of the amplified and labeled cRNA was performed with RNeasy columns (Qiagen, Valencia, CA, USA).
The cRNA samples were hybridized on custom Gene Expression Agilent Microarrays (4 x 44k format), with a single-color design [30]. Hybridizations were performed using the Agilent Gene Expression Hybridization Kit and each biological replicate was individually hybridized on one array. The arrays were hybridized at 65 °C with a rotation of 10 rpm, during 17 h. Afterwards, microarrays were washed using Agilent Gene Expression Wash Buffer Kit and scanned with the Agilent DNA microarray scanner G2505B.
Data acquisition and analysis
Fluorescence intensity data was obtained with Agilent Feature Extraction Software v. 10.7.3.1, using recommended protocol GE1 107 Sep09. Quality control was done by inspecting the reports on the Agilent Spike-in control probes. Background correction was provided by Agilent Feature Extraction software. To ensure an optimal comparison between the different normalization methods, only gene probes with good signal quality (flag IsPosAndSignif = True) in all samples were employed in the analyses. This implied the selection of 18,339 gene probes from a total of 43,750. Analyses were performed with R[27] v. 3.2.0, using R packages plotrix and RColorBrewer, and with Bioconductor[28] v. 3.1 packages genefilter and limma[29].
The synthetic data was generated gene by gene as normal variates with mean and variance equal, respectively, to the sample mean and sample variance of the real data. The applied normalization factors were those detected from the real data with standard-vector condition-decomposition normalization.
Median normalization was performed by subtracting the median of each sample distribution, and then adding the overall median to preserve the global expression level. Quantile normalization was performed as implemented in the limma package.
The two condition-decomposition normalizations proceeded in the same way: first, 51 independent within-condition normalization using all genes; then, final between-condition normalization, iteratively detecting no-variation genes and normalizing until convergence.
No-variation genes were identified with one-sided Kolmogorov-Smirnov tests, as goodnessof-fit tests against the uniform distribution, carried out on the greatest p-values obtained from an ANOVA test on the complete dataset. The ANOVA test benefited from the already corrected within-condition variances, provided by the within-condition normalizations. The KS test was rejected at α = 0.001.
The criterion for convergence for the median condition-decomposition (c.-d.) normalizations was to require that the relative changes in the standard deviation of the normalization factors were less than 1%, or less than 10% for 10 steps in a row. In the case of standard-vector c.-d. normalizations, convergence required that numerical errors were, compared to the estimated statistical errors (see below), less than 1%, or less than 10% for 10 steps in a row. For Fig. 2, due to the very low number of gene probes in some cases, the thresholds for convergence for 10 steps in a row were increased to 80% and 50% for median c.-d. and standard-vector c.-d., respectively.
In the standard-vector c.-d. normalization, the distribution of standard vectors was trimmed in each step to remove the 1% more extreme values of variance.
Differentially expressed gene probes were identified with limma (Fig. 3, Extended Data Fig. 2) or t-tests (Extended Data Fig. 1), using in all cases a FDR threshold of 5%.
The reference distribution with permutation symmetry shown in the polar plots of the probability density function in Supplementary Videos 1–3 was calculated with the 6 permutations of the empirical standard vectors. The Watson U 2 statistic was calculated with the two-sample test [37]. An equal number of samples for comparison was obtained by sampling with replacement the permuted standard vectors.
Mathematical Methods
In a gene expression dataset with g genes, c experimental conditions and n samples per condition, the observed expression levels of gene j in condition k, can be expressed in log2-scale as where is the vector of true gene expression levels and a(k) is the vector of normalization factors.
Given a sample vector x, the mean vector is , and the residual vector is Then, (1) can be linearly decomposed into
Equations (3) define the within-condition normalizations for each condition k. The scalar values in (2) are used to obtain the equations on condition means,
The between-condition normalization is defined by (5). Equations (4) reduce to a single number, which is irrelevant to the normalization. The complete solution for each condition is obtained with
The n samples of gene j in a given condition can be modeled with the random vectors Xj, Yj ∈ ℝn. Again, Yj = Xj + a, where a is a fixed vector of normalization factors. It can be proved, under fairly general assumptions, that the true standard vectors have zero expected value whereas the observed standard vectors verify, as long as a ≠ 0,
This motivates the following iterative procedure to solve (3) and (5) (standard-vector normalization):
At convergence, , which implies Convergence is faster the more symmetric the empirical distribution of is on the unit (n - 2)-sphere. Convergence is optimal with spherically symmetric distributions, such as the Gaussian distribution, because in that case
Assuming no correlation between genes, an approximation of the statistical error at step t can be obtained with
This statistical error is compared with the numerical error to assess convergence.
See Supplementary Methods for a detailed exposition of the mathematical methods, and Supplementary Videos 1–5 for an illustration.