Variation-preserving normalization unveils blind spots in gene expression profiling
===================================================================================

* Carlos P. Roca
* Susana I. L. Gomes
* Mónica J. B. Amorim
* Janeck J. Scott-Fordsmand

## Abstract

RNA-Seq and gene expression microarrays provide comprehensive profiles of gene activity, but lack of reproducibility has hindered their application. A key challenge in the data analysis is the normalization of gene expression levels, which is currently performed following an implicit assumption that most genes are not differentially expressed. Here, we present a mathematical approach to normalization that makes no assumption of this sort. We have found that variation in gene expression is much greater than currently believed, and that it can be measured with available technologies. Our results also explain, at least partially, the problems encountered in transcriptomics studies. We expect this improvement in detection to help efforts to realize the full potential of gene expression profiling, especially in analyses of cellular processes involving complex modulations of gene expression.

Keywords
*   differential gene expression
*   gene expression microarrays
*   RNA-Seq
*   data normalization

## Introduction

Since the discovery of DNA structure by Watson and Crick, molecular biology has progressed increasingly quickly, with rapid advances in sequencing and related genomic technologies. Among these, microarrays and RNA-Seq have been widely adopted to obtain gene expression profiles, by measuring the concentration of tens of thousands of mRNA molecules in single assays (Schena et al., 1995; Lockhart et al., 1996; Duggan et al., 1999; Mortazavi et al., 2008; Wang et al., 2009). Despite their enormous potential (Golub et al., 1999; van ’t Veer et al., 2002; Ivanova et al., 2002; Chi et al., 2003), problems of reproducibility and reliability (Tan et al., 2003; Frantz, 2005; Couzin, 2006) have discouraged their use in some areas, e.g. biomedicine (Michiels et al., 2005; Weigelt and Reis-Filho, 2010; Brettingham-Moore et al., 2011). In more mature microarray technologies, issues such as probe design, cross-hybridization, non-linearities and batch effects (Draghici et al., 2006) have been identified as possible culprits, but the problems persist (Shi et al., 2006; Su et al., 2014).

The normalization of gene expression, which is required to set a common reference level among samples (Smyth and Speed, 2003; Irizarry et al., 2003; Bullard et al., 2010; Garber et al., 2011; Dillies et al., 2013), is also reportedly problematic, affecting the reproducibility of results with both microarray (Shi et al., 2006; Shippy et al., 2006) and RNA-Seq (Su et al., 2014; Bullard et al., 2010; Dillies et al., 2013). Batch effects and their influence on normalization have recently received a great deal of attention (Leek et al., 2010; Reese et al., 2013; Li et al., 2014), resulting in approaches aiming to remove unwanted technical variation caused by differences between batches of samples or by other sources of expression heterogeneity (Listgarten et al., 2010; Gagnon-Bartsch and Speed, 2012; Risso et al., 2014). A different issue, however, is the underlying assumption made by the most widely used normalization methods to date, such as median and quantile normalization (Bolstad et al., 2003) for microarrays, or RPKM (Mortazavi et al., 2008) and TMM (Robinson and Oshlack, 2010) for RNA-Seq, which posit that most genes are not differentially expressed (Dillies et al., 2013; Hicks and Irizarry, 2015). This lack-of-variatio*n assumption ma*y seem reasonable for many applications, but it has not been confirmed. Furthermore, results obtained with other technologies, particularly qRT-PCR, suggest that it may not be valid (Shi et al., 2006; Bullard et al., 2010).

Some methods have been proposed to address the issue of the lack-of-variation assumption, based on the use of spike-ins (Lovén et al., 2012), negative control probes (Wu and Aryee, 2010) or negative control genes (Gagnon-Bartsch and Speed, 2012), that is, on external or internal controls that are *known a priori* not to be differentially expressed (Lippa et al., 2010). The applicability of these methods, however, has been limited by this requirement of a priori knowledge, which is rarely available for a sufficiently large number of controls. Thus, in attempts to clarify and overcome limitations imposed by the lack-of-variation assumption, we have developed an approach to normalization that does not assume lack-of-variation and that does not require the use of spike-ins or a priori knowledge of control genes. The analysis of a large gene expression dataset using this approach shows that the assumption can severely undermine the detection of variation in gene expression. We have found that large numbers of differentially expressed genes with substantial expression changes are missed when data are normalized with methods that assume lack-of-variation.

## Results

### Datasets and Normalization Methods

The dataset was obtained from biological triplicates of *Enchytraeus crypticus* (a globally distributed soil organism used in standard ecotoxicity tests), sampled under 51 experimental conditions (42 treatments and 9 controls), involving exposure to several substances, at several concentrations and durations according to a factorial design (Supp. Table 1). Gene expression was measured using a customized high-density oligonucleotide microarray, and the resulting dataset was normalized with four methods. Two of these methods are the most widely used procedures for microarrays, median (or scale) normalization and quantile normalization (Bolstad et al., 2003), whereas the other two, designated *median condition-decomposition normalization* and *standard-vector condition-decomposition normalization*, have been developed for this study.

View this table:
[Supplementary Table 1:](http://biorxiv.org/content/early/2015/12/04/021212/T1)

Supplementary Table 1: 
Experimental conditions of the toxicity experiment on *E. crypticus*, listed in the same order as they appear in each panel of, from left to right.

With the exception of quantile normalization, all used methods apply a multiplicative factor to the expression levels in each sample, equivalent to the addition of a number in the usual log2-scale for gene expression levels. Solving the *normalization problem* consists of finding these correction factors. The problem can be exactly and linearly decomposed into several sub-problems: one within-condition normalization for each experimental condition and one final between-condition normalization for the condition averages. In the within-condition normalizations, the samples (replicates) subjected to each experimental condition are normalized separately, whereas in the final between-condition normalization average levels for all conditions are normalized together. Because there are no genes with differential expression in any of the within-condition normalizations, the lack-of-variation assumption only affects the final between-condition normalization. The assumption is avoided by using, in this normalization, expression levels only from *no-variation genes*, i.e. genes that show no evidence of differential expression under a statistical test. Both methods of normalization proposed here follow this condition-decomposition approach.

With median condition-decomposition normalization, all normalizations are performed are included in the between-condition step. Otherwise, if all genes were used in this final step, the resulting total normalization factors would be exactly the same as those obtained with conventional median normalization.

For standard-vector condition-decomposition normalization, a vectorial procedure was developed to carry out each normalization step. The samples of any experimental condition, in a properly normalized dataset, must be *exchangeable*. In mathematical terms, the expression levels of each gene can be considered as an *s*-dimensional vector, where *s* is the number of samples for the experimental condition. After standardization (mean subtraction and variance scaling), these standard vectors are located in a (*s* − 2)-dimensional hypersphere. The exchangeability mentioned above implies that, when properly normalized, the distribution of standard vectors must be invariant with respect to permutations of the sample labels and must have zero expected value. These properties allow to obtain, under fairly general assumptions, a robust estimator of the normalization factors.

To further explore and compare outcomes of the normalization methods, they were also applied to a synthetic random dataset. This dataset was generated with identical means and variances gene-by-gene to the real dataset, and with the assumption that all genes were no-variation genes. In addition, normalization factors were applied, equal to those obtained from the real dataset. Thus, the synthetic dataset was very similar to the real one, while complying by construction with the lack-of-variation assumption.

### Normalization Results

Figure 1 displays the results of applying the four normalization methods to the real and synthetic datasets. Each panel shows the interquartile ranges of expression levels for the 153 samples, grouped in triplicates exposed to each experimental condition. Both median (second row) and quantile normalization (third row) yielded similar outputs, for both datasets. In contrast, the condition-decomposition normalizations (fourth and fifth rows) identified marked differences, detecting much greater variation between conditions in the sample the same, while quantile normalization makes the full distribution of each sample the same. Hence, if there were differences in medians or distributions of gene expression between experimental conditions, both methods would have removed them. Figures 1G,I show that such variation between conditions was present in the real dataset.

![Figure 1:](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2015/12/04/021212/F1.medium.gif)

[Figure 1:](http://biorxiv.org/content/early/2015/12/04/021212/F1)

Figure 1: 
The condition-decomposition normalizations detected a large amount of between-condition variation in the real expression data, in contrast with conventional methods. All 10 panels show interquartile ranges of expression levels of the 153 samples, grouped by the 51 experimental conditions (Ag, blue-yellow; Cu, red-cyan; Ni, green-orange; UV, purple; see Supp. Table 1). Black lines indicate medians. Rows and columns correspond to normalization methods and datasets (as labeled), respectively. In the synthetic dataset no gene was differentially expressed between any two conditions.

### Influence of no-variation genes on normalization

To clarify how the condition-decomposition normalizations preserved the variation between conditions, we studied the influence of the choice of no-variation genes in the final between-condition normalization. To this end, we obtained the between-condition variation with both methods in two families of cases. In one family, no-variation genes were chosen in decreasing order of *p*-values from an ANOVA test. In the other family, genes were chosen at random. The first option was similar to the approach implemented to obtain the results presented in Figures 1G–J, with the difference that there the number of genes was chosen automatically by a statistical test. As shown in Figure 2A, for the real dataset the random choice of genes resulted in *n*₋1/2 decays (*n* being the number of chosen genes), followed by a plateau. The *n*₋1/2 decays reflect the standard errors of the estimators of the normalization factors. Selecting the genes by decreasing *p*-values, however, yielded a completely different result. Up to a certain number of genes, the variance remained similar, but for larger numbers of genes the variance dropped rapidly. Figure 2A shows, therefore, that between-condition variation was removed as soon as the between-condition normalizations used genes that varied in expression level across experimental conditions. The big circles in Figure 2A indicate the working points of the normalizations used to generate the results displayed in Figures 1G,I. In fact, these points slightly underestimated the variation between conditions. Although the statistical test for identifying no-variation genes ensured that there was no evidence of variation, inevitably the expression of some selected genes varied across conditions.

![Figure 2:](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2015/12/04/021212/F2.medium.gif)

[Figure 2:](http://biorxiv.org/content/early/2015/12/04/021212/F2)

Figure 2: 
The selection of genes in the final between-condition normalization was crucial to preserve variation between conditions. The panels show the detected variation as a function of the number of gene probes used in the between-condition normalization of the real dataset (A) and synthetic dataset (B). Between-condition variation is represented as the standard deviation of the within-condition mean averages (averages of sample mean expression levels, for all samples under the condition). See Supplementary for within-condition median averages, with similar results. Each point in either of the panels indicates the variation obtained with one complete normalization (black circles, median condition-decomposition normalization; blue circles, standard-vector condition-decomposition normalization). Gene probes were selected in two ways: randomly (empty circles) or in decreasing order of *p*-values (filled circles). Big circles show the working points of the algorithms whose results are depicted in Figures 1G–J. Black dashed lines show references for *n*−1/2 decays, with the same values in both panels.

Figure 2B displays the results obtained with the synthetic dataset. There were no plateaus when no-variation genes were chosen randomly, only *n*₋1/2 decays, and small differences when no-variation genes were selected by decreasing *p*-values. Big circles show that working points were selected with much larger numbers of genes in the synthetic dataset (Figs. 1H,J) than in the real dataset (Figs. 1G,I). The residual variation, produced by errors in the estimation of the normalization factors, was much smaller than the variation detected in the real dataset, especially for standard-vector condition-decomposition normalization. Overall, Figure 2 shows that the between-condition variation pictured in Figures 1G,I is not an artifact caused by using an exceedingly small or extremely particular set of genes in the final between-condition normalization, but that this variation originated from the real dataset.

### Differential Gene Expression

Finally, Figure 3A shows the numbers of differentially expressed gene probes (DEGP), identified after normalizing with the four methods, for each of the 42 experimental treatments versus the corresponding control (Supp. Table 2). Compared to conventional methods, the number of DEGP detected with the condition-decomposition normalizations was much larger under most treatments, including some whose number of DEGP was larger by more than one order of magnitude. These are statistically significant changes of gene expression, i.e. changes that cannot be explained by chance. More important is the scale of the detected variation, as illustrated by the boxplots in Figure 3C showing absolute fold changes of DEGP detected after standard-vector condition-decomposition normalization. For all treatments, the entire interquartile range of absolute fold change is above 1.5-fold, and for more than two thirds of the treatments the median absolute fold change is greater than 2. This amount of gene expression variation cannot be neglected, and warrants further research to explore its biological significance.

View this table:
[Supplementary Table 2:](http://biorxiv.org/content/early/2015/12/04/021212/T2)

Supplementary Table 2: 
Treatment vs control comparisons, listed in increasing number of differentially expressed gene probes (DEGP) obtained with standard-vector condition-decomposition normalization and limma statistical analysis. This is the same order as in, from left to right.

![Figure 3:](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2015/12/04/021212/F3.medium.gif)

[Figure 3:](http://biorxiv.org/content/early/2015/12/04/021212/F3)

Figure 3: 
The condition-decomposition normalizations detected much larger numbers of differentially expressed gene probes (DEGP), with substantial fold changes. A: Number of DEGP for each treatment vs control comparison, obtained after applying the four normalization methods (empty black circles, median normalization; empty red triangles, quantile normalization; filled green circles, median condition-decomposition normalization; filled blue triangles, standard-vector condition decomposition normalization). Significant differential expression was identified with R/Bioconductor package limma. (see Supp. Fig. 2 for results with t-tests). Lower panel shows boxplots of absolute values of DEGP fold changes (absolute differences of log2 expression levels), also per treatment vs control comparison, obtained with quantile normalization (B) and standard-vector condition-decomposition normalization (C). Boxplots are colored by treatment, with the same color code as in . In both panels comparisons are ordered according to the number of DEGP identified with standard-vector condition-decomposition normalization, increasing from left to right (Supp. Table 2). Dashed horizontal lines in the lower panel indicate references of 1.5-fold and 2-fold changes.

![Supplementary Figure 1:](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2015/12/04/021212/F4.medium.gif)

[Supplementary Figure 1:](http://biorxiv.org/content/early/2015/12/04/021212/F4)

Supplementary Figure 1: 
Representing between-condition variation as the standard deviation of the within-condition median averages (averages of sample median expression levels, for all samples under the condition) yields similar results to those obtained with within-condition mean averages (Fig. 2). The panels show the detected variation as a function of the number of gene probes used in the between-condition normalization of the real dataset (A) and synthetic dataset (B). Labeling is the same as in . Each point in either of the panels indicates the variation obtained with one complete normalization (black circles, median condition-decomposition normalization; blue circles, standard-vector condition-decomposition normalization). Gene probes were selected in two ways: randomly (empty circles) or in decreasing order of *p*-values (filled circles). Big circles show the working points of the algorithms whose results are depicted in Figures 1G–J. Black dashed lines show references for *n*decays, with the same values in both panels.

![Supplementary Figure 2:](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2015/12/04/021212/F5.medium.gif)

[Supplementary Figure 2:](http://biorxiv.org/content/early/2015/12/04/021212/F5)

Supplementary Figure 2: 
With t-tests, the condition-decomposition normalizations also detected much larger numbers of differentially expressed gene probes (DEGP). The figure shows the number of DEGP obtained with a statistical analysis based on t-tests instead of limma (Fig. 3A). Labeling is the same as in A (empty black circles, median normalization; empty red triangles, quantile normalization; filled green circles, median condition-decomposition normalization; filled blue triangles, standard-vector condition decomposition normalization). Treatment vs control comparisons are ordered according to the number of DEGP identified with standard-vector condition-decomposition normalization, increasing from left to right. This order (not shown) was similar but not exactly the same as in A.

## Discussion

The variation between medians displayed in Figures 1G,I may seem surprising, given routine expectations based on current methods (Figs. 1C,E). Nevertheless, this variation inevitably results from the imbalance between over-and under-expressed genes. As an illustration, let us consider a case with two experimental conditions, in which the average expression of a given gene is less than the distribution median under one condition, but greater than the median under the other. The variation of this gene alone will change the value of the median to the expression level of the next ranked gene. Therefore, if the number of over-expressed genes is different from the number of under-expressed genes, and enough changes cross the median boundary, then the median will substantially differ between conditions. Only when the differential expression is balanced or small enough, will the median stay the same. This argument applies equally to any other quantile in the distribution of gene expression. Transcriptional amplification is an extreme example of change in the distribution of expression levels (Lovén et al., 2012), which can nevertheless be properly normalized with condition-decomposition methods, and without resorting to spike-ins as long as some genes are not differentially expressed.

An important feature of the approaches to normalization proposed here (linear decomposition into normalization sub-problems per condition, and standard-vector normalization for each sub-problem) is that they do not depend on any particular aspect of the technology of gene expression microarrays or RNA-Seq. The numbers in the input data are interpreted as measured concentrations of mRNA molecules, in order to identify the normalization factors and irrespectively of whether the concentrations were obtained from fluorescence intensities of hybridized cDNA (microarrays) or from counts of fragments read of mRNA sequences (RNA-Seq). Nevertheless, we consider that specific within-sample corrections for each technology are still necessary and must be applied *before* the between-sample normalizations proposed here. Examples include background correction for microarrays or gene-length normalization (RPKM) for RNA-Seq. Equally, methods that address the influence of biological or technical confounding factors on downstream applied when necessary, *after* normalizing.

The lack-of-variation assumption underlying the current methods of normalization was self-fulfilling, removing variation in gene expression that was present in the real dataset. Moreover, it had negative consequences for downstream analyses, as it both removed potentially important biological information and introduced errors in the detection of gene expression. A removal of variation can be understood as errors in the estimation of normalization factors. Considering data and errors vectorially, the length of each vector equals, after centering and up to a constant factor, the standard deviation of the data or error. The addition of an error of small magnitude, compared to the data variance, would have only a minor effect. However, errors of similar or greater magnitude than the data variance may, depending on the lengths and relative angles of the vectors, severely distort the observed data variance. This will in turn cause spurious results in the statistical analyses. Furthermore, the angles between the data and the correct normalization factors (considered as vectors) are random. Data reflect biological variation, while normalization factors respond to technical variation. If the experiment is repeated, even with exactly the same experimental settings, the errors in the normalization factors will vary randomly, causing random spurious results in the downstream analyses. This explains, at least partially, the lack of reproducibility found in transcriptomics studies, especially for the detection of small changes of gene expression, because small variations are most likely to be distorted by errors in the estimates of normalization factors. Accordingly, the largest differences in numbers of DEGP detected by conventional compared to condition-decomposition methods (Fig. 3A) occurred consistently in the treatments with the smallest magnitudes of gene expression changes, e.g. treatments 28, 29 and 33 (Figs. 3B,C).

In summary, this study proves that large numbers of genes change in expression level (often strongly) across experimental conditions, and too extensively to ignore in the normalization of gene expression data. Further, our approach, which avoids the prevailing lack-of-variation assumption, demonstrates that current normalization methods likely remove and distort important variation in gene expression. It also offers a means to investhis to provide revealing insights about diverse biomolecular processes, particularly those involving substantial numbers of genes, such as cell differentiation, toxic responses, diseases with non-Mendelian inheritance patterns and cancer. After years of lagging behind the advances in genome sequencing, we believe that the procedures presented here will assist efforts to realize the full potential of gene expression profiling.

## Data Deposition and Code Availability

MIAME-compliant microarray data from the experiment were submitted to the Gene Expression Omnibus (GEO) at the NCBI website (platform: GPL20310; series: GSE69746, GSE69792, GSE69793 and GSE69794). Custom code that reproduces all the reported results starting from the raw microarray data is available at the GitHub repository [https://github/carlosproca/gene-expr-norm-paper](https://github/carlosproca/gene-expr-norm-paper).

## Author Contributions

S.I.L.G., M.J.B.A. and J.J.S.-F. designed the toxicity experiment. S.I.L.G. carried out the experimental work and collected the microarray data. C.P.R. designed and implemented the novel normalization methods. C.P.R. performed the statistical analyses. All the authors jointly discussed the results. C.P.R. drafted the paper, with input from all the authors. All the authors edited the final version of the paper.

## Materials and Methods

### Test Organism and Exposure Media

The test species was *Enchytraeus crypticus*. Individuals were cultured in Petri dishes containing agar medium, in controlled conditions (Gomes et al., 2015b).

For copper (Cu) exposure, a natural soil collected at Hygum, Jutland, Denmark was used (Gomes et al., 2015b; Scott-Fordsmand et al., 2000). For silver (Ag) and nickel (Ni) exposure, the natural standard soil LUFA 2.2 (LUFA Speyer, Germany) was used (Gomes et al., 2015b). The exposure to ultra-violet (UV) radiation was done in ISO reconstituted water (OECD, 2004a).

### Test Chemicals

The tested Cu forms (Gomes et al., 2015b) included copper nitrate (Cu(NO3)2 ·3H2O *>* 99%, Sigma Aldrich), Cu nanoparticles (Cu-NPs, 20–30 nm, American Elements) and Cu nanowires (Cu-Nwires, synthesized by reduction of copper (II) nitrate with hydrazine in alkaline medium (Chang et al., 2005)).

The tested Ag forms (Gomes et al., 2015b) included silver nitratre AgNO3 *>* 99%, Sigma Aldrich), non-coated Ag nanoparticles (Ag-NPs Non-Coated, 20–30 nm, American Elements),

Polyvinylpyrrolidone (PVP)-coated Ag nanoparticles (Ag-NPs PVP-Coated, 20–30 nm, American Elements), and Ag NM300K nanoparticles (Ag NM300K, 15 nm, JRC Repository). The Ag NM300K was dispersed in 4% Polyoxyethylene Glycerol Triolaete and Polyoxyethylene (20) orbitan mono-Laurat (Tween 20), thus the dispersant was tested alone as control (CTdisp).

The tested Ni forms included nickel nitrate (Ni(NO3)2 ·6H2O ≥ 98.5%, Fluka) and Ni nanoparticles (Ni-NPs, 20 nm, American Elements).

### Spiking Procedure

Spiking for the Cu and Ag materials was done as in previous work (Gomes et al., 2015b). For the Ni materials, the Ni-NPs were added to the soil as powder, following the same procedure as for the Cu materials. NiNO3, being soluble, was added to the pre-moistened soil as aqueous dispersions.

The concentrations tested were selected based on the reproduction effect concentrations EC20 and EC50, for *E. crypticus*, within 95% of confidence intervals, being: CuNO3 EC20/50 = 290/360 mgCu/kg, Cu-NPs EC20/50 = 980/1760 mgCu/kg, Cu-Nwires EC20/50 = 850/1610 mgCu/kg, Cu-Field EC20/50 = 500/1400 mgCu/kg, AgNO3 EC20/50 = 45/60 mgAg/kg, Ag-NP PVP-coated EC20/50 = 380/550 mgAg/kg, Ag-NP Non-coated EC20/50 = 380/430 mgAg/kg, Ag NM300K EC20/50 = 60/170 mgAg/kg, CTdisp = 4% w/w Tween 20, NiNO3 EC20/50 = 40/60 mgNi/kg, Ni-NPs EC20/50 = 980/1760 mgNi/kg.

Four biological replicates were performed per test condition, including controls. For Cu exposure, the control condition for all the treatments consisted of soil from a control area at Hygum site, which has a Cu background concentration of 15 mg/kg (Scott-Fordsmand et al., 2000). For Ag exposure, two control sets were performed: CT (un-spiked LUFA soil, to be the control condition for AgNO3, Ag-NPs PVP-Coated and Ag-NPs Non-Coated treatments) and CTdisp (LUFA soil spiked with the dispersant Tween 20, to be the control condition for the Ag NM300K treatments). For Ni exposure, the control consisted of un-spiked LUFA soil.

### Exposure Details

In soil (i.e. for Cu, Ag and Ni) exposure followed the standard ERT (OECD, 2004b) with adaptations as follows: twenty adults with well-developed clitellum were introduced in each test vessel, containing 20 g of moist soil (control or spiked). The organisms were exposed for three and seven days under controlled conditions of photoperiod (16:8 h light:dark) and temperature 20 ± 1 ◦C without food. After the exposure period, the liquid nitrogen. The samples were stored at −80 ◦C, until analysis.

For UV exposure, the test conditions (OECD, 2004a) were adapted for *E. crypticus* (Gomes et al., 2015a). The exposure was performed in 24-well plates, where each well correspond to a replicate and contain 1 ml of ISO water and five adult organisms with clitellum. The test duration was five days, at 20 ± 1 ◦C. The organisms were exposed to UV on a daily basis, during 15 minutes per day to two UV intensities (280–400nm) of 1669.25 ± 50.83 and 1804.08 ± 43.10 mW*/*m2, corresponding to total UV doses of 7511.6 and 8118.35 J*/*m2, respectively. The remaining time was spent under standard laboratory illumination (16:8 h photoperiod). UV radiation was provided by an UV lamp (Spectroline XX15F/B, Spectronics Corporation, NY, USA, peak emission at 312 nm) and a cellulose acetate sheet was coupled to the lamp to cut-off UVC-range wavelengths (Gomes et al., 2015a). Thirty two replicates per test condition (including control without UV radiation) were performed to obtain 4 biological replicates with 40 organisms each for RNA extraction. After the exposure period, the organisms were carefully removed from the water and frozen in liquid nitrogen. The samples were stored at −80 ◦C, until analysis.

### RNA Extraction, Labeling and Hybridization

RNA was extracted from each replicate, which contained a pool of 20 and 40 organisms, for soil and water exposure, respectively. Three biological replicates per test treatment (including controls) were used. Total RNA was extracted using SV Total RNA Isolation System (Promega). The quantity and purity were measured spectrophotometrically with a nanodrop (NanoDrop ND-1000 Spectrophotometer) and its quality checked by denaturing formaldehyde agarose gel electrophoresis.

500 ng of total RNA were amplified and labeled with Agilent Low Input Quick Amp Labeling Kit (Agilent Technologies, Palo Alto, CA, USA). Positive controls were added with the Agilent one-color RNA Spike-In Kit. Purification of the amplified and labeled cRNA was performed with RNeasy columns (Qiagen, Valencia, CA, USA).

The cRNA samples were hybridized on custom Gene Expression Agilent Microarrays (4 x 44k format), with a single-color design (Castro-Ferreira et al., 2014). Hybridizations were performed using the Agilent Gene Expression Hybridization Kit and each biological replicate was individually hybridized on one array. The arrays were hybridized at 65 ◦C with a rotation of 10 rpm, during 17 h. Afterwards, microarrays were washed using Agilent Gene Expression Wash Buffer Kit and scanned with the Agilent DNA microarray scanner G2505B.

### Data Acquisition and Analysis

Fluorescence intensity data was obtained with Agilent Feature Extraction Software v. 10.7.3.1, using recommended protocol GE1\_107\_Sep09. Quality control was done by inspecting the reports on the Agilent Spike-in control probes. Background correction was provided by Agilent Feature Extraction software. To ensure an optimal comparison between the different normalization methods, only gene probes with good signal quality (flag IsPosAndSignif = True) in all samples were employed in the analyses. This implied the selection of 18,339 gene probes from a total of 43,750. Analyses were performed with R (R Core Team, 2015) v. 3.2.2, using R packages plotrix and RColorBrewer, and with Bioconductor (Huber et al., 2015) v. 3.1 packages genefilter and limma (Ritchie et al., 2015).

The synthetic data was generated gene by gene as normal variates with mean and variance equal, respectively, to the sample mean and sample variance of the real data. The applied normalization factors were those detected from the real data with standard-vector condition-decomposition normalization.

Median normalization was performed by subtracting the median of each sample distribution, and then adding the overall median to preserve the global expression level. Quantile normalization was performed as implemented in the limma package.

The two condition-decomposition normalizations proceeded in the same way: first, 51 independent within-condition normalization using all genes; then, final between-condition normalization, iteratively detecting no-variation genes and normalizing until convergence.

No-variation genes were identified with one-sided Kolmogorov-Smirnov tests, as goodness-of-fit tests against the uniform distribution, carried out on the greatest *p*-values obtained from an ANOVA test on the complete dataset (see below). The ANOVA test benefited from the already corrected within-condition variances, provided by the within-condition normalizations. The KS test was rejected at *α* = 0.001.

The criterion for convergence for the median condition-decomposition (CD) normalizations was to require that the relative changes in the standard deviation of the normalization factors were less than 1%, or less than 10% for 10 steps in a row. In the case of standard-vector CD normalizations, convergence required that numerical errors were, compared to the estimated statistical errors (see below), less than 1%, or less than 10% for 10 steps in a row. For Figure 2 and Supplementary Figure 1, due to the very low number of gene probes in some cases, the thresholds for convergence for 10 steps in a row were increased to 80% and 50%, respectively, for median CD and standard-vector CD normalization.

In standard-vector CD normalization, the distribution of standard vectors was trimmed in each step to remove the 1% more extreme values of variance.

Differentially expressed gene probes were identified with limma (Fig. 3) or t-tests (Supp. Fig. 2), using in all cases a FDR threshold of 5%.

The reference distribution with permutation symmetry shown in the polar plots of the probability density function in Supplementary Movies 1–3 was calculated with the 6 permutations of the empirical standard vectors. The Watson *U*2 statistic was calculated with the two-sample test (Durbin, 1973). An equal number of samples for comparison was obtained by sampling with replacement the permuted standard vectors.

### Mathematical Methods

In a gene expression dataset with *g* genes, *c* experimental conditions and *n* samples per condition, the *observed* expression levels of gene *j* in condition *k*, ![Graphic][1]</img>, can be expressed in log2-scale as ![Formula][2]</img>  where ![Graphic][3]</img> is the vector of *true* gene expression levels and **a**(k) is the vector of normalization factors.

Given a sample vector **x**, the mean vector is ![Graphic][4]</img>, and the residual vector is ![Graphic][5]</img>.

Then, (1) can be linearly decomposed into

![Formula][6]</img> ![Formula][7]</img> 
Equations (3) define the within-condition normalizations for each condition *k*. The scalar values in (2) are used to obtain the equations on condition means,

![Formula][8]</img> ![Formula][9]</img> 
The between-condition normalization is defined by (5). Equations (4) reduce to a single number, which is irrelevant to the normalization. The complete solution for each condition is obtained with ![Graphic][10]</img>.

The *n* samples of gene *j* in a given condition can be modeled with the random vectors ![Graphic][11]</img>. Again, **Y**j = **X***j* + **a**, where **a** is a fixed vector of normalization factors. It can be proved, under fairly general assumptions, that the true standard vectors have zero expected value

![Formula][12]</img> 
whereas the observed standard vectors verify, as long as **a** ≠ 0,

![Formula][13]</img> 
This motivates the following iterative procedure to solve (3) and (5) (*standard-vector normalization*):

![Formula][14]</img> ![Formula][15]</img> ![Formula][16]</img> 
At convergence, ![Graphic][17]</img>, which implies ![Graphic][18]</img> and ![Graphic][19]</img>. Convergence is faster the more symmetric the empirical distribution of ![Graphic][20]</img> is on the unit (*n* − 2)-sphere. Convergence is optimal with spherically symmetric distributions, such as the Gaussian distribution, because in that case

![Formula][21]</img> 
Assuming no correlation between genes, an approximation of the statistical error at step *t* can be obtained with

![Formula][22]</img> 
This statistical error is compared with the numerical error to assess convergence.

See Supplementary Material for a detailed exposition of the mathematical methods, and Supplementary Movies 1–5 for an illustration.

## Supplementary Mathematical Methods

### SM1 Vectorial representation of sample data

Let *x*1*,...,xn* be the samples of *n* independent and identically distributed random variables *X*1*,...,Xn*. Let us represent the samples *x*1*,...,xn* with the ℝ*n* column vector **x** = (*x*1*,...,xn*)′, and let us denote the sample mean by ![Graphic][23]</img>.

Let us define the ℝ*n* → ℝ*n* vectorial operators mean ![Graphic][24]</img> and residual ![Graphic][25]</img>, respectively, as

![Formula][26]</img> ![Formula][27]</img> 
**1** being the all-ones column vector of dimension *n*.

Thus, any sample vector **x** ∈ ℝ*n* can be decomposed as

![Formula][28]</img> 
The mean vector ![Graphic][29]</img> contains the sample mean, while the residual vector ![Graphic][30]</img> carries the sample variation around the mean.

The vectorial operators mean (13) and residual (14) are linear.

*Proposition*. For any two sample vectors **x, y** ∈ ℝ*n* and any two numbers *α, β* ∈ ℝ,

![Formula][31]</img> ![Formula][32]</img> 
*Proof*. Let us denote **x** = (*x*1*,...,xn*)′ and **y** = (*y*1*,...,yn*)′.

![Formula][33]</img> 
An essential property of the mean and residual vectors is that they belong to subspaces that are orthogonal complements (Eaton, 2007). Hence, for any sample vector **x** ∈ ℝn, the mean vector ![Graphic][34]</img> belongs to the subspace of dimension 1 spanned by the unit vector ![Graphic][35]</img>, while the residual vector ![Graphic][36]</img> abelongs to the (*n* − 1)-dimensional hyperplane orthogonal to ![Graphic][37]</img>.

The lengths of the mean vector and residual vector are equal, up to a scaling factor, to the sample mean and sample standard deviation, respectively. For a set of samples x1*,...,xn*, where *n* ≥ 2, let us denote the sample mean as before by ![Graphic][38]</img>, and the sample variance as ![Graphic][39]</img>. Then, the lengths of the mean and residual vectors obtained from the sample vector **x** = (*x*1*,...,xn*)′ are

![Formula][40]</img> ![Formula][41]</img> 
Finally, let us define the standard vector of the sample vector **x** = (*x*1*,...,xn*)′ (*n* ≥ 2), as

![Formula][42]</img> 
whenever ![Graphic][43]</img>, or otherwise as stdvec(**x**) = 0. 0 is the all-zeros column vector of dimension *n*.

For a given number of samples *n*, all the non-zero standard vectors belong to the (*n* − 2)-sphere of radius ![Graphic][44]</img>, embedded in the (*n* − 1)-dimensional hyperplane perpendicular to ![Graphic][45]</img>. Besides, all the components of a standard vector are equal to the corresponding standardized samples,

![Formula][46]</img> 
For the degenerate case of having only two samples (*n* = 2), the only possible values of a non-zero standard vector are ![Graphic][47]</img>′.

### SM2 Linear decomposition of the normalization problem

Let us consider a gene expression dataset, with *g* genes and *c* experimental conditions. Each condition *k* has *sk* samples. The total number of samples is ![Graphic][48]</img>.

Let us denote the *observed* expression level of gene *j* in the sample *i* of condition *k* by ![Graphic][49]</img>. We assume that the observed level ![Graphic][50]</img> is equal, in the usual log2-scale, to the addition of the normalization factor ![Graphic][51]</img> to the *true* gene expression level ![Graphic][52]</img>,

![Formula][53]</img> 
Solving the *normalization problem* amounts to finding the normalization factors ![Graphic][54]</img> from the observed values ![Graphic][55]</img>. The normalization factors can be understood as sample-wide changes in the concentration of mRNA molecules by multiplicative factors equal to ![Graphic][56]</img>. These changes are caused by technical reasons in the assay and are independent of the biological variation in the true levels ![Graphic][57]</img>.

Let us represent the true and observed expression levels, ![Graphic][58]</img> and ![Graphic][59]</img>, of gene *j* in the samples *i* = 1::: *sk* of condition *k*, by the *sk*-dimensional vectors

![Formula][60]</img> ![Formula][61]</img> 
Let us also represent the unknown normalization factors of condition *k b*y the *sk*-dimensional vector

![Formula][62]</img> 
From (22)–(25), the normalization problem can be written in vectorial form as

![Formula][63]</img> 
Applying the vectorial operators mean (13) and residual (14), we obtain

![Formula][64]</img> ![Formula][65]</img> 
The residual-vector equations (28) correspond to the *c* within-condition normalizations. Each within-condition normalization uses the equations (28) particular to a condition *k*, for the subset of genes ![Graphic][66]</img> that have expression level available and of enough quality in that experimental condition.

Let us denote the condition means for each gene as

![Formula][67]</img> ![Formula][68]</img> ![Formula][69]</img> 
so that

![Formula][70]</img> ![Formula][71]</img> ![Formula][72]</img> 
![Graphic][73]</img> being the all-ones column vector of dimension *sk*.

Then, the mean-vector equations (27) can be written as

![Formula][74]</img> 
so they reduce to the scalar equations

![Formula][75]</img> 
Let us define the vectors of conditions means as

![Formula][76]</img> ![Formula][77]</img> ![Formula][78]</img> 
and let us express the condition-mean equations in vectorial form as

![Formula][79]</img> 
Applying again the mean and variance operators, we obtain

![Formula][80]</img> ![Formula][81]</img> 
The residual-vector equations on condition means (42) correspond to the single between-condition normalization, in a similar way as (28) do for the each of the within-condition normalizations. There is one equation (42) per gene. The only equations used in the between-condition normalization are those of the subset of genes ![Graphic][82]</img> that show no evidence of variation across experimental conditions, according to a statistical test.

Given that ![Graphic][83]</img>, (41) has the only unknown ![Graphic][84]</img>. The meaning of ![Graphic][85]</img> is a conversion factor between the scale the true and observed expression levels. This factor depends on the technology used to measure the expression levels and finding it is out of the scope of the normalization problem. Therefore, without loss of generality, we assume ![Graphic][86]</img>, so

![Formula][87]</img> ![Formula][88]</img> 
The solution of the between-condition normalization, ![Graphic][89]</img>, allows to find the mean vectors of the normalization factors ![Graphic][90]</img>, via (34), (39) and (44). The within-condition normalizations yield the residual vectors ![Graphic][91]</img>. The complete solution to the normalization problem is finally obtained, for each condition *k*, with

![Formula][92]</img> 
Thus, the original normalization problem (26) has been divided in *c*+1 normalization subproblems on residual vectors, stated by (28) and (42). In fact, this linear decomposition is possible for any partition of the set of *s* samples. The choice of the partition as the one defined by the experimental conditions is motivated by the need to control the biological variation among the genes used in each normalization. All the *c* + 1 normalizations face the same kind of *normalization of residuals problem*, which we define in general as follows.

**Normalization of Residuals Problem**. Let *yij* be the *i*-th observed value of feature *j*, in a dataset with *n* ≥ 2 observations for each of the *m* features. The observed values *yij* are equal to the true values *xij* plus the normalization factors *ai*, which are constant across features. In vectorial form, there are *m* equations ![Formula][93]</img>  where the vectors belong to ℝ*n* . As a consequence ![Formula][94]</img> 

Solving the normalizatio*n of residuals problem amount*s to finding the residual vector of normalization factors ![Graphic][95]</img> from the observed residual vectors ![Graphic][96]</img>. In the within-condition of the corresponding experimental condition. In the between-condition normalization, the features are means of gene expression levels, with one observation per condition.

There is, however, an additional requirement imposed by the methods with which we propose to solve the between-condition normalization. We would like to consider the condition means ![Graphic][97]</img> in (36) as sample data across conditions. This only holds when all the conditions have the same number of samples. Otherwise, we balance the condition means so that they result from the same number of samples in all conditions, according to the procedure described in the following.

Let *s*∗ be the minimum number of samples across conditions, *s* ∗ = min{*s*1*,...,sc*}. Let ![Graphic][98]</img> be independent random samples (without replacement) of size *s*∗ from the set of indexes {1*,...,sk*}, with one sample per gene *j* and condition *k*. Then, the balanced condition means are defined as

![Formula][99]</img> ![Formula][100]</img> ![Formula][101]</img> 
From (22), the balanced condition means verify a relationship similar to (36),

![Formula][102]</img> 
Moreover, the average of ![Graphic][103]</img> across the sampling subsets ![Graphic][104]</img> is equal to the unknown ![Graphic][105]</img>. This implies that (51) are, on average, equivalent to (36). Hence, we use the following vectors of balanced conditions means

![Formula][106]</img> ![Formula][107]</img> 
instead of (37), (38), in order to build the condition-mean equations (40). This balancing of the condition means is only required when the experimental conditions have different number of samples.

### SM3 Permutation invariance of multivariate data

Let *xij* and *yij* be, respectively, the true and observed values of a dataset with *n* observations of *m* features, as defined in the *normalization of residuals problem* above.

We have assumed that the *n* true values *x*1*j,...,xnj* of feature *j* are samples of independent and identically distributed random variables X1*j,...,Xnj*. These random variables can be represented with the random vector X*j* = (*X*1*j,...,Xnj*)′, carried by the probability space (Ω, ![Graphic][108]</img>, P) and with induced space (ℝn , ![Graphic][109]</img> , ℙ). Let us define the random vectors ![Graphic][110]</img> and ![Graphic][111]</img> with the vectorial operators mean (13) and residual (14), respectively,

![Formula][112]</img> ![Formula][113]</img> ![Formula][114]</img> 
![Graphic][115]</img> holds for any random vector **X***j*, as well as the other properties presented above. Let us assume that E( ||**X***j*||) < ∞ and that ![Graphic][116]</img>, which imply that ![Graphic][117]</img> has length 1 almost surely.

The standard random vector ![Graphic][118]</img> is a pivotal quantity, where the location (mean) and scale (standard deviation) of feature *j* have been removed. The probability distribution of ![Graphic][119]</img> across the remaining degrees of freedom over the unit (*n* − 2)-sphere is governed by the parametric family of the random variables X1*j,...,Xnj*. Moreover, the independence and identity of distribution across the *n* observations implies that the distribution of **X***j* is *exchangeable*, i.e. invariant with respect to permutations of the observation labels. As a result, ![Graphic][120]</img> is also permutation invariant, which geometrically corresponds to symmetries with respect to the *n*! permutations of the axes in the *n*-dimensional space of random vectors, projected onto the (*n* − 1)-dimensional hyperplane of residual vectors.

Residual vectors and standard vectors have been widely studied, especially in relation to elliptically symmetric distributions and linear models (Fang et al., 1990; Gupta et al., 2013), and to the invariances of probability distributions (Kallenberg, 2005). Here, we consider these vectors from the viewpoint of the problem of normalizing multivariate data, and its relationship with permutation invariance.

It is well know that, for a multivariate distribution with independent and identically distributed components, the expected value of the standard vector is zero (Eaton, 2007), given that it is so for each component. We prove this here for completeness, and to show that it is also a necessary consequence of the permutation invariance of the distribution.

*Proposition*. The expected value of any true (i.e. without normalization issues) standard vector is zero. If the *n* ≥ 2 samples of feature *j* are independent and identically distributed, then

![Formula][121]</img> 
*Proof*. Let ![Graphic][122]</img> be the set of all the permutation matrices in ℝ*n*×*n*. Then, for any ![Graphic][123]</img>, ![Graphic][124]</img> is equal in distribution to ![Graphic][125]</img>. This implies that

![Formula][126]</img> 
The only vectors that are invariant with respect to all possible permutations are those that have all components identical. Therefore, ![Graphic][127]</img>, with *α* ∈ ℝ. However, ![Graphic][128]</img>, so that ![Graphic][129]</img>. Hence ![Graphic][130]</img>

For each true random vector **X***j*, there is an observed random vector Y*j* = X*j* + A, where **A i**s the random vector of normalization factors. The random vectors X*j* and **A ar**e independent, representing biological and technical variation, respectively. Therefore, and without loss of generality, we assume in what follows a fixed vector of normalization factors a, i.e. we condition on the event {**A** = **a }**. We also assume that ![Graphic][131]</img>, which implies that ![Graphic][132]</img> has length 1 almost surely.

In contrast to the true standard vector ![Graphic][133]</img>, the observed standard vector ![Graphic][134]</img> is biased toward the direction of ![Graphic][135]</img> with the result that the expected value is not zero.

*Proposition*. If the *n* ≥ 2 samples of feature *j* are independent and identically distributed, whenever ![Graphic][136]</img>,

![Formula][137]</img> 
When *n* = 2, there is the additional requirement that ![Graphic][138]</img>. This threshold of detection only occurs for the degenerate case of *n* = 2.

*Proof*. Let us consider the projection of ![Graphic][139]</img> on ![Graphic][140]</img>, compared to the projection of ![Graphic][141]</img>.

When the vectors ![Graphic][142]</img> and ![Graphic][143]</img> are collinear,

![Formula][144]</img> 
with

![Formula][145]</img> 
This is the only case when *n* = 2. The additional requirement ensures that, for *n* = 2,

![Formula][146]</img> 
which implies

![Formula][147]</img> 
Otherwise, when *n>* 2 and the vectors ![Graphic][148]</img> and ![Graphic][149]</img> are not collinear, they lie on a plane. The vector ![Graphic][150]</img> is the diagonal of the parallelogram defined by ![Graphic][151]</img> and ![Graphic][152]</img>. Hence the angle between ![Graphic][153]</img> and ![Graphic][154]</img> is strictly less than the angle between ![Graphic][155]</img> and ![Graphic][156]</img>, so the cosine of the angle is strictly greater. Thus,

![Formula][157]</img> 
Due to the permutation symmetries in the distribution of ![Graphic][158]</img>, when *n >* 2 the vector ![Graphic][159]</img> has non-zero probability of being not collinear with ![Graphic][160]</img>, i.e. ![Graphic][161]</img>.

Therefore,

![Formula][162]</img> 
which again implies

![Formula][163]</img> 
Finally,

![Formula][164]</img> 
As a consequence, the *normalization of residuals problem* may be restated as the problem of finding the normalization factors ![Graphic][165]</img> from the observed vectors ![Graphic][166]</img>, such that the standard vectors ![Graphic][167]</img> are invariant against permutations of the observation labels. Or equivalently, such that the standard vectors ![Graphic][168]</img>have zero mean. The following property provides an approach to the solution.

*Proposition*. Whenever ![Graphic][169]</img>, the component of the expected value of ![Graphic][170]</img> parallel to ![Graphic][171]</img> verifies

![Formula][172]</img> 
As in (58), when *n* = 2 we also assume that ![Graphic][173]</img>.

*Proof*. The first inequality holds from the previous proof. Concerning the second inequality, let us consider

![Formula][174]</img> 
We need to prove that the first term on the RHS has negative expected value. Let us decompose this term into the positive and negative parts, ![Formula][175]</img>  where *X*+ = max(*X,* 0) and *X*− = − min(*X,* 0).

Because ![Graphic][176]</img>

![Formula][177]</img> 
These inequalities are identities when ![Graphic][178]</img> is of opposite sign to ( ·)±, or when ![Graphic][179]</img>. Because of the permutation symmetries of ![Graphic][180]</img>, it follows that ![Graphic][181]</img>, which implies

![Formula][182]</img> 
and hence

![Formula][183]</img> 
For any permutation matrix ![Graphic][184]</img>,

![Formula][185]</img> 
so that

![Formula][186]</img> 
which together with

![Formula][187]</img> 
implies, as in (57), that

![Formula][188]</img> 
Therefore,

![Formula][189]</img> 
Back to the initial expected values, it follows that

![Formula][190]</img> 
which implies

![Formula][191]</img> 
The Gaussian multivariate distribution, among others, has spherical symmetry besides permutation symmetry. For parametric families with spherical symmetry, the true standard vector ![Graphic][192]</img> has uniform distribution over the (*n*−2)-sphere. As a result, the components of ![Graphic][193]</img> perpendicular to ![Graphic][194]</img> are antisymmetric with respect to the direction of ![Graphic][195]</img>, so that they cancel out in expectation. That is, for parametric families with spherical symmetry, and as long as ![Graphic][196]</img>,

![Formula][197]</img> 

### SM4 Standard-vector normalization

The properties (59), (60) suggest the use of

![Formula][198]</img> 
to approximate the unknown residual vector of normalization factors ![Graphic][199]</img>. The following iterative method implements this approach to solve the *normalization of residuals problem*.

Let us define the following recursive sequence, where each step *t* comprises *m* vectors ![Graphic][200]</img> (*j* ∈{1*,...,m*}) and one vector ![Graphic][201]</img>,

![Formula][202]</img> ![Formula][203]</img> ![Formula][204]</img> 
We assume that ![Graphic][205]</img>, for all *j* ∈{1*,...,m*} and all *t* ≥ 0. Nonetheless, an implementation of this algorithm benefits from trimming out a small fraction (e.g. 1%) of the features with lesser ![Graphic][206]</img> in (64), in order to avoid numerical singularities.

Let us write ![Graphic][207]</img> as a function of the unknowns ![Graphic][208]</img> and ![Graphic][209]</img>. For any *t* ≥ 1,

![Formula][210]</img> ![Formula][211]</img> ![Formula][212]</img> ![Formula][213]</img> ![Formula][214]</img> ![Formula][215]</img> 
Note that (70) is also valid for *t* = 0.

Let us also define the vectors ![Graphic][216]</img>, for *t* ≥ 0, which describe the vector of normalization factors still to be removed at step *t*,

![Formula][217]</img> 
so that, by (70), for *t* ≥ 0,

![Formula][218]</img> 
Therefore, the recursive sequence (62)–(64) faces a new, weaker *normalization of residuals problem* at each step *t*, with true residual vectors ![Graphic][219]</img>, observed residual vectors ![Graphic][220]</img> and unknown normalization factors ![Graphic][221]</img>. The step *t* results in the estimation of normalization factors ![Graphic][222]</img>, which are removed from ![Graphic][223]</img>, generating the next step. At the beginning, ![Graphic][224]</img>.

At convergence, ![Graphic][225]</img>. Equations (57), (58), (64) imply that, in such a case, ![Graphic][226]</img> and ![Graphic][227]</img>. Convergence is optimal when the parametric family of the *m* features has spherical symmetry, Gaussian being the most prominent case. Otherwise, the more uniform the distribution of standard vectors ![Graphic][228]</img> is on the (*n* − 2)-sphere, the faster the sequence (62)–(64) converges. See examples of convergence in Supplementary Movies 1–3.

### SM5 Identification of non-differentially expressed genes

Let us consider a gene expression dataset, with *g* genes and *c* experimental conditions. Each condition *k* has *sk* samples. The total number of samples is ![Graphic][229]</img>. Let us assume that *c* ≥ 2 and that *sk* ≥ 2, for all conditions *k* ∈{1*,...,c*}. Let us also assume that, among the *g genes*, there is a fraction *π* of non-differentially expressed (non-DE) genes, with 0 ≤ *π* ≤ 1, while the remaining fraction 1 − *π* comprises the differentially expressed (DE) genes (Storey and Tibshirani, 2003).

Let us consider the usual ANOVA test comparing average expression levels across conditions, gene-by-gene. Under the null hypothesis of a non-differentially expressed gene, the corresponding *F*-statistic follows the *F*-distribution with *c* − 1 and *s* − *c* degrees of freedom. The test of this hypothesis yields a *p*-value *pj* for each gene *j* ∈{1*,...,g*}. The obtained *p*-values *pj* follow a probability distribution that can be considered as the mixture of two probability distributions, *F* and *F*1, for the non-DE genes and the DE genes, respectively (Storey, 2003). The fraction *π* of non-DE genes follows the uniform distribution on the interval [0, 1], ![Formula][230]</img>  while the fraction 1 −*π* of DE genes follows a distribution that verifies, for any *p* ∈ (0, 1), ![Formula][231]</img>  and the mixture distribution is ![Formula][232]</img> 

Let us further assume that there exists a *p* ∗, with 0 < *p* ∗ < 1, such that *F*1(*p*) = 1 for every *p* ≥ *p* ∗. In other words, all DE genes have *p*-value *pj* from the ANOVA test such that *pj* ≤ *p* ∗, while only some genes among the non-DE genes have p-value with *pj >p* ∗. This implies that the mixture distribution of *p*-values is uniform on the interval [*p* ∗ , 1],

![Formula][233]</img> ![Formula][234]</img> 
On the other hand, for any set of *n* samples *x*(1) ≤ *x*(2) ≤. . . ≤ *x*(*n*) obtained from *n* independent and identically distributed uniform random variables on the interval [*a, b*], all the distances between consecutive ordered samples (including boundaries), *x*(1) − *a, x*(2) − *x*(1)*,...,x*(*n*) − *x*(*n*−1)*, b* − *x*(*n*), obey the same distribution (Feller, 1971). Then, it can be realized that, for any *j suc*h that 2 ≤ *j* ≤ *n* − 1, the two subsets of samples x(1)*,...,x*(*j*−1) and *x*(*j*+1)*,...,x*(*n*) follow uniform distributions on the intervals [*a, x*(*j*)] and [*x*(*j*)*,b*], respectively.

Based on these facts, to identify non-DE genes we propose finding the minimum *p*(*j*), from the ordered sequence of *p*-values *p*(1) ≤ *p*(2) ≤ . . . ≤ *p*(*g*), such that a goodness-of-fit test for the uniform distribution on the interval [*p*(*j*), 1], performed on *p*(*j*+1)*,...,p*(*g*), is not rejected. As a result, the genes corresponding to the *p*-values *p*(*j*)*,p*(*j*+1)*,...,p*(*g*) are considered as non-DE genes.

Given the concavity of *F*(*p*), the goodness-of-fit test used is the one-sided Kolmogorov-Smirnov test on positive deviations of the empirical distribution function.

See Supplementary Movies 4–5 for examples of this approach to identifying non-differentially expressed genes

## Acknowledgements

This work was funded by the European Union FP7 projects MODERN (Ref. 309314-2) (C.P.R., J.J.S.-F.) and MARINA (Ref. 263215) (J.J.S.-F.), by FEDER through COMPETE (Programa Operacional Factores de Competitividade) and FCT (Funda¸c˜ao para a Ciˆencia e Tecnologia) through project bio-CHIP (Ref. FCT EXPL/AAG-MAA/0180/2013) (S.I.L.G., M.J.B.A.), and by a post-doctoral grant (Ref. SFRH/BPD/95775/2013) (S.I.L.G).

*   Received June 18, 2015.
*   Accepted December 4, 2015.


*   © 2015, Posted by Cold Spring Harbor Laboratory

The copyright holder for this pre-print is the author. All rights reserved. The material may not be redistributed, re-used or adapted without the author's permission.

## References

1.  Bolstad, B. M., Irizarry, R. A., Astrand, M. and Speed, T. P. (2003). A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19, 185–193.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1093/bioinformatics/19.2.185&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=12538238&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F12%2F04%2F021212.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000180913600004&link_type=ISI) 

2.  Brettingham-Moore, K. H., Duong, C. P., Heriot, A. G., Thomas, R. J. S. and Phillips, W. A. (2011). Using gene expression profiling to predict response and prognosis in gastrointestinal cancers-the promise and the perils. Ann of Surg Oncol 18, 1484–1491.
    
    
3.  Bullard, J. H., Purdom, E., Hansen, K. D. and Dudoit, S. (2010). Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 11, 94.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1186/1471-2105-11-94&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=20167110&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F12%2F04%2F021212.atom) 

4.  Castro-Ferreira, M. P., de Boer, T. E., Colbourne, J. K., Vooijs, R., van Gestel, C. A. M., van Straalen, N. M., Soares, A. M. V. M., Amorim, M. J. B. and Roelofs, D. (2014). Transcriptome assembly and microarray construction for Enchytraeus crypticus, a model oligochaete to assess stress response mechanisms derived from soil conditions. BMC Genomics 15, 302.
    
    
5.  Chang, Y., Lye, M. L. and Zeng, H. C. (2005). Large-scale synthesis of high-quality ultralong copper nanowires. Langmuir 21, 3746–3748.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1021/la050220w&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=15835932&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F12%2F04%2F021212.atom) 

6.  Chi, J.-T., Chang, H. Y., Haraldsen, G., Jahnsen, F. L., Troyanskaya, O. G., Chang, D. S., Wang, Z., Rockson, S. G., van de Rijn, M., Botstein, D. and et al. (2003). Endothelial cell diversity revealed by global expression profiling. Proc Natl Acad Sci USA 100, 10623–10628.
    
    [Abstract/FREE Full Text](http://biorxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NDoicG5hcyI7czo1OiJyZXNpZCI7czoxMjoiMTAwLzE5LzEwNjIzIjtzOjQ6ImF0b20iO3M6Mzc6Ii9iaW9yeGl2L2Vhcmx5LzIwMTUvMTIvMDQvMDIxMjEyLmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 

7.  Couzin, J. (2006). Genomics. Microarray data reproduced, but some concerns remain. Science 313, 1559.
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=16973853&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F12%2F04%2F021212.atom) 

8.  Dillies, M.-A., Rau, A., Aubert, J., Hennequet-Antier, C., Jeanmougin, M., Servant, N., Keime, C., Marot, G., Castel, D., Estelle, J. and et al. (2013). A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief Bioinform 14, 671–683.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1093/bib/bbs046&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=22988256&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F12%2F04%2F021212.atom) 

9.  Draghici, S., Khatri, P., Eklund, A. C. and Szallasi, Z. (2006). Reliability and reproducibility issues in DNA microarray measurements. Trends Genet 22, 101–109.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1016/j.tig.2005.12.005&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=16380191&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F12%2F04%2F021212.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000235576900009&link_type=ISI) 

10. Duggan, D. J., Bittner, M., Chen, Y., Meltzer, P. and Trent, J. M. (1999). Expression profiling using cDNA microarrays. Nat Genet 21, 10–14.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1038/4434&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=9915494&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F12%2F04%2F021212.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000078008200005&link_type=ISI) 

11. Durbin, J. (1973). Distribution Theory for Tests Based on the Sample Distribution Function. Society for Industrial and Applied Mathematics, Philadelphia.
    
    
12. Eaton, M. L. (2007). Multivariate Statistics: A Vector Space Approach. Institute of Mathematical Statistics, Beachwood, Ohio.
    
    
13. Fang, K., Kotz, S. and Ng, K. W. (1990). Symmetric Multivariate and Related Distributions. Chapman and Hall, New York.
    
    
14. Feller, W. (1971). An Introduction to Probability Theory and Its Applications, vol. 2,. 2 edition, Wiley, New York.
    
    
15. Frantz, S. (2005). An array of problems. Nat Rev Drug Discov 4, 362–363.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1038/nrd1746&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=15902768&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F12%2F04%2F021212.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000229297900005&link_type=ISI) 

16. Gagnon-Bartsch, J. A. and Speed, T. P. (2012). Using control genes to correct for unwanted variation in microarray data. Biostatistics 13, 539–52.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1093/biostatistics/kxr034&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=22101192&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F12%2F04%2F021212.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000305420000013&link_type=ISI) 

17. Garber, M., Grabherr, M. G., Guttman, M. and Trapnell, C. (2011). Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods 8, 469–477.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1038/nmeth.1613&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=21623353&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F12%2F04%2F021212.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000291031800015&link_type=ISI) 

18. Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A. and et al. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–537.
    
    [Abstract/FREE Full Text](http://biorxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjEyOiIyODYvNTQzOS81MzEiO3M6NDoiYXRvbSI7czozNzoiL2Jpb3J4aXYvZWFybHkvMjAxNS8xMi8wNC8wMjEyMTIuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 

19. Gomes, S. I. L., Caputo, G., Pinna, N., Scott-Fordsmand, J. J. and Amorim, M. J. B. (2015a). Effect of 10 different TiO2 and ZrO2 (nano)materials on the soil invertebrate *Enchytraeus crypticus*. Environ Toxicol Chem doi:doi:10.1002/etc.3080.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=doi:10.1002/etc.3080&link_type=DOI) 

20. Gomes, S. I. L., Scott-Fordsmand, J. J. and Amorim, M. J. B. (2015b). Cellular energy allocation to assess the impact of nanomaterials on soil invertebrates (Enchytraeids): The effect of Cu and Ag. Int J Environ Res Public Health 12, 6858–6878.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.3390/ijerph120606858&link_type=DOI) 

21. Gupta, A. K., Varga, T. and Bodnar, T. (2013). Elliptically Contoured Models in Statistics and Portfolio Theory. Springer, New York.
    
    
22. Hicks, S. C. and Irizarry, R. A. (2015). quantro: a data-driven approach to guide the choice of an appropriate normalization method. Genome Biol 16, 117.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1186/s13059-015-0679-0&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=26040460&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F12%2F04%2F021212.atom) 

23. Huber, W., Carey, V. J., Gentleman, R., Anders, S., Carlson, M., Carvalho, B. S., Bravo, H. C., Davis, S., Gatto, L., Girke, T. and et al. (2015). Orchestrating high-throughput genomic analysis with Bioconductor. Nat Methods 12, 115–121.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1038/nmeth.3252&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=25633503&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F12%2F04%2F021212.atom) 

24. Irizarry, R. A., Hobbs, B., Collin, F., Beazer-Barclay, Y. D., Antonellis, K. J., Scherf, U. and Speed, T. P. (2003). Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4, 249–264.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1093/biostatistics/4.2.249&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=12925520&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F12%2F04%2F021212.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000182894900007&link_type=ISI) 

25. Ivanova, N. B., Dimos, J. T., Schaniel, C., Hackney, J. A., Moore, K. A. and Lemischka, I. R. (2002). A stem cell molecular signature. Science 298, 601–604.
    
    [Abstract/FREE Full Text](http://biorxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjEyOiIyOTgvNTU5My82MDEiO3M6NDoiYXRvbSI7czozNzoiL2Jpb3J4aXYvZWFybHkvMjAxNS8xMi8wNC8wMjEyMTIuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 

26. Kallenberg, O. (2005). Probabilistic Symmetries and Invariance Principles. Springer, New York.
    
    
27. Leek, J. T., Scharpf, R. B., Bravo, H. C., Simcha, D., Langmead, B., Johnson, W. E., Geman, D., Baggerly, K. and Irizarry, R. A. (2010). Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet 11, 733–739.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1038/nrg2825&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=20838408&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F12%2F04%2F021212.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000281911300013&link_type=ISI) 

28. Leek, J. T. and Storey, J. D. (2007). Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet 3, 1724–35.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1371/journal.pgen.0030161&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=17907809&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F12%2F04%2F021212.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000249767800015&link_type=ISI) 

29. Li, S., Labaj, P. P., Zumbo, P., Sykacek, P., Shi, W., Shi, L., Phan, J., Wu, P.-Y., Wang, M., Wang, C., Thierry-Mieg, D., Thierry-Mieg, J., Kreil, D. P. and Mason, C. E. (2014). Detecting and correcting systematic variation in large-scale RNA sequencing data. Nat Biotechnol 32, 888–895.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1038/nbt.3000&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=25150837&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F12%2F04%2F021212.atom) 

30. Lippa, K. A., Duewer, D. L., Salit, M. L., Game, L. and Causton, H. C. (2010). Exploring the use of internal and external controls for assessing microarray technical performance. BMC Res Notes 3, 349.
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=21189145&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F12%2F04%2F021212.atom) 

31. Listgarten, J., Kadie, C., Schadt, E. E. and Heckerman, D. (2010). Correction for hidden confounders in the genetic analysis of gene expression. Proc Natl Acad Sci USA 107, 16465–70.
    
    [Abstract/FREE Full Text](http://biorxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NDoicG5hcyI7czo1OiJyZXNpZCI7czoxMjoiMTA3LzM4LzE2NDY1IjtzOjQ6ImF0b20iO3M6Mzc6Ii9iaW9yeGl2L2Vhcmx5LzIwMTUvMTIvMDQvMDIxMjEyLmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 

32. Lockhart, D. J., Dong, H., Byrne, M. C., Follettie, M. T., Gallo, M. V., Chee, M. S., Mittmann, M., Wang, C., Kobayashi, M., Horton, H. and et al. (1996). Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol 14, 1675–1680.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1038/nbt1296-1675&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=9634850&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F12%2F04%2F021212.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=A1996VX10700028&link_type=ISI) 

33. Lovén, J., Orlando, D. A. A., Sigova, A. A. A., Lin, C. Y. Y., Rahl, P. B. B., Burge, C. B. B., Levens, D. L. L., Lee, T. I. I. and Young, R. A. A. (2012). Revisiting global gene expression analysis. Cell 151, 476–482.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1016/j.cell.2012.10.012&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=23101621&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F12%2F04%2F021212.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000310529300006&link_type=ISI) 

34. Michiels, S., Koscielny, S. and Hill, C. (2005). Prediction of cancer outcome with microarrays: A multiple random validation strategy. Lancet 365, 488–492.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1016/S0140-6736(05)17866-0&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=15705458&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F12%2F04%2F021212.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000226812500026&link_type=ISI) 

35. Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. and Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5, 621–628.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1038/nmeth.1226&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=18516045&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F12%2F04%2F021212.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000257166700015&link_type=ISI) 

36. OECD (2004a). Guidelines for the Testing of chemicals No 202. Daphnia sp. Acute Immobilization Test. Organization for Economic Cooperation and Development, Paris.
    
    
37. OECD (2004b). Guidelines for the Testing of chemicals No. 220. Enchytraeid Reproduction Test. Organization for Economic Cooperation and Development, Paris.
    
    
38. R Core Team (2015). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. URL: [http://www.R-project.org/](http://www.R-project.org/).
    
    
39. Reese, S. E., Archer, K. J., Therneau, T. M., Atkinson, E. J., Vachon, C. M., de Andrade, M., Kocher, J.-P. A. and Eckel-Passow, J. E. (2013). A new statistic for identifying batch effects in high-throughput genomic data that uses guided principal component analysis. Bioinformatics 29, 2877–83.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1093/bioinformatics/btt480&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=23958724&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F12%2F04%2F021212.atom) 

40. Risso, D., Ngai, J., Speed, T. P. and Dudoit, S. (2014). Normalization of RNA-seq data using factor analysis of control genes or samples. Nat Biotechnol 32, 896–902.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1038/nbt.2931&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=25150836&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F12%2F04%2F021212.atom) 

41. Ritchie, M. E., Phipson, B., Wu, D., Hu, Y., Law, C. W., Shi, W. and Smyth, G. K. (2015). *limma* powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res 43, e47.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1093/nar/gkv007&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=25605792&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F12%2F04%2F021212.atom) 

42. Robinson, M. D. and Oshlack, A. (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol 11, R25.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1186/gb-2010-11-3-r25&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=20196867&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F12%2F04%2F021212.atom) 

43. Schena, M., Shalon, D., Davis, R. W. and Brown, P. O. (1995). Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270, 467–470.
    
    [Abstract/FREE Full Text](http://biorxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjEyOiIyNzAvNTIzNS80NjciO3M6NDoiYXRvbSI7czozNzoiL2Jpb3J4aXYvZWFybHkvMjAxNS8xMi8wNC8wMjEyMTIuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 

44. Scott-Fordsmand, J. J., Krogh, P. H. and Weeks, J. M. (2000). Responses of *Folsomia fimetaria* (Collembola: Isotomidae) to copper under different soil copper contamination histories in relation to risk assessment. Environ Toxicol Chem 19, 1297–1303.
    
    
45. Shi, L., Reid, L. H., Jones, W. D., Shippy, R., Warrington, J. A., Baker, S. C., Collins, P. J., de Longueville, F., Kawasaki, E. S., Lee, K. Y. and et al. (2006). The MicroArray Quality Control (MAQC) project shows inter-and intraplatform reproducibility of gene expression measurements. Nat Biotechnol 24, 1151–1161.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1038/nbt1239&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=16964229&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F12%2F04%2F021212.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000240495200036&link_type=ISI) 

46. Shippy, R., Fulmer-Smentek, S., Jensen, R. V., Jones, W. D., Wolber, P. K., Johnson, C. D., Pine, P. S., Boysen, C., Guo, X., Chudin, E. and et al. (2006). Using RNA sample titrations to assess microarray platform performance and normalization techniques. Nat Biotechnol 24, 1123–1131.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1038/nbt1241&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=16964226&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F12%2F04%2F021212.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000240495200033&link_type=ISI) 

47. Smyth, G. K. and Speed, T. (2003). Normalization of cDNA microarray data. Methods 31, 265–273.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1016/S1046-2023(03)00155-5&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=14597310&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F12%2F04%2F021212.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000186547800002&link_type=ISI) 

48. Stegle, O., Parts, L., Durbin, R. and Winn, J. (2010). A Bayesian framework to account for complex non-genetic factors in gene expression levels greatly increases power in eQTL studies. PLoS Comput Biol 6, e1000770.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1371/journal.pcbi.1000770&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=20463871&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F12%2F04%2F021212.atom) 

49. Storey, J. D. (2003). The positive false discovery rate: a Bayesian interpretation and the q-value. Ann Stat 31, 2013–2035.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1214/aos/1074290335&link_type=DOI) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000188780400011&link_type=ISI) 

50. Storey, J. D. and Tibshirani, R. (2003). Statistical significance for genomewide studies. Proc Natl Acad Sci USA 100, 9440–9445.
    
    [Abstract/FREE Full Text](http://biorxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NDoicG5hcyI7czo1OiJyZXNpZCI7czoxMToiMTAwLzE2Lzk0NDAiO3M6NDoiYXRvbSI7czozNzoiL2Jpb3J4aXYvZWFybHkvMjAxNS8xMi8wNC8wMjEyMTIuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 

51. Su, Z., Labaj, P. P., Li, S., Thierry-Mieg, J., Thierry-Mieg, D., Shi, W., Wang, C., Schroth, G. P., Setterquist, R. A., Thompson, J. F. and et al. (2014). A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nat Biotechnol 32, 903–914.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1038/nbt.2957&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=25150838&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F12%2F04%2F021212.atom) 

52. Tan, P. K., Downey, T. J., Spitznagel, E. L., Xu, P., Fu, D., Dimitrov, D. S., Lempicki, R. A., Raaka, B. M. and Cam, M. C. (2003). Evaluation of gene expression measurements from commercial microarray platforms. Nucleic Acids Res 31, 5676–5684.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1093/nar/gkg763&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=14500831&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F12%2F04%2F021212.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000185600400030&link_type=ISI) 

53. van ’t Veer, L. J., Dai, H., van de Vijver, M. J., He, Y. D., Hart, A. A. M., Mao, M., Peterse, H. L., van der Kooy, K., Marton, M. J., Witteveen, A. T. and et al. (2002). Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, 530–536.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1038/415530a&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=11823860&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F12%2F04%2F021212.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000173564300048&link_type=ISI) 

54. Wang, Z., Gerstein, M. and Snyder, M. (2009). RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10, 57–63.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1038/nrg2484&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=19015660&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F12%2F04%2F021212.atom) 
    
    [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000261866500012&link_type=ISI) 

55. Weigelt, B. and Reis-Filho, J. S. (2010). Molecular profiling currently offers no more than tumour morphology and basic immunohistochemistry. Breast Cancer Res 12 Suppl. 4, S5.
    
    
56. Wu, Z. and Aryee, M. J. (2010). Subset Quantile Normalization Using Negative Control Features. J Comput Biol 17, 1385–1395.
    
    [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1089/cmb.2010.0049&link_type=DOI) 
    
    [PubMed](http://biorxiv.org/lookup/external-ref?access_num=20976876&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F12%2F04%2F021212.atom)

 [1]: /embed/inline-graphic-1.gif
 [2]: /embed/graphic-4.gif
 [3]: /embed/inline-graphic-2.gif
 [4]: /embed/inline-graphic-3.gif
 [5]: /embed/inline-graphic-4.gif
 [6]: /embed/graphic-5.gif
 [7]: /embed/graphic-6.gif
 [8]: /embed/graphic-7.gif
 [9]: /embed/graphic-8.gif
 [10]: /embed/inline-graphic-5.gif
 [11]: /embed/inline-graphic-6.gif
 [12]: /embed/graphic-9.gif
 [13]: /embed/graphic-10.gif
 [14]: /embed/graphic-11.gif
 [15]: /embed/graphic-12.gif
 [16]: /embed/graphic-13.gif
 [17]: /embed/inline-graphic-7.gif
 [18]: /embed/inline-graphic-8.gif
 [19]: /embed/inline-graphic-9.gif
 [20]: /embed/inline-graphic-10.gif
 [21]: /embed/graphic-14.gif
 [22]: /embed/graphic-15.gif
 [23]: /embed/inline-graphic-11.gif
 [24]: /embed/inline-graphic-12.gif
 [25]: /embed/inline-graphic-13.gif
 [26]: /embed/graphic-20.gif
 [27]: /embed/graphic-21.gif
 [28]: /embed/graphic-22.gif
 [29]: /embed/inline-graphic-14.gif
 [30]: /embed/inline-graphic-15.gif
 [31]: /embed/graphic-23.gif
 [32]: /embed/graphic-24.gif
 [33]: /embed/graphic-25.gif
 [34]: /embed/inline-graphic-16.gif
 [35]: /embed/inline-graphic-17.gif
 [36]: /embed/inline-graphic-18.gif
 [37]: /embed/inline-graphic-19.gif
 [38]: /embed/inline-graphic-20.gif
 [39]: /embed/inline-graphic-21.gif
 [40]: /embed/graphic-26.gif
 [41]: /embed/graphic-27.gif
 [42]: /embed/graphic-28.gif
 [43]: /embed/inline-graphic-22.gif
 [44]: /embed/inline-graphic-23.gif
 [45]: /embed/inline-graphic-24.gif
 [46]: /embed/graphic-29.gif
 [47]: /embed/inline-graphic-25.gif
 [48]: /embed/inline-graphic-26.gif
 [49]: /embed/inline-graphic-27.gif
 [50]: /embed/inline-graphic-28.gif
 [51]: /embed/inline-graphic-29.gif
 [52]: /embed/inline-graphic-30.gif
 [53]: /embed/graphic-30.gif
 [54]: /embed/inline-graphic-31.gif
 [55]: /embed/inline-graphic-32.gif
 [56]: /embed/inline-graphic-33.gif
 [57]: /embed/inline-graphic-34.gif
 [58]: /embed/inline-graphic-35.gif
 [59]: /embed/inline-graphic-36.gif
 [60]: /embed/graphic-31.gif
 [61]: /embed/graphic-32.gif
 [62]: /embed/graphic-33.gif
 [63]: /embed/graphic-34.gif
 [64]: /embed/graphic-35.gif
 [65]: /embed/graphic-36.gif
 [66]: /embed/inline-graphic-37.gif
 [67]: /embed/graphic-37.gif
 [68]: /embed/graphic-38.gif
 [69]: /embed/graphic-39.gif
 [70]: /embed/graphic-40.gif
 [71]: /embed/graphic-41.gif
 [72]: /embed/graphic-42.gif
 [73]: /embed/inline-graphic-38.gif
 [74]: /embed/graphic-43.gif
 [75]: /embed/graphic-44.gif
 [76]: /embed/graphic-45.gif
 [77]: /embed/graphic-46.gif
 [78]: /embed/graphic-47.gif
 [79]: /embed/graphic-48.gif
 [80]: /embed/graphic-49.gif
 [81]: /embed/graphic-50.gif
 [82]: /embed/inline-graphic-39.gif
 [83]: /embed/inline-graphic-40.gif
 [84]: /embed/inline-graphic-41.gif
 [85]: /embed/inline-graphic-42.gif
 [86]: /embed/inline-graphic-43.gif
 [87]: /embed/graphic-51.gif
 [88]: /embed/graphic-52.gif
 [89]: /embed/inline-graphic-44.gif
 [90]: /embed/inline-graphic-45.gif
 [91]: /embed/inline-graphic-46.gif
 [92]: /embed/graphic-53.gif
 [93]: /embed/graphic-54.gif
 [94]: /embed/graphic-55.gif
 [95]: /embed/inline-graphic-47.gif
 [96]: /embed/inline-graphic-48.gif
 [97]: /embed/inline-graphic-49.gif
 [98]: /embed/inline-graphic-50.gif
 [99]: /embed/graphic-56.gif
 [100]: /embed/graphic-57.gif
 [101]: /embed/graphic-58.gif
 [102]: /embed/graphic-59.gif
 [103]: /embed/inline-graphic-51.gif
 [104]: /embed/inline-graphic-52.gif
 [105]: /embed/inline-graphic-53.gif
 [106]: /embed/graphic-60.gif
 [107]: /embed/graphic-61.gif
 [108]: /embed/inline-graphic-54.gif
 [109]: /embed/inline-graphic-55.gif
 [110]: /embed/inline-graphic-56.gif
 [111]: /embed/inline-graphic-57.gif
 [112]: /embed/graphic-62.gif
 [113]: /embed/graphic-63.gif
 [114]: /embed/graphic-64.gif
 [115]: /embed/inline-graphic-58.gif
 [116]: /embed/inline-graphic-59.gif
 [117]: /embed/inline-graphic-60.gif
 [118]: /embed/inline-graphic-61.gif
 [119]: /embed/inline-graphic-62.gif
 [120]: /embed/inline-graphic-63.gif
 [121]: /embed/graphic-65.gif
 [122]: /embed/inline-graphic-64.gif
 [123]: /embed/inline-graphic-65.gif
 [124]: /embed/inline-graphic-66.gif
 [125]: /embed/inline-graphic-67.gif
 [126]: /embed/graphic-66.gif
 [127]: /embed/inline-graphic-68.gif
 [128]: /embed/inline-graphic-69.gif
 [129]: /embed/inline-graphic-70.gif
 [130]: /embed/inline-graphic-71.gif
 [131]: /embed/inline-graphic-72.gif
 [132]: /embed/inline-graphic-73.gif
 [133]: /embed/inline-graphic-74.gif
 [134]: /embed/inline-graphic-75.gif
 [135]: /embed/inline-graphic-76.gif
 [136]: /embed/inline-graphic-77.gif
 [137]: /embed/graphic-67.gif
 [138]: /embed/inline-graphic-78.gif
 [139]: /embed/inline-graphic-79.gif
 [140]: /embed/inline-graphic-80.gif
 [141]: /embed/inline-graphic-81.gif
 [142]: /embed/inline-graphic-82.gif
 [143]: /embed/inline-graphic-83.gif
 [144]: /embed/graphic-68.gif
 [145]: /embed/graphic-69.gif
 [146]: /embed/graphic-70.gif
 [147]: /embed/graphic-71.gif
 [148]: /embed/inline-graphic-84.gif
 [149]: /embed/inline-graphic-85.gif
 [150]: /embed/inline-graphic-86.gif
 [151]: /embed/inline-graphic-87.gif
 [152]: /embed/inline-graphic-88.gif
 [153]: /embed/inline-graphic-89.gif
 [154]: /embed/inline-graphic-90.gif
 [155]: /embed/inline-graphic-91.gif
 [156]: /embed/inline-graphic-92.gif
 [157]: /embed/graphic-72.gif
 [158]: /embed/inline-graphic-93.gif
 [159]: /embed/inline-graphic-94.gif
 [160]: /embed/inline-graphic-95.gif
 [161]: /embed/inline-graphic-96.gif
 [162]: /embed/graphic-73.gif
 [163]: /embed/graphic-74.gif
 [164]: /embed/graphic-75.gif
 [165]: /embed/inline-graphic-97.gif
 [166]: /embed/inline-graphic-98.gif
 [167]: /embed/inline-graphic-99.gif
 [168]: /embed/inline-graphic-100.gif
 [169]: /embed/inline-graphic-101.gif
 [170]: /embed/inline-graphic-102.gif
 [171]: /embed/inline-graphic-103.gif
 [172]: /embed/graphic-76.gif
 [173]: /embed/inline-graphic-104.gif
 [174]: /embed/graphic-77.gif
 [175]: /embed/graphic-78.gif
 [176]: /embed/inline-graphic-105.gif
 [177]: /embed/graphic-79.gif
 [178]: /embed/inline-graphic-106.gif
 [179]: /embed/inline-graphic-107.gif
 [180]: /embed/inline-graphic-108.gif
 [181]: /embed/inline-graphic-109.gif
 [182]: /embed/graphic-80.gif
 [183]: /embed/graphic-81.gif
 [184]: /embed/inline-graphic-110.gif
 [185]: /embed/graphic-82.gif
 [186]: /embed/graphic-83.gif
 [187]: /embed/graphic-84.gif
 [188]: /embed/graphic-85.gif
 [189]: /embed/graphic-86.gif
 [190]: /embed/graphic-87.gif
 [191]: /embed/graphic-88.gif
 [192]: /embed/inline-graphic-111.gif
 [193]: /embed/inline-graphic-112.gif
 [194]: /embed/inline-graphic-113.gif
 [195]: /embed/inline-graphic-114.gif
 [196]: /embed/inline-graphic-115.gif
 [197]: /embed/graphic-89.gif
 [198]: /embed/graphic-90.gif
 [199]: /embed/inline-graphic-116.gif
 [200]: /embed/inline-graphic-117.gif
 [201]: /embed/inline-graphic-118.gif
 [202]: /embed/graphic-91.gif
 [203]: /embed/graphic-92.gif
 [204]: /embed/graphic-93.gif
 [205]: /embed/inline-graphic-119.gif
 [206]: /embed/inline-graphic-120.gif
 [207]: /embed/inline-graphic-121.gif
 [208]: /embed/inline-graphic-122.gif
 [209]: /embed/inline-graphic-123.gif
 [210]: /embed/graphic-94.gif
 [211]: /embed/graphic-95.gif
 [212]: /embed/graphic-96.gif
 [213]: /embed/graphic-97.gif
 [214]: /embed/graphic-98.gif
 [215]: /embed/graphic-99.gif
 [216]: /embed/inline-graphic-124.gif
 [217]: /embed/graphic-100.gif
 [218]: /embed/graphic-101.gif
 [219]: /embed/inline-graphic-125.gif
 [220]: /embed/inline-graphic-126.gif
 [221]: /embed/inline-graphic-127.gif
 [222]: /embed/inline-graphic-128.gif
 [223]: /embed/inline-graphic-129.gif
 [224]: /embed/inline-graphic-130.gif
 [225]: /embed/inline-graphic-131.gif
 [226]: /embed/inline-graphic-132.gif
 [227]: /embed/inline-graphic-133.gif
 [228]: /embed/inline-graphic-134.gif
 [229]: /embed/inline-graphic-135.gif
 [230]: /embed/graphic-102.gif
 [231]: /embed/graphic-103.gif
 [232]: /embed/graphic-104.gif
 [233]: /embed/graphic-105.gif
 [234]: /embed/graphic-106.gif