Abstract
RNA-Seq and gene expression microarrays provide comprehensive profiles of gene activity, but lack of reproducibility has hindered their application. A key challenge in the data analysis is the normalization of gene expression levels, which is currently performed following an implicit assumption that most genes are not differentially expressed. Here, we present a mathematical approach to normalization that makes no assumption of this sort. We have found that variation in gene expression is much greater than currently believed, and that it can be measured with available technologies. Our results also explain, at least partially, the problems encountered in transcriptomics studies. We expect this improvement in detection to help efforts to realize the full potential of gene expression profiling, especially in analyses of cellular processes involving complex modulations of gene expression.
Introduction
Since the discovery of DNA structure by Watson and Crick, molecular biology has progressed increasingly quickly, with rapid advances in sequencing and related genomic technologies. Among these, microarrays and RNA-Seq have been widely adopted to obtain gene expression profiles, by measuring the concentration of tens of thousands of mRNA molecules in single assays (Schena et al., 1995; Lockhart et al., 1996; Duggan et al., 1999; Mortazavi et al., 2008; Wang et al., 2009). Despite their enormous potential (Golub et al., 1999; van ’t Veer et al., 2002; Ivanova et al., 2002; Chi et al., 2003), problems of reproducibility and reliability (Tan et al., 2003; Frantz, 2005; Couzin, 2006) have discouraged their use in some areas, e.g. biomedicine (Michiels et al., 2005; Weigelt and Reis-Filho, 2010; Brettingham-Moore et al., 2011). In more mature microarray technologies, issues such as probe design, cross-hybridization, non-linearities and batch effects (Draghici et al., 2006) have been identified as possible culprits, but the problems persist (Shi et al., 2006; Su et al., 2014).
The normalization of gene expression, which is required to set a common reference level among samples (Smyth and Speed, 2003; Irizarry et al., 2003; Bullard et al., 2010; Garber et al., 2011; Dillies et al., 2013), is also reportedly problematic, affecting the reproducibility of results with both microarray (Shi et al., 2006; Shippy et al., 2006) and RNA-Seq (Su et al., 2014; Bullard et al., 2010; Dillies et al., 2013). Batch effects and their influence on normalization have recently received a great deal of attention (Leek et al., 2010; Reese et al., 2013; Li et al., 2014), resulting in approaches aiming to remove unwanted technical variation caused by differences between batches of samples or by other sources of expression heterogeneity (Listgarten et al., 2010; Gagnon-Bartsch and Speed, 2012; Risso et al., 2014). A different issue, however, is the underlying assumption made by the most widely used normalization methods to date, such as median and quantile normalization (Bolstad et al., 2003) for microarrays, or RPKM (Mortazavi et al., 2008) and TMM (Robinson and Oshlack, 2010) for RNA-Seq, which posit that most genes are not differentially expressed (Dillies et al., 2013; Hicks and Irizarry, 2015). This lack-of-variation assumption may seem reasonable for many applications, but it has not been confirmed. Furthermore, results obtained with other technologies, particularly qRT-PCR, suggest that it may not be valid (Shi et al., 2006; Bullard et al., 2010).
Some methods have been proposed to address the issue of the lack-of-variation assumption, based on the use of spike-ins (Lovén et al., 2012), negative control probes (Wu and Aryee, 2010) or negative control genes (Gagnon-Bartsch and Speed, 2012), that is, on external or internal controls that are known a priori not to be differentially expressed (Lippa et al., 2010). The applicability of these methods, however, has been limited by this requirement of a priori knowledge, which is rarely available for a sufficiently large number of controls. Thus, in attempts to clarify and overcome limitations imposed by the lack-of-variation assumption, we have developed an approach to normalization that does not assume lack-of-variation and that does not require the use of spike-ins or a priori knowledge of control genes. The analysis of a large gene expression dataset using this approach shows that the assumption can severely undermine the detection of variation in gene expression. We have found that large numbers of differentially expressed genes with substantial expression changes are missed when data are normalized with methods that assume lack-of-variation.
Results
Datasets and Normalization Methods
The dataset was obtained from biological triplicates of Enchytraeus crypticus (a globally distributed soil organism used in standard ecotoxicity tests), sampled under 51 experimental conditions (42 treatments and 9 controls), involving exposure to several substances, at several concentrations and durations according to a factorial design (Supp. Table 1). Gene expression was measured using a customized high-density oligonucleotide microarray, and the resulting dataset was normalized with four methods. Two of these methods are the most widely used procedures for microarrays, median (or scale) normalization and quantile normalization (Bolstad et al., 2003), whereas the other two, designated median condition-decomposition normalization and standard-vector condition-decomposition normalization, have been developed for this study.
With the exception of quantile normalization, all used methods apply a multiplicative factor to the expression levels in each sample, equivalent to the addition of a number in the usual log2-scale for gene expression levels. Solving the normalization problem consists of finding these correction factors. The problem can be exactly and linearly decomposed into several sub-problems: one within-condition normalization for each experimental condition and one final between-condition normalization for the condition averages. In the within-condition normalizations, the samples (replicates) subjected to each experimental condition are normalized separately, whereas in the final between-condition normalization average levels for all conditions are normalized together. Because there are no genes with differential expression in any of the within-condition normalizations, the lack-of-variation assumption only affects the final between-condition normalization. The assumption is avoided by using, in this normalization, expression levels only from no-variation genes, i.e. genes that show no evidence of differential expression under a statistical test. Both methods of normalization proposed here follow this condition-decomposition approach.
With median condition-decomposition normalization, all normalizations are performed are included in the between-condition step. Otherwise, if all genes were used in this final step, the resulting total normalization factors would be exactly the same as those obtained with conventional median normalization.
For standard-vector condition-decomposition normalization, a vectorial procedure was developed to carry out each normalization step. The samples of any experimental condition, in a properly normalized dataset, must be exchangeable. In mathematical terms, the expression levels of each gene can be considered as an s-dimensional vector, where s is the number of samples for the experimental condition. After standardization (mean subtraction and variance scaling), these standard vectors are located in a (s − 2)-dimensional hypersphere. The exchangeability mentioned above implies that, when properly normalized, the distribution of standard vectors must be invariant with respect to permutations of the sample labels and must have zero expected value. These properties allow to obtain, under fairly general assumptions, a robust estimator of the normalization factors.
To further explore and compare outcomes of the normalization methods, they were also applied to a synthetic random dataset. This dataset was generated with identical means and variances gene-by-gene to the real dataset, and with the assumption that all genes were no-variation genes. In addition, normalization factors were applied, equal to those obtained from the real dataset. Thus, the synthetic dataset was very similar to the real one, while complying by construction with the lack-of-variation assumption.
Normalization Results
Figure 1 displays the results of applying the four normalization methods to the real and synthetic datasets. Each panel shows the interquartile ranges of expression levels for the 153 samples, grouped in triplicates exposed to each experimental condition. Both median (second row) and quantile normalization (third row) yielded similar outputs, for both datasets. In contrast, the condition-decomposition normalizations (fourth and fifth rows) identified marked differences, detecting much greater variation between conditions in the sample the same, while quantile normalization makes the full distribution of each sample the same. Hence, if there were differences in medians or distributions of gene expression between experimental conditions, both methods would have removed them. Figures 1G,I show that such variation between conditions was present in the real dataset.
Influence of no-variation genes on normalization
To clarify how the condition-decomposition normalizations preserved the variation between conditions, we studied the influence of the choice of no-variation genes in the final between-condition normalization. To this end, we obtained the between-condition variation with both methods in two families of cases. In one family, no-variation genes were chosen in decreasing order of p-values from an ANOVA test. In the other family, genes were chosen at random. The first option was similar to the approach implemented to obtain the results presented in Figures 1G–J, with the difference that there the number of genes was chosen automatically by a statistical test. As shown in Figure 2A, for the real dataset the random choice of genes resulted in n₋1/2 decays (n being the number of chosen genes), followed by a plateau. The n₋1/2 decays reflect the standard errors of the estimators of the normalization factors. Selecting the genes by decreasing p-values, however, yielded a completely different result. Up to a certain number of genes, the variance remained similar, but for larger numbers of genes the variance dropped rapidly. Figure 2A shows, therefore, that between-condition variation was removed as soon as the between-condition normalizations used genes that varied in expression level across experimental conditions. The big circles in Figure 2A indicate the working points of the normalizations used to generate the results displayed in Figures 1G,I. In fact, these points slightly underestimated the variation between conditions. Although the statistical test for identifying no-variation genes ensured that there was no evidence of variation, inevitably the expression of some selected genes varied across conditions.
Figure 2B displays the results obtained with the synthetic dataset. There were no plateaus when no-variation genes were chosen randomly, only n₋1/2 decays, and small differences when no-variation genes were selected by decreasing p-values. Big circles show that working points were selected with much larger numbers of genes in the synthetic dataset (Figs. 1H,J) than in the real dataset (Figs. 1G,I). The residual variation, produced by errors in the estimation of the normalization factors, was much smaller than the variation detected in the real dataset, especially for standard-vector condition-decomposition normalization. Overall, Figure 2 shows that the between-condition variation pictured in Figures 1G,I is not an artifact caused by using an exceedingly small or extremely particular set of genes in the final between-condition normalization, but that this variation originated from the real dataset.
Differential Gene Expression
Finally, Figure 3A shows the numbers of differentially expressed gene probes (DEGP), identified after normalizing with the four methods, for each of the 42 experimental treatments versus the corresponding control (Supp. Table 2). Compared to conventional methods, the number of DEGP detected with the condition-decomposition normalizations was much larger under most treatments, including some whose number of DEGP was larger by more than one order of magnitude. These are statistically significant changes of gene expression, i.e. changes that cannot be explained by chance. More important is the scale of the detected variation, as illustrated by the boxplots in Figure 3C showing absolute fold changes of DEGP detected after standard-vector condition-decomposition normalization. For all treatments, the entire interquartile range of absolute fold change is above 1.5-fold, and for more than two thirds of the treatments the median absolute fold change is greater than 2. This amount of gene expression variation cannot be neglected, and warrants further research to explore its biological significance.
Discussion
The variation between medians displayed in Figures 1G,I may seem surprising, given routine expectations based on current methods (Figs. 1C,E). Nevertheless, this variation inevitably results from the imbalance between over-and under-expressed genes. As an illustration, let us consider a case with two experimental conditions, in which the average expression of a given gene is less than the distribution median under one condition, but greater than the median under the other. The variation of this gene alone will change the value of the median to the expression level of the next ranked gene. Therefore, if the number of over-expressed genes is different from the number of under-expressed genes, and enough changes cross the median boundary, then the median will substantially differ between conditions. Only when the differential expression is balanced or small enough, will the median stay the same. This argument applies equally to any other quantile in the distribution of gene expression. Transcriptional amplification is an extreme example of change in the distribution of expression levels (Lovén et al., 2012), which can nevertheless be properly normalized with condition-decomposition methods, and without resorting to spike-ins as long as some genes are not differentially expressed.
An important feature of the approaches to normalization proposed here (linear decomposition into normalization sub-problems per condition, and standard-vector normalization for each sub-problem) is that they do not depend on any particular aspect of the technology of gene expression microarrays or RNA-Seq. The numbers in the input data are interpreted as measured concentrations of mRNA molecules, in order to identify the normalization factors and irrespectively of whether the concentrations were obtained from fluorescence intensities of hybridized cDNA (microarrays) or from counts of fragments read of mRNA sequences (RNA-Seq). Nevertheless, we consider that specific within-sample corrections for each technology are still necessary and must be applied before the between-sample normalizations proposed here. Examples include background correction for microarrays or gene-length normalization (RPKM) for RNA-Seq. Equally, methods that address the influence of biological or technical confounding factors on downstream applied when necessary, after normalizing.
The lack-of-variation assumption underlying the current methods of normalization was self-fulfilling, removing variation in gene expression that was present in the real dataset. Moreover, it had negative consequences for downstream analyses, as it both removed potentially important biological information and introduced errors in the detection of gene expression. A removal of variation can be understood as errors in the estimation of normalization factors. Considering data and errors vectorially, the length of each vector equals, after centering and up to a constant factor, the standard deviation of the data or error. The addition of an error of small magnitude, compared to the data variance, would have only a minor effect. However, errors of similar or greater magnitude than the data variance may, depending on the lengths and relative angles of the vectors, severely distort the observed data variance. This will in turn cause spurious results in the statistical analyses. Furthermore, the angles between the data and the correct normalization factors (considered as vectors) are random. Data reflect biological variation, while normalization factors respond to technical variation. If the experiment is repeated, even with exactly the same experimental settings, the errors in the normalization factors will vary randomly, causing random spurious results in the downstream analyses. This explains, at least partially, the lack of reproducibility found in transcriptomics studies, especially for the detection of small changes of gene expression, because small variations are most likely to be distorted by errors in the estimates of normalization factors. Accordingly, the largest differences in numbers of DEGP detected by conventional compared to condition-decomposition methods (Fig. 3A) occurred consistently in the treatments with the smallest magnitudes of gene expression changes, e.g. treatments 28, 29 and 33 (Figs. 3B,C).
In summary, this study proves that large numbers of genes change in expression level (often strongly) across experimental conditions, and too extensively to ignore in the normalization of gene expression data. Further, our approach, which avoids the prevailing lack-of-variation assumption, demonstrates that current normalization methods likely remove and distort important variation in gene expression. It also offers a means to investhis to provide revealing insights about diverse biomolecular processes, particularly those involving substantial numbers of genes, such as cell differentiation, toxic responses, diseases with non-Mendelian inheritance patterns and cancer. After years of lagging behind the advances in genome sequencing, we believe that the procedures presented here will assist efforts to realize the full potential of gene expression profiling.
Data Deposition and Code Availability
MIAME-compliant microarray data from the experiment were submitted to the Gene Expression Omnibus (GEO) at the NCBI website (platform: GPL20310; series: GSE69746, GSE69792, GSE69793 and GSE69794). Custom code that reproduces all the reported results starting from the raw microarray data is available at the GitHub repository https://github/carlosproca/gene-expr-norm-paper.
Author Contributions
S.I.L.G., M.J.B.A. and J.J.S.-F. designed the toxicity experiment. S.I.L.G. carried out the experimental work and collected the microarray data. C.P.R. designed and implemented the novel normalization methods. C.P.R. performed the statistical analyses. All the authors jointly discussed the results. C.P.R. drafted the paper, with input from all the authors. All the authors edited the final version of the paper.
Materials and Methods
Test Organism and Exposure Media
The test species was Enchytraeus crypticus. Individuals were cultured in Petri dishes containing agar medium, in controlled conditions (Gomes et al., 2015b).
For copper (Cu) exposure, a natural soil collected at Hygum, Jutland, Denmark was used (Gomes et al., 2015b; Scott-Fordsmand et al., 2000). For silver (Ag) and nickel (Ni) exposure, the natural standard soil LUFA 2.2 (LUFA Speyer, Germany) was used (Gomes et al., 2015b). The exposure to ultra-violet (UV) radiation was done in ISO reconstituted water (OECD, 2004a).
Test Chemicals
The tested Cu forms (Gomes et al., 2015b) included copper nitrate (Cu(NO3)2 ·3H2O > 99%, Sigma Aldrich), Cu nanoparticles (Cu-NPs, 20–30 nm, American Elements) and Cu nanowires (Cu-Nwires, synthesized by reduction of copper (II) nitrate with hydrazine in alkaline medium (Chang et al., 2005)).
The tested Ag forms (Gomes et al., 2015b) included silver nitratre AgNO3 > 99%, Sigma Aldrich), non-coated Ag nanoparticles (Ag-NPs Non-Coated, 20–30 nm, American Elements),
Polyvinylpyrrolidone (PVP)-coated Ag nanoparticles (Ag-NPs PVP-Coated, 20–30 nm, American Elements), and Ag NM300K nanoparticles (Ag NM300K, 15 nm, JRC Repository). The Ag NM300K was dispersed in 4% Polyoxyethylene Glycerol Triolaete and Polyoxyethylene (20) orbitan mono-Laurat (Tween 20), thus the dispersant was tested alone as control (CTdisp).
The tested Ni forms included nickel nitrate (Ni(NO3)2 ·6H2O ≥ 98.5%, Fluka) and Ni nanoparticles (Ni-NPs, 20 nm, American Elements).
Spiking Procedure
Spiking for the Cu and Ag materials was done as in previous work (Gomes et al., 2015b). For the Ni materials, the Ni-NPs were added to the soil as powder, following the same procedure as for the Cu materials. NiNO3, being soluble, was added to the pre-moistened soil as aqueous dispersions.
The concentrations tested were selected based on the reproduction effect concentrations EC20 and EC50, for E. crypticus, within 95% of confidence intervals, being: CuNO3 EC20/50 = 290/360 mgCu/kg, Cu-NPs EC20/50 = 980/1760 mgCu/kg, Cu-Nwires EC20/50 = 850/1610 mgCu/kg, Cu-Field EC20/50 = 500/1400 mgCu/kg, AgNO3 EC20/50 = 45/60 mgAg/kg, Ag-NP PVP-coated EC20/50 = 380/550 mgAg/kg, Ag-NP Non-coated EC20/50 = 380/430 mgAg/kg, Ag NM300K EC20/50 = 60/170 mgAg/kg, CTdisp = 4% w/w Tween 20, NiNO3 EC20/50 = 40/60 mgNi/kg, Ni-NPs EC20/50 = 980/1760 mgNi/kg.
Four biological replicates were performed per test condition, including controls. For Cu exposure, the control condition for all the treatments consisted of soil from a control area at Hygum site, which has a Cu background concentration of 15 mg/kg (Scott-Fordsmand et al., 2000). For Ag exposure, two control sets were performed: CT (un-spiked LUFA soil, to be the control condition for AgNO3, Ag-NPs PVP-Coated and Ag-NPs Non-Coated treatments) and CTdisp (LUFA soil spiked with the dispersant Tween 20, to be the control condition for the Ag NM300K treatments). For Ni exposure, the control consisted of un-spiked LUFA soil.
Exposure Details
In soil (i.e. for Cu, Ag and Ni) exposure followed the standard ERT (OECD, 2004b) with adaptations as follows: twenty adults with well-developed clitellum were introduced in each test vessel, containing 20 g of moist soil (control or spiked). The organisms were exposed for three and seven days under controlled conditions of photoperiod (16:8 h light:dark) and temperature 20 ± 1 ◦C without food. After the exposure period, the liquid nitrogen. The samples were stored at −80 ◦C, until analysis.
For UV exposure, the test conditions (OECD, 2004a) were adapted for E. crypticus (Gomes et al., 2015a). The exposure was performed in 24-well plates, where each well correspond to a replicate and contain 1 ml of ISO water and five adult organisms with clitellum. The test duration was five days, at 20 ± 1 ◦C. The organisms were exposed to UV on a daily basis, during 15 minutes per day to two UV intensities (280–400nm) of 1669.25 ± 50.83 and 1804.08 ± 43.10 mW/m2, corresponding to total UV doses of 7511.6 and 8118.35 J/m2, respectively. The remaining time was spent under standard laboratory illumination (16:8 h photoperiod). UV radiation was provided by an UV lamp (Spectroline XX15F/B, Spectronics Corporation, NY, USA, peak emission at 312 nm) and a cellulose acetate sheet was coupled to the lamp to cut-off UVC-range wavelengths (Gomes et al., 2015a). Thirty two replicates per test condition (including control without UV radiation) were performed to obtain 4 biological replicates with 40 organisms each for RNA extraction. After the exposure period, the organisms were carefully removed from the water and frozen in liquid nitrogen. The samples were stored at −80 ◦C, until analysis.
RNA Extraction, Labeling and Hybridization
RNA was extracted from each replicate, which contained a pool of 20 and 40 organisms, for soil and water exposure, respectively. Three biological replicates per test treatment (including controls) were used. Total RNA was extracted using SV Total RNA Isolation System (Promega). The quantity and purity were measured spectrophotometrically with a nanodrop (NanoDrop ND-1000 Spectrophotometer) and its quality checked by denaturing formaldehyde agarose gel electrophoresis.
500 ng of total RNA were amplified and labeled with Agilent Low Input Quick Amp Labeling Kit (Agilent Technologies, Palo Alto, CA, USA). Positive controls were added with the Agilent one-color RNA Spike-In Kit. Purification of the amplified and labeled cRNA was performed with RNeasy columns (Qiagen, Valencia, CA, USA).
The cRNA samples were hybridized on custom Gene Expression Agilent Microarrays (4 x 44k format), with a single-color design (Castro-Ferreira et al., 2014). Hybridizations were performed using the Agilent Gene Expression Hybridization Kit and each biological replicate was individually hybridized on one array. The arrays were hybridized at 65 ◦C with a rotation of 10 rpm, during 17 h. Afterwards, microarrays were washed using Agilent Gene Expression Wash Buffer Kit and scanned with the Agilent DNA microarray scanner G2505B.
Data Acquisition and Analysis
Fluorescence intensity data was obtained with Agilent Feature Extraction Software v. 10.7.3.1, using recommended protocol GE1_107_Sep09. Quality control was done by inspecting the reports on the Agilent Spike-in control probes. Background correction was provided by Agilent Feature Extraction software. To ensure an optimal comparison between the different normalization methods, only gene probes with good signal quality (flag IsPosAndSignif = True) in all samples were employed in the analyses. This implied the selection of 18,339 gene probes from a total of 43,750. Analyses were performed with R (R Core Team, 2015) v. 3.2.2, using R packages plotrix and RColorBrewer, and with Bioconductor (Huber et al., 2015) v. 3.1 packages genefilter and limma (Ritchie et al., 2015).
The synthetic data was generated gene by gene as normal variates with mean and variance equal, respectively, to the sample mean and sample variance of the real data. The applied normalization factors were those detected from the real data with standard-vector condition-decomposition normalization.
Median normalization was performed by subtracting the median of each sample distribution, and then adding the overall median to preserve the global expression level. Quantile normalization was performed as implemented in the limma package.
The two condition-decomposition normalizations proceeded in the same way: first, 51 independent within-condition normalization using all genes; then, final between-condition normalization, iteratively detecting no-variation genes and normalizing until convergence.
No-variation genes were identified with one-sided Kolmogorov-Smirnov tests, as goodness-of-fit tests against the uniform distribution, carried out on the greatest p-values obtained from an ANOVA test on the complete dataset (see below). The ANOVA test benefited from the already corrected within-condition variances, provided by the within-condition normalizations. The KS test was rejected at α = 0.001.
The criterion for convergence for the median condition-decomposition (CD) normalizations was to require that the relative changes in the standard deviation of the normalization factors were less than 1%, or less than 10% for 10 steps in a row. In the case of standard-vector CD normalizations, convergence required that numerical errors were, compared to the estimated statistical errors (see below), less than 1%, or less than 10% for 10 steps in a row. For Figure 2 and Supplementary Figure 1, due to the very low number of gene probes in some cases, the thresholds for convergence for 10 steps in a row were increased to 80% and 50%, respectively, for median CD and standard-vector CD normalization.
In standard-vector CD normalization, the distribution of standard vectors was trimmed in each step to remove the 1% more extreme values of variance.
Differentially expressed gene probes were identified with limma (Fig. 3) or t-tests (Supp. Fig. 2), using in all cases a FDR threshold of 5%.
The reference distribution with permutation symmetry shown in the polar plots of the probability density function in Supplementary Movies 1–3 was calculated with the 6 permutations of the empirical standard vectors. The Watson U2 statistic was calculated with the two-sample test (Durbin, 1973). An equal number of samples for comparison was obtained by sampling with replacement the permuted standard vectors.
Mathematical Methods
In a gene expression dataset with g genes, c experimental conditions and n samples per condition, the observed expression levels of gene j in condition k, , can be expressed in log2-scale as where is the vector of true gene expression levels and a(k) is the vector of normalization factors.
Given a sample vector x, the mean vector is , and the residual vector is .
Then, (1) can be linearly decomposed into
Equations (3) define the within-condition normalizations for each condition k. The scalar values in (2) are used to obtain the equations on condition means,
The between-condition normalization is defined by (5). Equations (4) reduce to a single number, which is irrelevant to the normalization. The complete solution for each condition is obtained with .
The n samples of gene j in a given condition can be modeled with the random vectors . Again, Yj = Xj + a, where a is a fixed vector of normalization factors. It can be proved, under fairly general assumptions, that the true standard vectors have zero expected value
whereas the observed standard vectors verify, as long as a ≠ 0,
This motivates the following iterative procedure to solve (3) and (5) (standard-vector normalization):
At convergence, , which implies and . Convergence is faster the more symmetric the empirical distribution of is on the unit (n − 2)-sphere. Convergence is optimal with spherically symmetric distributions, such as the Gaussian distribution, because in that case
Assuming no correlation between genes, an approximation of the statistical error at step t can be obtained with
This statistical error is compared with the numerical error to assess convergence.
See Supplementary Material for a detailed exposition of the mathematical methods, and Supplementary Movies 1–5 for an illustration.
Acknowledgements
This work was funded by the European Union FP7 projects MODERN (Ref. 309314-2) (C.P.R., J.J.S.-F.) and MARINA (Ref. 263215) (J.J.S.-F.), by FEDER through COMPETE (Programa Operacional Factores de Competitividade) and FCT (Funda¸c˜ao para a Ciˆencia e Tecnologia) through project bio-CHIP (Ref. FCT EXPL/AAG-MAA/0180/2013) (S.I.L.G., M.J.B.A.), and by a post-doctoral grant (Ref. SFRH/BPD/95775/2013) (S.I.L.G).