ABSTRACT
Analysis of single-cell RNA-seq data is challenging due to technical variability, high noise levels and massive sample sizes. Here, we describe a normalization technique that substantially reduces technical variability and improves the quality of downstream analyses. We also introduce a nonparametric method for detecting differentially expressed genes that scales to > 1,000 cells and is both more accurate and ~10 times faster than existing parametric approaches.
MAIN TEXT
Single-cell RNA sequencing (scRNA-seq) technologies have revolutionized functional genomics by providing a window into heterogeneity within complex cellular ensembles.1 However, substantial challenges remain in analyzing the resulting data. Single-cell transcriptomes are distorted by technical biases such as RNA degradation during cell isolation and processing, variable reagent amounts, presence of cellular debris and PCR amplification bias.2,3 Moreover, due to the small number of molecules under investigation (< 5 transcripts per moderately expressed gene), single-cell expression estimates are inherently noisy,2 even in the absence of technical variability. Thus, the two most basic components of transcriptomic data analysis, normalization anddifferential expression (DE) analysis, are uniquely challenging for scRNAseq. In addition, the introduction of new, high-throughput technologies such as inDrop/ Drop-seq4 has created a pressing need for algorithms that can scale beyond a thousand cells.
The most widely used RNA-seq normalization methods, which include Fragments Per Kilobase per Million reads5(FPKM), size factor6 (DEseq algorithm) and Trimmed Mean of M-values7 (TMM; edgeR algorithm), were originally developed for bulk RNA-seq data. These methods assume that the number of reads for a given transcript is directly proportional to its expression level,after adjusting for technical covariates. If this assumption were true, the distribution of expression estimates (counts, FPKMs) would have a consistent shape across samples of the same type, and this isindeed frequently observed in bulk data (Fig. 1a). However, we found that the shape of the scRNA-seq expression distribution was highly variable from cell to cell (Fig. 1b,c), even in the absence of cell cycle variability (Fig. 1d). Thus, theproportionality assumption may not be appropriate for scRNA-seq normalization.
Strategies originally devised for bulk RNA-seq are also commonly employed for identifying DE genes in scRNA-seq data.Inparticular, previous single-cell studies assumed that gene expression variation followed a well-defined parametric distribution.6,8,9 Parametric tests were therefore used to estimate the statistical significance of differentialexpression. However, it is well known that when the number of samples is large, as is typical of single-cell data, nonparametric tests have very similar statistical power.6 In fact, when the true distribution violates parametricassumptions, parametric methods are actually less powerful.10 Parametric DE algorithms are also typically slower, due to greater computational complexity. Parametric tests are therefore not ideally suited to data from many hundreds or thousands of single cells.4 Thus, it is important to develop nonparametric methods for single-cell DE analysis.
We propose a two-pronged approach to address the above problems, which can be applied to any single cell omics dataset because it does not require the expression distribution to have a specific form. 1) For expression normalization, we use a variant of quantile (Q) normalization.11 One drawback of Q normalization is that it does not address variability in the number of detected genes. Consequently, if a cell type A has fewer detected genes than another cell type B, many lowly expressed genes will remain undetected in the former, resulting in artifactual DE calls. Scaling-based normalization methods are also susceptible to this artifact. Our approach, which we term pseudocounted Quantile (pQ) normalization, homogenizes the expression of all genes below a fixed rank in each cell (Supplementary Material). 2) For DE analysis, we introduce a novel nonparametric statistic that combines rank-based and mean-based measures of expression divergence.
Methods for scRNA-seq normalization have not previously been systematically evaluated. For comparative benchmarking, we accessed scRNA-seq data from four previous studies that also performed bulk RNA-seq on the same samples12–15 (Supplementary Table 1). In addition to pQ normalization, we tested seven existing normalization methods: Q, total read count16 (TC), median16 (Med), DESeq, TMM, FPKM and raw read count (RC) quantified using HTSeq17 (Count; Supplementary Fig. 1). Expression was quantified as normalized read counts in all cases, except in the case of FPKM normalization. Performance was evaluated based on three criteria: 1) proportion of expression variance attributable to technical bias, 2) accuracy of DE calls and 3) transcriptomic separability of cell types.
It has recently been reported that a substantial proportion of scRNA-seq expression variability is explained by a single technical factor: the number of detected genes (b) in each cell.18 A good normalization method should minimize the impact of this technical bias. We, therefore, calculated the post-normalization expression variance explained by b, and used it as a measure of normalization efficacy (Supplementary Material). We performed this analysis on five published scRNA-seq datasets namely Avraham, Darmanis, Patel, Shalek and Trapnell (Supplementary Table 1). In addition, we generated a sixth scRNA-seq dataset consisting of 333 single cells from mouse brain, isolated at three developmental time points from three brain regions (Supplementary Material). The latter dataset provides a rigorous in vivo benchmark representing multiple co-existing cell types processed in multiple batches. On all six datasets, including mouse brain, pQ normalization was the most effective in minimizing technical bias, closely followed by conventional Q normalization (Fig. 1e).
An accurate normalization method should also improve the quality of subsequent DE gene analysis. For each of the 7 count-based normalization methods, we therefore examined the accuracy of single-cell DE gene calls (we tested the performance of FPKM normalization only for Shalek and Trapnell datasets). A single-cell DE gene call was counted as a true positive if the gene was also present in the matched bulk-transcriptome DE list. If not, it was counted as a false positive. The Wilcoxon rank sum test19 was used to call single-cell and bulk DE genes. This analysis was performed on the four datasets that contained matched single-cell and bulk profiles (Avraham, Patel, Shalek, Trapnell). On all four datasets, pQ normalization maximized the accuracy of single-cell DE analysis (Fig. 1f; Supplementary Fig. 2). TC and Q were second and third respectively, based on average rank (Supplementary Table 2).
In order to devise a nonparametric test for single-cell DE analysis, we first considered the Wilcoxon rank sum test.19 One potential limitation we anticipated was that this rank-based test ignored the magnitude of expression deviation between the two groups of cells. We therefore also calculated the expression-difference statistic D from the NOISeqBIO algorithm20,21 (Supplementary Material). The D-statistic exploits the fact that, under the central limit theorem, the distribution of the mean expression value in each cell population is approximately Gaussian. It is thusbroadly applicable, regardless of the parametric form of the underlying single-cell expression distributions. It is alsorobust since outliers are removed by the median-based pQ-normalization step. We then used Fisher’s method to combine p-values from the two statistical tests into a single, composite p-value for differential expression.
The DE analysis method described above, which we named NO nparametric D ifferential E xpression for S ingle-cells (NODES), was compared to three methods originally designed for bulk-sample DE analysis (DESeq2,22 edgeR8 and NOISeqBIO10), one recently developed algorithm for single-cell DE analysis (scde9) and also the Wilcoxon rank sum test (the D-statistic, by itself, was not found to be effective). As above, we applied these methods to the Avraham, Patel, Shalek and Trapnell datasets and used the corresponding bulk-sample DE gene calls for benchmarking. To facilitate uniform comparison, all DE algorithms were provided with the same TMM-normalized expression matrix as input. On all four datasets, NODES generated the most accurate DE calls (Fig. 2a, Supplementary Fig. 6) and the Wilcoxon test was on average second-best (Supplementary Table 3). The same was true when the DE algorithms were allowed to use their own default normalization strategies. We also tested the DE analysis algorithms on 10 simulated datasets generated by resampling expression values from a genuine scRNA-seq dataset (Supplementary Material). Again, NODES clearly outperformed the other methods(Supplementary Fig. 7). As a resource, we therefore used NODES to define marker genes uniquely expressed in each of the five major mouse brain cell types (Fig. 2b; Supplementary File).
Given the high level of noise in scRNA-seq data, large numbers of cells are increasingly being profiled in a single experiment in order to achieve sufficient statistical power for downstream analyses. Scalability is thus an essential feature of scRNA-seq analysis algorithms. We used a collection of 1,596 scRNA-seq datasets4 (inDrop/ Drop-seq protocol) to measure execution time for the five tested DE analysis algorithms (Supplementary Material). Notably, the parametric methods (scde, DESeq2 and edgeR) could not easily scale beyond 200-400 cells, either due to memory limitations orexcessive run time (Fig. 2c). In contrast, the non-parametric methods produced results relatively quickly, even on datasets as large as ~ 1,600 cells. The above results demonstrate the value of pQ normalization in reducing technical variability and improving the quality of downstream expression analysis. They also highlight the superior accuracy of NODES in calling DE genes, relative to existing methods. One significant advantage of the nonparametric methods introduced in this study is that they can straightforwardly be applied to scRNA-seq data from any experimental protocol, since they make no assumptions about the shape of the distribution. The nonparametric DE approach also provides a transformative reduction in computational complexity and execution time, which will be crucial for analyzing the massive single-cell datasets generated by inDrop/ Drop-seq and other high-throughput single cell technologies.
Software
R package implementing pQ and NODES can be found at https://goo.gl/Ndx07M.
Author Contributions
DS, SP and BL conceived the study. DS and SP designed the statistical methods and developed the analysis strategies. DS implemented the methods and performed the analyses. NAR performed the single cell RNA-seq experiments with assistancefrom ML. DS and SP wrote the manuscript with help from BL and NAR. All authors read and approved the manuscript.
Acknowledgments
This work is supported by grant #SPF2012/003 from the Agency for Science, Technology and Research (A*STAR), Singapore.