TY - JOUR T1 - Normalization of Single Cell RNA Sequencing Data Using both Control and Target Genes JF - bioRxiv DO - 10.1101/045070 SP - 045070 AU - Mengjie Chen AU - Xiang Zhou Y1 - 2016/01/01 UR - http://biorxiv.org/content/early/2016/03/21/045070.abstract N2 - Single cell RNA sequencing (scRNAseq) technique is becoming increasingly popular for unbiased and high-resolutional transcriptome analysis of heterogeneous cell populations. Despite its many advantages, scRNAseq, like any other genomic sequencing technique, is susceptible to the influence of confounding effects. Controlling for confounding effects in scRNAseq data is thus a crucial step for proper data normalization and accurate downstream analysis. Several recent methodological studies have demonstrated the use of control genes for controlling for confounding effects in scRNAseq studies; the control genes are used to infer the confounding effects, which are then used to normalize target genes of primary interest. However, these methods can be suboptimal as they ignore the rich information contained in the target genes. Here, we develop an alternative statistical method, which we refer to as scPLS, for more accurate inference of confounding effects. Our method is based on partial least squares and models control and target genes jointly to better infer and control for confounding effects. To accompany our method, we develop a novel expectation maximization algorithm for scalable inference. Our algorithm is an order of magnitude faster than standard ones, making scPLS applicable to hundreds of cells and hundreds of thousands of genes. With extensive simulations and comparisons with other methods, we demonstrate the effectiveness of scPLS. We apply scPLS to analyze three scRNAseq data sets to further illustrate its benefits in removing technical confounding effects as well as for removing cell cycle effects.Author Summary Data normalization is crucial for accurate estimation of gene expression levels and successful down-stream analysis in single cell RNA sequencing (scRNAseq) studies. We present a novel statistical method that solves a key challenge in data normalization for scRNAseq: controlling for the hidden confounding factors (e.g. batch effects, cell cycle effects etc.) and removing unwanted variation. Compare to some recent methods using a small set of control genes to infer and control for confounding effects, we propose instead modeling both control and non-control genes jointly. Through extensive simulations and case studies, we demonstrate that joint modeling enables much more accurate data normalization than previous approaches. ER -