ABSTRACT
Single cell RNA sequencing (scRNA-seq) is a promising technique to determine the states of individual cells and classify novel cell subtypes. Computationally, the processing of scRNA-seq data presents a daunting challenge because of the noisy nature and humongous size and dimensionality of the data. Compromised solution by omitting the genes with low expression is commonly taken in current scRNA-seq analysis, which leads to inaccurate gene counts. In this paper, we introduce a broadly applicable data-driven gene expression recovery framework, referred to as the self-consistent expression recovery machine (SERM), to impute the missing gene expression. Using deep learning, SERM first learns from a subset of the noisy gene expression data to estimate the underlying data distribution. SERM then recovers the overall gene expression data by imposing a self-consistency on the gene expression matrix, thus ensuring that the expression levels are similarly distributed in different parts of the matrix. We show that SERM significantly improves the accuracy of gene imputation with at least 100-fold increase in computational efficiency in comparison to the state-of-the-art techniques. Thus SERM promises to provide an urgently needed computational solution for rapid and accurate recovery of big genomic expression data. SERM is available as a web-based computational tool (https://www.analyxus.com/compute/serm) and its source codes can be found in https://github.com/xinglab-ai/self-consistent-expression-recovery-machine.
Competing Interest Statement
The authors have declared no competing interest.