Subpopulation identification for single-cell RNA-sequencing data using functional data analysis

Kyungmin Ahn; Hironobu Fujiwara

doi:10.1101/760413

Abstract

Motivation In single-cell RNA-sequencing (scRNA-seq) analysis, a number of statistical tools in multivariate data analysis (MDA) have been developed to help analyze the gene expression data. This MDA approach is typically focused on examining discrete genomic units of genes that ignores the dependency between the data components. In this paper, we propose a functional data analysis (FDA) approach on scRNA-seq data whereby we consider each cell as a single function that does not allow permutation of the data components. To avoid a large number of dropouts (zero or zero-closed values) and reduce the high dimensionality of the data, we first perform a principal component analysis (PCA) and assign PCs to be the amplitude of the function. For the phase components, we propose two criteria: we use the PCs directly from PCA, and we sort the PCs by the genetic spatial information. For the latter, we embed the spatial information of genes by aligning the genomic gene locations to be the phase of the function. These two approaches allow us to apply FDA clustering methods to scRNA-seq analysis.

Results To demonstrate the robustness of our method, we apply several existing FDA clustering algorithms to the gene expression data to improve the accuracy of the classification of the cell types against the conventional clustering methods in MDA. As a result, the FDA clustering algorithms achieve superior accuracy on simulated data as well as real data such as human and mouse scRNA-seq data.