Statistical Significance of Cluster Membership

Neo Christopher Chung

doi:10.1101/248633

Abstract

Clustering is routinely applied to modern high-dimensional data, including gene expression measurements from microarray and RNA-seq. Iteratively estimating the cluster centers and assigning memberships according to pre-defined criteria, the clustering algorithms classify genes or samples to help ascertain molecular processes or sub-types. For example, the cluster membership assignments of unlabeled single cells from massively parallel RNA-seq experiments are used as the cell identities. However, how can we evaluate if the cluster memberships are correctly assigned? To this end, we introduce the jackstraw methods for unsupervised classifications that rigorously test the assignments of data features into their clusters. By learning uncertainty in clustering the noisy data, the proposed jackstraw methods can identify statistically significant features that truly make up the corresponding clusters. Simulation studies using K-means clustering confirm the accuracy of the proposed statistical significance. We consider mRNA abundances of 5981 Saccharomyces cerevisiae genes under cell cycle. After the proposed jackstraw methods are applied for K = 6 clusters, we estimate and use posterior inclusion probabilities (PIP) to select and visualize the canonical features for their clusters. We also investigate the single cell RNA-seq (scRNA-seq) data from a mixture of Jurkat and 293T cell lines, where individual cell identities are unknown. The jackstraw methods evaluate cluster membership assignments of 3381 unlabeled single cells such that the majority of multiplets are identified in an unsupervised manner. When clustering is employed in high-dimensional data analysis, the proposed tests enable rigorous evaluation of membership assignments that readily improve feature selection and visualization.

Software jackstraw package in R available at https://github.com/ncchung/jackstraw.

Abbreviations
scRNA-seq: single cell RNA sequencing
PCA: principal component analysis
PIP: posterior inclusion probability
FDR: false discovery rate

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-ND 4.0 International license.