Abstract
Single-cell analysis techniques are frequently used to characterize and detect cell populations in biological samples. The latest advances in multicolor flow cytometry and mass cytometry enable measurement of dozens of parameters per cell in thousands of cells per second. This has led to bottlenecks in data analysis, since traditional gating techniques are insufficient for these large, high-dimensional data sets. To address this, major efforts have been made to develop automated analysis methods. A key step in automated analysis is the use of clustering methods to detect high-dimensional clusters representing cell populations. Here, we have performed an up-to-date, comprehensive, extensible comparison of clustering methods for detecting cell populations in high-dimensional flow and mass cytometry data, including several new methods that were not yet available during previous comparisons. We evaluated clustering methods using four publicly available data sets containing major immune cell populations as well as specific rare populations, with population identities available from manual gating. The comparisons revealed that FlowSOM (with optional meta-clustering but without automatic selection of the number of clusters) performed well across all data sets, and had among the fastest runtimes. We recommend FlowSOM as a first choice for analyzing new data sets. In particular, the fast runtimes enable interactive, exploratory analyses on a standard laptop. Several methods were sensitive to random starts when detecting rare cell populations, indicating that multiple random starts should be used in these cases. Our results provide a guide for researchers deciding between clustering methods for analyzing data sets from high-dimensional flow and mass cytometry experiments. R scripts to reproduce all analyses are available from GitHub (https://github.com/lmweber/cytometry-clustering-comparison), and preprocessed data files are available from FlowRepository (FR-FCM-ZZPH), allowing our comparisons to be extended to include new clustering methods and reference data sets.
Author Summary Detecting cell types or populations is an important part of many biological experiments and medical diagnostic procedures. For example, in cancer diagnostics, tumor subtypes may be identified by the presence of certain cell types, while in vaccine research, the success of a vaccine may be inferred from the frequency of activated immune cells. Flow cytometry and mass cytometry are technologies used to identify cell populations by measuring the expression of characteristic proteins on cell surfaces or within cells. With the latest advances, dozens of proteins can be measured for each cell, precisely characterizing the diversity of cell types present. However, this leads to challenges in data analysis. Recently, significant efforts have been made to develop automated analysis methods, many of which rely on clustering methods to detect groups of similar cells. Here, we have compared the performance of available clustering methods, using several publicly available data sets as benchmarks. The results provide guidance to researchers trying to decide which method to use when analyzing data sets. By helping researchers choose appropriate data analysis methods, this work will improve confidence in reported experimental results, and ultimately contribute to broader adoption of these state-of-the-art technologies for applications in biology and medicine.