ABSTRACT
Dimensionality reduction (DR) is an indispensable analytic component for many areas of single cell RNA sequencing (scRNAseq) data analysis. Proper DR can allow for effective noise removal and facilitate many downstream analyses that include cell clustering and lineage reconstruction. Unfortunately, despite the critical importance of DR in scRNAseq analysis and the vast number of DR methods developed for scRNAseq studies, however, few comprehensive comparison studies have been performed thus far to evaluate the effectiveness of different DR methods in scRNAseq. Here, we aim to fill this critical knowledge gap by providing a comprehensive comparative evaluation of a variety of commonly used DR methods for scRNAseq studies. Specifically, we compared 11 different DR methods on 28 publicly available scRNAseq data sets that cover a range of sequencing techniques and sample sizes. We evaluate the performance of different DR methods both for cell clustering and for lineage reconstruction in terms of their accuracy and robustness. We evaluate the computational scalability of different DR methods by recording their computational cost. Based on the comprehensive evaluation results, we provide important guidelines for choosing DR methods in scRNAseq data analysis. We also provide all analysis scripted used in the present study at www.xzlab.org/reproduce.html. Together, we hope that our results will serve as an important practical reference for practitioners to choose DR methods in the field of scRNAseq analysis.
INTRODUCTION
Single-cell RNA sequencing (scRNAseq) is a rapidly growing and widely applying technology [1–3]. By measuring gene expression at single cell level, scRNAseq provides an unprecedented opportunity to investigate the cellular heterogeneity of complex tissues [4–8]. However, despite the popularity of scRNAseq, analyzing scRNAseq data remains a challenging task. Specifically, due to the low capture efficiency and low sequencing depth per cell in scRNAseq data, gene expression measurements obtained from scRNAseq are noisy: collected scRNAseq gene measurements are often in the form of low expression counts paired with an excessive number of zeros known as dropouts [9]. Subsequently, dimensionality reduction (DR) methods that transform the original high-dimensional noisy expression matrix into a low-dimensional subspace with enriched signals become an important data processing step for scRNAseq analysis [10]. Proper DR can allow for effective noise removal, facilitate data visualization, and enable efficient and effective downstream analysis of scRNAseq [11].
DR is indispensable for many types of scRNAseq analysis. Because of the importance of DR in scRNAseq analysis, many DR methods have been developed and routinely used in many scRNAseq software tools that include, but not limited to, cell clustering tools [12, 13] and lineage reconstruction tools [14]. Indeed, most commonly used scRNAseq clustering methods rely on DR as the first analytic step [15]. For example, Seurat applies clustering algorithms directly on a low dimensional space inferred from principal component analysis (PCA) [16]. CIDR improves clustering by improving PCA through imputation [17]. SC3 combines different ways of PCA for consensus clustering [18]. Besides PCA, other DR techniques are also commonly used for cell clustering. For example, nonnegative matrix factorization (NMF) is used in SOUP [19]. Diffusion map is used in destiny [20]. Multidimensional scaling (MSD) is used in ascend [21]. Variational inference autoencoder is used in scVI [22]. In addition to cell clustering, most cell lineage reconstruction and developmental trajectory inference algorithms also rely on DR [14]. For example, TSCAN builds cell lineages using minimum spanning tree based on a low dimensional PCA space [23]. Waterfall performs k-means clustering in the PCA space to eventually produce linear trajectories [24]. SLICER uses locally linear embedding (LLE) to project the set of cells into a lower dimension space for reconstructing complex cellular trajectories [25]. Monocle employs independent components analysis (ICA) for DR before building the trajectory [26]. Wishbone combines PCA and diffusion maps to allow for bifurcation trajectories [27].
Besides the generic DR methods mentioned in the above paragraph, many DR methods have also been developed recently that are specifically targeted for modeling scRNAseq data. These scRNAseq specific DR methods can account for either the count nature of scRNAseq data and/or the dropout events commonly encountered in scRNAseq studies. For example, ZIFA relies on a zero-inflation normal model to model dropout events [28]. pCMF models both dropout events and the mean-variance dependence resulting from the count nature of scRNAseq data [29]. ZINB-WaVE incorporates additional gene-level and sample-level covariates for more accurate DR [30]. Finally, several deep learning-based DR methods have recently been developed to enable scalable and effective computation in large-scale scRNAseq data; for example, data that are collected by 10X Genomics techniques [31] and/or from large consortium studies such as Human Cell Atlas (HCA) [32, 33]. Common deep learning-based DR methods for scRNAseq include Dhaka [34], scScope [35], VASC [36], and DCA [37], to name a few.
With all these different DR methods for scRNAseq data analysis, one naturally wonders which DR method one would prefer for different types of scRNAseq analysis. Unfortunately, despite the popularity of scRNAseq technique, the critical importance of DR in scRNAseq analysis, and the vast number of DR methods developed for scRNAseq studies, few comprehensive comparison studies have been performed to evaluate the effectiveness of different DR methods for practical applications. Here, we aim to fill this critical knowledge gap by providing a comprehensive comparative evaluation of a variety of commonly used DR methods for scRNAseq studies. Specifically, we compared 11 different DR methods on 28 publicly available scRNAseq data sets that cover a range of sequencing techniques and sample sizes. We evaluate the performance of different DR methods both for cell clustering and for lineage reconstruction in terms of their accuracy and robustness using different metrics. We also evaluate the computational scalability of different DR methods by recording their computational time. Together, we hope our results can serve as an important guideline for practitioners to choose DR methods in the field of scRNAseq analysis.
RESULTS
Our evaluated the performance of 11 DR methods (Table 1; Figure S1) on 28 publicly available scRNAseq data sets (Table S1). Details of these data sets are provided in Methods and Materials. Briefly, these data sets cover a wide variety of sequencing techniques that include Smart-Seq2 (7 data sets), 10X genomics (6 data sets), Fluidigm C1 (4 data sets), Smart-Seq (5 data sets), inDrop (1 data set), SMARTer (3 data sets) and others (2 data sets). In addition, these data cover a range of sample sizes from a couple hundred cells to a few thousand cells. In each data set, we evaluated the effectiveness of different DR methods for one of the two important tasks: cell clustering and lineage inference. In addition, we recorded computation time of different DR methods. An overview of the comparison workflow is shown in Figure 1. All data and analysis scripts for reproducing the results in the paper is available at www.xzlab.org/reproduce.html.
Cell clustering
We first evaluated the performance of different DR methods for cell clustering applications. To do so, we obtained 14 publicly available scRNAseq data sets and simulated two additional scRNAseq data sets using Splatter package (Table S1). Each of the 14 real scRNAseq data sets contains known cell clustering information while each of the two simulated data sets contains 4 or 8 known cell types. For each DR method and each data set, we applied DR to extract a fixed number of low-dimensional components (e.g. these are the principal components in the case of PCA). We varied the number of low-dimensional components to examine their influence on downstream analysis. In particular, for a data with less than or equal to 300 cells, we varied the number of low dimensional components to be either 2, 6, 14, or 20. For a data with more than 300 cells, we varied the number of low dimensional components to be either 0.5%, 1%, 2%, or 3% of the total number of cells. We then applied either the hierarchical clustering algorithm or the k-means clustering algorithm to obtain cluster labels. We used both normalized mutual information (NMI) and adjusted rand index (ARI) values for comparing the true cell labels and inferred cell labels obtained by clustering methods based on the low dimensional components.
The evaluation results on DR methods based on clustering analysis using the k-means clustering algorithm are summarized in Figure 2 (for NMI criterion) and Figure S2 (for ARI criterion). Because the results based on either of the two criteria are similar, we will mainly explain the results based on the NMI criteria in Figure 2. For easy visualization, we also display the results averaged across data sets in Figure S3. A few patterns are noticeable. First, as one would expect, clustering accuracy depends on the number of low-dimensional components that are used for clustering. Specifically, accuracy is relatively low when the number of low-dimensional components is very small (e.g. 2 or 0.5%) and generally increases with the number of included components. In addition, accuracy usually saturates once a sufficient number of components is included, though the saturation number of components can vary across data sets and across methods. For example, the average NMI across all data sets and across all methods are 0.59, 0.67, 0.68 and 0.69 for increasingly large number of components, respectively. Second, when conditional on using a low number of components, scRNAseq specific DR method ZINB-WaVE and scRNAseq non-specific DR method ICA often outperform the other methods. For example, with the lowest number of components, the average NMI across all data sets for ICA and ZINB-WaVE is 0.77 and 0.76, respectively. The performance of ICA and ZINB-WaVE is followed by Diffusion Map (0.71), ZIFA (0.69), PCA (0.68), FA (0.68), NMF (0.59) and DCA (0.57). While the remaining three methods, Poisson NMF (0.42), pCMF (0.41) and scScope (0.26), do not fare well with a low number of components. The good performance of ZINB-WaVE with a low number of components presumably is because its direct modeling of dropout events and raw counts of scRNAseq data. The good performance of ICA with a low number of components presumably is because it extracts low-dimensional components using a non-linear transformation. Third, with increasing number of low-dimensional components, generic methods such as FA, ICA and PCA are often comparable with scRNAseq specific methods such as ZINB-WaVE, and in many cases can slightly outperform ZINB-WaVE. For example, with the highest number of low-dimensional components, the average NMI across all data sets for FA, ICA, PCA, ZINB-WaVE and Diffusion Map are 0.85, 0.84, 0.83, 0.83 and 0.80, respectively. Their performance is followed by ZIFA (0.79), NMF (0.73), and DCA (0.69). The same three methods, pCMF (0.55), Poisson NMF (0.31), and scScope (0.31), again do not fare well with a large number of low-dimensional components. The comparable results of generic DR methods with scRNAseq specific DR methods with a high number of low-dimensional components are also consistent some of the previous observations; for example, the original ZINB-WaVE paper observed that PCA generally can yield comparable results with scRNAseq specific DR methods in real data [30].
Besides the k-means clustering algorithm, we also used the hierarchical clustering algorithm to evaluate the performance of different DR methods (Figures S4-S6). In this comparison, we had to exclude one DR method, scScope, as hierarchical clustering does not work on the extracted low-dimensional components from scScope. Consistent with the k-means clustering results, we found that the clustering accuracy measured by hierarchical clustering is relatively low when the number of low-dimensional components is very small (e.g. 2 or 0.5%), but generally increases with the number of included components. In addition, consistent with the k-means clustering results, we found that generic DR methods often yield results comparable to or better than scRNAseq specific DR methods. In particular, two generic DR methods, FA and NMF, outperform various other DR methods across a range of settings. For example, when the number of low-dimensional components is moderate (6 or 1%), both FA and NMF achieve an average NMI value of 0.80 across data sets (Figures S6). Their performance is followed by PCA (0.72), Poisson NMF (0.71), ZINB-WaVE (0.71), Diffusion Map (0.70), ICA (0.69), ZIFA (0.68), pCMF (0.65), and DCA (0.63). We note, however, that the clustering results obtained by hierarchical clustering are often slightly worse than that obtained by k-means clustering across settings (e.g. Figure S3 vs Figure S6), consistent with the fact that many scRNAseq clustering methods use k-means as a key ingredient [18, 24].
While some DR methods (e.g. Poisson NMF, ZINB-WaVE, pCMF and DCA) directly model count data, many DR methods (e.g. DR methods, PCA, ICA, FA, NMF, and Diffusion Map) require normalized data. The performance of DR methods that use normalized data may depend on how data are normalized. Therefore, we investigated how different normalization approaches impact on the performance of the aforementioned five DR methods that use normalized data. We examined two alternative data transformation approaches, log2 CPM (count per million) and z-score, in addition to the log2 count we used in the previous results (transformation details are provided in Methods and Materials). The evaluation results are summarized in Figures S7-S10 and are generally insensitive to the transformation approach deployed. For example, with the k-means clustering algorithm, when the number of low-dimensional components is small (1%), PCA achieves an NMI value of 0.82, 0.82 and 0.81, for log2 count transformation, log2 CPM transformation, and z-score transformation, respectively (Figures S3A, S8A, and S10A). Similar results hold for the hierarchical clustering algorithm (Figures S3B, S8B, and S10B). Therefore, different data transformation approaches do not appear to substantially influence the performance of DR methods.
Finally, we also investigated the stability and robustness of different DR methods. To do so, we randomly split the Kumar data into two subsets with an equal number of cells for each cell type in the two subsets. We applied each DR method to the two subsets and measured the clustering performance in each subset separately. We repeated the procedure 10 times to capture the potential stochasticity during the data split. We visualize the clustering performance of different DR methods in the two subsets separately. Such visualization allows us to check the effectiveness of DR methods with respective to reduced sample size in the subset, as well as the stability/variability of DR methods across different split replicates (Figure S11). The results show that four of the DR methods, PCA, ICA, FA, and ZINB-WaVE, often achieve both accurate clustering performance and highly stable and consistent results across the subsets. The accurate and stable performance of both ICA and ZINB-WaVE is notable even with a relatively small number of low-dimensional components. For example, with very small number of low-dimensional components, both ICA and ZINB-WaVE achieve an average NMI value of 0.98 across the two subsets, with virtually no performance variability across data splits (Figure S11).
Overall, the results suggest that, in terms of downstream clustering analysis accuracy and stability, ICA is preferable across a range of data sets examined here. In addition, scRNAseq specific DR methods such as ZINB-WaVE is also preferable if one is interested in extracting a small number of low-dimensional components, while generic methods such as PCA or FA are also preferred when one is interested in extracting a large number of low-dimensional components.
Trajectory inference
We next evaluated the performance of different DR methods for lineage inference applications (details in Methods and Materials). To do so, we obtained 14 publicly available scRNAseq data sets, each of which contains known lineage information (Table S2). The known lineage in all these data are linear, without bifurcation or multifurcation patterns. For each data set, we applied one DR method at a time to extract a fixed number of low-dimensional components. In the process, we varied the number of low-dimensional components from 2, 6, 14 to 20 to examine their influence for downstream analysis. With the extracted low-dimensional components, we first used the hierarchical clustering algorithm or the k-means clustering algorithm to obtain cell type labels, where the number of cell types in the clustering was set to be the known truth. Afterwards, we supplied the low-dimensional components and cell type labels to the software Slingshot [38] to infer the lineage. Following [38], we evaluated the performance of DR methods by Kendall correlation coefficient that compares the true lineage and inferred lineage obtained based on the low-dimensional components. In this comparison, we also excluded one DR method, scScope, which is not compatible with Slingshot. The lineage inference results for the remaining DR methods are summarized in Figures 3 and S12-17.
Different from the clustering results where accuracy generally increases with increasing number of included low-dimensional components, the lineage tracing results do not show a clear increasing pattern with respect to the number of low-dimensional components, especially when we used k-means clustering as the initial step (Figures 3 and Figure S13A). For example, the average Kendall correlation across all data sets and across all methods are 0.37, 0.37, 0.38 and 0.37 for increasingly large number of components, respectively. When we use hierarchical clustering algorithm as the initial step, the lineage tracing results in the case of a small number of low-dimensional components are slightly inferior as compared to the results obtained using a large number of low-dimensional components (Figures S12 and Figure S13B). However, we do note that the lineage tracing results obtained using k-means are better than that obtained using hierarchical clustering as the initial step. Therefore, k-means clustering algorithm is recommend as the initial step for lineage inference and a small number of low-dimensional components there is often sufficient for accurate results. When conducting lineage inference based on a low number of components with k-means, we found that four DR methods, PCA, FA, NMF and ZINB-WaVE, all perform well for lineage inference (Figure S13A). These methods also worked reasonably well for the previous cell clustering analysis. For example, with the lowest number of components, the average Kendall correlation across data sets for PCA, FA, NMF, and ZINB-WaVE, are 0.43, 0.42, 0.42, and 0.41, respectively. Their performance is followed by ICA (0.38), ZIFA (0.37), and Diffusion Map (0.35). While DCA (0.31), pCMF (0.29), and Poisson NMF (0.29) do not fare well. These four methods (PCA, FA, NMF and ZINB-WaVE), with the only exception of NMF, are also among the best performers for lineage inference across varying number of low-dimension components.
For methods that require normalized data, we further examined the influence of different data transformation approaches on their performance (Figures S14-S15). Like in the clustering comparison, we found that different transformations do not influence the performance results for most DR methods in lineage inference. For example, with the k-means clustering algorithm as the initial step, when the number of low-dimensional components is small, ICA achieves a Kendall correlation of 0.38, 0.36 and 0.38, for log2 count transformation, log2 CPM transformation, and z-score transformation, respectively (Figures S13A, S15A, and S17A). Similar results hold for the hierarchical clustering algorithm (Figures S13B, S15B, and S17B). However, some notable exceptions exist. For example, with log2 CPM transformation but not the other transformations, the performance of Diffusion Map increases with increasing number of included components when k-means clustering was used as the initial step: the average Kendal correlation across different low-dimensional components are 0.37, 0.42, 0.44, and 0.47, respectively (Figures S14 and S15A). As another example, with z-score transformation but not with the other transformations, FA achieves the highest performance among all DR methods across different number of low-dimensional components.
We also investigated the stability and robustness of different DR methods by data split on the Hayashi data. We applied each DR method to the two subsets and measured the lineage inference performance in the two subsets separately. We again visualize the clustering performance of different DR methods in the two subsets, separately. Such visualization allows us to check the effectiveness of DR methods with respective to reduced sample size in the subset, as well as the stability/variability of DR methods across different split replicates (Figure S18). The results show that three of the DR methods, FA, Diffusion Map, and ZINB-WaVE, often achieve both accurate performance and highly stable and consistent results across the subsets. The accurate and stable performance of these is notable even with a relatively small number of low-dimensional components. For example, with very small number of low-dimensional components, FA, Diffusion Map and ZINB-WaVE achieve Kendall correlation of 0.75, 0.77, and 0.77 averaged across the two subsets, respectively, and again with virtually no performance variability across data splits (Figure S18).
Overall, the results suggest that, in terms of downstream lineage inference accuracy and stability, the scRNAseq non-specific DR method FA is preferable cross a range of data sets examined here. In addition, scRNAseq specific DR method ZINB-WaVE and the scRNAseq non-specific DR method NMF are also preferable if one is interested in extracting a small number of low-dimensional components. scRNAseq specific DR method Diffusion Map may also be preferable if one is interested in extracting a large number of low-dimensional components.
Computation time
We recorded and compared computing time for different DR methods on simulated data sets. Here, we also examined how computation time for different DR methods varies with respect to the number of low-dimensional components extracted (Figure 4A) as well as with respect to the number of cells contained in the data (Figure 4B). Overall, the computational cost of three methods, ZINB-WaVE, ZIFA, and pCMF, is substantially heavier than the remaining methods. Their computation time increase substantially with both increasingly large number of low-dimensional components and increasingly large number of cells in the data. Specifically, when the sample size equals 500 and the desired number of low dimensional components equals 22, the computing time for ZINB-WaVE, ZIFA, and pCMF to analyze 10,000 genes are 2.15, 1.33, and 1.95 hours, respectively (Figure 4A). When the sample size increases to 10,000, the computing time for ZINB-WaVE, ZIFA, and pCMF increases to 12.49, 20.50, and 15.95 hours, respectively (Figure 4B). Similarly, when the number of low-dimensional components increases to 52, the computing time for ZINB-WaVE, ZIFA, and pCMF increases to 4.56, 4.27, and 4.62 hours, respectively. Besides these three methods, the computing cost of both ICA and Poisson NMF can also increase noticeably with increasingly large number of low-dimensional components. The computing cost of ICA, but to a lesser extent of Poisson NFM, also increases substantially with increasingly large number of cells. In contrast, PCA, FA, Diffusion Map, and the two deep learning-based methods (DCA, and scScope) are computationally efficient. In particular, the computation time for these five methods are stable and do not show substantial dependence on the sample size or the number of low-dimensional components. Certainly, we expect that the computation time of all DR methods will further increase as the sample size of the scRNAseq data sets increases in magnitude. Overall, in terms of computing time, PCA, FA, Diffusion Map, DCA, and scScope are preferable.
Practical guidelines
In summary, our comparison analysis shows that different DR methods can have different merits for different tasks. Subsequently, it is not straightforward to identify a single DR method that strives the best in all data sets and for all downstream analyses. Instead, we provide a relatively comprehensive practical guideline for choosing DR methods in scRNAseq analysis in Figure 5. Our guideline is based on the accuracy and effectiveness of DR method in terms of the downstream analysis, the robustness and stability of DR method in terms of replicability and consistency across data splits, as well as computational scalability for large scRNAseq data sets. Briefly, for cell clustering analysis, PCA, ICA, FA and ZINB-WaVE are recommended for small data where computation is not a concern. In contrast, PCA, ICA, FA are recommended for large data where computation is a concern. For lineage inference analysis, FA, NMF and ZINB-WaVE are all recommended for small data. In contrast, a subset of these methods, FA is also recommended for large scRNAseq data. In addition, for very large scRNAseq data sets (e.g. >100,000 samples), DCA perhaps is the only feasible approach with reasonable performance for both downstream analyses. Finally, beside these general recommendations, we note that some methods have additional features that are desirable for practitioners. For example, ZINB-WaVE can include sample-level and gene-level covariates, thus allowing us to easily control for batch effects or size factors. We provide our detailed recommendations in Figure 5.
DISCUSSION
We have presented a comprehensive comparison of different dimensionality reduction methods for scRNAseq analysis based on two important downstream applications: cell clustering and trajectory inference. We hope the summary of these state-of-the-art DR methods, the detailed comparison results, and the recommendations and guidelines for choosing DR methods can assist researchers in the analysis of their own scRNAseq data.
We have primarily focused on evaluating DR methods based on two downstream applications. We did not, however, examine the performance of DR methods for data visualization purposes. Data visualization aims to project scRNAseq data into a two- or three-dimensional subspace for visualizing cell clustering results. For example, both tSNE [39] and UMAP [40] are commonly applied data visualization tools. Different from cell clustering, which often rely on a relatively large number of low-dimensional components, data visualization focuses on using only the top two or three low-dimensional components. Subsequently, DR methods for data visualization may not fare well for cell clustering, and DR methods for cell clustering may not fare well for visualization [41]. Besides data visualization, we note that DR methods are also used for many other analytic tasks in scRNAseq studies. For example, factor models for DR is an important modeling part for multiple scRNAseq data sets alignment [16], for integrative analysis of multiple omics data sets [42, 43], as well as for deconvoluting bulk RNAseq data using cell type specific gene expression measurements from scRNAseq [44, 45]. In addition, cell classification in scRNAseq also relies on a low-dimensional structure inferred from original scRNAseq through DR [46, 47]. Therefore, the comparative results obtained from the present study can provide important insights into these different scRNAseq analytic tasks. In addition, investigating the performance of DR methods in these different scRNAseq downstream analyses is an important future research direction.
We mostly focused on evaluating feature extraction methods for DR. Another important category of DR method is the feature selection method, which aims to select a subset of features/genes directly from the original feature space. The feature section methods rely on different criteria to select important genes and are also commonly used in the preprocessing step of scRNAseq data analysis [48]. For example, M3Drop relies on dropout events in scRNAseq data to identify informative genes [49]. Seurat uses gene expression variance to select highly variable genes [16]. Evaluating the benefits of different methods and criteria for selecting informative genes for different downstream tasks is another important future direction.
Due to the heavy computing cost of several DR methods, we unfortunately have to limit our comparisons to scRNAseq data sets with sample sizes less than 10,000 cells. With the advance of scRNAseq technologies and with the increase collaborations across scientific groups, new consortium projects such as the Human Cell Atlas (HCA) will generate scRNAseq data sets that contain millions of cells [32]. The large data at this scale poses critical computational and statistical challenges to many current DR methods. Many existing DR methods, in particular those that require the computation and memory storage of a covariance or distance matrix among cells, will no longer be applicable there. Therefore, new algorithmic innovations and new efficient computational approximations will likely be needed to scale many of the existing DR methods to millions of cells.
METHODS AND MATERIALS
scRNAseq data sets
We obtained a total of 28 scRNAseq data sets from public domains for benchmarking DR methods. All data sets were retrieved from the Gene Expression Omnibus (GEO) database (https://www.ncbi.nlm.nih.gov/geo/) or the 10X genomics website (https://support.10xgenomics.com/single-cell-gene-expression/datasets). These data sets cover a wide variety of sequencing techniques that include Smart-Seq2 (7 data sets), 10X genomics (6 data sets), Fluidigm C1 (4 data sets), Smart-Seq (5 data sets), SMARTer (3 data sets), inDrop (1 data set) and others (2 data sets). In addition, these data cover a range of sample sizes from a couple hundred cells to several thousand cells measured in either human (16 data sets) or mouse (12 data sets). In each data set, we evaluated the effectiveness of different DR methods for one of the two important downstream analysis tasks: cell clustering and lineage inference. In particular, 14 data sets were used for cell clustering evaluation while another 14 data sets were used for lineage inference evaluation. For cell clustering, 10 of the 14 data sets were obtained by mixing cells from different cell types either pre-determined by fluorescence activated cell sorting (FACS) or cultured on different conditions. Therefore, these 10 studies contain the true cell type labels for all cells. The remaining 4 data sets contain cell labels that were determined in the original study and we simply treated them as truth though we do acknowledge that such “true” clustering information may not be accurate. For lineage inference, 4 of the 14 data sets were obtained by mixing cells from different cell types pre-determined by FACS. These different cell types are at different developmental stages of a single linear lineage; thus these 4 studies contain the true lineage information for all cells. The remaining 10 data sets contain cells that were collected at multiple time points during the development process. For these data, we simply treated cells at these different time points as part of a single linear lineage, though we do acknowledge that different cells collected at the same time point may represent different developmental trajectories from an early time point if the cells at the early time are heterogeneous. In either case, the true lineage in all these 14 data sets are treated as linear, without any bifurcation or multifurcation patterns.
A detailed list of the selected scRNAseq datasets with corresponding data features is provided in Tables S1-S2. In each of the above 28 data sets, we removed genes that are expressed in less than five cells. For methods modeling normalized data, we transformed the raw counts data into continuous data with the normalize function implemented in scater (R package v1.12.0). We then applied log2 transformation on the normalized counts by adding one to avoid log transforming zero values. We simply term this normalization as log2 count transformation, though we do acknowledge that such transformation does take into account of cell size factor etc. through the scater software. In addition to log2 count transformation, we also explore the utility of two additional data transformation: log2 CPM transformation and z-score transformation. In the log2 CPM transformation, we first computed counts per million reads (CPM) and then performed log2 transformation on the resulted CPM value by adding a constant of one to avoid log transformation of zero quantities. In the z-score transformation, for each gene in turn, we standardized CPM values to achieve a mean of zero and variance of one across cells using Seurat package (v2.3).
Besides the above 28 real scRNAseq data sets, we also simulated 2 additional scRNAseq data sets for cell clustering evaluation. In the simulations, we used all 94 cells from one cell type (v6.5 mouse 2i+LIF) in the Kumar data as input. We simulated scRNAseq data with 500 cells and a known number of cell types, which were set to be either 4 or 8, using the Splatter package v1.2.0. All parameters used in the Splatter (e.g., mean rate, shape, dropout rate, etc.) were set to be approximately those estimated from the real data. In the case of 4 cell types, we set the group parameter in Splatter as 4. We set the percentage of cells in each group as 0.1, 0.15, 0.5 and 0.25, respectively. We set the proportion of the differentially expressed genes in each group as 0.02, 0.03, 0.05 and 0.1, respectively. In the case of 8 cell types, we set group/cell type parameter as 8. We set the percentage of cells in each group as 0.12, 0.08, 0.1, 0.05, 0.3, 0.1, 0.2 and 0.05, respectively. We set the proportion of the differentially expressed genes in each group as 0.03, 0.03, 0.03, 0.1, 0.05, 0.07, 0.08, and 0.1, respectively.
Compared dimensionality reduction methods
DR methods aim to transform an originally high-dimensional feature space into a low-dimensional representation with a much-reduced number of components. These components are in the form of a linear or non-linear combination of the original features (known as feature extraction DR methods) and in the extreme case are themselves a subset of the original features (known as feature selection DR methods). In the present study, we have collected and compiled a list of 11 popular and widely used DR methods in the field of scRNAseq analysis. These DR methods include factor analysis (FA; R package psych, v1.8.12), principal component analysis (PCA; R package stats, v3.6.0), independent component analysis (ICA; R package ica, v1.0.2), Diffusion Map (Diffusion Map; R package destiny, v2.14.0), nonnegative matrix factorization (NMF; R package NNLM, v1.0.0), Kullback-Leibler divergence-based NMF (Poisson NMF; R package NNLM, v1.0.0), zero-inflated factor analysis (ZIFA; Python package ZIFA), zero-inflated negative binomial based wanted variation extraction (ZINB-WaVE; R package zinbwave, v1.6.0), probabilistic count matrix factorization (pCMF; R package pCMF, v1.0.0), deep count autoencoder network (DCA; Python package dca), and a scalable deep-learning-based approach (scScope; Python package scscope). An overview of these 11 DR methods with their corresponding modeling characteristics is provided in Table 1.
Assess the performance of dimensionality reduction methods
We evaluated the performance of DR methods by evaluating how effective the low-dimensional components extracted from DR methods are for downstream analysis. We evaluated either of the two commonly applied downstream analysis, clustering analysis and lineage reconstruction analysis, in the 30 data sets described above. In the analysis, we varied the number of low-dimensional components extracted from these DR methods. Specifically, for cell clustering data sets, in a data with less than or equal to 300 cells, we varied the number of low dimensional components to be either 2, 6, 14, or 20. In a data with more than 300 cells, we varied the number of low dimensional components to be either 0.5%, 1%, 2%, or 3% of the total number of cells. For lineage inference data sets, we varied the number of low dimensional components to be either 2, 6, 14, or 20 for all data sets, since common lineage inference methods prefer a relatively small number of components.
For clustering analysis, after DR with these DR methods, we used two different clustering methods, the hierarchical clustering (R function hclust; stats v3.5.3) and k-means clustering (R function kmeans; stats v3.6.0), to perform clustering on the reduced feature space. The k-means clustering is a key ingredient of commonly applied scRNAseq clustering methods such as SC3 [18] and Waterfall [24] while the hierarchical clustering is a key ingredient of commonly applied scRNAseq clustering methods such as CIDR [17] and CHETAH [50]. In both these clustering methods, we set the number of clusters k to be the known number of cell types in the data. We compared the cell clusters inferred using the low dimensional components to the true cell cluster and evaluated clustering accuracy by two criteria: the adjusted rand index (ARI) [51] and the normalized mutual information (NMI) [52]. The ARI and NMI are defined as: where P = (p1,p1, ···, pn)T denotes the inferred cell-type cluster labels from clustering analysis while T = (t1, t1, ···, tn)T denotes the known true cell-type labels for n samples in the data; l and s enumerate the clusters, with l,s = 1, ···, k; nl = Σl I(pi = l) is the number of cells that belong to cluster l in the inferred cluster labeling, with I(·) being an indicator function; ns = Σs I (ti = s) is the number of cells that belong to cluster s in the true cluster labeling; and nls = Σl,s (pi = l)I(ti = s) is the number of times where the i’th cell belongs to the cluster l in the inferred cluster labeling and belongs to the cluster s in the true cluster labeling; note that nls effectively measures the number of cells that are in common between P and T; is the mutual information between two cluster labels; and is the entropy function for true cell-type labeling. We used the compare function in the igraph R package (v1.0.0) to compute both ARI and NMI criteria. For each data set, we repeated the above procedure five times and report the averaged results to avoid the influence of the stochasticity embedded in some DR methods and/or the clustering algorithm.
For trajectory inference, after DR with these DR methods, we used Slingshot [38] (R package, v1.2.0), which is the recommend lineage inference method based on a recent comparative study [14]. The Slingshot software takes two input data: the low-dimensional components extracted from DR methods and a vector of cluster labels predicted by clustering algorithms. For the later, we used either k-means or hierarchical clustering algorithm on the extracted low-dimensional components to obtain cluster labels. After obtaining the two types of input through the slingshot function, we used the getLineages function to fit a minimum spanning tree (MST) to identify lineage. The final output from Slingshot is an object of class SlingshotDataSet that contains the inferred lineage information. We follow the original Slingshot paper [38] to evaluate the accuracy of the inferred lineage using the Kendall rank correlation coefficient. To do so, for each data, we first ranked genes based on their position on the true lineage. We ordered all m genes based on this rank order and denoted the corresponding rank in ascending order for these genes as {x1, ···, xm}, where xi ≤ xi+1. Note that the true lineage is linear without any bifurcation or multifurcation patterns, while the inferred lineage may contain multiple ending points in addition to the single starting point. Therefore, for each inferred lineage, we examined one trajectory at a time, where each trajectory consists of the starting point and one of the ending points. In each trajectory, we ranked genes in order based on their position in the trajectory. We denote the corresponding rank order in the inferred trajectory for all m genes as {yl, ···, ym}, where we set yl as missing if l’th gene is not included in the inferred trajectory. For each pair of non-missing genes, we labeled the gene pair (i, j) as a concordant pair if their relative rank in the inferred lineage are consistent with their relative rank in the true lineage; that is, either (xi ≥ xj & yi ≥ yj) or (xi < xi & yi < yj). Otherwise, we labeled the gene pair (i, j) as discordant. We denoted C as the number of concordant pairs, D as the number of discordant pairs, and U as the total number of non-missing genes. The Kendell correlation coefficient is then computed as
Afterwards, we obtained the maximum absolute τ over all these trajectories as the final Kendall correlation score to evaluate the similarity between the inferred lineage and the true lineage. For each data set, we repeated the above procedure five times and report the averaged results to avoid the influence of the stochasticity embedded in some DR methods and/or the lineage inference algorithm.
Finally, we investigated the stability and robustness of different DR methods in both cell clustering or lineage inference applications through data splitting. Here, we focused on two representative scRNAseq data sets, the Kumar data set for cell clustering and the Hayashi data set for lineage inference. For each data, we randomly split the data into two subsets with an equal number of cells in each cell type in the two subsets. We repeated the split procedure 10 times to capture the potential stochasticity during the data split. In each split replicate, we applied different DR methods to analyze each subset separately. We used k-means clustering algorithm to infer the clustering labels in each subset. We used NMI to measure cell clustering accuracy and used Kendall correlation to measure lineage inference accuracy.
FUNDING
This study was supported by the National Institutes of Health (NIH) Grants R01HG009124 and R01GM126553, and the National Science Foundation (NSF) Grant DMS1712933. This project has also been made possible in part by grant number 2018-181314 from the Chan Zuckerberg Initiative DAF, an advised fund of Silicon Valley Community Foundation. S.S. was supported by NIH Grant R01HD088558 (PI Tung), Top International University Visiting Program for Outstanding Young Scholars of Northwestern Polytechnical University, the Fundamental Research Funds for the Central Universities (Grant No. 3102017OQD098), and the Natural Science Foundation of Shaanxi Province (Grant No. 2019JQ127). J.Z. was supported by NIH Grant U01HL137182 (PI Kang).
AVAILABILITY OF DATA AND MATERIALS
All source code and data sets used in our experiments have been deposited at www.xzlab.org/reproduce.html.
AUTHORS’ CONTRIBUTIONS
XZ conceived the idea and provided funding support. SS and XZ designed the experiments. SS adapted software, performed simulations and analyzed real data. JQ and YM collected data sets and helped interpreting results. SS and XZ wrote the manuscript with input from all other authors.
ETHICS APPROVAL AND CONSENT TO PARTICIPATE
No ethnical approval was required for this study. All utilized public data sets were generated by other organizations that obtained ethical approval.
CONSENT FOR PUBLICATION
Not applicable.
COMPETING INTERESTS
The authors declare that they have no competing interests.