Abstract
Using single-cell RNA-seq (scRNA-seq), the full transcriptome of individual cells can be acquired, enabling a quantitative cell-type characterisation based on expression profiles. Due to the large variability in gene expression, assigning cells into groups based on the transcriptome remains challenging. We present Single-Cell Consensus Clustering (SC3), a tool for unsupervised clustering of scRNA-seq data. SC3 achieves high accuracy and robustness by consistently integrating different clustering solutions through a consensus approach. Tests on nine published datasets show that SC3 outperforms 4 existing methods, while remaining scalable for large datasets, as shown by the analysis of a dataset containing 44,808 cells. Moreover, an interactive graphical implementation makes SC3 accessible to a wide audience of users, and SC3 also aids biological interpretation by identifying marker genes, differentially expressed genes and outlier cells. We illustrate the capabilities of SC3 by characterising newly obtained transcriptomes from subclones of neoplastic cells collected from patients.
Introduction
With the recent advent of single cell RNA-seq (scRNA-seq) technology, researchers are now able to quantify the entire transcriptome of individual cells, opening up a wide spectrum of biological applications. One key application of scRNA-seq is the ability to determine cell types based on their transcriptome profile alone1–3. The diversity of cell-types is a fundamental property of higher eukaryotes. Traditionally, cell type was defined based on shape and morphological properties, but tools from molecular biology have enabled researchers to categorise cells based on surface markers4,5. However, morphology or a small number of marker proteins are not sufficient to characterise complex cellular phenotypes. scRNA-seq opens up the possibility to group cells based on their genome-wide transcriptome profiles, which is likely to provide a better representation of the cellular phenotype. Indeed, several studies have already used scRNA-seq to identify novel cell-types1–3, demonstrating its potential to unravel and catalogue the full diversity of cells in the human body.
A full characterisation of the transcriptional landscape of individual cells holds an enormous potential, both for basic biology and clinical applications. An important medical application is cancer, which has long been known to be a heterogeneous disease with multiple subclones coexisting within the same tumor. Until recently, tumor heterogeneity has mainly been assessed at the DNA level by genome sequencing6–8. The use of scRNA-seq makes it possible to characterise the transcriptional landscape of the different subclones while utilizing very small number of cells. A better understanding of the transcriptome in the different subclones could thus yield important insights about drug resistance, thereby informing the development of novel therapies. However, due to the large variability in gene expression, identifying subclones from patient transcriptomes remains challenging9,10.
Mathematically, the problem of de novo identification of cell-type from data may be seen as an unsupervised clustering problem, i.e., how to separate cells into non-overlapping groups without a priori information as to the number of groups or group labels. However, the lack of training data and reliable benchmarks for validation renders this unsupervised clustering a hard problem. For scRNA-seq the challenge is further compounded by technical errors and biases that remain incompletely understood, a high degree of biological variability in gene expression11. and the high dimensionality of the transcriptome.
Although scRNA-seq is a relatively new technology, several groups have developed custom clustering methods for single cell data2, 1 2–16. Yet, these clustering methods have various shortcomings: (i) they have not been thoroughly benchmarked against one another or against standard algorithms; (ii) it is not clear how they can be scaled to large datasets; (iii) there is no interactive, user-friendly implementation that includes support to facilitate the biological interpretation of the clusters; and (iv) the number of clusters, k, has to be fixed a priori by the user and there is no support to explore different hierarchies of clusters. The last point is particularly relevant when studying complex biological tissues as previous studies have found biologically meaningful cell populations at several levels of granularity14,17,18.
We present a novel interactive clustering tool for scRNA-seq data, SC3 (Single Cell Consensus Clustering) a user-friendly R-package with a graphical interface. SC3 obtains robust results by combining several well-established techniques using a consensus clustering approach. SC3 has several features to facilitate the evaluation of the clustering quality, and to aid the user in determining the appropriate number of clusters. We demonstrate the performance of SC3 by applying it to nine published datasets and thoroughly benchmarking the method against four other methods. Furthermore, we showcase the scalability of SC3 to very large datasets by analysing a dataset with ~45,000 cells15. Crucially, in addition to providing cell clusters, SC3 facilitates their biological interpretation by identifying marker genes, differentially expressed genes, outlier cells, and by providing an integrated link to Gene Ontology analysis. We apply SC3 to the first single cell RNA-Seq data from hematopoietic stem cells isolated directly from patients with myeloproliferative neoplasm. Myeloproliferative neoplasms are a heterogeneous disease, with each patient typically harbouring multiple neoplastic subclones that coexist for long periods of time19. Therefore, transcriptome data from patients are typically hard to interpret due to its inherent heterogeneity within these cells. Here, we identify clusters corresponding to different subclones within two patients with different mutational landscapes, and we characterise the differences between their expression profiles.
Results
Consensus clustering as a robust methodology
The output of a scRNA-seq experiment is typically represented as an expression matrix consisting of g rows (corresponding to genes or transcripts) and N columns (corresponding to cells). SC3 takes this expression matrix as its input. The SC3 algorithm consists of several steps: a cell filter, a gene filter, distance calculations, spectral transformations combined with k-means clustering, followed by the consensus step (Fig. 1a, Methods). Note, in particular, that the distance calculation reflects a change of coordinate space, as we go from the expression matrix (s × N) to a cell-to-cell matrix (N × N). The spectral transformation is a nonlinear step whereby the first S eigenvectors of the distance matrix are retained for clustering. Each of the above steps requires the specification of a number of parameters, e.g. different metrics to calculate distances between the cells, or the particular type of nonlinear transformation. However, choosing optimal parameter values is difficult. To avoid this problem, SC3 utilizes a parallelisation approach, whereby a significant subset of the parameter space is evaluated simultaneously to obtain a set of optimized clusterings. Instead of trying to pick the optimal solution from this set, we combine all the different clustering outcomes into a consensus matrix that summarises how often each pair of cells is located in the same cluster. The final result provided by SC3 is determined by complete-linkage hierarchical clustering of the consensus matrix into k groups. Using this approach, we can leverage the power of a plenitude of well-established methods, while additionally gaining robustness against the particularities of any single method.
We first validated SC3 using eight publicly available scRNA-Seq datasets9,14,17,18,20–23. ranging from N = 80 to N = 3,005 single cells (Fig. 1b). For each dataset, we used the clustering provided by the authors of the original study as reference labels (i.e., the ground truth) against which we compared the clusters obtained by SC3. To quantify the similarity between clusterings, we used the Adjusted Rand Index (ARI, see Methods) which ranges from 1, when the clusterings are identical, to 0 when the similarity is what one would expect by chance. A wide variety of methods and parameters to include within our consensus approach was evaluated by first analysing three of the eight datasets (Fig. 1c). We found that clustering performance was largely unaffected by the cell filter, and for different choices of gene filter, distance metrics and spectral transformations, the ARI values were overall similar (Fig. 1c, S1-S3). In contrast, we found that the quality of the outcome as measured by the ARI was most sensitive to the number of eigenvectors, d, retained after the spectral transformation: the ARI is low for small values of d, and then increases to its maximum before dropping close to 0 as d approaches N/2. In particular, we find that the best clusterings were achieved when d is between 4-7% of the number of cells, N (Methods). We hypothesise that this range represents an optimal dimensionality reduction such that the technical noise is minimized for our type of data. It is likely that this range will no longer be optimal for other data, e.g. a small yeast genome or a sample with much more shallow coverage. We thus use this range of parameters within our consensus approach to narrow down the parameter space to be explored. Strikingly, consensus clustering24 of the solutions obtained in this range further improves the ARI (Fig. S4).
To benchmark SC3, we considered four other methods: tSNE25 followed by k-means clustering (a method similar to those used by Grün et al1 and Macosko et al15), pcaReduce13, SNN-Cliq12 and SINCERA16. SC3 performs better than the four tested methods across all datasets, with the exception of the Pollen1 dataset (Fig. 2). In contrast to the other methods that rely on different initializations (pcaReduce and tSNE+kmeans), SC3 is highly stable, and except for the Usoskin and Zeisel data, a single dominant solution is obtained. Furthermore, the higher ARI attained by SC3 comes at a moderate computational cost (Fig. S6): the run time for N = 1,000 is ~10 min.
SC3 can be scaled to large datasets
The main computational bottleneck of SC3 is the calculation of the eigenvectors of the distance matrix (which scales as O(N3))26. This scaling makes it impractical to use the version of SC3 described above for datasets with >1,000 cells. To apply SC3 to larger datasets, we have implemented a hybrid approach that combines unsupervised and supervised methodologies. When the number of cells is large (N>1,000), SC3 selects a small subset of cells uniformly at random, and obtains clusters from this subset as described above. Subsequently, the inferred labels are used to train a support vector machine (SVM, Methods), which is employed to assign labels to the remaining cells. Training the SVM typically takes only a few minutes, thus allowing SC3 to be applied to very large datasets.
To test the SVM in isolation, we used the validation datasets from Fig. 1c with the authors’ reference labels to train the SVM. Our results demonstrate that using <20% of the cells for training it is possible to accurately predict the labels of the remaining cells (Fig. 3a). For the situation where we first assign labels using the unsupervised clustering described above, we find that it is still possible to achieve an ARI>0.8 with only 20% of the training cells for the larger datasets (Fig. 3b). Overall, the performance of the SVM improves with higher N, and for the larger datasets (Klein and Zeisel), an ARI of 0.8 can be achieved with <10% of the cells used as a training set.
Using this approach, we are able to analyse a large Drop-Seq dataset with N = 44,808 cells and a = 39 clusters15. For this dataset, we found that an ARI of 0.8 can be achieved by the SVM with only 3% of the cells used for training. Interestingly, in the case of this dataset SC3 produces a result that is drastically different from the authors’. Closer inspection of our results shows that SC3 does not identify cluster 24 (labelled as “Rods”) with 29,400 cells15. This implies that the original cluster 24 may be more heterogeneous than suggested by the authors.
SC3 features an interactive and user-friendly interface
To increase its usability, SC3 features a graphical user interface displayed through a web browser, thus minimizing the need for bioinformatics expertise. The user is only required to provide the input expression matrix and a range for the number of clusters, k. SC3 will then calculate the possible clusterings for this range. Since the clustering is performed during startup, the user can thus explore different choices of k in real time. The outcome is presented graphically as a consensus matrix to facilitate visual assessment of the clustering quality (Fig. 4a). The elements of the consensus matrix represent a score, normalised between 0 and 1, indicating the fraction of SC3 parameter combinations that assigned the two cells to the same cluster. By considering the consensus scores within and between clusters, one may quickly assess the clustering quality by visual inspection. Furthermore, to aid the user in the selection of k, SC3 also calculates the silhouette index27, a measure of how tightly grouped the cells in the cluster are (Fig. S10).
SC3 assists with biological interpretation
A key aspect to evaluate the quality of the clustering, which cannot be captured by traditional mathematical consistency criteria, is the biological interpretation of the clusters. To help the user characterise the biology of the clusters, SC3 identifies differentially expressed genes, marker genes, and outlier cells. By definition, differentially expressed genes differ between two or more clusters. To detect such genes, SC3 employs the Kruskal-Wallis28 test (Methods) and reports the differentially expressed genes across all clusters, sorted by p-value. In contrast, marker genes are highly expressed in only one of the clusters and are selected based on their ability to distinguish one cluster from all the remaining ones (Fig. 4b). To select marker genes, SC3 uses a binary classifier based on the gene expression to distinguish one cluster from all the others. For each gene, a receiver operator characteristic (ROC) curve is calculated and the area under the ROC curve is used to rank the genes. The area under the ROC curve provides a quantitative measure of how well the gene can distinguish one cluster from the rest. The most significant candidates are then reported as marker genes (Methods). Cell outliers are also identified by SC3 through the calculation of a score for each cell using the Minimum Covariance Determinant29. Cells that fit well into their clusters receive an outlier score of 0, whereas high values indicate that the cell should be considered an outlier (Fig. 4c). The outlier score helps to identify cells that could correspond to, e.g., rare cell-types or technical artefacts. In addition, SC3 facilitates obtaining a gene ontology analysis for each cluster by directly exporting the list of marker genes to WebGestalt30. All results from SC3 can be saved to text files for further downstream analyses.
To illustrate the above features, we analysed the Deng dataset tracing embryonic developmental stages, including zygote, 2-cell, 4-cell, 8-cell, 16-cell and blastomere. Based on the silhouette index, SC3 recommends a clustering into either k=2 groups or k=10 groups (Fig. S10). The solution for k=2 identifies one cluster with the blastocyst and one with the remaining cells. The most stable result for k=10 is shown in Fig. 4a, and our clusters largely agree with the known sampling timepoints. However, our results suggest that the difference between the 8 and 16 cell stages is quite small. The latter stages of development are labelled “early”, “mid” and “late” blastomeres, although it is well known that these stages consist primarily of trophoblasts and inner cell mass. Interestingly, SC3 suggests that the mid-blastocyst stage could be split into two groups, which most likely correspond to trophoblasts and the inner cell mass (clusters 3 and 4). This conclusion is supported by the fact that Sox2 and Tdgf1 (inner cell mass markers) are listed as marker genes in cluster 431 (Table S2). In total, we identified ~3000 marker genes (Table S2), many of which had been previously reported as specific to the different developmental stages32–36. Furthermore, the analysis revealed several genes specific to each developmental stage which had previously not been reported (Table S2). Importantly, using the reference labels reported by the authors22, we identified nine cells with high outlier scores (green cells in Fig. 4c) which were prepared using the Smart-Seq2 protocol instead of the Smart-Seq protocol12, 22, thus demonstrating the ability of SC3 to identify technical outliers.
SC3 characterises myeloproliferative neoplasm subclones
Myeloproliferative neoplasms, a group of diseases characterised by the overproduction of terminally differentiated cells of the myeloid lineage, reflect an early stage of tumorigenesis where multiple subclones are known to coexist in the same patient19. From exome sequencing data, we previously identified TET2 and JAK2V61F as the only driver mutations in a large patient cohort 37. In this paper, we use two patients out of this cohort which harbour TET2 and JAK2V617F mutations in their stem cell compartment. Haematopoietic stem cells (HSCs) are thought to be the cell of origin in MPNs. However, little is known about the transcriptional consequences of driver mutations on the stem cell in HSCs. We obtained scRNA-seq data from HSC from two different patients (Methods). For patient 1 (N = 51), SC3 suggested that k = 3 provides the best clustering, revealing three clusters of similar size (Fig. S14). For patient 2 (N = 89) SC3 indicated k=1, suggesting that one single cluster per patient might best reflect the underlying transcriptional changes.
Since known driver mutations in our patients are the TET2 and JAK2V617F loci38 we hypothesized that the different clusters correspond to different combinations of mutations within different clones of each patient. Unfortunately, the coverage of the JAK2V617F and TET2 loci was insufficient to reliably determine the genotype of each cell directly from the scRNA-seq data. Instead the genotype composition for each HSC clone was determined by growing individual hematopoietic stem cells into granulocyte/macrophage colonies, followed by Sanger sequencing of the TET2 and JAK2V617F loci (Fig. 5a). In agreement with the clustering defined by SC3, patient 1 (k=3) was found to harbor three different subclones: (i) cells with both TET2 and JAK2V617F mutations, (ii) cells with a TET2 mutation and (iii) wild-type cells (Fig. 5b). Strikingly, the SC3-clusters contain 22%, 29% and 49% of the cells, in excellent agreement with the proportions of each genotype found in the patient, namely 20%, 30% and 50%. Thus, we hypothesize that cluster 1 corresponds to the double mutant, cluster 2 corresponds to cells with only a TET2 mutation, and cluster 3 corresponds to wild-type cells. The HSC compartment of patients 2 was 100% mutant for TET2 and JAK2V617F, which was consistent with clustering of k=1 suggested by SC3 (Fig. S16 and S19).
Three additional lines of evidence support the assumption that SC3 can help to define subclonal composition. Firstly, we analysed data from patient 2, with a dominant double mutant clone, together with all cells from patient 1, harbouring three different clones. SC3 clustering again suggested k = 3 (Fig. 5b, S17). Most importantly, all of the putative double mutant cells from patient 1 were grouped with the cells from patient 2. SC3 also reported 33 marker genes for the putative TET2 mutant and 202 marker genes for the putative double mutant clone (Fig. 5c, Table S4). Secondly, we used microarray data from BFU-E colonies available for patient 137 where the genotype of each clone was linked to a specific transcriptional signature. When comparing differentially expressed genes for the double mutant clone from BFU-E colonies and the marker genes obtained from the pooled putative TET2/JAK2 mutant clone, we found 13 genes in common. This overlap was significant (p-value=0.048, hypergeometric test) and further strengthens our assumption that the double mutant clone identified through clustering by SC3 indeed harbours both mutations. This result demonstrates that SC3 is capable of finding clusters defined by mutations rather than by patient batch. By contrast, none of the other clustering methods was able to clearly identify clusters which could readily be related to the genotype information (Fig. S15, Methods).
Lastly, we performed pathway analysis using marker genes. KEGG Pathway analysis (Methods) of the latter group revealed that one of the most upregulated categories was “Chemokine signalling” (Table S5) with Wikipathway analysis (Methods and Table S6) highlighting interleukin and EPO receptor signalling pathways as being enriched. Therefore, ligands involved in JAK/STAT pathway activation are highly enriched in our marker genes for the putative double mutant cluster. For the putative TET2 only mutant subclones, none of the above pathways were specifically misregulated. Instead, we hypothesized that since TET2 is involved in DNA de-methylation there would be a global impact on the transcriptome. Loss of TET enzymes has been reported to impact on the variability in gene expression in mouse embryos37. Comparing the genome-wide distribution of the normalized variances revealed that the putative TET2 mutants have more variable transcriptomes than putative wild-type cells (Mann-Whitney test p-value <2.2e-16, Fig. S18). In summary, we can, for the first time, confer genotype information from patient data and infer pathways and putative disease genes important for disease maintenance for HSCs.
Discussion
We have presented SC3, an interactive tool for unsupervised clustering of scRNAseq data. By comparing to several other methods, we demonstrate that SC3 provides a highly accurate clustering for published datasets. For large datasets, SC3 employs a hybrid approach, which makes it possible to scale the method to very large experiments, e.g. Drop-seq15. Importantly, SC3 features a graphical user interface, making it interactive and userfriendly. SC3 also aids biological interpretation by identifying differentially expressed genes, marker genes, and outlier cells. The ability of SC3 to identify marker genes and differentially expressed genes is important when analysing complex scRNAseq datasets. Traditional methods for differential expression analysis39,40 are impractical since a total of k(k-1)/2 comparisons are required, as only two groups at a time are compared. Each comparison provides a list of genes, and extracting the information directly provided by SC3 would require a large amount of postprocessing.
A major challenge when developing unsupervised clustering algorithms for scRNAseq is the lack of good mathematical models that can be used to generate realistic, synthetic surrogate datasets to benchmark the methods. Instead, we must rely on published datasets where the labels have been provided by the original authors. For some of the datasets (Ting, Patel and Klein), the labels are likely to be accurate since they correspond to cells taken from different tissues, patients or time-points. For the other datasets, however, the labelling was based on a combination of the authors’ clustering methods and their biological knowledge. In these latter cases, the labelling is less reliable, and we cannot be certain that the original clustering represents a meaningful ground truth. Reassuringly, the experimental consistency in the original labels correlates well with the observed accuracy of SC3; for the datasets where the labels are almost certainly correct (Ting, Patel and Klein) SC3 obtains almost perfect ARIs close to 1, whereas for the remaining, less certain datasets, SC3 reaches good ARIs of ~0.8.
Another central problem of unsupervised clustering relates to the choice of the number of clusters, k. We investigated whether the internal measures of clustering quality, such as the Dunn41 and Davies-Bouldin42 indexes, can be used to choose k, but we did not find any correlation between them and the external indexes (Rand, ARI and Jaccard43) (Fig. S5). In addition, for many datasets (e.g. Pollen, Usoskin and Zeisel) there are at least two hierarchies present, and this is likely to be the case for most samples from complex tissues. Rather than identifying a single k, SC3 allows the user to interactively explore different options.
Although significant progress has been made on understanding the mutations leading to cancer19. much less is known about the differences between subclones within the tumour at the transcriptional level. We applied SC3 to scRNA-seq data from two patients diagnosed with myeloproliferative neoplasms. We found strong evidence in support of the hypothesis that the clusters revealed by SC3 directly correspond to subclones identified by independent experiments37. Moreover, we used the marker genes identified by SC3 to provide a biological characterisation of the different subclones (Fig. 5b, (c). Our results demonstrate that it is possible to identify subclones using scRNA-seq, and that the analysis of the transcriptome can provide significant insights into the functional consequences of different mutations. We aim to investigate the role of the identified marker genes in future studies to further demonstrate the value of the characterisation of the transcriptome of individual subclones.
As sequencing costs decrease, larger scRNA-seq datasets will become increasingly common, furthering their potential to advance our understanding of biology. An exciting aspect of scRNA-seq is the possibility to address fundamental questions that were previously inaccessible, e.g. de novo identification of cell-types. However, the current lack of computational methods for analysing scRNA-seq has made it difficult to exploit fully the information contained in such datasets. We have shown that SC3 is a versatile, accurate and user-friendly tool, which will facilitate the analysis of complex scRNA-seq datasets. We believe that SC3 can provide experimentalists with a hands-on tool that will help extract novel biological insights from such rich datasets.
Methods
Validation datasets
All validation datasets (Fig. 1c), except the Pollen dataset, were acquired from the accessions provided in the original publications. The Pollen dataset17 was acquired from personal communication with Alex A Pollen.
SC3 clustering
SC3 takes as input an expression matrix M where columns correspond to cells and rows correspond to genes. Each element may correspond either to the number of transcripts present or number of reads depending on the details of the experimental protocol. By default SC3 does not carry out any form of normalization or correction for batch effects. SC3 is based on seven elementary steps. The parameters in each of these steps can be easily adjusted by the user, but are set to sensible default values, determined via cross-validation on the Deng, Pollen and Usoskin datasets.
1. Cell Filter
The cell filter is an optional feature that should be used if the quality of data is low, i.e. if one suspects that some of the cells may be technical outliers with poor coverage. The cell filter removes cells, which contain less than a specified number of expressed genes (genes with measured non-zero expression level). The default in SC3 for the minimum number of expressed genes is set to 2,000.
2. Gene Filter
The gene filter removes genes that are either expressed or absent (expression value is less than 2) in at least 94% of cells. The motivation for the gene filter is that ubiquitous and rare genes most often are not informative for the clustering.
3. Log-transformation
For further analysis the filtered expression matrix M is log-transformed after adding a pseudo-count of 1: M’ = log2(M + 1).
4. Distance calculations
Distance between the cells, i.e. columns, in M’ are calculated using the Euclidean, Pearson and Spearman metrics to construct distance matrices.
5. Transformations
All distance matrices are transformed using either principal component analysis (PCA), multidimensional scaling (MDS) or by calculating the eigenvectors of the associated graph Laplacian (L = I - D1/2AD−1/2. where I is the identity matrix, A is the distance matrix and D is the degree matrix of A). The columns of the resulting matrices are then sorted in descending order by their corresponding eigenvalues. The MDS method was not included in SC3 because of poor performance (Fig. S1-3).
6. k-means
k-means clustering is performed on the first d eigenvectors of the transformed distance matrices (Fig. 1a) by using the default kmeans() R function with the Hartigan and Wong algorithm44. The maximum number of iterations was set to 109 and the number of starts was set to 1,000.
7. Consensus clustering
SC3 computes a consensus matrix using the Cluster-based Similarity Partitioning Algorithm (CSPA)24. For each clustering result a binary similarity matrix is constructed from the corresponding cell labels: if two cells belong to the same cluster, their similarity is 1, otherwise the similarity is 0 (Fig. 1a). A consensus matrix is calculated by averaging all similarity matrices. This can be split in 2 steps:
7a. Consensus over the d range
SC3 calculates the consensus matrix over a range of S from 4% to 7% of N. Consensus over the d range is performed for each combination of distance measures and transformations (Fig. 1a). To reduce computational time, if the length of the d range (D on Fig. 1a) is more than 15, a random subset of 15 values selected uniformly from the d range is used. Note that consensus over the d range does not provide a single solution and further averaging is required.
7b. Consensus over the parameter set
Consensus over the parameter set takes multiple results of the consensus over the d range and includes additional averaging of similarity matrices over the distance measures and transformations. As a final result SC3 uses consensus averaging over all six combinations of distances and transformations by default.
7c. Consensus clustering
The resulting consensus matrix (after consensus over the parameter set) is clustered using hierarchical clustering with complete agglomeration and the clusters are inferred at the k level of hierarchy, where k is defined by a user (Fig. 1a).
Fig. S4 shows how the quality and the stability of clustering improves after consensus clustering.
Adjusted Rand Index
If cell-labels are available (e.g. from a published dataset) the Adjusted Rand Index (ARI)45 can be used to calculate similarity between the SC3 clustering and the published clustering. Since the reference labels are known for all validation datasets, ARI is used for all comparisons throughout the paper. We also investigated the correlation between external (Rand index and Jaccard index43 S and internal (Dunn index41 and Davies-Bouldin index42) (Fig. S5) measures of clustering. An external index evaluates the clustering based on an external reference, whereas an internal index does not require any additional information. Even though external indexes were in good agreement with each other, there was little or no correlation between the external and the internal indexes.
Benchmarking
Before benchmarking we applied the Gene Filter to all the datasets. For tSNE+k-means, SNN-Cliq and pcaReduce we additionally applied Log-transformation as described above (M’ = log2(M + 1)). For SINCERA we used the original z-score normalisation16 instead of the Log-transformation. For tSNE the Rtsne() function of the Rtsne R package was used with the default parameters.
Support Vector Machines (SVM)
Results shown in Fig. 3 were obtained using the following steps. When a dataset contains more than 1,000 cells, we randomly select and cluster 1,000 cells. Next, a support vector machine (SVM)46 model with a linear kernel (this kernel provided the best clustering predictions, results from using other kernels - polynomial, radial and sigmoid - are shown in Figs. S7-S9) is constructed based on the obtained clustering. We used the svm function of the e1071 R-package with default parameters. The cluster IDs for the remaining cells are then predicted by the SVM model.
Based on the results presented in Fig. 3, we set the default of SC3 so that when N>1,000 only 20% of the cells are used and when N>5,000 only 1,000 cells are used for clustering before training the SVM.
Biological insights
Identification of differential expression
Differential expression is calculated using the non-parametric Kruskal-Wallis test28, an extension of the Mann-Whitney test for the scenario when there are more than two groups. A significant p-value indicates that gene expression in at least one cluster stochastically dominates one other cluster. SC3 provides a list of all differentially expressed genes with p-values<0.01, corrected for multiple testing (using the default “holm” method of p.adjust() R function) and plots gene expression profiles of the 50 most significant differentially expressed genes. Note that the calculation of differential expression after clustering can introduce a bias in the distribution of p-values, and thus we advise to use the p-values for ranking the genes only.
Identification of marker genes
For each gene a binary classifier is constructed based on the mean cluster expression values. The area under the receiver operating characteristic (ROC) curve is used to quantify the accuracy of the prediction. A p-value is assigned to each gene by using the Wilcoxon signed rank test comparing gene ranks in the cluster with the highest mean expression with all others (p-values are adjusted by using the default “holm” method of p.adjust() R function). The genes with the area under the ROC curve (AUROC) >0.85 and with the p-value<0.01 are defined as marker genes. The AUROC threshold corresponds to the 99% quantile of the AUROC distributions obtained from 100 random permutations of cluster labels for all datasets (Table S1 and Fig. S11). SC3 provides a visualization of the gene expression profiles for the top 10 marker genes of each obtained cluster.
Cell outlier detection
Outlier cells in each cluster are detected by first reducing the dimensionality of the cluster using the robust method for PCA47. Second, robust distances are calculated using the minimum covariance determinant (MCD)29. We then used a threshold based on the 99.99% quantile of the chi-squared distribution to define outliers. Finally, we define an outlier score as the differences between the square root of the robust distance and the square root of the 99.99% quantile of the chi-squared distribution. The outlier score is plotted as a barplot (Figs. 4d).
Patients
All patients provided written informed consent. Diagnoses were made in accordance with the guidelines of the British Committee for Standards in Haematology.
Isolation of haematopoietic stem and progenitor cells
Cell populations were derived from peripheral blood enriched for haematopoietic stem and progenitor cells (CD34+, CD38-, CD45RA-, CD90+), hereafter referred to as HSCs. For single cell cultures, individual HSCs were sorted into 96-well plates (Fig. S12) and grown in a cytokine cocktail designed to promote progenitor expansion as previously described48. For scRNA-seq studies, single HSCs were directly sorted into lysis buffer as described in Picelli et al49.
Determination of mutation load
Colonies of granulocyte/macrophage composition were picked and DNA isolated for Sanger sequencing for JAK2V617F and TET2 mutations as previously described by Ortmann et al37.
Single cell RNA-Sequencing
Single HSCs were sorted into 96-well plates and cDNA generated as described previously49. The Nextera XT library making kit was used for library generation as described by Picelli et al49.
Processing of scRNA-seq data from HSCs
96 single cell samples per patient with 2 sequencing lanes per sample were sequenced yielding a variable number of reads (mean = 2,180,357, std dev = 1,342,541). FastQC50 was used to assess the sequence quality. Foreign sequences from the Nextera Transposase agent were discovered and subsequently removed with Trimmomatic51. The reads were trimmed to 90 bases to remove low quality positions before being mapped with TopHat52 to the Ensembl53 reference genome version GRCh38.77 augmented with the spike-in controls downloaded from the ERCC consortium54. Counts of uniquely mapped reads in each protein coding gene and each ERCC spike-in were calculated using SeqMonk (http://www.bioinformatics.bbsrc.ac.uk/proiects/seqmonk) and were used for further downstream analysis. Quality control of the cells contained two steps: 1. filtering of cells based on the number of expressed genes; 2. filtering of cells based on the ratio of the total number of ERCC spike-in reads to the total number of reads in protein coding genes. Filtering threshold were manually chosen by visual exploration of the quality control features (Fig. S13). After filtering, 51 and 89 cells were retained from patient 1 and patient 2, correspondingly. The expression values in each dataset were then normalised by first using a size-factor normalisation (from DESeq2 package55) to account for sequencing depth variability. Secondly, to account for technical variability, a normalisation based on ERCC spike-ins was performed using the RUVSeq package56 (RUVg() function with parameter k = 1). For combined patient data, normalisation steps were performed after pooling the cells. The resulting filtered and normalised datasets were clustered by SC3.
Clustering of patient scRNA-seq data by SC3
We clustered scRNA-seq data from patient 1 and patient 2 separately as well as a combined dataset containing data from patient 1 + patient 2. For patient 1 the best clustering was achieved for k = 3 (Fig. S14). Data from patient 2 was homogeneous and we were unable to identify more than one meaningful cluster (Fig. S15). For the combined dataset for patient 1 + patient 2 the best values of k were 2 and 3 (Fig. S17). In both cases all of the cells from cluster 1 in patient 1 were grouped with the cells from patient 2. For k = 3 clusters 1 and 3 of patient 1 were also resolved.
Comparison of clustering of patient 1 scRNA-seq data
Among the tools (SC3, SNN-Cliq and SINCERA) that can predict k, SC3 was the only one predicting 3 clusters for the patient 1 dataset (Fig. S14). SNN-Cliq and SINCERA predicted 2 and 1 clusters, respectively. We compared SC3 clustering of the patient 1 data for k = 2, 3, 4 and 5 to the other methods by evaluating the clustering and its stability (Fig. S15). The stability was defined by running each stochastic method 100 times, then taking one of the solutions and comparing it to the other 99 solutions using the ARI. The stability was defined as “number of times the ARI equals 1” + 1. The stability of the deterministic methods (SNN-Cliq and SINCERA) was set to 100. SC3 identified the same 11 cells of cluster 1 of the patient 1 (red cells on Fig. S15a) at three levels of granularity (k = 3, 4 and 5) with a stability of 100 (Fig. S15b). SINCERA found a similar cluster when k was set to 4 or 5. tSNE+kmeans was able to identify 3 clusters similar to the ones identified by SC3, but the perplexity parameter had to be adjusted manually (perplexity = 16) and the solutions were unstable (stability = 1). pcaReduce was unable to find the 3 clusters and had a stability of less than 20 (exact value depends on the choice of random seed).
Identification of differentially expressed genes from microarray data
The microarray data of patient 1 (EE50) was obtained from Array Express accession number E-MTAB-3086.7. One replicate (2B) was identified as an outlier and removed. The limma R package57 was used to identify 932 differentially expressed genes between WT and TET2/JAK2V617F double mutant using an adjusted (by false discovery rate) p-value threshold of 0.1.
Marker genes analysis for patients
For both patients, to increase the number of marker genes, the AUROC threshold was set to 0.7 instead of the default value of 0.85 and the 0.1 p-value threshold was chosen.
Pathway enrichment analysis
We utilized WebGestalt30 web tool to find pathway enrichments in obtained set of marker genes. By default we used the KEGG Analysis and Wikipathways Analysis options with a p-value of 0.1.
Contributions
MH conceived the study; VYK, MH, MS, MB and TA contributed to the computational framework; KK and TC performed the experiments for the patient data; KNN helped with the analysis of embryonic mouse data; MB, WR, ARG and MH supervised the research; VYK and MH led the writing of the manuscript with input from the other authors.
Accession Numbers
scRNA-seq data for patient 1 and 2 is available from GEO accession GSE79102.
Software availability
SC3 is available as a R package at http://bioconductor.org/packages/SC3/.
Funding
Work in the Green lab is supported by Bloodwise (grant ref. 13003), the Wellcome Trust (grant ref. 104710/Z/14/Z), the Medical Research Council, the Kay Kendall Leukaemia Fund, the Cambridge NIHR Biomedical Research Center, the Cambridge Experimental Cancer Medicine Centre, the Leukemia and Lymphoma Society of America (grant ref. 07037), and a core support grant from the Wellcome Trust and MRC to the Wellcome Trust-Medical Research Council Cambridge Stem Cell Institute.
Acknowledgements
We would like to thank Borislav Vangelov, Jean-Charles Delvenne and Renaud Lambiotte for fruitful discussions and their help with computational methods. We would also like to thank David Flores Santa Cruz, Danai Dimitropolou and Jacob Grinfeld for technical assistance with experiments. We thank Ignacio Vasquez-Garcia, David Harmin, Michal Kosicki, Daniel Ramsköld and Meri Huch for helpful comments on the manuscript.