Abstract
Accurate and comprehensive extraction of information from high-dimensional single cell datasets necessitates faithful visualizations to assess biological populations. A state-of-the-art algorithm for non-linear dimension reduction, t-SNE, requires multiple heuristics and fails to produce clear representations of datasets when millions of cells are projected. We developed opt-SNE, an automated toolkit for optimal t-SNE parameter selection that utilizes Kullback-Liebler divergence evaluation in real time to tailor the early exaggeration and overall number of gradient descent iterations in a dataset-specific manner. The precise calibration of early exaggeration together with opt-SNE adjustment of gradient descent learning rate dramatically improves computation time and enables high-quality visualization of large cytometry and transcriptomics datasets, overcoming limitations of analysis tools with hard-coded parameters that often produce poorly resolved or misleading maps of fluorescent and mass cytometry data. In summary, opt-SNE enables optimal data resolution in t-SNE space and more precise data interpretation.
Visual exploration of high-dimensional data is imperative for the comprehensive analysis of single cell datasets. Fluorescence, mass and sequencing-based cytometric data analysis requires tools that are able to reveal the combinations of proteomic and/or transcriptomic markers that define complex and diverse cell phenotypes in a mixed population. While traditional biaxial data presentation via expert-driven gating is still the standard analysis method for cytometry data to date, with the advent of the modern multi-parameter era an analysis tool that can accurately and comprehensively visualize multi-dimensional data is direly needed to relieve the current cytometry data-processing bottleneck.
To date, multiple dimensionality reduction techniques have been applied to cytometry data with variable success. Linear methods, such as principal component analysis (PCA) which generates a low-dimensional representation of data with a linear mapping matrix, are mostly unsuitable for cytometry data visualization as such techniques cannot faithfully present the non-linear relationships in the data. t-Distributed Stochastic Neighbor Embedding (t-SNE) is a state-of-the-art dimensionality reduction algorithm for non-linear data representation that creates a low-dimensional distribution, or a ‘map’, of high-dimensional data 1, 2. Conspicuous groupings of datapoints, or ‘islands’, correspond to observations that are similar in the original high-dimensional space and help to visualize the general structure and heterogeneity of a dataset. t-SNE was developed as a machine learning technique for a broad range of data types and has been adopted 2 for single-cell applications. When t-SNE embeds single cell data, the islands represent cells with similar phenotypes, as defined by a cytometric or genomic signature, thereby allowing to reveal biological data structure and to surface important differences between samples and/or subject groups3.
In addition, t-SNE maps are used to categorize single cell data into relevant biological populations for downstream quantification, achievable through expert-guided filtering (‘gating’)4 or unsupervised clustering of the map 5–7. Cytometry clustering algorithms that directly interrogate high-dimensional data, such as FlowSOM 8 and PhenoGraph 9, are often used in conjunction with t-SNE maps to present annotated clusters to the viewer.
A limitation of t-SNE in its current form is its inability to scale to datasets with large numbers of observations 7, 10. This restrains t-SNE’s utility for cytometry datasets that often include millions of observations (‘events’) routinely collected for phenotypic analysis. Unlike PCA, t-SNE learns the embedding non-parametrically, and hence new pieces of data cannot validly be added to an existing analysis, necessitating the whole dataset to be analyzed within one computation. When the full dataset is comprised of multiple samples, each representing a subject in a large cohort or an independent experimental condition, retaining statistically significant representation of small subpopulations in each sample requires inflating the dataset size 11. However, even within a single measurement, subsampling the data risks preventing rare subsets from being identified. These limitations cannot be overcome via application of the currently understood best practices for t-SNE use. Not only are large datasets computationally expensive to analyze, but also the resulting t-SNE maps provide poor visualization and incomprehensive representation of high-dimensional data. As a result, upon finding initial poor t-SNE visualizations from large datasets, researchers often resort to either subsampling their data to the very limit of detection of rare populations 12 or to exporting specific populations from their dataset, thus compromising the ‘unbiased’ data analysis approach13, 14.
Although t-SNE has been widely adopted by the scientific community, to our knowledge no rigorous theoretical or empirical testing of t-SNE for cytometry applications has been performed. In 2013, Amir et al first reported the use of t-SNE (or viSNE, as it was renamed 3) on mass cytometry data; since then, t-SNE has been implemented in the majority of commercial and open-source platforms for cytometry analysis (FlowJo, Cytobank, FCSExpress, cytofkit, etc). In most implementations, few or no adjustments were made to the Barnes-Hut t-SNE algorithm for the requirements of cytometry datasets; the default and hard-coded parameter settings that were originally tested and optimized with non-cytometry datasets like CIFAR (image dataset) or MNIST (handwritten digits) are retained in these cytometry programs. Developing rigorous methodology to release the full potential of t-SNE for single cell data comprehension is the primary motivation for this work.
In this study, we first assessed the behavior of t-SNE computation with routinely used settings that match common best practices, and then iteratively modified parameters of embedding to identify conditions that ensure optimal visualization. As a result of this work, we propose a new method to automatically find optimal t-SNE parameters via fine-tuning of the early exaggeration stage of t-SNE embedding in real time. We call our approach opt-SNE, for optimal t-SNE. We find that our adjustments can tremendously shorten the number of iterations required to obtain visualizations of large cytometry datasets with superior quality. Our approach also eliminates the need for trial-and-error runs intended to empirically find the optimal selection of t-SNE parameters, potentially saving hundreds of hours of computation time per research project.
Materials and methods
Primary samples
Flow18parameter and Flow20M data were collected as described 18 with minor modifications of the flow cytometry reagent list. Study protocols were approved by the institutional review boards and all subjects provided written informed consent. The committee that approved the research protocol was the Boston University Institutional Review Board, IRB# H-33095.
Data pre-processing
Singlet events from several data recordings were digitally concatenated and a randomly subsampled file of 1,000,000 mass 15 or flow cytometry 16 events was created and used for analyses of the mass41parameter and flow18parameter datasets. All observations from 27 recordings of flow cytometry data were concatenated to generate the flow20M dataset. All observations from 104 recordings of mass cytometry data 7 were concatenated to generate the van Unen et al dataset.
All flow cytometry data were compensated with acquisition-defined compensation matrices. Prior to t-SNE analysis, all cytometry data were transformed using asinh (all mass cytometry data and flowl4parameter data) or biexponential (flow20M) transformation. Light scatter parameters were log-transformed.
Data analysis
A desktop C++ Barnes-Hut implementation of t-SNE for Mac OS was used for most t-SNE analyses 2. All datasets were embedded in 2D space. Original code was edited to allow user input for early exaggeration stop iteration, perplexity, total number of iterations, early exaggeration factor value, and learning rate value. KLD value and t-SNE coordinates were reported during each iteration or as frequently as requested. To allow generation of visually comparable t-SNE maps, the same random seed value was used and all experiments were repeated with several values of random seed. We did not observe noticeable differences in reported results between runs initiated with different seed values. For cross-validation and to benchmark against standard platforms, we utilized cloud-based Cytobank 19, FlowJo V10.3–10.5 and FlowJo V9.9.6. Cytobank and FlowJo platforms were used to generate FCS files and graphical outputs from tabular data. FIt-SNE (Fast Fourier Transform-accelerated Interpolation-based t-SNE 20 plugin for FlowJo was used as indicated. Logs of t-SNE runs were batch-processed with VBA scripts and analyzed with GraphPad Prism 7. Expert-guided (manual) analysis of cell populations was performed in FlowJo 10.3–10.5 as described previously for specific datasets 15, 18 or as explained below and used for map annotations as cluster classifiers.
For scRNAseq analysis, we used SeqGeq 1.3 package. We used PCA projections provided in 10X Genomics dataset to calculate the t-SNE embedding and annotated it using marker genes for major cell types. Louvain cluster classification was adopted from the SCANPY data analysis study 17
The quality of the embeddings was assessed by a nearest neighbor (NN) classifier strategy as described in previous reports on t-SNE accuracy evaluation 1, 2, 10. Briefly, for each observation, the k nearest neighbors (by Euclidean distance) were calculated using the 2D coordinates of the t-SNE map and the class assigned by expert gating was compared to the most common class of its k neighborhood. The rate of correct matches was tallied and represented as the overall nearest neighbor accuracy. The accuracy was also calculated on a per-class basis; different values for k (1,10,20,30,40,50) were reported.
The standard t-SNE configuration for cytometry applications
As described in detail in van der Maaten 2014, t-SNEcomputes low-dimensional coordinates of high-dimensional data resulting in similar and dissimilar data points in the raw data space placed proximally and at a distance, respectively, in the dimensionally reduced map. This map placement is achieved via t-SNE modeling the probabilities as a Gaussian distribution around each data point in the high-dimensional space and modeling the target distribution of pairwise similarities in the lower-dimensional space using Cauchy distribution (Student t-distribution with 1 degree of freedom). Then, the Kullback-Liebler Divergence (KLD) between the distributions is iteratively minimized via gradient descent. The gradient computation is essentially an N-body simulation problem with attractive forces (approximated to nearest neighbors using vantage-point trees) pulling similar points together and repulsive forces (approximated at each iteration using the Barnes-Hut algorithm) pushing dissimilar points apart.
An important part of t-SNE gradient descent computation is the “early exaggeration” (EE) that was proposed by van der Maaten and Hinton (2008) to battle the “overcrowding” artifact of embedding. With EE, all probabilities modeling distances in high-dimensional space are multiplied by a factor (early exaggeration factor, EEF, or a) for the duration of the first (typically 250, or 25% of the total number of) iterations. EE coerces data to form tight and widely separated clusters in the map and is considered to enable the map to find a better global structure.
Multiple software platforms incorporate t-SNE algorithm specifically for analysis of cytometry data, including commonly used cytometry analysis desktop packages (FlowJo, FCS Express), and the cloud-based analysis platform Cytobank. Also, implementations of t-SNE are available as open-source packages in popular programming languages such as R (rtsne) and Python (sci-kit learn). Most of implementations wrap or re-write original C++ code of Barnes-Hut t-SNE (van der Maaten, 2014) and produce comparable analysis results upon direct comparison. Here, we customized the standard t-SNE C++ code to implement the parameter adjustments described in this work and published this customization as an open source solution to enable the research community to use this optimized t-SNE algorithm. Also, the equivalent adjustments of t-SNE have been made available as a cloud application from Omiq and integrated into FlowJo and SeqGeq programs.
Results
The standard t-SNE configuration fails to visualize large datasets
The t-SNE algorithm can be guided by a set of parameters that finely adjust multiple aspects of the t-SNE run 21. However, cytometry data analysis software often locks or severely restrains the tunability of those parameters, likely to provide a simplified, ‘one-size-fits-all’ solution for t-SNE use in the software packages. Although each software platform has a unique combination of possible adjustments, most allow changes to both the number of iterations and to the perplexity (a soft measure for the number of nearest neighbors considered for each data point).
The datasets used throughout this study include at least 1 million datapoints of fluorescent or mass cytometry data and are therefore considerably larger than the smaller (< 5×105) datasets previously reported in benchmark studies of cytometry algorithmic tools 22. Cytometry datasets larger than approximately 5×105 events are generally observed to produce poor quality t-SNE maps and are therefore usually subsampled prior to analysis.
Empirically, cytometrists have observed that increasing the number of iterations of t-SNE computation results in better quality maps. We hypothesized that the resolution of t-SNE maps created from higher event counts could be dramatically improved via fine-tuning of t-SNE parameters. We first directly tested the relationship between iteration number and map quality by running two datasets (mass41parameter and flow18parameter, as described in Table 1) at the default 1000 iterations per run and with an “extended” 3000 iteration computation (Fig. 1A, B). To aid visualization in Fig 1 and the following figures maps are overlaid with color-coded populations derived from expert-driven (‘manual’) gating of the same dataset which serves as a ground-truth basis for data classification. As expected, 1000-iteration runs produced maps with poor visualization (overall 1-NN accuracy of embedding was 65% and as low as 18% for certain populations). Specifically, massive overlaps and random fragmentation of populations were observed. In contrast, the 3000-iteration runs resulted in maps with defined “islands” comprised of clearly isolated populations and no random fragmentations (overall 1-NN accuracy of embedding 96%; see Suppl. Table 1 for the detailed results of accuracy evaluations). Therefore, these findings are in agreement with the concept of a higher number of iterations resulting in higher quality t-SNE maps.
KLD plateau phase resolves global cluster structure in t-SNE visualization
In order to determine the cause of the difference in cluster resolution between the “default” and “extended” t-SNE runs, we examined the behavior of KLD over the duration of t-SNE embeddings (Fig 1C). As expected, the KLD value was inflated during the EE since EEF is factored into gradient and KLD value calculation1. The EE is applied over 250 iterations in the “default” (red line, Fig. 1C, 1000 iterations) t-SNE configuration and 750 iterations in the “extended” run with 3000 iterations (black line Fig. 1C, 3000 iterations) since most platforms have EE scaled to 25% of total iteration number.
Notably, the KLD did not immediately minimize at the start of the EE in both the “default” and “extended” t-SNE runs; instead, the graph of KLD over time is a plateau that is followed by a curve that captures the incremental decrease of KLD, indicating the gradient descent. In the “default” run, the plateau was interrupted when the EE was stopped and KLD dropped, then continued with a non-exaggerated value of KLD.
According to van der Maaten and Hinton, EE was introduced as a “trick” to improve resolution of the global structure of the data visualization that would not otherwise converge to separated clusters. As the suboptimal quality of the 1000-iteration t-SNE maps shown in Fig. 1A, B demonstrates poor global structure resolution, we hypothesized that by increasing the total iterations an analyst who uses t-SNE via conventional cytometry analysis platforms may inadvertently increase the number of EE iterations and this specific alteration may be the cause of the improvement in map visualization. To test this hypothesis, we compared multiple 2000- or 3000-iteration runs that differed in timing of the EE stop (Fig. 2A) by plotting the embedding at the iteration both when the EE stops and at later iterations, thus assessing the effects of EE and our perturbations on both mass cytometry (Fig. 2A) and flow cytometry (Fig. 2B) data visualization.
We found notable differences in map quality between the shorter and longer EE runs. Although the map after EE200/total3000 iterations appears visually superior than EE250/total1000 (Fig 1A, B) and could be considered a successful visualization, ground-truth labeling indicated that cluster fragmentation is present in both maps. When cluster fragments were plotted on a biaxial plot against parameters that were used in the t-SNE dimension reduction, we were not able to identify parameters that immediately contributed to their fragmentation (Suppl. Fig. 1).
Conversely, tight clusters that form at the end of the plateau remain mostly unchanged as long as the EE is being applied to the computation (Fig 2 A, B). The KLD minimization in that case could be explained by the gradual shrinking of the 2D space (data not shown). Once EE is removed, the attractive forces within each cluster are weakened and the local structure of the data is fully resolved within each cluster. Overall, these observations suggest that the EE stage of the gradient descent is essential for data clustering while the non-exaggerated descent results in resolution of local structures.
Real-time monitoring of the KLD plateau results in optimal quality of t-SNE maps
We have demonstrated that when the EE is too short, cell clusters continue to be resolved simultaneously with the local structure of each cluster being unfolded, leading to fragmented, overlapped or deformed “islands” in the resulting map (Fig. 2). Due to these results, we constructed an equation to find optimal EE timing. Specifically, we tracked the relative rate of KLD change (KLDRCN = 100% *((KLDN-1 – KLDN)/ KLDN-1) where N is the iteration number) and identified the local maximum of KLDRC (maxKLDRC) (Fig. 3A). Since KLD is computed at each iteration, the maxKLDRC ‘sensor’ can be easily added to the algorithm programmatically and would stop EE at the next iteration past maxKLDRC. For the mass41parameter dataset of 1M datapoints, the maxKLDRC was detected at iteration 705 (Fig. 3A). Next, we ran t-SNE with an EE stop at iteration 706 and sampled map development at 706, 1.5×706, 2×706, 2.5×706 and 3×706 iterations (Fig 3B). As expected, at the maxKLDRC iteration the map contained the primordial clusters only; it was well shaped at 2 x max KLDRC and there was no visible improvement in map quality past that step and the visualization was very similar to the EE750/3000 map at Fig. 1A. When compared to the map created with ‘default’ settings of EE taking 25% of the run yet computed with the same number of iterations (1410), the maxKLDRC-triggered t-SNE produced visually and KLD-wise superior results within the similar computation time, and it also eliminated extensive trial-and-error calibration of t-SNE parameters (Fig. 3C). We propose a conservative approach to finalize the embedding automatically when (KLDN-1 – KLDN) < KLDN/10,000. Alternatively, t-SNE projection output can be evaluated in real time to justify the termination of embedding.
Moderate adjustments of EE factor and perplexity do not impact visualization
Once we found EE to be crucial for map optimization, we next examined if the value of the EE factor a can also be tuned to improve the results of t-SNE. We made a user-accessible in our C++ t-SNE code since it is hard-coded in the original Barnes-Hut C++ t-SNE implementation and all aforementioned results were obtained with default value of α = 12. We chose the parameters defined above (α = 12, EE = 706 iterations, 1410 iterations total) as our baseline for comparison since they provided optimal balance of map quality versus computation time. First, we tested how the optimization would proceed without EE (α = 1). We expected the run to fail or produce extremely crowded results as explained in the original t-SNE report 1; however, we did not see much overlap in cluster positioning, probably due to the substantial number of map iterations run (Fig 4B). Nevertheless, the resulting map showed a lot of fragmentation proving to be an extreme case of an interrupted plateau phase. Even when run for as many as 3000 iterations, the fragmentation could not be remedied, again demonstrating the necessity of EE.
As expected, higher values of α lead to much higher KLD during EE, however, the KLD values were similar at 2000 iterations when α varied between 4 and 60 (Fig. 4A). Larger α prolonged the plateau phase and became detrimental for KLD values when over 100. Visually α = 200 results in a distorted map with smaller populations lost. Therefore, we suggest that for cytometry applications the α parameter may remain unchanged and set to 12, as suggested in van der Maaten 2014, or reverted to α = 4, as originally proposed in van der Maaten and Hinton (2008) since per our results any value between 4 and 20 leads to comparable outcomes.
Increased perplexity has been proposed to be an intuitively beneficial method for visualization improvement since it translates to a larger number of considered nearest neighbors and hence a more accurate approximation of attractive forces, while decreased perplexity can completely fail the visualization 21. KLD values for runs with varying perplexity cannot be directly compared since the KLD value is related to perplexity; however, KLD records over time do not show that increased perplexity results in faster resolution of clusters (Fig 4C) or cleaner data visualization (Fig 4D). However, while changing a does not affect t-SNE computation time, perplexity is linearly related to the time and memory required to create the embedding (data not shown). Although we and others have found some benefits of perplexity increases to map quality in otherwise suboptimal t-SNE runs, optimizing the EE step as described above and further in this work does not leave much space for improvement with perplexity tuning (Fig 4E).
Learning step size is a key parameter to ensure t-SNE visualization of large datasets
The step size in t-SNE gradient descent is updated at each iteration per Jacobs adaptive learning rate scheme 23. This method increases the learning rate in directions in which the gradient is stable. A conservative initial value of 200 is hard-coded into most platforms. We hypothesized that larger datasets may stay longer on KLD plateau due to the number of iterations it takes to build up a sufficient learning rate step size. To evaluate this possibility, we titrated the step size h while observing the KLD with fixed EE=1000 iterations in Mass41parameter dataset. In agreement with our hypothesis, h = 25 and h = 50 runs failed to resolve from KLD plateau within 1000 iterations of EE (Fig. 5A) and h = 200 finished the plateau in ~700 iterations as previously shown. With further increases in h, we found that not only are progressively fewer iterations required to complete the plateau, but also that the final KLD of the maps scored at lower values. KLD is directly related to the quality of visualization since it reflects the faithfulness of representation of high-dimensional data in t-SNE space; therefore, lower KLD values indicate superior visualization quality.
We continued to see improvement in plateau duration and KLD values with higher h values up to h ~ 64,000, a value that is drastically far from the “default” h = 200 setting (note that in most platforms, h is restricted to ranges below 3,000) (Fig. 5B). At h ~ 256,000 we observed irregular peaks in KLD graph indicating that the prescribed step size rendered gradient descent ineffective. However, using lower values of h we were able to converge the map with lowest KLD values at a fraction of time when limiting the EE step to 200 and even 100 iterations despite the 106 size of the dataset (Fig. 5C). Visual inspection of the embedded maps over the range of h values agrees with KLD values (Fig 5D).
In a recent publication, Linderman and Steinerberger 24 prove that in general t-SNE embedding will not converge if a product of EE factor a and of fixed learning rate step size h is larger than the number of datapoints n (i.e. if αh > n). Since we employ an adaptive learning rate, our selection of initial h value is more forgiving; however, in our experiments we found the optimal settings of h to be close to h = n / 12 for computations where a = 12. Therefore, we propose to initiate the gradient descent with h = n / α to create optimal t-SNE visualization (Fig 5E).
Opt-SNE allows optimal embedding of massive datasets
We implemented all proposed techniques including: (1) a dataset-specific automated early exaggeration step controlled by the KLDRC sensor, (2) an optimal learning rate step size, and (3) a KLDRC-driven embedding termination into a single workflow labeled ‘opt-SNE’, for ‘optimized t-Stochastic neighbor embedding’. To test the performance of opt-SNE, we used a 20.1 × 106 event fluorescent cytometry dataset concatenated from two independent batches of PBMC samples (N=27) stained with a variation of the OMIP-037 fluorescent cytometry panel 18 that allows detailed assessment of naïve and memory CD4+ and CD8+ T cells, NK cells and γδ T cells (Fig. 6A). The embedding completed in 770 iterations with only 73 iterations required to pass the EE step (Suppl. Fig. 3) and resulted in clear separation of cell clusters as evaluated by cell type annotation (Fig 6C). The majority of clusters appear to be populated by cells from all subjects with the exceptions of several populations that contained sample-unique debris features (Fig 6B, dashed arrows), confirming an absence of batch effects. A detailed breakdown of identified populations is presented in Fig 6C that shows subsets of CD4+ and CD8+ T cells, NK cells, γδ T cells, B cells, and monocytes. Importantly, B cell and monocyte lineage markers were detected together with a viability dye in a dump channel in this panel and therefore cannot be gated accurately via traditional biaxial plot analysis. However, opt-SNE identified them in high-dimensional space through the combination of several surface labels and light scatter characteristics, and each was successfully clustered into populations that were minimally mixed with dead cells. Use of the standard t-SNE algorithm completely failed to reveal the structure of the multi-million event flow cytometry dataset, even with several thousands of iterations (Fig 6D).
In order to test the suitability of opt-SNE for applications beyond flow and mass cytometry, we analyzed a 1.3×106 cell single-cell RNA-seq dataset of mouse embryonic brain cells published by 10X Genomics. We used pre-calculated PCA projections included in the dataset to generate opt-SNE maps that we compared with 10X standard t-SNE embedding (Fig 6E). 10X used EE = 1000/total 4000 iterations of standard t-SNE while we used opt-SNE settings with h = 97,959, EE = 66/total 885 iterations (Suppl. Fig. 2). Non-immune single cell transcriptomics data are more difficult to interpret with ground-truth classes since much fewer scRNA-seq markers can be easily interpreted for population identification. Therefore, we utilized both single gene classification and classification through the Louvain algorithm clustering using the Scanpy Python package 17 to annotate the data. While opt-SNE and t-SNE both capture the macro-structure illustrated by gene overlays, several Louvain clusters (3, 18, 19, 25, 28) appear partially or completely overlapped by other clusters in standard t-SNE but not in the opt-SNE embedding. Therefore, opt-SNE allowed equivalent or superior resolution of single cell transcriptomics data as with standard t-SNE but with ~5x smaller iteration time (885 iterations of opt-SNE vs 4000 iterations of standard t-SNE).
HSNE (hierarchical SNE) is a t-SNE adaptation that was recently reported to facilitate analysis of large datasets by constructing a hierarchy of embeddings that can be explored from the overview of ‘landmark’ populations up to a single-cell level or resolution 7. We applied opt-SNE to the 5.2 million point dataset that was reported in HSNE analysis of mass cytometry data, and compared opt-SNE visualization to the full resolution level of HSNE embedding (Fig. 6G, Suppl. Fig. 3). In the CD4+ subset, opt-SNE visualization revealed two groups of CD4+CD28- cells likely representing terminally differentiated memory CD4+ T cells 25 with different levels of CCR7 expression that may reflect distinct differentiation states of the two populations. While HSNE allowed identification of CD4+CD28-CCR7-cells, it was unable to visualize CD4+CD28-CCR7+ cells as a single cluster and projected these cells sparsely in the CD4+ island (Fig.6I, left). Also, both algorithms projected heterogenous expression of CD56 in CD4+CD28-CCR7-cells, but HSNE embedding did not resolve separate CD56+ and CD56- clusters within CD4+CD28-CCR7-cells, loosely mapping them to the poles of the single round cluster (Fig. 6I, right). On the contrary, opt-SNE embedding visualized both the CD56+ and CD56- clusters and disparate CD45RA expression within each cluster, revealing distinct phenotypes for the control and diseased subject groups (Fig. 6H). Also, opt-SNE demonstrated that the CD56+CD45RA-cells in the cluster originate from several subjects with Crohn’s disease (Fig. 6H). In summary, these results confirm that opt-SNE embedding provides superior visualization quality for complex cytometry data.
Discussion
Visual exploration of data drives hypothesis formation and/or serendipitous discoveries; therefore, t-SNE is an extremely valuable tool for data comprehension. It is often used to facilitate data perception when hypothesis generation is automated by robust computational methods 8, 26. Comparison of t-SNE embeddings from multiple experimental conditions, timepoints, or subjects is invaluable to visualize sample-to-sample differences including disease hallmarks and longitudinal observations 27. t-SNE is also valuable for quality assessment of data, when abnormal clustering could be traced back to sample preparation, data acquisition and preprocessing artifacts 28. Therefore, batch embedding of multiple experimental points is essential for sample comparison and can only be enabled when t-SNE accommodates large datasets.
t-SNE was first introduced in cytometry research as a tool to visualize CyTOF data, since fluorescence-based high-parameter datasets were less common at that time. With recent advances in instrumentation and reagent availability, flow cytometry datasets with >20 parameters are quickly becoming prevalent and even standard in the field29–31, yet the proper data assessment tools are lacking for general use. DNA-barcoded antibodies have been recently utilized to allow simultaneous protein-epitope and transcriptome measurements in single cells 32 thus expanding the repertoire of traditional cytometry methods that could employ t-SNE as a staple method of data visualization and presentation.
One approach to large dataset t-SNE embedding is to model visualization with a subset of datapoints curated 7, 33 or randomly selected (as implemented in cytometry data analysis platforms such as FCS Express and Gemstone) from the nearest neighbor graph. Such techniques may fail to project extremely rare datapoints, as demonstrated by CD4+CD28-CCR7+ cells that were identified with opt-SNE, but not HSNE in van Unen et al dataset (Fig. 6). This T cell subset was significantly less abundant in subjects with severe inflammatory conditions including Crohn’s disease as compared to controls (Suppl. Fig. 3), marking this population as extremely rare. On the other hand, the CD4+CD28-CCR7-, while also low in frequency in most PBMC samples (<1% of all CD4+ cells), was likely successfully clustered by HSNE because two of the 11 Crohn disease subjects in the dataset showed an unusually high frequency of this population (40.7% and 31.0% of all CD4+ cells), allowing it to be well represented in the kNN graph of the full dataset.
Several attempts to successfully apply t-SNE-like methods to massive datasets have been recently reported including aforenoted HSNE 7, 33, 34 LargeVis 10 and net-SNE 35. However, these improved methods, when applied to large datasets, often require/benefit from considerable computational resources; for instance, the LargeVis study was performed on a 512Gb RAM, 32 core station. However, routine data analysis in a typical research laboratory should be possible with less available resources. We have not explicitly focused on computational efficiency in this work since we benchmarked the algorithm against itself with no specific emphasis on shortening computation time, which occurred due to the fewer number of iterations required to complete the data embedding with our adjustments. However, we have addressed several aspects of computation efficiency. For public use, we released an opt-SNE modification of multicore t-SNE C++ implementation (See Data and Software Availability) since the original Barnes-Hut t-SNE does not employ multi-threading. FIt-SNE, a recently published alternative to Barnes-Hut t-SNE that uses fast Fourier transform for much faster computation of repulsive forces approximation 20, renders opt-SNE even more feasible on personal computers; when we combined our automated opt-SNE setup with FIt-SNE approximation in the van Unen dataset analysis (Fig. 6G-H), it was completed in about two hours on a 16Gb RAM personal notebook with 2 cores. Notably, we have not observed differences in embedding quality between embeddings generated with fast Fourier transform versus Barnes-Hut approximations when opt-SNE was used to control the visualization (Suppl.Fig.4 and data not shown). All other analyses performed in this manuscript were performed using Barnes-Hut approximation on personal computers with the exception of the 20M embedding that required ~60Gb RAM at its peak and was run for several days on a multicore workstation using SeqGeq implementation of opt-SNE. Therefore, we expect opt-SNE to be applicable for existing or future adaptations of t-SNE even if alternative methods of computation are utilized 20, 36 provided that they retain the core principles of t-SNE embedding. A promising approach that may be integrated with opt-SNE is the smart EE adjustment implemented in A-tSNE (approximated t-SNE)34 algorithm where EE is removed gradually and on a per-point basis. Cytosplore, a novel software platform that includes HSNE and A-tSNE, allows the analyst to interactively initiate the local refinement of the map, resulting a significant improvement in computation time7.
Similar to other types of biological data, the structure of cytometry data is difficult to project due to its mixed nature, often comprised of cluster-like, manifold-like and/or hierarchical components 28, 37. In this paper we propose multiple techniques that are essential for optimal t-SNE data projection and are all germane to the fine-tuning of the early exaggeration stage of t-SNE embedding. EE facilitates cluster formation on a 2D plane 1 and serves as a necessary compromise that allows clusters to escape the crowding effect. We designed an efficient measure to ensure that the cluster-like global structure of the data is fully revealed during the EE stage by monitoring the KLD output of the embedding in real time. As indicated by the lower KLD values of opt-SNE embedding where EE was limited to fewer iterations (Fig. 5E), prolonged amplification of the attractive forces that drive tight cluster formation in EE may be detrimental for the manifold-like local data structure represented by signal distributions that are continuous with background signal. These ‘continuum expression’ molecules include the proteins used to define classic immune cell subsets, as well as those linked with activation and/or exhaustion and markers indicating disease phenotypes38. Conversely, the non-exaggerated stage of t-SNE allows local data structures to be revealed. 1. Therefore, cytometry data analysis would be missing valuable information if we limited t-SNE applicability to finding only well separated clusters, especially since other techniques would perform that task better and faster. However, some workflows call for t-SNE pre-processing to facilitate extraction of cluster features from multidimensional data 5, 39. In those cases, it may be helpful to adapt the opt-SNE toolkit to terminate the embedding calculation immediately at the EE stop iteration and re-assess the high-dimensional structure within each cluster. Alternatively, a ‘late exaggeration approach’ 20 can be cautiously applied to create very tight clusters, although in our experience this approach was only marginally beneficial for global structure representation but detrimental for local structure (data not shown).
It is advisable to note that certain data structures, such as cluster hierarchy, cannot be revealed with t-SNE 40, 41. t-SNE accessibility in cytometry analysis software lead to its not infrequent misuse with cytometry data, evident when the cluster-like structure is not prominent in the map. Therefore, the features identified from t-SNE embedding in its current form should be verified with alternative methods when possible for confirmational purposes. Im et al also suggest that if a continuous manifold structure exists in the data, large perplexity values may cause artificial breaks (overclustering) 41. The perplexity values commonly used in cytometry analysis are on the lower end of the suggested range for efficient clustering, as it is often advised to scale the number of nearest neighbors to the average cluster size 42; however, if computationally feasible, higher perplexity values might facilitate feature preservation for markers whose expression is not bimodally distributed.
In summary, we believe that opt-SNE is a powerful optimization toolkit that removes major limitations of t-SNE use for cytometric datasets and could thus potentiate novel data-driven findings in single cell research.
Data and software Availability
All code including the open source multicore t-SNE C++ implementation, usage instructions, flow18parameters and mass41parameters datasets, and the cloud version of opt-SNE are available at http://www.omiq.ai/opt-SNE (the C++ code with a Python wrapper is available from https://github.com/omiq-ai/Multicore-opt-SNE). Van Unen et al dataset is available at http://flowrepository.org/id/FR-FCM-ZYRM. 10X Genomics 1.3M scRNA-seq dataset is available at https://support.10xgenomics.com/single-cell-gene-expression/datasets
To facilitate availability to flow cytometry and scRNA-seq data analysts, opt-SNE has been incorporated into FlowJo version ≥ 10.5.2 and SeqGeq version ≥ 1.4. At the time of this publication, this option was considered experimental and therefore hidden, but users can enable it by adding <DRPlatform showAutoLearning="1” /> to the FlowJo10.prefs (or SeqGeq.prefs) XML file. It will be fully integrated and available by default in future releases of FlowJo and SeqGeq.
Author contributions
ACB has conceived the study. ACB and COC developed the opt-SNE method. COC developed implementation in C++ and RH and JS developed implementation in FlowJo and SeqGeq. ACB has analyzed flow and mass cytometry datasets. RH and JS have analyzed the scRNA-seq dataset. RA and JSC provided conceptual input. ACB wrote the manuscript. All authors discussed the results and commented on the manuscript.
Competing interests
COC is a founder of Omiq, Inc. RH and JS are employees of FlowJo, LLC.
Acknowledgements
The authors would like to thank Yvan Saeys, Gary Kazantsev, Jonathan Irish, Allison Irvine and Katherine Drake for helpful discussions, and Geoff Kraker for technical assistance. We thank Sean Bendall from Stanford University and Vincent van Unen from Leiden University Medical Center for sharing their data, and Brian Tilton and Riley Pihl from BUSM Flow Cytometry Core Facility for assistance with data collection at BUSM. Anna C. Belkina is an ISAC (International Society for Advancement of Cytometry) SRL Emerging Leader 2015-2019 and thanks the ISAC organization and members for continuous support and encouragement. Josef Spidlen is an ISAC Marylou Ingram Scholar.