Abstract
Cytometry analysis has seen a considerable expansion in in recent years with the expansion in the maximum number of parameters that can be acquired in a single experiment. In response to this technological advance, there has been an increased effort to develop computational methodologies for handling high-dimensional data acquired by flow or mass cytometry. Despite the success of numerous algorithms and published packages to replicate and outperform traditional manual analysis, widespread adoption of these techniques has yet to be realised in the field of cytometry. Here we present CytoPy, a Python framework for automated analysis of high dimensional cytometry data that integrates a document-based database for a data-centric and iterative analytical environment. The capability of supervised classification algorithms in CytoPy to identify cell subsets was successfully confirmed by using the FlowCAP-I competition data. The applicability of the complete analytical pipeline to real world datasets was validated by immunophenotyping the local inflammatory infiltrate in individuals with and without acute bacterial infection. CytoPy is open-source and licensed under the MIT license. Source code is available online at the https://github.com/burtonrj/CytoPy, and software documentation can be found at https://cytopy.readthedocs.io/.
1. Introduction
Cytometry data analysis has undergone a paradigm shift in response to the growing number of parameters that can be observed in any one experiment. As the field evolves, the traditional method of manual gating by sub-setting single cell data into populations and encircling data points in hand-drawn polygons in two-dimensional space has proven laborious, subjective, and difficult to standardise. In response to these shortcomings, a cross-disciplinary effort has given birth to a new approach often termed ‘cytometry bioinformatics’, to leverage complex computer algorithms and machine learning to automate analysis and improve the investigator’s ability to extract meaning from high dimensional data.
Where cytometry is used for data acquisition, the typical objective is to discern differences between groups of subjects or experimental conditions, or to identify a phenotype that correlates with an experimental or clinical endpoint. To this end, a computational approach to analysis of cytometry data can take one of two strategies: to separate single cell data into groups or classifications, which then form the variables (often descriptive statistics of the obtained groups) the investigator uses to test their hypothesis, or directly model the acquired distribution of single cell data with respect to a chosen endpoint. Classification strategies can be further subdivided: autonomous gating replicates traditional gating through the use of algorithms to cluster data in two-dimensional space (flowDensity (1), OpenCyto (2)); high-dimensional clustering groups cells according to their individual phenotypes (FlowSOM (3), PhenoGraph (4), Xshift (5), SPADE (6)); and supervised classification where training on an example of manually gated data produces a classifier capable of distinguishing cell populations (FlowLearn (7) and DeepCyTof (8)). Modelling strategies have been successfully adopted in applications such as ACCENSE (9), CellCNN (10), and CytoDX (11) although this approach requires pooling of sample data and is therefore sensitive to batch effects.
In addition, various pieces of software have been developed for data handling, transformation, normalisation and cleaning (e.g. flowCore, flowIO, flowUtils, flowTrans, reFlow, flowAI), visualisation (e.g. ggCyto, t-SNE, UMAP, PHATE), and pipelines for specific applications (e.g. Citrus, MetaCyto, flowType/RchyOptimyx). To date, there are over 30 different contributions to automated analysis (12; 13; 14; 15). However, there is no widespread adoption of these methods as yet, nor is there a consensus on how to adopt such techniques, with much of the analysis pipeline left to the individual investigator to establish. This inconsistency results in projects amassing collections of custom scripts and data management that are not standardised or centralised, which not only makes reproducing results difficult but also makes for a daunting landscape for newcomers to the field.
We here introduce ‘CytoPy’, a novel analysis framework that aims to mend these issues whilst granting access to state-of-the-art machine learning algorithms and techniques widely adopted in cytometry bioinformatics. CytoPy is developed and maintained in the Python programming language, which prides itself on readability and is becoming the language of choice amongst the open source data science community (16). CytoPy introduces a central data source for all single cell data, clinical/experimental metadata and analysis results, and provides a ‘low code’ interface that is both powerful and beginner friendly.
We demonstrate the capability of supervised classification techniques housed within CytoPy on the Flow Cytometry: Critical Assessment of Population Identification Methods (FlowCAP) data (17), which has been created for comparing the performance of automated analytical techniques for flow cytometry data. As the FlowCAP data underwent extensive pre-processing prior to their publication and hence do not reflect the challenges encountered with primary data generated by individual users, an in-house dataset of local immune cells in samples collected from patients undergoing peritoneal dialysis and who presented with and without acute bacterial infection was generated to demonstrate the applicability of CytoPy as a complete analytical pipeline for complex and unprocessed data. These data reflect the challenges presented by an observational study of complex clinical specimens collected over extended periods of time, often years. We believe that CytoPy provides a powerful and user-friendly framework to interrogate high dimensional data originating from investigations that utilise flow cytometry or mass cytometry for data acquisition, and has the potential to facilitate automated data analysis in a multitude of experimental and clinical contexts.
2. Design and Implementation
2.1. Building a framework that is data-centric
Reliable data management is a cornerstone of successful analysis, by improving reproducibility and collaboration. A typical cytometry project consists of many Flow Cytometry Standard (FCS) files, clinical or experimental metadata, and additional information generated throughout the analysis (e.g. gating, clustering results, cell classification, sample specific metadata). A further complication is that any analysis is not static but an iterative process. We therefore deemed it necessary to anchor a robust database at the centre of our software. In CytoPy, projects are instantiated and housed within this database, which serves as a single dynamic data repository that is then accessed continuously throughout the subsequent analysis. For the architecture of this database we chose a document-orientated database, MongoDB (18), where data are stored in JSON-like documents in a tree structure. Document-based databases carry many advantages, including simplified design, dynamic structure (i.e. database fields are not ‘fixed’ and therefore resistant to unforeseen future requirements) and easy to scale horizontally, thereby improving integration into web applications and collaboration. In this respect, CytoPy depends upon MongoDB being deployed either locally or via a cloud service, and MongoEngine, a Document-Object Mapper based on the PyMongo driver (19).
2.2. Framework overview
An overview of the CytoPy framework is given in Figure 1 including a recommended pathway for analysis, although individual elements of CytoPy can be used independently. CytoPy follows an object-orientated design with a document-object mapper for both commitment to, and collection from, the underlying database. The user interacts with the database using an interface of several CytoPy classes, each designed for one or more tasks. CytoPy is algorithm agnostic, meaning new autonomous gating, supervised classification, clustering or dimensionality reduction algorithms can be introduced to this infrastructure and applied to cytometric data using one of the appropriate classes. CytoPy makes extensive use of the Scikit-learn (20) and SciPy (21) ecosystems and follows the naming conventions commonly used in this space. Throughout an analysis, whenever single cell data are retrieved from the database, they are stored in memory as Pandas DataFrames that are accessible for custom scripting at any stage.
Following the steps in Figure 1, a typical analysis in CytoPy would be performed as follows (functions are shown in italics and class names are shown in italics and title-case).
Single cell data are generated and exported from the flow/mass cytometer in FCS 2.0 or 3.0 format. Experimental and clinical metadata are collected in tabular format either as Microsoft Excel document or CSV file, with the only requirement being that metadata be in ‘tidy’ format (22).
A Project is defined and populated with the single cell data and accompanying metadata. Each subject (e.g. a patient, a cell line, or an animal) has a Subject document containing metadata that are dynamic and have no restriction on the data stored within, and that are associated to one or several FileGroup documents. Each FileGroup document is representative of one or more FCS files associated to a single biological sample collected from the subject. The single cell data from the multiple files in a FileGroup are saved to disk as a Hierarchical Data Format (HDF) file. The FileGroup is linked to this file on disk but the data can be migrated to any drive of the user’s choosing. The FileGroup and its associated HDF file contain all single cell data, gated populations, clusters and meta-information that attains to a single sample. This also includes any isotype or Fluorescence-Minus-One (FMO) controls. Compensation is applied to single cell data at the point of entry using either an embedded spillover matrix or a provided CSV file. The FileGroup is associated to an Experiment, containing all samples collected under one particular set of staining conditions. There must always be a Panel document associated to an Experiment. For this, the investigator must provide a ‘panel design’ in the form of a simple Excel document (see CytoPy documentation https://cytopy.readthedocs.io/). CytoPy then uses regular expression to match FCS metadata such as channel names to the expected panel and offers error handling for when discrepancies arise.
The common approach to cytometry data analysis is expert driven gating in sequential two-dimensional space. Attempts to automate gating in the past have focused on the application of a single methodology (1; 23; 24) or an assortment of selected algorithms (2). CytoPy attempts to expand on the later approach with an open design, allowing the use of any unsupervised learning algorithm supported in Scikit-Learn or libraries that follow the Scikit-Learn template. Algorithms that generate a label for each data-point can be handled using the PolygonGate object, converting the output into polygon gates by calculating the convex envelope of each cluster. Probabilistic models such as Gaussian Mixture Models are handled by the EllipseGate object, where elliptical gates are generated by drawing a confidence interval around the models components. Finally, we have designed a density based algorithm implemented in the ThresholdGate object, separating data in one or two-dimensional space based on properties of the estimated probability density function. In a traditional analysis, gates are applied in sequence to derive the cell population of interest. CytoPy replicates this with autonomous gates using the GatingStrategy class. Using this class, autonomous gates are defined once and then applied to subsequent samples. In each instance, algorithms will ‘fit’ the new data they encounter (generating a data-driven strategy to gating) and then locate the expected populations, annotating accordingly.
The variation that separates biological specimens can originate from genuine biological differences but can also arise from technical variation introduced by pipetting errors, changes in instruments, variation in experimental practice, or other uncontrollable experimental conditions. CytoPy offers the ‘variance’ module, containing utilities for visualising the univariant and multivariant deviations between biological specimens. If the data and study design allow, the user can consider pooling data and modelling the distribution of single cell data directly. In some circumstances the user will have to consider the contribution of batch effects, and in such a case the investigator can use the SimilarityMatrix class in CytoPy to group subjects according to their statistical distance from one another. Representative data from each group can be used as training data for a supervised classification algorithm used to annotate the entire dataset without the need for manual gating or complex autonomous gating strategies.
Multiple strategies can be employed to classify cells based on a common phenotype. Strategies such as autonomous gating and supervised classification are biased by the training data provided (and the gating strategy used to label those data) whereas high-dimensional clustering is an unsupervised method that groups cell populations according to their phenotype but can be difficult to critique. CytoPy offers both supervised classification through the CellClassifier class and high dimensional clustering through the Clustering class, so that variables can be generated from either or both strategies. Importantly, the results of either strategy can be committed to the database and then visually interpreted using a class called Explorer. The Explorer class also facilitates exploratory data analysis with interactive plots of embedded space using multiple dimensionality reduction techniques.
Once cells have been classified, the user can test their hypothesis. The single cell data are summarised into a ‘feature space’, summary statistics that describe the cell populations. This generates a large number of variables, many of which will be either uninformative or redundant. Filter and wrapper methods are applied to perform feature selection, finding only those variables important for predicting a clinical/experimental endpoint. In addition, there are multiple methods available for visualising extracted features, thereby allowing the investigator to quickly determine whether certain patterns exist in the dataset.
3. Results
3.1. Autonomous gating can standardise the cleaning of single cell data for rapid analysis
To validate CytoPy we demonstrate its use on the characterisation of immune cells in peritoneal drain fluid and whole blood of peritoneal dialysis (PD) patients with and without acute bacterial infection. We chose this dataset based on a wealth of previous experience in the field (25; 26; 27), the clinical relevance of acute peritonitis in those patients (28), and because of the technical challenges presented by the sample type. Samples were collected between 2017 and 2019 and stained with a comprehensive panel of monoclonal antibodies to identify T lymphocytes, monocytes, dendritic cells, eosinophils and neutrophils as the major constituents of peritoneal immune cells, together with activation and differentiation markers on those populations (Supplementary Tables S2 and S3).
Cytometry data are highly variable and surface marker expression must often be identified amongst a backdrop of cellular debris and staining artefacts. This is particularly relevant when studying complex samples such as local specimen taken from the site of acute infection. In the case of individuals receiving PD, bacterial infection leads to the influx of billions of inflammatory cells, predominantly neutrophils, into the peritoneal cavity within a few hours (27). Traditionally the variability this introduces into analysis is handled by laborious and subjective manual gating. We generated a computational framework that grants access to numerous algorithms for the purpose of data-driven autonomous gating. The user can design a sequence of autonomous gates, save this sequence to a GatingStrategy, and then apply the sequence to subsequent data. This is exemplified in Figure 2, showing the identification of T lymphocytes from local immune cells in the peritoneal effluent of PD patients. This example utilises density-driven threshold gates, where a threshold is determined based on properties of the Probability Density Function as estimated using Gaussian Kernel Density Estimation, and mixture models shown as elliptical gates. CytoPy hides the daunting complexity of this process behind a low-code interface (Supplementary Figure S1), in an attempt to make cytometry bioinformatics more approachable for newcomers to the field. Many popular algorithms are accessible through this interface, including Birch, mini-batch K means and mixture models (Supplementary Figure S2).
Autonomous gates are capable of replicating manual gates but are often dependent on the choice of hyperparameters and their optimal values may differ between data. Despite this, we have found their application is suitable for identification of large populations requiring simple gating strategies. Figure 3 shows the performance of autonomous gates for identifying common cell populations in peritoneal effluent by comparing their correlation with the same population identified manually by an expert analyst. It is evident that autonomous gates fail when a population is rare and/or displays poor separation from cells with similar phenotypes. The same observations were made when gating for T cell subsets, with good performance for CD4+ and CD8+ T cells, but poor performance for minor populations such as mucosal-associated invariant T (MAIT) cells and γδ T cells (data not shown).
3.2. CytoPy provides accurate cell classification using supervised machine learning algorithms
Unsupervised methods employed by autonomous gates may fail to generalise and struggle to reliably annotate rare cell populations, populations that significantly deviate between biological specimens, or populations that may be almost absent in some individuals yet abundant in others. Further to this, autonomous gates are only exposed to one or two dimensions of the n-dimensional feature space (that is, the vector of intensity values for all measured fluorochromes) rather than exploiting all available variables. The nature of cytometry data lends itself well to supervised classification, given that a typical biological sample yields hundreds of thousands of events but current technologies are limited to measuring up to a maximum of approximately 40 variables for each cell, resulting in an abundance of observations. We therefore hypothesised that a supervised classifier, trained on one or more annotated examples and then exposed to all available variables in the data, would result in increased performance. After consulting the literature we found that others had employed such techniques (8; 7) but no robust framework for their application exists.
CytoPy offers the CellClassifier class as a blueprint for supervised classification in a cytometry framework. Through this class CytoPy exposes the popular machine learning libraries Scikit-Learn (20), XGBoost (29) and Keras (30) to the task of annotating cytometry data. The CellClassifier class follows the conventions of Scikit-Learn by providing a familiar application programming interface (API) and the apparatus for any classification algorithm to be integrated into the CytoPy framework. In this study, we have chosen to demonstrate the following: XGBoost, a Feed-Forward Neural Network, Linear Discriminant Analysis, Support Vector Machines and K-Nearest Neighbours. The choice of algorithms to include in this analysis was based on prior experience with classification tasks (31), examples in the literature of supervised classifiers in this domain (8; 17; 23), and the relevance of including classifiers from multiple families (32). In order to validate the CellClassifier class and test the performance of each algorithm, we utilised the FlowCAP-I classification challenge (see Supplementary Methods). These data were chosen because of prior publication and their use for validation of algorithms applied to cytometry analysis (17; 8). As shown in Table 1, when judged by weighted Fl score we found acceptable performance for each algorithm applied (defined as an Fl score greater than 0.9) while XGBoost gave the best performance, and was therefore deemed the method of choice for the remainder of this study.
3.3. CytoPy provides visual and quantitative tools for evaluating inter-sample variation, assisting the choice of suitable training data
The FlowCAP-I data are suitable for validation of a method but are heavily pre-processed and simplified compared to data encountered in a complex observational study such as the peritonitis data introduced earlier. Studies designed to collect clinical specimens over several months or even years introduce unavoidable complications, for instance including but not limited to: deviations in experimental conditions, changes in instrument setup, and variation introduced by batch changes of staining monoclonal antibodies. The technical variation combined with the biological variation observed for each specimen can make it difficult to determine their comparability and choose representative training data for a supervised learning approach to cytometry data annotation.
CytoPy provides the variance module to visualise and quantify this variation, to assist the user in the choice of adequate training material. As an anchoring point, the user should choose a suitable “reference sample” to be used when comparing the observable variation amongst all samples in an experiment. A reference sample can be identified using the calculate_ref_sample function. Following the method presented by Li et al (8), CytoPy performs a pairwise computation of the Euclidean norm of each sample’s covariance matrix, and selects the sample with the smallest average distance as reference. This reference sample can then be used for univariate comparison of each channel using the marker_variance function (Figure 4A) or multivariate comparison using a dimensionality reduction technique such as Principle Component Analysis (PCA), achieved with the dim_reduction_grid function (Figure 4B).
In Figure 4, the reference sample is shown in blue and compared to randomly selected samples shown in red; ten such samples are depicted to ease visual interpretation but there is no limit to the number of comparisons that can be made in a single plot. While Figure 4A shows the degree of inter-sample variance for individual fluorochromes and highlights abnormalities in a single channel, Figure 4B shows the same ten randomly selected samples, individually plotted to overlay the reference sample, thus illustrating the multivariate drift of a sample compared to the chosen reference. This allows for identification of samples that are explicit outliers and gives a general sense of the inter-sample variance in the complete immunological landscape measured.
The approach illustrated in Figure 4 defines methods that are helpful for visually critiquing the quality of the dataset and that can identify anomalies that should be addressed by changing technical procedures in data acquisition. To proceed with classifying cells into known phenotypical subsets we must take into account this variation. This is achieved in traditional manual gating by laboriously adjusting gates on a per-sample basis, with considerable variation depending on the investigator. For automated classification by supervised methods, we instead choose our training data in such a way that inter-sample variation is accounted for. CytoPy provides the SimilarityMatrix class and the output is shown for each sample type in Figure 5. Unlike the visualisation techniques depicted in Figure 4, the SimilarityMatrix quantifies the inter-sample variation by computing a pairwise statistical distance for each possible combination of samples. In brief, the joint probability density function (PDF) of the n-dimensional feature space is estimated using a kernel density estimation (KDE; multiple implementations are available but CytoPy defaults to a fast convolution based technique (33)). Bandwidth for KDE can be given either as a floating point number used for each KDE computation, estimated for each sample using normal approximation (e.g. Silverman’s method), or the optimal bandwidth for each sample can be estimated by cross-validation. The latter is the preferred and default method, whereby the optimal bandwidth is chosen by grid-search hyperparameter tuning using cross-validaiton; the optimal bandwidth being the one which maximises the total log probability density under the model. If the the number of cells observed are few, the number of dimensions many, or the user lacks computational resources, they can opt to perform dimension reduction prior to the KDE. The statistical distance for each pair of PDFs is calculated to generate a matrix of distances. The statistical distance shown in Figure 5 is the square root of the Jenson-Shannon divergence (the default choice for this function), given by:
Where m is the pointwise mean of the left probability vector p (PDF of the first sample) and the right probability vector q (PDF of the second sample). KL is the Kullback-Leibler divergence. The Jenson-Shannon distance returns a value between 0 and 1, where 0 indicates that the distributions p and q are equivalent, and 1 that they are highly dissimilar (34; 35). Any statistical distance (a function taking two probability vectors and outputting a metric distance) can be used, but by default the Jenson-Shannon distance is applied, chosen for its properties of symmetry and finite output (35; 36). The SimilarityMatrix outputs a heatmap where the colour of each cell corresponds to the Jenson-Shannon distance of the x, y axis pair that overlaps on the given cell. The axes of the heatmap are clustered using single linkage clustering. Clustering on the pairwise Jenson-Shannon distance reveals groups of samples that are similar in the distribution of their single cell subsets in high dimensional space. Classification of cell populations in these groups can be performed independently per group but with the same objective of identifying phenotypically distinct cell populations. For each group, the investigator chooses training data (a uniform sample of cells from each member of the group) using the calculate_ref_sample function, or can choose to sample multiple members of a group with the create_ref_sample function to generate a new FileGroup containing cells from many specimens. Suitable training data are then annotated for the cell phenotypes of interest (e.g. for T lymphocytes this might be CD4+ and CD8+ T cell subsets) using the gating infrastructure discussed in section 3.1. Once annotated a classifier is trained using the labelled reference and subsequently predicts the cell populations for the remaining members of the group. This approach accounts for the inter-sample variation, and therefore improves the classifiers’ ability to generalise.
3.4. Supervised classification algorithms can reliably identify cell subsets in complex sample types whilst providing tools to inspect and diagnose anomalies
In Figure 5, biological samples were clustered on pairwise Jenson-Shannon distances to reveal groups of samples of relatively high similarity; clustering results are shown as a dendrogram on the axis of each two-dimensional heatmap matrix. Groups are derived by cutting the dendrogram at a level that was heuristically chosen through visual inspection of the dendrogram. This process was repeated for each sample type and set of staining conditions to generate the groups shown in Figure 6A where each group was treated independently during supervised classification.
Figure 6A shows the performance of XGBoost classification of all leukocyte subsets in peritoneal drain fluid and more detailed subsets of the T cell compartment in peritoneal drain fluid and in PBMCs from whole blood. Performance is given as the weighted F1 score, a metric that captures the harmonic mean between precision and sensitivity, and is weighted by class support (the number of true instances for each label), which provides a value between 0 and 1, where 1 is the best possible score. This metric was captured by monitoring the performance of XGBoost on five randomly chosen validation samples from each classification group of each experimental condition and/or sample type. The validation samples were labelled by manual gating. Performance was best for PBMCs from whole blood where the weighted Fl score on average was above 0.95. Performance was worst for identifying leukocyte subsets in peritoneal effluent, which reflects the complex nature of the sample type and the diversity of cell subsets we intend to describe. The situation for T cell subsets classified from drain fluid was more complicated. For groups 2 and 3 performance was optimal (average weighted Fl score ≥ 0.95) yet for group 1 there was one significant outlier; one validation sample gave a weighted Fl score of 0.6, outside the interquartile range for this group. Of note, CytoPy provides functionality to easily visualise and explore the results of CellClassifier objects. For the particular outlier mentioned, Figures 6B and 6C show detailed results of the classification of T cell subsets. Figure 6B is a heatmap representation of a confusion matrix, provided if the user provides a value of True to the argument confusion matrix, in the validation method of CellClassifier. The confusion matrix in Figure 6B shows ‘predicted labels’ versus the ‘true label’; the ground truth being the results of manual gates. The values shown in the confusion matrix were normalised across each row (true label) meaning the values on the diagonal were equivalent to the accuracy for each class. The confusion matrix revealed that although this sample scored poorly in terms of Weighted F1 score, the classification accuracy was greater than 95% for all but two classes: γδ T cells and unclassified cells, i.e. those that would not fall into any ‘gate’. 52% of cells that had been classed as γδ T cells by the manual gate in this particular sample were instead left unclassified and a large majority of unclassified cells from manual gating were classified into other categories by the XGBoost algorithm. The inclusion of unclassified cells into one or more other subsets was least concerning as it likely reflected the subjective nature of manual gating; the close fit of a gate to its chosen population being one common subjective property of manual gates. The classification of γδ T cells was of greater concern, as this is a T cell subset that is relatively rare in many individuals and hence challenging to assess, yet of significant importance especially in Gram negative infections (37).
The CellClassifier of CytoPy converts its classification results to population data and is associated back to the FileGroup. This makes comparison of supervised classification to the results of manual gating, semi-autonomous gating or clustering analysis straight-forward. For example, the back_gating method of GatingStrategy allows the investigator to plot the results of multiple methods on familiar bi-axial plots for comparison. As illustration, Figure 6C shows the interrogation of data likely to represent an outlier in the analysis. Overlaid is the result of the XGBoost classification for Vδ2+ γδ T lymphocytes (red points) and the manual gate for the same subset (yellow line). Vδ2+ γδ T cells were unusually sparse in this particular patient sample, which explains the poor classification performance in this instance. Of note, upon visual inspection the XGBoost algorithm was equally suited at identifying rare cell types compared to manual gating; and classification of γδ T cells was performed correctly by the XGBoost algorithm in all other samples (additional examples shown in Supplementary Figure S3 and S4).
3.5. Unbiased cell classification by high dimensional clustering
Although supervised classification provides us with one methodology for identifying cell subsets, it is biased by the gating strategy used in labelling training data. In recent years, numerous clustering algorithms have been proposed for high-dimensional clustering of single cell data. Two popular solutions are PhenoGraph (4) and FlowSOM (3; 38), both of which are available in CytoPy through the Clustering class. As with the CellClassifier class, Clustering is agnostic to the clustering algorithm of choice. Semi-automated gating, XGBoost classification, and PhenoGraph clustering are comparable in their identification of major cells subsets (Supplementary Figure S5) but using unison of methods (i.e. XGBoost classification and PhenoGraph clusters) provides many benefits and is encouraged in the CytoPy framework; high dimensional clustering offers the opportunity for exploratory data analysis, and obtained clusters can be contrasted with populations identified from supervised classification to improve the confidence of reported results.
Exploratory data analysis in CytoPy is facilitated by the Explorer class, which encapsulates the single cell data of one or multiple patients after clustering and supervised classification has been performed, and houses the data within a Pandas DataFrame. Operations can be performed on the DataFrame independently allowing custom scripting, but the Explorer class carries many utility functions that are designed for exploratory data analysis. Examples include methods for associating metadata to clusters (e.g. the patient phenotype), dimensionality reduction techniques, and interactive plotting tools.
Clustering is performed on a per-sample basis but to explore the immune landscape of the entire cohort, a consensus must be found such that similar clusters between patients can be grouped. This consensus gives rise to comparisons in cell abundance and phenotype between clinical phenotypes. To achieve this, CytoPy uses meta-clustering. In brief, each subject is independently min-max normalised, and the centroid of each cluster calculated. The centroids of clusters for each subject are then merged to form a dataframe that describes the clustering results of all subjects. Finally, a clustering algorithm of choice is applied to this dataframe (see Supplementary Methods). As example for the successful utilisation of PhenoGraph, Figure 7A shows the results of meta-clustering for total leukocytes in the peritoneal drain fluid of individuals receiving PD. The Uniform Manifold Approximation and Projection (UMAP) (39) plot shows all clusters (solid filled circles) from all patients displayed in two-dimensional space. The colour of a cluster corresponds to the associated meta-cluster while the size cluster represents the proportion of cells within the cluster (relative to the total CD45+ single immune cells in each individual patient). The nature of the UMAP plot is such that clusters of similar phenotype are arranged closer to one another. However, CytoPy allows to utilise any dimensionality reduction technique (e.g. PCA, Isomap, PHATE (40) etc), depending on the preference of the investigator and the specific question to be addressed. Meta-clusters are manually labelled according to their phenotype, as displayed in the heatmap of Figure 7A. Clusters can be colour-coded using any desired metadata. For instance, given an instance of Explorer named explorer, one could associate the clinical phenotype of a patient to their clusters using the following single line of code:
For each patient in this example, the database is queried for the variable named ‘peritonitis’ (as in “does this patient have acute peritonitis?”) and populates the Pandas DataFrame stored in the explorer object. The UMAP plot is then repeated by colour-coding according to the metadata, as shown in Figure 7B. The distribution of clusters of different clinical phenotypes in the UMAP plot reveals changes in the immunological response. Subsets of cell compartments (e.g. ‘Monocytes_0’, ‘Monocytes_1’ etc.) can be consolidated and the proportion of cells within these consolidated groups (as percentage of all CD45+ immune cells) is shown in the boxplots of Figure 7B.
Applying this cluster analysis to a cohort of PD patients, CytoPy found that acute bacterial peritonitis resulted in a dramatic shift in the composition of local immune cells, with a significant increase in the proportion of neutrophils and a parallel drop in the relative proportion of monocytes/macrophages, dendritic cells (DCs), B cells and T cells. These findings corresponded well with previous studies showing a significant influx of inflammatory cells into the peritoneal cavity on the first day of presenting with acute symptoms, compared to stable individuals in the absence of peritoneal inflammation (25; 26; 27; 41)
Figure 8 shows the same set of analytical techniques applied to the local T cell populations in individuals receiving PD. Figure 7A shows a UMAP plot of clusters, coloured according to their associated meta-cluster and revealing clean separation not only of CD4+ and CD8+ T cells as the major T cell populations but also of unconventional T cell populations such as Vα7.2+ CD161+ MAlT cells and Vδ2+ γδ T cells. Figure 8B shows the same clusters as in Figure 8A but now colour-coded by the metadata regarding the presence or absence of bacterial infection. The differences in T cell subsets between stable controls and those with acute peritonitis were subtle and, due to the small size of this cohort, not statistically significant. Of note, CytoPy allows to explore the composition of the T cell compartment in even more detail, as illustrated for the CD8+ T cell subset (Figure 8C). Here, PhenoGraph was capable of discerning distinct memory and effector subsets based on the expression of the surface markers CD45RA, CD27 and CCR7 (Figure 8C) further validating CytoPy as a reliable method for exploring changes in immune response in large flow cytometry data.
3.6. Feature extraction and feature selection reveal variables that differentiate the immune response during acute peritonitis compared to stable controls
Following cell classification by both biased and unbiased methodology, the immunological landscape of the observed subjects can be summarised in CytoPy into a ‘feature matrix’. This includes the relative abundance of populations as identified by supervised classification and clusters produced by techniques such as PhenoGraph. There will be significant overlap here, and therefore the user may choose to specify to generate a consensus between the results of supervised classification and clustering by way of an average of the two methods. Supervised classification is more robust towards underlying batch effects but biased by the gating strategy imposed upon the training data, whereas clustering is unbiased but not stable to batch effects. By combining both methods the investigator can overcome the limitations that they present individually.
The methods described are implemented in the feature extraction module of CytoPy. Once a feature matrix has been generated dimensionality reduction techniques can be employed to reveal immediately if subjects separate in accordance to the experimental or clinical endpoint of interest. Figure 9A shows a PCA plot where peritonitis patients and stable controls clearly separated across two components, as expected from earlier studies by us (25; 27) and from the analysis shown in Figure 7.
Filtering techniques can be employed within CytoPy to remove variables of low variance or identify high multi-colinearity (Supplementary Figure 2). This is often necessary to remove redundant variables. The immunological pattern that differentiates a clinical state or experimental end-point can then be visualised in a radial plot as shown in Figure 8B. In this example, cell populations are marked on the axis and the internal value is the proportion of cells relative to their respected parent, after consolidating the results of both PhenoGraph clustering and XGBoost classification. Figure 9B confirms the observations made in the exploratory data analysis of clustering results (Figures 7 and 8): although subtle differences exist in the T cell compartments, it is the stark differences in the proportion of myeloid cells that differentiates those with peritonitis compared to stable controls. Where further feature selection is necessary, CytoPy offers embedded methods in the form of L1-regularised linear models, where variables can be selected according to whether their coefficient remains non-zero as the regularisation parameter decreases. (Supplementary Figure S6).
4. Availability and Future Directions
CytoPy represents a framework for the analysis of cytometry data that facilitates automated analysis whilst introducing robust data management and an iterative analytical environment. The present study shows the ability of CytoPy to characterise the FlowCAP-I dataset with high precision and identified XGBoost as optimal classification algorithm for gating with supervised methods. To demonstrate the capabilities of CytoPy on real-world data, we chose to analyse samples from patients with and without acute peritonitis, taking advantage of our extensive experience with this type of samples over more than a decade. Initially acquiring such samples on a four colour BD FACSCalibur flow cytometer with two lasers and simple FSC/SSC settings (42), we later utilised an eight colour BD FACSCanto with three lasers and FSC/SSC area/height channels (24; 34; 35), and now in the present study took advantage of a 16 colour BD LSR Fortessa with four lasers and FSC/SSC area, height, width, and time, thus illustrating the technological advance in the field but also the increasing complexity of the data acquired. The exquisite and elegant performance of CytoPy confirmed a striking increase in total neutrophils at the site of infection and a parallel decrease in the proportion of monocytes/macrophages, dendritic cells and T cells, in agreement with previous findings (26; 27), thereby validating the utility of CytoPy.
We know of no other framework to date that offers the flexibility provided by CytoPy whilst also providing a low-code API for easy application. Existing platforms include the likes of CytoBank (43) which - whilst boasting ease of use and a diverse offering of existing cytometry clustering algorithms - is a proprietary product that would not allow for collaboration and expansion of methodologies through the open-source bioinformatics community. In contrast, CytoPy will remain open-source to encourage bioinformaticians and developers to expand on existing technologies. Alternative open-source frameworks include OpenCyto (2) and Cytofkit (44). OpenCyto is focused on the application of autonomous gates, whereas Cytofkit exposes popular high-dimensional clustering algorithms in an easy interface. Despite their successful application in the literature, neither framework provides the diversity offered in CytoPy, where autonomous gates, supervised classification, and high dimensional clustering can not only be performed in one analytical pipeline, but their outputs visualised and contrasted with one another. This results in increased confidence in our findings as we can assess the conformity of multiple methodologies applied to the same data. Additionally, these existing frameworks do not provide the robust data management that CytoPy offers, with each analytical process being captured in a central database.
We have chosen to develop and maintain CytoPy in Python, a programming language with growing popularity in the bioscience domain. The application of the popular Python deep learning frameworks such as Tensorflow (45) and Keras (30) offer potential for the autonomous analysis of cytometry data (8; 10; 38; 51). It is our intention to incorporate these methodologies in a future release. The agnostic object orientated design of CytoPy facilitates such additional implementations in a straight-forward manner. It is this agnostic design and the introduction of a document-based database as central repository for cytometry analysis that sets CytoPy apart from alternative solutions.
In addition to providing a new data-centric framework for applying existing methods of single cell classification and clustering, CytoPy offers novel tools to aid the analytical pipeline. In this study, we highlight the difficulties presented in complex cytometry data and demonstrate autonomous methods that improve the efficiency of pre-processing. We show how CytoPy can visualise and quantify the inter-sample variation, helping to account for batch-effects. Prior attempts to mitigate or remove batch-effects have either been tied to the application of gates in two-dimensional space (46; 47), involve manipulation of the input space in such a way that biological signals could be lost or distorted (48; 49), or requires some technical intervention during data acquisition (50). Here, we introduce an alternative strategy, instead of removing batch-effect by transforming or aligning the data, we propose a statistical measure be used to group data and supervised classification performed on each group individually. However, we appreciate the impact that a reliable method for mitigating or removing batch effect prior to analysis might have and are open to the integration of data normalisation or transformation methodologies that would achieve this and would see that it fits the data-centric design of CytoPy.
As high-dimensional cytometry analysis continues to grow in popularity there will be increasing demand for an analytical framework that is friendly for those who are new to programming, provides a database that directly relates metadata to single cell data, and scales in a fashion that encourages collaboration and expansion. CytoPy meets all these criteria whilst remaining opensource and freely available on GitHub (https://github.com/burtonrj/CytoPy). In future releases we wish to open CytoPy up to a wider audience through the integration of a graphical user interface, and we hope to expand the capabilities of CytoPy by continuing to support new methodologies in the cytometry domain. Those wishing to collaborate with us or extend our software capabilities should consult the documentation (https://cytopy.readthedocs.io/) and make a pull request on our GitHub repository.
5. Supplementary Methods
5.1. FlowCAP
To assess the ability of CytoPy to classify cells we used the datasets provided in the Flow Cytometry: Critical Assessment of Population Identification Methods (FlowCAP) challenge [21], where the challenge is to accurately separate cells into subsets based on single cell phenotype. The FlowCAP-I data consist of four human studies (graft-versus-host disease, diffuse large B-cell lymphoma, symptomatic West Nile virus infection, and normal donors) and one mouse study (hematopoietic stem cell transplant). Data were labelled and pre-processing performed (removal of debris, dead material, and with fluorescence compensation applied) at source by the laboratory responsible for acquiring the original data. Here, classifiers were trained on 25% of data and classification performance tested on the remaining 75%. Performance was reported as the average of weighted F1 scores across all five datasets, where the F1 score for data with |C| set of possible classes is given as:
Run time was determined as the number of seconds elapsed for training and classification, as an average across every sample classified. Five supervised machine learning algorithms, housed within CytoPy, were compared without hyperparameter tuning:
Feed-forward neural network with three layers of size 12, 6 and 3 nodes, L2 penalty of 1×10-4, ReLU activation function on the hidden layers, softmax activation function on the outer most layer, and categorical cross-entropy as the loss function; implemented in Keras v2.3.
XGBoost withdefault hyperparameters; implemented in XGBoost v0.9.
Linear Discriminant Analysis with singular value decomposition with no shrinkage and number of components equal to min(n classes - n features); implemented in Scikit-Learn v0.22.
K-Nearest Neighbours with number of neighbours used in constructing tree equalling 5 and ‘ball tree’ algorithm to compute nearest neighbour for classification; implemented in Scikit-Learn v0.22.
Support Vector Machine with radial basis function kernel; implemented in Scikit-Learn v0.22.
In each instance, data were standardised by removing the mean and scaling to unit variance; where u is the mean and s the standard deviation.
5.2. Patients
The study cohort comprised 37 adult individuals receiving peritoneal dialysis (PD) who were admitted between October 2016 and October 2018 to the University Hospital of Wales, Cardiff, on day 1 of acute peritonitis, before commencing antibiotic treatment (34.6% female; median age 68 years, range 22-91 years). 20 age and gender-matched individuals receiving PD and with no previous infections for at least 3 months served as stable, non-infected controls (35.0% female; median age 69.5 years, range 28-93 years). Subjects known to be positive for HIV or hepatitis C virus were excluded. Clinical diagnosis of acute peritonitis was based on the presence of abdominal pain and cloudy peritoneal effluent with >100 white blood cells/mm3. According to the microbiological analysis of the effluent by the routine Microbiology Laboratory, Public Health Wales, episodes of peritonitis were defined as infections caused by Gram-positive or Gram-negative organisms. Cases of fungal infection and negative or unclear culture results were excluded from this analysis. Basic patient demographics can be found in the Supplementary Methods and a summary of the bacterial culture results for patients with peritonitis are shown in Supplementary
Table S1. All methods were carried out in accordance with relevant guidelines and regulations, and written informed consent was obtained from all subjects. Recruitment of PD patients was approved by the South East Wales Local Ethics Committee under reference number 04WSE04/27, and conducted according to the principles expressed in the Declaration of Helsinki. The study was registered on the UK Clinical Research Network Study Portfolio under reference numbers #11838 “Patient immune responses to infection in Peritoneal Dialysis” (PERIT-PD).
5.3 Flow cytometry
Peritoneal leukocytes were harvested from overnight dwell effluents and processed as described previously (27; 41); samples were treated with DNase (Sigma; 1:2,500 dilution) when excessive debris was visually apparent. Leukocyte populations in total effluent were stained using monoclonal antibodies against CD1c, CD3, CD14, CD15, CD16, CD19, CD45, CD116, HLA-DR and Siglec-8 (Supplementary Table S2) and identified as CD45+ immune cells, CD3+ T cells, CD19+ B cells, CD15-CD14+ monocytes/macrophages, CD15+ neutrophils, CD15-CD14+/-CD1c+ dendritic cells, and CD15-Siglec-8+ eosinophils. T cell subsets in peripheral blood mononuclear cells (PBMCs) and in peritoneal effluent were stained after Ficoll-Paque (Fisher Scientific) separation of blood and peritoneal leukocytes, respectively, using monoclonal antibodies against CD3, CD4, CD8, CD161, TCR-Vα7.2, TCR-Vδ2, TCR-pan-γδ, CD45RA, CCR7 and CD27 (Supplementary Table S3). Cell acquisition by flow cytometry was performed using a 16 colour BD LSR Fortessa cell analyser (BD Biosciences). Live single cells were gated based on side and forward scatter area/height and live/dead staining (fixable Aqua; Invitrogen).
5.4 Meta-clustering
Meta-clustering was performed to find a consensus amongst the individual clustering results of many individual samples. Each sample was independently normalized; that is, each feature was scaled:
Where x is the original value for a given feature and xnorm is its values scaled between zero and one. Once each sample was individually normalized, the clusters from each sample were extracted and their centroid calculated; by default this was given as the median of their feature vector but other definitions of center can be used (e.g. mean, geometric mean etc.). Cluster centroids were annotated as to which sample they originated from and their original cluster ID and then concatenated into a single dataframe. This dataframe was then used as the input to a clustering algorithm of the user’s choosing.
9. Supplementary data
ACKNOWLEDGMENTS
We are grateful to all peritoneal dialysis patients for participating in this study, and to the clinicians and nurses for their cooperation. We also thank Sarah Baker, Chantal Colmont, Donald Fraser, Alexander Greenshields-Watson, Ann Kift-Morgan, Kristin Ladell, Oliwia Michalak and John Pulford for their help and advice. This research received support from the Wales Kidney Research Unit (WKRU), UK Clinical Research Network (UKCRN) Study Portfolio, Medical Research Council (MRC) grant MR/N023145/1, the Welsh European Funding Office’s Accelerate programme, and a School of Medicine PhD Studentship (to R.J.B.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Footnotes
Clarification of text, including revisions to introduction, results section 3.1, 3.2, and 3.3, and discussion. Inclusion of additional figure 3 and supplementary figures.