Abstract
Patient stratification or disease subtyping is crucial for precision medicine and personalized treatment of complex diseases. The increasing availability of high-throughput molecular data provides a great opportunity for patient stratification. In particular, many clustering methods have been employed to tackle this problem in a purely data-driven manner. Yet, existing methods leveraging high-throughput molecular data often suffers from various limitations, e.g., noise, data heterogeneity, high dimensionality or poor interpretability. Here we introduced an Entropy-based Consensus Clustering (ECC) method that overcomes those limitations all together. Our ECC method employs an entropy-based utility function to fuse many basic partitions to a consensus one that agrees with the basic ones as much as possible. Maximizing the utility function in ECC has a much more meaningful interpretation than any other consensus clustering methods. Moreover, we exactly map the complex utility maximization problem to the classic K-means clustering problem with a modified distance function, which can then be efficiently solved with linear time and space complexity. Our ECC method can also naturally integrate multiple molecular data types measured from the same set of subjects, and easily handle missing values without any imputation. We applied ECC to both synthetic and real data, including 35 cancer gene expression benchmark datasets and 13 cancer types with four molecular data types from The Cancer Genome Atlas. We found that ECC shows superior performance against existing clustering methods. Our results clearly demonstrate the power of ECC in clinically relevant patient stratification.
Introduction
High-throughput technologies, such as next-generation sequencing, have enabled us to rapidly accumulate a wealth of various molecular data types, including genome, transcriptome, proteome, and epigenome (1-3). Those massive genomics studies offer us great opportunities to characterize human pathologies and disease subtypes, identify driver genes and pathways, and nominate drug targets for precision medicine (4, 5). In particular, development of novel computational approaches for patient stratification leveraging high-throughput molecular data would significantly facilitate precision medicine and personalized treatment, which target discrete molecular subclasses of complex diseases with specific genetic or epigenetic profiles (4).
Clustering, an unsupervised exploratory analysis, has been widely used for patient stratification or disease subtyping (6). However, traditional clustering algorithms, such as K-means, hierarchical clustering, and spectral clustering, suffer from noise, data heterogeneity and high dimensionality that are associated with high-throughput molecular data (7, 8). Ensemble clustering (a.k.a. consensus clustering) can merge some individually generated basic partitions, and ensure the final consensus partition maximally agrees with the basic ones (9). This significantly helps us generate more robust clustering results, find bizarre clusters, better handle noise, outliers and sample variations, and integrate solutions from multiple distributed data sources (9). However, existing consensus clustering algorithms based on co-association matrix (10) are computationally expensive and require a large storage space, preventing them to handle high-throughput molecular data. Moreover, their interpretation of the consensus partition is often obscure.
Results
Methodology overview of ECC
Here we introduce a novel consensus clustering method, i.e. Entropy-based Consensus Clustering (ECC), for patient stratification. Consider an n×m matrix of molecular data of n subjects (or experiments, conditions, samples; corresponding to n rows) and m features (such as mRNAs; corresponding to m columns). Each subject can be represented by a point in the m-dimensional feature space, with different shapes representing different clusters the subjects belong to (Fig. 1A). There are three steps in the ECC pipeline. Step-1: We generate r basic partitions using K-means clustering with parameter K (i.e., the number of clusters) randomly chosen from 2 to (see Fig. 1B) (11). Hereafter we call this kind of basic partitions generation strategy Random Parameter Selection (RPS). Note that in this step we can use any basic clustering method. Here we just choose K-means for its simplicity and high efficiency. In this work we choose r = 100 and find that larger r does not significantly improve the result. Step-2: We derive a binary matrix from each basic partition via 1-of-K coding, where K is the cluster number in this basic partition and only one element in each row is1, others are 0. We concatenate all those binary matrices into a large binary matrix (Fig. 1C). Step-3: We employ an entropy-based utility function to guide the fusion of all the r basic partitions into a consensus one (Fig. 1D). This is achieved by conducting K-means clustering on the binary matrix with a modified distance function and a user-defined cluster number K.
Our ECC method has three key features. First, it solves the consensus clustering problem in a utility way, which has more meaningful interpretation than any other consensus clustering methods. Here the utility function is applied to quantify the similarity between each of the r basic partitions and the consensus one. Maximizing the utility function requires us to find a single consensus partition that agrees with the basic ones as much as possible. Second, we uncover a remarkable equivalence relationship between an entropy-based utility function and a K-means distance function so that the complex utility maximization problem can be efficiently solved by the classic K-means method with a modified distance function (see Supplementary Materials Sec. I.A). Consequently, both the time and space complexity of ECC are linear in n (see Supplementary Materials Sec. I.B). This dramatically improves the efficiency of ECC in real-world applications (11). Finally, ECC can naturally integrate multiple molecular data types measured from the same set of subjects, and easily handle missing values without any imputation (see Materials and Methods). This significantly increases the power of ECC in clinically relevant patient stratification.
To demonstrate that ECC indeed outperforms existing clustering methods, we compared the performance of ECC with five traditional clustering methods: Agglomerative Hierarchical Clustering with Average Linkage (AL), Single-Linkage (SL) and Complete-Linkage (CL), K-means Clustering (KM), and Spectral Clustering (SC); and two state-of-the-art consensus clustering methods: the Link-based Cluster Ensemble (LCE) and Approximate SimRank-based (ASRS) methods (12).
Evaluation using synthetic data
We first applied all those clustering methods to synthetic gene expression datasets with built-in cluster structure generated through a well-established dynamical gene regulation model (see Supplementary Materials Sec. II) (13). For fair comparison, we used two external indices of clustering validity: Rn (Normalized Rand Index) and NMI (Normalized Mutual Information) to objectively evaluate the performance of different clustering methods (14). Both Rn and NMI are positive cluster validity indices that estimate the quality of clustering results with respect to the underlying cluster structure of the data (see Materials and Methods). We found that ECC generally outperforms other methods in terms of its robustness against noise (see Supplementary Materials Sec. IV.A).
Evaluation using benchmark cancer gene expression data
We then evaluated ECC and other clustering methods on 35 widely usedbenchmark cancer gene expression datasets(15) (Supplementary Materials Sec. III.A). The detailed description of the 35 datasets was provided in Supplementary Table S2. Fig. 2A shows the clustering performance of different algorithms measured by NMI. We found that for most datasets, the three consensus clustering methods (LCE, ASRS, and ECC) are superior to the five traditional clustering methods. Moreover, our ECC method achieves promising results on several datasets by a large margin, such as dataset-5 (Armstrong-2002-v2), dataset-9 (Yeoh-2001-v1), dataset-10 (Chowdary-2006) and dataset-13 (Golub-1999-v1). Although LCE and ASRS yield reasonable performance on several datasets, they suffer from low robustness. For example, ASRS achieves 100% accuracy on dataset-23 (Nutt-2003-v3), but it yields even worse results than that of random assignment on dataset-9 (Chen-2002). We emphasize that, for unsupervised tasks, robustness is much more important than performance in practice when dealing with highly heterogeneous molecular data types (such as mRNA expression)(16). Different from LCE and ASRS, ECC fuses the basic partitions in a utility way, which ensures highly meaningful interpretations with high stability for the final consensus partition (Supplementary Materials Sec. I). To compare the overall performance of those clustering methods over the 35 benchmark datasets, we proposed an average performance score (see Materials and Methods) and found that ECC revealed significant advantages over all other methods in terms of average performance score. We notice that there are four specific datasets (Gordon-2002, Khan-2001, Ramaswamy-2001 and Shipp-2002) for which all clustering methods yield very poor performance, most likely due to the presence of irrelevant or noisy features. We pointed that this difficulty cannot be easily resolved by any existing clustering methods. Yet, it can be alleviated by a complementary basic partition generation strategy of RPS, i.e., the Random Feature Selection (RFS) strategy, within the framework of ECC. To achieve that, we generate different sub-datasets by randomly selecting certain percentage of features (e.g., mRNAs) and then apply traditional clustering (e.g., K-means) to those sub-datasets to obtain basic partitions. Indeed, we find that for these four datasets, the performance of RFS exceeds RPS with all sampling ratios. This indicates that RFS helps us avoid noisy and irrelevant mRNA expressions (see Supplementary Materials Sec. IV for details).
In addition, ECC has tremendous merits in terms of computational cost. Fig. 2B shows the execution time (in logarithmic scale) of the three consensus clustering methods (LCE, ASRS and ECC). The time complexity of ECC is O(InKr), where I is the number of iterations, n is the number of subjects, K is the number of clusters and r is the number of basic partitions. The space complexity of ECC is O(nr). For LCE and ASRS, the space complexities are both O(n2); and the time complexities are O(n2logn) and O(n3), respectively. Naturally, ECC is more suitable for high-throughput molecular data analysis (Supplementary Materials Sec. IV). For example, on dataset-34 (Yeoh-2002-v1), ECC is 115 times and 1,600 times faster than LCE and ASRS, respectively.
Translational applications of ECC
The availability of massive and various molecular data types generated from large-scale and well-characterized cohorts across multiple cancer types provides an unprecedented opportunity for patient stratification. Here we demonstrated the translational applications of ECC based on13 major cancer types from The Cancer Genome Atlas (TCGA) project with sufficient sample size and clinical profiles for four molecular data types: mRNA expression (RNA-seq V2), microRNA (miRNA) expression, protein expression, and somatic copy number alterations (SCNAs), as shown in Supplementary Table S3. For fair comparison, we collected the empirical number of clusters (subtypes) for the 13 TCGA cancer types from previous studies. Then we applied survival analysis to evaluate the performance of different clustering methods in terms of −log10(P) with P the log-rank test P-value (Supplementary Materials Sec. V and Supplementary Tables S7-10).
For each molecular data type (Protein, miRNA, mRNA, and SCNA), we calculated the clustering performance of ECC against other clustering methods across the 13 TCGA cancer types. We found that ECC outperformed other methods (in terms of the number of significant survival analysis results across the 13 TCGA cancer types, as highlighted in dotted red rectangles in Fig. 3A-D) for any single molecular data type.
By integrating the 4 different molecular data types, ECC generated significant clusters (cancer subtypes) for all the 13 TCGA cancer types (P < 0.05, log-rank test, Table. 1). Note that traditional clustering methods and existing consensus clustering methods cannot easily integrate multiple molecular data types, due to the presence of missing values for certain molecular data type of certain subjects. Yet, ECC can naturally resolve this issue by utility fusion, where missing values in basic partition provide no utility for the final fusion (Supplementary Fig. S9). Moreover, by integrating multiple molecular data types, ECC is effectively more robust to noise present in the data (partially because it has more data types to generate basic partitions). For example, in the case of uterine corpus endometrial carcinoma (UCEC), using any of the 4 molecular data types, ECC cannot yield significant clusters (Fig. 4A). Yet, by integrating multiple molecular data types (pan-omics), ECC yielded 4 significant clusters (Fig. 4B) with distinct patient survival curves (P = 0.0043, Fig. 4C); while using any single molecular data type the clusters generated by ECC do not pass the significance test of survival analysis (P >0.05, Supplementary Tables S7-10). In addition, subtypes identified by ECC via integrating 4 molecular data types were closely associated with the clinical subtypes on a histological basis in UCEC (Fig. 4D). For instance, subtype 2 with most aggressive uterine tumor shows poor survival than subtype 1 with the less aggressive uterine tumors. Similar trends are also observed in ovarian serous cystadenocarcinoma (OV, P = 7.79×10−4) and prostate adenocarcinoma (PRAD, P = 5.27×10−4) (see Supplementary Table S11). Since TCGA clinical information may not be complete or rigorously annotated, future efforts of assessing the clinical utility of different subtypes on additional patient cohorts with more carefully annotated clinical variables are needed.
Discussion
In sum, we show that ECC owns significant advantages in terms of cluster validity, execution time and space complexity, and robustness compared with other clustering methods in patient stratification. We demonstrate that ECC with RFS strategy can alleviate the detriment effect of irrelevant and noisy features. Moreover, ECC displays superior performance on the pan-omics data by integrating multiple molecular data types than that of single molecular data type. We anticipate that integrating more types of both molecular and clinical data, such as somatic mutations, DNA methylation, functional genomic data generated from CRISPR/Cas9 (17), proteogenomics (18), radiomics (19), and electronic medical records (20), will further improve patient stratification. Altogether, our ECC method paves the way to a much more refined representation and understanding of various molecular data types, facilitating the development of precision medicine.
Methods
Consensus clustering
Here we introduce the basic ideas of consensus clustering in the context of omics data (e.g., gene expression) analysis. Consensus clustering was originally developed for fusing several existing partitions into a robust one (1), and has recently been applied to gene expression data analysis (2,3). For example, the link-based cluster ensemble (LCE) method first summarizes several basic partitions into a co-association matrix (that measures how often two instances simultaneously occur in the same cluster); then modifies the zero entries in the co-association matrix with the distance derived from the original data; and finally conducts spectral clustering to obtain the consensus partition (2). As a variant of the LCE method, the Approximate SimRank-based(ASRS) method employs very similar idea with slightly different modification on the zero entries in the co-association matrix (3).
Different from the existing consensus clustering methods (LCE and ASRS), our ECC method employs an entropy-based utility function for the guidance of fusing all the basic partitions into a consensus one, which has a more meaningful interpretation than existing consensus clustering methods. Let X denote a gene expression dataset with n subjects and m genes. A partition of X into K crisp clusters is represented as a collection of K subsets of instances in C = {Ck | k = 1,…,K}, with , and or as a label vector π =(Lπ(x1),…,Lπ(xn))T where Lπ maps xl to some label in {1,2,…,K}, 1 ≤ l ≤ n. Suppose we have r basic partitions denoted as{π(1),π(2),…,π(r)} generated by some traditional clustering method (e.g., K-means) and there are Kv clusters in π(v), for 1 ≤ v ≤ r. The goal of consensus clustering is to find a consensus partition π by solving the following optimization problem: where U is a utility function measuring the similarity at the partition-level between each basic partition and the consensus one. In other words, we expect to find an optimal partition that agrees with the basic ones as much as possible. Different utility functions measure the similarity of two partitions in different aspects, rendering different objective functions for consensus clustering. In this work, an entropy-based utility function is employed for its fast convergence and high quality (4).
Entropy-based Utility Function
The core of ECC is to fuse these basic partitions into a consensus one based on an entropy-based utility function, which assures the consensus clustering algorithm to be highly efficient and robust. As formulated in Eq. (1), a utility function is defined on two partitions π and π(v) to measures their similarity at partition-level. We can employ the following contingency table to calculate the entropy-based utility function.
Here we have two partition π and π(v), which contain K and Kv clusters, respectively. We assume π is the ground truth, and π(v) is the clustering result generated by a specific clustering algorithm. Let nij denote the number of objects shared by cluster Ci in π and cluster Cj in π(v). Define , and , 1 ≤ i ≤ K, 1 ≤ j ≤ Kv.
Based on the contingency table, for π and π(v) we define two discrete distributions , and . Then we have
(UH). An entropy-based utility function UH is defined as where H denotes the Shannon entropy.
Since Shannon entropy is a concave function, according to the Jensen’s inequality, we can prove that , rendering that UH ≤ 0. A larger UH indicates the higher utility from the two partitions in greater similarity. Note that UH is asymmetric, with UH(π,π(v)) ≠ UH(π(v),π), if π≠π(v).
Entropy-based Consensus Clustering
Although it is crucial to design a utility function, how to optimize it in an efficient way is another challenge. Thanks to the general K-means based Consensus clustering (4), which has substantial advantage in terms of efficiency; we can transform the optimization problem in Eq.(1) into a modified K-means clustering problem as follows.
Let B = (b1,…,bn)T be a binary matrix derived from r basic partitions{π(1),π(2),…,π(r)}, with
Apparently, B is a binary matrix, with For the Entropy-based Consensus Clustering, a K-means clustering is directly conducted on B with the following modified distance function. where with , and KL-divergence from bl(v) to mk(v).
By this means, the complex consensus clustering can be exactly mapped into a classic K-means clustering with a modified distance function, which has roughly linear time complexity and its convergence can also be guaranteed as well. The exactness of the mapping can be rigorously proved (see Supplementary Materials Sec. I.A for details). This mapping makes ECC very practical for large-scale molecular data analysis. Indeed only r elements are non-zero entries in each row of B, which leads the time complexity from to O(IKnr), where I is the number of iterations.
Handling missing values
Missing values are quite common in practice due to data collection or device failure, especially for the pan-omics data of a large population (in computer science, this kind of data is called multi-view). Typically there are two ways to handle those missing values. One is to just remove the instances (i.e., subjects) that have missing values in any single molecular data type (or any single view). Apparently, this is of great waste because those instances (subjects) might have values for many other views (molecular data types). The other way is to replace these missing values by default or average values. This would harm the original data structure and degrade the clustering performance. We can naturally resolve this challenging issue within the framework of ECC. In particular, we consider that those missing values, which lead to missing labels in the basic partitions, do not provide any utility for the consensus fusion. If a basic partition has missing labels, we call it an incomplete basic partition (IBP). For IBP, we directly denote as an all-zero vector, which will not be involved in the distance calculation and centroid update. The following is the distance function for IBP: and with
Datasets
In this work, we use 110 synthetic datasets to systematically evaluate the performance of ECC. The 110 synthetic datasets are generated by a well-established dynamical gene regulation model (5): where mi is the maximum transcription rate, ri is the translation rate, and are the mRNA and protein degradation rates, and x and y are vectors of mRNA and protein concentration levels, respectively. fi(·) computes the relative activation of gene. The topology of the gene regulatory network is encoded in the activation functions.
Among the 110 synthetic datasets, 55 of them are based on an Erdős–Rényi random network with 500 nodes (genes), and the other 55 are based on a human transcriptional regulation network of 2723 genes (6). Each dataset contains 200 subjects with its benchmark of 4 clusters (50 subjects in a cluster). Each dataset contains 200 subjects divided evenly into 4 groups (clusters). Each group has a specific set of knocked-out genes. A more detailed description of the synthetic datasets can be found in Supplementary Materials Sec. I.
Besides the 110 synthetic datasets, 35 widely used cancer gene expression benchmark datasets (7) are employed to test the cluster validity of ECC. Also, 13 cancer types with four molecular data types from TCGA with survival information are used for practical evaluation of ECC (Supplementary Table S3).
Evaluation metrics
Since the true labels for synthetic and benchmark datasets are available, we can apply external measurements to objectively evaluate the performance of different clustering algorithms. Although there are many external measurements, some of them are biased. According to Wu et.al. (12), two normalized external measurements, NMI and Rn are unbiased and hence can be chosen for proper evaluation of clustering performance. Both can easily be calculated from the contingency table.
Normalized Mutual Information (NMI) measures the mutual information between resulted cluster labels and ground truth labels, followed by a normalization operation to assure NMI ranges from 0 to 1. Mathematically, it is defined as:
Normalized Rand Index, denoted as Rn measures the similarity between two partitions in a statistical way, which is defined as:
Note that both NMI and Rn are positive measurements, i.e. a better partition has a larger NMI or Rn value. Although Rn is normalized, it can still be negative, which means that the partition is even worse than random label assignment.
To compare the overall performance of those clustering algorithms over the 35 benchmark cancer expression datasets, we propose an average performance score as follows: whereV(Dj,Ai) denotes the performance (i.e., Rn or NMI) of Algorithm Ai on dataset Dj and d is the total number of benchmark datasets.
Code availability
The MATLAB code is freely available at http://scholar.harvard.edu/yyl/ecc.
Contributions
Y.-Y.L. and Y.F. conceived the project. Y.-Y.L. designed the research. H.L. developed the code to perform ECC clustering and conducted all the cluster analyses. H.F. and R.Z developed the code to generate synthetic gene expression data. R.Z. collected cancer gene expression benchmark datasets. F.C. collected TCGA datasets and interpreted results. All authors analyzed the results. F.C., H.L., R.Z. and Y.-Y.L. wrote the manuscript. H.F. edited the manuscript.
Author Information
The authors declare no competing financial interests. Correspondence and requests for materials should be addressed to F.Y. (yunfu{at}ece.neu.edu) or Y.-Y.L.(yyl{at}channing.harvard.edu).
II. SYNTHETIC DATASETS
The 110 synthetic gene expression datasets are generated through a well-known dynamical gene regulation model[1]. To be self-contained, we summarize all the variables used in this section in Table SII..
A. Dynamical Gene Regulation Model
This model is based on a gene regulatory network represented by a digraph G(V,E). HereV is the set of nodes (genes) and E is the set of directed edges (gene regulations). Both transcription and translation are modeled using a standard thermodynamic approach[1]. For each node vi, i=1,2,…,n, the changing rate of mRNA concentration FimRNA and the changing rate of protein concentration FiProt are described by a set of coupled ordinary differential equations (ODEs):
where mi is the maximum transcription rate, ri is the translation rate, λimRNA and λiProt are the mRNA and protein degradation rates, x ∈ Rn and y ∈ Rn are vectors of mRNA and protein concentration levels, respectively. fi(·) is the activation function of gene i, which is between 0 (gene i is turned off) and 1(gene i is maximally activated) given the protein concentrations y. The network topology is encoded in the activation functions. A more detailed definition of fi(·) is as follows.
We use a standard thermodynamics-based approach to model gene regulation[1]. The basic assumption is that binding of transcription factors (TFs) to cis-regulatory sites on the DNA is in quasiequilibrium because it is orders of magnitude faster than transcription and translation. In the simplest cases, gene i is regulated by a single TF (e.g.TFj), then its promoter has only two states: either the TFj is bound (state S1) or un-bound (state S0). The probability P(S1) that gene i is in state S1 at a certain time instant is given by the fractional saturation:
where yj is the concentration of TFj, kij is the dissociation constant, and hij the Hill coefficient. The bound TF activates or represses the expression of the gene. In state S0, the relative activation isα0; in state S1, the relative activation is α1. Given P(S1) and its complement P(S0) = 1 — P(S1), we can derive the function fi(yj)), which computes the mean activation of the gene i as a function of the TF concentration yj:
We can also consider gene i has two regulatory inputs. The resulting expression would be:
where ρ is the cooperativity factor, and αi are the relative activations when none of the TFs (α0), only the first (α1, only the second (α2) or both TFs are bound (α3).
This approach can be used for an arbitrary number of regulatory inputs. If a gene is regulated by N TFs, it will have 2N states: each of the TFs can be bound or un-bound. Thus, the function for N regulators would be:
Based on thermodynamics, we can compute the probability P(St) for every state t.
Assume now we have the gene regulatory network G(V, E) and the detailed dynamical model parameters. The initial values of xi and yi are generated randomly from the interval [0.001, 1]. Then we can compute the time evolution of x and y until the system reaches a steady state.
B. Noise
The integration of the coupled ODEs (Eq.S1a and Eq.S1b) results in noiseless mRNA and protein concentration levels. However, both molecular and measurement noise in gene expressions are unavoidable in practice. In living cells, molecular noise originates from thermal fluctuations and stochastic processes such as transcription and translation. Moreover, measurement noise of gene expression depends on the experimental technology used to monitor the gene expression level.
1. Molecular Noise
Both FimRna and FiProt can be written as follows:
where V(Xt) is the production term and D(Xt) is the degradation term of mRNA or protein. To model molecular noise in the transcription and translation processes, we can use the following chemical Langevin equation (CLE):
where ηv and ηd are independent Gaussian white-noise processes, c1 is a constant to integrate two equations to control the amplitude of the molecular noise.
To solve the CLE, we can use the Stratonovich scheme and the Milstein method. Stratonovich Scheme is a technique used in stochastic integral, which is very similar to Ito integral. Suppose we have a stochastic differentiable equation:
where Xt is the random variable, f is the drift coefficient, g is the diffusion coefficient, and Wt is the Wiener Process. Integrating both sides yields:
where ○ is used here to distinguish Stratonovich fromIto.
We compute the integrals in Eq. S10 using the Milstein method. Milstein method has two versions, namely a normal one and a derivative-free version. For simplicity, we use derivative-free Milstein method:
where
Then we update X until it reaches a steady state.
2. Measurement Noise
The measurement noise depends on the technology used to monitor level of gene expression and hence is modeled independently of the molecular noise. In this work, we model the measurement noise as follows:
where Xt is computed from Eq.S1, ηω represents independent Gaussian white-noise processes, and c2 is a constant quantifying the measurement noise level.
C. Synthetic Disease Subtypes
In order to generate synthetic disease subtypes, we first generate a random digraph G to represent the gene regulatory network. Considering the exponential time complexity of calculating the activation function, we set the average in-degree kin = 2 in the digraph G. Regarding the model parameters in Eq.S1, we choose mi and ri randomly from the interval [0, 1]. The Hill coefficient hij is sampled from a Gaussian distribution N(2,4) bounded in the interval [1, 10]. An example of the gene regulatory network of 20 genes is shown in Fig. S1a. The time evolution of mRNA concentration levels xi(t) without noise, with only molecular noise, and with both molecular and measurement noise, are shown in Fig. S1b, Fig. S1c, Fig. S1d, respectively. This serves as the baseline model.
To simulate gene expression data for different disease subtypes, we assume that different disease subtypes are associated with different sets of genes that are knocked out. For those knock-out genes, we set their transcription rates mi to be zero. Suppose in total we have N subjects divided into four groups (G0, G1, G2, G3) evenly: subjects in G0 are generated from the baseline model, while subjects in G1, G2 and G3 have different sets of knock-out genes (see Fig. S2). For each group (disease subtype), different subjects are simulated from different initial conditions of x and y.
III. REAL-WORLD DATASETS
We analysed 35 benchmark cancer gene expression datasets[2] with label (i.e. cluster structure) information to fully evaluate the performance of ECC, as well as 13 real cancer gene expression datasets with survival information available for practical evaluation. Some key characteristics of these datasets are summarized in Table SIII and Table SIV.
A. Benchmark Gene Expression Datasets
For the 35 benchmark datasets in Table SIII, the numbers of subjects vary from 22 to 248, the numbers of genes vary from 85 to 4,553 and the numbers of clusters vary from 2 to 14. Some datasets are from the same source. For example, dataset 2 and 3 (Alizadeh-2000-v2, Alizadeh-2000-v3) share the same gene expression data, but with different cluster numbers; dataset-4 and 5 (Armstrong-2002-v1, Armstrong-2002-v2) have the same subject number, but different dimensions of gene expression; dataset-14 (Golub-1999-v2) splits one cluster in dataset-15 (Golub-1999-v1) into two; dataset-32 (Tomlins-2006-v2) has one more cluster than the dataset-33 (Tomlins-2006-v2).
B. Molecular data from TCGA
Furthermore, we also analysed 13 molecular data from 13 major cancer types from The Cancer Genome Atlas (TGCA, https://tcga-data.nci.nih.gov/tcga/, date: 4/16/2016) project with survival information available. These cancer types include bladder urothelial carcinoma (BLCA), breast cancer carcinoma (BRCA), colon adenocarcinoma (COAD), head and neck squamous cell carcinoma (HNSC), kidney renal clear cell carcinoma (KIRC), acute myeloid leukemia (LAML), brain lower grade glioma (LGG), lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC), ovarian serous cystadenocarcinoma (OV), prostate adenocarcinoma (PRAD), skin cutaneous melanoma (SKCM), thyroid carcinoma (THCA), and uterine corpus endometrial carcinoma (UCEC). Each dataset contains 4 different types of molecular data, including protein expression, microRNA (miRNA) expression, mRNA expression (RNA-seq V2) and somatic copy number alterations (SCNAs). Note that the numbers of subjects varies in different data types.
IV. ADDITIONAL NUMERICAL RESULTS
In this section, we provide more numerical results on ECC in terms of synthetic datasets evaluation, the number of basic partitions and the basic partition generation strategy.
A. Performance of different clustering algorithms on synthetic data
Figure S3 and Figure S4 show the performance of different clustering algorithms on the 110 synthetic datasets with different settings. We find that for most datasets ECC achieves better performance than both traditional and previous ensemble clustering methods. For the four cases when ECC is not the best, the difference between the best one and ECC is within a small margin.
B. Performance of different clustering algorithms on benchmark data sets
Table S V and Table SVI show the performance of different clustering algorithms in terms of Rn and NMI.
C. Impact of the basic partition number
To fully uncover the properties of ECC for practical use, we thoroughly explore some impact factors of ECC. Fig. S5 shows the performance of ECC as a function of the number r. Generally speaking, the performance goes up with larger r, and the variance becomes smaller. The number of basic partitions determines the stability of ECC. We find that r = 100 is large enough for a robust partition.
D. Comparison of different basic partition generation strategies
So far we employ Random Parameter Selection (RPS) strategy to generate basic partitions by K-means with different cluster numbers. A complementary strategy is the so-called Random Feature Selection (RFS) strategy. In RFS, we generate a subdata set by randomly selecting certain percentage of features (e.g. gene expressions), where K-means with a fixed cluster number is conducted to obtain the basic partitions. Fig. S6 demonstrates the performance of these two generation strategies. For these four datasets (15, 16, 26 and 28), RFS has better performance than RPS on all percentages of sampling ratio. For example, the improvements on dataset-15 (Gordon-2002) are over 80% and 40% in terms of Rn and NMI. This indicates that only a subset of features (genes) reflect the true cluster structure, the rest are irrelevant or noisy. Finding discriminative features (genes) is a very challenging task, especially in unsupervised scenarios. Here we employ the simple RFS strategy and fuse these basic partitions to obtain promising results. Taking the efficiency into account, 10% sampling ratio is good enough for a satisfactory partition. Fig. S7 shows the comparison between the results derived from RPS and RFS (with 10% sampling ratio). We find that indeed RFS is a complementary basic partition generation strategy of RPS.
E. Performance with missing values
We validate the performance of ECC in the presence of missing values, which result in incomplete basic partitions (IBPs). Given a dataset, we randomly remove certain instances and callK-means clustering algorithm on the rest instances with the user-defined cluster number. For these removed instances, the labels are assigned to be 0 in the incomplete basic partitions. We repeat the above process 100 times to obtain 100 IBPs and employ ECC to get the consensus one. Fig. S9 shows the performance of ECC with different missing ratios on 4 datesets. We find that ECC can still provide high quality and robust consensus partition even with high missing ratio.
V. SURVIVAL ANALYSIS
For real-world molecular data without label information (e.g. the 13 TCGA cancer types analyzed in this work), we can employ survival analyses to evaluate the performance of different clustering methods. Survival analysis considers the expected duration of time until one or more events happen, such as death, disease occurrence, disease recurrence, recovery, or other experience of interest[5]. The duration of time measures the time from the beginning of an observation period (such as surgery or beginning treatment) to an event, or end of the study, or loss of contact or withdrawal from the study. Censoring/Censored observation means that, if a subject does not have an event during the observation time, they are described as censored. The subject is censored in the sense that nothing is observed or known about that subject after the time of censoring. A censored subject may or may not have an event after the end of observation time.
A. Log-rank test
The log-rank test is a hypothesis test to compare the survival distributions of two or more groups. The null hypothesis that every group has the same survival function. The expected number of subjects surviving at each time point in each group is adjusted for the number of subjects at risk in the groups at each event time. The log-rank test determines if the observed number of events in each group is significantly different from the expected number. The formal test is based on a chi-squared statistic. The log-rank statistic has a chi-squared distribution with one degree of freedom, and the p-value is calculated using the chi-squared distribution. When the p-value is smaller than 0.05, it typically indicates that those groups differ significantly in survival times.
B. Survival analysis on real-world data
Tables S7-10 display the log-rank p-values for survival analysis of 13 TCGA major cancer types using different molecular data type: mRNA expression, microRNA expression, protein expression, and somatic copy number alterations (SCNAs); and different clustering methods. The p-values that are smaller than 0.05 are displayed in bold face. We found that for each single molecular data type our ECC method yields more significant p-values than other clustering methods. Table S11 displays the log-rank p-values for survival analysis of 13 TCGA major cancer types using pan-omics data (i.e. integrating mRNA expression, microRNA expression, protein expression, and SCNAs) and our ECC method. (Note that the competitive clustering methods cannot handle missing values or incomplete basic partitions, hence we only show the result of ECC on the pan-omics data.) We find that for each cancer type the p-value is smaller than 0.05, suggesting that by integrating 4 different molecular data types ECC generated significant subtypes for all the 13 cancer types.
Acknowledgments
We thank Rulla Tamimi and Edwin Silverman for valuable discussions. This work is supported in part by the John Templeton Foundation (Award number 51977), the National Academy of Sciences-Grainger Foundation Frontiers of Engineering Award (2000006959), NSF CNS Award (1314484), ONR award (N00014-12-1-1028), ONR Young Investigator Award (N00014-14-1-0484), and U.S. Army Research Office Young Investigator Award (W911NF-14-1-0218).
References
- [S1].
- [S2].
- [S3].
- [S4].
- [S5].
- [S6].
- [S7].
- [S8].
- [S9].
- [S10].
- [S11].
- [S12].
- [S13].
- [S14].
- [S15].
- [S16].
- [S17].
- [S18].