Abstract
Cancer is a disease often characterized by the presence of multiple genomic alterations, which trigger altered transcriptional patterns and gene expression, which in turn sustain the processes of tumorigenesis, tumor progression and tumor maintenance. The links between genomic alterations and gene expression profiles can be utilized as the basis to build specific molecular tumorigenic relationships. In this study we perform pan-cancer predictions of the presence of single somatic mutations and copy number variations using machine learning approaches on gene expression profiles. We show that gene expression can be used to predict genomic alterations in every tumor type, where some alterations are more predictable than others. We propose gene aggregation as a tool to improve the accuracy of alteration prediction models from gene expression profiles. Ultimately, we show how this principle can be beneficial in intrinsically noisy datasets, such as those based on single cell sequencing.
Author Summary In this article we show that transcript abundance can be used to predict the presence or absence of the majority of genomic alterations present in human cancer. We also show how these predictions can be improved by aggregating genes into small networks to counteract the effects of transcript measurement noise.
Introduction
Cancer is a molecular disease occurring when a cell or group of cells acquire uncontrolled proliferative behavior, conferred by a multitude of deregulations in specific pathways [1]. As is implied by such a broad definition, cancer is a highly heterogeneous disease, showing remarkably different molecular, histological, genetic and clinical properties, even when comparing tumors originating from the same tissue [2]. Many cancers are characterized by the presence of single nucleotide or short indel mutations and/or copy number alterations, which appear somatically at the early stages of oncogenesis and can drive tumor progression [3]. Cancers can be broadly divided in two classes: the M class, where point mutations are prevalent, and the C class, where copy number variations (CNVs) are more numerous and are often associated with TP53 mutations. Tumor class influences anatomic location. Most ovarian cancers, for example, belong to the C class, while most colorectal cancers belong to the M class, although many exceptions do exist [4].
The Cancer Genome Atlas (TCGA) project [5] has recently underwent a major effort to collect vast amounts of information on thousands of distinct tumor samples. The TCGA data collection, commonly referred to as the “Pan-cancer” dataset, provided the scientific community with an avalanche of data on DNA alterations, gene expression, methylation status and protein abundances among others, with the critical mass necessary to identify rarer driver tumorigenesis effects in many types of cancers [6–8]. By combining all 33 TCGA datasets, Bailey and colleagues [9] recently outlined a pan-cancer map of which mutations can be drivers for the progression of cancer.
The availability of thousands of samples measuring many different variables in cancer has allowed scientists to generate statistical models of relationships between different molecular species. A pan-cancer correlation network between coding genes and long noncoding RNAs, for example, sheds light on the function of non-coding parts of the transcriptome [10]. More recently, mutations on transcription factors (TFs) have been linked to altered gene expressions and phosphoprotein levels in 12 TCGA tumor type datasets [11]. Network approaches have been applied to identify clusters of coexpressed genes, shared by multiple cancer types [12]. Several studies have sought to characterize the relationships between genomic status and expression levels in cancer, trying to identify commonalities across different cancer types [13,14]. In particular, Alvarez and colleagues [15] have postulated that the effect of genomic alterations in cancer can be more readily assessed by aggregating gene expression profiles into transcriptional networks, rather than by profiles taken separately.
While the association between genomic events and gene expression is proven in several scenarios, it remains to be seen if it can be assessed in scenarios where fully quantitative readouts are unavailable, such as low coverage samples. One of these scenarios is Single Cell Sequencing [16], often carried out in experiments where thousands of mutations are generated via a system of pooled CRISPR-Cas9 knockouts [17].
To our knowledge there is no study trying to identify relationships between all genomic alteration events (somatic mutations/indels and CNVs) and global gene expression across cancers. In this study, we use 24 TCGA tumor datasets to investigate whether gene expression can be used to predict the presence of specific genomic alterations in several cancer tissue contexts. To this end, we leverage the current availability of a vast family of machine learning algorithms [18]. We investigate whether some gene alterations can be better modelled than others, and whether using grouped gene expression profiles as aggregated variables can effectively identify specific genomic alterations. Finally, we test whether predicting mutations and CNVs can be carried out in an intrinsically noisy single cell RNA-Seq (scRNA-Seq) transcriptomics datasets.
Results
Collection of Pan-Cancer Dataset
We downloaded the most recent version of the TCGA datasets available on Firehose (v2016_01_28), encompassing mutational, CNV and gene expression data. Using TSNE clustering on gene expression data (9642 samples), we observed how different tumor types cluster separately from each other (Figure 1A). However, two tumour types segregate into two subgroups: breast cancer, which subdivides into a major luminal cluster and a smaller (in terms of samples collected) basal cluster [19]; and esophageal carcinoma, which roughly subdivides into adenocarcinomas and squamous cell carcinomas [20].
We then aggregated the single nucleotide and short indel somatic mutation data from the same samples for which we had collected gene expression. As is widely known, TP53 is the most mutated gene in human cancer (Figure 1B), followed by PIK3CA, SYNE1 and KRAS. As shown before [4] some tumor types are characterized by a high presence of somatic mutations. In particular, colorectal cancer, mesothelioma and esophageal cancer carry at least one of these events in almost 100% of the samples in the TCGA dataset. In the figure, we filtered out commonly known non-driver mutations [21], such as those happening in long genes like TTN and OBSCN, but we kept them in all following analyses for the sake of completion. A representation of all mutated genes, including blacklisted ones, is available in Figure S1. Some tumors are characterized by the prevalence of a mutation in a specific gene, such as the G-protein coding BRAF in thyroid carcinoma [22] or IDH1, translating into isocitrate dehydrogenase, in low grade glioma [23].
Finally, we obtained readouts of CNV status for all TCGA samples. CNVs can have different extensions in terms of nucleotides affected and can sometimes encompass entire chromosomes [24] and the thousands of genes therein. In order to limit the number of variables to a more meaningful subset, we assigned a CNV profile to every gene, and kept only those whose CNV profiles are positively and significantly correlated with their transcript abundance profiles [25]. We defined these events as functional CNVs (fCNVs). In order to make fCNV variables comparable to the mutational ones, we defined a cut-off for presence or absence by using the log2(CNV) threshold of 0.5, which roughly corresponds to at least one copy gain for amplifications, and at least one copy loss for deletions (see Materials and Methods). We then reported their abundance in the pan-cancer dataset, distinguishing between amplifications (Figure 1C) and deletions (Figure 1D). As previously shown [4], virtually all ovarian cancer samples are characterized by at least one CNV event. Among the most amplified genes, we find the oncogenes SOX2 [26], EGFR [27] and MDM2 [28], and also a non-coding gene, PVT1, the most amplified gene in breast cancer, with proven but as-of-yet uncharacterized proto-oncogenic effects [29,30]. Amongst the most deleted genes (Fig.1D) we observe well known tumor-suppressor genes, such as CDKN2A [31,32] and PTEN [33,34].
Modelling Cancer Alterations with gene expression
After collecting all the expression and genomic alteration data from TCGA, we set out to generate models able to predict the presence or absence of each event by virtue of gene expression data in the contexts of all collected tumor types.
We tested several modelling algorithms for classification using the aggregator platform for machine learning caret [18] in the bladder cancer mutational dataset [35]. We observed that all models provide better-than-random predictions for the majority of mutational events, in terms of area under the ROC curve (AUROC)(Figure 2) [36]. We chose the top-scoring algorithm in this test, the Gradient Boost Modelling algorithm (gbm), a robust tree-based boosting model [37], due to its robustness and speed of implementation.
We calculated gbm models for all tumour types of at least 100 samples with co-measured expression and CNV or mutations, which included 24 of the 33 TCGA tumor types. The models were predictive of genomic events observed in no less than 5% and no more than 95% of the patients in the dataset, and at least in 10 samples. Our results show that in all tumour types, a machine learning algorithm based on gene expression is consistently better than a random predictor (AUROC line at 0.5) at correctly classifying tumour samples for the presence or absence of specific genomic alteration events (Figure 3 and Supplementary Table S1). In particular, TP53 mutations are well modelled in many of these tumor types, being the most well predicted mutational event in both acute myeloid leukemia and low grade glioma. We could also model the presence of a copy loss of TP53 in sarcoma, which can be predicted with an accuracy of 70%. Ovarian and pancreatic cancer datasets presented exceptional cases, in that each contained such high TP53 mutation rates (next to 95% detected) [38,39] that our algorithms could not distinguish sufficient differences within each dataset to train a model. Also KRAS-targeting events are well modelled, specifically in colon, lung and stomach cancer, and cervical squamous carcinoma [40]. We noted a tendency where models for more frequent CNV events yielded a greater predictive power (Figure S2), a tendency not observed for somatic mutation models. We then tested if known tumor-related genes, such as those curated by the Cancer Gene Census [41] are better modelled than the rest of the genome. There is no difference in mutation and amplification results, but for deletion events, oncogenes yield weaker models (Wilcoxon Test, p=0.0037) and tumor suppressor genes yield generally stronger models (p=0.00050). This is in agreement with the central paradigm of cancer, where a tumor suppressor gene deletion can be one of the driving events of tumorigenesis and tumor progression [42]. On the other hand, deletion of tumour-promoting oncogenes is generally unfavourable for tumor progression, and so, generally speaking it should be present only as a passenger event, unlikely to determine global gene expression and tumor fate.
Modelling specific alterations with noise addition
In order to understand whether cancer-related genomic alterations can be modelled by gene expression in scenarios with lower signal-to-noise ratio, we artificially perturbed the TCGA gene expression dataset via the addition of Gaussian noise, and then proceeded to build models to predict the presence of TP53 mutations in breast cancer, the largest dataset in TCGA by number of samples.
As expected, the addition of uniform random gaussian noise to the gene expression matrix has a detrimental effect on the amount of information left for modelling the presence of TP53 somatic mutations (Figure 4A).
We then decided to test several permutations of noise addition on the same breast cancer expression data, by each time aggregating genes into networks defined a priori in the same context, using a Tukey Biweight Robust Average method [43] on Weighted Gene Correlation Network Analysis (WGCNA) clusters [44] and the VIPER algorithm [15] on ARACNe-AP networks [45]. It is important to note that WGCNA clusters are completely non-overlapping and yield generally a lower number of aggregated variables than VIPER clusters, which are groups of genes possibly shared by other transcription factor clusters and that collectively yield the global expression of a transcription factor target set (dubbed as a proxy for “TF activity” in the original VIPER manuscript [15]).
Our results show that gene expression, VIPER activity and WGCNA clusters yield very similar models for predicting TP53 mutations in breast cancer (figure S4). The amount of information contained in the input variables is therefore comparable. Adding noise to the input expression matrix, however, and then aggregating the resulting noise-burdened genes into VIPER or WGCNA clusters (see Materials and Methods), provides robustness to the models (Figure 4B). Similar results with higher variances (possibly due to the smaller size of the datasets) can be observed for EGFR amplifications in glioblastoma (Figure S5) and lung squamous carcinoma (Figure S6), for PVT1 amplifications in ovarian cancer (Figure S7) and for PTEN deletions in sarcoma (Figure S8). In all these examples, however, the performance of the simple WGCNA/Tukey aggregation is closer (if not worse) to that of simple gene expression.
An alternative way to reduce the information content from an NGS gene expression dataset is to reduce the number of read counts from each sample. This operation reflects either a low coverage bulk RNA-Seq experiment or an experiment arising from Single-Cell sequencing [46]. In particular, single-cell RNA-Seq (scRNA-Seq) is characterized by the dropout phenomenon [47] wherein genes expressed in the cells are sometimes not detected at all. In order to simulate such scenarios, we down-sampled each RNA-Seq gene count profile from the largest TCGA dataset (Breast Cancer) to a target aligned read number using a beta function, which allows for reduction coupled with random complete gene dropouts (Figure 5A). We then modelled again the presence of TP53 mutations using gene expression (Figure 5B). We found out that models based on standard unaggregated gene expression experience an accuracy drop at around 30M reads, while aggregating genes using VIPER (but not with WGCNA) allows for better-than-random accuracies even at 3M reads, confirming the benefits of gene aggregation in low coverage RNA-Seq, as previously found e.g. for sample clustering [48].
Mutation prediction in single-cell data
We set out to detect if mutations can be modelled from gene expression data in single-cell RNA-Seq contexts. In order to do so, we used the original CROP-Seq dataset [17], where multiple gene knock-outs were carried out via CRISPR/Cas9 in Jurkat cells and the presence of the deletion was measured alongside gene expression in a single cell manner.
We built models based on 8 knock-out subsets targeting the following genes: JUNB, JUND, LAT, NFAT5, NFKB1, NFKB2, NR4A1 and PTPN11, all with at least 35 single cells carrying the single knock-outs (vs. 420 control wild-type single cells). Our analysis shows that gene aggregation in TF-centered coexpression groups using ARACNe/VIPER can be beneficial in predicting mutation presence, by virtue of showing the probability of carrying the mutation in mutated samples vs. control samples (Figure 6).
Discussion
In this paper, we tested a framework to investigate the complex relationships between genetic events and transcriptional deregulation through machine learning approaches. We demonstrated as a generalized proof-of-principle that genomic alterations can be modeled by gene expression across several human cancers through several machine learning algorithms, and specifically that a gradient boost modeling approach seems optimal for the task. In the process, we generated a collection of models for each genomic alteration in each cancer context, showing that the best predicted alterations are not necessarily targeting known oncogenes or tumor suppressors. Interestingly, we show how the aggregation of gene expression profiles in groups of coexpressed genes, via the ARACNe/VIPER or WGCNA methods, makes the models more robust and more resistant to perturbations such as gaussian noise or artificial downsampling. Finally, we have shown how the same aggregation principle can have beneficial effects in predicting the presence of mutations in intrinsically noisy scenarios, like single cell RNA-Seq. At the same time, we have shown how modeling can be carried out in single-alteration contexts, implicitly overtaking the potential bias of cancer samples, where in fact multiple genomic alterations can and do coexist.
The performance of gene aggregation methods has been tested before for sample clustering in RNA-Seq read reduction scenarios [15,48], but never in this specific task nor in a pan-cancer context. As a principle, the usage of robust averages of pre-defined co-expressed genes can be applied in any context where reliability of gene expression data is necessary, from differential expression to pathway enrichment analyses. The notion that relationships between genomic alterations and gene expression profiles can be robustly modelled across different cancer scenarios, as well as in single-cell and noisy contexts, can have important repercussions in diagnostics, where theoretically a single quantitative expression experiment can be used to predict the presence or absence of a mutation.
Materials and Methods
Data processing
We obtained raw expression counts, mutation and CNV raw data from TCGA using the Firehose portal (gdac.broadinstitute.org). Raw counts were normalized using Variance Stabilizing Transformation as described before [49]. Somatic mutations not changing the aminoacid sequence of the protein product were discarded. We flagged genes blacklisted by the MutSig project [21], such as TTN, ORs, MUCs as false positives, and removed them from further analysis (except the most mutated in the pan-cancer dataset, shown in Figure S1). CNV tracks were associated to the targeted gene using the GenomicRanges R package [50]. Gene-centered CNVs were then associated to the expression profile of the gene itself. CNV tracks with a Spearman correlation coefficient above 0.5 were deemed “functional CNVs” and used in the rest of the analysis. Samples with more than 0.5% of the genes in the genome somatically amplified, deleted or mutated were deemed “hypermodified” and the total number was shown in Figure 1 bottom bars.
Clustering analysis was carried out on the TCGA tumor samples using the expression profiles of 1172 Transcription Factors defined by Gene Ontology terms “transcription factor activity, sequence-specific DNA binding” (GO:0003700) and “nuclear location” (GO:0005634) [51].
The dataset expression profiles were visualized after TSNE transformation [52] with 1000 iterations using a 2D kernel density estimate for coloring different tumor types [53]. Oncogenes and Tumor Suppressor genes were obtained from the COSMIC Cancer Gene Census in October 2018 [41].
Modeling
We used the R caret package [18] as the platform to run all our predictive models in a standardized and reproducible way. Binary classifiers were built to predict the presence/absence of mutation, amplification and deletion events. The CNV value provided by TCGA corresponds to log2(tumor coverage) – genomic median coverage. The threshold for amplification/deletion presence was set to 0.5.
Data partitioning was performed once for each tumor type, with 75% of the samples used for training and 25% for test purposes. Training was performed using 10-fold Cross Validation. Recursive Feature Elimination was carried out by the default caret implementation on the 10,000 highest variance gene expression tracks. The algorithms used (and R packages implementing theme) were:
Bayesian Generalized Linear Model (bayesglm)
Tree Models from Genetic Algorithms (evtree)
Gradient Boost Modeling (gbm)
Generalized Linear Model (glm)
k-Nearest Neighbors (kknn)
Linear Discriminant Analysis (lda)
Neural Networks (mxnet)
Neural Networks with Feature Extraction (pcaNNet)
Random Forest (rf)
Linear Support Vector Machine (svmLinear)
Radial Support Vector Machine (svmRadial)
In order to reduce information from the gene expression profiles, we adopted two strategies. The first, shown e.g. in Figure 4B, adds random gaussian noise to the expression tracks, with a variable standard deviation (indicated as “Gaussian Noise Level”). Each model run after noise addition was run 100 times to allow for various data partitions. The second strategy (Figure 5) reduced the number of reads mapped to each gene in order to obtain expression samples with decreased total gene counts. In order to do so, we applied to each gene in each sample a downsampling factor sample from a beta distribution: Where B is the Beta function, acting as a normalization constant, x is the raw gene expression count in a particular sample, α is the first shape parameter and β the second shape parameter. In order to reduce the total sample coverage to the desired level, β is set to 0.1 and α is set to: Where f is the desired number of reads and r is the total number of reads in the sample. A real case example of this beta distribution is shown in Figure S9.
Aggregation algorithms
We used ARACNe-AP [45] to generate TF-centered networks on each of the VST-normalized TCGA expression datasets. TFs were selected via Gene Ontology as described before, with p-value for each network edge set to 10-8. ARACNe networks were then used to obtain an aggregated value of TF activity for each sample using the VIPER algorithm [15] which reports the collective gene expression level changes of each TF-centered network vs. the mean expression of each gene in the dataset. Only TF networks with at least 10 genes (excluding the TF) were included.
WGCNA clusters of genes were constructed using the wcgna package [44] with default parameters and minimum network size set to 10 To obtain a robust median expression value for each WGCNA cluster in each sample we used Tukey’s Biweight function as implemented by the R affy package [54].
Single Cell dataset
CROP-Seq raw expression counts were obtained from the Datlinger dataset (available on Gene Expression Omnibus, entry GSE92872). Samples mapping wild-type control cells and the most represented knock-out genes (JUNB, JUND, LAT, NFAT5, NFKB1, NFKB2, NR4A1 and PTPN11) were selected. Variance Stabilizing Transformation was applied using a blinded experimental design. Gradient boost modelling was applied to each model as described in the previous paragraph, and probabilities of carrying the knock-out for samples in the test set are shown, grouped for wild-type and knock-out samples. In this particular case, 10 data partitioning rounds are done, in order to increase the exploration space of the model performance.
Methods Availability
All code used to generate the analysis and the figures of this paper is available in the online materials.
Acknowledgments
We thank Dr. Marco Russo and Jordan Pflugh Kraft for the fruitful discussions.
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵