ABSTRACTS
Despite large-scale efforts to systematically map the cancer genome, little is known about how the interplay of genetic and epigenetic alternations shapes the architecture of the tumor’s transcriptome. With the goal of constructing a system-level view of the deregulated pathways in cancer cells, we systematically investigated the functional organization of the transcriptomes of 10 tumor types using data sets generated by The Cancer Genome Atlas project (TCGA). Our analysis indicates that the human cancer transcriptome is organized into well-conserved modules of co-expressed genes. In particular, our analysis identified a set of conserved gene modules with distinct cancer hallmark themes involving cell cycle regulation, angiogenesis, innate and adaptive immune response, differentiation, metabolism and regulation of protein phosphorylation. We applied a network inference approach to nominate candidate drivers of these conserved gene modules. The predicted drivers have consistent cancer-relevant functions related to the specific hallmarks and are enriched with cancer consensus genes and significantly mutated genes. We showed genetic alternations of TP53 and other cell cycle drivers have major downstream transcriptional impact on cell cycle regulation. Collectively, our analysis provided global views of convergent transcriptome architecture of human cancer. The result of our analysis can serve as a foundation to link diverse genomic alternations to common transcriptomic features in human cancer.
INTRODUCTION
The evolutionary history of a cancer cell has documented the accumulation of random ‘passenger’ mutations along with a few rate-limiting ‘driver’ events which confer fitness benefit to the cell. The cumulative effect of acquired driver events eventually set cancerous cells into uncontrolled state of proliferation, immortality, angiogenesis, cell death, invasion and metastasis. However, how the genetic and epigenetic effects propagate through intricate gene and biochemical circuits to cause deregulated cellular states within a cancer cell is still largely unknown. A promising approach is to decompose the oncogenic state of tumor cells directly in terms of the activity of sets of conserved biological features that has been described as the hallmarks of cancers1.
The comprehensive multi-dimensional genomic data provided by TCGA datasets across many tumor types offers an unprecedented opportunity to deepen our understanding of cancer biology and the discovery of new therapeutic and diagnostic targets. Gene expression analyses using unsupervised clustering approaches have been widely used in cancer research. Although informative, these methods analyzed gene level information and are prone to intrinsic noise caused by various biological and experimental factors. During the last decade, studies2-5 showed that gene modules consisting of co-regulated or co-expressed genes can be considered as higher-order building blocks of the global transcriptomic network and have been successfully applied to detect gene-disease association by improving the statistical power beyond the level of individual genes. Systematic data-driven approaches6-10 have also been developed to identify co-expressed gene modules in order to elucidate patterns of transcriptome organization across tissues and species.
We adopt the data-driven strategy to investigate the co-expression networks across 10 TCGA human cancer data sets including glioblastoma multiforme (GBM), ovarian carcinoma (OV), colon and rectal cancer (COADREAD), lung squamous cell carcinoma (LUSC), breast invasive carcinoma (BRCA), head and neck squamous cell carcinoma (HNSC), kidney renal clear cell carcinoma (KIRC), bladder urothelial carcinoma (BLCA), uterine corpus endometrioid carcinoma (UCEC) (Supplementary Table 1). Our analysis pipeline has four components: 1) construction of the co-expression network for each cancer data set and decomposition of the network into Co-expressed Gene Modules (CGMs) (Figure 1a); 2) identification of ‘conserved’ (or overlapping) CGMs across multiple tumor types (Figure 1b); 3) extraction of the core genes of conserved CGMs and classification of CGMs by cancer hallmark themes (Figure 1c); and 4) nomination of putative drivers of CGMs based on analysis of the human protein interaction network (Figure 1d).
Overall, we identified a set of conserved cancer CGMs across multiple tissue lineages. These conserved cancer CGMs are enriched with distinct gene ontology and cancer hallmark themes. Predicted module drivers have consistent cellular functions with their corresponding hallmarks and are enriched with cancer consensus genes and significantly mutated genes.
RESULTS
Co-expression network analysis reveals conservation of gene modules
We followed a previously described method9 and constructed gene co-expression networks on the basis of Pearson correlation between the RNA-seq expression profiles for the 10 TCGA cancer data sets (Supplementary Methods). For each cancer data set, co-expressed gene modules (CGMs) were identified in two steps: First, we performed unsupervised hierarchical clustering of the gene expression profiles using the topological overlap8,11 as a pair-wise distance measure; and then applied the dynamic tree cut method12 to decompose the dendrogram at different height and merge the clusters to form stable CGMs iteratively (Supplementary Methods). We defined 12 CGMs in GBM, 5 in OV, 10 in COADREAD, 8 in LUSC, 7 in BRCA, 9 in HNSC, 11 in BLCA, 8 in KIRC, and 9 in UCEC. In total, 79 cancer CGMs were identified by this analysis.
Next we tested whether these CGMs can be validated using data from other experimental platforms and in independent data sets. We first compared the TCGA GBM expression data (syn1446212) measured by RNA-seq and by Affymetrix microarrays. We applied the same procedures to both data sets and were able to detect the majority of the CGMs defined using Affymetrix microarray when using RNA-seq data, suggesting that CGMs are reproducible across technical platforms (Supplementary Figure 1). We performed a similar comparison using 2 independently published GBM expression data sets (GSE4271, GSE13041) and reassuringly, as above, the majority of CGMs detected in the TCGA data were also detected in the external data sets demonstrating the robustness of the CGMs (Supplementary Figure 1).
To determine whether CGMs from different cancer data sets were composed of overlapping genes, we enumerated all pair-wise CGMs (n=2,756) from any two different cancer data sets and identified 509 pairs of CGMs with significant mutual overlap (Bonferroni corrected hypergeometric test, p < 0.05, Supplementary Materials). The list of overlapping pairs of CGMs is provided in Supplementary Table 2. Surprisingly, 63 out of 79 (82%) CGMs showed significant overlap with CGMs detected in at least two different tumor types (Figures 2 and 3). Taken together, our analysis suggests that the human cancer transcriptome is organized into modules of co-expressed genes that are conserved across tumor types.
Hierarchical organization of co-expressed gene modules
To better understand their global organization, we sought to define the relationships between these CGMs through the construction of a module-module network. The resulting network has 79 nodes, each represents one CGM, and 509 edges between significantly overlapping CGMs. Many module nodes appear to be closely connected on the resulting network (Figure 3b). To visualize the multi-level relationship among the CGMs, we transformed the module-module network into a hierarchical tree (Figure 3a) based on the random walk distance13. We partitioned the hierarchical tree into subtrees or module groups based on network modularity14 (modularity threshold=0.37, Supplementary Materials), a graph structure measure reflecting the enrichment of edges among nodes within a group (or cluster) relative to the nodes between the groups.
As shown in Figure 3b, our analysis revealed 11 groups of CGMs comprised of the 63 CGMs with high degree of mutual overlap (Figures 2 and 3). The 11 CGM groups have a median of 5 CGMs representing at least 3 different tumor types in each group. This implies that these CGM groups capture common or ‘conserved’ cellular processes across cancer types. The 11 CGM groups are shown as distinct subtrees on the transformed hierarchical tree of CGMs (Figure 3a). Notably, the red, turquoise, and orange CGMs fall within one branch delineated by the green square on Figure 3a which mostly involves immune response. The blue, purple, and magenta modules fall within another branch (enclosed by the blue square on Figure 3a) which primarily involves angiogenesis, epithelial-mesenchymal transformation (EMT) and muscle development. The hierarchical tree of identified gene modules suggests convergent functional organization of cellular programs of tumor independent of their tissue lineage.
To test whether genes belonging to each of the CGM groups are indeed co-expressed across their tumor types, we calculated the Pearson correlation for all gene pairs between their expression levels in the relevant tumor types. We confirmed the gene-gene correlation within the overlapping CGMs is significantly higher than expected from a random gene set of the same size (Kolmogorov-Smirnov test p < 2.6 × 10−16, Figure 3c).
Conserved expression modules are enriched with cancer hallmarks
To assess possible biological functions of the CGMs, we calculated gene ontology enrichment for individual CGM (Figure 4, Supplementary Materials). We also extract core module genes (Supplementary Materials) for each of the 11 CGM group and calculate its functional enrichment using DAVID tools (http://david.niaid.nih.gov/david/). We classified the CGMs into 7 functional categories based on their specific enrichment of gene ontology15 and cancer hallmarks (Supplementary Table 3): (i) cell cycle (FDR< 2.2 × 10−51, black modules of Figure 2-4); (ii) innate immune response (FDR< 7 × 10−14, red modules of Figure 2-4); (iii) adaptive immune response (FDR< 5.6 × 10−109, turquoise and orange modules of Figure 2-4); (iv) differentiation (FDR< 9 × 10−10, brown modules of Figure 2-4); (v) angiogenesis/EMT (FDR< 2 × 10−7, blue, purple, and magenta modules of Figure 2-4); (vi) metabolism (FDR<0.1 green and pink modules of Figure 2-4); (vii) phosphorylation (FDR<0.3, yellow modules). We referred to these classified CGMs as “hallmark gene modules” and highlighted their core module genes below. We provide the full list of core module genes in Supplementary Table 4.
The cell cycle modules (class i, black modules of Figure 2-4) are enriched with genes involving chromosome instability (i.e. CENPA, BUB3, NUF216-18) and DNA replication (i.e. CDC45, MCM1019). Other notable genes in this module include transcription factors MYBL2, FOXM1, E2F2 and mitotic regulators such as CCNB2, CDK1, AURKB, CDC25A and CDC25C. The innate immune modules (class ii, red modules of Figure 2-4) include multiple INF-signaling and virus-response pathway genes including IFI27, IFI6, IRF9 and several MHC class I genes including HLA-B, HLA-H, HLA-L. Mutations and somatic copy number alternations (SCNAs)(doi:10.7303/syn300013) in several HLA gene family members affect about 10% of total TCGA cancer patients (Supplementary Figure 2). The adaptive immune modules (class iii, turquoise and orange modules of Figure 2-4) are present in all 10 TCGA cancer data sets for this analysis. The core adaptive immune genes include multiple inflammatory response genes CD163, CPVL, FPR3 and members of MHC class II genes including HLA-DOA, HLA-DPA1, HLA-DPB1.
The angiogenesis/EMT modules (class iv, blue modules of Figure 2-4) are present across all 10 TCGA data sets in this study. Core genes of the angiogenesis/EMT modules include FN1, MMP2, TWIST1, and WNT5A that are involved with cell adhesion, embryonic development, angiogenesis and TGF-beta mediated signaling1. Somatic mutations and SCNAs in members of angiogenesis and TGF-beta pathway are shown to affect 37% of human cancers (Supplementary Figure 2). The differentiation modules (class v, brown modules of Figure 2-4) appear to be specific to squamous carcinoma including LUSC, HNSC and BLCA. The core genes of the differentiation module contains multiple tissue differentiation genes, for example, SOX2, TP63, KRT5, KRT16, HOXA10 and WNT signaling pathway genes WNT4 and WNT7B20,21. SOX2 is reported previously as lineage-survival proto-oncogene with essential roles in the differentiation and development of normal squamous cell22. SOX2 is located on 3q26, a frequent amplified region in HNSC and LUSC23. The mutual exclusivity between SOX2 and TP63 mutations (Supplementary Figure 2) suggests a convergent mechanism driving the differentiation transcriptional programs among a subset of squamous carcinoma.
Functions of imputed module drivers reflect hallmark themes
We next sought to nominate candidate drivers of the conserved CGMs. Previous studies suggest that disease genes in close proximity on the protein interaction network (PIN) will likely be functionally related24,25. We compiled a joint human PIN dataset from various resources including HPRD26, NCI-PID27, Cancer Cell Map from (http://cancer.cellmap.org/cellmap/), and REACTOME28 (Supplementary Material). For each of the CGM groups, we applied a network-based driver inference algorithm (NetSig, Supplementary Materials) to prioritize putative module drivers based on their proximity to the core module genes on the PIN. Figure 5 shows examples of protein subnetworks comprised of the direct protein interactions between the top candidate module drivers (red node, Figure 5) and the most connected core module genes (Supplementary Materials).
Our analysis identified a total of 318 candidate module drivers (doi:10.7303/syn1960770). Many of the top candidate drivers identified by NetSig have consistent oncogenic roles relevant to the functional classes of their CGMs. For example, top candidate drivers (red nodes, Figure 5a) of the cell cycle modules (black modules, Figure 5) include TP53, BRCA1, CDKN1A, CDK2, CDC6, PCNA and BUB1A, all of which have well-known roles in the regulation of cell cycle and genomic stability1. Notably, non-synonymous mutations and foal SCNAs of TP53 and other cell cycle drivers are present in 55% of the human cancers (Figure 6d). This highlights the role of DNA repair and genomic instability as the key enabling hallmarks of human cancer. Tumor patients carrying genetic alternations of TP53 and other cell cycle drivers tend to have increased number of SCNA events (especially deletions) (Figure 6b), which is in agreement with the observation made independently by Zack et al. Submitted. To assess the downstream transcriptional impact of the genetic alternations to the cell cycle drivers, we defined the pathway activity of a gene set in terms of the coefficient of the first principle component of the expression matrix of the pre-defined gene set29. Then we calculated the pathway activity for the cell cycle modules and the knowledge-based KEGG cell cycle pathway30. The activity profile of the cell cycle modules and the KEGG cell cycle pathway (Figure 6e) are concordant with each other and both highly up-regulated among tumor patients with genetic changes of cell cycle drivers (student t-test, p < 2.2 × 10−16). In addition, the cell cycle module genes are expressed at much higher level than adjacent non-tumor tissues across 6 TCGA tumor types with such comparison available (Figure 6f).
The protein subnetworks of the angiogenesis/EMT modules (blue and purple modules, Figure 5) share several common candidate drivers including CTNNB1, SMAD4, SRC, and ITGB1, which have known regulatory and signaling roles in TCF-beta signaling, angiogenesis and EMT1,31,32. Other notable candidate drivers in the subnetworks include RAC1, EP300, EGFR which have been shown to be frequently mutated in human cancer (doi:10.7303/syn1715784.2). The top predicted drivers of the closely related module group (magenta modules, Figure 5f and Figure 3) include several muscle development genes including DMD, DMN, SGCD and SRC among others.
The top predicted drivers of the innate immune modules (red modules, Figure 5g) include STAT2 and RNASEL among others. STAT2 is reported to mediate the innate immunity and host defense mechanism via the Type I Interferon receptor 33. RNASEL contributes in many ways to the interferon anti-virus system34. The top candidate drivers of the adaptive immune modules (turquoise and orange module, Figure 5) include LCK, CD4, CBL, PTPN11 (also known as SHP2) with important roles in diverse immune response through the antigen recognition by the T cell receptor (TCR)35. The top predicted drivers of the phosphorylation modules (yellow modules, Figure 5c) are linked to members of several major signaling pathways including AR, ESR1, PIK3R1, HSP90AA1, JAK1, STAT3, SMAD41.
The top predicted drivers of the metabolism modules (green modules, Figure 5b) include multiple members of the cohesin complex SMC1A, SMC1B, SMC3, RAD21, STAG1 and STAG2. Cohesin complex are frequently mutated in acute myeloid leukemia (AML)36 and defects in cohesin complex may underlie chromosomal instability of human cancers37,38. Other notable candidates of the subnetwork include H2AFX, HIST3H3, RHOA, MAPK8 and MAPK14. H2AFX is frequently mutated in the Pan-Cancer data by MutSig algorithm39 (doi:10.7303/syn1715784.2).
We next asked whether the module drivers are enriched with cancer genes targeted by genomic alternations. We compared the predicted module drivers with the mutation drivers identified by the MutSig algorithm in the Pan-Cancer data (doi:10.7303/syn1715784.2). There are 29 out of 318 predicted modules drivers (Supplementary Table 3) overlapping with mutation drivers (hypergeometric test p < 8 × 10−26). There are 41 out of 318 module drivers (Supplementary Table 5) overlapping with the Cancer Consensus Genes (CCG)4038 (hypergeometric test p < 1 × 10−20). Overall, the result suggests that the predicted module drivers likely capture aspects of the biology and shed new light on the functional context of the cancer drivers discovered via mutation analysis.
DISCUSSION
To our knowledge, this is the first extensive investigation of human cancer transcriptome to date. Our study defined the global transcriptome architecture of human cancer. We identified 11 groups of CGMs conserved across multiple tumor types. These conserved CGMs display distinct gene ontologies and cancer hallmark themes. The highly conserved CGMs across tumor types suggest the existence of selective force shaping the cancer transcriptome. Such conserved CGMs can be viewed as ‘building blocks’ of the cancer transcriptiome and be used as surrogate for the common cellular functions of human cancer. We noticed these conserved CGMs are significantly correlated with a large collection of published gene signatures (data not shown) although they are derived directly from the data and not dependent on any prior knowledge. As an example, we showed that the activity profile of the cell cycle modules is highly concordant with the knowledge-based KEGG cell cycle pathway (Figure 6e).
As a starting point to understand the common genetic and regulatory mechanism of these conserved CGMs, we integrated cancer transcriptome and interactome to nominated drivers for the conserved CGMs. We showed the predicted module drivers recapitulate known tumor biology and have consistent cancer-relevant functions with the hallmark themes of the conserved CGMs. Our analysis support that the cell cycle regulation and chromatin maintenance are the key system hallmarks of human cancer. Genetic alternations of TP53 and other cell cycle drivers have major downstream transcriptional impact on the cell cycle regulation from our Pan-Cancer analysis. We anticipate future studies to elucidate mechanism leading to the deregulation of the hallmark gene modules would have direct implications for the design of new drug targets and contribute to novel pathway-focused therapeutic paradigm.
METHODS SUMMARY
The 10 TCGA Pan-Cancer data sets used for our study are available from Sage Bionetworks (doi:10.7303/syn300013). The RNAseq molecular profiling was based on IlluminaHiSeq_RNAseqV2. The RNAseq data preprocessing was carried out using the SeqWare Pipeline project, version 0.7.0. The normalized level 3 RNAseq expression was based on RSEM41. Significantly recurrent SCNAs were identified using GISTIC 2.042,43 on relative total copy-number data after correction for sample purity. Significantly mutated genes were identified using MutSig39. Co-expression network analysis was based on WGCNA package44.
Author Contributions
LZ conceived the study and performed the analysis and wrote the paper.
ACKNOWLEDGEMENTS
The authors would like to acknowledge TCGA and Sage Bionetworks for the collaborative atmosphere to generate and make the PanCan12 data freeze public available and the support of all institutions involved. We would like to thank Gad Getz, Lynda Chin, Chip Stewart for the commenting on the preliminary draft of the manuscript. We would like to thank our Broad Institute colleagues Gordon Saksena, Mike Lawrence, Andrew Cherniack and members of the Broad GDAC team for their contribution to the generation of the PanCan12 data freeze during construction of the study.