Abstract
Risk stratification model for lung cancer with gene expression profile is of great interest. Instead of the previously reported models based on individual prognostic genes, we aimed to develop a novel system-level risk stratification model for lung adenocarcinoma based on gene coexpression network. Using multiple microarray datasets obtained from lung adenocarcinoma, gene coexpression network analysis was performed to identify survival-related network modules. Representative genes of these network modules were selected and then, risk stratification model was constructed exploiting deep learning algorithm. The model was validated in two independent test cohorts. Survival analysis using univariate and multivariate Cox regression was performed using the output of the model to evaluate whether the model could predict patients’ overall survival independent of clinicopathological variables. Five network modules were significantly associated with patients’ survival. Considering prognostic significance and representativeness, genes of the two survival-related modules were selected for input data of the risk stratification model. The output of the model was significantly associated with patients’ overall survival in the two independent test sets as well as training set (p < 0.00001, p < 0.0001 and p = 0.02 for training set, test set 1 and 2, respectively). In multivariate analyses, the model was associated with patients’ prognosis independent of other clinical and pathological features. Our study presents a new perspective on incorporating gene coexpression networks into the gene expression signature, and the clinical application of deep learning in genomic data science for prognosis prediction.
1. Introduction
Recently, risk stratification based on gene expression profiles is of major biomedical interest in lung cancer research (1–6). Previous studies developed risk stratification models that mostly focused on individual prognostic genes. However, these studies have not fully considered the nature of biological networks and their systematic properties. Since it is more evident that biological processes are derived from numerous interactions between many cellular components, gene network analysis could provide valuable information about cancer pathogenesis and therapeutic interventions (7). Among the various biological networks, gene co-expression network has some strengths: not relying on prior information about genes, avoiding biologically wrong assumptions about independence of gene expression levels, and alleviating multiple testing problems (8).
Lung cancer – mainly, non-small cell lung cancer – is one of the most common cancers and is the leading cause of cancer-related death worldwide (9, 10). Currently, TNM staging system is a universal guideline for prognosis prediction and treatment decision. However, heterogeneous molecular features of lung cancer require diverse adjuvant treatment options and lead to different prognosis even in the same stage (11). Hence, there has been a constant need for developing better risk stratification models to predict accurate prognosis and to improve cancer-related survival.
The main objectives of this study were 1) to identify survival-related gene co-expression network modules, 2) and to propose a deep learning (DL)-based risk stratification model reflecting survival-related network modules. Using public microarray datasets from the Gene Expression Omnibus (GEO), we identified survival-related network modules of lung adenocarcinoma. Subsequently, we constructed DL-based prognostic score using representative genes of survival-related network modules and it showed great prognostic property in all cohorts.
2. Materials and Methods
2.1. Gene expression data and preprocessing
Microarray data sets were searched from the National Center for Biotechnology Information GEO database (https://www.ncbi.nlm.nih.gov/geo/) (12) using keywords ‘lung cancer’, ‘lung adenocarcinoma’, or ‘adenocarcinoma’. We searched for studies analyzed by single platform (Affymetrix HG-U133A Plus 2.0) in order to obtain high proportion of overlapping genes. In total, eleven microarray data sets were included and raw gene expression data were downloaded from the GEO data repository for preprocessing step (13–21). We selected two microarray data sets with survival information (accession number GSE31210 (20) and GSE30219 (21)) as independent test sets, and the others (13–19) with or without survival information as the training set. For those microarray data sets containing multiple histologic types of lung cancer, only the samples from adenocarcinoma were extracted. Detailed information of data source used in this study can be found in Supplementary Table 1. All available clinico-pathological variables (age, sex, smoking status, stage, and molecular subtypes) and survival information (survival status and duration) were compiled from each microarray data sets using ‘GEOquery’ package (22) (Table 1).
We generated the training set by assembling nine microarray data sets through stepwise preprocessing method described below. The raw gene expression data from microarray data sets were called and normalized using robust multichip average method using the ‘affy’(23) package. On a study-by-study basis, we removed invalid and duplicated probe sets by ‘featureFilter’ function in ‘genefilter’ package (24) and mapped array probe sets for the respective gene symbols. In addition, to remove poor quality probes, we filtered out probe sets with low expression level (signal intensity < log2(100) in at least 5% of samples within at least one study) and low variability (interquartile range < 0.5). As we combined microarray data from different studies, we performed additional normalization using Combat algorithm (25) in order to eliminate potential batch effects. Lastly, we detected the outliers by calculating the inter-array correlation based on Pearson’s correlation coefficient for all samples, and removed them. As a result, the training set contained 4615 probe sets from 510 lung adenocarcinoma samples including 273 samples with available survival information.
The raw gene expression data of both test sets were called and normalized as the same method with the training set. One outlier sample was removed from the test set 2; consequently, the test set 1 and 2 included 226 and 84 lung adenocarcinoma samples respectively.
2.2. Weighted gene co-expression network construction from the training set
We used weighted gene co-expression network analysis (WGCNA) package (26, 27) to build a weighted gene co-expression network from the training set. We created a correlation matrix on the basis of Pearson’s correlation coefficient for all pair-wise genes across all samples. The power –the key parameter for weighted network– was selected to optimize both the scale-free topology and sufficient node connectivity and we chose a threshold of 6 in this study (Supplementary Fig. 1). The correlation matrix was transformed into adjacency matrix (matrix of connection strength) using the power function, and pair-wise topological overlap (TO) between genes were calculated. We identified network modules using hierarchical clustering method with TO dissimilarity as the distance measure. The modules were detected using dynamic tree cut algorithm (28) in WGCNA package, defining height cutoff value of 0.99, deep split as 2, and minimum module size cutoff value of 30. Genes that were not assigned to any module were classified to color gray (Fig. 1).
2.3. Identification and validation of survival-related network modules
For each module, we summarized the module expression profile by one representative gene, module eigengene (ME), which is the first principal component of the expression matrix of the corresponding module. We used ME as the representative of each module to evaluate association with overall survival (OS). The survival-related network modules were identified using Cox regression analysis in the training set. For validation, the same genes included in the network construction were extracted from each test set, and assigned to the modules identified in the previous step. ME was calculated based on the expression profile of each test set, and the association between ME and OS was evaluated using Cox regression analysis to see whether the modules identified from the training set are also associated with OS in each test set. The modules with uncorrected p-value under 0.05 were regarded as significant survival-related network modules.
2.4. Functional annotation and network visualization of survival-related network modules
The enrichment of the gene ontology terms in each module were evaluated based on the hypergeometric test using ‘clusterProfiler’(29) package. The gene ontology biological process terms at false discovery rate under < 0.05 in each survival-related module were regarded as significantly enriched terms. The network of two common survival-related network modules (red and turquoise) was visualized with Cytoscape Software 3.4.0 (30).
2.5. Representative genes selection for risk stratification model construction
Representative genes of the survival-related network modules were selected to construct risk stratification model. Degree of representativeness of genes in each module was calculated by gene module membership (GMM), a correlation coefficient between gene expression profile and module eigengene. Additionally, the relationship between GMM and prognostic significance (p-value) of an individual gene was tested. Prognostic significance of gene was measured by univariate Cox regression analysis for overall survival. Pearson correlation analysis was performed between GMM and prognostic significance for every gene. We selected top 10 genes according to the GMM from the modules which showed significant correlation between GMM and prognostic significance. Accordingly, expression levels of the selected genes in the same network module were highly correlated to each other, and they could be also highly associated with prognosis because of strong correlation between GMM and prognostic significance. The expression levels of selected genes were used for risk stratification model based on DL.
2.6. DL-based risk stratification model
Expression profiles of representative genes were used for the input of the deep learning because they were expected to preserve co-expression patterns and to reflect the systematic properties of survival-related network modules. Convolutional neural network (CNN) was specifically used to extract gene expression patterns of modules. It finally produced gene network prognostic score (NetScore).
DL framework was based on a nonlinear proportional hazard model, which assumed hazard function (λ), a product of a time-dependent baseline hazard function (λ0) and a risk function determined by covariates:
λ(x, t) = λ0(t) × eh(x). Conventional Cox model for the risk stratification using multiple covariates (x1,x2, … xn) estimates the risk function h(x) by a combination of linear functions.
DL-based risk stratification modeling also adopts proportional hazard model, however, replaces linear risk function with the output of neural network (31). We designed a simple CNN to estimate risk function, hθ(x). Firstly, 1-dimensional convolutional filters were applied. Filter size was same as the input length, 10. Thus, the number of the output of the first layer was same as the number of convolutional filters. Genes in different modules were inputted as different channels. We set the number of filters were 24. The outputs of convolutional layer were hierarchically connected to three fully-connected (FC) layers. Each FC layer had 24 nodes except final output layer. For FC layers, a dropout function was applied to reduce overfitting and learn more robust features. This function randomly drops the connections with predefined probability. We set the probability as 0.5. The final output of CNN, hθ(x), was a single node.
The CNN model was trained by the RMSprop algorithm (32). The model was optimized to minimize the loss function, negative log partial likelihood.
Ei=1 represents that the event has occurred in individual i at event time Ti. j ∈ R(Ti) represents that another patient j is still at risk of the event at time Ti.
Our framework was trained by initial learning rate with 1x10-4 and took 500 epochs for the training. The CNN was implemented using a deep learning library, Keras (ver. 1.0.4) with the Theano (ver. 0.8.2) backend (33).
Parameters related to training of the neural network including number of layers, nodes, training epoch and learning rate were determined by 5-fold cross-validation. Training set was randomly divided into 5 subsets. At each step, a single subset was left for testing and other four subsets were used for training. The performance of the model was measured by Harrell’s C-index of the final output score of the model (34). The optimal parameters were selected according to the maximum average C-index across the 5-fold of the loop. The predictive value of NetScore was independently validated in two test sets. C-index for each test set was also evaluated.
2.7. Comparison of predictability between DL-based model and conventional Cox proportional hazard model
Expression level of all selected genes was fitted into multivariate Cox regression model and the predictive value of the Cox model was evaluated by C-index as in DL-based model. C-index of Cox model was measured by 5-fold cross validation in the training set, and it was calculated in two test sets. C-index of Cox model was compared with that of DL-based model in each cohort (35).
2.8. Survival analysis using NetScore in all cohorts
Prognostic property of NetScore as continuous variable was evaluated by univariate Cox analysis. To define risk groups, NetScore was dichotomized using the median value in each cohort. Kaplan-Meier method was used to assess survival rates according to the risk groups and survival rate differences were assessed with the log-rank test. Additionally, independent prognostic value of NetScore was assessed by multivariate and subgroup analysis. Multivariate Cox analysis was performed using clinical and pathological variables as well as NetScore. Subgroups were divided on the basis of clinical and pathological features, and univariate Cox analysis of NetScore was performed in each subgroup.
3. Results
3.1. Gene co-expression network modules from the training set
We aimed at developing a risk stratification model based on gene co-expression networks (Fig. 1A). The networks were constructed from the training set which consists of microarray data of 510 lung adenocarcinoma samples. The clinico-pathological features of all samples from the training set are detailed in Table 1. Using WGCNA, 23 co-expression network modules were identified from the training set (Fig. 1B, Supplementary Data 1). The relationship between modules is visualized with hierarchical clustering dendrogram and heatmap of the corresponding ME (Supplementary Fig. 2).
3.2. Identification of survival-related modules from the training set and validation in test sets
Total five modules were significantly associated with OS (Fig. 1C): red (p < 0.0001), turquoise (p = 0.018), magenta (p = 0.029), black (p = 0.043), and lightgreen (p = 0.044). To validate the survival-related modules, we conducted survival analysis in two independent test sets (GSE31210 as test set 1 and GSE30219 as test set 2; n=226 and 84, respectively). Consequently, turquoise (p = 0.0005), lightgreen (p = 0.019), red (p = 0.030) modules in test set 1, and turquoise (p = 0.011) and red (p = 0.049) modules in test set 2 were significantly associated with OS (Fig. 1D).
The networks of two common survival-related network modules (red and turquoise) are presented in Fig. 2A and 2B. The significantly enriched gene ontology terms of the red module included ‘organic acid catabolic process’, ‘carboxylic acid catabolic process’, ‘small molecule catabolic process’, and the turquoise module included ‘DNA strand elongation involved in DNA replication’, ‘mitotic cell cycle phase transition’, ‘DNA-dependent DNA replication’ (Supplementary Table 2, Supplementary Data 2).
3.3. DL-based risk stratification model using representative genes of survival-related module
By measuring the correlation between gene significance for OS (p-value) and GMM in each survival-related module, we identified two modules demonstrating high correlation with statistical significance (r = 0.53, p < 1x10-19 and r = 0.35, p < 1 x 10-26 for red and turquoise module, respectively; Fig. 2C). Based on the strong correlation, we could assume that the genes with high representativeness measured by GMM has high significance for OS and are the most important elements of the module; therefore, we selected top 10 genes according to GMM from the red and turquoise modules for the DL-based risk stratification model construction (Supplementary Fig. 3).
The expression profiles of selected 20 genes were used as input data of the risk stratification model (Fig. 3A). NetScore, the final output of our model, was significantly associated with OS in the training and two test sets (Fig. 3B) (p < 0.00001, p < 0.0001 and p = 0.02 for training set, test set 1 and 2, respectively). Subjects were divided into two groups, high- and low-risk groups, according to the median value of NetScore in each cohort. The high-risk group was significantly associated with OS in the training set (p < 0.0001; Fig. 3C) and in test set 1 (p < 0.0001; Fig. 3D). A trend of the association was also shown in test set 2 (p = 0.054; Fig. 3E).
3.4. NetScore as an independent predictive factor for prognosis
Cox multivariate analysis revealed that the risk group was associated with OS independent of stage as well as other clinico-pathological features in the training set and test set 1 (Table 2). The independent predictive factors for OS in Cox multivariate analysis were the risk group (p = 0.001) and T-stage 3 (p = 0.030) in training set, and the risk group (p = 0.01) and EGFR mutation status (p = 0.005) in test set 1. In the test set 2, there was no feature significantly associated with OS in univariate Cox analysis, though the high risk group showed a trend of unfavorable prognosis (p = 0.06). We also evaluated the prognostic value of NetScore in subgroups divided by clinical and pathological features. In the training set, the high risk group was significantly associated with poor prognosis in subgroups regardless of age and T-stage. In all subgroups, a trend of close relationship between the risk group and OS was found except never-smoking subgroup (Fig. 4A, Supplementary Fig. 4). According to subgroup analysis in test set 1, the risk group was closely associated with OS in male, old-aged, ever/never smokers, stage IA/IB, EGFR positive and all negative mutation subgroups (Fig. 4B, Supplementary Fig. 5). A trend of association between the risk group and OS was also revealed in each subgroup of test set 2, regardless of clinical features including sex, age and T stage (Fig. 4C, Supplementary Fig. 6).
4. Discussion
In this study, we developed a risk stratification model for lung adenocarcinoma based on gene co-expression networks and deep learning. Survival-related network modules were identified in multiple cohorts and representative genes of these modules were selected for risk stratification modeling. The model constructed by deep CNN reflects gene expression patterns of survival-related network modules and it provides prognostic score, NetScore. The NetScore was significantly associated with OS in all cohorts and also an independent predictor for OS from clinico-pathological variables.
The model based on survival-related network modules can provide more robust risk stratification compared with models focusing on statistical combination of individual prognostic genes which have been proposed in the previous studies (1–6). In spite of their promising results, several models failed to validate in independent samples of other study (4). Furthermore, there were few overlapping significant prognostic genes in the previous models. A meta-analysis of published gene expression data revealed that few genes were associated with survival of lung adenocarcinoma (36). The result of few significant prognostic genes in large samples implied the limitation of usage of individual genes for risk stratification. Besides, selection of individual significant genes has a substantial problem of multiple statistical testing (37). Instead of these previous approaches, systemic approach integrating gene interaction as well as individual genes would be a breakthrough for robust risk stratification modeling because variation patterns of their expression levels can be associated with prognosis.
Functional annotation elucidated the role of specific gene network in lung adenocarcinoma pathophysiology. The red module was functionally associated with catabolic process of organic acid and carboxylic acid. As cancer cells rely on aerobic glycolysis and facilitate the metabolic process of amino acid, nucleotides and lipid for rapid proliferation, genes related to fatty-acid and amino acid metabolism could reflect progression of cancer cell (38). The turquoise module was related to DNA replication and cell-cycle. A previous gene co-expression analysis also revealed that cell-cycle related genes were closely associated with lung cancer prognosis (39). Furthermore, overexpression of cyclins has repeatedly been associated with poor prognosis in lung cancer (40). Our result also emphasizes the importance of cell-cycle genes in lung cancer prognosis.
Recently, DL has dramatically improved data analysis in genomics and imaging fields (41, 42). The main contribution of DL for our risk stratification model is firstly applying convolutional neural network to gene expression data. It was used for extracting multiple gene expression patterns by applying convolutional filters. Another contribution is to solve regression problems of survival data by using a specialized loss function (31). We compared predictive accuracy of DL-based model and conventional Cox proportional hazard model obtained from the expression level of selected 20 genes. Predictability of the DL-based model was significantly higher than that of the Cox model in test set 1 (C-index= 0.709±0.042 and 0.608±0.046, respectively; p = 0.004). It was also higher in the training set and test set 2 though the difference did not reach statistical significance (Supplementary Fig. 7). To our knowledge, NetScore is the first study that apply deep convolutional neural network to high-dimensional gene expression data for predicting prognosis. By applying this novel approach to various genomic data, risk stratification and survival prediction could be improved compared with conventional Cox model.
NetScore was trained by various samples with different clinico-pathological characteristics. We found NetScore was associated with sex, smoking status, stage and molecular subtypes (Supplementary Fig. 8). Briefly, a trend of high NetScore was found in male, smokers, late stage and KRAS mutation positive samples. Nonetheless, NetScore was significantly associated with OS independent of clinico-pathological variables according to multivariate and subgroup analyses. Of note, NetScore was significant predictor in early stage subgroups (stage IA/IB). This finding could be important because the new risk stratification could identify patients who might need adjuvant chemotherapy. For example, a recent clinical trial using 15-gene signature based on individual prognostic genes showed successful selection of patients with stage IB and II NSCLC who would most likely benefit from adjuvant chemotherapy (43). In the future, as a new prognostic biomarker based on gene network, the usefulness of NetScore should be tested whether it could affect clinical decision, and compared with the previous prognostic models using individual genes.
5. Conclusion
We developed a risk stratification model for lung adenocarcinoma using gene co-expression network. A future extension of our work would be to apply this approach to the co-expression networks of other cancer types. In terms of technical improvement, modification of DL architecture and selection process of representative genes could improve the prediction accuracy. Finally, we expected that a prospectively designed clinical trial with well-controlled clinico-pathological variables would help find clinical application of our new risk stratification model.
Competing Interests
The authors declare no competing financial interests.
Acknowledgements
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.