Abstract
Metabolomics has great potential in the development of new biomarkers in cancer. In this study, metabolomics and gene expression data from breast cancer tumor samples were analyzed, using (1) probabilistic graphical models to define associations using quantitative data without other a priori information; and (2) Flux Balance Analysis and flux activities to characterize differences in metabolic pathways. A metabolite network was built through the use of probabilistic graphical models. Interestingly, the metabolites were organized into metabolic pathways in this network, thus it was possible to establish differences between breast cancer subtypes at the metabolic pathway level. Additionally, the lipid metabolism node had prognostic value. A second network associating gene expression with metabolites was built. Associations were established between the biological functions of genes and the metabolites included in each node. A third network combined flux activities from Flux Balance Analysis and metabolomics data, showing coherence between the metabolic pathways of the flux activities and the metabolites in each branch. In this study, probabilistic graphical models were valuable for the functional analysis of metabolomics data from a functional point of view, allowing new hypotheses in metabolomics and associating metabolomics data with the patient’s clinical outcome.
Author summary Metabolomics is a promising technique to describe new biomarkers in cancer. In this study we proposed computational methods to manage this type of data and associate it with gene expression data. We also employed a metabolic computational model to compare predictions from this model with metabolomics measurements. Finally, we built predictors of relapse based on the integration of those high-dimensional data in breast cancer patients.
Introduction
Breast cancer is one of the most common malignancies, with 266,120 estimated new cases and 40,920 estimated deaths in the United States in 2018 [1]. In clinical practice, the expression of hormonal receptors and HER2 allows the classification of this disease into three groups: hormonal receptor-positive (ER+), HER2+ and triple negative (TNBC).
Metabolomics is the most recent-omics. It consists of measuring the entire set of metabolites present in a biological sample [2]. The most common techniques in metabolomics experiments are mass spectrometry-related methods, which are based on the mass/charge relationships of each metabolite or its fragments [3]. Metabolomics is a promising tool for the development of new biomarkers [4].
We used two different methods to merge metabolomics and gene expression data in breast cancer. In previous studies, we used probabilistic graphical models (PGMs) to study differences between breast tumor subtypes and to characterize muscle-invasive bladder cancer at a functional level using proteomics data [5-7]. Flux Balance Analysis (FBA), however, is a method that has been widely used to study biochemical networks [8]. FBA predicts the growth rate or the rate of production of a given metabolite [9], and it has previously been used to characterize breast cancer cell responses against drugs targeting metabolism [10]. In this study, flux activities were proposed as a feasible method to compare flux patterns in metabolic pathways.
In the present study, metabolomics and gene expression data from 67 fresh tissue samples [11] were analyzed through PGMs and FBA. Our aim was to find associations between metabolomics and gene expression data.
Results
Patient characteristics
The data used in this study are from the previous work of Terunuma et al. [11]. A total of 67 paired normal and tumor fresh tissue samples from patients with breast cancer were studied. We only selected samples from tumor tissues for the present analyses.
This cohort included 67 patients, 33 ER+ and 34 ER− (of which 14 were TNBC). The median follow-up was 50 months, and 31 deaths had occurred during this time. No significant differences regarding overall survival were observed between patients with ER+ or ER− tumors. Patient characteristics are shown in Table 1.
Analysis of metabolomics data
An overall survival predictor using metabolomics data was built. This signature included five metabolites: glutamine, 2-hydroxypalmitate, deoxycarnitine, butyrylcarnitine and glycerophosphorylcholine (p-value =0.003, hazard ratio [HR] = 0.342, cut-off = 50:50) (Fig 1). A multivariate analysis showed that the predictor provided additional prognostic information to that of the clinical data (S1 Table).
Metabolomics data, including 237 metabolites, were analyzed through PGM. The resulting network was built assigning a main metabolic pathway to each node using IMPaLA. IMPaLA is a tool that allows ontology analyses based on metabolic pathways instead of genes. Strikingly, this network had a functional structure, grouping the metabolites into metabolic pathways. The network had five nodes, each with a different overrepresented metabolic pathway (Fig 2).
The activity of each node was calculated as previously described [6, 7, 10, 12]. Significant differences were found between ER+ and ER− tumors in lipid metabolism and purine metabolism (p<0.05) (S1 Fig).
The lipid metabolism node had prognostic value in this cohort (p =0.045, HR = 0.479, cut-off = 50:50) (Fig 3). Differences remained when stratified by the expression of hormonal receptors. However, a multivariate analysis did not show that the predictor supplied additional prognostic information to that of the clinical data (S2 Table).
Analyses combining gene expression with metabolomics data
A network combining metabolomics and gene expression data was built. Although most metabolites were grouped in the same node, some were integrated into gene nodes (Fig 4).
This combined network was then functionally characterized. The resulting network had eleven functional nodes and a twelfth node that grouped the metabolites (Fig 4).
Once the main functions were assigned, a literature review was performed to study the relationship between metabolites included in the gene nodes and the main function of each node. A relationship with functional nodes had been previously described for 4 of 20 metabolites: succinate, cytidine, histamine and 1,2-propanediol. The relationships between metabolites and their node function are shown in Table 2.
Flux Balance Analysis and flux activities
FBA and flux activities were calculated as previously described [10]. No significant differences were found in the tumor growth rate between ER+ and ER− tumors (S2 Fig).
Flux activities showed significant differences between ER+ and ER− in glycerophospholipid metabolism, phosphatidyl inositol metabolism, urea cycle, propanoate metabolism, pyrimidine catabolism and reactive oxygen species (ROS) detoxification (S3 Fig).
A predictor for overall survival was built with flux activities of glutamate metabolism and alanine and aspartate metabolism (p-value = 0.024, HR = 0.411, cut-off = 50:50) (Fig 5). A multivariate analysis showed that the predictor provided prognostic information independent from clinical data (S3 Table).
PGM analysis combining flux activities with metabolomics data
Using flux activities and metabolomics data, a new network was built. Interestingly, this network combined both types of data; however, flux activities appeared at the periphery of the network (Fig 6).
The resulting network was split into several branches to study the relationship of the metabolites to the flux activities included in each branch (Fig 6). Coherence between both types of data was shown, associating flux activities and metabolites related to these flux activities in the same branch. For instance, branch 1 includes glycolysis flux activity and three metabolites previously related to glycolysis (S4 Table). Regarding vitamins and cofactors, it was not possible make comparisons because the IMPaLA label for this category is “Vitamin and co-factor metabolism” and Recon2 labels differentiate between the various vitamins, labeling them as “Vitamin B6 metabolism”, “Vitamin A metabolism”, etc.
Discussion
Metabolomics is attracting considerable interest as a technique for finding new biomarkers in cancer. In this study, a new analytical workflow for the management and study of metabolomics data was proposed. This workflow allowed global metabolic characterization, beyond analyses based on unique metabolites.
Genomics and metabolomics data from Terunuma et al. have previously been used by The Cancer Genome Atlas Consortium to correlate gene expression data with metabolomics data [11, 13]. Based on this dataset, we applied PGMs for the first time in metabolomics data from tumor samples and also in metabolomics data combined with gene expression data and flux activities, with the aim of confirming known associations and finding new ones.
First, we evaluated whether metabolomics data were related to overall survival in patients with breast cancer. An overall survival predictive signature was built that included the expression values of glutamine, deoxycarnitine, butyrylcarnitine, glycerophosphorylcholine and 2-hydroxypalmitate [14]. The first three of these metabolites has previously been related to survival in breast cancer [15, 16]. However, to our knowledge, this is the first report associating 2-hydroxypalmitate with cancer survival. Additionally, in the previous study by Terunuma et al., 2-hydroxyglutarate was associated with a poor prognosis in patients with breast cancer [11]. 2-hydroxyglutarate is a glutamine intermediate in the tricarboxylic acid cycle, involved in the conversion of glutamine into lactate, a process known as glutaminolysis [14]. These results highlight the relevance of glutamine metabolism in breast cancer prognoses.
A metabolite network using metabolomics data was built using PGM. IMPaLA assigned a dominant metabolic function to each resulting node. In previous studies, we demonstrated that PGMs are useful for functionally characterizing gene or protein networks [6, 7, 12].
However, to our knowledge, this is the first time a PGM has been applied to metabolomics data from tumor samples. Just as observed in genes or proteins, metabolites are grouped into metabolic pathways, allowing the characterization of differences in metabolic pathways between ER+ and ER− tumors. For example, both lipid metabolism and purine metabolism node activities were higher in ER− tumors. ER− tumors usually overexpress genes related to lipid metabolism. [17]. Moreover, the activity of the lipid metabolism node had prognostic value. No relationship between purine metabolism and breast cancer has previously been defined.
On the other hand, the network combining gene expression data and metabolomics data grouped most of the metabolites into an isolated node. Yet, some metabolites were included in gene nodes. We found that four of the twenty metabolites showed a previously reported relationship with the main function of the gene node in which they were included. Succinate and cytidine were located in the immune response node. Succinate acts as an inflammation activation signal, inducing IL-1β cytokine production through hypoxia-inducible factor 1 [18]. In addition, succinate increases dendritic cell capability to act as antigen-presenting cells, prompting an adaptive immune response [19]. Regarding cytidine, Wachowska et al. described that 5-aza-2’-deoxycytidine modulates the levels of major histocompatibility complex class I molecules in tumor cells, induces P1A antigen and has immunomodulatory activity when combined with photodynamic therapy [20].
Both histamine and 1,2-propanediol appeared to be related to the angiogenesis node. Histamine is known to promote angiogenesis through vascular epithelial growth factor [21]. On the other hand, sulfoquinovosyl acylpropanediol, an 1,2-propanediol derivate, inhibits angiogenesis in murine models with pulmonary carcinoma [22].
FBA was used to model metabolism using gene expression data. Although FBA-predicted biomass did not show significant differences between ER+ and ER− tumors, differences in flux activities were shown between both subtypes. Some of these activities were also related to prognosis. One of these flux activities is “Glutamine metabolism”, which agrees with the results obtained from the metabolomics data, including glutamine in the metabolite, a signature capable of predicting overall survival. With the aim of associating metabolomics and FBA results, flux activities and metabolomics data were combined to form a new network. As opposed to gene and metabolite data, metabolomics data and flux activities combined well in the network. Interestingly, flux activities are dead-end nodes, perhaps due to the fact that they are by definition a final summary of each pathway. IMPaLA assigned a main metabolic pathway to resulting branches; thus, it was possible to know how many metabolites were related to flux activity in each branch. In most cases with available information, there was coherence between metabolites included in the branch and its flux activity. This validates FBA and flux activities, both based on gene expression, as a method of simulating metabolism.
Our study has some limitations. The limited number of samples leads us to consider the results as preliminary, and validation in an independent cohort is needed. Additionally, our results are difficult to place in the current clinical landscape, given that tumors in the original series had not been assessed for HER2 expression. On the other hand, evolving techniques currently allow the detection of more metabolites, which would permit a more thorough analysis.
In conclusion, PGMs reveal their utility in the analysis of metabolomics data from a functional point of view, not only metabolomics data alone, but also in combination with flux or gene expression data. Therefore, PGM is postulated as a method to propose new hypotheses in the metabolomics field. We also found that it is possible to associate metabolomics data with clinical outcomes and to build prognostic signatures based on metabolomics data.
Materials and methods
Patients included in the study
Metabolomics and gene expression data from 67 fresh tumor tissue samples originally analyzed by Terunuma et al. [11] were included in this study.
Preprocessing of gene expression and metabolomics data
For the metabolomics data, log2 was calculated. As quality criteria, data were filtered to include detectable measurements in at least 75% of the samples. Missing values were imputed to a normal distribution using Perseus software [23]. After quality control, 237 metabolites were considered for subsequent analyses.
In terms of gene expression data, the 2000 most variable genes, i.e., those genes with the highest standard deviation, were chosen to build the PGM.
Probabilistic graphical models and gene ontology analyses
As previously described [6, 7, 10, 12], PGMs compatible with high dimensional data were used, using correlations as associative criteria. The grapHD package [24] and R v3.2.5 [25] were employed to build the network. A majority function was assigned to each node using gene ontology analyses. In the case of genes, gene ontology analyses were performed using the DAVID web tool with “homo sapiens” as background and GOTERM, KEGG and Biocarta selected as categories [26]. In the case of metabolites, the Integrated Molecular Pathway Level Analysis (IMPaLA) web tool was used [27].
Node activities were calculated, as previously described [6, 7, 10, 12], as the mean of the expression/quantity of genes/metabolites of each node that are related to the main node function/metabolic pathway.
Flux Balance Analysis and flux activities
FBA was calculated using the human metabolic reconstruction Recon2 [28]. As the objective function, the biomass reaction proposed in the Recon2 was used. FBA was performed using the COBRA Toolbox [29] available for MATLAB. Gene-Protein-Reaction rules were solved as described in previous studies [7, 10], using a modification of the Barker et al. algorithm [30], which were incorporated into the model by the E-flux method [31].
Flux activities were previously proposed as a measurement to compare differences at the flux pathway level [10]. Briefly, they were calculated as the sum of the fluxes of the reactions included in each pathway defined in Recon2.
Statistical analyses
The statistical analyses were performed with GraphPad Prism v6, and the network analyses were performed using Cytoscape software [32]. Predictor signatures were built with the BRB Array Tool from Dr. Richard Simon’s team. All p-values are two-sided and are considered statistically significant under 0.05.
Funding statement
This study was supported by Instituto de Salud Carlos III, Spanish Economy and Competitiveness Ministry, Spain and co-funded by the FEDER program, “Una forma de hacer Europa” (PI15/01310). LT-F is supported by the Spanish Economy and Competitiveness Ministry (DI-15-07614). GP-V is supported by Conserjería de Educación, Juventud y Deporte of Comunidad de Madrid (IND2017/BMD7783). The funders had no role in the study design, data collection and analysis, decision to publish or preparation of the manuscript.
Author contributions
All the authors have directly participated in the preparation of this manuscript and have approved the final version submitted. JMA, MD-A, HN and PM contributed the directed graphical models. LT-F, AG-P, G-PV, and AZ-M performed the statistical analyses, the graphical model interpretation and the ontology analyses. LT-F, AG-P, JAFV, PZ and EE conceived of the study and participated in its design and interpretation. MD-A and LT-F performed the Flux Balance Analysis. LT-F drafted the manuscript. AG-P, JAFV, and EE supported the manuscript drafting. AG-P and JAFV coordinated the study.
Competing interests
JAFV, EE and AG-P are shareholders in Biomedica Molecular Medicine SL. LT-F and GP-V are employees of Biomedica Molecular Medicine SL. The other authors declare no competing interests.