Abstract
The evolutionary nature of cancer relates directly to a renewed focus on the voluminous NGS (next generation sequencing) data, aiming at the identification of explanatory models of how the (epi)genomic events are choreographed in cancer initiation and development. However, despite the increasing availability of multiple additional-omics data, this quest has been frustrated by various theoretical and technical hurdles, mostly related to the dramatic heterogeneity and temporality of the disease. In this paper, we build on our recent works on “selectivity” relation among driver mutations in cancer progression and investigate their applicability to the modeling problem – both at the population and individual levels. On one hand, we devise an optimal, versatile and modular pipeline to extract ensemble-level progression models from cross-sectional sequenced cancer genomes. The pipeline combines state-of-the-art techniques for sample stratification, driver selection, identification of fitness-equivalent exclusive alterations and progression model inference. We demonstrate this pipeline’s ability to reproduce much of the current knowledge on colorectal cancer progression, as well as to suggest novel experimentally verifiable hypotheses. On the other hand, we prove that our framework can be applied, mutatis mutandis, in reconstructing the evolutionary history of cancer clones in single patients, as illustrated by an example with multiple biopsy data from clear cell renal carcinomas.
Introduction
Since the late seventies evolutionary dynamics, with its interplay between variation and selection, has progressively provided the widely-accepted paradigm for the interpretation of cancer emergence and development [1–3]. Random alterations of an organism’s (epi)genome can sometimes confer a functional selective advantage to certain cells, in terms of adaptability and ability to survive and proliferate. Since the consequent clonal expansions are naturally constrained by the availability of resources (metabolites, oxygen, etc.), further mutations in the emerging heterogeneous tumor populations are necessary to provide additional fitness of different kinds that allow survival and proliferation in the unstable micro environment. Such further advantageous mutations will eventually allow some of their sub-clones to outgrow the competing cells, thus enhancing tumor’s heterogeneity as well as its ability to overcome future limitations imposed by the rapidly exhausting resources. Competition, predation, parasitism and cooperation are indeed often observed in co-existing cancer clones [4].
In the well-known vision of Hanahan and Weinberg [5,6], the phenotypic stages that characterize this multistep evolutionary process are called hallmarks. These can be acquired by cancer cells in many possible alternative ways, as a result of a complex biological interplay at several spatio-temporal scales that is still only partially deciphered [7]. In this framework, we distinguish alterations driving the hallmark acquisition process (i.e., drivers) by activating oncogenes or inactivating tumor suppressor genes, from those that are transferred to sub-clones without increasing their fitness (i.e., passengers) [8]. Driver identification is a modern challenge of cancer biology, as distinct cancer types exhibit very different combinations of drivers, some cancers display mutations in hundreds of genes [9], and the majority of drivers is mutated at low frequencies (“long tail” distribution), not allowing their detection by examining the recurrence at the population-level [10]. One can also use the evolutionary models to characterize, what may be called, anti-hallmarks – the phenotypes that are possible by the variational processes, but rarely found to be selected [11]. For instance, certain collections of driver mutations, whose individual members are often present in the patient genomes, are never seen jointly. These anti-hallmarks point to tumors’ vulnerabilities, and thus, novel targets for therapeutic interventions.
Cancer clones harbour distinct types of “alterations”. The somatic ones involve either few nucleotides or larger chromosomal regions, and are usually catalogued as mutations – i.e., Single Nucleotide Variants (SNVs) and Structural Variants (SVs) at multiple scales (insertions, deletions, inversions, translocations) – of which only some are detectable as Copy Number Alterations (CNAs), which appear to be most prevalent in many tumor types [12]. Also epigenetic alterations, such as DNA methylation and chromatin reorganization, play a key role in the process [13]. The overall picture is confounded by factors such as genetic instability [14], aneuploidy and tumor-microenvironment interplay [15], the latter involving stromal and immune-system cells with strong influence on the final effect of mutations [16]. Furthermore, spatial organization and tissue specificity play an essential role on tumor progression as well [17].1.
In this scenario, genomic alterations are related to the phenotypic properties of tumor cells via the structure and dynamics of functional pathways, in a process which has been only partially characterized [18–21]. In general, in fact, as there exist many equivalent ways to disrupt signaling and regulatory pathways, many mutations can provide equivalent fitness to cancer cells, leading to alternative routes to selective advantage across a population of tumors [22]. Practically, if multiple genes are equally functional for the same biological process, when any of those is altered the selection pressure on the others is diminished or even nullified [23]. Such genes, e.g., apc/ctnnb1 in colorectal cancer [24], therefore show a trend of exclusivity across a cohort – with few cases of co-occurrent alterations. The same applies when disruptive alterations hit on the same gene, e.g., PTEN’s mutations and deletions in prostate cancer [25].
An immediate consequence of this state of affair is the dramatic heterogeneity and temporality of cancer, both at the inter-tumor and at the intra-tumor levels [26]. The former manifests as different patients with the same cancer type can display few common alterations. This led to the development of techniques to stratify tumors into subtypes with different genomic signatures, prognoses and response to therapy [27].The latter refers to the noteworthy genotypic and phenotypic variability among the cancer cells within a single neoplastic lesion, characterized by the coexistence of more than one cancer clones with distinct evolutionary histories [28].
Cancer heterogeneity poses a serious problem from the diagnostic and therapeutic perspective as, for instance, it is now acknowledged that a single biopsy might not be representative of other parts of the tumor, hindering the problem of devising effective treatment strategies [4]. Therefore, the quest for an extensive etiology of cancer heterogeneity and for the identification of cancer evolutionary trajectories is nowadays central to cancer research, and attempt to exploit the massive amount of sequencing data available through public projects such as The Cancer Genome Atlas (TCGA) [29].
Such projects involve an increasing number of cross-sectional (epi)genomic profiles collected via single biopsies of patients affected by various cancer types, which might be used to extract trends of cancer evolution across a population of samples. Higher resolution data such as multiple samples collected from the same tumor [28], as well as single-cell sequencing data [30], might be complementarily used to face the same problem within a specific patient. However, either the lack of public data or problems of accuracy and reliability, currently prevent a straightforward application [31].
These different perspectives lead to the different mathematical formulations of the problem of inferring a cancer progression model from genomic data, which we shall examine at length in this paper [32]. Indeed, such models can either be focused to describe trends characteristics of a population, i.e. ensemble-level, or clonal progression in a single-patient. In general, both problems deal with understanding the temporal ordering of somatic alterations accumulating during cancer evolution, but use orthogonal perspectives and different input data – see Figure 1.
Ensemble-level cancer evolution
It may seem desirable to extract a probabilistic graphical model (PGM) explaining the statistical trend of accumulation of somatic alterations in a population of n cross-sectional samples collected from patients affected by a specific cancer. To make this problem independent of the experimental conditions in which tumors are gathered, we only consider the list of alterations detected per sample – thus, as 0/1 Bernoulli variables.
Much of the difficulty lies in estimating the true and unknown trends of selective advantage among genomic alterations in the data, from such observations. This hurdle is not unsurmountable, if we constrain the scope to only those alterations that are persistent across tumor evolution in all sub-clonal populations, since it yields a consistent model of a temporal ordering of mutations. Therefore, epigenetic and trascriptomic states, such as hyper and hypo-methylations or over and under expression, could only be used, provided that they are persistent thorough tumor development [34].
Historically, the linear colorectal progression by Vogelstein is an instance of a solution to the cancer progression modeling problem [35]. That approach was later generalized to accommodate tree-models of branched evolution [36–39] and, later, further generalized to the inference of directed acyclic graph (DAG) models by Beerenwinkel and others [40–42]. We contributed to this research prgram with two related algorithms: CAncer PRogression Extraction with Single Edges (CAP-RESE, [43]) and CAncer PRogression Inference (CAPRI, [33]), which are currently implemented in TRONCO (TRanslational ONCOlogy), an open source R package available in standard repositories [44]. Both techniques rely on Suppes’ theory of probabilistic causation to define estimators of selective advantage [45], are robust to the presence of noise in the data and perform well even with limited sample sizes. The former algorithm exploits shrinkage-like statistics to extract a tree model of progression, the latter combines bootstrap and maximum likelihood estimation with regularization to extract general directed acyclic graphs that capture branched, independent and confluent evolution. Both algorithms represent the current state-of-the-art to approach this problem, as they outperform others in speed, scale and predictive accuracy.
Clonal architecture in individual patients
At the time of this writing, technical and economical limitations of single-cell sequencing prevent a straightforward application of phylogeny inference algorithms to the reconstruction of the clonal evolutionary history of genomic alterations within a single tumor [46,47]. Conversely, samples of cells collected from a single bulk tumor do not define an isogenic lineage [48] and most likely contain a large number of cells belonging to a collection of sub-clones resulting from the complex evolutionary history of the tumor, where the prevalence of a particular clone in time and its spatial distribution reflect its growth and proliferative fitness. To overcome hurdles such as this, many recent efforts have aimed at inferring the clonal signatures and prevalence in individual patients from sequencing data [28,49].
The majority of attempts employ different strategies, usually based on Bayesian inference, to relate allelic imbalance to cellular prevalence, and benefit from multiple sample per patient, taken across time or space. In particular, most tools usually process a set of read counts from a high-coverage sequencing experiment to estimate Variant Allele Frequency (VAF). Some of them are based on the VAF analysis of specific SNVs [50,51]. Recent algorithms attempt to minimize the error between the observed and inferred mutation frequencies with distinct optimization procedures [52–54]. Other approaches support explicitly short-read data and different types of data, such as CNAs, SNVs and B-allele fractions [55]. Distinct techniques, instead, use genome-wide segmented read-depth information to determine mixtures of subclonal CNA profiles [56,57], while others use a generative approach to deconvolve sequencing data to clonal architectures [58]. Clearly, any of these approaches gains precision from high-coverage sequencing data, since high read counts yield high confidence estimate of allele frequency.
Results
Here, we report on the design, development and evaluation of an optimal, versatile and modular pipeline which exploits state-of-the-art tools to extract ensemble-level cancer progression models from cross-sectional data. We also show its applications in interpreting colorectal cancer data which, because of its high levels of heterogeneity, may be thought of as one of the most challenging case studies. Here, we are able to show that, in general, tools to detect cancer evolution at the ensemble-level can be effective even on single-patient data.
A pipeline to infer ensemble-level progression models
We have devised a customizable pipeline to infer ensemble-level cancer progression models from cross-sectional data with CAPRI [33]. To increase the statistical quality of its predictions our pipeline pre-processes data to diminish the confounding effects of inter and intra-tumor heterogeneity. At a high-level, we shall thus identify: (i) biologically meaningful subtypes with similar molecular profiles via tumor stratification, (ii) the set of driver alterations and (iii) the groups of fitness-equivalent (i.e., exclusive) alterations.
Thus, this pipeline, which is briefly sketched in Figure 1 and detailed as Online Methods, is similar in spirit to those implemented by consortia such as TCGA to analyze huge populations of cancer samples [59,61]. One of the main novelties of our approach, which is only possible by the specific features of hypothesis-testing provided by CAPRI [33], is the exploitation of groups of exclusive alterations as a proxy to detect fitness-equivalent routes of cancer progression. Thus, CAPRI may be thought of as an ideal tool for efficient and theoretically-grounded investigations in population-based studies on cancer genomics.
Our approach allows one to produce a progression model for virtually every cancer subtype identified in the input cohort, which shall be characteristic of the population trends of cancer initiation and progression. In the following, we empirically characterize the efficacy of our approach in processing colorectal cancer data from TCGA project [59], demonstrating that we were able tore-discover most of the existing body of knowledge about colorectal tumor progression or to propose further experimentally verifiable hypotheses2.
Evolution in a population of MSI/MSS colorectal tumors with CAPRI
It is common knowledge that colorectal cancer (CRC) is a heterogeneous disease comprising different molecular entities. Since similar tumors are most likely to behave in a similar way, grouping tumors with homogeneous characteristics may be useful to define personalized therapies. Indeed, it is currently accepted that colon tumors can be classified according to their global genomic status into two main types: microsatellite instable tumors (MSI), further classified as high or low, and microsatellite stable (MSS) tumors (also known as tumors with chromosomal instability). This taxonomy plays a significant role in determining pathologic, clinical and biological characteristics of CRC tumors [62]. Thus, MSS tumors are characterized by changes in chromosomal copy number and show worse prognosis [63,64]. On the contrary, the less common MSI tumors (about 15% of sporadic CRC) are characterized by the accumulation of a high number of mutations and show predominance in females, proximal colonic localization, poor differentiation, tumor-infiltrating lymphocytes and a better prognosis [65]. In addition, MSS and MSI tumors exhibit different responses to chemotherapeutic agents [66,67]. Regarding molecular progression, it is also well established that each subtype arises from a distinctive molecular mechanism. While MSS tumors generally follow the classical adenoma-to-carcinoma progression (sequential apc-kras-tp53 mutations) described in the seminal work by Vogelstein and Fearon [68], MSI tumors results from the inactivation of DNA mismatch repair genes like mlh-1 [65].
We instantiated the pipeline discussed as Online Methods to process MSI-HIGH – hereby shortly denoted as MSI – and MSS colorectal tumors collected from the The Cancer Genome Atlas project “Human Colon and Rectal Cancer” (COADREAD, [59]) – see Supplementary Figure S1. Details on the implementation are available as Supplementary Material, as well as source code to replicate this study. COADREAD has enough samples to implement a training/test statistical validation of our findings – see Supplementary Table S1 and Supplementary Figure S2. In brief, we split subtypes by the microsatellite status of each tumor, and select somatic mutations and focal CNAs in 33 driver genes manually annotated to 5 pathways in [59] – wnt, raf, tgf-β, pi3k and P53. Groups of exclusive alterations were scanned by MUTEX [23] (Supplementary Table S2), and fetched by [59] using the MEMO [60] tool; groups were used to create CAPRI’s formulas, see Supplementary Table S3. Data for MSI tumors are shown in Figure 2, for MSS tumors are shown in Supplementary Figure S3 and S4. CAPRI was run, on each subtype, by selecting recurrent alterations from the pool of 33 pathway genes and using both AIC/BIC regularizators.
The model inferred for MSS tumors is in Figure 3, the model for MSI-HIGH ones is in Figure 4. Each edge in the graph mirrors selective advantage among the upstream and downstream nodes, as estimated by CAPRI from training datasets (statistics: p < 0.05, 100 non-parametric bootstraps); only the minimum amount of edges is selected to maximize the likelihood of data (see Online Methods). As statistical validation of these models, we mark those relations that display significant p-values in the test datasets, and rank them if they contribute (or otherwise) to max-likelihood. For some edges it is not possible to provide a validation, as some upstream or downstream event may be missing in the test dataset, while other edges do not show statistical evidence in the test datasets.
Interpretation of the models
Our models capture the well-known features distinguishing MSS and MSI tumors, e.g., apc-kras-tp53 primary events and chromosomal aberrations in MSS, versus braf mutations in MSI, which lacks chromosomal alterations. Of all 33 driver genes, 15 are common to both models – e.g., apc, braf, kras, nras, tp53 and fam123b among others (mapped to pathways like wnt, mapk, apoptosis or activation of T-cell lymphocites), although in different relationships (position in the model), whereas new (previously un-implicated) genes stood out from our analysis and deserve further research.
MSS (Microsatellite Stable). In agreement with the known literature, we identify kras, tp53 and apc as primary and pten as late events in the carcinogenesis, as well as nras and kras determining two independent evolution branches, the former being “selected by” tp53 mutations, i.e. being a downstream event in the model, the latter “selecting for” pik3ca mutations. The leftmost portion of the model links many wnt genes, in agreement with the observation that multiple concurrent lesions affecting such pathway confer selective advantage. In this respect, our model predicts multiple routes for the selection of alterations in sox9 gene, a transcription factor known to be active in colon mucosa [69]. Its mutations are indeed selected by apc/ctnnb1 alterations or by fbxw7, an early mutated gene that both directly, and in a redundant way via ctnnb1, relates to sox9. The sox family of transcription factors have emerged as modulators of canonical wnt/β-catenin signaling in many disease contexts, with evidences that multiple sox proteins physically interact with β-catenin and modulate the transcription of wnt-target genes, as well as with evidences of regulating of sox’s expression by wnt, resulting in feedback regulatory loops that fine-tune cellular responses to, β-catenin/tcf activity [70]. Also interestingly, fbxw7 has been previously reported to be involved in the malignant transformation from adenoma to carcinoma [71], and it was recently shown that SCFFbw7, a complex of ubiquitin ligase that contains such gene, targets several oncogenic proteins including sox9 for degradation [72]; this relation has high-confidence also in the test dataset. The rightmost part of the model involves genes from other pathways, and outlines the relation between kras and the pi3k pathway. We indeed find, consistently in the training and test, selection of pik3ca mutations by kras ones, as well as selection of the whole MEMO module, which is responsible for the activation of the pi3k pathway [59]. smad proteins relate either to kras or braf genes, and fam123b, tcf7l2 converge in dkk2 or dkk4 which is interesting as these four genes are implicated in the wnt signalling pathway. It is also worth pointing that the model predicts a selection trend among sox9/arid1a and atm/fam123b; however, given that the within these couples the events have very similar frequencies, it is not possible to confidently assess the direction of the selectivity relations, which, in fact, are found to be reversed in the test dataset.
MSI (Microstaellite Instable). In agreement with the current literature, braf is the most commonly mutated gene in MSI tumors [73]. CAPRI predicted convergent evolution of tumors harbouring fbxw7 or apc mutations towards deletions of nras gene, as well as selection of smad2 or smad4 mutations by fam123b mutations, for these tumors. Relevant to all MSI tumors seems again the role of the pi3k pathway. Indeed, a relation among apc and pik3ca mutations was inferred with a high confidence in both training and test datasets, consistent with recent experimental evidences pointing at a synergistic role of these mutations, which co-occurr in the majority of human colorectal cancers [74]. Similarly, we find consistently a selection trend among apc and the whole MEMO module. Interestingly, both mutations in apc and erbb3 select for kras mutations, which might point to interesting therapeutic implications (see Discussion). In contrast, mutations in braf mostly select for mutations in acvr1b, a receptor that once activated phosphorylates smad proteins. It forms receptor complex with acvr2a, a gene mutated in these tumors that selects for tcf7l2 mutations. Tumors harbouring tp53 mutations are those selected by exhibit mutations in axin2, a gene implicated in wnt signalling pathway, and related to instable gastric cancer development [75]. Inactivating mutations in this gene are important, as it provides serrated adenomas with a mutator phenotype in the MSI tumorigenic pathway [76]. Thus, our results reinforce its putative role as driver gene in these tumors.
By comparing these models we can find similarity in the prediction of a potential new early event for CRC formation, fbxw7, as other authors have recently described [71]. This tumor suppressor is frequently inactivated in human cancers, yet the molecular mechanism by which it exerts its antitumor activity remains unexplained [77], and our models provide a new hypothesis in this respect. We also note that genes involved in these models exhibit distinctive functional features, suggesting that each one imparts alterations in different pathways in the early stages of carcinogenesis.
Private alterations of these tumors denote potential different progression mechanisms. Mutations or CNAs specific to MSS tumors involve intracellular genes like ctnnbl or pten. In contrast, private MSI mutations appear in membrane receptors such as acvrIb, acvr2a, erbb3, lrp5, tgfbr1 and tgfbr2; as well as in secreted proteins like igf2.This suggests that MSI tumors need to disturb cell-cell and/or cell-microenvironment communication to grow, as their lesions accumulate in private pathways like cytokine-cytokine receptor, endocytosis and tgf-β signalling pathway. On the other hand, genes specific to MSS tumors are implicated in p53, mTOR, sodium transport and inositol phosphate metabolism.
Inference of patient-specific clonal evolution with CAPRESE
We also discovered that the CAPRESE [43] algorithm can be used to successfully reconstruct the clonal architecture in individual patients, an instance of tree-phylogeny of Figure 1. This result is indicative of the power of the selective advantage scores à-la-Suppes [45], even outside the scope of cross-sectional data. We performed our analysis on data from Gerlinger et al., who have recently used multi-region targeted exome sequencing (> 70x coverage) to resolve the genetic architecture and evolutionary histories of ten clear cell renal carcinomas [49].
Besides quantification of intra-tumor heterogeneity, their work found that loss of the 3p arm and alterations of the Von Hippel-Lindau tumor suppressor gene vhl are the only events ubiquitous among their patients. In Figure 5 we show the clonal evolution estimated for one of those patients, RMH004, computed with CAPRESE (shrinkage coefficient λ = 0.5, time < 1 sec) from the Bernoulli 0/1 profiles provided in Supplementary Table 3 and Figure 4 of [49], with non-parametric bootstrap confidence (time < 6 sec). This model may be compared to the one inferred by processing the region-specific VAF with a max-mini optimization of most parsimonious evolutionary trees [78], and performing selection-by-consensus when multiple optimal solutions exist – Supplementary Figure 9 in [49]. CAPRESE requires no arbitrarily defined curation criteria to select the optimal tree, as it constructively searches for a solution which, in this case, is analogous in suggesting parallel evolution of subclones via deregulation of the swi/snf chromatin-remodeling complex – i.e., as may be noted from multiple clones with distinct PBMR1 mutations. Finally, the approach in [78], estimates also the number of non-synonymous mutations acquired on a certain edge of the tree. While our model is silent about this, it is very likely due to the limitations imposed by the lower-resolution and small sample size of the data – 9 events from 8 regions, and not the VAFs for all alleles.
Single-cell synthetic data
We estimate the efficiency of our approach to single-cell sequencing data, as if it was collected from patient RMH004 (synthetic data generated from the clonal phylogeny architecture of Figure 5). To mimic a poor reliability of this technology, to each sampled cell a noise model which accounts for false positives and negatives in the calls of their genomic alterations is applied. Performance is measured as the fraction of true-positive and negative ancestry relations inferred among cells (precision and recall), as a function of the number of sequenced cells and noise level. Results indicate a very good performance even with very small number of cells and reasonable noise levels, hinting at a promising application with this technology. Complete details for synthetic data generation and further performance measures are provided as Supplementary Material.
Discussion
In this paper, we have continued our exploration of the nature of somatic evolution in cancer, but with an emphasis on colorectal cancer and jointly with epidemiologists who study the disease. The nature of the proposed model of somatic evolution in cancer not only supports the heterogeneity and temporality seen in tumor population, but also suggests a selectivity/causality relation that can be used in analyzing (epi)genomic data and exploited in therapy design. We have shown in this paper that our approach can be effective in extracting evolutionary trajectories for cancer progression both at the level of populations and individual patients. In the former case we have set up a pipeline to minimize the confounding effects imputable to tumor heterogeneity, and we have applied it to a highly-heterogeneous cancer such as colorectal. In the latter we have have shown how our techniques can be readily applied to reconstruct clonal philogeny from multi-sample data, with an application to clear renal cell carcinoma.
Emphasis of this work is on the population-level inference of cancer progression. Our pipeline has been able to infer the role of many known events in colorectal cancer progression, and sheds light on the roles of new players such as fbxw7, sox9 or axin2 which deserve further investigation. In colon carcinogenesis, although each model identifies characteristic early mutations suggesting different initiation events, both model appear to be “converging” in common pathways and functions such as wnt or mapk. However, each progression model recapitulates private functions related to microenvironment communication in the case of MSI tumors and with intracellular signalling in the case of MSS tumors.
Our models might have implications also for treatment strategies. For instance, some of the relations that we observed in our models might point to cancer hallmarks to be exploited for therapy design. As an example, the interesting relation between sox9 and fbxw7 in microsatellite stable tumors, interpreted together with genes such as tp53, might point to a DNA fragmentation and cell cycle arrest hallmark as these genes are sensitive for cell-cycle regulation – via the p53 protein – and for degradation and senescence, via sox9. This would also be supported by other cancer studies since transcription factor sox9 seems to play an important role in colon cancer development [79].
Personalized treatment strategies might also benefit from our analyses. In fact, kras status is currently used as a predictive biomarker for the selection of CRC patients susceptible to be treated with anti-egfr targeted therapy [80]. However, resistance in kras wild-type tumors has been observed that could be caused by mutated genes in the same pathway other than kras [81]. Models could then be useful to detect these alternative mutated genes, like nras or erbb family, which are characteristic of the population of observed tumors.
Remarkably, we could prove the effectiveness of our approach in inferring the clonal evolutionary history of single cancer patients as well, by showing a successful application to multiple-biopsy data on clear cell renal carcinoma. We also demonstrated that, in case of single-cell synthetic data generated by sampling a real clonal phylogenetic architecture, our inference techniques provide an excellent performance with a very limited number of samples and also in presence of a certain level of experimental noise. Even if further investigations on this topic are underway, these preliminary results point at the efficiency of our algorithmic framework in inferring the clonal architecture of single cancer patients, especially in anticipation of the expected increasing availability and reliability of single-cell sequencing data.
Authors contribution
The pipeline was discovered and realized by MA’s Bioinformatics lab at University of Milan-Bicocca, within a project led and supervised by GC. GC, AG and DR designed the pipeline, GC, DR and LDS coded and executed it. Data gathering and models interpretation was done by GC, LDS, DR, AG together with VM and RSP. GM, MA, VM and BM provided overall organizational guidance and discussion. GC, AG and RSP wrote the original draft of the paper, which all authors reviewed and revised in the final form. This work follows up on an earlier project started by BM and involving BIMIB, subsequently including ICO after the 2014 School on Cancer, Systems and Complexity (CSAC).
Financial support
MA, GM, GC, AG, DR acknowledge Regione Lombardia (Italy) for the research projects RetroNet through the ASTIL Program [12-4-5148000-40]; U.A 053 and Network Enabled Drug Design project [ID14546A Rif SAL-7], Fondo Accordi Istituzionali 2009. BM acknowledges founding by the NSF grants CCF-0836649 and CCF-0926166. VM and RSP acknowledge the Instituto de Salud Carlos III supported by The European Regional Development Fund (ERDF) grants PI11–01439, PIE13/00022, the Spanish Association Against Cancer (AECC) Scientific Foundation, and the Catalan Government DURSI, grant 2014SGR647.
Online Methods
Our cancer bioinformatics pipeline is versatile and can be easily customized for multiple purposes. Below, we review how its features may be selected according to the specific research goals, input data, and cancer type.
A general pipeline to infer ensemble-level progression models
For each of n tumors (n patients) we assume relevant (epi)genetic data to be available. We do not put constraints on data gathering and selection, leaving the user to decide the appropriate “resolution” of the input mutational data. For instance, one might decide whether somatic mutations should be classified by type, or aggregated. Or, one might decide to lift focal CNAs to the wider resolution of cytobands or full arms. These choices depend on data and on the overall understanding of such alterations and their functional effects for the cancer under study, and no single all-encompassing rationale may be provided.
Step 1: Reducing inter-tumor heterogeneity by cohort subtyping
We might wish to identify cancer subtypes in the heterogeneous mixture of input samples. In some cases the classification can benefit from clinical biomarkers, such as evidences of certain cell types [82], but in most cases we will have to rely on multiple clustering approaches at once, see, e.g., [59,61].
Many common approaches cluster expression profiles [83], often relying on non-negative matrix factorization techniques [84] or earlier approaches such as k-means, Gaussians mixtures or hierarchical/spectral clustering - see the review in [85]. For glioblastoma and breast cancer, for instance, mRNA expression subtypes provides good correlation with clinical phenotypes [86–88]. However, this is not always the case as, e.g., in colorectal cancer such clusters mismatch with survival and chemotherapy response [86]. Clustering of full exome mutation profiles or smaller panels of genes might be an alternative as it was shown for ovarian, uterine and lung cancers [89, 90]
Step 2: selection of driver events
In subtypes detection, with more alterations available it becomes easier to find similarities across n samples, as features selection gains precision. In progression inference, instead, one wishes to focus on m ≪ n driver alterations, which ensure also an appropriate statistical ratio between sample size (n) and problem dimension (m).
Multiple tools filter out driver from passenger mutations. MutSigCV identifies drivers mutated more frequently than background mutation rate, [91]. OncodriveFM, avoids such estimation but looks for functional mutations [92]. OncodriveCLUST scans mutations clustering in small regions of the protein sequence [93]. MuSiC uses multiple types of clinical data to establish correlations among mutation sites, genes and pathways [94]. Some other tools search for driver CNAs that affect protein expression [95]. All these approaches use different statistics to estimate signs of positive selection, and we suggest using them in an orchestrated way, as done in some platforms [96]. Notice that driver genes will likely differ across subtypes, mimicking the different molecular properties of each group of samples.
Step 3: fitness equivalence of exclusive alterations
When working at the ensemble-level, identification of “groups of equivalent but alternative” mutually exclusivity alterations is crucial, prior to progression inference [33]. A plethora of tools can be used; greedy approaches [97, 98] or their optimizations, such as MEMO, which constrain search-space with network priors [60]. This strategy is further improved in MUTEX, which scans mutations and focal CNAs for genes with a common downstream effect in a curated signalling network, and selects only those genes that significantly contributes to the exclusivity pattern [23]. Other tools, instead, employ advanced statistics or generative approaches without priors [99–104].
In the fitness equivalent groups, we distinguish between hard and soft exclusivity, the former assuming strict exclusivity among events, with random errors accounting for possible overlaps, the latter admitting co-occurrences. [23]. CAPRI is the only algorithm where relations among group of genes can be input as “testable hypotheses” via logical Boolean formulas. In this case, we can use logical connectives such as ⊕ (the logical “xor”) as a proxy for hard-exclusivity, and ⋁ (the logical “disjunction”) as a proxy for soft-exclusivity3. For example, these can be used to test wether colorectal tumors “start” prevalently from β-catenin deregulation, i.e., APC ⋁ ctnnb1, and if they further progress exclusively (⊕) through kras or nras alterations. In general, as this testing-feature leaves the inference unbiased – see [33] – arbitrary hypotheses on significantly mutated subnetworks could be considered as well [105,106].
Step 4: progression inference and confidence estimation
Finally, we use CAPRI to reconstruct cancer progression models of each identified molecular subtype, provided that there exist a reasonable list of driver events and the groups of fitness-equivalent exclusive alterations.
CAPRI’s input is a binary n × (m + k) matrix M with n samples, m driver alteration events (Bernoulli 0/1 variables) and k testable formulas. CAPRI first scans pairwise M to identify a set of plausible selective advantage relations, which then reduces to the most relevant ones,
Construction of depends on the number of non-parametric bootstrap iterations and confidence p-values for estimating selective advantage among input events x and y. CAPRI postulates that “x selects for y” if it estimates that “x is earlier than y” and that “x’s presence increases the probability of observing y” [45]. These conditions are implemented with these inequalities for which we get p-values by Mann-Withney U Testing. Here, p(·) is an empirical marginal probability, p(·|·) is a conditional, and ¬x is the negation of x.
Optimization of is central to our tolerance to false positives and negatives in . CAPRI’s implementation in TRONCO [44] selects from a subset of relations by optimizing the score with regularization where £(·) is the model likelihood; the estimated optimal solution is .
Different values of θ lead to different tolerance to errors in , the Akaike Information Criterion (AIC) being for θ = 2, the Bayesian Information Criterion (BIC) for θ = log(n). Both scores are approximately correct; AIC is more prone to overfitting but likely to provide also good predictions from data and is better when false negatives are more misleading than positive ones. BIC is more prone to underfitting errors, thus is more parsimonious and better in opposite cases. As often done, we suggest to combine both approaches and distinguish which relations are selected by BIC/AIC.
Model confidence can be estimated with non-parametric, parametric or statistical bootstrap [107]. These procedures re-sample datasets to provide a confidence to every selective advantage relation and to the overall model. Bootstrapped datasets are randomly generated by re-shuffling data and seed (non-parametric), just seed (statistical) or by sampling from the model (parametric). CAPRI’s other statistics include hypergeometric tests to assess how significant is the overlap between pairs of alterations.
Footnotes
↵† Co-senior authors.
↵1 We mention that much attention has been recently casted on newly discovered cancer genes affecting global processes that are apparently not directly related to cancer development, such as cell signaling, chromatin and epigenomic regulation, RNA splicing, protein homeostasis, metabolism and lineage maturation [10].
↵2 We remark that in-vitro and in-vivo experiments could provide an optimal validation for the newly suggested selective advantage relations and hypotheses, yet this is out of the scope of the current work.
↵3 Logical disjunction of a set of operands is true if and only if one or more of its operands is true. For this reason, if we shall use that as a model of soft-exclusivity, we shall also check that the majority of observations indeed shows an exclusivity trend, meaning that few cases of co-occurent observations happen.
References
- [1].↵
- [2].
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].
- [20].
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].
- [38].
- [39].↵
- [40].↵
- [41].
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].↵
- [67].↵
- [68].↵
- [69].↵
- [70].↵
- [71].↵
- [72].↵
- [73].↵
- [74].↵
- [75].↵
- [76].↵
- [77].↵
- [78].↵
- [79].↵
- [80].↵
- [81].↵
- [82].↵
- [83].↵
- [84].↵
- [85].↵
- [86].↵
- [87].
- [88].↵
- [89].↵
- [90].↵
- [91].↵
- [92].↵
- [93].↵
- [94].↵
- [95].↵
- [96].↵
- [97].↵
- [98].↵
- [99].↵
- [100].
- [101].
- [102].
- [103].
- [104].↵
- [105].↵
- [106].↵
- [107].↵