Abstract
We present SLAPenrich, a statistical approach implemented in an R package to search for pathways enriched for genomic alterations in large datasets. SLAPenrich mines for sample-population level enrichments, accounting for mutation rates, gene exonic lengths, and mutational mutual exclusivity. We show that SLAPenrich detects in lung adenocarcinoma pathways known to be typically altered, therapeutic targets, and associations with clinicopathological features. Finally, we explore with SLAPenrich the landscape of pathways contributing to the acquisition of the cancer hallmarks in large cohorts of patients across 10 cancer types, highlighting potential novel cancer driver genes and networks.
Background
The swift progression of next-generation sequencing technologies is enabling a fast and affordable production of an extraordinary amount of genome sequences. Cancer research is particularly benefiting from these advances, and comprehensive catalogues of somatic mutations involved in carcinogenesis, tumour progression and response to therapy are becoming increasingly available and ready to be exploited for the identification of new diagnostic, prognostic and therapeutic markers [1, 2, 3, 4]. Exploration of the genomic makeup of multiple cancer types has highlighted that driver somatic mutations typically involve few genes altered at high frequency and a long tail of more genes mutated at very low frequency [5, 6], with a tendency for both sets of genes to code for proteins involved into a limited number of biological processes [7]. As a consequence, a reasonable approach is to consider these alterations by grouping them based on prior knowledge of the cellular mechanisms and biological pathways where the products of the mutated genes operate [8]. This facilitates the identification of the spectrum of possible alterations underpinning a common cancer hallmark [9, 10], which can include genes that would be otherwise not considered on the basis of their (low) alteration frequency alone. Additionally, a pathway-based description reduces the dimensionality of large geonomic datasets involving thousands of altered genes into a sensibly smaller set of altered mechanisms that are more interpretable, and possibly actionable in a pharmacological or experimental way [11].
Existing methods
Several computational methods have been designed to perform pathway analysis on genomic data, aiming at prioritizing sets of genomically altered genes whose products operate in the same cellular process or functional network. All the approaches proposed so far toward this aim can be classified into two main classes [8].
The first class of approaches aims at identifying pathways whose composing genes are significantly over-represented in the set of altered genes across all the samples of a dataset compared against the background set of all studied genes. Generally, a Fisher’s exact test is used to calculate the statistical significance of this over-representation. Many tools exist and are routinely used to perform this analysis [12, 13, 14], sometimes incorporating additional features, such as accounting for inter-gene dependencies and signal correlations [15]. More sophisticated methods take into account the differences in pathway mutation probabilities expected by random chance [16], variations in gene lengths, and combine p-values from tests on individual samples into multiple-sample p-values [17].
The second class of approaches aims at identifying novel pathways by mapping alteration patterns on large protein networks. The combinatorial properties occurring among the alterations are then analyzed and used to define cost functions, for example based on the tendency of a group of genes to be mutated in a mutual exclusive manner. On the basis of these cost functions, optimal sub-networks are identified and interpreted as novel cancer driver pathways [18, 19, 20]. However, at the moment there is no consensual way to rigorously define a mathematical metric for mutual exclusivity and compute its statistical significance, and a number of interpretations exist [18, 19, 21, 22, 23].
SLAPenrich
Here we present SLAPenrich (Sample Level Analysis of Pathway alteration Enrichments): a computational method implemented into an R package to perform pathway analyses at the population level. We propose this tool as a mean to characterize in an easily interpretable way sparse somatic mutations detected in heterogeneous cancer sample populations, which share traits of interest and are subjected to strong selective pressure, leading to combinatorial patterns. SLAPenrich does not require somatic mutations in a pathway to be statistically enriched among those detected in each sample nor the merged (or aggregated) set of mutations in the population. Relying on the mutual exclusivity principle [24], SLAPenrich assumes that a single mutation involving a node of a pathway in an individual sample can be sufficient to deregulate the activity of the pathway, providing selective growth advantages. Hence, SLAPenrich belongs roughly to the first category described above, although it shares the mutual exclusivity consideration with the methods in the second.
SLAPenrich includes a visualization/report framework allowing an easy exploration of the outputted enriched pathways across the analyzed samples, in a way that highlights their mutual exclusivity mutation trends, and a module for the identification of core-components genes, shared by related enriched pathways.
We illustrate SLAPenrich in a case-study where we analyzed somatic mutations across lung adenocarcinoma (LUAD) patient samples from a public dataset [25]. In addition, we used SLAPenrich to characterise across 4,415 patients covering 10 cancer types the contributions of the underlying somatic mutations to the acquisition of the canonical cancer hallmarks [9, 10]. This novel ‘hallmark heterogeneity analysis’ has the potential to provide a data-driven evaluation of each hallmark across disease types, to explore how these hallmarks are achieved via different pathways, and to identify novel putative cancer driver genes and networks.
Results
Summary of the approach and analytical features
SLAPenrich assumes that an individual point mutation in a gene belonging to a given pathway can be sufficient to deregulate the activity of that pathway in the sample under consideration. More precisely, after modeling the probability of a genomic alteration in at least one member of a given pathway across the individual samples, a collective statistical test is performed against the null hypothesis that the number of samples with at least one alteration in that pathway is close to that expected by random chance, therefore no association exists between the analyzed population and the pathway under consideration. An additional advantage of modeling probabilities of at least an individual mutation in a given pathway, instead of modeling the probability of the actual number of mutated genes, is that this prevent signal saturations due to hypermutator samples.
The input to SLAPenrich is a collection of samples accounting for the mutational status of a set of genes, such as a cohort of human cancer genomes. This is modeled as a dataset where each sample consists of a somatic mutation profile indicating the status (pointmutated or wild-type) of a list of genes (Figure 1A). For a given biological pathway P, each sample is considered as an individual Bernoulli trial that is successful when that sample harbors somatic mutations in at least one of the genes belonging to the pathway under consideration (Figure 1B). An additional advantage of modeling probabilities of at least an individual mutation in a given pathway (instead of, for example, the probability of the actual number of mutated genes) is that this prevent signal saturations due to hypermutated samples.
The probability of success of each of these trials is computed by either (i) a general hypergeometric model accounting for the mutation burden of the sample under consideration, the size of the gene background population and the number of genes in the pathway under consideration, or (ii) a more refined modeling of the likelihood of observing point mutations in a given pathway, accounting for the total exonic block lengths of the genes in that pathway (Figure 1AB) and the estimated (or actual) mutation rate of the sample under consideration [26]. In addition, more sophisticated methods, accounting for example for gene sequence compositions, trinucleotide rates, and other covariates (such as expression, chromatin state, etc) can be used, through user-defined functions that can be easily integrated in SLAPenrich.
Once these probabilities have been computed, the expected number of samples in the population harboring at least one somatic mutation in P can be estimated, and its probability distribution modeled analytically. Based on this, a pathway alteration score can be computed observing the deviance of the number of samples harboring somatic mutations in P from its expectation, and its statistical significance quantified analytically (Figure 1C). Finally, the resulting statistically enriched pathways can be further filtered by looking at the tendency of their composing genes to be mutated in a mutually exclusive fashion across all the analyzed samples, as an additional evidence of positive selection [27, 18, 19]. A formal description of the statistical framework underlying SLAPenrich is provided in the Methods.
To visualize enriched pathways SLAPenrich makes use of presence/absence matrices visualised as binary heatmaps where columns indicate samples, rows indicate genes harboring at least one somatic mutation in at least one sample of the analyzed dataset, and colors indicate the absence or the presence of somatic mutations (respectively) in a given gene/sample combination. To emphasize mutual exclusivity trends among the row-wise mutation patterns, rows and columns of these heatmaps are sorted with a heuristic method (detailed in the Methods) that minimizes the superposition of mutated samples column-wisely, thus the overlaps of the mutation patterns across the rows (an example is provided in Figure 2A and described in the next section). To finally summarize the results, an analysis of the enriched-pathway core-component genes is performed. The aim of this final analysis is to visualize in the same heatmap enriched path-ways that share a frequently mutated sub-set of genes (the core-component) that is supposed to lead the pathway enrichments, together with a membership matrix specifying to which enriched pathway each core-component gene belongs to (an example is provided in Figure 2B, as described in the next section). This allows to filter out from the results those pathways that are not directly relevant to the disease under consideration, in a supervised way. A final feature of the package is the identification of pathways that are differentially enriched (thus frequently altered) across two sub-populations of samples of the same input dataset. SLAPenrich is implemented as an R package (available at https://github.com/francescojm/SLAPenrich, submission to Bioconductor is in progress). It includes different collections of pathway gene sets from multiple public available sources [28], together with all the data objects needed to run the analysis, but can be used with any user-defined collection of gene-sets. An overview of the exposed functions of the R package is provided in additional file 7.
A first case study
To show the ability of SLAPenrich to recover pathways known to be frequently altered in a given disease, and the usefulness of its differential pathway enrichment analysis, we have re-analysed a published dataset encompassing somatic mutations found in 188 lung adenocarcinoma (LUAD) patients, studied in [25].
A SLAPenrich analysis of this dataset, using the entire collection of pathways downloaded from KEGG [29] (post-processed for redundancy removal as described in the Methods), resulted into 48 significantly enriched pathways, at a FDR < 5% and a mutual exclusive coverage (EC) > 50% (additional file 1). Among these, we found pathways whose deregulation is known to be involved in lung cancer, such as tight junction (alteration score (AS) = 0.37, EC = 89%) [30] (Figure 2A), gap junction (AS = 0.45, EC = 75%) [31], and several pathways found with other methods [25, 17], such as for example focal adhesion (AS = 0.06, EC = 84%), ERBB signaling pathway (AS = 0.27, EC = 69%), dorsoventral axis formation (AS = 0.42, EC = 55%). Additionally, we found a number of path-ways not reported in the aforementioned studies but recently proposed as potential targets for lung cancer therapy such as GNRH signaling pathway (AS = 0.45, EC = 87%) [32], WNT signaling pathway (AS = 0.29, EC = 74%) [33], and VEGF signaling pathway (AS = 0.33, EC = 80%) [34].
Considering the clinical information of the samples in the analyzed dataset we stratified the corresponding patients based on their smoking status (never-smoker and current-smokers) and their bronchioalveolar carcinoma type (mucinous and non-mucinous), and performed a differential SLAPenrich analysis contrasting the variant profiles of the obtained sub-populations, using the far larger publicly available collection of pathway gene sets from Pathway Commons [28], postprocessed for redundancy removal as described in the Methods. Outcomes from the first analysis, comparing never-smoker vs. current-smokers, are reported in the additional file 2 and summarized in Figure 2C. In total we found 147 differentially enriched pathways (enriched at FDR < 5% in at least one of the two populations). Ranking these pathways according to their differential enrichment score, in decreasing order (Figure 2C) highlights, consistently with previously reported findings, in the current-smokers population a prominent enrichment of alterations in the RAS/RAF/MEK signaling cascade [35], telomerase activity [36], NOXA and PUMA signaling [37]. On the other hand, in the never-smoker population we observed prominent enrichments in EGFR signaling and EGFR-dependent endothelin signaling pathways [38].
When contrasting mucinous vs. non-mucinous BAC types (additional file 3 and additional figure 1), we again observed correct associations between the mucinous BAC type and pathway alteration enrichments in the RAS/RAF/MEK signaling cascade [39], signaling by leptin [40], PI3K and MTOR signaling pathways [41], and inflammation related pathways such as CXCR3 and GM-CSF mediated signaling. Whereas for the non-mucinous BAC type population prominent enrichments were observed in pathways involving EGFR signaling [42]. The presented analyses and results are fully detailed in the vignette of the SLAPenrich package (see Code Availability).
SLAPenrich analyses across different cancer types highlights the heterogeneity of cancer hallmark acquisition
SLAPenrich can be used to systematically analyze large cohorts of cancer genomes providing a data-driven exploration of mutated pathways that can be easily compared across cancer types. Additionally, the format of the results allows a wide range of novel investigations at a higher level of abstraction.
As an illustration of this, we performed individual SLAPenrich analyses on 10 different genomic datasets containing somatic point mutations, preprocessed as described in [43], from 4,415 patients across 10 different cancer types, from public available studies, in particular The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC). These samples (see methods) comprise breast invasive carcinoma (BRCA, 1,132 samples), colon and rectum adenocarcinoma (COREAD, 489), glioblastoma multiforme (GBM, 365), head and neck squamous cell carcinoma (HNSC, 375), kidney renal clear cell carcinoma (KIRC, 417), lung adenocarcinoma (LUAD, 388), ovarian serous cystadenocarcinoma (OV, 316), prostate adenocarcinoma (PRAD, 242), skin cutaneous melanoma (SKCM, 369), and thyroid carcinoma (THCA, 322). We observed a weak correlation (R = 0.53, p = 0.11) between the number of enriched pathways across the different analyses and the number of available samples in the analysed dataset (additional Figure 3A), but a down-sampled analysis showed that our results are not broadly confounded by the sample sizes (see methods and additional Figure 3B). Results from all these individual SLAPenrich analyses are contained in the additional file 4.
We investigated how our pathway enrichments capture known tissue specific cancer driver genes. To this aim, we used a list of high-confidence and tissue-specific cancer driver genes [43, 44] (from now high-confidence Cancer Genes, HCGs, assembled as described in the Methods). We observed that the majority of the HCGs was contained in at least one SLAPenriched pathway, across the 10 different tissues analyses (median percentage = 63.5, range = 88.5%, for BRCA, to 28.7% for SKCM) (additional Figure 3C).
Interestingly, we found that the number of SLAPenriched pathways per cancer type (median = 130, range = 55 for PRAD, to 200 for BRCA and COREAD) was independent from the average number of mutated genes per sample across cancer types (median = 46, range from 15 for THCA to 388 for SKCM) with a Pearson correlation R = 0.16 (p = 0.65), Figure 3A, as well as from the number of high confidence cancer driver genes (as predicted in [44], median = 100, range from 33 for THCA to 251 for SKCM, Figure 3B). Particularly, THCA has the lowest average number of mutations per sample (15.03), but there are 4 tissues with a lower number of pathways mutated. In contrast, SKCM has the highest average number of point mutations per sample (387.63), but the number of affected pathways is less than half of those of BRCA and GBM (82 enrichments against an average of 191), which have on average less than 100 mutations per sample (Figure 3). GBM, OV, KIRC, PRAD and BRCA are relatively homogeneous with respect to the average number of somatic mutations per sample (mean = 41.03, from 34.76 for KIRC to 45.95 for PRAD) but when looking at the number of enriched pathways for this set of cancer types we can clearly distinguish two separate groups (Figure 3). The first group includes BRCA and GBM that seem to have a more heterogeneous sets of processes impacted by somatic mutations (average number of SLAPenriched pathways = 191) with respect to the second group (63 SLAPenriched pathways on average). These results suggest that there is a large heterogeneity in the number of processes deregulated in different cancer types that is independent of the mutational burden. This might be also indicative of different subtypes with dependencies on different pathways (and at least for BRCA this is expected) but could be also biased by the composition of the analysed cohorts being representative of a selected subtype only.
Subsequently, we reasoned that since the main role of cancer driver alterations is to enable cells to achieve a series of phenotypic traits called the ‘cancer hallmarks’ [9, 10], that can be linked to gene mutations [45], it would be informative to group the pathways according to the hallmark they are associated with. Towards this end, through a computer aided manual curation (see Methods and additional file 5) we were able to map 374 gene-sets (from the most recent release of pathway commons [28]) to 10 cancer hallmarks [9, 10] (additional Figure 2AB), for a total number of 3,915 genes (included in at least one gene set associated to at least one hallmark; additional file 6). The vast majority (99%, 369 sets) of the considered pathway gene-sets were mapped on two hallmarks at most, and 298 of them (80%) was mapped onto one single hallmark (additional Figure 2C). Regarding the individual genes contained in at least one pathway gene-set, about half (49%) were associated with a single hallmark, 22% with two, 12% with three, and 7% with four(additional Figure 2D). Finally, as shown in additional Figure 2E, the overlaps between the considered pathway gene-sets was minimal (74% of all the possible pair-wise Jaccard indexes was equal to 0 and 99% < 0.2). In summary, our manual curation produced a non-redundant matching in terms of both pathways- and genes-hallmarks associations.
Mapping pathway enrichments into canonical cancer hallmarks through this curation allowed us to explore how different cancer types might acquire the same hallmark by selectively altering different pathways. examples are provided in Figure 4, and additional Figure 4. Heatmaps in these figures (one per each hallmark) show different level of enrichments of pathways associated to the same hallmark across different tissues.
We investigated at what extent the identified enriched pathways were dominated by somatic mutations in the high-confidence cancer genes (HCGs) [44], across cancer types. To this aim, for each pathway P enriched in a given cancer type T, we computed an HCG-dominance score as the ratio between the number of T samples with mutations in HCGs belonging to P and the total number of T samples with mutations in any of the gene belonging to P. Results of this analysis are shown in additional figures 5 and 6. We observed a median of 15% of pathway enrichments, across hallmarks, with an HCG-dominance score < 50%, thus not led by somatic mutations in HCGs (range from 9% for Deregulating Cellular Energetics to 21% for Genome Instability and Mutation). Additionally, a median of 3% of pathway enrichments had a null HCG-dominance, thus not involved somatic mutations in HCGs (range from 0.25% for Evading Growth Suppression to 15% for Avoiding Immune Destruction). Across all the hallmarks, the cancer type with the lowest median HCG-dominance was KIRC (33%), whereas that with the highest was THCA (91%).
Patterns and well defined clusters can be clearly distinguished in the heatmaps of Figure 4. Patterns and well defined clusters can be clearly distinguished. As an example, in Figure 4 the heatmap related to the Genome Instability and mutation hallmark shows that BRCA, OV, GBM, LUAD and HNSC might achieve this hallmark by selectively altering a group of pathways related to homologous recombination deficiency, whose prevalence in BRCA and OV is established [46]. This deficiency has been therapeutically exploited recently and translated into a clinical success thanks to the introduction of PARP inhibition as a very selective therapeutic option for these two cancer types[47]. Pathways preferentially altered in BRCA, OV, GBM, LUAD and HNSC include G2/M DNA Damage Checkpoint // Processing Of DNA Double Strand Break Ends, TP53 Regulates Transcription Of DNA Repair Genes, and other signaling networks related to BRCA1/2 and its associated RING Domain 1 (BARD1). Conversely, the Androgen receptor pathway, known to regulate the growth of glioblastoma multiforme (GBM) in men [48] is also exclusively and preferentially altered in this cancer type. The acquisition of the Genome Instability and mutation hallmark seems to be dominated in COREAD by alterations in the HDR Through Single Strand Annealing (SSA), Resolution Of D Loop Structures Through Synthesis Dependent Strand Annealing (SDSA), Homologous DNA Pairing And Strand Exchange and other pathways more specifically linked to a microsatellite instability led hypermutator phenotype, known to be prevalent in this cancer type [49]. Finally, the heatmap for Genome Instability and Mutation shows nearly no enriched pathways for SKCM. This is consistent with the high burden of mutations observed in melanoma originating from cell extrinsic processes such as UV light exposure [50]. The maintenance of genomic integrity is guarded by a network of damage sensors, signal transducers, and mediators, and it is regulated through changes in gene expression. Recent studies show that miRNAs play a crucial role in the response to UV radiation in skin cells [51]. Our analysis strikingly detects MiRNAs Involved In DNA Damage Response as the unique pathway associated to Genome instability and mutation enriched in SKCM. This suggests that mutations in this pathway, involving ATM (as top frequently mutated gene, and known to induce miRNA biogenesis following DNA damage [52]), impair the ability of melanocytes to properly respond to insult from UV light and may have a significant role in the tumourigenesis of melanoma.
The Avoiding Immune destruction heatmap (Figure 4) highlights a large number of pathways selectively enriched in COREAD, wheareas very few pathways associated to this hallmark are enriched in the other analysed cancer types. This could explain why immunotherapies, such as PD-1 inhibition, have a relatively low response rate in COREAD when compared to, for example, non-small cell lung cancer [53], melanoma [54] or renal-cell carcinoma [55]. In fact, response to PD-1 inhibition in COREAD is limited to tumours with mismatch-repair deficiency, perhaps due to their high rate of neoantigen creation [56]. In the context of COREAD, the Tumor-promoting inflammation heatmap (Figure 4) also highlights several pathways predominantly and very specificically altered in this cancer type. Chronic inflammation is a proven risk factor for COREAD and studies in animal models have shown a dependency between inflammation, tumor progression and chemotherapy resistance [57]. Indeed, a number of clinical trials evaluating the utility of inflammatory and cytokine-modulatory therapies are currently underway in colorectal cancer [58, 59]. Interestingly, according to our analysis this hallmark is acquired by SKCM by exclusively preferentially altering IRF3 related pathways.
Several other examples would be worthy of mention. For example, the detection of the Warburg effect pathway contributing to the acquisition of the Deregulating cellular energetics hallmark in GBM only (Figure 4). The Warburg effect is a unique bioenergetic state of aerobic glycolysis, whose reversion has been recently proposed as an effective way to decrease GBM cell proliferation [60]. Additionally, the pathway Formation of senescence associated heterochromatin, associated to the Enabling replicative immortality hallmark is enriched in multiple cancer types. Genomic alterations in this pathway have not been linked to cancer so far. More interestingly the enrichment of this pathway, across cancer types, is not driven by any established cancer gene.
Finally, we quantified the diversity of pathways used to achieve each hallmark in a given cancer type, via a cumulative heterogeneity score (CHS) computed as the proportion of the pathways associated to that hallmark that are enriched. The larger this score the more a given cancer type relies on altering a large number of pathways in order to achieve the considered hallmark. A larger heterogeneity of pathways, in turn, could point to the exploitment of more evolutionary trajectories (reflected by selecting genomic alterations in a large number of associated pathways). As a consequence, the larger this score the higher might be the evolutionary fitness of that hallmark for the cancer type under consideration. Joining the CHSs of all the hallmarks resulting from the analysis of a given cancer type, gives its hallmark heterogeneity signature. The hallmark heterogeneity signatures of all the 10 analysed cancer types are reported in Figure 5. Results show consistency with the established predominance of certain hallmarks in determined cancer types, such as for example a high CHS for Genome instability and mutation in BRCA and OV [61], for Tumour-promoting inflammation and Avoiding immune-destruction in COREAD [62]. Lastly, and as expected for Sustaining proliferative-signaling and Enabling replicative immortality, the key hallmarks in cancer initiation [9], high CHSs are observed across the majority of the analysed cancer types.
Taken together, these results show the potential of SLAPenrich to perform systematic landscape analyses of large cohorts of cancer genomes. In this case this is very effective in highlighting commonalities and differences in the acquisition of the cancer hallmarks across tissue types, confirming several known relations between cancer types, and pinpointing preferentially altered pathways and hallmark acquisitions.
Hallmark heterogeneity analysis points at potential novel cancer driver genes and networks
To investigate the potential of SLAPenrich in identifying novel cancer driver genes and networks we performed a final analysis (from now the filtered analysis) after removing all the variants involving, for each considered cancer type, the corresponding HCGs. Results of this exercise (Figure 6 and additional figure 7), showed that the majority of the enrichments identified in the original analyses (on the unfiltered genomic datasets) were actually led by alterations in the HCGs (consistent with their condition of high reliable cancer genes). The average ratio of retained enrichments in the filtered analyses across cancer types (true positive rates (TPRs) in Figure 6 and additional figure 7) was 21%, (range from 2.1% for GBM to 56.2% for COREAD). However, several pathway enrichments (some of which did not include any HCGs) were still detected in the filtered analysis and, most importantly, the corresponding hallmark heterogeneity signatures were largely conserved between the filtered and unfiltered analyses for most of the cancer types, with coincident top fitting hallmarks and significantly high over-all correlations (Figure 6, additional figure 7). If the hallmark signatures from the original unfiltered analyses are faithful representations of the mutational landscape of the analysed cancer types and the filtered analyses still detect this landscape despite removal of known drivers, then the filtered analyses might have uncovered novel cancer drivers in the long tail of infrequently mutated genes. In fact, these new gene modules are typically composed by groups of functionally interconnected and very lowly frequently mutated genes (Figure 7).
An example is represented by the pathway Activation Of Matrix Metalloproteinases associated with the Invasion and metastasis hallmark and highly enriched in the filtered analyses of COREAD (FDR = 0.002%), SKCM (0.09%) (Figure 7A), LUAD (0.93%), and HNSC (3.1%). The activation of the matrix metalloproteases is an essential event to enable the migration of malignant cells and metastasis in solid tumors [63]. Although this is a hallmark acquired late in the evolution of cancer, according to our analysis this pathway is still detectable as significantly enriched. As a consequence, looking at the somatic mutations of its composing genes (of which only Matrix Metallopep-tidase 2 MMP2 has been reported as harboring cancer driving alterations in LUAD [44]) might reveal novel key components of this pathway leading to metastatic transitions. Interestingly, among these, one of the top frequently mutated genes (across all the 4 mentioned cancer types) is Plasminogen (PLG), whose role in the evolution of migratory and invasive cell phenotype is established [64]. Furthermore, blockade of PLG with monoclonal antibodies, DNA-based vaccination or silencing through small interfering RNAs has been recently proposed to counteract cancer invasion and metastasis [65]. The remaining altered component of this pathway is mostly made of a network of very lowly frequently mutated (and in a high mutual exclusive manner) other metalloproteinases.
Another similar example is given by the IL 6 Type Cytokine Receptor Ligand Interactions pathway significantly enriched in the filtered analysis of SKCM (FDR = 4.6%) and associated with the Tumour-promoting inflammation hallmark (Figure 7B). IL-6-type cytokines have been observed to modulate cell growth of several cell types, including melanoma [66]. Increased IL-6 blood levels in melanoma patients correlate with disease progression and lower response to chemotherapy [67]. Importantly, studies proposed OSMR, a IL6-type of cytokine receptor, to play a role in the prevention of melanoma progression [68], and as a novel potential target in other cancer types [69]. Consistently with these findings, OSMR is the member of this pathway with the largest number of mutations in the SKCM cohort (Figure 7B), complemented by a large number of other lowly frequently mutated genes (most of which are interleukins).
In the context of melanoma, we observed other two pathways highly enriched in the filtered analysis: PDGF receptor signaling network (FDR = 2.7%) (Figure 7C) and Neurophilin Interactions with VEGF And VEGFR (0.21%)(Figure 7D), both associated with the Inducing angiogenesis hallmark. Mutations in all the components of these two pathways are not common in SKCM and have not been highlighted in any genomic study so far. The first of these two pathway enrichments is characterised by patterns of highly mutual exclusive somatic mutations in Platelet-derived growth factor (PDGF) genes, and corresponding receptors: a network that has been recently proposed as an autocrine endogenous mechanism involved in melanoma proliferation control [70].
A final example is given by the enriched pathway Regulating the activity of RAC1 (associated with the Activating Invasion and Metastasis hallmark) in COREAD (Figure 7E). The Ras-Related C3 Botulinum Toxin Substrate 1 (RAC1) gene is a member of the Rho family of GTPases, whose activity is key for cell motility [71]. Previous in vitro and in vivo studies in prostate cancer demonstrated a marked increase in RAC1 activity in cell migration and invasion, and that RAC1 inhibition immediately stopped these processes [72, 73]. However, although the role of RAC1 in enabling metastasis has already been suggested, the mechanisms underlying such aberrant behaviour are poorly understood, and our findings could be used as a starting point for further investigations [74].
Another interesting case is the high level of mutual exclusivity observed in the mutation patterns involving members of the TP53 network, highly enriched in the filtered analysis of SKCM, encompassing TP63, TP73, TNSF10, MYC and SUMD1 (Figure 7F). Whereas alterations in some nodes of this network are known to be an alternative to p53 repression, conferring chemoresistance and poor prognosis [75], dissecting the functional relations between them is still widely considered a formidable challenge [76]. Our results point out alternative players worthy to be looked at in this network (particularly, among the top frequently altered, TNSF10).
Taken together, these results suggest that SLAPenrich can help to identify potential novel cancer driver genes and cancer driver networks composed by lowly frequently mutated genes.
Discussion
In this paper we present SLAPenrich, a statistical framework implemented in an open source R package to identify pathway enrichment in genomic datasets at the population level. SLAPenrich does not seek for pathways whose alterations are enriched at the individual sample level nor at the global level, i.e. considering the union of all the genes altered in at least one sample. Instead, it assumes that an individual mutation involving a given pathway in a given sample might be sufficient to deregulate the activity of that pathway in that sample and it allows enriched pathways to be mutated in a mutual exclusive manner across samples.
The SLAPenrich package includes (i) fully tunable functions where statistical significance criteria and alternative models, can be defined by the user; (ii) a visualization and reporting framework, and (iii) accessory functions for data management and gene identifier curation and cross-matching. Worthy of note is that many different tools provide the possibility of visualizing a mutual-exclusivity sorted sets of somatic mutations and other genomic alterations from publicly available or user defined datasets via a browser accessible software suite (e.g. GiTools [77] and cBioPortal [78]) or as a result of combinatorial pattern analysis (such as MEMo [79] and Dendrix [19]). However, none of these tools offer this feature as a mean to visualise an arbitrarily defined data matrix and, to our knowledge, there is no publicly available R implementation for this.
Beyond extraction of pathway enrichments, as illustrated with our case study on a lung cancer data set, SLAPenrich can also be used to perform large-scale comparative analysis of cancer mutational landscapes. We used this feature to perform a characterization of the mutational status of cancer hallmarks. The obtained results provided a first data-driven landscape of hallmark acquisitions through the preferential alteration of heterogenous sets of pathways across cancer types, confirming established hallmark predominancies, and detecting peculiar pattern of altered pathways for certain cancer types. Finally, by using the identified hallmark signatures as the gold-standard signal, we re-performed SLAPenrich analyses after the removal of variants in established cancer genes and highlighted genes that contributed to those hallmarks as potential novel cancer driver genes and networks.
A number of possible limitations could hamper deriving definitive conclusions from this paper, such as the use of only mutations, the possibility that some of the analysed cohorts of patients are representative only of well-defined disease subtypes, or the limitation of our knowledge of pathways. Nevertheless, we provide the community with a useful tool for the analysis of large genomic datasets, whose produced results (as in our hallmark analysis presented here) could open a wide range of novel in-silico investigations.
Conclusions
SLAPenrich should be of wide usability for the functional characterization of sparse genomic data from heterogeneous populations sharing common traits and subjected to strong selective pressure. As an example of its applicability we have studied the large cohorts of publicly available cancer genomes patient data that is publicly available in the TCGA. However, SLAPenrich could be of great utility in other scenarios such as the characterizion of genomic data generated upon chemical mutagenesis to identify somatic mutations involved in acquired drug resistance [Brammeld et al, under revision http://dx.doi.org/10.1101/066555]. More generally, SLAPenrich can be used to characterize at the pathway level any type of biological dataset that can be modeled as a presence/absence matrix, where genes are on the rows and samples are on the columns.
Methods
Formal description of the SLAPenrich statistical framework
Let us consider the list of all the genes G = {g1, g2, …, gn}, whose somatic mutational status has been determined across a population of samples S = {s1, s2, …, sm}, and a function f: G × S → {0, 1} defined as
Given the set of all the genes whose products belong to the same pathway P, we aim at assessing if there is a statistically significant tendency for the samples in S to carry mutations in P. Importantly, we do not require the genes in P to be significantly enriched in those that are altered in any individual sample nor in the sub-set of G composed by all the genes harbouring at least one somatic mutation in at least one sample. In what follows P will be used to indicate the pathway under consideration as well as the corresponding set of genes, interchangeably. We assume that P is altered in sample si if ∃g ∈ G such that g ∈ P and f (g, si) = 1, i.e. at least one gene in the pathway P is altered in the i-th sample (Figure 1B). To quantify how likely it is to observe at least one gene belonging to P altered in sample si, we introduce the variable Xi = |{g ∈ G: g ∈ P and f (g, si) = 1}|, accounting for the number of genes in P altered in sample si. Under the assumption of both a gene-wise and samplewise statistical independence, the probability of Xi assuming a value greater or equal than 1 is given by: where N is the size of the gene background-population, k is the number of genes in P, ni is the total number of genes g such that f (g, si) = 1, i.e. the total number of genes harbouring an alteration in sample si, and H is the probability mass function of a hypergeometric distribution:
To take into account the impact of the exonic lengths λ(g) of the genes (g) on the estimation of the alteration probability of the pathway they are part of P, it is possible to redefine the pi probabilities (of observing at least one genes in the pathway P altered in sample where N′ = Σ g∈G λ(g), with G the gene background-population, i.e. the sum of all the exonic content block lengths of all the genes; k′ = Σ g∈P λ(g) is the sum of the exonic block length of all the genes in the pathway P; n′ is the total number of individual point mutations involving genes belonging to P in sample si, and H is defined as in equation 3, but with parameters x, N′, k′, and n′i. Similarly, the pi probabilities can be modeled accounting for the total exonic block lengths of all the genes belonging to P and the expected/observed background mutation rate [26], as follows: where k′ is defined as for equation 4 and ρ is the background mutation rate, which can be estimated from the input dataset directly or set to established estimated values (such as 10−6/nucleotide)[26].
If considering the event “the pathway P is altered in sample si” as the outcome of a single test in a set of Bernoulli trials {i} (with i = 1, …, M) (one for each sample in S), then each pi can be interpreted as the success probability of the i − th trial. By definition, summing these probabilities across all the elements of S (all the trials) gives the expected number of successes E(P), i.e. the expected number of samples harbouring a mutation in at least one gene belonging to P:
On the other hand, if we consider a function ϕ on the domain of the X variables, defined as ϕ(X) = 1−δ(X), where δ(X) is the Dirac delta function (assuming null value for every X ≠ 0), i.e. ϕ(X) = {1 if X > 0, and 0 otherwise}, then summing the ϕ(Xi) across all the samples in S, gives the observed number of samples harbouring a mutation in at least one gene belonging to P:
A pathway alteration index, quantifying the deviance of O(P) from its expectation, and thus how surprising is to find so many samples with alterations in the pathway P, can be then quantified as:
To assess the significance of such deviance, let us note that the probability of the event O(P) = y, with y ≤ M, i.e. the probability of observing exactly y samples harbouring alterations in the pathway P, distributes as a Poisson binomial B (a discrete probability distribution modeling the sum of a set of {i} independent Bernoulli trials where the success probabilities pi are not identical (with i = 1, …, M). In our case, the i-th Bernoulli trial accounts for the event “the pathway P is altered in the sample si” and its success probability of success is given by the {pi} introduced above (and computed with one amongst 2, 4, or 5). The parameters of such B distribution are then the probabilities π = {pi}, and its mean is given by Equation 6. The probability of the event O(P) = y can be then written as where Fy is the set of all the possible subsets of y elements that can be selected from the trial 1, 2, …, M (for example, if M = 3, then F2 = {{1, 2}, {1, 3}, {2, 3}}, and Ac is the complement of A, i.e. {1, 2, …, M}\A. Therefore a p-value can be computed against the null hypothesis that O(P) is drawn from a Poisson binomial distribution parametrised through the vector of probabilities π. Such p-value can be derived for an observation O(P) = z, with z ≤ M, as (Figure 1C):
Finally, p-values resulting from testing all the pathways in the considered collection are corrected for multiple hypothesis testing with a user-selected method among (in decreasing order of stringency) Bonferroni, Benjamini-Hochberg, and Storey-Tibshirani [80].
Pathway gene sets collection and pre-processing
To highlight the versatility of SLAPenrich and guarantee results’ comparability with respect to previously published studies, we have conducted the analyses described in the Results section using different collections of pathway gene sets, all included (as R objects) in our software package.
For the case study analysis on the LUAD dataset we downloaded the whole collection of KEGG [29] pathway gene sets from MsigDB [81], encompassing 189 gene sets for a total number of 5,224 genes included in at least one set.
The following differential enrichment analyses and the hallmark signature analyses were performed on a larger collection of pathway gene sets from the Path-way Commons data portal (v8, 2016/04) [28] (http://www.pathwaycommons.org/archives/PC2/v4-201311/). This contained an initial catalogue of 2,794 gene sets (one for each pathway) that were assembled from multiple public available resources, such as Reactome [82], Panther [83], HumanCyc [84], pid [85], smpdb [86], KEGG [29], ctd [87], inoh [88], wikipathways [89], netpath [90], and mirtarbase [91], and covering 15,281 unique genes.
From this pathway collection, those gene sets containing less than 4 or more than 1,000 genes, were discarded. Additionally, in order to remove redundancies, those gene sets (i) corresponding to the same pathway across different resources or (ii) with a large overlap (Jaccard index (J) > 0.8, as detailed below) were merged together by intersecting them. The gene sets resulting from this compression were then added to the collection (with a joint pathway label) and those participating in at least one of these merging were discarded. Finally, gene names were updated to their most recent HGCN [92] approved symbols (this updating procedure is also executed by a dedicate function in of the SLAPenrich package, by default on each genomic datasets prior the analysis). The whole process yielded a final collection of 1,911 pathway gene sets, for a total number of 1,138 genes assigned to at least one gene set.
Given two gene sets P1 and P2 the corresponding J (P1, P2) is defined as:
Curation of a pathway/hallmark map
We implemented a simple routine (included in the SLAPenrich R package) that assigns to each of the 10 canonical cancer hallmarks a subset of the pathways in a given collection. To this aim this routine searches for determined keywords (typically processes or cellular components) known to be associated to each hallmark in the name of the pathway (such as for example: ‘DNA repair’ or ‘DNA damage’ for the Genome instability and mutations hallmark) or for key nodes in the set of included genes or key word in their name prefix (such as for example ‘TGF’, ’SMAD’, and ‘IFN’ for Tumour-promoting inflammation. The full list of keywords used in this analysis are reported in the additional file 5. Results of this data curation are reported in the additional file 6.
Mutual exclusivity filter
After correcting the p-values yielded by testing all the pathways in a given collection, the enriched pathways can be additionally filtered based on a mutual exclusivity criterion, as a further evidence of positive selection. To this aim, for a given enriched pathway P, an exclusive coverage score C(P) is computed as where O(P) is the number of samples in which at least one gene belonging to the pathway P is mutated, and O′(P) is the number of samples in which exactly one gene belonging to the pathway gene-set P is mutated. All the pathways P such that C(P) is at least equal to a chosen value pass this final filter.
Heuristic mutual exclusivity sorting and pathway visualization
The set of somatic mutations of a cancer genomic dataset can be easily modeled as a binary (or Boolean) matrix, whose entries can assume only two possible values, i.e. {0, 1}. In this case, the columns indicate samples, its rows indicate genes (or vice-versa) and a non-zero entry the presence of a somatic mutations in a given gene/sample combination. In a binary matrix, a run is a sequence of consecutive non-zero entries. Reordering rows and columns in a way that the number of runs on the rows and the column-wise marginal totals are minimized is an effective way to highlight patterns of mutual exclusivity among the runs of different rows (i.e. the genes of the considered sub-set). This is an NP-hard problem [93] here referred as the ‘mutual-exclusivity sorting’. In SLAPenrich a heuristic implementation of the mutual-exclusivity sorting is provided in a dedicated R function used by the internal visualization routines, although this function is also available and usable on any user defined binary matrix. Here, for simplicity we will describe an execution of this heuristic applied to a binary matrix summarizing a genomic dataset (with genes on the rows, samples on the columns, and binary entries specifying the status of a gene in a given sample).
In the initial step of the algorithm all the samples and all the genes in the input matrix are declared as uncovered and an empty vector is initialized: this is the set of covered genes G. Then the algorithm proceeds through a series of iterations until the sets of uncovered genes and uncovered samples are both empty. In each of these iterations a best in class gene is identified. This is the uncovered gene with the maximal exclusive coverage, which is defined as the number of uncovered samples in which this gene is mutated minus the number of samples in which at least another uncovered gene is mutated. Finally, the identified best in class gene is removed from the set of the uncovered genes, it is attached to G, and the set of samples in which it is mutated are removed from the set of the uncovered samples. After these iterations have been executed, an empty vector of samples L is initialized, all the samples of the dataset are labeled again as uncovered, an empty vector of samples S is initialized. Then for each of the best in class gene g (in the same order as they appear in G) and until there are uncovered samples, the uncovered samples in which g is mutated are sorted according to the exclusive coverage of g across them (in decreasing ordered), they are labeled as covered samples and attached in the resulting order to L. To obtain the final mutual-exclusivity sorting of the initial dataset, the corresponding inputted binary matrix is rearranged by permuting the genes/rows in the same order as they appear in G and the samples/columns in the same order as they appear in L.
Identification and visualization of enriched pathway core-components
To identify shared core-components across significantly enriched pathways, the set of enriched pathways and their composing genes are modeled as a bipartite network, in which nodes in the first set correspond to enriched pathways and nodes in the second set to genes belonging to at least one of the enriched pathways. Finally a pathway node is connected with an edge to each of its composing gene nodes. The resulting bipartite network is then mined for communities, i.e. groups of densely interconnected nodes, by using a fast community detection algorithm based on a greedy strategy [94]. The resulting communities are finally visualized as independent heatmaps where nodes in the first set (pathways) are on the columns, nodes in the second set (genes) are on the rows and a not-empty cell in position i, j indicates that the i-th gene belongs to the j-th pathway (Figure 2 B).
Differential pathway enrichment analysis
Similarly to differential gene expression analysis, the two sub-populations to be contrasted are defined through a contrast matrix. Then individual SLAPenrichment analyses are performed on these two populations, yielding two sets of results. The pathways that are significantly enriched in at least one of the two analyses (according to a user defined false discovery rate (FDR) threshold) are then selected and, for each of them, a differential enrichment score is computed as: where A and B are the two contrasted sub-populations (respectively, positive and negative) and FDRA and FDRB are the two SLAPenrichment FDRs obtained in the two corresponding individual analyses. Graphic routines included in our package allow a pathway level visualization of the inputted alterations across the two contrasted population, on the domain of the differentially enriched pathways as well as heatmaps and barplots of the differential enrichment scores (see an example in Figure 2C).
LUAD case study analysis
Annotations of somatic variants identified in a cohort of 188 lung adenocarcinoma patients, and associated clinical information, were downloaded from http://genome.wustl.edu/pub/supplemental/tsp_nature_2008/ (files: supplementary table 2.tsv and supplementary table 15.tsv, respectively). The variants annotations were converted into a genomic event matrix (EM) with altered genes on the rows, patient sample identifiers on the columns, and generic i, j entries specifying the number of observed point mutations hosted by the i-th gene in the j-th patient. A SLAPenrich analysis on the resulting dataset was performed using the SLAPE.analyse function with default values for all the parameters (including a Bernoulli model [26] for the individual pathway alteration probabilities across all the samples, and the choice of the set of all the altered genes in the dataset as background population), and a pathway gene sets collection from KEGG [29] (embedded in the package as R data object: SLAPE.20160211 MSigDB KEGG hugoUpdated).
Hallmark heterogeneity signature analysis: genomic datasets and high-confidence cancer genes
Tissue specific catalogues of genomic variants for 10 different cancer types (breast invasive carcinoma, colon and rectum adenocarcinoma, glioblastoma multiforme, head and neck squamous cell carcinoma, kidney renal clear cell carcinoma, lung adenocarcinoma, ovarian serous cystadenocarcinoma, prostate adenocarcinoma, skin cutaneous melanoma, and thyroid carcinoma) were downloaded from the GDSC1000 data portal described in [43] (http://www.cancerrxgene.org/gdsc1000/). This resource (available at http://www.cancerrxgene.org/gdsc1000/) encompasses variants from sequencing of 6,815 tumor normal sample pairs derived from 48 different sequencing studies [44] and reannotated using a pipeline consistent with the COSMIC database [95] (Vagrent: https://zenodo.org/record/16732#.VbeVY2RViko). Lists of tissue specific high-confidence cancer genes [44] were downloaded from the same data portal (http://www.cancerrxgene.org/gdsc1000/GDSC1000_WebResources//Data/suppData/TableS2A.xlsx). These were identified by combining complementary signals of positive selection detected through different state of the art methods [96, 97] and further filtered as described in [43] (http://www.cell.com/cms/attachment/2062367827/2064170160/mmc1.pdf)
Hallmark heterogeneity signature analysis: Individual SLAPenrich analysis parameters
All the individual SLAPenrich analyses were performed using the SLAPE.analyse function of the SLAPenrich R package (https://github.com/francescojm/SLAPenrich) using a Bernoulli model for the individual pathway alteration probabilities across all the samples, the set of all the genes in the dataset under consideration as background population, selecting pathways with at least one gene point mutated in at least 5% of the samples and at least 2 different genes with at least one point mutation across the whole dataset, and and a pathway gene sets collection downloaded from pathway commons[28], postprocessed for redundancy reduction as explained in the previous sections, and embedded in the SLAPE package as R data object:
PATHCOM HUMAN nonredundant intersection hugoUpdated.RData.
A pathway in this collection was considered significantly enriched, and used in the follow-up computation of the hallmark cumulative heterogeneity score, if the SLAPenrichment false discovery rate (FDR) was less than 5% and its mutual exclusive coverage (EC) was greater than 50%.
Down-sampling analyses
To investigate how differences in sample size might bias the SLAPenrichment results due to a potential tendency for larger datasets to produce larger number of SLAPenriched pathways, down-sampled SLAPenrich analyses were conducted for the 5 datasets with more than 350 samples (for BRCA, COREAD, GBM, HNSC, LUAD). Particularly, for n ∈ {800, 400, 250} for BRCA and n = 250 for the other cancer types, 50 different SLAPenrich analyses were performed on n samples randomly selected from the genomic dataset of the cancer type under consideration, with the parameter specifications described in the previous section. The average number of enriched pathways (FDR < 5% and EC > 50%) across the 50 analysis was observed.
Hallmark signature analysis: signature quantification
For a given cancer type C and a given hallmark H a cumulative heterogeneity score (CHS) was quantified as the ratio of the pathways associated to H in the SLAPenrich analysis of the C variants.
The CDS scores for all the 10 hallmark composed the hallmark signature of C.
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Availability of data and material
R code and data-objects are available at: https://github.com/francescojm/SLAPenrich. Pre-processed data sources are specified in the Methods.
Funding
OpenTargets funds JSR (Projects OpenTargets15 and OpenTargets16).
Competing interests
The authors declare that they have no competing interests.
Author’s contributions
FI designed the statistical framework underlying SLAPenrich, conceived the hallmark heterogeneity analysis, and designed the other heuristic algorithms, conceived the visualization framework, implemented the R package, and wrote the manuscript; LGA contributed to the implementation of the visualization functions, tested and contributed to implementing the R package, curated data, and contributed to manuscript writing and revising; JB contributed to testing the R package, interpreted results and findings, contributed to manuscript writing and revising; IM contributed to the design of the validation analyses, read and edited the manuscript; DRW contributed to the design of the statistical framework and supervised its mathematical formalization; UM contributed to the interpretation of results; JSR supervised the study and contributed to the manuscript writing and revising.
Additional Files
Additional figure 1 — Differential pathway enrichment analysis results: non-mucinous vs mucinous bronchioloalveolar LUAD patients
Differential SLAPenrich analysis obtained contrasting two sub-pulations of LUAD patients based on their bronchioloalveolar type (non-mucinous vs. mucinous).
Additional figure 2 — Manually curated mapping between genes, pathways and hallmarks (properties):
(A) Heatmap with cancer hallmarks on the rows, pathways gene sets on the columns. A coloured bar in position (i, j) indicates that the j-th pathway is associated with the i-th hallmark; bar diagram on the right shows the number of pathways associated with each hallmark. (B) Heatmap with cancer hallmarks on the rows and genes on the columns. A coloured bar in position (i, j) indicates that the j-th gene is contained in at least one pathway associated with the i-th hallmark (thus associated with the i-th hallmark); bar diagram on the right shows the number of genes associated with each hallmark. (C) Number of associated hallmarks per pathways: the majority of the pathways is associated with 1 hallmark. (D) Number of associated hallmarks per gene: the majority of the genes is associated with less than 3 hallmarks. (E) Distribution of Jaccard similarity scores (quantifying the extent of pair-wise overlaps) computed between pairs of pathway gene sets.
Additional figure 3 — Enriched pathways versus sample size, downsampled analyses, and covered known cancer genes:
(A) Number of significantly enriched pathway at the population versus the number of samples available in the analysed cohorts, across cancer type. (B) Number of significantly enriched pathway at the population level across 5 different cancer types (with more than 350 samples), indicated by different colors, and down-sampled trials. In each of this trials, for each cancer type and 50 different iterations, a set of n samples is randomly selected and a SLAPenrich analysis is performed on this sub-set of data. Average number of SLAPenriched pathway (and standard deviations) are reported. n = 800, 400, and 250 for BRCA and n = 250 for the other four cancer type. For four of the tested tissues there is no tendency for increased number of samples to produce more SLAPenriched pathways. A mild dependency trend is observable for BRCA only, with a continuously increasing average number of enriched pathways as a function of sample size up to 800 samples, that plateaus above this size, with a very similar number of enriched pathways when analysing 1,132 samples or across 50 analysis on 800 pathways. (C) Each bar quantifies the ratio of high-confidence cancer genes (as predicted in [44]) contained in at least one pathway enriched at the population level (covered pathways), across cancer types. Different contained colored bars indicate the ratio of the genes included in covered pathways associated to different hallmarks, one colored bar per hallmark. The white bar at the top indicate the ratio of genes included in covered pathways associated to multiple hallmarks.
Additional figure 4 — Hallmark heterogeneity across cancer types:
Heatmaps showing pathways enrichments at the population level across cancer types for individual hallmarks. Color intensities correspond to the enrichment significance. Cancer types and pathways are clustered using a correlation metric. See also figure 4.
Additional figure 5 — Impact of known cancer genes’ mutations on the results:
Heatmaps showing, for each enriched-pathway/cancer-type, the ratio between the number samples harbouring mutations in known cancer genes belonging to the pathway under consideration and the total number of samples harbouring mutations in any gene belonging to the pathway under consideration. See also figure 4.
Additional figure 6 — Impact of known cancer genes’ mutations on the results:
Heatmaps showing, for each enriched-pathway/cancer-type, the ratio between the number samples harbouring mutations in known cancer genes belonging to the pathway under consideration and the total number of samples harbouring mutations in any gene belonging to the pathway under consideration. See also figure 4.
Additional figure 7 — Hallmark signature analysis to discover new cancer driver networks:
In each row, first circle plots shows pathway enrichments at the population levels when considering all the somatic variants (bars on the external circle) and when considering only variants not involving known high-confidence cancer driver genes; second circle plot shows similarly a comparison between the hallmark signatures resulting from SLAPenrich analysis including (bars on the external circle) or excluding (bars on the internal circle) the variants involving known high-confidence cancer genes. The bar plot shows a comparison, in terms of true-positive-rate (TPR) and positive-predicted-value (PPV), of the SLAPenriched pathways across the two analysis and, finally, the scatter plots on the right shows a comparison between the resulting hallmark signatures.
Additional file 1 — KEGG pathways enriched in the LUAD case study dataset
Additional file 2 — Differential enrichment analysis comparing Smokers vs. Non-Smokers: Results
Additional file 3 — Differential enrichment analysis comparing Mucinus vs. Non-Mucinus BAC types: Results
Additional file 4 — SLAPenrich analysis results across 10 different cancer types
Additional file 5 — List of keywords used to curate the mapping between genes, pathways and hallmarks
Additional file 6 — Manually curated mapping between genes, pathways and hallmarks
Additional file 7 — Overview of the exposed function of the SLAPenrich R package
Acknowledgements
We would like to thank Jorge Buendia, Mathew Garnett, and Annalisa “Lilla” Mupo for a number of insightful discussions, David Tamborero and Nuria Lopez-Bigas for critical feedback on the manuscript.
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.
- 13.
- 14.
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵