Summary
Continued evolution in cancers gives rise to intra-tumour heterogeneity (ITH), which is a major mechanism of therapeutic resistance and therefore an important clinical challenge. However, the extent, origin and drivers of ITH across cancer types are poorly understood. Here, we extensively characterise ITH across 2,778 cancer whole genome sequences from 36 cancer types. We demonstrate that nearly all tumours (95.1%) with sufficient sequencing depth contain evidence of recent subclonal expansions and most cancer types show clear signs of positive selection in both clonal and subclonal protein coding variants. We find distinctive subclonal patterns of driver gene mutations, fusions, structural variation and copy-number alterations across cancer types. Dynamic, tumour-type specific changes of mutational processes between subclonal expansions shape differences between clonal and subclonal events. Our results underline the importance of ITH and its drivers in tumour evolution and provide an unprecedented pan-cancer resource of extensively annotated subclonal events, laying a foundation for future cancer genomic studies.
Introduction
Cancers accumulate somatic mutations as they evolve1,2. Some of these are driver mutations that convey fitness advantages to their host cells and can lead to clonal expansions3-6. Late clonal expansions or incomplete selective sweeps result in distinct cellular populations and manifest as intra-tumour heterogeneity (ITH)1. Clonal mutations are shared by all cancer cells whereas subclonal mutations are present only in some.
ITH represents an important clinical challenge, as it provides genetic variation fuelling cancer progression and can lead to the emergence of therapeutic resistance7-9. Subclonal drug resistance and associated driver mutations are common10-15. ITH can impact precision medicine trial design16, predict progression17, and can be directly prognostic. For example, ITH of copy number aberrations (CNAs) is associated with increased risk of relapse in non-small cell lung cancer18, head and neck cancer19,20 and glioblastoma multiforme21.
ITH can be characterised from massively parallel sequencing data10,11,22-24, as the cells comprising a clonal expansion share a unique set of driver and passenger mutations that occurred in the expansion-initiating cell. Each mutation within this shared set is present in the same proportion of tumour cells (known as cancer cell fraction, CCF), which may be estimated by adjusting the mutation allele frequencies for local copy number changes and sample purity. Subsequent clustering of mutations based on their CCF (see Dentro et al.25 for a recent review) yields a sample’s ‘subclonal architecture’, i.e. estimates of the number of tumour cell populations in the sequenced sample, the CCF of each population, and assignments of mutations to each population (subclone).
Previous pan-cancer efforts used these principles to characterise subclonal events, but have been limited to exomes, which restricts the number and resolution of somatic mutation calls and ignores structural variation26. Two recent studies using pan-cancer data from The Cancer Genome Atlas found that actionable driver mutations are often subclonal11, and that ITH has broad prognostic value26. Williams and colleagues27 proposed that neutral evolutionary dynamics are responsible for the observed ITH in a large proportion of cancers, although the test statistics developed there have been shown to poorly discriminate neutral evolution from selection28,29. To date, ITH remains poorly characterised across cancer types, and there is substantial uncertainty concerning the selective pressures operating on subclonal populations.
Recent studies have used multi-region exome or targeted sequencing to characterise ITH in detail in specific cancer types18,30. Due to the ‘illusion of clonality’31 variants found clonal in one sample may be subclonal in other samples, and therefore, singlesample analyses may underestimate the amount of ITH. Importantly, however, any mutations detected as subclonal in any single sample, will remain subclonal when additional samples are assayed. Therefore, through assaying single cancer samples, a robust lower limit of ITH can be established.
Here, we assess ITH, its origin and drivers, and its role in tumour development, across 2,778 tumours from 36 histologically distinct cancer types. Our study is built on the International Cancer Genome Consortium’s Pan-Cancer Analysis of Whole Genomes (PCAWG) initiative, which represents the largest dataset of cancer whole-genome sequences to date32. Whole-genome sequencing data provides 1-2 orders of magnitude more point mutations, greater resolution to detect CNAs and the ability to call structural variants (SVs). Combined, these greatly increase the breadth and depth of our ITH analyses. Building on the high-quality consensus calls generated by the PCAWG consortium, we find pervasive ITH across cancer types. In addition, we observe clear signs of positive selection in detected subclones, we identify subclonal driver mutations in known cancer genes and find changes in mutational signature activity across cancer types, which combined provide detailed insight into tumour evolutionary dynamics.
Results
Consensus-based characterisation of intra-tumour heterogeneity in 2,778 cancers
We set out to paint detailed portraits of ITH across cancer types, including SNVs, indels, SVs and CNAs, as well as subclonal drivers, subclonal selection, and mutational signatures. We leveraged PCAWG initiative dataset, encompassing 2,778 whole-genome sequences across 36 distinct histological cancer types32.
We applied an ensemble of six state-of-the-art copy number callers and 11 subclonal reconstruction methods and developed approaches to integrate their calls into a high-confidence consensus (Fig. 1a, Supplementary Methods). As previous studies report high sensitivity of subclonal reconstruction methods to the quality of copy number calls26, we devised a robust consensus approach to copy number calling. In addition to breakpoints called by the six CNA callers, we incorporated SVs into our consensus call set, improving sensitivity and obtaining breakpoints with base-pair resolution (Supplementary Methods). Consensus purity and ploidy were determined, and correlate strongly with a recent cross-omics analysis of tumour purity33 (Supplementary Fig. 1). We identify samples that have undergone whole-genome duplication, as they separate from other samples when comparing tumour ploidy and the extent of loss of heterozygosity (Fig. 1b, Supplementary Methods). These samples exhibit synchronous chromosomal gains (see our companion paper34), further validating the purity and ploidy estimates. Consensus copy number calls were assigned ‘tiers’, based on the level of agreement between different callers. On average, we reached consensus on 93% of the genome (Fig. 1c, Supplementary Methods).
Consensus copy number profiles, SNVs and purity estimates served as input to 11 subclonal SNV-clustering methods, the results of which were combined into a single reconstruction for each tumour. We validated three consensus approaches on two independent simulated datasets and assessed their robustness on the real data. Consensus performance was comparable to the best individual methods on both simulated datasets and the top-performing individual methods also displayed high similarity scores (Fig. 1d, Supplementary Methods). In contrast, on the real data, the highest similarities were observed only between consensus methods (Fig. 1d). Using one simulated dataset with 965 samples, we evaluated the performances of consensus approaches over all 2,035 possible combinations of 11 individual methods, and observed that the most robust performance, when the best callers are not known a priori, is achieved in having all 11 callers combined (Supplementary Methods). Hence, we used the output of one of our consensus methods as the basis for our global assignment strategy (Supplementary Methods), obtaining the number of detectable subclonal expansions, the fraction of subclonal SNVs, indels, SVs and CNAs, as well as the assignment of SNVs, indels and SVs to subclones.
Portraits of intra-tumour heterogeneity across cancer types
We find pervasive ITH across all 36 cancer types. Subclonal expansions are evident in 95.1% of the 1,868 tumour samples for which our analysis is powered to detect subclones with CCF > 30% (Fig. 2, Supplementary Fig. 2, Supplementary Methods). Importantly, these estimates, based on single sample reconstruction, provide a lower bound of the number of subclonal mutations and the true proportion of cancers with ITH is likely to be even higher. In contrast to nearly all primary tumour samples, only half of melanoma metastases had detectable subclones (96.7% of 1,801 vs 51% of 67 samples). Surprisingly, metastases of other cancer types all contained detectable subclones (100%, n = 42). Similar to primary tumours, melanoma recurrence samples show a high degree of ITH (Fig. 2, Supplementary Fig. 3a). An approach orthogonal to clustering of SNVs confirmed that clonal melanoma metastases contain significantly less subclonal signal (p-value = 8.4×10−5, Supplementary Fig. 3b, Supplementary Methods).
The patterns of ITH across SNVs, indels, SVs and CNAs paint a characteristic portrait for each histological cancer type (Fig. 2). While some cancer types have limited ITH across these different types of somatic variants (e.g. lung cancers, squamous cell carcinomas and liposarcomas), others show abundance of ITH in some somatic variant types, but nearly none in others (e.g. kidney cancers and pancreatic neuroendocrine tumours show high subclonal burden across somatic variant types, except CNAs) (Fig. 2). We noticed an anti-correlation between the number of SNVs and the average fraction of subclonal SNVs across cancer types (Fig. 2), yet this relation does not hold on the level of individual tumours (Supplementary Methods). The proportions of subclonal indels and SNVs are strongly correlated (R2 = 0.89). SVs follow a similar trend (R2 = 0.64 with SNVs), except for lung squamous cell carcinoma and kidney papillary carcinoma, which show higher fractions of subclonal SVs than SNVs (Fig. 2, Supplementary Fig. 4). In contrast, the average proportions of subclonal large-scale CNAs and SNVs are only weakly correlated (R2 = 0.33).
These findings highlight the high prevalence of ITH across cancer types. Nearly all primary tumours, irrespective of cancer type, have undergone recent subclonal expansions giving rise to detectable subclonal populations. In addition, we find that the average proportions of subclonal SNVs, indels, SVs and CNAs are highly variable across cancer types. These observations accentuate different ITH portraits, suggesting distinct evolutionary narratives of each histological cancer type. Further, among the primary tumours of each cancer type, we find substantial diversity in the fraction of subclonal mutations.
The landscape of subclonal driver mutations
We leveraged the comprehensive whole-genome view of driver events in these cancer genomes35 to gain insight into clonal vs. subclonal drivers. Out of 4,211 high-confidence driver mutations in 360 genes, we find 699 subclonal ones (SNVs and indels) across 196 genes (Fig. 3a). However, 74% of samples with at least one subclone (1,499 / 2,038), and 79% of all detected subclones (2,148 / 2,724), contain no identified subclonal driver SNVs or indels. In contrast, only 29% of samples (770 / 2,658) lack identified clonal driver SNVs or indels.
Overall, the landscape of subclonal driver mutations indicates that specific genes are recurrently hit in subclones across cancer types (Fig. 3a). For example, the PTEN tumour suppressor is commonly found subclonally mutated in both pancreatic and stomach adenocarcinomas. Interestingly, mutations in some driver genes that are exclusively clonal in most cancer types, are predominantly subclonal in others. For example, we find subclonal driver mutations in TP53 in CLL and thyroid cancers; PIK3CA in pilocytic astrocytomas, melanomas and prostate cancers; and KRAS in pilocytic astrocytomas and cervical cancers.
Several tumour types have higher average numbers of subclonal known drivers per sample, suggesting greater subclonal diversity (Fig. 3a). Gene set analysis (Supplementary Methods) revealed enrichment of subclonal mutations in genes responsible for chromatin regulation and transcriptional activity, suggesting an important role in later cancer progression. Indeed, we found that ARID1A, PBRM1, KMT2C/D and SETD2 were highly enriched for subclonal driver mutations. Splicing factor SF3B1 was also often subclonally mutated, and tumour suppressor SMAD4 was subclonally aberrated in breast and pancreatic neuroendocrine tumours.
To assess the potential impact of ITH on clinical decisions, we identified actionable subclonal driver mutations. We reasoned that targeting mutations that are not present in all tumour cells will likely result in ineffective treatment. Restricting our analysis to genes and mutations for which inhibitors are available35, we find that 11.7% of tumours with sufficient coverage harbour an identified subclonal driver that is clinically actionable (Fig. 3b). Among them, 5.1% of tumours show targetable driver mutations only in subclones, while the remaining 6.6% show both subclonal and clonal targetable drivers. When considering only tumours with at least one actionable event, we find that 20.7% of tumours contain at least one subclonal actionable driver, of which about half (9.1% of tumours) show only subclonal actionable events. As our results represent lower bound estimates of the subclonality at the level of the whole tumour, this suggests that targeted therapy would yield an incomplete response in at least 20% of cases. These results highlight the importance of assessing clonality of targeted mutations.
Subclones contain driver mutations that are under positive selection
Selective pressures acting on the coding regions of cancer genomes can be quantified using the dN/dS ratio, which compares the rates of non-synonymous and synonymous mutations36. A dN/dS ratio larger than 1 indicates positive selection, while smaller ratios characterise negative selection, and dN/dS ≈ 1 points towards neutral evolutionary dynamics (or, theoretically, approximately equal amounts of positive and negative selection).
Previously, dN/dS > 1, i.e., evidence of positive selection, has been shown for cancer driver genes37. When analysing clonal mutations in our dataset, we confirm this signature of selection within a set of 566 well-established driver genes (Supplementary Methods). When specifically assaying our consensus subclonal mutations for the same set of drivers, we observed a dN/dS > 1 for nonsense, missense and splice-site SNVs (Fig. 3c). This indicates that driver mutations, rather than neutral evolutionary dynamics27, frequently shape subclonal expansions. This is further supported by the identification of dN/dS ratios > 1 in subclonal mutations of tumours reportedly shaped by neutral evolutionary dynamics27 (Fig. 3d). The 95% confidence intervals of dN/dS for subclonal mutations lay above 1 only in a subset of cancer types (Fig. 3e), in large part due to power limitations: cancer types with no mutation types showing dN/dS > 1 in subclonal mutations also had significantly lower numbers of samples available (p-value = 1.2×10−3, Mann-Whitney U test).
SV clonality reveals how rearrangements influence tumour development and progression
Having established the presence of many subclonal driver SNVs, and a broad correlation between the proportions of subclonal SNVs and SVs, we then sought to examine patterns of subclonality among candidate SV driver mutations.
We defined an SV to be a candidate driver if it was associated with significantly recurrent breakpoints (see companion paper38) at non-fragile sites. All other SVs were deemed passengers (Supplementary Methods).
We found substantial variation in the clonality of driver SVs across cancer types, implying cancer type-specific roles for SVs in tumour development and/or progression (Fig. 4a-c). Nearly half of the samples (45%; 575 / 1,273) and all of the 28 cancer types analysed contained subclonal driver SVs. However, in nine of these cancer types, including B-cell non-Hodgkin lymphoma and melanoma, more than 75% of the candidate driver SVs were clonal, suggesting a role for SV drivers in tumour initiation but not progression. In four of these nine, the driver SVs were significantly more clonal than the passenger SVs (Fig. 4c, p < 0.05, difference of weighted medians, permutation testing), suggesting that the acquisition of early driver SVs was not caused by general genomic instability, nor did genomic instability cause the acquisition of further SV drivers. Similarly, driver SVs were also significantly more clonal than passengers in four of the remaining 19 cancer types. In contrast, pancreatic neuroendocrine cancers and leiomyosarcomas had just over 50% of their driver SVs appearing subclonally, suggesting initiation in these cancer types was potentially driven by non-SV mutations, with subsequent ITH driven by SVs. In line with this, these two types have a relatively low number of subclonal SNV drivers (Fig. 3a) and no significant difference between the clonality of driver and passenger SVs. The remaining tumour types showed substantial evidence for both clonal and subclonal SV drivers, suggesting SVs can drive tumour initiation and progression in these cancer types.
Despite differences in clonality of driver SVs among cancer types, all loci containing recurrent breakpoints showed both clonal and subclonal breaks (Supplementary Fig. 5a), suggesting the same SVs can drive either tumour initiation or progression in different cancer types. Nonetheless, certain loci showed a preference for clonal or subclonal SVs (Fig. 4d, g-value < 0.05, rank-based permutation test). Candidate drivers targeted by predominately clonal SVs included PTPRB, KIAA0125 (mainly in lymphomas, Supplementary Fig. 5b), CDKN2A/B, TERT, MAP3K11, CCND1, and KCNU1. Predominately subclonal targets included a gene-poor region on chromosome 4, and another region on chromosome 13 containing RB1, in agreement with previous studies linking RB1 loss to tumour progression in liver39, liposarcoma40,41, and breast cancer42.
To further understand how clonality impacts gain-of-function driver SVs across cancer types, we specifically focused on previously known and curated oncogenic driver fusion SVs (as described in COSMIC curated fusions [http://cancer.sanger.ac.uk/cosmic/fusion43]). We compared the clonality of fusions in this curated list of drivers with other unknown or out-of-frame fusion events, as well as with the overall pattern of SV clonality in studied samples. Known driver fusions were more likely to be clonal (p-value = 0.0284, Fisher’s exact test, Fig. 4e) with some recurrent fusions appearing exclusively clonal or highly enriched for clonal events (CCDC6-RET, BRAF-KIAA1549, ERG-TMPRSS2), pointing to a model where gain-of-function SVs tend to appear early rather than late during tumour development.
Complex phylogenies among subclones revealed by whole genome sequencing
Whole-genome sequencing provides us with an opportunity to explore and reconstruct additional patterns of subclonal structure by performing mutation-to-mutation read phasing to assess evolutionary relationships of subclonal lineages (Fig. 5a,b). Two subclones can be either linearly related to each other (parent-child relationship), or have a common ancestor, but develop on branching lineages (sibling subclones). Establishing evolutionary relationships between subclones is challenging on singlesample sequencing data due to the limited resolution to separate subclones and the uncertainties on their CCF estimates. We can however examine pairs of SNVs in WGS data that are covered by the same read pairs to reconstruct this relationship. Specifically, in haploid regions, if two SNVs are found in multiple non-overlapping read pairs, then they cannot belong to the same cell, suggesting a branching sibling lineage. In our series, we find that, of 84 tumours with sufficient mutation pairs and power, 42 (50%) show such in-trans SNV pairs in haploid regions (Supplementary Methods), suggesting that in at least 50% of tumours, branching subclonal lineages can be detected (Fig. 5a).
Similarly, in-cis SNV pairs (on same allele) support collinear subclones: when an SNV occurs only on a subset of read pairs that support another SNV, it means they belong in a parent-child relationship and thus indicate two successive subclonal expansions (Supplementary Methods). Using pairs of mutations confidently assigned to the same cluster, we find evidence that 44% (86 of 196) of tumours carry such in-cis SNV pairs, suggesting that we can further subdivide CCF clusters into multiple collinear lineages (Fig. 5b). These analyses illustrate frequent complex patterns of multiple subclonal expansions, exposed by whole-genome sequencing.
We further corrected the number of mutations in subclones detected by mutation clustering, by accounting for a detection bias introduced by somatic variant calling23. Specifically, in subclones with lower CCFs, some proportion of SNVs will be missed, causing an underestimation of the number of associated mutations and an overestimation of their subclones’ CCFs (somewhat akin to the “winner’s curse”). The larger number of SNVs revealed by WGS permits us to characterize and correct for these biases. We developed two methods to do this, validated them on simulated data (Supplementary Methods, Supplementary Fig. 6) and combined them to correct the number of SNVs and the CCF of each subclone. We estimate that, on average, 14% of SNVs in detectable subclones are below the somatic caller detection limits (Fig. 5c,d), while in subclones with CCF < 30%, on average 21% of SNVs are missed. We therefore extrapolate that approximately 14% of subclonal drivers in detected subclones would be missed due to the limitations of mutation calling in this series.
Patterns of subclonal mutation signature activity changes across cancers
Mutational processes can differ in their activity between clonal and subclonal lineages11. To explore the subclonal dynamics of mutational signatures in detail, we examined subclonal mutations for changes in signature activity. We reasoned that when, for example, a mutational process is activated during tumour growth or specific subclonal expansion, only the post-expansion mutations will carry the corresponding mutational signature. Such signature activity change points can therefore be identified in SNVs that are rank-ordered by their CCFs estimates44 (Supplementary Methods). Of the 2,488 samples with sufficient SNVs to perform this analysis, 1,897 (76.1%) had an activity change of at least 6% in at least one signature (a conservative threshold established via permutation and bootstrapping analyses, Supplementary Methods). We detect an average of 1.76 mutational signature activity transitions per sample.
Overall, mutational signature activity is remarkably stable. The most often changing signature (Signature 7, UV-light exposure) is variable in approximately 60% of the cases in which it is active (Fig. 6a). Across the dataset, we find that lifestyle-associated mutational signatures (Signatures 4, tobacco smoking, and 7, UV light exposure), and Signatures 9 (Pol η activity on AID lesions) and 12 (aetiology unknown) decrease in activity from clonal to subclonal in over half the tumours in which these signatures are active. When only considering pairs of signatures that change in the same tumour, we see that 6 out of the top 10 pairs involve Signature 5 (aetiology unknown but hypothesised to reflect lower-fidelity DNA repair pathways45). Such changes are often anti-correlated, suggesting that one of the mutational processes is changing at the proportional expense of others. Evaluating signature trajectories per cancer type (Fig. 6a), we observe a gradually changing picture. In melanoma metastases, Signature 7 always decreases and Signature 5 increases. In contrast, in head-and-neck cancers, most signature activity changes go both up and down in similar, relatively low proportions of tumours. On average, signature activity changes are modest in size, with the maximum average exposure change recorded in CLL (29%, Signature 9). Some changes are observed across many cancer types - e.g., Signatures 5 and 40, of unknown aetiology - while others are found in only one or a few cancer types. For example, in hepatocellular carcinomas, we observe an increase in Signature 35 and a decrease in 12 (both aetiology unknown), and in oesophageal adenocarcinomas, we see an increase in Signature 3 (double-strand break-repair) and a decrease in 17 (aetiology unknown).
Average signature activity change across cancers of the same type is often monotonous along CCF (Fig. 6b, Supplementary Fig. 7). CLLs and lung adenocarcinomas initially see a sharp change in signature activity when transitioning from clonal to subclonal mutations, but activity of the signatures appears to remain stable within subclonal mutations (Fig. 6b). In contrast, oesophageal adenocarcinomas show a steady decrease in Signature 17 activity, whilst thyroid adenocarcinomas often contain a continuing increase in Signature 2 and 13 (APOBEC) activity. These patterns are consistent at a single sample level, for example in individual CLL tumours (Fig. 6c).
Mutation signature activity changes mark subclonal boundaries
We next compared the mutational signature change points (shifts in activity) with detected subclones and reasoned that these would correspond well if the emergence of subclones is associated with changes in mutational process activity. In such a scenario, we expect that the signature change points coincide with the CCF boundaries between subclones, assuming that clustering partitioned the SNVs accurately. In accordance with previous studies that highlight changes in signature activity between clonal and subclonal mutations11,18, we find that between 36% and 53% of clone–subclone boundaries and between 43% and 59% of subclone-subclone boundaries coincide with a region of activity change (Fig. 6d, Supplementary Methods). This not only validates our clustering approach, but also demonstrates that subclonal expansions are often associated with changes in signature activity. It further suggests that increased ITH would correspond to greater activity change. Indeed, the samples with the largest changes in activity tend to be the most heterogeneous (Fig. 6e). Conversely, 49% of changes per sample are not within our window of subclonal boundaries (Fig. 6f), suggesting that some detected CCF clusters represent multiple subclonal lineages, which could not be separated by single-sample clustering (Supplementary Methods).
Discussion
We have painted detailed portraits of ITH and subclonal selection for 36 cancer types, using SNVs, indels, SVs, CNAs, driver mutations and mutational signatures, leveraging the largest set of whole-genome sequenced tumour samples compiled to date. Remarkably, although these single-region-based results provide only a lower bound estimate of ITH, we detected subclonal tumour cell populations in 96.7% of 1,801 primary tumours. Individual subclones in the same tumour frequently exhibited differential activity of mutational signatures, implying that successive waves of subclonal expansion can act as witnesses of temporally and spatially changing mutational processes. We extensively characterised the clonality of SNVs, indels, SVs, and CNAs. For SNVs, we identified patterns of subclonal driver mutations in known cancer genes across 36 tumour types and average rates of subclonal driver events per tumour10,11,14,18. Analysis of dN/dS ratios revealed clear signs of positive selection across the detected subclones and across cancer types. Indels showed clonality patterns highly correlated with SNVs. For SVs, we analysed both candidate driver and passenger events, revealing different models of how SVs influence tumour initiation and progression. Clonality estimates from CNAs suggest a complementary role of chromosomal instability and mutagenic processes in driving subclonal expansions.
Evaluation of dN/dS ratios revealed that tumours classified as evolving neutrally according to the approach described by Williams et al.27, contain subclones under positive selection, as previously reported29. Although our analyses do not exclude the possibility that a small fraction of tumours evolve under very weak or no selection, they show that selection is widespread across cancer types, with few exceptions. Recent methodological advances to test the neutral model based on explicit tumour growth models have emerged and could shed further light on the evolutionary dynamics of individual tumours through single46 and multiple47 tumour biopsies.
Our findings thus support and extend Nowell’s model of clonal evolution1: as neoplastic cells proliferate under chromosomal and genetic instability, some of their daughter cells acquire mutations that convey further selective advantages, allowing them to become precursors for new subclonal lineages. Here, we have demonstrated that this process is ongoing up to and beyond diagnosis, in virtually all tumours and cancer types.
Our observations highlight a considerable gap in knowledge about the drivers of subclonal expansions. Specifically, only 21% of the 2,724 detected subclones have a currently known SNV or indel driver mutation. Thus, late tumour development is either driven largely by different mechanisms – copy number alterations, genomic rearrangements18,48 or epigenetic alterations – or most late driver mutations remain to be discovered. In support of the latter, our companion study34 finds that late driver mutations occur in a more diverse set of genes than early drivers. For now, the landscape of subclonal drivers remains largely unexplored due to limited resolution and statistical power to detect recurrence of subclonal drivers. Nonetheless, each tumour type has its own characteristic patterns of subclonal SNVs, indel, SVs and CNAs, revealing distinct evolutionary narratives. Tumour evolution does not end with the last complete clonal expansion, and it is therefore important to account for ITH and its drivers in clinical studies.
We show that regions of recurrent rearrangements, harbouring likely driver SVs, also exhibit subclonal rearrangements. This suggests that improved annotations must be sought for both SVs and SNVs, in order to comprehensively catalogue the drivers of subclonal expansion. By combining analysis of SV clonality with improved annotations of candidate SV drivers38, we highlight tumour types that would benefit from further characterisation of subclonal SV drivers, such as pancreatic neuroendocrine cancers and leiomyosarcomas.
These observations have a number of promising clinical implications. For example, there was subclonal enrichment of SVs causing RB1 loss across multiple cancer types, expanding on the known behaviour of RB1 mutations in breast cancer42. These SVs could be linked to known resistance mechanisms to emerging treatments (e.g. CDK4/6 inhibitors in breast42 and bladder49 cancer). If profiled in a resistance setting, they may provide a pathway to second-line administration of cytotoxic therapies such as cisplatin or ionizing radiation, which show improved efficacy in tumours harbouring RB1 loss50.
Our results show rich subclonal architectures, with both linear and branching evolution in many cancers. This suggests that driver mutations either reinforce or compete with each other depending on the background in which they arise, in an evolutionary regime called clonal interference5. Given the pivotal role that positive selection plays in the evolution of cancer, further work is needed to characterise the full spectrum of cancer subclones and understand their fitness distribution. Meanwhile, results in controlled laboratory evolution can shed light on adaptive dynamics in the presence of genetic heterogeneity51,52. As the fitness distribution ultimately defines the rules for the evolutionary dynamics that ensue, future work should incorporate integrative analyses of clonal genotype and fitness to build a unified view of the selective constraints on cancer genomes.
Our study builds upon a wealth of data of cancer whole-genome sequences generated under the auspices of the International Cancer Genome Consortium and The Cancer Genome Atlas, allowing detailed characterisation of ITH from single tumour samples across 36 cancer types. It builds on consensus reconstructions of CNAs and subclones from 6 and 11 individual methods, respectively. In establishing these reconstructions, we found that each method makes errors that are corrected by the consensus. Our consensus-building tools and techniques thus provide a set of best practices for future analyses of tumour whole genome sequencing data. As multi-region sequencing strategies are better powered to infer detailed ITH compared to single-sample studies18,47,53, future detailed pan-cancer analyses of ITH would greatly benefit from multi-region whole-genome sequencing approaches.
Methods summary
Consensus copy number analysis
As the basis for our subclonal architecture reconstruction, we needed a confident copy number profile for each sample. To this end, we applied six copy number analysis methods (ABSOLUTE, ACEseq, Battenberg, CloneHD, JaBbA and Sclust) and combined their results into a robust consensus (see Supplementary Methods for details). In brief, each individual method segments the genome into regions with constant copy number, then calculates the copy number of both alleles for the genomic location. Some of the methods further distinguish between clonal and subclonal copy number states, i.e. a mixture of two or more copy number states within a region. Disagreement between methods mostly stems from either difference in the segmentation step, or uncertainty on whole genome duplication (WGD) status. Both issues were resolved using our consensus strategy.
To identify a set of consensus breakpoints, we combined the breakpoints reported by the six methods with the consensus structural variants (SVs). If a hotspot of copy number breakpoints could be explained by an SV, we removed the copy number breakpoints in favour of the base-pair resolution SV. The remaining hotspots were merged into consensus calls to complement the SV-based breakpoints. This combined breakpoint set was then used as input to all methods in a second pass, where methods were required to strictly adhere to the provided breakpoints.
Allele-specific copy number states were resolved by assessing agreement between outputs of the individual callers. A consensus purity for each sample was obtained by combining the estimates of the copy number methods with the results of the subclonal architecture reconstruction methods that infer purity using only SNVs.
Each copy number segment of the consensus output was rated with a star-ranking representing confidence.
To create a subclonal copy number consensus, we used three of the copy number methods that predicted subclonal states for segments and reported the subclonal state if at least two methods were in agreement.
Consensus subclonal architecture clustering
We applied 11 subclonal reconstruction methods (BayClone-C, Ccube, CliP, cloneHD, CTPsingle, DPClust, PhylogicNDT, PhyloWGS, PyClone, Sclust, SVclone). Most were developed or further optimised during this study. Their outputs were combined into a robust consensus subclonal architecture (see Supplementary Methods for details). During this procedure, we used the PCAWG consensus SNVs and indels [Synapse ID syn7118450] and SVs [syn7596712].
The procedure to create consensus architectures consisted of three phases: a run of the 11 callers on a subset of SNVs that reside on copy number calls of high-confidence, merging of the output of the callers into a consensus and finally assignment of all SNVs, indels and SVs.
Each of the 11 subclonal reconstruction callers outputs the number of mutation clusters per tumour, the number of mutations in each cluster, and the clusters’ proportion of (tumour) cells (cancer cell fraction, CCF). These data were used as input to three orthogonal approaches to create a consensus: WeMe, CSR and CICC. The results reported in this paper are from the WeMe consensus method, but all three developed methods lead to similar results, and were used to validate each other (Supplementary Methods).
The consensus subclonal architecture was compared to the individual methods on two independent simulation sets, one 500-sample for training and one 965-sample for validation, and on the real PCAWG samples to evaluate robustness. The metrics by which methods were scored take into account the fraction of clonal mutations, number of mutational clusters and the root mean square error (RMSE) of mutation assignments. To calculate the overall performance of a method, ranks of the three metrics were averaged per sample.
Across the two simulated datasets, the scores of the individual methods were variable, whereas the consensus methods were consistently among the best across the range of simulated number of subclones, tumour purity, tumour ploidy and sequencing depth. The highest similarities were observed among the consensus and the best individual methods in the simulation sets, and among the consensus methods in real data, suggesting stability of the consensus in the real set. Increasing the number of individual methods input to the consensus consistently improved performance and the highest performance was obtained for the consensus run on the full 11 individual methods, suggesting that each individual method has its own strengths that are successfully integrated by the consensus approaches (Supplementary Methods).
All SNVs, indels and SVs were assigned to the clusters that were determined by the consensus subclonal architecture using MutationTimer34. Each mutation cluster is modelled by a beta-binomial and probabilities for each mutation belonging to each cluster are calculated.
Not only did this process result in the final consensus subclonal architecture, it also timed mutations relative to copy number gains (Supplementary Methods).
SV clonality analysis
Due to the difficulty in determining SV VAFs from short read sequence data, and subsequent CCF point estimation54, we elected to explore patterns of putative driver SV clonality using subclonal probabilities, allowing us to account for uncertainty in our observations of SV clonality (Supplementary Methods). After excluding unpowered samples, highly mutated samples, and cancer types with less than ten powered samples (Supplementary Methods), we analysed 125,920 consensus SVs from 1,517 samples, across 28 cancer types. SVs were divided into candidate driver SVs and candidate passenger SVs using annotations from a companion paper38. SVs were considered candidate drivers if they were annotated as having significantly recurrent breakpoints (SRBs) at non-fragile sites, and candidate passenger SVs otherwise (Supplementary Methods).
Subclonal probabilities of driver and passenger SVs across tumour types were observed using weighted median and interquartile ranges (Supplementary Methods). Any tumour types with interquartile ranges exceeding subclonal probabilities of 0.5 were considered as having evidence of subclonal SVs. Permutation testing was used to determine significant differences in the weighted medians between driver and passenger SVs (Supplementary Methods). To test if any genomic loci were enriched for clonal or subclonal SVs across cancer types, we employed a GSEA-like55 rank-based permutation test (Supplementary Methods).
“Winner’s curse” correction
Because somatic mutation callers require a minimum coverage of supporting reads, in samples with low purity and/or small subclones, the reported CCF values and cluster sizes will be biased. As variants observed in a lower number of reads have a higher probability to be missed by somatic mutation callers, rare subclones will show lower apparent mutation numbers and higher apparent CCF values. We refer to this effect as the “Winner’s curse”. To adjust mutational clusters both in size and in CCF, we developed two methods, PhylogicCorrectBias and SpoilSport. Results from both methods were integrated to produce a consensus correction, and our correction approach was validated on simulated data (Supplementary Methods).
Mutation signatures trajectory analysis
Given the mutational signatures obtained from PCAWG [syn8366024], we used TrackSig44 to fit the evolutionary trajectories of signature activities. Mutations were ordered by their approximate relative temporal order in the tumour, by calculating a pseudo-time ordering using CCF and copy number. Time-ordered mutations were subsequently binned to create time points on a pseudo-timeline to which signature trajectories can be mapped.
At each time point, mutations were classified into 96 classes based on their trinucleotide context and a mixture of multinomial distributions was fitted, each component describing the distribution of one active signature. Derived mixture component coefficients correspond to mutation signature activity values, reflecting the proportion of mutations in a sample that were generated by a mutational process. By applying this approach to every time point along a sample’s evolutionary timeline, a trajectory showing the activity of signatures over time was obtained.
We applied likelihood maximisation and the Bayesian Information Criterion to simulations to establish the optimal threshold at which signature activity changes can be detected. This threshold was determined to be 6%. Subsequently, a pair of adjacent mutation bins was marked as constituting a change in activity if the absolute difference in activity between the bins of a at least one signature was greater than the threshold.
Signature trajectories were mapped to our subclonal reconstruction architectures by dividing the CCF space according to the proportion of mutations per time point belonging to a mutation cluster determined by the consensus reconstruction. By comparing distances in pseudo-time between trajectory change points and cluster boundaries, change points were classified as “supporting” a boundary if they are no more than three bins apart.
Acknowledgements
This work was supported by the Francis Crick Institute, which receives its core funding from Cancer Research UK (FC001202), the UK Medical Research Council (FC001202), and the Wellcome Trust (FC001202). This project was enabled through access to the MRC eMedLab Medical Bioinformatics infrastructure, supported by the Medical Research Council (grant number MR/L016311/1). MT and JD are postdoctoral fellows supported by the European Union’s Horizon 2020 research and innovation program (Marie Sklodowska-Curie Grant Agreement No. 747852-SIOMICS and 703594-DECODE). JD is a postdoctoral fellow of the Research Foundation – Flanders (FWO). IVG was supported by a Wellcome Trust PhD fellowship (grant number WT097678). SM is funded by a Vanier Canada Graduate Scholarship. SCS is supported by the NSERC Discovery Frontiers Project, “The Cancer Genome Collaboratory” and by NIH GM108308. DJA is supported by Cancer Research UK. FM, GM and KeY would like to acknowledge the support of the University of Cambridge, Cancer Research UK and Hutchison Whampoa Limited. GM, KeY and FM are funded by CRUK core grants C14303/A17197 and A19274. SSe and YJ are supported by NIH R01 CA132897. HZ is supported by grant NIMH086633 and an endowed Bao-Shan Jing Professorship in Diagnostic Imaging. PTS is supported by U24CA210957 and 1U24CA143799. WW is supported by the U.S. National Cancer Institute (1R01 CA183793 and P30 CA016672). DCW is funded by the Li Ka Shing foundation. PVL is a Winton Group Leader in recognition of the Winton Charitable Foundation’s support towards the establishment of The Francis Crick Institute. We gratefully acknowledge Nicholas McGranahan and Charles Swanton for valuable comments on our manuscript.