Summary
Sequencing of whole tumor genomes holds the promise of revealing functional somatic regulatory mutations, such as those described in the TERT promoter. Recurrent promoter mutations have been identified in many additional genes and appear to be particularly common in melanoma, but convincing functional data such as influence on gene expression has been more elusive. Here, we show that frequently recurring promoter mutations in melanoma occur almost exclusively at cytosines flanked by a distinct sequence signature, TTCCG, with TERT as a notable exception. In active, but not inactive, promoters, mutation frequencies for cytosines at the 5’ end of this ETS-like motif were considerably higher than expected based on a UV trinucleotide mutational signature. Additional analyses solidify this pattern as an extended context-specific mutational signature that mediates an exceptional position-specific vulnerability to UV mutagenesis, arguing against positive selection. We further use ultra-sensitive amplicon sequencing to demonstrate that cell cultures exposed to UV light quickly develop subclonal mutations specifically in affected positions. Our findings have implications for the interpretation of somatic mutations in regulatory regions, and underscore the importance of genomic context and extended sequence patterns to accurately describe mutational signatures in cancer.
Main text
A major challenge in cancer genomics is the separation of functional somatic driver mutations from non-functional passengers. This problem is relevant not only in coding regions, but also in the context of non-coding regulatory regions such as promoters, where putative driver mutations are now mappable with relative ease using whole genome sequencing1,2. One important indicator of driver function is recurrence across independent tumors, which can be suggestive of positive selection. However, proper interpretation of recurrent mutations requires a detailed understanding of how somatic mutations occur in the absence of selection pressures. Somatic mutations are not uniformly distributed across tumor genomes, and regional variations in mutation rates have been associated with differences in transcriptional activity, replication timing as well as chromatin accessibility and modification3–5. Impaired nucleotide excision repair (NER) has been shown to contribute to increased local mutation density in promoter regions and protein binding sites6,7. Additionally, analyses of mutational processes and their sequence signatures have shown the importance of the immediate sequence context for local mutation rates8. Still, our understanding of mutational heterogeneity is incomplete, and it is not clear to what extent such effects can explain recurrent somatic mutations in promoter regions, which are suggested by some studies to be particularly frequent in melanoma despite several other cancer types approaching melanoma in terms of total mutation load9,10.
To characterize somatic promoter mutations in melanoma, we analyzed the sequence context of recurrently mutated individual genomic positions occurring within +/- 500 bp of annotated transcription start sites (TSSs), based on 38 melanomas subjected to whole genome sequencing by the Cancer Genome Atlas10,11. Strikingly, of 17 highly recurrent promoter mutations (recurring in at least 5/38 of tumors, 13%), 14 conformed to an identical 6 bp sequence signature (Table 1, Fig. 1a). Importantly, the only exceptions were the previously described TERT promoter mutations at chr5:1,295,228, 1,295,242 and 1,295,25012,13 (Table 1, Fig. 1b). The recurrent mutations occurred at cytosines positioned at the 5’ end or one base upstream of the motif CTTCCG (Fig. 1c), and were normally C>T or CC>TT transitions (Table 1). Similar to most mutations in melanoma they were thus C>T changes in a dipyrimidine context, compatible with UV-induced damage through cyclobutane pyrimidine dimer (CPD) or 6-4 photoproduct formation8,14. Out of 15 additional positions recurrently mutated in 4/38 tumors, 13 conformed to the same pattern, while the remaining two showed related sequence contexts (Table 1). Many less recurrent sites also showed the same pattern(Supplementary Table 1). The signature described here matches the consensus binding sequence of ETS family transcription factors (TFs)15, and the results are consistent with recent reports showing that ETS promoter sites are often recurrently mutated in melanoma9 and that such mutations preferably occur at cytosines upstream of the core TTCC sequence16. Thus, while recurrent promoter mutations are common in melanoma, they consistently adhere to a distinct sequence signature, which may argue against positive selection as a major causative factor.
The recurrently mutated positions were next investigated in additional cancer cohorts, first by confirming them in an independent melanoma dataset17 (Supplementary Table 2). We found that the identified hotspot positions were often mutated also in cutaneous squamous cell carcinoma (cSCC)18 (Supplementary Table 3) as well as in sun-exposed skin18,19, albeit at lower variant frequencies (Supplementary Fig. 1, Supplementary Table 4). Additionally, one of the mutations, upstream of DPH3, was recently described as highly recurrent in basal cell skin carcinoma20. However, we did not detect mutations in these positions in 13 non-UV-exposed cancer types (Supplementary Table 5). The hotspots are thus present in UV-exposed samples of diverse cellular origins, but in contrast to the TERT promoter mutations they are completely absent in non-UV-exposed cancers. This further supports that recurrent mutations at the 5’ end of CTTCCG elements are due to elevated susceptibility to UV-induced mutagenesis in these positions.
Next, we considered additional properties that could support or argue against a functional role for the recurrent mutations. We first noted a general lack of known cancer-related genes among the affected promoters, with TERT as one of few exceptions (Table 1 and Supplementary Table 1, indicated in blue). Secondly, the recurrent promoter mutations were not associated with differential expression of the nearby genes (Table 1 and Supplementary Table 1). This is in agreement with earlier investigations of some of these mutations, which gave no conclusive evidence regarding influence on gene expression9,16,20, although it should be noted that significant association is lacking also for TERT in this relatively small cohort10. Lastly, we found that when comparing different tumors there was a strong positive correlation between the total number of the established hotspot positions that were mutated and the genome-wide mutation load, both in melanoma (Fig. 2a; Spearman’s r 0.88, P = 2.8e-13) and in cSCC (Supplementary Table 3; r = 0.78, P = 0.026). This is again compatible with a passive model involving elevated mutation probability in the affected positions. Importantly, this contrasted sharply with most of the major driver mutations in melanoma, which were detected also in tumors with lower mutation load (Fig. 2b, Supplementary Table 3). These different findings further reinforce the CTTCCG motif as a strong mutational signature in melanoma.
We next investigated whether the observed signature would be relevant also outside of promoter regions. As expected, numerous mutations occurred in CTTCCG sequences across the genome, but notably we found that recurrent mutations involving this motif were always located close to actively transcribed TSSs (Fig. 3abc). We further compared the frequencies of mutations occurring at cytosines in the context of the motif to all possible trinucleotide contexts, an established way of describing mutational signatures in cancer8. As expected, on a genome-wide scale, the mutation probability for cytosines in CTTCCG-related contexts was only marginally higher compared to corresponding trinucleotide contexts (Fig. 4a). However, close to TSSs, the signature conferred a striking elevation in mutation probability compared to related trinucleotides, in particular for cytosines at the 5’ end of the motif and most notably near highly expressed genes (Fig. 4b-d). Recurrent promoter mutations in melanoma thus conform to a distinct sequence signature manifested only in the context active promoters, suggesting that a specific binding partner is required for the element to confer elevated mutation probability.
CTTCCG elements have in various individual promoters been shown to be bound by ETS factors such as ETS1, GABPA and ELF121, ELK422, and E4TF123. This suggests that the recurrently mutated CTTCCG elements could be substrates for ETS TFs. As expected, matches to CTTCCG in the JASPAR database of TF binding motifs were mainly ETS-related (Supplementary Table 6). Notably, recurrently mutated CTTCCG sites were evolutionarily conserved to a larger degree than non-recurrently mutated but otherwise similar control sites, further supporting that they constitute functional ETS binding sites (Supplementary Fig. 2). This was corroborated by analysis of top recurrent CTTCCG sites in relation to ENCODE ChIP-seq data for 161 TFs, which showed that the strongest and most consistent signals were for ETS factors (GABPA and ELF1) (Supplementary Fig. 3).
The distribution of mutations across tumor genomes is shaped both by mutagenic and DNA repair processes. Binding of TFs to DNA can increase local mutation rates by impairing NER, and strong increases have been observed in predicted sites for several ETS factors6,7. It is also established that contacts between DNA and proteins can modulate DNA damage patterns by altering conditions for UV photoproduct formation24–27. In upstream regions of XPC -/- cSCC tumors lacking global NER, we found that the CTTCCG signature still conferred strongly elevated mutation probabilities compared to relevant trinucleotide contexts (Supplementary Fig. 4), although to a lesser extent than in melanomas with functional NER (Fig. 4). Transcription-coupled NER (TC-NER) may still be active in XPC -/- tumors, and the signature could thus theoretically arise due to inhibition of TC-NER at CTTCCG elements. However, the recurrently mutated positions were typically positioned upstream of the TSSs (Fig. 1) and should not be subjected to this process. Additionally, TC-NER is strand-specific14 while the mutations occurred independently of strand orientation relative to the downstream gene (Supplementary Fig. 4). The signature described here is thus unlikely explained by impaired NER alone and may instead arise due to inhibition of other repair processes or due to favorable conditions for UV lesion formation at the 5’ end of ETS-bound CTTCCG elements.
Finally, we sought to experimentally test our proposed model that the observed promoter hotspots are due to localized vulnerability to mutagenesis by UV light. We subjected human melanoma cells and keratinocytes to daily UV doses for a period of 5 or 10 weeks (Fig. 5a) and used an ultrasensititive error-correcting amplicon sequencing protocol, SiMSen-Seq28, to assay two of the observed promoter hotpots for mutations: RPL13A, the most frequently mutated site in the tumor data (Table 1), and DPH310,20. Between 36k and 82k error-corrected reads (>20x oversampling) were obtained for each of 16 different conditions (Fig. 5b-c). Strikingly, subclonal mutations appeared specifically in expected positions at both time points and in both cell lines at a frequency reaching up to 2.9% of fragments (RPL13A, 10 weeks of exposure), while being absent in non-exposed control cells (Fig. 5d-e). As predicted by the tumor data, mutations occurred primarily at cytosines upstream of the TTCCG motif, with lower-frequency mutations occurring also in the central cytosines. Few mutations were observed outside of the TTCCG context despite presence of many cytosines in theoretically vulnerable configurations in the two amplicons (Fig. 5d-e, underscored). These results further reinforce that recurrent mutation hotspots in promoters in melanoma arise due to an exceptional vulnerability to UV mutagenesis in these positions.
In summary, we demonstrate that recurrent promoter mutations are common in melanoma, but also that they adhere to a distinct sequence signature in a strikingly consistent manner, arguing against positive selection as a major driving force. This model is supported by several additional observations, including lack of cancer-relevant genes, lack of obvious effects on gene expression, presence of the signature exclusively in UV-exposed samples of diverse cellular origins, and strong positive correlation between genome-wide mutation load and mutations in the affected positions. Crucially, exposing cells to UV light under controlled conditions efficiently induces mutations specifically in affected sites. These results point to limitations in conventional genome-wide derived trinucleotide models of mutational signatures, and imply that extended sequence patterns as well as genomic context should be taken into account to improve interpretation of somatic mutations in regulatory DNA.
Methods
Mapping of somatic mutations
Whole-genome sequencing data for 38 skin cutaneous melanoma (SKCM) metastases was obtained from the Cancer Genome Atlas (TCGA) together with matching RNA-seq data. Mutations were called using samtools29 (command mpileup with default settings and additional options -q1 and –B) and VarScan30 (command somatic using the default minimum variant frequency of 0.20, minimum normal coverage of 8 reads, minimum tumor coverage of 6 reads and the additional option –strand-filter 1). Mutations where the variant base was detected in the matching normal were not considered for analysis. The resulting set of mutations was further processed by removing mutations overlapping germline variants included in the NCBI dbSNP database, Build 146. The genomic annotation used was GENCODE31 release 17, mapped to GRCh37. The TSS of a gene was defined as the 5’most annotated transcription start. Somatic mutation status for known driver genes was obtained from the cBioPortal32,33.
RNA-seq data processing
RNA-seq data was analyzed with resepect to the GENCODE31 (v17) annotation using HTSeq-count (http://www-huber.embl.de/users/anders/HTSeq) as previously described34. Differential gene expression between tumors with and without mutations in promoter regions was evaluated using the two-sided Wilcoxon rank sum test, avoiding assumptions about distribution or directionality.
Analyzed genomic regions
The SKCM tumors were analyzed across the whole genome or in regions close to TSS, in which case only mutations less than 500 bp upstream or downstream of TSS were included. For the analysis of regions close to TSS the genes were divided in three tiers of equal size based on the mean gene expression level across the 38 SKCM tumors.
Mutation probability calculation
The February 2009 assembly of the human genome (hg19/GRCh37) was downloaded from the UCSC Genome Bioinformatics site. Sequence motif and trinucleotide frequencies were obtained using the tool fuzznuc included in the software suite EMBOSS35. The mutation probability was calculated as the total number of observed mutations in a given sequence context across all tumors divided by the number of instances of this sequence multiplied by the number of tumors.
Evolutionary conservation data
The evolutionary conservation of genome regions was evaluated using phastCons scores36 from multiple alignments of 100 vertebrate species retrieved from the UCSC genome browser. The analyzed regions were 30 bases upstream and downstream of the motif CTTCCG located less than 500 bp from TSS.
ChIP-seq data
Binding of transcription factors at NCTTCCGN sites was evaluated using normalized scores for ChIP-seq peaks from 161 transcription factors in 91 cell types (ENCODE track wgEncodeRegTfbsClusteredV3) obtained from the UCSC genome browser.
Analysis of whole genome sequencing data from UV-exposed skin
Whole genome sequencing data from sun-exposed skin, eye-lid epidermis, was obtained from Martincorena et al., 201519. Samtools29 (command mpileup with a minimum mapping quality of 60, a minimum base quality of 30 and additional option –B) was used to process the data and VarScan30 (command mpileup2snp counting all variants present in at least one read, with minimum coverage of one read and the additional strand filter option disabled) was used for mutation calling.
Analysis of whole genome sequencing data from cSCC tumors
Whole genome sequencing data from 8 cSCC tumors and matching peritumoral skin samples was obtained from Durinck et al., 201137. Whole genome sequencing data from cSCC tumors and matching peritumoral skin from 5 patients with germline DNA repair deficiency due to homozygous frameshift mutations (C940del-1) in the XPC gene was obtained from Zheng etal., 201418. Samtools29 (command mpileup with a minimum mapping quality of 30, a minimum base quality of 30 and additional option –B) was used to process the data and (command mpileup2snp counting all variants present in at least one read, with minimum coverage of two reads and the additional strand filter option disabled) was used for mutation calling. For the mutation probability analysis of cSCC tumors with NER deficiency an additional filter was applied to only consider mutations with a total coverage of at least 10 reads and a variant frequency of at least 0.2. The functional impact of mutations in driver genes was evaluated using PROVEAN38 and SIFT39. Non-synonymous mutations that were considered deleterious by PROVEAN or damaging by SIFT were counted as driver mutations.
Cell lines and UV treatments
A375 melanoma cells were a gift from Joydeep Bradbury and HaCaT keratinocyte cells were a gift from Maria Ericsson. Cells were grown in DMEM + 10% FCS + gentamycin (A375) or pen/strep (HaCaT) (Thermo Scientific). Cells were treated in DMEM in 10 cm plates without lids with 36 J/m2 UVC 254 nm (equivalent to 6 hour daily dose at 0.1J/m2/min40, CL-1000 UV crosslinker, UVP), 5 days a week for 10 weeks. Cells were split when confluent and reseeded at 1:5. Cells were frozen at -20°C.
DNA purification
DNA was extracted based on Tornaletti and Pfeifer 41. Briefly, cell pellets were lysed in 0.5 ml of 20 mM Tris-HCl (pH 8.0), 20 mM NaCl, 20mM EDTA, 1% (w/v) sodium dodecyl sulfate, 600 mg/ml of proteinase K, and 0.5 ml of 150 mM NaCl, 10 mM EDTA. The solution was incubated for two hours at 37°C. DNA was extracted twice with phenol-chloroform and once with chloroform and precipitated by adding 0.1 vol. 3 M sodium acetate (pH 5.2), and 2.5 volumes of ethanol. The pellets were washed with 75% ethanol and briefly air-dried. DNA was dissolved in 10 mM Tris-HCl (pH 7.6), 1 mM EDTA (TE buffer) (all from Sigma Aldrich). DNA was treated with RNAse for 1 hr at 37°C and phenol-chloroform extracted and ethanol precipitated before dissolving in TE buffer.
Ultrasensitive mutation analysis
To detect and quantify mutations we applied SiMSen-Seq (Simple, Multiplexed, PCR-based barcoding of DNA for Sensitive mutation detection using Sequencing) as described28. Briefly, barcoding of 150 ng DNA was performed in 10 µL using 1x Phusion HF Buffer, 0.1U Phusion II High-Fidelity polymerase, 200 µM dNTPs (all Thermo Fisher Scientific), 40 nM of each primer (PAGE-purified, Integrated DNA Technologies) and 0.5M L-Carnitine inner salt (Sigma Aldrich). Barcode primer sequences are shown in Supplementary Table 8. The temperature profile was 98 °C for 3 min followed by three cycles of amplification (98 °C for 10 sec, 62 °C for 6 min and 72 °C for 30 sec), 65 °C for 15 min and 95 °C for 15 min. The reaction was terminated by adding 20 µL TE buffer, pH 8.0 (Invitrogen, Thermo Fisher Scientific) containing 30 ng/µL protease from Streptomyces griseus (Sigma Aldrich) at the beginning of the 65 °C incubation step. Next, 10 µL of the diluted barcoded PCR products were amplified in a 40 µL using 1x Q5 Hot Start High-Fidelity Master Mix (New England BioLabs) and 400 nM of each sequencing adapter primer. Adapter primers are shown in Supplementary Table 8. The temperature profile was 95 °C for 3 min followed by 40 cycles of amplification (98 °C for 10 sec, 80 °C for 1 sec, 72 °C for 30 sec and 76 °C for 30 sec, with a ramp rate of 0.2 °C/sec). The 40 µL PCR products were then purified using Agencourt AMPure XP beads (Beckman-Coulter) according to the manufacturers’ instructions using a bead to sample ratio of 1. The purified product was eluted in 20 µL TE buffer, pH 8.0. Library concentration and quality was assessed using a Fragment Analyzer (Advanced Analytical). Final libraries were pooled to equal molarity in Buffer EB (10 mM Tris-HCl, pH 8.5, Qiagen) containing 0.1% TWEEN 20 (Sigma Aldrich).
Sequencing was performed on an Illumina NextSeq 500 instrument at Tataa Biocenter (Gothenburg, Sweden) using 150 bp single-end reads. Raw FastQ files were subsequently processed as described28 using Debarcer Version 0.3.0 (https://github.com/oicr-gsi/debarcer). Sequence reads with the same barcode were grouped into families for each amplicon. Barcode families with at least 20 reads, where ≥ 90 % of the reads were identical, were required to compute consensus reads. FastQ files were deposited in the Sequence Read Archive (SRA).
Supplementary figures
Supplementary tables
Acknowledgements
The results published here are in whole or part based upon data generated by The Cancer Genome Atlas pilot project established by the NCI and NHGRI. Information about TCGA and the investigators and institutions who constitute the TCGA research network can be found at http://cancergenome.nih.gov. We are most grateful to the patients, investigators, clinicians, technical personnel, and funding bodies who contributed to TCGA, thereby making this study possible. E. L. was supported by the Knut and Alice Wallenberg Foundation, the Swedish Foundation for Strategic Research, the Swedish Medical Research Council, the Swedish Cancer Society, the Åke Wiberg foundation, and the Lars Erik Lundberg Foundation for Research and Education. A. S. was supported by Sahlgrenska Academy-ALF, the Swedish Childhood Cancer Foundation, the Swedish Cancer Society, and the Wallenberg Centre for Molecular and Translational Medicine. Computations were in part performed on resources provided by SNIC through Uppsala Multidisciplinary Center for Advanced Computational Science (UPPMAX) under project b2012108.
Author contributions
J.F and E.L. conceived the study; J.F performed bioinformatics analyses; J.F and E.L. wrote the paper; K.E. performed cell culture and UV irradiation experiments; S.F. and A.S performed ultrasensitive amplicon sequencing.
Competing financial interests
The applied SiMSen-Seq approach is patent pending (A.S.). The other authors declare no competing financial interests or other conflict of interest.