ABSTRACT
Somatic mutations are the driving force of cancer genome evolution1. The rate of somatic mutations appears in great variability across the genome due to chromatin organization, DNA accessibility and replication timing2-5. However, other variables that may influence the mutation rate locally, such as DNA-binding proteins, are unknown. Here we demonstrate that the rate of somatic mutations in melanoma tumors is highly increased at active Transcription Factor binding sites (TFBS) and nucleosome embedded DNA, compared to their flanking regions. Using recently available excision-repair sequencing (XR-seq) data6, we show that the higher mutation rate at these sites is caused by a decrease of the levels of nucleotide excision repair (NER) activity. Therefore, our work demonstrates that DNA-bound proteins interfere with the NER machinery, which results in an increased rate of mutations at their binding sites. This finding has important implications in our understanding of mutational and DNA repair processes and in the identification of cancer driver mutations.
The accumulation of somatic mutations in cells results from the interplay of mutagenic processes, both internal and exogenous, and mechanisms of DNA repair. Recent efforts to sequence the whole genome of tumor samples from different tumor types7,8 have shed light on this interplay. On the one hand, mutational signatures associated to various tumorigenic mechanisms have been identified across cancer types9; on the other, genomic features such as chromatin organization, DNA accessibility, and DNA replication timing2-5 have been associated to the variation of somatic mutation rates at the megabase scale. Two recent studies proposed a causal relationship between the accessibility of chromosomal areas to the DNA repair machinery and their mutational burden. Supek and Lehner, 201510 point to variable repair of DNA mismatches as the basis of the megabase scale variation of somatic mutation rates across the human genome. Polak et al. 20144 attributed lower somatic mutation rates at DNase-I hypersensitive sites (DHS) in cell lines and primary tumors than at their flanking regions and the rest of the genome to higher accessibility to the global genome repair machinery. Similarly, nucleosome occupancy has been linked to regional mutation rate variation between the nucleosome bound DNA and linker regions11-13, while two recent studies found a relation between transcription factor binding sites (TFBS) and nucleotide substitution rates. Reijns et al. 201514 detected increased levels of nucleotide substitutions around TFBS in the yeast genome, which was attributed to DNA-binding proteins acting as partial barriers to the polymerase-delta-mediated displacement of polymerase-alpha-synthesized DNA. Katainen et al. 201515 found that CTCF/cohesin-binding sites are frequently mutated in colorectal tumors and in a small subset of tumors of other cancer types, and suggest that these mutations are probably caused by challenged DNA replication under aberrant conditions.
To elucidate the impact of DNA-binding proteins on DNA repair, we analyzed the somatic mutation rate at TFBS in the genomes of 38 primary melanoma samples sequenced by TCGA16. We found that the mutation rate was approximately five times higher in active TFBS, i.e., those overlapping DHS (Fig. 1a) than in their flanking regions (P < 2.2 × 10−6, chi-square test). We determined that this elevated mutation rate could not be explained by the sequence context (Fig. 1a), and that it did not occur at inactive TFBS (Fig. 1a and Extended Data Fig. 1), indicating that it is directly related to the protein bound to DNA. Furthermore, this enrichment for mutations appeared at the active binding sites of most transcription factors (TFs) (Fig. 1b, Extended Data Fig. 2 and Supplementary Table 1); the signal was discernible in most analyzed melanoma samples (Fig. 1c and Supplementary Table 2), and it increased with genome-wide mutation rate.
Most somatic mutations in melanocytes are caused by exposure to ultraviolet (UV) radiation9. UV radiation causes specific DNA lesions or DNA photoproducts –cyclobutane pyrimidine dimers (CPDs) and (6-4) pyrimidine–pyrimidone photo-products ((6-4)PPs), at the sites of dipyrimidines17. As expected, C>T (G>A) mutations predominated over other nucleotide changes in melanomas (Fig. 1d), both within TFBS and at their flanks. This could be explained by either a faulty DNA repair or higher probability of UV induced lesions18-19 at protein-bound DNA.
Next, we focused on active TFBS in distal regions from transcription start sites, and again found increased mutation rate at binding sites, flanked by periodic peaks of mutation rate observed at a distance of ~146bp, which coincides well with the size of the DNA being wrapped in nucleosomes. When we superimposed the nucleosomes positioning signals from ENCODE20 and these mutation rate peaks, we verified that their positions matched perfectly (Fig. 2a). Furthermore, we found that the peak of mutation rate observed at the center of DHS regions occurred exclusively at TFBS located within promoter regions (DHS-Promoters-TFBS), and was absent from DHS-noPromoter-noTFBS (Fig. 2b). This corroborated that whatever the process causing the increment of mutation rate it required that the proteins be bound to the DNA.
We then inquired if the cause of the higher mutation rate in TFBS and nucleosomes was the reduced accessibility to the protein-bound DNA of the NER machinery. Non-repaired nucleotides would be by-passed by polymerases carrying out translesion DNA synthesis, thus resulting in mutations21. To test it we assembled nucleotide-resolution maps of the NER activity of the two products of UV-induced DNA damage, CPDs and (6-4)PPs, generated by Hu et al., 2015 using XR-seq in irradiated skin fibroblasts6. In XR-seq, the excised ∼30-mer around the site of damage generated during nucleotide excision repair is isolated and subjected to high-throughput sequencing. When we analyzed the genome-wide signal of this NER map, we found a strong decrease in the amount of CPD and (6-4)PP repair at the center of TFBS (Fig. 3a and Extended Data Fig. 3), compared to their flanking regions. The decrease was apparent both in wild-type cells (NHF1), and CS-B mutant cell lines, which lack transcription-coupled repair6 (Fig. 3a and Extended Data Fig. 3), and it appeared at the binding sites of individual transcription factors (Extended Data Fig. 4). Moreover, we found that the level of DNA excision repair (and the mutation rate) at TFBS correlated with the strength of their binding (Fig. 3B and Extended Data Fig. 5). We concluded from these observations that the higher mutation rate observed at active TFBS is caused by a decrease of the NER activity.
A previous study related higher DNA repair activity at DHS than that outside DHS to greater accessibility to the repair machinery4. By specifically deconvoluting the signal of mutation rate within DHS, our work goes a step beyond to show that bound TFs at the center of DHS actually hinder DNA repair. This interplay of greater NER at DHS and lower NER at TF bound sites in their center results in a volcano-shaped pattern of NER activity around the TFBS, with a strong depletion exactly at its center flanked by two mountains in the DHS area around it (Fig. 3). The volcano shape is more pronounced at distal TFBS, those that occur distant from transcription start sites (TSS) (Fig. 3a), which may be explained by the presence of shorter regions of open chromatin surrounded by compacted DNA. Moreover, a periodicity in NER activity is observable for the first nucleosomes around TFBS (Fig 3a), which matches nicely the previously noted periodical variation of the mutation rate. Also in coherence with the mutation rate pattern, the signal of decreased NER activity is clearer at the center of DHS-Promoters-TFBS, exactly at the position of the TFBS (Extended Data Fig. 6c). These results demonstrate that repair activity in DHS regions is in general higher than in non-DHS regions, supporting previous observations4, however this activity is specifically impaired at sites with bound transcription factors.
NER consists of two pathways: global repair –targeting the lesions in a genome-wide manner– and transcription-coupled repair that recognize lesion within transcribed regions17. These pathways differ in the initial steps of damage recognition, although they share the core component that excise damaged regions. To discern the effect of DNA bound TFs on transcription coupled NER we focused on transcribed regions centered at TFBS at least 200 bps downstream of TSS, and plotted together mutation rate and XR-seq data in XP-C cells, which only have transcription-coupled repair6. Mutation rate is also increased at the center of transcribed TFBS, and the volcano shape of repair rate in XP-C cells is apparent for TFs bound to either template or non-template strand (Extended Data Fig. 7). This result demonstrates that the decrease in NER caused by bound TFs results from impairment of both NER pathways.
NER recognizes and repairs other DNA lesions beside those induced by UV light, such as DNA adducts induced by smoking-related carcinogens (e.g. benzo[a]pyrene diol epoxide)22. We therefore hypothesized that the conclusion we had drawn from the observations made in melanomas could be extended to other tumor types. We observed higher mutation rates at TFBS in lung adenocarcinomas and lung squamos cell carcinomas, in particular for C>A variants, which correspond to the mutations caused by tobacco smoking9 (Extended Data Fig. 8). In contrast, no increment of the mutation rate in TFBS is observed in colon adenocarcinomas, where NER activity is not expected to play a major role in the mutational process.
Two previous studies have described abnormal mutation rates in connection with a group of DNA bound TFs in yeast14 and CTCF/cohesin sites in a subset of colorectal tumors15. However, in contrast to our results, in neither of these studies the peaks of mutation rate were caused by impairment of NER resulting from bound proteins. In the former, higher mutation rate at specific TFBS were related to polymerase-delta-mediated displacement of polymerase-alpha-synthesized DNA during replication. In the latter, higher mutations at CTCF/cohesin sites of a subset of colorectal tumors, was attributed to challenged DNA replication under aberrant conditions.
In summary, our results demonstrate that the accessibility of the DNA to the NER machinery directly determines the distribution of mutational density at the nucleotide scale. The increased repair in freely accessible, nucleosome-free, DNA around TFBS and the decline in repair efficiency exactly at TFBS produces a lower mutation rate in the periphery of DHS sites and higher mutation rate at their center (Fig 4). Moreover, periodic signals of higher mutation rate and lower NER in close chromatin regions coincide with nucleosome occupancy, suggesting that nucleosomes produce the same type of impairment to NER.
These findings have strong implications for our basic understanding of the mechanisms of DNA repair in human cells, as well as for the study of tumor evolution and cancer-associated somatic mutations. They indicate that most mutations in TFBS accumulate due to faulty repair at these sites. Therefore, methods designed to identify potential somatic driver mutations, in non-coding regions, which typically exploit the mutational patterns of genomic elements must construct models of the background mutation rate that accurately take into account the increased mutation density at TFBS due to faulty repair.
Methods
Mutation data
Whole-genome somatic mutations of 38 skin cutaneous melanomas (SKCM), 46 lung adenocarcinomas (LUAD), 45 lung squamous cell carcinomas (LUSC), and 42 colorectal adenocarcinomas (CRC) identified by TCGA were obtained from Fredriksson et al., 201416. As suggested by the authors of that paper, we considered in our analyses only single nucleotide substitutions with a minimum variant frequency of 0.2 and which do not overlap dbSNP entries (v138). The total number of mutations of each cancer type passing these thresholds is listed in Extended Data Table 1. We separated CRC samples into two groups: hypermutated (with mutations of the DNA polymerase epsilon (POL-E) gene; n = 8 samples) and hypomutated (the rest; n = 34 samples).
Genomic elements
The genomic coordinates of transcription factor binding sites (TFBS), i.e., TF motif match under ChIP-seq peak regions, were obtained from ENCODE20. These comprised the binding sites of 109 transcription factors (TF) as used in Khurana et al., 201323. We also obtained from ENCODE predicted binding sites of 52 transcription factors which are not supported by ChIP-seq peaks (termed unbound TFBS). In addition, we obtained the binding sites of 32 TFs used in Reijns et al., 201513. We treated the latter as an independent data set, and following the authors of the original paper,13 we clustered the TFBS into quartiles according to the binding strength or occupancy of the TFs to their sites – quantified through ChIP-seq read coverage.
As promoters, we considered the DNA sequences up to 2.5kb upstream of transcription start sites (TSS) of all protein coding genes in GENCODE24 (v19). Promoter regions overlapping coding sequences (CDS) or untranslated regions (UTRs) were excluded. We classified TFBS as either proximal –i.e., overlapping these upstream promoters– or distal –i.e., those located in intergenic regions, with no annotated TSS (as per GENCODE v19) within 5kb distance on either side. A third group of TFBS was composed of those located downstream TSS (between +200bp and +500bp) and which do not overlap with the upstream 2.5kb promoter regions –i.e., TFBS in transcribed regions.
All TFBS overlapping DNase I Hypersensitive sites (DHS) identified by the Epigenome roadmap project25 in primary cell types most closely matching the cell of origin of each tumor type (see below) were considered active. We considered only DHS sites identified by the Hotspot algorithm (narrowPeaks in FDR 1%), which are typically 150nts long. For each cancer type, the matching primary cell type was selected based on the recent study by Polak et al., 20155 (Extended Data Table 1). We chose the DHS from primary cell types (from Epigenome Roadmap project) instead of cell lines (from ENCODE), because the chromatin features of the cell of origin of a tumor has been shown to correlate better with its mutation profile than that of matched cancer cell lines5. However, we selected the TFBS detected by ENCODE in cell lines (see above) due to the lack of TF binding site annotations in primary cells analyzed by the Epigenome Roadmap project25.
We then classified the TFBS in the samples of each tumor type as active or inactive based on their overlap, or lack thereof, with DHS regions (minimum 1bp) of the matched primary cell type. Unbound TFBS (see above), which do not overlap with TF peaks or DHS regions, were considered as inactive TFBS and used as negative control to compare with the active TFBS (in Extended data Fig. 1). All genomic co-ordinates of TFBS used in this study as part of any aforementioned category are available at http://bg.upf.edu/tfbs.
Mutation rate estimation
In order to compare the mutation rate in TFBS to their neighboring regions, we considered flanking stretches of 1000 nucleotides at both sides of the TFBS mid-point. To exclude regions that could bias the mutation rate analyses, prior to mapping the somatic mutations to these selected 2001 nts windows, we filtered out: any regions overlapping a) coding sequences, and b) UCSC Browser blacklisted regions, often misaligned to sites in the reference assembly, (Duke and DAC) and low unique mappability of sequencing reads (“CRG Alignability 36' Track”26, score < 1) (http://genome.ucsc.edu/cgi-bin/hgFileUi?db=hg19&g=wgEncodeMapability). In addition, regions that overlap other TFBS within flanking regions (immediately upstream or downstream the TFBS) were excluded. The resulting filtered windows of each TFBS were then aligned (taking as reference the TFBS centers), and the mutation rate of every column i within the window was calculated as the total number of mutations mapped to nucleotides in column i divided by the total number of nucleotides observed in column i (after filtering). We computed this mutation rate for each TF separately, as well as globally for all TFs. In the latter case, prior to the calculation, we removed any repeated chromosomal positions (from different TFs) observed in a column.
In the case of the analysis center on DHS, we considered flanking stretches of 1000 nucleotides at both sides from DHS peak center and followed the same steps mentioned above to filter mutations and to compute the mutation rate.
Background mutation rate estimation
In order to check if the mutation rate observed at each position was expected due to the local sequence context, we randomly introduced the same number of mutations observed at each window following the probability of occurrence of each mutation according to its tri-nucleotide context. We computed the probability of occurrence of all possible 96 tri-nucleotide changes in each cancer type based on the total number of observed mutations in all its samples. We also computed separate probabilities of occurrence of all 96 tri-nucleotide in active and inactive TFBS from the mutations observed in each category. The mutation rate of each randomly generated set of changes, was computed for each column as explained above. This procedure was repeated 1000 times to compute the mean random mutation rate of every column in the motif.
Enrichment analysis
To identify if TFBS is enriched for mutations compared to the immediate flanking region, we compared the ratio of the total number of mutations to the total number of nucleotide positions within the TFBS region (-15 to 15nts) and that of the flanking region (16 to 1000nts on either side) using a chi-squared test. We performed this test for all transcription factors and for each individual tumor, and corrected the resultingp-values for multiple-testing using the Benjamini-Hochberg procedure28. In addition, we computed the fold change of mutation rates through the expected frequencies obtained from chi-squared tests. Both, the fold change and adjusted p-values are shown in Figure 1b-c.
Nucleotide excision repair data
The genome-wide maps of nucleotide excision repair of two types of UV-induced damage, cyclobutane pyrimidine dimers (CPD) and (6-4) pyrimidine-pyrimidone photoproducts ((6-4)PP), available for three different cell lines –i) wild-type NHF1 skin fibroblasts, ii) XP-C mutants, lacking the global repair mechanism, and iii) CS-B mutants lacking transcription-coupled repair– were obtained from Hu et al., 20156. The dataset contains normalized read counts for fixed steps of 25bp across the genome, for the forward and reverse strands separately. We kept these for our analyses and also generated strand independent data as the average of normalized read counts from both strands for every nucleotide position. These average read counts were mapped to the TFBS centered windows (2001bp), filtered and aligned to the TFBS mid-point as described above. We computed the average repair rate for each column i of these windows as the total number of average read counts mapped to the nucleotides in the column i divided by the total number of nucleotides in the column i, as described above for the mutation rate.
Nucleosome signals
Genome-wide nucleosome positioning signals (density graph) of ENCODE cell line GM12878 (lymphoblastoid cell line) were downloaded via the UCSC genome browser (http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeSydhNsome/). We then mapped them to the TFBS centered windows, and similar to mutation and repair rates, we computed the average signal per column i of the window as the sum of signal values mapped to the nucleotides in column i divided by the total number of nucleotides in column i.
Computational and statistical tools
BEDTools utilities29 were used to carry out operations as extensions or overlaps in the various analyses of genomic features (TFBS/DHS), as well as to map somatic mutations to genomic features. All curve fittings shown in figures (best-fit spline) were performed using the smooth.spline function from R30 (v3.0). The auto-correlation was performed using the acf function from statsmodels python package (http://statsmodels.sourceforge.net/).
ACKNOWLEDGEMENTS
We acknowledge funding from the Spanish Ministry of Economy and Competitiveness (grant number SAF2012–36199), the Marató de TV3 Foundation, and the Spanish National Institute of Bioinformatics (INB). R.S. is supported by an EMBO Long-Term Fellowship (ALTF 568–2014) co-funded by the European Commission (EMBOCOFUND2012, GA-2012–600394) support from Marie Curie Actions. A.G.-P. is supported by a Ramón y Cajal contract.