Abstract
Introduction Many transcription factors initiate transcription only in specific sequence contexts, providing the means for sequence specificity of transcriptional control. A four-letter DNA alphabet only partially describes the possible diversity of nucleobases a transcription factor might encounter. For instance, cytosine is often present in a covalently modified form: 5-methylcytosine (5mC). 5mC can be successively oxidized to 5-hydroxymethylcytosine (5hmC), 5-formylcytosine (5fC), and 5-carboxylcytosine (5caC). Just as transcription factors distinguish one unmodified nucleobase from another, some have been shown to distinguish unmodified bases from these covalently modified bases. Modification-sensitive transcription factors provide a mechanism by which widespread changes in DNA methylation and hydroxymethylation can dramatically shift active gene expression programs.
Methods To understand the effect of modified nucleobases on gene regulation, we developed methods to discover motifs and identify transcription factor binding sites in DNA with covalent modifications. Our models expand the standard A/C/G/T alphabet, adding m (5mC) h (5hmC), f (5fC), and c (5caC). We additionally add symbols to encode guanine complementary to these modified cytosine nucleobases, as well as symbols to represent states of ambiguous modification. We adapted the well-established position weight matrix model of transcription factor binding affinity to an expanded alphabet. We developed a program, Cytomod, to create a modified sequence. We also enhanced the MEME Suite to be able to handle custom alphabets. These versions permit users to specify new alphabets, anticipating future alphabet expansions.
Results We created an expanded-alphabet sequence using whole-genome maps of 5mC and 5hmC in naive ex vivo mouse T cells. Using this sequence and ChIP-seq data from Mouse ENCODE and others, we identified modification-sensitive cis-regulatory modules. We elucidated various known methylation binding preferences, including the preference of ZFP57 and C/EBPβ for methylated motifs and the preference of c-Myc for unmethylated E-box motifs. We demonstrated that our method is robust to parameter perturbations, with transcription factors’ sensitivities for methylated and hydroxymethylated DNA broadly conserved across a range of modified base calling thresholds. Hypothesis testing across different threshold values was used to determine cutoffs most suitable for further analyses. Using these known binding preferences to tune model parameters enables discovery of novel modified motifs.
Discussion Hypothesis testing of motif central enrichment provides a natural means of differentially assessing modified versus unmodified binding affinity, without most of the limitations of a de novo analysis. This approach can be readily extended to other DNA modifications, provided genome-wide single-base resolution data is available. As more high-resolution epigenomic data becomes available, we expect this method to continue to yield insights into altered transcription factor binding affinities across a variety of modifications.
Introduction
Different cell types have widely varied gene expression, despite sharing the same genomic sequence. Epigenomic factors, including modifications to DNA; RNA; and proteins, modulate gene expression and contribute to the cellular regulatory program. Covalent cytosine modifications have an important regulatory role across a number of eukaryotic species, including both mice and humans.1 The most well-studied cytosine modification involves the addition of a methyl group to the 5' carbon of cytosine, creating 5-methylcytosine (5mC). Modified cytosine nucleobases do not substantively disrupt the overall structure of the DNA double helix, permitting transcription and replication to occur. However, these modifications alter various properties of the double-helix, including altering the conformation of both the major and minor grooves.2 They can also lead to steric hindrance of transcription factor DNA binding domains, relative to the typical interactions of specific DNA motifs with their cognate binding sites.3,4
The demethylation cascade as functional genomic elements
5mC is the first of four modified cytosine nulceobases, that are involved in the demethylation of 5mC back to its unmodified form. This demethylation cascade occurs via successive oxidation of 5mC, to 5-hydroxymethylcytosine (5hmC), 5-formylcytosine (5fC), and 5-carboxylcytosine (5caC; Figure 1).5,6These oxidations are mediated by ten-eleven translocation (TET) enzymes.7
5mC has long been known to be involved in a diverse set of regulatory roles.8,9 5hmC is increasingly being implicated in regulatory processes,10 and is now known to be a stable epigenetic modification,11 with structural rationale for its reduced propensity of TET-mediated oxidation.12 Far less is known about 5fC and 5caC, largely due to their considerably lesser abundance. They are far less abundant than 5hmC, itself around an order of magnitude less abundant than 5mC. The abundance of these modifications varies by cell type, with greater abundance observed in mouse embryonic stem cells (mESCs),13–15 in which nearly 3% of cytosine bases were methylated,5 while 0.055% were hydroxymethylated in a different mESC sample.16 There have been only a few investigations into the genome-wide distributions and roles of 5fC (such as poised enhancers) and 5caC.15,17,18 They are often regarded as mere intermediates of the demethylation cascade, largely due to their generally being two to three orders of magnitude less abundant than 5hmC, and capable of triggering a strong DNA damage response.1,5,16 In mESCs, 5fC was found to account for 0.0014% of cytosine bases,16 while 5caC accounted for a mere 0.000335%.5 While it is by no means certain that they play a distinctive regulatory role across multiple tissue types, converging lines of evidence suggest that they too can be important modulators of gene expression.10 5fC alters the conformation of the DNA double helix19 and is known to be stable in mESCs, not merely a demethylation intermediate.20
All of these modifications are (by far) most frequent at CpG dinucleotides, but non-CpG 5mC nucleobases are known to exist in non-negligible quantities, particularly within mESCs.21,22 Mapping of these modifications is complicated by a few additional sources of biochemical complexities: strand biases,13 concomitant modifications, and hemi-methylation.23
Modified nucleobases can substantially alter transcription factor recognition
Many transcription factors prefer specific motifs, enabling the sequence specificity of transcriptional control.24 The position weight matrix (PWM) model allows for the computational identification of transcription factor binding sites, by characterizing a transcription factor’s position-specific preference over the DNA alphabet.25 Just as transcription factors distinguish one unmodified nucleobase from another, some transcription factors are known to distinguish between unmodified and modified bases. Despite these covalent modifications not altering base-pairing, they protrude into the major and minor grooves of DNA, and impact other aspects of DNA conformation. These changes can result in altered protein recognition.2
In particular, transcription factors often bind to novel motifs that differ from the unmodified core consensus sequences. MeCP2, one of many non-sequence-specific methyl-CpG binding proteins, has been shown to bind to 5hmC.26 However, the role of non-sequence-specific modified nucleobase binding is limited to specific protein families.
It is more informative, but also more challenging, to elucidate and characterize sequence specific motifs. In 2013, Hu et al.3 demonstrated that central CpG-methylated motifs have strong binding activity for certain transcription factors. Using protein binding microarrays, they showed that these motifs are often very different from the unmethylated sequences that those transcription factors usually bind. A few transcription factors have well-characterized modification preferences. These preferences can serve as a means of verifying a predictive framework; a working model is expected to be able to robustly yield the known preferences. Therefore an understanding of known modification-sensitivities informs the design of such a model. Since Hu et al.’s3 work, other transcription factors have been shown to have methylation-sensitivity27 and an instance of 5caC increasing binding affinity was found.28 Both C/EBPa and C/ΞΒΡβ have increased binding activity when the central CpG of its canonical octamer (consensus: TTGC|GCAA) is methylated, formylated, or carboxylated, with both strands contributing to increase the effect and hemi-modification still demonstrating a reduced effect.29 5hmC was found to inhibit binding of C/EBPβ, but not C/EBPα.29 c-Myc is a basic helix-loop-helix (bHLH) family transcription factor, which has been demonstrated to have a strong preference for unmethylated E-box motifs, often preferring the CACGTG hexamer.30,31 It is one of many bHLH transcription factors that demonstrate such a preference.32–36
Recently, Quenneville et al.37 demonstrated that ZFP57 has a preference for methylated motifs, specif-ically for the completely centrally-methylated TGCCGC(R) heptamer (red indicates methylation on the positive strand and blue on the negative strand). This was subsequently confirmed, and extended upon, by Strogantsev et al.38, who additionally found that ZFP57 motifs with a final guanine residue as the core binding site are often concomitantly methylated at that second CpG site. This preference was also confirmed with crystallography and in solution with fluorescence polarization analyses, by Liu et al.39, who additionally demonstrated that ZFP57 has successively decreasing affinity for the oxidized forms of 5mC. Xu et al.40 recently applied a random forest to predict binding of transcription factors by combining genomic and methylation data. They did not attempt to predict the preference of factors for methylated DNA, but rather developed software to use profiles of 5mC or 5hmC bases to improve predictions of in vivo transcription factor binding events.
Stable modification-induced changes to DNA shape, and the existence of modification-sensitive transcription factors, motivate the development of a computational framework to elucidate and characterize altered motifs. We describe here methods to analyze covalent DNA modifications and their affects on transcription factor binding sites, by introducing an expanded epigenetic alphabet. We introduce Cytomod, software to integrate DNA modification information into a single genomic sequence and we detail the use of extensions to the MEME Suite41 to analyze 5mC and 5hmC transcription factor binding site sensitivities.
Methods
An expanded epigenetic alphabet
To analyze DNA modifications’ effects upon transcription factor binding, we developed a model of genome sequence that expands the standard A/C/G/T alphabet. Our model adds the symbols m (5mC), h (5hmC), f (5fC), and c (5caC). This allows us to more easily adapt existing computational methods, that work on a discrete alphabet, to work with epigenetic cytosine modification data.
Each symbol represents a base pair in addition to a single nucleotide, implicitly encoding a complementarity relation. Accordingly, we add four symbols to represent G when paired with modified C: 1 (G:5mC), 2 (G:5hmC), 3 (G:5fC), and 4 (G:5caC) (Table 1). This ensures that complementation remains a lossless operation. It also captures the fact that the properties of the base pairing of a guanine to a modified residue is altered by the presence of the modification.2 We number these symbols in the same order in which the ten-eleven translocation (TET) enzyme acts on 5-methylcytosine and its oxidized derivatives (Figure 1).6
Many cytosine modification-detection assays only yield incomplete information of a cytosine’s modification state. For example, conventional bisulfite sequencing alone determines if cytosine bases are modified to either 5mC or 5hmC, but cannot resolve between those two modifications.6 Even with sufficient sequencing to disambiguate all modifications, statistical methods are needed to infer each modification from the data, resulting in additional uncertainty. To capture common instances of modification state uncertainty, we also introduce ambiguity codes: z/9 for a cytosine of (completely) unknown modification state, y/8 for a cytosine known to be neither hydroxymethylated nor methylated, x/7 for a hydroxymethylated or methylated base, and w/6 for formylated or carboxylated bases (Table 2). These codes are analogous to those defined by the Nomenclature Committee of the International Union of Biochemistry already in common usage, such as for unknown purines or pyrimidines (R or Y, respectively).42,43
Creation of an expanded-alphabet genome sequence
Like most epigenomic data, the abundance and distribution of cytosine modifications is cell-type specific. Therefore, modified genomes need to be constructed for a particular cell-type and downstream analyses cannot necessarily be expected to generalize. Accordingly, we first need to construct a modified genome that pertains to the organism, assembly, and tissue type we wish to analyze. This modified genome uses the described expanded alphabet to encode cytosine modification state, using calls from single-base resolution modification data.
To do this, we created a Python program called Cytomod. It loads an unmodified assembly and then alters it using provided modification data. It relies upon Genomedata44 and NumPy45 to load and iterate over genome sequence data. Cytomod can take the intersection or union of replicates pertaining to a single modification type. It also allows one to provide a single replicate of each type, and potentially to run it multiple times to produce multiple independent replicates of modified genomes. It permits flagging of ambiguous input data, such as when only conventional bisulfite sequencing data is available and therefore the only modified bases are x/7. Cytomod additionally produces browser extensible data (BED) tracks for each cytosine modification, for viewing in the UCSC46 (Figure 2) or Ensembl genome browsers.47
We used conventional and oxidative whole-genome bisulfite sequencing data generated for naive CD4+T cells, extracted from the spleens of C57BL/6J mice, aged 6–8 weeks. A fraction enriched in CD4+T cells was first obtained by depletion of non-CD4+T cells by magnetic labelling and then fluorescence-activated cell sorting was used to get the CD4+, CD62L+, CD44low, and CD25‒ naive pool of T cells. This data was generated by the Ferguson-Smith and Adams labs at the University of Cambridge and the Sanger Institute for the BLUEPRINT project, as a part of Sjöberg et al.51.
We refer to the combination of both conventional and oxidative whole-genome bisulfite sequencing as (ox)WGBS. We analyzed biological replicates separately, 2 of each sex. Unaligned, paired-end, BAM files output from the sequencer were subjected to a standardized internal quality check pipeline. We used MethPipe52 (development version, commit 3655360) to process the data. All random chromosomes were excluded, after alignment. We selected Bismark53 for alignment, which has been demonstrated to work well.54, 55 The processing pipeline is as follows: sort the unaligned raw BAM files in name order (using SamBamba,56 version 0.5.4); convert the files to FASTQ, splitting each paired-end (via version 2.23.0 of BEDTools57 bamtofastq); align the FASTA files to NCBIm37/mm9 using Bismark53 (version 0.14.3), which uses Bowtie58 (version 2.2.4), in the default directional mode for a stranded library; sort the output aligned files by position (again via SamBamba sort); index sorted, aligned, BAMs (via version 1.2 of SAMtools59 index); convert the processed BAM files into the format required by MethPipe, using to-mr; merge sequencing lanes (via direct concatenation of to-mr output files) for each specimen (biological replicate), for each sex, and each of WGBS and oxWGBS; sort the output as described in MethPipe’s documentation (by position and then by strand); remove duplicates using MethPipe’s duplicate-remover; run MethPipe’s methcounts program; and finally run MLML,60 which combines the conventional and oxidative bisulfite sequencing data to yield consistent estimations of cytosine modification state.
We then create modified genomes from the MLML outputs. MLML outputs maximum-likelihood estimates of the levels of 5mC, 5hmC, and C, which are between 0 and 1. These estimates are computed directly or via expectation maximization.60 It outputs an indicator of the number of conflicts, which is an estimate of methylation or hydroxymethylation levels falling outside of the confidence interval computed from the input coverage and level. This value is 0, 1, or 2 in our case, since we have two inputs per run (WGBS and oxWGBS). An abundance of conflicts can indicate the presence of non-random error.60 We assign z/9 to all loci with any conflicts, regarding those loci as having unknown modification state. Our analysis pipeline accounts for cytosine modifications occurring in any genomic context, and additionally maintains the data’s strandedness, allowing analyses of hemi-modification. We created modified genomes using a grid search, in increments of 0.01, for a threshold t, for the levels of 5mC (m) and 5hmC (h), as described in Figure 3.
We use half of the threshold value for assignment to x/7, since we consider that consistent with the use of the full threshold value to call a specific modification (since if t is sufficient to call 5mC or 5hmC alone, m + h ≥ t should be sufficient to call x/7).
We additionally analyzed base frequencies for each modification, both overall (Figure 4) and per genomic cytosine. These frequencies are computed genome-wide, for putative promoter regions and enhancer regions. To estimate promoter regions, we used GENCODE Release M1 (the last GENCODE version annotating NCBIm37/mm9), for the primary genome annotation, using a 2kbp region upstream of the first transcription start site for each “known” GENCODE transcript. We create enhancer regions from a seven-state ChromHMM segmentation for ES-Bruce4.61 We used segmentation state 3, “K4m1”, which is highly enriched for H3K4me1.
Detection of altered transcription factor binding in modified genomic contexts
Next, we performed transcription factor binding site motif discovery, enrichment and modified-unmodified comparisons. Here, we use mouse assembly NCBI m37/mm9 for all analyses, since we wanted to be able to make use of all Mouse ENCODE61 ChIP-seq data without re-alignment nor lift-over.1 We updated the MEME Suite41 to work with custom alphabets, such as our expanded epigenomic alphabet. We incorporated these modifications into MEME Suite version 4.11.0.
We characterize modified transcription factor binding sites using MEME-ChIP63 It allows us to rapidly assess the main software outputs we are interested in: Multiple EM (Expectation Maximization) for Motif Elicitation (MEME)64 and Discriminative Regular Expression Motif Elicitation (DREME),65 both for de novo motif elucidation; CentriMo,66,67 for the assessment of motif centrality; SpaMo,68 to assess Spaced Motifs (which is especially relevant for multi-partite motifs); and Find Individual Motif Occurrences (FIMO).69
CentriMo66 is our main focus for the analysis of our results. It permits inference of the direct DNA binding affinity of motifs, by assessing a motif’s local enrichment. In our case, we scan peak centres with PWMs, for the best match per region. The PWMs used are generated from MEME-ChIP; by loading the JASPAR 201470,71 core vertebrates database, in addition to any elucidated de novo motifs from MEME or DREME. The number of sequences at each position of the central peaks is counted and normalized to estimate probabilities of central enrichment. These are smoothed and plotted. A one-tailed binomial test is used to assess the significance of central enrichment.
If low complexity sequences are not masked out first, MEME-ChIP63 can yield repetitive motifs. Existing masking algorithms are not designed to work with modified genomes, and we accordingly mask the assembly, prior to modification with Cytomod. This masking is only for downstream motif analyses. The unmasked modified genome output by Cytomod is always used for base frequency and distribution analyses. We use Tandem Repeat Finder (TRF)72 (version 4.07b) to mask low complexity sequences and TRF masked genomes are always used with MEME-ChIP We used the following parameters: 2 7 7 80 10 50 500 ‐h ‐m ‐ngs, taken from the TRF parameter optimization results of Frith et al.73.
We ran MEME-ChIP, using the published protocol for the command-line analysis of ChIP-seq data,74 against Cytomod genome sequences for regions pertaining to chromatin immunoprecipitation-sequencing (ChIP-seq) peaks from transcription factors of interest. We employ positive controls, in two opposite directions, to assess the validity of our results. We use c-Myc as the positive control for an unmethylated binding preference.30,31 ChIP-seq data for c-Myc was used from both a stringent streptavidin-based genome-wide approach with biotin-tagged Myc in mESCs from Krepelova et al.75 (GEO: GSM1171648), as well as murine erythroleukemia and CH12.LXMyc Mouse ENCODE samples (ENCFF001YJE and ENCFF001YHU). Conversely, both ZFP57 and C/EBPβ are used as positive controls for methylated binding preferences.29,37–39 For C/EBPβ, we used Mouse ENCODE ChIP-seq data, conducted upon C2C12 cells (ENCFF001XUT) or myocytes differentiated from those cells (ENCFF001XUR and ENCFF001XUS). We used one replicate of ZFP57 peaks provided by Quenneville et al.37. We constructed a ZFP57 BED file using BEDTools57 (version 2.17.0) to subtract the control influenza hemagglutinin (HA) ChIP-seq (GEO: GSM773065) from the target (HA-tagged ZFP57: GEO: GSM773066). Only target regions with no overlap with any features implicated by the control file were retained, yielding 11 231 of 22 031 features.
We also used ZFP57 ChIP-seq data from Strogantsev et al.38 (GEO: GSE55382), consisting of 40 bp singleend reads from reciprocal F1 hybrid Cast/EiJ × C57BL/6J mESCs (BC8: sequenced C57BL/6J mother × Cast father and CB9: sequenced Cast mother × C57BL/6J father). We are not interested in allele-specificity and need it to correspond to the assembly we are using. We re-processed the data, aligning it to NCBI m37/mm9, in a similar manner to some of the Mouse ENCODE datasets, to maximize consistency for future Mouse ENCODE analyses. We obtained raw FASTQs using SRA Toolkit’s fastq-dump; aligned them via Bowtie58 (version 1.1.0; bowtie ‐v 2 ‐k 11 ‐m 10 ‐t ‐‐best ‐‐strata); sorted and indexed the BAM files (using Sambamba56); and called peaks, using the input as the negative enrichment set, via MACS 2,76 with increased stringency (q = 0.00001), with parameters: ‐q 0.00001 ‐f BAM ‐g mm. This parameter list omits the previously explained target and control information, and parameters to set the output’s base name and directory. This resulted in 90 478 BC8 and 56 142 CB9 peaks.
We used the ChIPQC77 Bioconductor78 package to assess the ChIP-seq data quality. We used the two control and two target runs for each of BC8 and CB9. We then used ChIPQC(samples, consensus=TRUE, bCount=TRUE, summits=250, annotation=”mm9”, blacklist=”mm9-blacklist.bed.gz”, chromosomes =chromosomes). We set the chromosomes list to all the fully-aligned mouse chromosomes, excepting chrM. A blacklist of regions is used to filter out regions that appear uniquely mappable, but have been empirically found to show artificially elevated signal in short-read functional genomics data. We took the blacklist from the NCBIm37/mm9 ENCODE blacklist website (https://sites.google.com/site/anshulkundaje/projects/blacklists).79 The fraction of reads in peaks (FRiP) was 13.7% and 9.12% for the BC8 and CB9 data respectively. We additionally performed peak calling at the default q = 0.05, which resulted in many more peaks (197 610 BC8 and 360 932 CB9 peaks) and respective FRiP values of 27.6% and 19.74%. The CB9 sample had a lesser fraction of reads in (overlapping) blacklisted regions (RiBL). At the default peak calling stringency, BC8 had 29.7% RiBL, while CB9 had only 8.38%.
We additionally analyzed three ZFP57 ChIP-seq replicates (100 bp paired-end reads) pertaining to mESCs in pure C57BL/6J mice.80 Each replicate is paired with an identically-conducted ChIP-seq in a corresponding sample, for which ZFP57 is not expressed (ZFP57-null controls). The same protocol as for the hybrid ZFP57 data was used, excepting that we used the ZFP57-null ChIP-seq data as the negative set for peak calling instead of the input and Bowtie was run in paired-end mode (using -1 and -2). We additionally omitted the Bowtie arguments ‐‐best ‐‐strata, which do not work in paired-end mode and added -y ‐‐maxbts 800, the latter of which is what is set with ‐‐best’s value, instead of the default threshold of 125. We also set MACS to paired-end mode (via -f BAMPE). However, this resulted in very few peaks when processed with the same peak-calling stringency as the hybrid data (at most 1812 peaks) and FRiP values under 2%. Even when we used the default stringency threshold, there were at most 4496 peaks, with FRiP values of around 4.5%. Nonetheless, we still observed the expected preference for methylated motifs (Figure S1).
To directly compare various modifications of motifs to their cognate unmodified sequences, we adopted a hypothesis testing approach. Motifs of interest can be derived from a de novo result that merits further investigation, but are often formed from prior expectations of motif binding preferences from the literature, such as for c-Myc, ZFP57, and C/EBPβ. For every unmodified motif of interest, we can partially or fully change the base at a given motif position to some modified base (Table 3).
To directly compare modified hypotheses to their cognate unmodified sequences robustly, we try to minimize as many confounds as possible.
We fix the CentriMo central region width (via ‐‐minreg 99 ‐‐maxreg 100). We also compensate for the substantial difference in the background frequencies of modified versus unmodified bases. Otherwise, vastly lower modified base frequencies can yield higher probability and sharper CentriMo peaks, since when CentriMo scans with its “log-odds” matrix, it computes scores for nucleobase b with background frequency f(b) as
To compensate for this, we ensure that any motif pairs being compared have the same length and similar relative entropies. To do this, we use a larger motif pseudo-count (via ‐‐motif-pseudo <count>) for modified motifs. We compute the appropriate pseudo-count, as described below, and provide it to iupac2meme. We set CentriMo’s pseudo-count to 0, since we have already applied the appropriate pseudocount to the motif.
The relative entropy (or Kullback-Leibler divergence), DRE, of a motif m of length |m|, with respect to a background model b over the alphabet A, of size |A|, is81
For each position, i, in the motif, the MEME Suite adds the pseudo-count parameter, α, times the background frequency for a given base, j, at the position: .
Accordingly, to equalize the relative entropies, we need only substitute for each mij in Equation 1 and then isolate α. If we proceed in this fashion, however, our pseudo-count would depend upon the motif frequency at each position and the background of each base in the motif. Instead, we can make a number of simplifying assumptions that apply in this particular case. First, the unmodified and modified motifs we are comparing differ only in the bases being modified, which in this case, are only C or G nucleobases, with a motif frequency of 1. Additionally, we set the pseudo-count of the unmodified motif to a constant 0.1 (CentriMo’s default). Thus, the pseudo-count to use for a single modified base, is the value α, obtained by solving, for provided modified base background frequency bm and unmodified base frequency bu:
However, Equation 2 only accounts for a single modification, on a single strand. For complete modification, we also need to consider the potentially different background frequency of the modified bases’ complement. Thus for a single complete modification, with modified positions m1 and m2 and corresponding unmodified positions u1 and u2, modified base background frequencies bm1, bm2 and unmodified base frequencies bu1, bu2, we obtain
We numerically solve for α in Equation 3 for each modified hypothesis, using fsolve from SciPy.82 Finally, we may have multiple modified positions. We always either hemi-modify or completely modify all modified positions, so the pseudocount to use is the product of modified positions and the α value from Equation 3.
The pseudo-count obtained in this fashion does not exactly equalize the two motif’s relative entropies, since we do not account for the effect that the altered pseudo-count has upon all the other positions of the motif.
We then perform hypothesis testing for an unmodified motif and all possible 5mC/5hmC modifications of all CpGs for known modification-sensitive motifs for c-Myc, ZFP57, and C/EBPβ. These modifications consist of the six possible combinations for methylation and hydroxymethylation at a CpG, where a CpG is not permitted to be both hemi-methylated and hemi-hydroxymethylated. These six combinations are: mG, C1, m1, hG, C2, and h2. For c-Myc, the unmodified motif from which modified hypotheses were constructed is the standard E-box: CACGTG. For ZFP57, we tested the known binding motif, as both a hexamer (TGCCGC) and as extended heptamers (TGCCGCR and TGCCGCG).37,38 We additionally tested motifs that we found to occur frequently in our de novo analyses, C(C/A)TGm1(C/T)(A). We encoded this motif as the hexamer MTGCGY and heptamers, with one additional base for each side: CMTGCGY and MTGCGYA. This encoding permitted direct comparisons to the other known ZFP57-binding motifs of the same length. Finally, for C/EBPβ we tested the modifications of two octamers: its known binding motif (TTGCGCAA) and the chimeric C/EBP|CRE motif (TTGCGTCA).29 These motifs were then assessed for their centrality within their respective ChIP-seq datasets, using CentriMo. We then compute the ratio of CentriMo central enrichment p-values, adjusted for multiple testing,66 for each modified/unmodified motif pair. For numerical precision, we compute this ratio as the difference of their log values returned by CentriMo. This determines if the motif prefers a modified (positive) or unmodified (negative) binding site.
We conducted hypothesis testing across all four replicates of WGBS and oxWGBS data, for a grid search of modified base calling thresholds. These thresholds are based upon the levels output by MLML.60 We interpret these values as our degree of confidence for a modification occurring at a given locus. We conducted our grid search from 0.01–0.99 inclusive, at 0.01 increments. Finally, the ratio of CentriMo p-values are plotted across the different thresholds, using Python libraries Seaborn83 and Pandas.84, 85
Results
We created an expanded-alphabet sequence using oxidative (ox) and conventional whole-genome bisulfite sequencing (WGBS) maps of 5mC and 5hmC for naive ex vivo mouse CD4+ T cells.51 We generated individual modified genomes across four replicates of (ox)WGBS data and for a variety of modified base calling thresholds. We used these modified genome sequences as the basis for the extraction of genomic regions implicated by ChIP-seq data for particular transcription factors.
The modification abundances obtained were as expected, with respect to the absolute abundance of nucleobases, including their modifications genome-wide, within promoter regions, and within enhancer regions (Figure 4). Genome-wide, at a 0.7 threshold, for the female 15-16 specimen, we find that 2.5% of cytosine residues are methylated, and that 5hmC abundance is 3.5–8.0% of 5mC abundance, depending upon the inclusion of ambiguous bases. These frequencies are consistent with previous results in other cell types.5,14,16
Additionally, 5hmC comprises 0.17% of cytosine or guanine bases genome-wide vs. 0.20% within enhancer regions. If ambiguous 5mC/5hmC (x/7) bases are included, this difference increases to 0.39% vs. 0.45%. These results are consistent with greater 5hmC abundance within enhancer regions.86–89
Hypothesis testing reveals altered modified transcription factor binding preferences
We conducted hypothesis testing across three transcription factors for which we can predict their expected methylation or hydroxymethylation sensitivities from the literature. Two of the tested transcription factors are expected to prefer methylated DNA: ZFP5738 and C/EBPβ,29 and one is known to prefer unmethylated DNA: c-Myc.30,31 Additionally, C/EBPβ is known to have reduced affinity for hydroxymethylated DNA.29
We tested known unmodified transcription factor binding motifs against all possible 5mC and 5hmC modifications thereof, at all CpG dinucleotides. For each modified motif, we assessed its expected DNA binding affinity using its adjusted CentriMo central enrichment p-value.66 We conducted the same test for the unmodified version of the motif, comparing their p-values as a ratio, using the difference of their log transformed values. Positive values for this difference represent a preference for the modified motif, while negative values represent the converse.
We find that the expected transcription factor binding preferences hold across all four (ox)WGBS replicates and for all investigated modified nucleobase calling thresholds (from 0.01–0.99 inclusive, at 0.01 increments, representing modification confidence; Figure 5). Our observation that all c-Myc log p-value differences are below zero, implies that modified c-Myc motifs are disfavoured compared with their unmodified E-box motifs. For the ZFP57 sample shown, all modifications are favoured, compared to their unmodified counterparts. One of the modified motifs which has the greatest increase in predicted binding affinity in the modified case is TGCm1m1, a motif that Strogantsev et al.38 often found. Two methylated motifs had the greatest increase in predicted binding affinity for C/EBPβ: TTGmGCAA and TTGC1TCA. The same results are obtained for multiple different ChIP-seq replicates for these transcription factors (Figure S1). These results are robust in the face of perturbations, including peak calling stringency (Figure S2).
In addition to ZFP57 displaying a strong preference for methylated DNA, hydroxymethylated CpGs had a substantially lesser increase in binding affinity than methylated motifs (Figure 5), but still greater than the completely unmethylated motif. This recapitulates Liu et al.’s39 in vitro finding that ZFP57 has the greatest binding affinity for motifs containing 5mC, followed by 5hmC, and then by unmodified cytosine.
Elucidation of dichotomous binding preferences—C/EBPβ
C/EBPβ is of particular interest because of its dichotomous binding preferences for 5mC versus 5hmC.29 Our method is able to recapitulate this preference, across all replicates of (ox)WGBS and ChIP-seq data, with methylated motif pairs generally having positive ratios, whereas hydroxymethylated motif ratios are negative (Figure S3). One exceptional case is for a positive strand, hemi-methylated, motif (TTGmGTCA), which is often disfavoured compared to the unmodified motif. This motif is not the consensus C/EBPβ motif, but rather the chimeric C/EBP|CRE octamer. While Sayeed et al.29 demonstrated that this chimeric transcription factor had a more modest preference toward its methylated DNA motif, we would still have expected a weak preference for this motif, over its unmodified counterpart, as opposed to the unmodified motif preference observed. Additionally, we find hemi-methylation to have greater enrichment than complete methylation, which contradicts their finding of both strands contributing to increase the effect.29 This may be due to technical issues with hemi-methylation in our modified sequence and requires further investigation.
Suitable thresholds for de novo and downstream analyses
The grid search for transcription factor binding thresholds at 0.01 increments allowed us to determine suitable thresholds (0.3 and 0.7) for further investigation (Figure S1). Overall, this grid search demonstrates the suitability of a wide-range of thresholds, likely useful for assessing future datasets. De novo analyses of C/EBPβ confirmed the preference for methylated DNA, with methylated motifs having much greater central enrichment than their unmethylated counterparts, at both the 0.3 (Figure 6) and 0.7 thresholds (Figure 7).
Despite robust findings with hypothesis testing, across almost the entire range of possible thresholds, we were, however, often unable to detect the expected binding preferences in a de novo context. The c-Myc and, to a lesser extent, ZFP57 CentriMo runs, in a non-hypothesis testing context, did not demonstrate substantial enrichment nor depletion with respect to modified vs. unmodified motifs. An example of this is shown in Figure S4 for c-Myc. We consider potential explanations for this in the Discussion.
Discussion
We have added expanded alphabet capabilities to the widely-used MEME Suite,41 a set of software tools for the sequence-based analysis of motifs. This included extending several of its core tools, including: MEME,64 DREME,65 and CentriMo,66 used in a unified pipeline via MEME-ChIP63 We undertook further extension of all downstream analysis tools and pipelines, and most of the MEME Suite41 can now be used with arbitrary alphabets. We have processed maps of cytosine modifications in ex vivo mouse T-cells to yield genome sequences that use the expanded alphabet. We then used the extended software on the modified genome sequences in regions defined by ChIP-seq data to confirm previously known transcription factor binding preferences using our expanded-alphabet models.
Hypothesis testing, with equal central region widths and relative entropies, leads to more interpretable results than the standard CentriMo analyses, in that it permits a direct comparison of centrality p-values. We often observed the expected pattern in many replicates of conventional CentriMo runs with de novo motifs, such as with C/EBPβ (Figure 6 and Figure 7) and ZFP57 (Figure S5). However, there were instances in which the expected motif binding preference was not obvious from de novo CentriMo analyses, such as for c-Myc (Figure S4) and other ZFP57 CentriMo results, despite the hypothesis testing robustly corroborating its expected preference for unmethylated DNA (Figure S1).
We suspect that the inability of de novo analyses to elucidate modified binding preferences is primarily due to such analyses not having any means of integrating modified and unmodified motifs. Our de novo analyses are also unable to compensate for the large differences in modified versus unmodified background frequencies. De novo elucidation involves some form of optimization or heuristic selection of sites, and is an inherently variable process. Modified motifs have particular characteristics that differ from most unmodified motifs. Most notably, they are necessarily different from the overall and likely local sequence backgrounds, as a result of the low frequency of modifications. Conversely, an unmodified genome sequence has a comparably uniform nucleobase background, and unmodified motifs are usually found within local sequence of highly similar properties to the motifs themselves.90 Accordingly, modified motifs can get lost within a background of irrelevant unmodified motifs or no comparable sets of motifs may be found, without specifically accounting for these confounds. Also, modified motifs that a de novo analysis finds might not be comparable to any unmodified counterpart. This could occur due to their being of substantially different lengths, often being shorter. It is also difficult to compare motifs having sequence properties that often indicate a poor-quality motif, such as repetitious motifs, or off-target motifs, such as zingers.91 Hypothesis testing, with relative entropy normalization, can be used to mitigate these concerns.
This method is robust in the face of parameter perturbations. In particular, changes in the modified base calling threshold, across a broad range, consistently led to the same expected results, across three transcription factors and a number of ChIP-seq and bisulfite sequencing replicates (Figure S1). Furthermore, modification of peak calling stringency for a set of ZFP57 datasets, did not negatively impact our detection of its affinity for methylated DNA (Figure S2). The consistency of our controls provides confidence in the ability of this method to detect and accurately characterize the effect of modified DNA on transcription factor binding. This is instrumental in applying this method to a diverse array of ChIP-seq data, towards the elucidation of novel binding preferences.
There is an inherent trade-off between a lower threshold, yielding more modified loci but potentially introducing false positives, and a higher one, which may be too stringent to detect modified base binding preferences. We selected a lower threshold of 0.3, based primarily on the observation of increased variance and decreased apparent preference for unmethylated DNA for c-Myc below this threshold, across multiple replicates (Figure S1). We also selected an upper threshold of 0.7, based primarily on the rapid decrease in relative affinity for methylated over unmethylated motifs in ZFP57 (Figure 5) and, to a lesser extent, C/EBPβ (Figure S3).
We found that there is often an enrichment for hemi-modified, as opposed to completely-modified binding sites. Motifs with hemi-(hydroxy)methylation were often more centrally enriched than those with complete modification of a central CpG dinucleotide (Figure 6 and Figure 7). This is surprising, because numerous in vitro experiments have demonstrated that for transcription factors preferring modified DNA, each modification is often additive, resulting in completely modified motifs having greatest affinity.29,39 It is possible that the hemi-(hydroxy)methylation events we detect are the result of asymmetric binding affinities for 5mC (5hmC). ZFP57, for example, is known to have asymmetric recognition of 5mC, with the negative strand methylation being more important than the positive strand methylation with respect to the TGCCGC motif.39 Further work is needed to determine if this is due to technical artifacts (either at the level of the bisulfite sequencing data or the methods used) or if this reflects an actual biological preference.
There are few high-quality single-base resolution datasets of 5hmC, 5fC, and 5caC. We had previously attempted analyses using modification data, from assays like MeDIP92 that did not employ single-base resolution methods.14 We found that without single-base resolution, it was difficult to create a discrete genome sequence with a reasonable abundance of the modification under consideration without biasing the sequence, thereby making downstream analyses of transcription factor binding uninformative. It is essential to have single-base resolution data, for any modifications that one wishes to analyze. Additionally, many datasets which do meet this criteria use some form of reduced representation approach, in which CpGs are enriched, allowing for much cheaper sequencing, while still capturing many DNA modifications. The use of reduced representation bisulfite sequencing data can lead to confounding factors, due to the nonuniform distribution of methylated sites surveyed. We accordingly recommend that enrichment approaches be avoided for use with these methods, at least until these confounds are better addressed.
The ChIP-seq data we used was not from the same cell type as the (ox)WGBS data. While transcription factor binding models created from one cell type are often assumed to be consistently useful across different cell types, in some cases, they are not. Nonetheless, we consistently observed the expected preferences in transcription factor binding for the expected modification affinities, across multiple ChIP-seq replicates, often in different cell types.
The MEME Suite’s new custom alphabet capability permits further downstream analyses of modified motifs. For example, one can find individual motif occurrences with FIMO69 or conduct pathway analyses with Gene Ontology for MOtifs (GOMO).93 Alternatively, FIMO results can be used for pathway analyses via GREAT,94 and downstream pathway analysis tools, such as Enrichment Map,95,96 can be used for further interpretation of the results. This permits inference of implicated genomic regions and biological pathways, which can then be subjected to further analysis.
This approach can be readily extended to other DNA modifications, since we designed all of our software with this in mind. A number of DNA modifications can now be detected at high resolution, with many known to occur endogenously across diverse organisms,1 such as 5-hydroxymethyluracil (5hmU), 5-formyluracil (5fU), 8-oxoguanine (8-oxoG), and 6-methyladenine (6mA).97–99 We provide recommendations in Appendix A for the nomenclature of these modified nucleobases, among others.
We provide a framework to readily apply motif analyses on sequences containing DNA modifications. Consistent reproduction of known transcription factor binding affinities suggests that these methods produce biologically meaningful results and can predict the modification sensitivity of other transcription factors. We intend to apply these methods to analyze all Mouse ENCODE factors toward the identification of novel epigenetic binding preferences.
Acknowledgements
We thank William Stafford Noble and Charles E. Grant for useful discussions and contributions to the MEME Suite. We thank Andrew D. Smith, Meng Zhou, Ben Decato, and Egor Dolzhenko for their work on MethPipe52, 60 and for actively providing support. We thank Michael Waskom for his visualization work on the Seaborn83 Python package and for actively providing support. We thank Carl Virtanen and Zhibin Lu for technical assistance.
This research was enabled by support provided by: Globus,100,101 Compute Canada (specifically, West-Grid, SHARCNET, and SciNet102), and the Princess Margaret Computational Biology Resource Centre.
This work was supported by the Canadian Cancer Society (703827 to M.M.H.), the Natural Sciences and Engineering Research Council of Canada (RGPIN-2015-03948 to M.M.H. and an Alexander Graham Bell Canada Graduate Scholarship to C.V.), the Ontario Ministry of Training, Colleges and Universities (Ontario Graduate Scholarship to C.V), the Ontario Institute for Cancer Research through funding provided by the Government of Ontario (CSC-FR-UHN to John E. Dick), the University of Toronto McLaughlin Centre (MC-2015-16 to M.M.H.), and the Princess Margaret Cancer Foundation.
Appendix A Recommendations for modified nucleobase nomenclature
Interest in different covalent DNA modifications and improvements in sequencing technologies are expected to create a greater need for computational analyses of modified sequence data. In order to encourage standardization, we recommend symbols for various modified nucleobases (Table S1). We use lower case letters (a–z) for specific nucleobase forms and numerals (0–9) to specify complements without any information loss. The list is not comprehensive, but may provide guidance for those who need to select symbols for these bases. This list also reserves specific symbols in an attempt to reduce contradictory definitions. All upper-case letters of the Latin alphabet are considered to be reserved for allocation by IUPAC, in addition to those already specified.43 We use lower-case letters, which may have different meanings in upper-case, since there are insufficient unassigned letters to restrict ourselves to those. Many applications, including the current implementation of the MEME Suite, only support the Latin letters and numerals.
For any abbreviations of covalently modified nucleobases, we recommend that they be referred to as <position><modification><base>, where <position> is the position of the modified atom, <modification= is the modification, and <base= is the nucleobase being modified, such as “5mC”. In particular, we recommend that no punctuation be used to demarcate the number of the atom from its modification and that the numeral always appear before the base being modified. For example, others have occasionally abbreviated 6-methyladenine as m6A, but we recommend the use of 6mA instead when the modification occurs in DNA rather than in RNA. This distinction in the nomenclature of DNA vs. RNA modifications can be seen in a recent review by Chen et al.105.
The core symbols for cytosine modifications (Table 1) have been incorporated into Table S1. While we specified a set of ambiguity codes for our usage in this work (Table 2), we do not recommend general definitions. Instead, we suggest that the latter portions of the lower-case Latin alphabet and numerals be reserved for this purpose. This increases the likelihood that sufficient symbols will be available within the alphanumeric alphabet for a variety of use-cases. As implemented in this work, we recommend that ambiguity codes be assigned starting from the end of their character set, beginning with the most equivocal ambiguity code.
Footnotes
↵1 Specifically, we used the Mus musculus Illumina iGenome62 packaging of the UCSC mm9 genome. This genome excludes all alternative haplotypes as well as all unreliably ordered, but chromosome-associated, sequences (the so-called “random” chromosomes). This was ideal for downstream analyses, but not sufficient for aligning data ourselves. This is because exclusion of these additional pseudo-chromosomes might deleteriously impact alignments, by resulting in the inclusion of spuriously unique reads. Therefore, the full UCSC mm9 build is used when aligning to a reference sequence.
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.
- 34.
- 35.
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.
- 88.
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.
- 99.↵
- 100.↵
- 101.↵
- 102.↵
- 103.
- 104.
- 105.↵