Abstract
Nucleosome repeat length (NRL) defines the average distance between adjacent nucleosomes. When calculated for specific genomic regions, NRL reflects the local nucleosome ordering and characterises its changes during developmental processes. The architectural protein CTCF provides one of the strongest nucleosome positioning signals, setting a decreased NRL for ~20 nucleosomes in its vicinity (thus affecting up to 10% of the mouse genome). We show that upon differentiation of mouse embryonic stem cells (ESCs) to neural progenitor cells and mouse embryonic fibroblasts, a subset of common CTCF sites preserved in all three cell types keeps small NRL despite genome-wide NRL increase. This suggests that differential CTCF binding not only affects 3D genome organisation but also defines genomic regions with conserved nucleosome arrangement. Our analysis revealed that NRL decrease near CTCF is correlated with CTCF affinity for DNA binding. Stronger CTCF binding is linked to increased probability to form chromatin loops and more efficient recruitment of chromatin remodellers. We show that the effect of individual remodellers on decreasing the NRL near CTCF is increasing in the order Brg1≤Chd4<Chd6<Chd1≤Chd2≤EP400≤Chd8<Snf2h.
Introduction
Nucleosomes are positioned along the genome in a non-random way (Baldi, 2019; Lai and Pugh, 2017; Teif and Clarkson, 2019), which is critical for determining the DNA accessibility and genome organisation (Maeshima et al., 2019). A classical parameter characterising the nucleosome spacing is the nucleosome repeat length (NRL), defined as the average distance between the centres of adjacent nucleosomes. NRL can be defined genome-wide, locally for an individual genomic region or for a set of regions. The local NRL is particularly important, since it reflects different structures of chromatin fibers (Bascom et al., 2017; Bass et al., 2019; Nikitina et al., 2017; Risca et al., 2017; Routh et al., 2008).
Ever since the discovery of the nucleosome (Kornberg, 1974; Olins and Olins, 1974) there have been many attempts to compare NRLs of different genomic regions (De Ambrosis et al., 1987; Gottesfeld and Melton, 1978; Lohr et al., 1977) and it has been established that genome-wide NRL changes during cell differentiation (van Holde, 1989; Weintraub, 1978). Recent sequencing-based investigations showed that active regions such as promoters, enhancers and actively transcribed genes usually have shorter NRLs while heterochromatin is characterised by longer NRLs (Baldi et al., 2018; Chereji et al., 2018; Sun et al., 2001; Valouev et al., 2011). Studies performed in Yeast linked NRL changes at transcription start sites (TSS) to a number of specific molecular mechanisms, down to individual chromatin remodellers responsible for increasing/decreasing NRL (Celona et al., 2011; Hennig et al., 2012; Kubik et al., 2019; Mobius et al., 2013; Ocampo et al., 2016; Zhang et al., 2011). However, in higher eukaryotes regulatory regions are very heterogeneous, and although several recent attempts have been made (de Dieuleveult et al., 2016; Giles et al., 2019), it is difficult to come up with a set of definitive remodeller rules determining their effect on NRL. For example, ubiquitous heterogeneity and asymmetry of nucleosome distributions around subsets of different TF binding sites has been noted (Kundaje et al., 2012).
A particularly important nucleosome positioning signal is provided by CTCF, an architectural protein that maintains 3D genome architecture (Merkenschlager and Nora, 2016; Nora et al., 2017; Rao et al., 2017) and can organise up to 20 nucleosomes in its vicinity (Fu et al., 2008) (Fig. 1A). CTCF has hundreds of thousands of potential binding sites in the mouse genome. Usually there are ~30,000-60,000 of CTCF sites bound in a given cell type, which translates to about 1 million of affected nucleosomes (Chen et al., 2012; Shen et al., 2012; Wang et al., 2012; Wiehle et al., 2019).
We previously showed that in mouse embryonic stem cells (ESC), NRL near CTCF is about 10 bp smaller than genome-wide NRL (Teif et al., 2014; Teif et al., 2012). Our analysis demonstrated that purely statistical positioning of nucleosomes near CTCF boundaries would result in a longer NRL than observed experimentally, and the effects of strong nucleosome-positioning DNA sequences, while compatible with the observed NRL, are limited to a small number of CTCF sites (Beshnova et al., 2014). A very recent study has investigated the effect of Snf2 and Brg1 remodellers on NRL in ESCs, suggesting Snf2 as the primary player (Barisic et al., 2019). However, other factors may be at play as well. Thus, the question of what determines the NRL near CTCF remains open, as well as the question of the functional consequences of such small NRLs. Here we will address both these problems in a systematic manner using all available datasets in ESCs.
Results
Location of genomic region with respect to CTCF sites has profound effect on its apparent NRL
Our analysis is based on the “phasogram” type of NRL calculation introduced previously (Teif et al., 2012; Valouev et al., 2011). The idea of this method is to consider all mapped nucleosome reads within the genomic region of interest and calculate the distribution of the distances between nucleosome dyads. This distribution typically shows peaks corresponding to the prevailing distance between two nearest neighbour nucleosomes followed by the distances between next neighbours. The slope of the line resulting from the linear fit of the positions of the peaks then gives the NRL. To perform bulk calculations of NRLs for many genomic subsets of interest we developed software NRLcalc, which loads the phasograms computed in NucTools and performs linear fitting to calculate NRL (see Methods).
We first noticed that NRL near CTCF depends critically on the distance of the region of NRL calculation to the binding site summit (Fig. 1B). While the phasograms for regions [100, 2000] and [250, 1000], which are both excluding the CTCF site, are quite similar to each other, a region that includes CTCF [-500, 500] is characterised by a very different phasogram. However, the latter phasogram is an artefact of the effect of the interference of two “waves” of distances between nucleosomes: one wave corresponds to the distances between nucleosomes located on the same side from CTCF, and the second wave corresponds to distances between nucleosomes located on different sides from CTCF. The superposition of these two waves results in the appearance of additional peaks shown by arrows in Fig. 1B. A linear fit through all the peaks given by the interference of these two waves gives NRL=155 bp, but this value does not reflect the real prevailing distance between nucleosomes (Fig. 1C). We thus selected the region [100, 2000] for the following calculations. Below, all NRLs refer to regions [100, 2000] near the summit of TF binding site, unless specified otherwise. Once the region location with respect to the CTCF site is fixed, the phasograms are not significantly affected by the choice of the nucleosome positioning dataset (Fig. S1). In the following calculations we used the high-coverage MNase-seq and chemical mapping datasets from (Voong et al., 2016).
In order to investigate the effect of CTCF on NRLs near binding sites of other proteins, we calculated NRLs near binding sites of 18 stemness-related TFs whose binding has been experimentally determined in ESCs using ChIP-seq (Fig. 1D and Fig. S2). The latter analysis revealed that the proximity to CTCF binding sites changes all of these NRLs. When we filtered out TF binding sites that overlap with CTCF, the NRLs for each individual TF increased (Fig. 1D). On the other hand, TF binding sites that overlap with CTCF had significantly smaller NRLs (Fig. S2). Thus, CTCF’s effects on NRL are unique, which warrants focusing on CTCF alone for the rest of our study.
The stronger CTCF binds to DNA the smaller is NRL near its binding sites
In order to investigate the effect of CTCF on NRL, we split CTCF sites into 5 quintiles based on the height of their ChIP-seq peaks reported previously (Shen et al., 2012). Comparison of CTCF quintiles in terms of the distribution of nucleosome dyad-to-dyad distances determined by chemical mapping revealed that stronger CTCF binding is associated with smaller NRLs. NRL profiles also changed from one dominant peak in the case of weak CTCF binding to several pronounced peaks in case of the strongest CTCF binding quintile (Fig. 2A and Fig. S3). The calculation of the “classical” NRLs based on MNase-seq data showed a smooth decrease of NRL as the strength of CTCF binding increased (Fig. 2B). We confirmed that this relation is determined by the strength of CTCF binding per se by repeating this calculation for all computationally predicted CTCF sites in the mouse genome which were split into quintiles based on the similarity of their motifs to the canonical CTCF motif (Fig. 2B).
Using the same procedure we have also calculated the NRL as a function of the binding strength for all TFs in the mouse genome whose position weight matrices are available in JASPAR2018 (Khan et al., 2018). This analysis revealed that for proteins other that CTCF NRL did not reveal a smooth function of their binding strength (see Fig. S4). Thus, CTCF is a unique protein that shows anticorrelation between the strength of its DNA binding and NRL
Common CTCF sites preserve local nucleosome organisation during ESC differentiation
Then we set to determine the functional consequences of the NRL decrease near CTCF. We investigated the change of NRL near CTCF upon differentiation of ESCs to neural progenitor cells (NPSs) and mouse embryonic fibroblasts (MEFs). We first noted that the stronger CTCF binds to DNA the higher the probability is that this site will remain bound upon differentiation (Fig. 2C). This suggests that the strength of CTCF binding can act as the major factor determining which CTCF sites retain and which are lost upon differentiation (and thus how the 3D structure of the genome will change). In relation to NRL, we showed that NRL near bound CTCF on average increases as the cell differentiates (Fig. 2D and S5). Importantly, common CTCF sites resisted this NRL change, suggesting that CTCF retention upon differentiation at common sites preserves both 3D structure and nucleosome patterns at these loci.
What determines the NRL decrease near CTCF?
In order to define the physical mechanisms of NRL decrease near CTCF one has to consider a number of genomic features and molecular factors that potentially can account for the NRL decrease near CTCF:
Our previous observations suggested that the strength of CTCF binding is related to the surrounding GC and CpG content (Pavlaki et al., 2018; Wiehle et al., 2019). Our new calculations performed here show that the strength of CTCF binding is indeed correlated with GC content around CTCF sites (Fig. 3A), as well as the probability that a given site is located in a CpG island (Fig. 3B). Therefore, we one potential hypothesis to check is whether CTCF site location inside vs outside CpG islands has an effect on NRL.
Small NRL near CTCF could be simply because CTCF sites are in active regions (promoters or enhancers) which have smaller NRL in comparison with genome-average based on previous studies (Baldi et al., 2018; Valouev et al., 2011). Our analysis performed here demonstrated that there is a positive correlation between the strength of CTCF binding and the probability that it is inside a promoter region (Fig. 3C).
The NRL could depend on whether a given CTCF site forms a boundary of topologically associated domains (TADs) or enhancer-promoter loops. Our analysis using recently published coordinates of TADs and chromatin loops in ESCs (Bonev et al., 2017) showed that there is a positive correlation between the strength of CTCF binding and the probability that it forms a boundary of TADs and even higher correlation for the boundaries of loops (Fig. 3C).
Nucleosome arrangement could be determined by a specific chromatin remodeller interacting with CTCF. We have processed all available remodeller ChIP-seq datasets in ESCs and plotted the percentage of CTCF sites overlapping with remodeller ChIP-seq peaks (Fig. 3D). This analysis showed that the stronger CTCF binds the higher the probability that a given CTCF binding site overlaps with remodellers. Particularly large percentage of CTCF sites overlaps with peaks of remodellers Chd4, EP400, Chd8 and BRG1.
We set to check all four hypotheses formulated above (Fig. 4). CTCF site location inside boundaries of loops or TADs was indeed associated with NRL decrease, which was even more pronounced in CpG islands. We have also derived a systematic rules of remodeller effects on NRL near CTCF, with Brg1 having no detectable effect (based on two independent Brg1 datasets), and Snf2h having the largest effect. The effect of other remodellers is increasing in the order BRG1≤Chd4<Chd6<Chd1≤Chd2≤EP400≤Chd8<Snf2h.
Discussion
We developed a new methodology for quantitative investigations of local NRL changes, and its application revealed a number of interesting observations:
First, we found that NRL critically depends on the distance of the selected genomic region to the summit of the CTCF site. We showed that the CTCF site needs to be excluded from the genomic region for robust NRL calculations; otherwise the apparent NRL is unrealistically small. We checked that this artefact at least does not affect NRL calculations near TSS (Figure S6), but previous NRL calculations for CTCF-containing regions may need to be re-evaluated.
Second, we found that the NRL decrease near CTCF is correlated with CTCF-DNA binding affinity. This result goes significantly beyond previous observations that stronger CTCF binding is associated with more regular nucleosome ordering near its binding site (Owens et al., 2019; Vainshtein et al., 2017) and may have direct functional implications. Strikingly, the NRL decrease as a function of CTCF binding affinity spans a large interval from 193 bp for weak CTCF-like DNA motifs down to 178 bp for the strongest sites bound in ESCs. None of other DNA-binding proteins showed such behaviour. This uniqueness of CTCF can be explained by the large variability of its binding affinity through different combination of its 11 zinc fingers that allows creating a “CTCF code” (Lobanenkov and Zentner, 2018; Nichols and Corces, 2015).
Third, our calculations showed that the strength of CTCF binding acts as a good predictor of a given CTCF site being preserved upon cell differentiation (which may be used as a foundation for the CTCF code determining its differential binding as the cell progresses along the Waddington-type pathways). Importantly, a subclass of common CTCF sites preserved upon cell differentiation tends to keep a small NRL, while genome-wide NRL increases. A previous study reported a related distinction of common versus non-common CTCF sites based on the distance between the two nucleosomes downstream and upstream of CTCF (Snyder et al., 2016). The preservation of NRL for common CTCF sites may give rise to a new effect where differential CTCF binding defines extended regions which do not change (or change minimally) their nucleosome positioning.
Fourth, we systematised the contributions to NRL decrease determined by each of 8 chromatin remodellers that have been profiled in ESCs (Fig 4B). Our analysis suggests that Snf2h has a major role in this phenomenon, consistent with previous studies of Snf2H knockout in HeLa cells (Wiechens et al., 2016) and ESCs (Barisic et al., 2019). Consistently with the latter study, we found that BRG1 has no detectable effect on NRL near CTCF, although it may be still involved in nucleosome positioning near TAD boundaries (Barutcu et al., 2017). Our investigation also identified Chd8 and EP400 as two novel major players. Previous studies indeed showed that Chd8 physically interacts with CTCF and knockdown of Chd8 abolishes the insulator activity of CTCF sites required for IGF2 imprinting (Ishihara et al., 2006). Thus, our work revealed a systematic set of remodeller effects on NRL near CTCF and provided the basis for future quantitative investigations of local NRL variations during development.
Materials and Methods
Experimental datasets
Nucleosome positioning and transcription factor binding datasets were obtained from the Gene Expression Omnibus (GEO), Short Read Archive (SRA) and the ENCODE web site as detailed in Table ST1. NRL calculations near CTCF in ESCs were performed using the MNase-seq dataset from (Voong et al., 2016). NRL calculations near 19 stemness-related proteins in ESCs shown in Figure 1D and S1 were performed using the chemical mapping dataset from (Voong et al., 2016). NRL calculations in NPCs and MEFs were based on the MNase-seq datasets from (Teif et al., 2012). MNase-assisted H3 ChIP-seq from (Wiehle et al., 2019) was used for demonstrative purposes in the phasogram calculation in Figure 1C. Coordinates of genomic features and experimental maps of transcription factor and remodeller binding in ESCs were obtained from published sources as detailed in Table S1. The coordinates of loops and TADs described in (Bonev et al., 2017) were provided by the authors in a BED file aligned to the mm10 mouse genome and were converted to mm9 using liftOver (UCSC Genome Browser).
Data pre-processing
For nucleosome positioning, raw sequencing data were aligned to the mouse mm9 genome using Bowtie allowing up to 2 mismatches. For all other datasets we used processed files with genomic coordinates downloaded from the corresponding database as detailed in Table ST1. Where required, coordinates were converted from mm10 to mm9 since the majority of the datasets were in mm9.
Basic data processing
TF binding-sites were extended from the center of the site to the region [100, 2000]. In order to find all nucleosomal DNA fragments inside each genomic region of interest the bed files containing the coordinates of nucleosomes processed using the NucTools pipeline (Vainshtein et al., 2017) were intersected with the corresponding genomic regions of interest using BEDTools (Quinlan, 2014). Average nucleosome occupancy profiles were calculated using NucTols. The phasograms were calculated using NucTools as detailed below.
Binding site prediction
Computationally predicted TF binding sites were determined via scanning the mouse genome with position frequency matrices (PFMs) from the JASPAR2018 database (Khan et al., 2018) using R packages TFBSTools (Tan and Lenhard, 2016) and GenomicRanges (Lawrence et al., 2013). A similarity threshold of 80% was used for all TFs in order to get at least several thousand putative binding sites.
Stratification of TF-DNA binding affinity
In the case of experimentally determined binding sites of CTCF we stratified these into five equally sized quintiles according to the ChIP-seq peak height determined via peak calling performed in the original publication (Shen et al., 2012). In the case of the predicted TF sites, we used the TRAP algorithm (Roider et al., 2007) to predict the affinity of TF for each site. The same operation as described above was performed on these sites, with the sites arranged into quintiles according to the TRAP score.
Phasogram calculation
The “phasograms” representing the histograms of dyad-to-dyad or start-to-start distances were calculated with the NucTools script nucleosome_repeat_length.pl. When paired-end MNase-seq was used, dyad-to-dyad distances were calculated using the center of each read as described previously (Vainshtein et al., 2017). When chemical mapping data was used, this procedure was modified to use the start-to-start distances instead, because in the chemical mapping method the DNA cuts happen at the dyad locations, so the DNA fragments span from dyad to dyad.
Automated NRL determination from phasograms
Studying many phasograms proved cumbersome when manually picking the points in a non-automated way. To circumvent this problem, an interactive applet called NRLcalc was developed based on the Shiny R framework (http://shiny.rstudio.com) to allow one to interactively annotate each phasogram such that the NRL could be calculated conveniently. The app allows one to select a smoothing window size to minimise noise in the phasograms. A smoothing window of 20 bp was used in our calculations. The app also provides the Next and Back button to allow the user to go through many phasograms, as well as intuitive user interface to load and save data.
Competing interests
No competing interests declared
Funding
This work was funded by the Wellcome Trust grants 200733/Z/16/Z and 211967/Z/18/Z and by the Frontrunner program of the University of Essex.
Data availability
Our software is available at https://github.com/chrisclarkson/NRLcalc
Acknowledgements
We thank Boyan Bonev and Giacomo Cavalli for providing the coordinates of chromatin loops and TADs, Stuart Newman for the computer cluster support and Yevhen Vainshtein for the NucTools support and fruitful discussions.