Abstract
In vivo, the human genome folds into a characteristic ensemble of three-dimensional structures. The mechanism driving the folding process remains unknown. We report a theoretical model for chromatin (Minimal Chromatin Model) that explains the folding of interphase chromosomes and generates chromosome conformations consistent with experimental data. The energy landscape of the model was derived by using the maximum entropy principle and relies on two experimentally derived inputs: a classification of loci into chromatin types and a catalog of the positions of chromatin loops. First, we trained our energy function using the Hi-C contact map of chromosome 10 from human GM12878 lymphoblastoid cells. Then we used the model to perform molecular dynamics simulations producing an ensemble of 3D structures for all GM12878 autosomes. Finally, we used these 3D structures to generate contact maps. We found that simulated contact maps closely agree with experimental results for all GM12878 autosomes.
The ensemble of structures resulting from these simulations exhibited unknotted chromosomes, phase separation of chromatin types, and a tendency for open chromatin to lie at the periphery of chromosome territories.
One Sentence Summary: We report a model for chromatin that explains and accurately reproduces the three-dimensional structure of chromosomes in interphase.
Main Text:
Chromatin comprises a highly flexible polymer composed of nucleosomes – DNA wrapped around histone proteins – connected to one another by a linker region of 20-50 base pairs (bp). Hundreds of associated structural and regulatory proteins interact with the genetic material coordinating the way chromatin folds to fit inside the nucleus of eukaryotic cells.
The resulting ensemble of partially organized structures brings sections of DNA separated by a great genomic distance into close spatial proximity, and plays an important role in controlling gene transcription (1, 2). Although some of the features of this ensemble can be explained using simple polymer physics (3-5), there is now ample evidence that specific biochemical interactions play a crucial role (6-8). Understanding the interplay between biochemistry, genome architecture, and transcriptional regulation is a major outstanding challenge.
For over two decades, molecular biology techniques that combine chromatin fragmentation and proximity ligation have given us quantitative information about how chromatin is organized in vivo (5, 9-11). In recent years, Hi-C experiments have made it possible to measure the frequency of contact between all pairs of genomic loci using a single experiment.
Here, we explore a physical model by which local interactions between genomic loci can lead to the conformations of human chromosomes in interphase. Specifically, we propose a theoretical energy landscape model for chromatin folding, designated the Minimal Chromatin Model (MiChroM), which uses the maximum entropy principle (12, 13) in combination with a minimal number of assumptions in order to model the structural consequences of the aforementioned biochemical interactions. Importantly, MiChroM can be used to model biochemical interactions even though the identity of the interacting biomolecules is unknown. MiChroM suggests a mechanism that is sufficient to explain chromatin organization and can be used to generate ensembles of 3D structures describing whole genomes. As we will show, contact maps generated in silico from these ensembles of structures reproduce in detail the maps from Hi-C.
The first assumption made in MiChroM is that the genome is partitioned into intervals of a handful of types, such that each type of interval is marked by characteristic histone modifications and interacts with a characteristic combination of nuclear proteins. As a result, when two segments of chromatin come into contact, the effective free energy change due to this contact depends, to first order, on the chromatin type of each segment (see also Jost et al. (14)).
This assumption is supported by both biochemical and structural data. For instance, five distinct types of chromatin have been found in Drosophila cells based on the binding patterns of nuclear proteins (15). Further, analysis of original Hi-C maps (5) suggested that human chromatin is partitioned into two compartments, A and B, each associated with distinct long-range contact patterns. More recently, Rao et al. (8) used kilobase-resolution Hi-C experiments to show that the human genome can be further partitioned into six subcompartments (A1,2 and B1,2,3,4); each correlated with particular histone marks and associated with a particular pattern of longrange contacts. A similar partitioning of the genome was observed also in mouse (8, 16) and Drosophila (17, 18). Both the boundaries of these genomic intervals and their chromatin types may change along with changes in cell state (8). The close association between interval types and long-range contact patterns suggests that intervals of the same type segregate together in the nucleus.
The second assumption made in MiChroM is that certain pairs of genomic “anchor” loci tend to form loops. This tendency is encoded in the model as a change in the effective free energy of a chromatin configuration when the two anchor loci are in contact. This assumption is wellsupported by historical literature (7), and has been further confirmed by recent high-resolution Hi-C maps of the human genome, where loops are visible as peaks in the contact probability map (8). Most loops are associated with convergent pairs of CCCTC binding factor (CTCF)-binding motifs, which have been proposed to help orchestrate loop formation via extrusion (19). MiChroM, however, makes no assumption about the particular mechanism of loop formation.
Finally, MiChroM assumes that every time a pair of loci comes into contact there is a gain/loss of effective free energy, γ(d), that depends only on the genomic distance, d. This “ideal chromosome” term models the local structure of chromatin in absence of compartmentalization or looping (13), and is sequence translational invariant by construction. The form of the ideal chromosome potential is supported by the widespread evidence that chromatin can behave like a liquid crystal (20, 21), and is consistent with the popular notion of the existence of a higher order fiber in chromatin (22) while remaining more general.
To build a physical model for chromatin, we use the maximum entropy principle to convert the above three assumptions into an information theoretical energy function. The effective energy that maximizes the information theoretic entropy takes the following form (see SI): and includes, respectively, the potential energy, , characterizing a generic homopolymer, the interactions between chromatin types (assumption #1), the interactions between loop anchors (assumption #2), and the translational invariant compaction term (assumption #3).
This potential function contains 27 parameters that must be provided in order to fully specify the model. Once the potential function is fully specified, it is possible to perform molecular dynamics simulations of chromatin using as input the classification of loci into chromatin types and the location of loops. This procedure is directly analogous to the simulation of protein folding using amino acid sequence and disulfide bond positions as the only input.
Determining the optimal value for these 27 parameters requires a training data set. In this case, we iteratively adjusted the parameter set in order to reproduce data extracted from a Hi-C contact map of chromosome 10 generated using GM12878 cells (8). To do so, we modeled human chromosome 10, which is 136 Mbp long, as a polymer containing 2712 monomers, each representing 50 kb of DNA. We used the annotations generated by Rao et al. to assign each monomer a chromatin type, as well as to specify the positions of loops between pairs of monomers. In each iteration, we combined these polymer specifications with the current parameter set in order to generate an ensemble of structures. We then used this ensemble to generate a simulated map of pairwise inter-monomer contact frequencies, and compared this contact map to the one obtained by Rao et al. experimentally in order to choose the next set of parameters (see SI).
The simulated contact maps obtained using the final set of parameters correspond closely to the experimental contact maps obtained for chromosome 10 (Pearson’s r = 0.95). This correspondence goes beyond the visually obvious “checkerboard” pattern in the simulated contact map (Figure 1). In general, all features larger than 300-400 kb in the experimental contact map (i.e., features that are about an order of magnitude larger than the size of an individual monomer in our simulations) appear to be accurately recapitulated by the MiChroM model. Notably, the power law scaling relationship between the probability of forming contacts and genomic distance, often used to justify the non-equilibrium fractal globule model, is also reproduced with great accuracy by this equilibrium model (Figure 1E).
Next, we applied the MiChroM model to the remaining GM12878 autosomes by combining the potential function with the experimentally derived monomer type and loop annotations. When each chromosome is simulated separately, the resulting intrachromosomal contact map closely corresponds to the experimental contact map in every case. Notably, the correspondence for autosomes that were not used to train the potential function was typically as close (Pearson’s r = 0.95) as the correspondence for chromosome 10 (See Figure 2, S2-S47, and SI).
When we examined the ensemble of 3D structures for each individual chromosome, we observed that each chromosome formed a compact chromosome territory. We also observed the phase separation of chromatin types within this territory, leading to subvolumes comprising only a single type of genomic interval (Figure 3A). Usually, only a single subvolume formed for each subcompartment, although in some cases we observed multiple subvolumes of a single type. Similarly, we see that highly expressed genes (as measured by RNA-Seq (23)) tend to occupy spatial subvolumes, which is expected given that highly expressed genes lie predominantly in the A compartment. Overall, these findings are consistent with the notion that different types of intervals co-localize in distinct spatial compartments. Interestingly, the A compartment tends to be less densely packed and to lie at the periphery of the chromosome territory. These observations are consistent with the findings of prior studies using both microscopy and Hi-C (8, 24, 25). Notably, a control model composed of a simple self-avoiding homopolymer chain failed to exhibit any of these results, and instead recapitulated the expected properties for an equilibrium globule (Figures 3A and 3B).
It is commonly assumed that one essential feature of chromosomes is the absence of knots, as one might suppose that a highly knotted structure could create obstacles to the transcription process. We studied the extent of knotting in the ensemble of chromosome structures sampled from the optimized energy landscape and from the homopolymer potential. In order to quantify knotting in a particular conformation of the chromosome, we used two different knot invariants: the Alexander polynomial and the minimal rope length required to generate a topologically equivalent knot (13, 26). Both measures show that the configurations produced by MiChroM are largely devoid of knots. In contrast, the homopolymer control system tended to form extraordinarily complex knots (Figure 3C). This topological feature is a direct result of inferring the energy landscape from the three physical assumptions explained above. Remarkably, the simple equilibrium mechanism underlying MiChroM produces ensembles of structures that are devoid of knots.
Finally, we used MiChroM to jointly simulate chromosomes 17 and 18 (Figure S1). This allowed us to explore whether the MiChroM potential function, which was trained using a single intrachromosomal contact map for chromosome 10, could successfully reproduce genome architecture at a larger scale. The resulting intrachromosomal contact maps are essentially the same as those simulated in isolation (Pearson’s r = 0.96). The phenomenon of phase separation of chromatin types now extends to both chromosomes creating larger regions of space occupied by one single type. Spatial confinement introduces artifacts in the frequency of interchromosomal contacts; therefore, the interchromosomal contact map from simulation shows somewhat increased probabilities with respect to Hi-C. Even with the biased intensity, the two-chromosome map shows a correct pattern of interchromosomal interactions.
When we examined the 3D ensemble, we found that, despite the extensive contacts between the chromosomes, the chromosomes were not entangled with one another (Figure S1B); instead, we observed the formation of non-overlapping chromosome territories. This last result highlights the fact that MiChroM can successfully recapitulate features of the nucleus as a whole.
The Minimal Chromatin Model assumes that chromosomes fold under the action of a cloud of proteins that bind with different selectivity to different sections of chromatin, and offers a simple strategy for recapitulating the energy landscape created by such interactions. This energy landscape brings about transient contacts rather than permanent ones, which is consistent with the fact that most of the experimentally observed contacts between two genetic loci only occur in a small fraction of cells at a given time (5, 27). Contacts associated with loop formation tend to be more frequent; accordingly, our optimization algorithm assigns them a larger free energy gain upon formation. In humans, we find that six types of chromatin are sufficient to reproduce the arrangement of interphase DNA in vivo. The fact that our model can be reliably transferred from one chromosome to the rest suggests the plausibility of the proposed energetic mechanism, even if the underlying biochemical details remain unclear at the present time.
As shown, MiChroM is able to explain and reproduce the results of DNA proximity ligation experiments. Nevertheless, caution must be applied in the interpretation of these results. Hi-C experiments are performed using millions of cells at once, and report only a population average. We know little about what happens in individual cells at specific moments in time. For instance, a typical cell population interrogated by Hi-C may contain entirely separate subpopulations, as well as fluctuating or even oscillating configurations. These would be lost in MiChroM.
The classification of loci into chromatin types and the position of chromatin loops, which are inputs of our model, are strongly associated with epigenetic features (histone modifications and bound CTCF motifs in convergent orientation) that can be directly and inexpensively assayed by ChIP-Seq. Exploiting these associations along with MiChroM opens up the possibility of predicting in silico the 3D structure of whole genomes starting from 1-dimensional genomics data, which are often already publicly available.
Acknowledgments:
We thank Ryan R. Cheng, Davit Potoyan and Lena Simine for many useful discussions, and Erica J. Di Pierro for help in editing the manuscript.
This work was supported by the Center for Theoretical Biological Physics sponsored by the National Science Foundation (grants PHY-1427654 and NSF-MCB-1214457) and by the Cancer Prevention and Research Institute of Texas (CPRIT – grant R1110). Michele Di Pierro was also supported by the Welch Foundation (grant C-1792).