Abstract
Adapting a well-established formalism in polymer physics, we develop a minimalist approach to infer threedimensional (3D) folding of chromatin from Hi-C data. The 3D chromosome structures generated from our heterogeneous loop model (HLM) are used to visualize chromosome organizations that can substantiate the measurements from FISH, ChIA-PET, and RNA-Seq signals. We demonstrate the utility of HLM with several case studies. Specifically, the HLM-generated chromosome structures, which reproduce the spatial distribution of topologically associated domains (TADs) from FISH measurement, show the phase segregation between two types of TADs explicitly. We discuss the origin of cell-type dependent gene expression level by modeling the chromatin globules of α-globin and SOX2 gene loci for two different cell lines. We also use HLM to discuss how the chromatin folding and gene expression level of Pax6 loci, associated with mouse neural development, is modulated by interactions with two enhancers. Finally, HLM-generated structures of chromosome 19 of mouse embryonic stem cells (mESCs), based on single-cell Hi-C data collected over each cell cycle phase, visualize changes in chromosome conformation along the cell cycle. Given a contact frequency map between chromatic loci supplied from Hi-C, HLM is a computationally efficient and versatile modeling tool to generate chromosome structures, which can complement interpreting other experimental data.
INTRODUCTION
Recent advances in chromosome conformation capture techniques combined with parallel sequencing1–5 and fluorescence imaging microscopies have ushered in a new era of chromosome research over the past decade. Along with post-translational histone modifications, which have been led to conceptualization of epigenomes6, the critical findings from fluorescence imaging and Hi-C data, that the spatial organization of chromatin varies with the tissue or cell types7, 8, cell cycle4, and pathological states9–11, have brought a new dimension to our understanding of genome functions.
Among others, maps of genome-wide contact frequencies, quantified by Hi-C data, offer unprecedented opportunities to infer 3D chromosome structures in cell nuclei12–22. In a nutshell, Hi-C provides the contact frequencies of genomic loci pairs based on the statistics of PCR-amplified DNA fragments digested from formaldehyde cross-linked cells1, 2. One could interpret that Hi-C measures the population-sampled contact probability between pair of genomic loci, say i and j, pij. A proper mathematical mapping of pij to the spatial distance pij is of critical importance for interpreting fluorescence imaging data23, 24 in comparison with Hi-C data.
The advent of fluorescence in situ hybridization (FISH) followed by C-based techniques have engendered much devotion to capture the principle underlying the three-dimensional (3D) folding of chromosomes. This has led to development of a series of polymer-based models over the decades, which include “multiloop subcompartment model,”25, 26 “random loop model,” (RLM)27–29 “strings and binders switch” model12, 15, 30 and its derivative17, 31, 32, “loop extrusion model,”13–15, 33 “minimal chromatin model,”34 and more recently “chromosome copolymer model.”22 Among them, while applicability is limited to the associated spatio-temporal scale of the model being considered, some were developed by keeping a specific molecular mechanism in mind or by incorporating “one-dimensional” information of epigenetic modification and/or DNA accessibility along genomic loci as input to heteropolymer model22, 32, 35. On the other hand, partly sacrificing the model simplicity, others were developed solely for the purpose of reconstructing more precise 3D chromatin structures from Hi-C20, 36–38 and other experiments39.
As the cell imaging data over different cell types is rapidly growing, comparative study of chromosome conformations has become imperative. In the abovementioned models, however, a physically sound mapping of pij from Hi-C to the spatial distance rij (see review40) is still lacking, and computational cost are still high. To this end, here we develop a minimalist model that allows us to generate chromatin conformations from Hi-C data in a most efficient way and to study the structural characteristics of chromosome at a length scale of interest corresponding to the resolution of the given data. In order to achieve such a goal in a most simplifying manner, one could learn much from literature of generic polymer problems, such as the collapse transition of an isolated polymer chain or macromolecular networks with increasing number of internal bonds41–44, and polymer conformation and dynamics inside confinement45, 46.
Pushing the polymer physics idea to its extreme, we propose a minimalist approach, termed the heterogeneous loop model (HLM), that allows us to build 3D structures of chromosomes from Hi-C data. HLM adapts the random loop model (RLM) which was originally developed based on a randomly crosslinked polymer chain27, 28, 49. In RLM, which represents chromosome conformation in terms of the sum of harmonic potentials, pairwise contact probabilities are expressed analytically in terms of a few model parameters. Here, without sacrificing the mathematical tractability and simplicity of the RLM, we extend the RLM to HLM by allowing the loop interactions to be non-uniform and heterogeneous, such that the resulting loop interactions can best represent a given Hi-C data.
In this study, we apply HLM to various regions of human and mouse genomes that span 1 – 100 Mb at 5 – 500 kb resolution, and generate the corresponding conformational ensemble of chromsomes. We demonstrate the utilities of HLM by comparing the structural information extracted from HLM-generated chromosome ensemble with those implicated from the measurements from FISH23, 24, 28, chromatin interaction analysis by paired-end tag sequencing (ChlA-PET)50, 51, and previous modeling studies28, 32, 37, 52, 53. Through multiple examples this study will demonstrate that HLM is an excellent approach to infer 3D structures from Hi-C data.
RESULTS
HLM is effectively a multi-block copolymer model in which monomer-monomner interactions (loops) are harmonically restrained with varying interaction strengths (kij) (Methods and SI). Mapping the pairwise contact probabilities pij from Hi-C to the model parameters kij is the essence of HLM. By incorporating a standard Lennard-Jones non-bonded potential slightly below the θ-condition, which takes into account the short-range excluded volume interaction between monomers as well as global thermodynamic driving force that induces microphase separation between different monomer types, HLM allows us to generate a conformational ensemble of chromosome structures that reproduces a contact probability matrix that displays close resemblance to an original input Hi-C data. We used HLM to model various genomic regions (see Table I). HLM-generated chromosome conformations were used to interpret the currently available experimental results.
Spatial distribution of TADs inferred from HLM in comparison with FISH measurement
Intra-chromosomal distances between TADs in human IMR90 cells, measured by Wang et al. through a multiplexed FISH method23, have been used as a benchmark for different models38. To show the utility of HLM, we model 34 Mb genomic region on chr21 of IMR90 cells, which contains 33 labeled TADs (Table S1 provides the genomic positions of these TADs).
First, the contact probability matrix constructed from HLM-generated structures captures the characteristic checkerboard pattern of the heatmap of Hi-C data, ; the mean contact probability PHLM(s) of HLM is consistent with PHi-C(s) calculated from Hi-C over all length scales including the wiggly pattern at large s (Figs. 1A and 1B).
The heatmap calculated for inter-TAD distances using the HLM-generated conformational ensemble (lower diagonal part of Fig. 1C) can directly be compared with the FISH measurement (upper diagonal part). The square block pattern along the diagonal axis of the heatmap indicates that 4–5 adjacent TADs constitute an aggregate, reminiscent of meta-TAD30, and the patterns in the off-diagonal part (highlighted by the magenta boxes) suggest long range clustering of TADs. The error of the inter-TAD distance heatmap relative to FISH is 0.184, which is comparable to the value of GEM model38 and better than others (see Fig. 4D in Ref.38). A principal component analysis of this matrix (top left part of the matrix in Fig. 1C) divides TADs into A/B types23. Aligning the geometric centers of HLM-generated A- and B-type TADs parallel to the x-axis highlights a polarized organization of A- and B-type TADs (see Fig. 1D)23.
The intrachain end-to-end distance displays a scale-dependent scaling relationship with the genomic distance s, r(s) ~ sν (Fig. 1E). In qualitative agreement with the FISH measurement23, there is a crossover around s = 7 Mb, such that ν ≈ 1/3 for s < 7 Mb and ν ≈ 0.21 for s > 7 Mb.
We explore the relationship between contact probability pij and the corresponding distance rij of two loci. It is expected that the looping probability of polymer is inversely proportional to the volume of space (V) explored by the two loci as Ploop ~ 1/V. Since the volume V scales with the spatial separation (R) between the two loci in d-dimension as V ~ Rd, it follows that54–56
The correlation hole exponent g is g = 0 for a Gaussian chain57. According to the Flory theorem58–61, the ideal chain statistics is a good approximation for a chain in polymer melts or for a subchain in a fully equilibrated globule. Since d = 3 for 3D, we expect Ploop ~ R−3, or equivalently (see also Fig. S1B). In fact, this scaling relation is observed for the data point generated by HLM for rij < 1 μm (Fig. 1F). Although Wang et al., who combined Hi-C and FISH data, reported a scaling relation of for the entire range, it is not clear whether the relation can straightforwardly be extended to the range of rij < 1 μm where the data point from their measurement might be less accurate. According to the HLM-generated data a more proper scaling should be for rij < 1 μm and for rij > 1 μm.
Next, to demonstrate another analysis on FISH measurement, we applied HLM to the q-arm of chr11 in IMR90 cells, whose intrachain pairwise distances between genomic loci had been measured with FISH28, 64 (see Table S2 for the position of FISH probes in the genome and in the model). The model produces the contact probability matrix with a Pearson correlation (PC) of 0.98 relative to Hi-C data (see Figs. S3A, S3B, and SI for discussion of PC in comparison to other alternative method). HLM enables us to calculate the spatial distances between specific pairs of loci (Fig. S3C), with a mean relative error of 0.189 (with respect to FISH data). The HLM-generated structural ensemble also indicates that compared to the gene-poor and transcriptionally inactive anti-ridge domain, the transcriptionally active ridge domain is less compact, less spherical, and has a rougher domain surface (Figs. S3D-F), all of which are in agreement with the FISH experiment64. Modeling another 30 Mb region on chr1 of IMR90 cells leads to similar results (Fig. S4 and Table S3).
Visualization of chromatin globules
α-globin gene
Cis-regulatory elements generally mediate the transcription of neighboring genes within a range smaller than 1 Mb65. The α-globin gene domain, a 500 kb-genomic region known as ENm008 located at the left telomere of human chr16, has previously been studied to decipher the relationship between chromatin structure and transcription activity37, 52, 53. RNA-seq data62, 66, 67 indicate that the α-globin genes (including ζ-, μ-, α2-, α1- and θ-globin genes) are expressed in K562 cell lines, but silenced in GM12878 (tracks on the left side of the Hi-C heatmaps in Fig. 2A). According to 3C/5C measurements52, 68, the α-globin gene forms long-range looping interactions with multiple regulatory elements upon gene activation. Among them, of particular interest is one of the DNase I-hypersensitive sites (DHS), HS40, located at ~ 70 kb upstream of the α1 gene.
The HLM-generated structural ensemble at 5 kb resolution for ENm008 of two cell lines (K562 and GM12878) suggests that the contact probability P(s) decreases slightly faster in K562 than in GM12878 cells at large s (Fig. 2B). The α-globin domains of K562 and GM12878 cell lines visualized with FISH52 indicates that K562 is less compact than GM12878, which is confirmed straightforwardly by the compactness calculated using the HLM-generated structures (Fig. 2C). Compared with GM12878 cells, the α-globin domain in K562 cells adopt a less spherical shape (Fig. 2D)52, 53.
Next, we examined the changes in the distances between the α1-globin gene and other loci upon activation of the gene. Even though the whole domain in K562 cells is relatively more expanded, HS40 is closer to the α1 gene in K562 than in GM12878 cells (Fig. 2E), which is consistent with the expectation based on the higher contact enrichment between HS40 and α1 gene observed in K562 by 3C/5C measurements (e.g., Fig. 2 in Ref.52). Through inter-cell line comparison between K562 and GM12878 for the rest of the region using distance distribution to the α-globin gene locus, we identified a group of loci other than HS40 that are significantly closer to α-globin genes in K562 cells (Mann-Whitney U test, p < 1 × 10−5). Their genomic positions are marked using red sticks in Fig. 2E. According to the independent ChIA-PET experiments50, 51 designed to capture the chromatin loop interactions mediated by specific protein factors, the structural variation associated with α-globin genes is mainly orchestrated by Pol II (see Table S4). HLM captures 83% of Pol II-mediated chromatin loops specific to K562 cells (Fig. 2F).
Taken together, HLM captures both the tissue-specific variation in the global packing of the α-globin gene domain, and variation in the structure of gene locus. The multiple K562-specific interactions, substantiated by HLM, suggest that a cooperative action of multiple regulatory elements including HS40 is responsible for the activation of α-globin genes37. HLM-generated conformations indeed confirm the notion of chromatin globule proposed in Ref.52.
SOX2 gene
As an another example of transcription-dependent chromatin folding, we studied the human SOX2 gene locus which encodes a transcription factor involving the regulation of embryonic development. The SOX2 gene is transcribed in human embryonic stem cells (hESCs), but not in umbilical vein epithelial cells (HUVECs) (Fig. 3A). To compare the results from HLM with a recent modeling study32, we measured the distances between SOX2 gene and two possible regulatory elements located at regions ~800 kb upstream (US) and ~650 kb downstream (DS). Whereas both elements are closer to the SOX2 locus in transcriptionally active hESCs than in inactive HUVECs, the chromatin fiber is less compact in hESCs (Fig. 3D, see also the snapshots in Figs. 3E and F). HLM-generated structures demonstrate the dependence of chromatin folding on the transcription level at SOX2 gene loci, and this trend comports well with the prediction made in Ref.32 that also employed polymer model simulation.
Chromatin interactions at complex genomic loci
The efficacy of HLM was further tested for the genomic loci of Pax6 gene that involve the development of mouse neural tissues. Flanked by two neighboring genes (Pax6os1 and Elp4), the expression level of Pax6 gene is considered to be regulated by multiple long-range elements, including two regulatory regions located at ~50 kb upstream (URR) and ~95 kb downstream (DRR) (Fig. 4A). The DRR contains several DNase I-hypersensitive sites and the SIMO enhancer, which was identified in transgenic reporter gene studies of developing mouse embryos71, 72. Another cis-regulatory element PE3 within URR has recently been identified from mouse pancreatic β-cells (β-TC3)70.
A study combining Capture-C, FISH and simulations32 has reported a non-trivial correlation between the expression level of Pax6 gene and the spatial separation from Pax6 gene to URR and DRR. Among the three types of mouse cells (β-TC3, MV+ and RAG cells) studied in Ref.32, Pax6 gene maintained the largest separation from DRR in the β-TC3 cells that displayed the highest expression level of Pax6. Therefore, it was suggested32 that the enhancer at DRR is not involved in upregulation of Pax6 in β-TC3 cells, or that some unclear upregulation mechanisms that do not require the spatial proximity to enhancers are responsible for the activity of Pax6 gene.
To study the origin of complex interplay between Pax6 gene and neighboring genetic elements, we applied HLM to the same genomic region of five different mouse cell types whose Hi-C data are currently available: (i) embryonic stem cells (mESCs), (ii) neural progenitors (NPCs), (iii) cortical neurons (CNs), (iv) ncx_NPC, and (v) ncx_CN, where the prefix “ncx_ “ indicates that the cells are directly purified from the developing mouse embryonic neocortex in vivo. Each cell type displays distinct transcriptional activity patterns of Pax6 and its neighboring genes48 (Fig. 4A). According to the FPKM scores from RNA-seq analysis (Fig. 4 B), the five cell types display Pax6 activity in the following order: ncx_NPC > NPC > CN > ES > ncx_CN.
The contact probabilities calculated from our HLM-generated conformations reasonably reproduce the Hi-C data at a resolution of 8 kb48 (see Table I and Fig. S5). The Hi-C contact profiles of three genomic loci (URR, Pax6, and DRR) with other genomic regions (histograms in Fig. 4C) are well captured by HLM-generated conformations (lines in Fig. 4C). Compared with the distance of Pax gene promoter (P) to the upstream enhancer (UE), Pax6 gene activity is better correlated with the distance to the downstream enhancer (DE) (see Fig. 4D); the closer to the DE, the higher the Pax gene activity is. The highest Pax gene activity is seen in ncx_NPC. Notice that the most enriched Hi-C contacts between Pax6 and DRR is indeed found in ncx_NPC, which is marked with a red star in Fig. 4 C. We note that our finding on contacts between Pax6 and DRR is in contrast to that based on β-TC3 cells (see Fig. 2 A in Ref.32). This however underscores that the mechanism or the chromatin conformations responsible for the Pax6 gene activity depends strongly on the cell-type: At least the mechanism of Pax6 gene regulation in ncx_NPC cells differs clearly from that in β-TC3 cells.
Next, given that Hi-C data is obtained from a collection of millions of cells, heterogeneity of chromatin conformations is inevitable in analyses, which has indeed been highlighted in Ref.32. To characterize the heterogeneity in the HLM-generated conformational ensembles, we classified each chromatin structure into five groups based on the separations between the Pax6 gene promoter (P) and two enhancers (UE and DE) (Fig. 4E). To visualize the conformational diversity, we randomly selected 200 structures and characterized by the promoter-enhancer distances (Fig. 4 F). Except for the “gray” group where all three separations are large, the population of conformational ensemble consists mainly of the “black” group (P is close to DE but not to UE), and the “purple” group (P is close to UE but not to DE) which are suspected to be responsible for high expression level of Pax6 gene. In consistent with our analysis on the ensemble-averaged distance to enhancers for different cells (Fig. 4 D), the proportion of “black” group shows a decreasing trend as Pax6 becomes less active (Fig. 4 E), suggesting a more important role of DE than UE in regulating Pax6 gene for the five cell lines.
While an indirect upregulation of Pax6 gene by DRR as seen in β-TC3 cells32 cannot entirely be ruled out, the correlation of gene activity level with the spatial proximity of Pax6 gene to DRR is clearly demonstrated, at least, across the five cell lines that we studied using HLM. The mechanism of indirect upregulation and the mechanism of cell type-dependent choices deserve further study.
Chromosome in different phases of cell cycle
Most Hi-C data are obtained over a population of ‘unphased’ cells. Here, we employ HLM to model the global architecture of chromosome at different phases of cell cycle during the interphase, based on single-cell Hi-C4. Accumulating the data from tens to hundreds of binary contact matrices of single cells into an input matrix , we built 500 kb-resolution model of chromosome for the post-M, early-S, mid-S, late-S/G2, and pre-M phases of chr19 in mESC (above the diagonal in Fig. 5A). matrices computed using HLM (below the diagonal in Fig. 5A) display reasonable correlation with the original Hi-C data (Pearson correlation, PC > 0.9) except for the post-M phase (PC = 0.77); unlike other phases, the lower PC value with the -matrix at the post-M phase, characterized with uniform and featureless pattern, is due to the smaller number of sampling cells (Nc).
The local compactness of the chromosome conformation was quantified in terms of the average volume occupied by a single monomer based on the Voronoi tessellation (Fig. 5B). After the mitosis, the chromosome continues to expand until the late-S/G2 phase. The gyration radius also captures this trend (Fig. 5B), except that the model has the largest value of rg in post-M phase. A partial condensation of the chain (decreases in and ) is observed before entering the pre-M phase. This decondensation-condensation cycle is also captured with the asphericity of structures generated from HLM (Fig. 5C), which decreases dramatically from the post-M to G1 phases and then increases gradually after the G1 phase. The same conclusion can be drawn from the probability density of pairwise distance between monomers (see Fig. S6).
DISCUSSION
HLM is similar to previous polymer models of chromatin, which also convert information of spatial proximity into harmonic restraints between monomers25, 73, 74. In order to demonstrate that the choice of energy potential in HLM is optimal over other alternatives, we examined HLM and its three variants on a 10 Mb genomic region on chr5 of GM12878 cells (Fig. S7). Unlike the HLM which faithfully reproduced the domain edges of enriched contacts observed by Hi-C (highlighted by cyan boxes in Fig. S7A), which was regarded as a distinct feature of loop extrusion14, two alternative copolymer models, which retain uniform strength of loop interaction, could not properly reproduce the diagonal-block patterns of Hi-C data (Fig. S7B and C). In a homopolymer model, where χ−,−, χ−,+, and χ−,+ are all set to 1 (see Methods), the long-range checkboard pattern was not reproduced (Fig. S7D). The Pearson correlation of contact probabilities contrasted between Hi-C and other models at different genomic separations shows that HLM outperforms others (Fig. S7E).
As shown for different chromosomes, cell types, species with a flexible choice of model resolution, one of the greatest advantages of HLM is its versatile application. While all of the output conformations exhibit great variability (see discussions in SI, Fig. S8, and Fig. 4F), the population-sampled contact map faithfully reproduces the input Hi-C data. For a given Hi-C data, the two sets of model parameters and {χti, tj} can be determined in a few minutes using a personal computer without any manual intervention (Table I).
In summary, we demonstrated that HLM is a computationally efficient approach with which to investigate the genome function. The conformational ensemble generated by HLM shows that depending on the chromatin states, different types of chromatin domains have different compactness and shapes, and spatial phase separation between domains takes places in human genome. The inter-cell line comparison of human α-globin and SOX2 loci shows that while the sub-megabase gene domain becomes less compact upon gene activation, the most critical regulatory element comes closer to the gene, and that its expression is likely affected by many other elements. The activity of Pax6 gene in a complex genetic environment is mostly modulated by the proximity between Pax6 promoter and the downstream enhancer, while the distance to the upstream regulatory element shows non-monotonic variations with its activity for the cell types we studies. HLM was also used to visualize the cell cycle dynamics of chromosome organization based on single-cell Hi-C. Although HLM is not designed based on assumptions of molecular mechanisms of genome organization, the principle of transcription regulation can be inferred from the changes of chromatin conformations. With Hi-C data being accumulated, HLM would be of great use to provide complementary structural information, which are not easily accessible to current experiments.
METHODS
Description of HLM
The full energy potential of HLM consists of two parts.
In what follows, we delineate the first and second terms of Eq. 2 (see SI for technical details).
First, decomposed into two parts, describes the harmonic constraints on a chain of N monomers27, where successive monomers along the backbone and non-successive monomers forming loops are both harmonically restrained. In the second line, is written in a compact form with and K representing the Kirchhoff matrix. K can be built from the interaction strength matrix that takes as its matrix element. The interaction strengths ought to be non-negative (kij ≥ 0) for all i and j-th monomer pairs. In HLM, if kij ≠ 0 then the i and j-th monomer has a potential to form a (chromatin) loop. After removing the translational degrees of freedom by setting on Eq. 3, we obtain the probability density of pairwise distance as27 where and is the covariance between the positions of i and j-th monomers, which can be obtained from an inverse of K-matrix as
One can obtain the contact probability pij by integrating the pairwise distance P(rij) (Eq. 4) up to a certain capture radius (rc)75, 76, , which gives where . Therefore, a one-to-one analytical mapping between pij and kij follows from the precise mappings between pij and σij from Eqs 7 and 5, and between σij and kij from Eq. 6.
Although it is tempting to directly use the mathematical relation between pij and kij to obtain from Hi-C data, there is an unavoidable numerical issue (see SI Text and Figs. S9–S11 for details). In practice, we calculate -matrix that approximates by selecting only the significant contacts in . More specifically, we evaluate the significance of contact probability pij by calculating zij, which is defined as (see the matrix elements in the upper diagonal part of Fig. 6B): where is the mean contact probability for monomer pairs separated by the arc length s along the contour. The greater the value of zij, the contacts are deemed more significant. We then select top 2N (i, j) pairs ranked in terms of the values of zij (> 1) (the matrix elements in the lower diagonal part of Fig. 6B). For these 2N pairs whose contact probability pij is given in , the precise value of (or equivalently can be determined using Eq. 7. Then, starting from a Rouse chain configuration as an initial input, we add non-successive bonds with varying interaction strengths (0 ≤ kij ≤ 10 kBT/a2) until we minimize the objective function so as to determine the optimal values of . Here the weight factor ωij, which is used to normalize the statistical bias from chromatin loops of different sizes, is defined as where is the number of loops of size s. The gradient-descent algorithm (L-BFGS-B method in SciPy package) was used to determine the optimal parameters . A fully convergent solution of -matrix (Fig. 6C) could be obtained within a few minutes when N was not too large (≤ 200). This -matrix determining process, termed constrained optimization, faithfully reproduces the original matrix with a relative error smaller than 5% (see also Figs. S10-S12).
After obtaining (Fig. 6C), and hence , we added a non-bonded interaction term Unb(r), defined for all i and j pairs to the full energy potential UHLM(r) (Eq.2): where uLJ(r) is the Lennard-Jones potential truncated for r ≥ rc where rc = 5a/2 with ∊ = 0.45 kBT,
If ∊ = ∊θ(= 0.34 kBT) with χti, tj = 1, then Unb(r) leads to θ-solvent condition for infinitely long chain, putting the second virial coefficient to zero, i.e., . We chose ∊(= 0.45 kBT) slightly greater than ∊θ and assigned loci-pair-type-dependent prefactor χti, tj. Each monomer i is assigned with a type t, either “−” or “+”, based on the sign of the first principal component of (see the track on top of Fig. 6B). The value of prefactor χti, tj (> 0), depending on the types of two loci i and j which are either titj = ++, −−, or −+, are evaluated by averaging over all the monomer pairs of the corresponding types, such that χp, q = 〈zij〉ti=p, tj=q, The values of χti, tj are determined based on a given Hi-C data. For the case shown in Fig. 6, we obtain χ−,− = 1.18, χ−,+ = 0.79, and χ+,+ = 1.19. According to the Flory-Huggins theory57, the condition leads to microphase separation between + and − type loci, which indeed is realized and reflected in the characteristic checkerboard pattern of Hi-C data. It should be noted, however, that the classification of type −/+ is not necessarily identical to the A/B compartment of chromatin. Whereas A/B compartments are genome-wide characteristics usually defined based on Hi-C data at low (Mb) resolutions2, 3, the monomers in HLM can be always classified into types −/+ regardless the resolution of the model.
Finally, we sampled 3D chromosome structures using molecular dynamics simulation implementing the full energy potential UHLM(r) and calculated the contact probability matrix based on HLM-generated conformational ensemble. In the specific example demonstrated for the Hi-C data of 10 Mb-genomic region of chr5 in GM12878 cell line (Fig. 6), (Fig. 6E) obtained from HLM-generated chromosome conformations (Fig. 6D, see also the clustering analysis which highlights the conformational variability of chromosomes in SI text and Fig. S8) displays a notable resemblance to the input (Fig. 6A) (Pearson correlation of 0.96; Spearman correlation of 0.92). Despite the simplicity of HLM potential (Eq. 2), the similarity between and , as well as the chromosome conformations ensemble generated during the procedure is remarkable.
Structure characterization
We quantified the structural feature of HLM-generated chromosome ensemble, by means of several quantities:
The compactness of a (sub-)chain of length N is quantified in terms of , where rg is the gyration radius of the (sub-)chain.
The asphericity (A) is calculated by where λi(i = 1, 2, 3) are the three eigenvalues of the moment of inertia tensor, and is their mean77, 78. A = 0 for a sphere, and A > 0 for a non-spherical shape.
The roughness of the surface of a (sub-)chain, was evaluated using the Voronoi diagram79 that tessellates the 3D space occupied by the chain. A upper bound for the volume of each monomer was set using a dodecahedron with a diameter of 2a, The Voronoi diagram provides a well-defined volume V and surface area S of the (sub-)chain. Since the surface area of a perfect sphere with the volume V is S0 = (36πV2)1/3, we quantified the surface roughness using S/S0 ≥ 1.
To visualize an ensemble of structures with considerable variability, we first divided the chain into a few segments (domains). Next, the distribution of the distances between the geometric centers of these domains were computed based on the ensemble of structures. Several configurations of chromosomes were then randomly selected from the most populated state (in terms of interdomain distances), aligned, and rendered.
ACKNOWLEDGMENTS
We thank the Center for Advanced Computation in KIAS for providing computing resources. CH acknowledges a partial support from the National Research Foundation of Korea (NRF-2018R1A2B3001690).
Footnotes
↵a) hyeoncb{at}kias.re.kr