Abstract
Chromatin interactions have important roles for enhancer-promoter interactions (EPI) and regulating the transcription of genes. CTCF and cohesin proteins are located at the anchors of chromatin interactions, forming their loop structures. CTCF has insulator function limiting the activity of enhancers into the loops. DNA binding sequences of CTCF indicate their orientation bias at chromatin interaction anchors – forward-reverse (FR) orientation is frequently observed. DNA binding sequences of CTCF were found in open chromatin regions at about 40% - 80% of chromatin interaction anchors in Hi-C and in situ Hi-C experimental data. Though the number of chromatin interactions was about seventy thousand in Hi-C at 50kb resolution, about twenty millions of chromatin interactions were recently identified by HiChIP at 5kb resolution. It has been reported that long range of chromatin interactions tends to include less CTCF at their anchors. It is still unclear what proteins are associated with chromatin interactions.
To find DNA binding motif sequences of transcription factors (TF), such as CTCF, and repeat DNA sequences affecting the interaction between enhancers and promoters of genes and their expression, first I predicted TF bound in enhancers and promoters using DNA motif sequences of TF and experimental data of open chromatin regions in monocytes and other cell types, which were obtained from public and commercial databases. Second, transcriptional target genes of each TF were predicted based on enhancer-promoter association (EPA). EPA was shortened at the genomic locations of FR or reverse-forward (RF) orientation of DNA motif sequence of a TF, which were supposed to be at chromatin interaction anchors and acted as insulator sites like CTCF. Then, the expression levels of the transcriptional target genes predicted based on the EPA were compared with those predicted from only promoters.
Total 369 biased orientation of DNA motifs (232 FR and 178 RF orientation, the reverse complement sequences of some DNA motifs were also registered in databases, so the total number was smaller than the number of FR and RF) affected the expression level of putative transcriptional target genes significantly in CD14+ monocytes of four people in common. The same analysis was conducted in CD4+ T cells of four people. DNA motif sequences of CTCF, cohesin and other transcription factors involved in chromatin interactions were found to be a biased orientation. Transposon sequences, which are known to be involved in insulators and enhancers, showed a biased orientation. The biased orientation of DNA motif sequences tended to be co-localized in the same open chromatin regions. Moreover, for 36 – 95% of FR and RF orientations of DNA motif sequences, EPI predicted from EPA that were shortened at the genomic locations of the biased orientation of DNA motif sequence were overlapped with chromatin interaction data (Hi-C and HiChIP) significantly more than other types of EPAs.
Background
Chromatin interactions have important roles for enhancer-promoter interactions (EPI) and regulating the transcription of genes. CTCF and cohesin proteins are located at the anchors of chromatin interactions, forming their loop structures. CTCF has insulator function limiting the activity of enhancers into the loops (Fig. 1A). DNA binding sequences of CTCF indicate their orientation bias at chromatin interaction anchors – forward-reverse (FR) orientation is frequently observed (de Wit et al. 2015; Guo et al. 2015). About 40% - 80% of chromatin interaction anchors of Hi-C and in situ Hi-C experiments include DNA binding motif sequences of CTCF. Though the number of chromatin interactions was about seventy thousand in Hi-C at 50kb resolution (Javierre et al. 2016), about twenty millions of chromatin interactions were recently identified by HiChIP at 5kb resolution (Mumbach et al. 2017). However, it has been reported that long range of chromatin interactions tends to include less CTCF at their anchors (Jeong et al. 2017). Other DNA binding proteins such as ZNF143, YY1, and SMARCA4 (BRG1) are found to be associated with chromatin interactions and EPI (Bailey et al. 2015; Barutcu et al. 2016; Weintraub et al. 2017). CTCF, cohesin, ZNF143, YY1 and SMARCA4 have other biological functions as well as chromatin interactions and EPI. The DNA binding motif sequences of the transcription factors (TF) are found in open chromatin regions near transcriptional start sites (TSS) as well as chromatin interaction anchors.
DNA binding motif sequence of ZNF143 was enriched at both chromatin interaction anchors. ZNF143’s correlation with the CTCF-cohesin cluster relies on its weakest binding sites, found primarily at distal regulatory elements defined by the ‘CTCF-rich’ chromatin state. The strongest ZNF143-binding sites map to promoters bound by RNA polymerase II (POL2) and other promoter-associated factors, such as the TATA-binding protein (TBP) and the TBP-associated protein, together forming a ‘promoter’ cluster (Bailey et al. 2015).
DNA binding motif sequence of YY1 does not seem to be enriched at both chromatin interaction anchors (Z-score < 2), whereas DNA binding motif sequence of ZNF143 is significantly enriched (Z-score > 7; Bailey et al. 2015 Figure 2a). In the analysis of YY1, to identify a protein factor that might contribute to EPI, (Ji et al. 2015) performed chromatin immune precipitation with mass spectrometry (ChIP-MS), using antibodies directed toward histones with modifications characteristic of enhancer and promoter chromatin (H3K27ac and H3K4me3, respectively). Of 26 transcription factors that occupy both enhancers and promoters, four are essential based on a CRISPR cell-essentiality screen and two (CTCF, YY1) are expressed in >90% of tissues examined (Weintraub et al. 2017). These analyses started from the analysis of histone modifications of enhancer and promoter marks rather than chromatin interactions. Other protein factors associated with chromatin interactions may be found from other researches.
As computational approaches, machine-learning analyses to predict chromatin interactions have been proposed (Schreiber et al. 2017; Zhang et al. 2017). However, they were not intended to find DNA motif sequences of TF affecting chromatin interactions, EPI, and the expression level of transcriptional target genes, which were examined in this study.
DNA binding proteins involved in chromatin interactions are supposed to affect the transcription of genes in the loops formed by chromatin interactions. In my previous analysis, the expression level of human putative transcriptional target genes was affected, according to the criteria of enhancer-promoter association (EPA) (Fig. 1B; (Osato 2018)). EPI were predicted based on EPA shortened at the genomic locations of FR orientation of CTCF binding sites, and transcriptional target genes of each TF bound in enhancers and promoters were predicted based on the EPI. The EPA affected the expression levels of putative transcriptional target genes the most among three types of EPAs, compared with the expression levels of transcriptional target genes predicted from only promoters (Fig. 2). The expression levels tended to be increased in monocytes and CD4+ T cells, implying that enhancers activated the transcription of genes, and decreased in ES and iPS cells, implying that enhancers repressed the transcription of genes. These analyses suggested that enhancers affected the transcription of genes significantly, when EPI were predicted properly. Other DNA binding proteins involved in chromatin interactions, as well as CTCF, may locate at chromatin interaction anchors with a pair of biased orientation of DNA binding motif sequences, affecting the expression level of putative transcriptional target genes in the loops formed by the chromatin interactions. As experimental issues of the analyses of chromatin interactions, chromatin interaction data are changed, according to experimental techniques, depth of DNA sequencing, and even replication sets of the same cell type. Chromatin interaction data may not be saturated enough to cover all chromatin interactions. Supposing these characters of DNA binding proteins associated with chromatin interactions and avoiding experimental issues of the analyses of chromatin interactions, here I searched for DNA motif sequences of TF and repeat DNA sequences, affecting EPI and the expression level of putative transcriptional target genes in CD14+ monocytes and CD4+ T cells of four people and other cell types without using chromatin interaction data. Then, putative EPI were compared with chromatin interaction data.
Results
Search for biased orientation of DNA motif sequences
Transcription factor binding sites (TFBS) were predicted using open chromatin regions and DNA motif sequences of transcription factors (TF) collected from various databases and journal papers (see Methods). Transcriptional target genes were predicted using TFBS in promoters and enhancer-promoter association (EPA) shortened at the genomic locations of DNA binding motif sequence of a TF acting as insulator such as CTCF and cohesin (RAD21 and SMC3) (Fig. 1B). To find DNA motif sequences of TF acting as insulator, other than CTCF and cohesin, and repeat DNA sequences affecting the expression level of genes, EPI were predicted based on EPA shortened at the genomic locations of the DNA motif sequence of a TF or a repeat DNA sequence, and transcriptional target genes of each TF bound in enhancers and promoters were predicted based on the EPI. The expression levels of the putative transcriptional target genes were compared with those predicted from promoters using Mann Whiteney U test, two-sided (p-value < 0.05) (Fig. 2A). The number of TF showing a significant difference of expression level of their putative transcriptional target genes was counted. To examine whether the orientation of a DNA motif sequence, which is supposed to act as insulator sites and shorten EPA, affected the number of TF showing a significant difference of expression level of their putative transcriptional target genes, the number of the TF was compared among forward-reverse (FR), reverse-forward (RF), and any orientation (i.e. without considering orientation) of a DNA motif sequence shortening EPA, using chi-square test (p-value < 0.05). To avoid missing DNA motif sequences showing a relatively weak statistical significance by multiple testing collection, the above analyses were conducted using monocytes of four people independently, and DNA motif sequences found in monocytes of four people in common were selected. Total 369 of biased (232 FR and 178 RF) orientation of DNA binding motif sequences of TF were found in monocytes of four people in common, whereas only seven any orientation of DNA binding motif sequence was found in monocytes of four people in common (Fig. 2B; Table 1; Supplemental Table S1). FR orientation of DNA motif sequences included CTCF, cohesin (RAD21 and SMC3), ZNF143 and YY1, which are associated with chromatin interactions and EPI. SMARCA4 (BRG1) is associated with topologically associated domain (TAD), which is higher-order chromatin organization. The DNA binding motif sequence of SMARCA4 was not registered in the databases used in this study. FR orientation of DNA motif sequences included SMARCC2, a member of the same SWI/SNF family of proteins as SMARCA4.
The same analysis was conducted using DNase-seq data of CD4+ T cells of four people. Total 376 of biased (203 FR and 213 RF) orientation of DNA binding motif sequences of TF were found in T cells of four people in common, whereas only seven any orientation (i.e. without considering orientation) of DNA binding motif sequences were found in T cells of four people in common (Supplemental Fig. S1 and Supplemental Table S2). Biased orientation of DNA motif sequences in T cells included CTCF, cohesin (RAD21 and SMC3), and SMARC. Among 369, 73 of biased orientation of DNA binding motif sequences of TF were found in both monocytes and T cells in common (Supplemental Table S5). For each orientation, 46 FR and 34 RF orientation of DNA binding motif sequences of TF were found in both monocytes and T cells in common. Without considering the difference of orientation of DNA binding motif sequences, 113 of biased orientation of DNA binding motif sequences of TF were found in both monocytes and T cells. As a reason for the increase of the number (113) from 73, a TF or an isoform of the same TF may bind to a different DNA binding motif sequence according to cell types and/or in the same cell type. About 50% or more of alternative splicing isoforms are differently expressed among tissues, indicating that most alternative splicing is subject to tissue-specific regulation (Wang et al. 2008) (Chen and Manley 2009) (Das et al. 2007). The same TF has several DNA binding motif sequences and in some cases one of the motif sequences is almost the same as the reverse complement sequence of another motif sequence of the same TF. For example, RAD21 had both FR and RF orientation of DNA motif sequences, but the number of the FR orientation of DNA motif sequence was relatively small in the genome, and the RF orientation of DNA motif sequence was frequently observed and co-localized with CTCF. I previously found that a complex of TF would bind to a slightly different DNA binding motif sequence from the combination of DNA binding motif sequences of TF composing the complex in C. elegans (Tabuchi et al. 2011). From another viewpoint of this study, the expression level of putative transcription target genes of some TF would be different, depending on the genomic locations (enhancers or promoters) of DNA binding motif sequences of the TF in monocytes and T cells of four people.
Moreover, using open chromatin regions overlapped with H3K27ac histone modification marks known as enhancer and promoter marks, the same analyses were performed in monocytes and T cells. H3K27ac histone modification marks were used in the analysis of EPI, but were not used in the analysis of TF as insulator like CTCF and cohesin in this study, since new biased orientation of DNA motif sequences were found in this criterion. When H3K27ac histone modification marks were used in the analysis of TF as insulator like CTCF and cohesin, the number of biased orientation of DNA motif sequences was decreased. Total 233 of biased (179 FR and 70 RF) orientation of DNA binding motif sequences of TF were found in monocytes of four people in common, whereas only two any orientation of DNA binding motif sequence was found (Supplemental Table S3). Though the number of biased orientation of DNA motif sequences was reduced, CTCF, RAD21, SMC3, ZNF143, and YY1 were found. For T cells using H3K27ac histone modification marks, total 291 of biased (173 FR and 143 RF) orientation of DNA binding motif sequences of TF were found in T cells of four people in common, whereas only 10 any orientation of DNA binding motif sequences were found (Supplemental Table S4). Though the number of biased orientation of DNA motif sequences was reduced, CTCF, RAD21, SMC3, and YY1 were found. Scores of CTCF, RAD21, and SMC3 were increased compared with the result of T cells without using H3K27ac histone modification marks, and they were ranked in the top four. Biased orientation of DNA motif sequences included JUNDM2 (JDP2), which is involved in histone-chaperone activity, promoting nucleosome, and inhibition of histone acetylation (Jin et al. 2006). JDP2 forms a homodimer or heterodimer with various TF (https://en.wikipedia.org/wiki/Jun_dimerization_protein). As summary of the results with and without H3K27ac histone modification marks, total 433 of biased (306 FR and 178 RF) orientation of DNA motif sequences were found in monocytes of four people in common. Total 499 of biased (285 FR and 278 RF) orientation of DNA motif sequences were found in T cells of four people in common. Total number of these results in monocytes and T cells was 773 biased (513 FR and 413 RF) orientation of DNA motif sequences. Biased orientation of DNA motif sequences found in both monocytes and T cells were listed in Supplemental Table S5.
To examine whether the biased orientation of DNA binding motif sequences of TF were observed in other cell types, the same analyses were conducted in other cell types. However, for other cell types, experimental data of one sample were available in ENCODE database, so the analyses of DNA motif sequences were performed by comparing with the result in monocytes of four people. Among the biased orientation of DNA binding motif sequences found in monocytes, 61, 135, 95, and 108 DNA binding motif sequences were also observed in H1-hESC, iPS, Huvec and MCF-7 respectively, including CTCF and cohesin (RAD21 and SMC3) (Table 2; Supplemental Table S6). The scores of DNA binding motif sequences were the highest in monocytes, and the other cell types showed lower scores. The results of the analysis of DNA motif sequences in CD20+ B cells and macrophages did not include CTCF and cohesin, because these analyses can be utilized in cells where the expression level of putative transcriptional target genes of each TF show a significant difference between promoters and EPA shortened at the genomic locations of a DNA motif sequence acting as insulator sites. Some experimental data of a cell did not show a significant difference between promoters and the EPA (Osato 2018).
Instead of DNA binding motif sequences of TF, repeat DNA sequences were also examined. The expression levels of transcriptional target genes of each TF predicted based on EPA that were shortened at the genomic locations of a repeat DNA sequence were compared with those predicted from promoters. Three RF orientation of repeat DNA sequences showed a significant difference of expression level of putative transcriptional target genes in monocytes of four people in common (Table 3). Among them, LTR16C repeat DNA sequence was observed in iPS and H1-hESC with enough statistical significance considering multiple tests (p-value < 10−7). The same as CD14+ monocytes, biased orientation of repeat DNA sequences were examined in CD4+ T cells. Three FR and two RF orientation of repeat DNA sequences showed a significant difference of expression level of putative transcriptional target genes in T cells of four people in common (Supplemental Table S7). MIRb and MIR3 were also found in the analysis using open chromatin regions overlapped with H3K27ac histone modification marks, which are enhancer and promoter marks. MIR and other transposon sequences are known to act as insulators and enhancers (Bejerano et al. 2006; Rebollo et al. 2012; de Souza et al. 2013; Jjingo et al. 2014; Wang et al. 2015).
Co-location of biased orientation of DNA motif sequences
To examine the association of 369 biased (FR and RF) orientation of DNA binding motif sequences of TF, co-location of the DNA binding motif sequences in open chromatin regions was analyzed in monocytes. The number of open chromatin regions with the same pairs of DNA binding motif sequences was counted, and when the pairs of DNA binding motif sequences were enriched with statistical significance (chi-square test, p-value < 1.0 × 10−10), they were listed (Table 4; Supplemental Table S8). Open chromatin regions overlapped with histone modification of enhancer and promoter marks (H3K27ac) (total 26,095 regions) showed a larger number of enriched pairs of DNA motifs than all open chromatin regions (Table 4; Supplemental Table S8). H3K27ac is known to be enriched at chromatin interaction anchors (Phanstiel et al. 2017). As already known, CTCF was found with cohesin such as RAD21 and SMC3 (Table 4). Top 30 pairs of FR and RF orientations of DNA motifs co-occupied in the same open chromatin regions were shown (Table 4). Total number of pairs of DNA motifs was 428, consisting of 120 unique DNA motifs, when the pair of DNA motifs were observed in more than 80% of the number of open chromatin regions with the DNA motifs. Biased orientation of DNA binding motif sequences of TF tended to be co-localized in the same open chromatin regions.
To examine the association of 376 biased orientation of DNA binding motif sequences of TF in CD4+ T cells, co-location of the DNA binding motif sequences in open chromatin regions was analyzed. Top 30 pairs of FR and RF orientations of DNA motifs co-occupied in the same open chromatin regions were shown (Supplemental Table S9). Total number of pairs of DNA motifs was 99, consisting of 72 unique DNA motifs, when the pair of DNA motifs were observed in more than 80% of the number of open chromatin regions with the DNA motifs (chi-square test, p-value < 1.0 × 10−10). Among them, 11 pairs of DNA motif sequences including a pair of CTCF, SMC3, and RAD21 in T cells were found in monocytes in common (Supplemental Table S9).
Comparison with chromatin interaction data
To examine whether the biased orientation of DNA motif sequences is associated with chromatin interactions, enhancer-promoter interactions (EPI) predicted based on enhancer-promoter associations (EPA) were compared with chromatin interaction data (Hi-C). Due to the resolution of Hi-C experimental data used in this study (50kb), EPI were adjusted to 50kb resolution. EPI were predicted based on three types of EPA: (i) EPA shortened at the genomic locations of FR or RF orientation of DNA motif sequence of a TF, (ii) EPA shortened at the genomic locations of DNA motif sequence of a TF acting as insulator sites such as CTCF and cohesin (RAD21, SMC3) without considering their orientation, and (iii) EPA without being shortened by the genomic locations of a DNA motif sequence. EPA (i) showed a significantly higher ratio of EPI overlapped with chromatin interactions (Hi-C) using DNA binding motif sequences of CTCF and cohesin (RAD21 and SMC3) than the other two types of EPAs (n = 4, binomial test, two-sided) (Supplemental Fig. S2). Total 58 biased orientation (38 FR and 22 RF) of DNA motif sequences including CTCF, cohesin, and YY1 showed a significantly higher ratio of EPI overlapped with Hi-C chromatin interactions (a cutoff score of CHiCAGO tool > 1) than the other types of EPAs in monocytes (Supplemental Table S10). When comparing EPI predicted based on only EPA (i) and (iii) with chromatin interactions, total 215 biased orientation (130 FR and 102 RF) of DNA motif sequences showed a significantly higher ratio of EPI predicted based on EPA (i) overlapped with the chromatin interactions than EPI predicted based on EPA (iii) (Supplemental material 2). The difference between EPI predicted based on EPA (i) and (ii) seemed to be difficult to distinguish using the chromatin interaction data and the statistical test in some cases. However, as for the difference between EPI predicted based on EPA (i) and (iii), a larger number of biased orientation of DNA motif sequences was found to be correlated with chromatin interaction data. Chromatin interaction data were obtained from different samples from DNase-seq, open chromatin regions, so individual differences seemed to be large from the results of this analysis. Since, for some DNA motif sequences of transcription factors, the number of EPI overlapped with chromatin interactions was small, if higher resolution of chromatin interaction data (such as HiChIP, in situ DNase Hi-C, and in situ Hi-C data, or a tool to improve the resolution such as HiCPlus) is available, the number of EPI overlapped with chromatin interactions would be increased and the difference of the numbers among three types of EPA would be larger and more significant (Rao et al. 2014; Ramani et al. 2016; Mumbach et al. 2017; Zhang et al. 2018).
After the analysis of CD14+ monocytes, to utilize HiChIP chromatin interaction data in CD4+ T cells, the same analysis for CD14+ monocytes was performed using DNase-seq data of four donors in CD4+ T cells (Mumbach et al. 2017). EPI predicted based on EPA were compared with three replications (B2T1, B2T2, and B3T1) of HiChIP chromatin interaction data in CD4+ T cells respectively. The resolutions of HiChIP chromatin interaction data and EPI were adjusted to 5kb. EPI were predicted based on the three types of EPA in the same way as CD14+ monocytes using top 60% expression level of all transcripts (genes) excluding transcripts not expressed in T cells. The criteria of the analysis were determined to include known DNA motif sequences involved in chromatin interactions such as CTCF and cohesin in the result, and the result was consistent with that using Hi-C chromatin interaction data. EPA (iii) showed the highest ratio of EPI overlapped with chromatin interactions (HiChIP) using DNA binding motif sequences of CTCF and cohesin (RAD21 and SMC3), compared with the other two types of EPA (i) and (ii) (n = 4, binomial test, two-sided, 95% confidence interval) (Fig. 3). Total 136 biased orientation (70 FR and 73 RF) of DNA motif sequences, which included CTCF, cohesin (RAD21 and SMC3), and SMARC in three replications (B2T1, B2T2, and B3T1) and ZNF143 in two replications (B2T2 and B3T1), showed a significantly higher ratio of EPI overlapped with HiChIP chromatin interactions (more than 1,000 counts for each interaction) than the other types of EPAs in T cells (Table 5). When comparing EPI predicted based on only EPA (i) and (iii) with the chromatin interactions, total 356 biased orientation (194 FR and 200 RF) of DNA motif sequences showed a significantly higher ratio of EPI predicted based on EPA (iii) overlapped with the chromatin interactions than EPI predicted based on EPA (i) (Table 5; Supplemental material 2). As expected, the number of EPI overlapped with chromatin interactions (HiChIP) was increased, compared with Hi-C chromatin interactions. Most of biased orientation of DNA motif sequences (95%) were found to be correlated with chromatin interactions, when comparing EPI predicted based on EPA (i) and (iii) with HiChIP chromatin interactions.
Moreover, to examine the enhancer activity of EPI, the distribution of expression level of putative target genes of EPI was compared between EPI overlapped with HiChIP chromatin interactions and EPI not overlapped with them. Though the target genes of EPI were selected from top 60% expression level of all transcripts (genes) excluding transcripts not expressed in T cells, target genes of EPI overlapped with chromatin interactions showed a significantly higher expression level than EPI not overlapped with them, suggesting that EPI overlapped with chromatin interactions activated the expression of target genes in T cells. Almost all (99.9%) FR and RF orientations of DNA motifs showed a significantly higher expression level of putative target genes of EPI overlapped with chromatin interactions than EPI not overlapped. When a biased orientation of DNA motif showed a significantly higher expression level of putative target genes of EPI overlapped with chromatin interactions than EPI not overlapped, ‘1’ was marked with in the tables of the comparison between EPI and HiChIP chromatin interactions in Supplemental material 2. When a DNA motif showed a significantly lower expression level, ‘-1’ was marked with, however, it was not observed in this analysis. When there was not significant difference of expression level, ‘0’ was marked with.
If biased orientation of DNA motif sequences of TF found in both monocytes and T cells are biologically meaningful, these may match the result of the analysis of HiChIP data. Among 376 FR and RF orientations of biased orientation of DNA motifs of TF in T cells, 136 (36%) were biased orientation of DNA motifs in the analysis of HiChIP data for three types of EPA (i) (ii) and (iii). Among 73 FR and RF orientations of DNA motifs of TF found in both monocytes and T cells, 31 (42%) were biased orientation of DNA motifs in the analysis of HiChIP data, which was significantly higher ratio than all 376 biased orientation of DNA motifs in T cell, and included CTCF, RAD21 and SMC3 (p-value < 0.015, binomial test, two-sided, 95% confidence interval) (Supplemental Table S5 and Table 5). However, this may not imply that all 376 biased orientation of DNA motifs included false-positive predictions, and may be due to the limitation of resolution of the HiChIP data (5kb) or the small number of DNA binding sites of a TF in genome sequences. Then, among 113 FR and RF orientations of DNA motifs of TF found in both monocytes and T cells without considering the difference of orientation (FR or RF) of DNA binding motifs, 42 (37%) were biased orientation of DNA motifs in the analysis of HiChIP data, which was not significantly higher ratio. This implied that the difference of orientation of DNA motifs was important to predict EPI in comparison with HiChIP data.
Though the ratios of EPI overlapped with chromatin interactions were increased by using many chromatin interaction data including lower score and count of chromatin interactions (a cutoff score of CHiCAGO tool > 1 for Hi-C and more than 1,000 counts for HiChIP), the ratios of EPI overlapped with chromatin interactions showed the same tendency among the three types of EPAs. The ratio of EPI overlapped with Hi-C chromatin interactions was increased using H3K27ac marks in both monocytes and T cells. The ratio of EPI overlapped with HiChIP chromatin interactions was also increased using H3K27ac marks. Chromatin interaction data were obtained from different samples from DNase-seq, open chromatin regions in CD4+ T cells, so individual differences seemed to be large from the results of this analysis, and (Mumbach et al. 2017) suggested that individual differences of chromatin interactions were larger than those of open chromatin regions. ATAC-seq data, open chromatin regions were available in CD4+ T cells in the paper, however, when using ATAC-seq data, the result of the analysis of biased orientation of DNA motif sequences was different from DNase-seq data, and not included a part of CTCF and cohesin. Thus, DNase-seq data collected from ENCODE and Blueprint projects were employed in this study.
Discussion
To find DNA motif sequences of transcription factors (TF) and repeat DNA sequences affecting the expression level of human putative transcriptional target genes, the DNA motif sequences were searched from open chromatin regions of monocytes of four people. Total 369 biased [232 forward-reverse (FR) and 178 reverse-forward (RF)] orientation of DNA motif sequences of TF were found in monocytes of four people in common, whereas only seven any orientation (i.e. without considering orientation) of DNA motif sequence of TF was found to affect the expression level of putative transcriptional target genes, suggesting that enhancer-promoter association (EPA) shortened at the genomic locations of FR or RF orientation of the DNA motif sequence of a TF or a repeat DNA sequence is an important character for the prediction of enhancer-promoter interactions (EPI) and the transcriptional regulation of genes.
When DNA motif sequences were searched from monocytes of one person, a larger number of biased orientation of DNA motif sequences affecting the expression level of human putative transcriptional target genes were found. When the number of donors, from which experimental data were obtained, was increased, the number of DNA motif sequences found in all people in common decreased and in some cases, known transcription factors involved in chromatin interactions such as CTCF and cohesin (RAD21 and SMC3) were not identified by statistical tests. This would be caused by individual difference of the same cell type, low quality of experimental data, and experimental errors. Moreover, though FR orientation of DNA binding motif sequences of CTCF and cohesin is frequently observed at chromatin interaction anchors, the percentage of FR orientation is not 100, and other orientations of the DNA binding motif sequences are also observed. Though DNA binding motif sequences of CTCF and cohesin are found in various open chromatin regions, DNA binding motif sequences of some TF would be observed less frequently in open chromatin regions. The analyses of experimental data of a number of people would avoid missing relatively weak statistical significance of DNA motif sequences of TF in experimental data of each person by multiple testing correction of thousands of statistical tests. A DNA motif sequence was found with p-value < 0.05 in experimental data of one person and the DNA motif sequence found in the same cell type of four people in common would have p-value < 0.054 = 6.25 × 10−6. Actually, DNA motif sequences with p-value slightly less than 0.05 in monocytes of one person were observed in monocytes of four people in common.
EPI were compared with chromatin interactions (Hi-C) in monocytes. EPAs shortened at the genomic locations of DNA binding motif sequences of CTCF and cohesin (RAD21 and SMC3) showed a significant difference of the ratios of EPI overlapped with chromatin interactions, according to three types of EPAs (see Methods). Using open chromatin regions overlapped with ChIP-seq experimental data of histone modification of an enhancer mark (H3K27ac), the ratio of EPI not overlapped with Hi-C was reduced. (Phanstiel et al. 2017) also reported that there was an especially strong enrichment for loops with H3K27 acetylation peaks at both ends (Fisher’s Exact Test, p = 1.4 × 10−27). However, the total number of EPI overlapped with chromatin interactions was also reduced using H3K27ac peaks, so more chromatin interaction data would be needed to obtain reliable results in this analysis. As an issue of experimental data, data for chromatin interactions and open chromatin regions were came from different samples and donors, so individual differences would exist in the data. Moreover, the resolution of chromatin interaction data used in monocytes was about 50kb, thus the number of chromatin interactions was relatively small (72,284 at 50kb resolution with a cutoff score of CHiCAGO tool > 1 and 16,501 with a cutoff score of CHiCAGO tool > 5). EPI predicted based on EPA shortened at the genomic locations of DNA binding motif sequence of TF that were found in various open chromatin regions such as CTCF and cohesin (RAD21 and SMC3) tended to be overlapped with a larger number of chromatin interactions than TF less frequently observed in open chromatin regions. Therefore, to examine the difference of the numbers of EPI overlapped with chromatin interactions, according to the three types of EPAs, the number of chromatin interactions should be large enough.
As HiChIP chromatin interaction data were available in CD4+ T cells, biased orientation of DNA motif sequences of TF were examined in T cells using DNase-seq data of four people. The resolutions of chromatin interactions and EPI were adjusted to 5kb by fragmentation of genome sequences. In monocytes, the resolution of Hi-C chromatin interaction data was converted by extending anchor regions of chromatin interactions to 50kb length and merging the chromatin interactions overlapped with each other. Fragmentation of genome sequences may affect the classification of chromatin interactions of which anchors are located near the border of a fragment, but the number of chromatin interactions would not be decreased, compared with merging chromatin interactions. The number of HiChIP chromatin interactions was 19,926,360 at 5kb resolution, 666,149 at 5kb resolution with chromatin interactions (more than 1,000 counts for each interaction), and 78,209 at 5kb resolution with chromatin interactions (more than 6,000 counts for each interaction). As expected, the number of EPI overlapped with chromatin interactions was increased, and 36 – 95% of biased orientation of DNA motif sequences of TF showed a statistical significance in EPI predicted based on EPA shortened at the genomic locations of the DNA motif sequence, compared with the other types of EPAs or EPA not shortened. False positive predictions of EPI would be decreased by using H3K27ac marks and other features. The ratio of EPI overlapped with Hi-C chromatin interactions was increased using H3K27ac marks in both monocytes and T cells. The ratio of EPI overlapped with HiChIP chromatin interactions was also increased using H3K27ac marks. However, the number of biased orientation of DNA motif sequences showing a higher ratio of EPI overlapped with HiChIP chromatin interactions than the other types of EPAs was decreased using H3K27ac marks (Supplemental material 2).
When forming a homodimer or heterodimer with another TF, TF may bind to genome DNA with a specific orientation of their DNA binding sequences (Fig. 4). From the analysis of biased orientation of DNA motif sequences of TF, TF forming heterodimer would also be found. If the DNA binding motif sequence of only the mate to a pair of TF was found in EPA, EPA was shortened at one side, which is the genomic location of the DNA binding motif sequence of the mate to the pair, and transcriptional target genes were predicted using the EPA shortened at the side. In this analysis, the mate to both heterodimer and homodimer of TF can be used to examine the effect on the expression level of transcriptional target genes predicted based on the EPA shortened at one side. For heterodimer, biased orientation of DNA motif sequences may also be found in forward-forward or reverse-reverse orientation.
Some DNA binding sites of TF predicted using DNA binding motif sequences of TF were changed according to the parameters of FIMO tool, particularly background frequencies of ATGC nucleotides in genome sequences and p-value threshold. Repeat DNA sequences also affected the result of the prediction. Without repeat masking, the number of any orientation of DNA motifs of TF was increased. However, the p-value of the DNA motifs was relatively high and close to the threshold, so these DNA motifs seemed to be false positives. To decrease false-positive and false-negative predictions of DNA binding sites of TF, improve the prediction of biased orientation of DNA motifs, and obtain a robust result of the analysis, there may be a room to explore more suitable parameters and methods, such as stricter p-value threshold, using genomic regions conserved among species, masking some exons encoding mRNA, removing DNA motifs highly affected by parameter changes (low information content), changing the parameter of nucleotide frequencies according to genomic regions, considering epigenetic modifications (DNA methylation and histone) and so on. For the analysis of EPI, instead of using all DNA motif sequences of TF in databases, selecting DNA motif sequences of TF indicating enhancer activity in a cell type using my method would reduce the effect of TF not acting as enhancer (Osato 2018).
It has been reported that CTCF and cohesin-binding sites are frequently mutated in cancer (Katainen et al. 2015). Some biased orientation of DNA motif sequences would be associated with chromatin interactions and might be associated with diseases including cancer.
The analyses in this study revealed novel characters of DNA binding motif sequences of TF and repeat DNA sequences to analyze TF involved in chromatin interactions, insulator function and forming a homodimer, heterodimer or complex with other TF, affecting the transcriptional regulation of genes.
Methods
Search for biased orientation of DNA motif sequences
To examine transcriptional regulatory target genes of transcription factors (TF), bed files of hg38 of Blueprint DNase-seq data for CD14+ monocytes of four donors (EGAD00001002286; Donor ID: C0010K, C0011I, C001UY, C005PS) were obtained from Blueprint project web site (http://dcc.blueprint-epigenome.eu/#/home), and the bed files of hg38 were converted into those of hg19 using Batch Coordinate Conversion (liftOver) web site (https://genome.ucsc.edu/util.html). Bed files of hg19 of ENCODE H1-hESC (GSM816632; UCSC Accession: wgEncodeEH000556), iPSC (GSM816642; UCSC Accession: wgEncodeEH001110), HUVEC (GSM1014528; UCSC Accession: wgEncodeEH002460), and MCF-7 (GSM816627; UCSC Accession: wgEncodeEH000579) were obtained from the ENCODE websites (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeUwDgf/; http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeUwDnase/).
As high resolution of chromatin interaction data using HiChIP became available in CD4+ T cells, to promote the same analysis of CD14+ monocytes in CD4+ T cells, DNase-seq data of four donors were obtained from a public database and Blueprint projects web sites. DNase-seq data of only one donor was available in Blueprint project for CD4+ T cells (Donor ID: S008H1), and three other DNase-seq data were obtained from NCBI Gene Expression Omnibus (GEO) database. Though the peak calling of DNase-seq data available in GEO database was different from other DNase-seq data in ENCODE and Blueprint projects where 150bp length of peaks were usually predicted using HotSpot (John et al. 2011), FASTQ files of DNase-seq data were downloaded from NCBI Sequence Read Archive (SRA) database (SRR097566, SRR097618, and SRR171574). Read sequences of the FASTQ files were aligned to the hg19 version of the human genome reference using BWA (Li and Durbin 2009), and the BAM files generated by BWA were converted into SAM files, sorted, and indexed using Samtools (Li et al. 2009). Peaks of the DNase-seq data were predicted using HotSpot-4.1.1.
To identify transcription factor binding sites (TFBS) from the DNase-seq data, TRANSFAC (2019.1), JASPAR (2018), UniPROBE (2018), BEEML-PBM, high-throughput SELEX, Human Protein-DNA Interactome, transcription factor binding sequences of ENCODE ChIP-seq data, and HOCOMOCO version 9 and 11 were used to predict insulator sites (Wingender et al. 1996; Newburger and Bulyk 2009; Portales-Casamar et al. 2010; Xie et al. 2010; Zhao and Stormo 2011; Jolma et al. 2013; Kheradpour and Kellis 2014) (Kulakovskiy et al. 2018). TRANSFAC (2011.1), JASPAR (2012), UniPROBE (2012), BEEML-PBM, high-throughput SELEX, and Human Protein-DNA Interactome were used to analyze enhancer-promoter interactions, since these data were sufficient to identify biased orientation of DNA motif sequences of TF with less computational time, reducing the number of any orientation of DNA motif sequences of TF. Position weight matrices of transcription factor binding sequences were transformed into TRANSFAC matrices and then into MEME matrices using in-house scripts and transfac2meme in MEME suite (Bailey et al. 2009). Transcription factor binding sequences of TF derived from vertebrates were used for further analyses. Transcription factor binding sequences were searched from each narrow peak of DNase-seq data in repeat-masked hg19 genome sequences using FIMO with p-value threshold of 10−5 and background frequencies of ATGC nucleotides in repeat-masked hg19 genome sequences (Grant et al. 2011). Repeat-masked hg19 genome sequences were downloaded from UCSC genome browser (http://genome.ucsc.edu/, http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.masked.gz). TF corresponding to transcription factor binding sequences were searched computationally by comparing their names and gene symbols of HGNC (HUGO Gene Nomenclature Committee) -approved gene nomenclature and 31,848 UCSC known canonical transcripts (http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/knownCanonical.txt.gz), as transcription factor binding sequences were not linked to transcript IDs such as UCSC, RefSeq, and Ensembl transcripts.
Target genes of a TF were assigned when its TFBS was found in DNase-seq narrow peaks in promoter or extended regions for enhancer-promoter association of genes (EPA). Promoter and extended regions were defined as follows: promoter regions were those that were within distance of ±5 kb from transcriptional start sites (TSS). Promoter and extended regions were defined as per the following association rule, which is the same as that defined in Figure 3A of a previous study (McLean et al. 2010): the single nearest gene association rule, which extends the regulatory domain to the midpoint between the TSS of the gene and that of the nearest gene upstream and downstream without the limitation of extension length. Extended regions for EPA were shortened at the genomic locations of DNA binding sites of a TF that was the closest to a transcriptional start site, and transcriptional target genes were predicted from the shortened enhancer regions using TFBS. Furthermore, promoter and extended regions for EPA were shortened at the genomic locations of forward–reverse (FR) orientation of DNA binding sites of a TF. When forward or reverse orientation of DNA binding sites were continuously located in genome sequences several times, the most external forward–reverse orientation of DNA binding sites were selected. The genomic positions of genes were identified using ‘knownGene.txt.gz’ file in UCSC bioinformatics sites (Karolchik et al. 2014). The file ‘knownCanonical.txt.gz’ was also utilized for choosing representative transcripts among various alternate forms for assigning promoter and extended regions for EPA. From the list of transcription factor binding sequences and transcriptional target genes, redundant transcription factor binding sequences were removed by comparing the target genes of a transcription factor binding sequence and its corresponding TF; if identical, one of the transcription factor binding sequences was used. When the number of transcriptional target genes predicted from a transcription factor binding sequence was less than five, the transcription factor binding sequence was omitted.
Repeat DNA sequences were searched from the hg19 version of the human reference genome using RepeatMasker (Smit, AFA & Green, P RepeatMasker at http://www.repeatmasker.org) and RepBase RepeatMasker Edition (http://www.girinst.org).
For gene expression data, RNA-seq reads mapped onto human hg19 genome sequences were obtained, including ENCODE long RNA-seq reads with poly-A of H1-hESC, iPSC, HUVEC, and MCF-7 (GSM26284, GSM958733, GSM2344099, GSM2344100, GSM958734, and GSM765388), and UCSF-UBC human reference epigenome mapping project RNA-seq reads with poly-A of naive CD4+ T cells (GSM669617). Two replicates were present for H1-hESC, iPSC, HUVEC, and MCF-7, and a single one for CD4+ T cells. FPKMs of the RNA-seq data were calculated using RSeQC (Wang et al. 2012). For monocytes, Blueprint RNA-seq FPKM data (‘C0010KB1.transcript_quantification.rsem_grape2_crg.GRCh38.20150622.results’, ‘C0011IB1.transcript_quantification.rsem_grape2_crg.GRCh38.20150622.results’, ‘C001UYB4.transcript_quantification.rsem_grape2_crg.GRCh38.20150622.results’, and ‘C005PS12.transcript_quantification.rsem_grape2_crg.GRCh38.20150622.results’) were downloaded from Blueprint DCC portal (http://dcc.blueprint-epigenome.eu/#/files). Based on FPKM, UCSC transcripts with top 50% expression level of all the transcripts excluding transcripts not expressed were selected in each cell type.
The expression level of transcriptional target genes predicted based on EPA shortened at the genomic locations of DNA motif sequence of a TF or a repeat DNA sequence was compared with the expression level of transcriptional target genes predicted from promoter. For each DNA motif sequence shortening EPA, transcriptional target genes were predicted using about 3,000 – 5,000 DNA binding motif sequences of TF, and the distribution of expression level of putative transcriptional target genes of each TF was compared between EPA and only promoter using Mann-Whitney test, two-sided (p-value < 0.05). The number of TF showing a significant difference of expression level of putative transcriptional target genes between EPA and promoter was compared among forward-reverse (FR), reverse-forward (RF), and any orientation (i.e. without considering orientation) of a DNA motif sequence shortening EPA using chi-square test (p-value < 0.05). When a DNA motif sequence of a TF or a repeat DNA sequence shortening EPA showed a significant difference of expression level of putative transcriptional target genes among FR, RF, or any orientation in monocytes of four people in common, the DNA motif sequence was listed.
Though forward-reverse orientation of DNA binding motif sequences of CTCF and cohesin are frequently observed at chromatin interaction anchors, the percentage of forward-reverse orientation is not 100, and other orientations of the DNA binding motif sequences are also observed. Though DNA binding motif sequences of CTCF and cohesin are found in various open chromatin regions, DNA binding motif sequences of some transcription factors would be observed less frequently in open chromatin regions. The analyses of experimental data of a number of people would avoid missing relatively weak statistical significance of DNA motif sequences in experimental data of each person by multiple testing correction of thousands of statistical tests. A DNA motif sequence was found with p-value < 0.05 in experimental data of one person and the DNA motif sequence found in the same cell type of four people in common would have p-value < 0.054 = 6.25 × 10−6.
Co-location of biased orientation of DNA motif sequences
Co-location of biased orientation of DNA binding motif sequences of TF was examined. The number of open chromatin regions with the same pair of DNA binding motif sequences was counted, and when the pair of DNA binding motif sequences were enriched with statistical significance (chi-square test, p-value < 1.0 × 10−10), they were listed. For histone modification of an enhancer mark (H3K27ac), bed files of hg38 of Blueprint ChIP-seq data for CD14+ monocytes (EGAD00001001179) and CD4+ T cells (Donor ID: S000RD) were obtained from Blueprint web site (http://dcc.blueprint-epigenome.eu/#/home), and the bed files of hg38 were converted into those of hg19 using Batch Coordinate Conversion (liftOver) web site (https://genome.ucsc.edu/util.html). Networks of co-locations of biased orientation of DNA motif sequences were plotted using Cytoscape v3.71 with yFiles Layout Algorithms for Cytoscape (Shannon et al. 2003).
Comparison with chromatin interaction data
For comparison of EPA in monocytes with chromatin interactions, ‘PCHiC_peak_matrix_cutoff0.txt.gz’ file was downloaded from ‘Promoter Capture Hi-C in 17 human primary blood cell types’ web site (https://osf.io/u8tzp/files/), and chromatin interactions for Monocytes with scores of CHiCAGO tool > 1 and CHiCAGO tool > 5 were extracted from the file (Javierre et al. 2016). In the same way as monocytes, Hi-C chromatin interaction data of CD4+ T cells (Naive CD4+ T cells, nCD4) were obtained.
Enhancer-promoter interactions (EPI) were predicted using three types of EPAs in monocytes: (i) EPA shortened at the genomic locations of FR or RF orientation of DNA motif sequence of a TF, (ii) EPA shortened at the genomic locations of any orientation (i.e. without considering orientation) of DNA motif sequence of a TF, and (iii) EPA without being shortened by a DNA motif sequence. EPI predicted using the three types of EPAs in common were removed. First, EPI predicted based on EPA (i) were compared with chromatin interactions (Hi-C). The resolution of chromatin interaction data used in this study was 50kb, so EPI were adjusted to 50kb before their comparison. The number and ratio of EPI overlapped with chromatin interactions were counted. Second, EPI were predicted based on EPA (ii), and EPI predicted based on EPA (i) were removed from the EPI. The number and ratio of EPI overlapped with chromatin interactions were counted. Third, EPI were predicted based on EPA (iii), and EPI predicted based on EPA (i) and (ii) were removed from the EPI. The number and ratio of EPI overlapped with chromatin interactions were counted. The number and ratio of the EPI were compared two times between EPA (i) and (iii), and EPA (i) and (ii) (binomial distribution, p-value < 0.025 for each test, two-sided, 95% confidence interval).
For comparison of EPA with chromatin interactions (HiChIP) in CD4+ T cells, ‘GSM2705049_Naive_HiChIP_H3K27ac_B2T1_allValidPairs.txt’, ‘GSM2705050_Naive_HiChIP_H3K27ac_B2T2_allValidPairs.txt’, and ‘GSM2705051_Naive_HiChIP_H3K27ac_B3T1_allValidPairs.txt’ files were downloaded from Gene Expression Omnibus (GEO) database (GSM2705049, GSM2705050 and GSM2705051). The resolutions of chromatin interaction data and EPI were adjusted to 5kb before their comparison. Chromatin interactions with more than 6,000 and 1,000 counts for each interaction were used in this study.
Putative target genes for the analysis of EPI were selected from top 50% expression level of all transcripts excluding transcripts not expressed in monocytes and top 60% expression level of transcripts in CD4+ T cells. The expression level of putative target genes of EPI overlapped with HiChIP chromatin interactions was compared with EPI not overlapped with them. For each FR or RF orientation of DNA motif, EPI were predicted based on EPA and the overlap of EPI with chromatin interactions was examined. When a putative transcriptional target gene of a TF in an enhancer was found in both EPI overlapped with a chromatin interaction and EPI not overlapped with, the target gene was removed. The distribution of expression level of putative target genes was compared using Mann-Whitney test, two-sided (p-value < 0.05).
Acknowledgements
The supercomputing resource was provided by Human Genome Center of the Institute of Medical Science at the University of Tokyo. Computations were partially performed on the NIG supercomputer at ROIS National Institute of Genetics. Publication charges for this article were funded by JSPS KAKENHI Grant Number 16K00387. This research was partially supported by the Platform Project for Supporting in Drug Discovery and Life Science Research (Platform for Dynamic Approaches to Living System) from Japan Agency for Medical Research and Development (AMED). This research was partially supported by Development of Fundamental Technologies for Diagnosis and Therapy Based upon Epigenome Analysis from Japan Agency for Medical Research and Development (AMED). This work was partially supported by JST CREST Grant Number JPMJCR15G1, Japan.