Summary
The human genome has more than 800 C2H2 Zinc Finger-containing genes, and many of them are composed of long tandem arrays of zinc fingers. Current Zinc Finger Protein (ZFP) motif prediction models assume longer finger arrays correspond to longer DNA-binding motifs and higher specificity. However, recent experimental efforts to identify ZFP binding sites in vivo contradict this assumption with many having short reported motifs. Using Zinc Finger Y (ZFY), which has 13 ZFs, we quantitatively characterize its DNA binding specificity with several complementary methods, including Affinity-seq, HT-SELEX, Spec-seq and fluorescence anisotropy. Besides the previously identified core motif GGCCT recognized by fingers 12-13, we find a novel secondary irregular motif recognized by accessory fingers. Via high-throughput energy measurements and two-color anisotropy, we establish that this secondary motif contributes to binding and recognition in a non-independent manner, increasing overall affinity only in the presence of the core recognition site. Through additional experimental and iterative computational analysis of CTCF and ZNF343, we further establish that this non-independent recognition between core and secondary motifs could be a general mechanism for tandem zinc finger proteins. These results establish that better motif discovery methods that consider the intrinsic properties of tandem zinc fingers including irregular motif structure, variable spacing and non-independent recognition are essential to improve prediction of ZFP recognition, occupancies, and effects on downstream gene expression in vivo.
Introduction
The zinc finger domain was first described in the TFIIIA protein of Xenopus which contains an array of 9 zinc fingers involved in the recognition of RNA Polymerase III promoters [1, 2]. With the advent of whole genome sequencing it has been discovered that ZFPs occur in all eukaryotic species but have expanded enormously in the vertebrate lineage [3–5] where they are the most abundant class of transcription factors (TFs) [6–8]. While ZFPs can have other roles, such as binding to RNA and in protein-protein interactions [9–11], it is generally thought that most of them function as TFs [12, 13].
The most common form of ZFPs are those with a C2H2 motif in which two cysteine and two histidine residues coordinate a zinc atom in a beta-beta-alpha protein structure [14, 15]. Some proteins contain only one C2H2 domain, but it is much more common for TFs to contain multiple C2H2 domains in an array, or sometimes in multiple, separated arrays [12]. The total number of C2H2 domains has also greatly expanded in vertebrates, with some ZFPs containing more than 30 (Figure 1A-E) [13].
The initial structure determination of a ZFP bound to DNA showed that each finger could interact with 3-4 base pairs with adjacent fingers binding to adjacent DNA triplets [15]. That structural model and information about the binding sites for several ZFPs led to simple models of DNA recognition which could be used to design ZFPs to recognize specific sequences [14, 16–20]. Those early models only provided predictions of the preferred binding site sequence, whereas more comprehensive predictions of the specificity of the ZFPs is available in probabilistic models [21–23]. Based largely on the development of zinc finger nucleases for genome editing, specificities have been determined for large collections of short zinc finger proteins [12, 24–26] and from those improved methods for motif prediction have been developed [27–29]. But all models assume that each finger of the ZFP interacts with 3bp of DNA, concatenated together to form the complete binding site of length 3n for a ZFP with n fingers. While the majority of ZFPs with many fingers have unknown specificity, recent advances have led to the identification of motifs for many more [30, 31]. For some ZFPs with only a few ZFs shown in Fig. 1F, the observed motifs are longer than expected probably due to binding in vivo with another factor[9, 10, 32], but it is evident that the motifs are usually much shorter than expected based on the number of fingers.
There are several examples of well-studied ZFPs with long arrays of zinc fingers, including CTCF [33], Gli1 [34], and PRDM9 [35, 36], in which significant portions of their DNA binding domains are apparently not engaged in motif recognition but still evolutionarily conserved. This “many fingers but short motif” paradox could be explained in several ways. It is possible that the additional fingers may contribute to DNA binding specificity, but in ways that are missed by current methods for motif discovery. It may be that the additional fingers have roles other than DNA binding, such as being involved in protein-protein or protein-RNA interactions in vivo. It may also be that the additional fingers contribute to binding affinity but not specificity, for example by contributing favorable binding energy through interactions with the DNA backbone. Of course, these alternative functions for the additional zinc fingers are not mutually exclusive, and different proteins may employ different functions.
One notable case is Zinc Finger Y (ZFY). As the only zinc finger gene on human Y chromosome, ZFY was initially perceived as the putative regulator for sex determination until the discovery of the SRY gene [37, 38]. In mice, its close homologs mZFY1 and mZFY2 (Figure 2A) are required for the meiotic sex chromosome inactivation (MSCI) process [39]. ZFY consists of an N-terminal activation domain followed by thirteen tandem fingers, therefore motif prediction models suggest it should have a ~39bp long consensus binding site (Fig. 2B). Taylor-Harris et al [40] performed SELEX on mZFY1 and found it preferentially binds to sequences containing GGCCT and Grants et al [41] showed that fingers 11-13 were sufficient to recognize a motif of RGGCCT, both of which are consistent with the model from zinc finger motif prediction algorithms (Fig. 2B) [29]. Since ZFY genes share close homology with their X-chromosome counterpart ZFX, we list all the contact residues of ZFY and ZFX for both human and mouse versions in Fig. 2A. ZFY and ZFX have the same contact residues composition in Finger 12 and 13, so not surprisingly previous PBM [42] and Chip-seq [43] experiments of ZFX produced the same 5 base-pair motif (Fig. 2C, D).
We employ four complementary experimental approaches to determine the specificity of ZFY, and we use different truncated versions of the protein to assess the contribution of different fingers in DNA binding affinity. We find that the additional fingers do contribute to specificity, with a weak, irregular motif not consistent with the motif prediction programs. We also find that the additional fingers contribute significantly to binding affinity only in the presence of the core site, increasing the half-life of the complex considerably, which could be important for fulfilling ZFY’s in vivo biological functions [39]. Similar experiments with CTCF show the recognition of an upstream motif depends on the presence of core motif with variable spacing. Given this, it is more effective to use a sequential or iterative motif discovery method to analyze and extract more motif information of tandem ZFPs, as we show for ZNF343 from high-throughput genomic data.
Results
Short motifs are obtained for hZFY and mZFY1 using Affinity-seq
Affinity-seq is a method for in vitro selection of fragmented genomic DNA followed by MEME motif analysis [35], which can determine all potential binding sites of a given TF over the entire genome sequence. We performed Affinity-seq on human ZFY and mouse ZFY1. For hZFY and mZFY1, we obtained 90,084 and 50,170 peaks at p value < 0.01, respectively, from which we found motifs very similar to those previously reported (Fig. 2E-F) and no secondary motifs were reported by MEME.
High-throughput SELEX (HT-SELEX) identifies an extended consensus site and Spec-seq confirms this irregular secondary motif by accessory fingers of ZFY
Spec-seq is a high-throughput sequencing-based method to quantitatively characterize the energetic landscape of TF-DNA interactions with resolution down to 0.1kBT [44, 45]. However, one practical limitation of Spec-seq is that one can only obtain the relative binding energy of a few thousands of variants in one assay, so having prior knowledge about the TF consensus site is preferable. To test whether ZFY has any extended motif beyond GGCCT, we adopted High-throughput SELEX (HT-SELEX) first to infer an extended consensus site, similar to our previous approach of following SELEX with a more quantitative analysis [46], then performed Spec-seq. We chose randomized dsDNA libraries with prefixed GGCCT in the flanking region for HT-SELEX [47, 48] analysis by full-length human ZFY (Fig. 3A). After two rounds of bound DNA selection by EMSA separation and amplification, we sequenced the enriched DNA pool and found the most enriched site to be GGCCTAGGCGTTG. We then fixed that extended consensus site, extended the randomized dsDNA region, and redid the SELEX assay. This returned the most enriched site of GGCCTAGGCGTTATTTT (Fig. 3A).
Given this SELEX-enriched site, we constructed tandemly non-overlapping dsDNA libraries for test runs and noticed that GGCCTAGTCGTTTTTG has slightly higher affinity than the SELEX-enriched site, thus it was chosen as the reference site for Spec-seq library design [44, 49]. The four dsDNA libraries (Rand 9, 10, 11, 12), each with four randomized positions (a total of 1,021 sequences), were designed as in Fig. 3B. Spec-seq experiments were performed with mZFY1 (F7-F13), hZFY-full (F1-F13), and truncated versions of hZFY containing different subset of ZFs (F11-F13, F9-F13, F7-F13, F5-F13, F1-F11). Consistent with our previous work, we observed ~0.2kT measurement variation for individual sequence between multiple replicate runs, so we draw conclusions only with energy deviations above 0.2kT.
From the Spec-seq results, energy logos were created from the reference site and all single base variants. Figure 3C shows the logos for hZFY(F11-F13), hZFY(F7-F13) and mZFY(F7-F13). The logos for the longer proteins are essentially equivalent to the F7-F13 versions, indicating that those fingers are sufficient for the determining the specificity of the proteins. Comparing those logos, it is clear that Finger 11 can only recognize a 2nt long TA motif in positions 7-8, as opposed to the predicted 3nt TCA motif for positions 7-9 in Fig. 2B. The remaining GTN motif in positions 9-11 is likely recognized by Finger 10, which is missing in our (F11-F13) construct. While this kind of compressed or irregular motif has been reported before, this has only been for cases with a known crystal structure [34] or irregular linkers between fingers [50, 51].
Binding energy data by Spec-seq shows the dependence of accessory motif on the core site
The energetic landscape of ZFY-DNA interactions can be visualized by mapping the assayed binding sites according to their energy values and highlighting their mismatches to the reference site with red color (Fig. 3D). For hZFY(F7-F13), a few variants from the RAND11 library, covering the CGTT portion of the reference sequence, have slightly higher binding affinity (lower energy) than the reference. But most of the variants of the RAND11 and RNAD12 libraries have somewhat lower affinity, up to about 1kT higher binding energy. For the RAND10 library, covering the TAGT portion of the reference sequence, most variants have between 1-2.5 kT higher energy, and some of them have increases beyond 2.5kT. All of the variants from RAND9, covering the core region of the motif, have binding energies >2.5kT above the reference. These results confirm the existence of an irregular extended motif containing ----TAGT--TTT---. The greatly reduced affinity of those core variants explains why it is observed by methods like ChiP-seq or PBM but the weaker extended motif is not. To determine why Affinity-seq didn’t detect this extended motif, we did a correlation analysis between the occurrences of all GGCCTNNN sites near Affinity-seq peak summits and their corresponding relative binding affinity by Spec-seq. For mZFY1, the most abundant site, GGCCTAGT, does have the highest binding affinity (Fig. 3E), nonetheless the overall correlation coefficient is not particularly high (r2=0.224), probably because significant portions of Affinity-seq peaks (58,541/90,084 for hZFY and 24,801/50,170 for mZFY1) consist of multiple GGCC sites (at least 2) and therefore are not solely determined by the binding affinity of individual ones. That makes the enrichment too weak to be detected by MEME although it can be observed with quantitative binding assays.
To further explore the contribution of the extended motif, we included two extra DNA libraries R9N and R9NN (Fig 3B) and re-ran Spec-seq assay using multiple differently truncated versions of ZFY proteins. Library R9N randomized the core positions, 3-6, in the context of non-preferred bases in positions 8-12. Library R9NN also contains non-preferred bases in positions 15-18. If we symbolically divide the whole binding site to three regions, i.e., core site, extended motif part I, and part II, then many binding sites can be classified and given a short acronym depending on the sequence feature, either specific (S) or non-specific (N). For example, the acronym S-N-S refers to the site that has the specific (preferred) sequence in positions 3-6 and 15-18, but the non-specific sequence in positions 8-12. Figure 3F summarizes our measurement results for some characteristic sites. In a conventional motif prediction model, each finger recognizes its own motif and operates in independent, additive fashion. Under this additive model, the energy gap between a site with all specific subsequences (S-S-S) and sites with one or two non-specific subsequences (S-N-S and S-N-N) should be the same as the difference between N-S-S and the sites N-N-S and N-N-N. As expected, when the core site is present (S-x-x), the two other regions increase overall affinity additively (2.1kT for middle region, 0.7kT for last region, and 2.7kT for their sum for hZFY(F7-F13)). However, in the absence of the core (N-x-x), there is essentially no difference in binding specificity with or without those two regions, suggesting that the proper recognition of adjacent secondary motif depends on the presence of the core GGCC site. Note that this analysis does not allow us to determine relative affinities for the same sites between different proteins, rather each protein’s binding energy is determined relative to the reference sequence.
Two-color fluorescence anisotropy reveals multi-state DNA binding for ZFY
Besides sequencing, we sought an alternative quantitative method to validate and further investigate differences in ZFY-DNA binding affinities. We recently developed an orthogonal method, two-color competitive fluorescence anisotropy (2C-CFA, Fig. 4A)[45], to quantify the relative binding affinity of a protein of interest to different DNA sequences. Two-color competitive anisotropy overcomes the noise inherent to conventional single-color anisotropy by measuring anisotropy values of reference and competitor probes in the same protein and buffer environment simultaneously. Jantz and Berg [52] demonstrated that a difference in anisotropy is detectable only if the fluorophore is proximal to the actual TF binding site within a few nucleotides, which makes it an ideal tool to probe the local properties of TF binding on DNA.
We designed five representative ZFY competitor probes, each with FAM labeled on either the left or right side, as in Fig. 4B. As expected, titration of increasing amounts of ZFY(F7-F13) protein into the “SSS vs. SSS” binding reactions yielded consistent linear correspondence between FAM- and TAMRA- probes on either side, establishing that the placement of attached fluorophores has no effect on binding. For “SNS vs. SSS”, “NSS vs. SSS”, and “NNN vs. SSS” experiments on both sides, we obtained the same energy values matching the Spec-seq results well (Fig. 4E). Clearly for the NSS and NNN cases, because of lack of a core GGCC site to form a stable complex, both sequences showed almost the same affinities that were at least 3kT worse than the SSS reference site (Fig. 4C, D). However, interestingly, while the “FAM-SSN vs. TAMRA-SSS” experiment detected ~0.7kT energy difference, the reciprocal “SSN-FAM vs. SSS-TAMRA” case reported a significantly larger value (1.3kT). This discrepancy simply cannot be explained by any one-step, all-or-none DNA-binding model, and one plausible explanation is the existence of some intermediate protein-DNA complex state in which only part of the DNA probe is bound by ZFY. Given this, we proposed a multi-state, DNA binding and motif recognition model to explain our observation (Fig. 4F). Assuming for each DNA site, there is some intermediate state that only the core and its closely adjacent regions (positions 1-12) are bound by ZFY, denoted as state 2, 5, 8 respectively in Fig. 4F, then due to the local sensing property of anisotropy, for “FAM-SSN vs. TAMRA-SSS” experiment, we are detecting the binding occupancy difference between ensemble state 2+3 and ensemble state 5+6, whereas for “SSN-FAM vs. SSS-TAMRA” run, we are only monitoring the difference between state 3 and 6. Since state 2 and 5 are in the same energy level, overall smaller binding occupancy difference is observed by fluorophores on the left side than ones on the right side. On the other hand, for “SNS vs. SSS” runs, state 5 and state 8 already carry different sequences and energy values, so we wouldn’t get such asymmetric results. Additionally, we carefully performed conventional single-color anisotropy titrated by increasing ZFY(F7-F13) and observed its dissociation constant in nanomolar range (Fig. S4), which corresponds to at least 16 kT binding energy, thus the ~4 kT specific binding energy only contributes to small part of the overall protein-DNA interactions. Combining Spec-seq and anisotropy data, we are able to construct the energy landscapes for multiple sites including some intermediate states (Fig. 4G). The ~2 kT difference between fully occupied SSS and partially occupied SSS is derived using “FAM-SSS vs. TAMRA-SS” data with detail in Supplemental Info. Thermodynamically, many protein-DNA complex conformations should exist at room temperature, but most of them fall in the non-specific range. It is usually referred as multi valent binding in other areas [53], but has not been explored for ZFPs before.
ZFY accessory fingers incrementally increase the stability of the protein-DNA complex
Besides binding energy measurement, we directly measured the intrinsic dissociation rate for various ZFY truncation constructs interacting with different DNA sequences using fluorescence anisotropy (Fig. 5A). Each experiment assessed binding to one of the four versions of hZFY containing different subsets of fingers (F9-F13, F7-F13, F5-F13 and F1-F13 (full-length)) interacting with one of the four dsDNA probes of different lengths and sequences in the presence of a long consensus unlabeled ‘competitor’ probe. After reactions reach equilibrium, a 200-fold excess of the unlabeled competitor DNA is added and the anisotropy values of the FAM-DNA probes are monitored over time to visualize dissociation processes (Fig. 5B and 5C). The changes in mean lifetime for full-length hZFY to the different sequences are consistent with the Spec-seq measurement, for example SSS and SSN have 0.7kT energy difference measured by Spec-seq, corresponding to 1,176s and 656s of mean life time respectively. This means that binding affinity differences between different forms of protein-DNA complex are primarily driven by difference in dissociation rates (Fig. 5C, E). The full length ZFY protein displayed the longest mean lifetime of about 50mins, with each additional truncation of the protein, the mean lifetime are reduced down to about 15 minutes for fingers 9-13 (Fig. 5B and 5D). When the shorter SSS probe is bound by hZFY-full, the mean lifetime also drops to below 20mins (Fig. 5C and 5D). These results show that the additional fingers of hZFY contribute to binding affinity by stabilizing the complex, even fingers that do not appear to contribute to specificity.
CTCF fingers 10-11 recognize an upstream motif conditionally with variable spacing to the core sequence
Besides ZFY, do other tandem ZFPs recognize their motif in such a non-independent way? The CTCF insulator protein is composed of 11 tandem zinc fingers that have identical contact residues in humans and mice. Previous ChIP-chip and ChIP-seq work [31, 54, 55] identified a 14nt core motif CCnnnAGGGGGCGC, recognized by fingers 7 to 3, as in Fig. 6A, B. Recent work [56] using ChIP-seq, ChIP-exo, and DNaseI-seq combined with site-directed mutagenesis reported an extra “upstream motif” and “downstream motif” with a variable 5-6 nt distance to the core sequence (Fig. 6C, downstream motif not included). According to their analysis, among 48,137 detected ChIP-seq peaks, only 31,474 peaks contain the core sequences alone with no flanking motifs, and ~ 6,000 peaks have the upstream motif 5 or 6nt away from the core sites. We hypothesized that this flanking upstream motif may also represent a ‘conditional motif’ that depends on the proper recognition of CTCF to its core site and can enhance the overall protein-DNA complex binding.
To test this, we designed 4 Spec-seq libraries (R1, R2, R3 and R2L (Fig. 6D)) to assay the contributions of this upstream motif CTCF’s specificity by testing the effects of the presence or absence of this motif at 5 and 6 nt spacings in the presence and absence of variations within the core at positions 2 and 6 (Y=C or T; M = A or C). Consistent with previous results, we identify an optimal upstream motif with preferred sequence TGCAGTACCC and a preferred spacing of 6 nt from the core motif (Fig. 6E, F). Fig. 6H shows that when the core sequence is intact (state C), the upstream motif can further enhance the DNA binding by ~1.7kBT (U5C vs. N5C, U6C vs. N6C), but for mutated core sites, the energy differences are negligible (U5N vs. N5N, U6N vs. N6N). We also did Spec-seq on truncated mouse CTCF(F1-F9) and found no effect of the upstream motif, proving that fingers 10 and 11 are necessary for the upstream recognition.
Next, we asked whether the presence of the secondary motif could be detected by Affinity-seq. We profiled binding of the cloned mouse CTCF ZF array (which is identical to the human ZF array) to mouse genomic DNA. In total, we detected 192,670 CTCF peaks in the mouse genome at p<0.01. Analysis of all data with MEME reveals only the core motif. Because previous work reported that the secondary motif was found in only a fraction of the ChIP-exo data [56], we subdivided the 35,370 top activity peaks into 249 bins of decreasing peak intensity and ran MEME on each bin separately. Interestingly, the top 60 bins containing 4518 peaks showed strong match to the core motif only. The secondary motif was found predominantly in peaks with moderate peak intensities and of those, 4856 peaks, or 13.85%, showed secondary motif presence. These results suggest that secondary ‘conditional motifs’ may provide a method by which cells can compensate for core mutations to maintain overall high levels of binding.
Iterative analysis of Chip-exo data with prefixed core site reveals extended motif of human ZNF343
Our specificity analysis for ZFY and CTCF suggests many long ZFPs that are currently found to have short motifs may actually have extended motifs underrepresented in current motif discovery program. For instance, human ZNF343 is a 12-finger long KRAB ZFP, yet RCADE analysis of published ChIP-exo data reveal only a 6-nt long consensus site (GAAGCG) [30, 57], as in Fig. 7C. The motif prediction model (Fig. 7B) suggests this hexamer motif is most likely recognized by fingers 2-1. With this prior knowledge, we identified 3,237 GAAGCG sites within the reported 4,532 Chip-exo peaks. Using these sites as the “anchor point”, we then aligned, mapped, and counted all the ChIP-exo reads near these core sites (Fig. 7F). There is a significant reverse peak signal at position 7, confirming that this hexamer is indeed recognized by the N-terminal fingers 1-2, but for the forward ChIP-exo signals, the peaks are quite broad and ambiguous. Ideally, the ChIP-exo read count at each site should be proportional to the binding occupancy of that site; if the in vivo ZNF343 protein concentration is low enough, reads should further be proportional to the binding affinity. Based on these assumptions, we calculated the negative logarithmic ratio of ChIP-exo read counts to yield relative binding energies for data regression and motif analysis (Fig. 7D). This analysis revealed an extended motif in positions – 8 to 0 that is very similar to the published result of HT-SELEX [58] (Fig 7E). We also group each binding site based on its number of mismatches to a putative consensus sequence GCCNNGGTGAAGCG and count the Chip-exo reads distribution for each group (Fig. 7G). The consensus site has the highest Chip-exo signal, consistent with the case for the longer motif.
Discussion
To tackle this “long fingers but short motifs” paradox, we must first ask, is it true that long ZFPs really have short motifs? There are some technical reasons that hinder long motif discovery. First, most existing techniques like PBM, ChIP-seq, and Affinity-seq have limited resolving power for weak binding sites. As suggested by one reviewer, for ChIP-seq work on human genomic DNA, each 15-mer will show up only once on average, which makes the discovery of longer motifs increasingly difficult. Second, the irregular nature of extended motifs like those seen for ZFY, where each finger does not appear to interact with the ‘expected’ three bases, can also inhibit motif detection when searches are guided by motif prediction methods. The CTCF upstream motif, which is outside the border of the predicted motif, was recently shown to be conferred by some irregular structural configuration [59]. A visual comparison between the predicted motifs and ChIP-seq results for 131 human ZFPs with reported motifs suggests that such irregular motifs may be quite common: for example, ZNF140, ZNF324, and ZNF449 all likely contain some irregular motifs with fewer than three base-pairs for some internal fingers (Supplemental Figure S3). Lastly, for ZFY and CTCF, the secondary motif is harder to detect when it is conditioned on the presence of primary motif. In the CTCF case, the upstream motif has significantly smaller energy amplitude (±0.6kT/position) than the core motif (±2kT/position), and the variable spacing between the two motifs also makes it harder for conventional motif programs to discover the upstream one.
Given those reasons, we think better motif discovery algorithms should take into account these intrinsic properties of long ZFPs, including irregular motif structure, variable spacing and non-independent recognition between sub-motifs. Because the non-additive dependence of secondary motif on the core site, it is computationally more effective to identify the core motif first and use it as anchor site to analyze flanking regions to extract more motif information for many other ZFPs, as we did for ZNF343. Current ZFP motif prediction and ChiP-seq motif discovery programs all assume independent specificity contribution from each finger, thus we expect some iterative motif discovery program in the future on high-quality data will yield more specificity information.
If some long ZFPs indeed have short motifs compared to their finger numbers, what would be the biological functions of those extra fingers? Very likely many of them are engaged in protein-protein and protein-RNA interactions. Alternatively, we speculate that the tandem array ZF architecture can modulate the TF-DNA complex stability, so that once the TF is bound to its regulatory elements, it can remain there long enough to fulfill its activation or repression function. Only very limited work has been done on this topic previously [60–62]. Here we used fluorescence anisotropy to monitor the in vitro dissociation process of ZFY protein-DNA complex over time, and found that those accessory fingers (F1-F11) incrementally increase the half-lives of the protein-DNA complex, which could be important for fulfilling ZFY’s in vivo biological functions [39]. Alternatively, these secondary motifs make it possible to maintain binding in the face of evolutionary changes to the core motif. When we look at the overall correlation between in vitro binding predictions and in vivo occupancies, the correlations just really aren’t all that satisfactory. At least in some cases, this could be because we’re predicting binding based on the consensus site – but cells have a lot of ways to ‘tune’ binding. Here, we are showing ZFPs can increase overall affinity after a hit to the main binding site by just adding an additional upstream site.
Experimental and Data analysis Procedures
Data census for C2H2-containing proteins in model organisms
All raw protein sequences data were downloaded from InterPro database (IPR013087). For each model organisms, Cys-X2,4-Cys-X12-His-X3,4,5-His was used to search and match individual finger in each protein sequence, and only protein sequences with unique name were included for final census (e.g. PRDM9 was only counted once for human). Processed data was deposited to Mendeley Data.
Construction and expression of recombinant proteins
The coding sequences for human ZFY (Uniprot P08048:408-768) and mouse ZFY1 (Unirprot P10925:390-782) were codon optimized for E. coli expression and synthesized by IDT gBlock service, whereas mouse CTCF (Uniprot Q61164-1:241-583) was cloned from mouse cDNA libraries. After In-Fusion cloning into NEB DHFR control vector with N-terminal hisSUMO tag, the expression and purification procedures were essentially the same as our previous work [45] except that extra heparin column purification is used for anisotropy experiment. The plasmid of hisSUMO-mCTCF(F1-F9) was reused from our previous work is readily available from Addgene(#102859). All constructs, including truncated versions, are listed in Supplemental Table S1.
Affinity-seq procedures
Affinity-seq was essentially done as in [36] with minor adjustments. A ZF array of the protein of interest is amplified then cloned into a universal Affinity-seq vector by recombineering. The resulting construct expresses a fused protein containing 6HisHALO–the 412-511 aa fragment of PRDM9– ZF array of interest. The fused protein is expressed in Rosetta 2 cells at 15°C for 24 h, and partially purified by ion exchange chromatography on SP-sepharose. The purified protein is mixed with genomic DNA sheared to ~200 bp on a Covaris ultrasonicator, and allowed to bind overnight. The protein-DNA complexes are then isolated on HisPur Ni-NTA Resin (Thermo Scientific) preincubated with partially purified prep of the empty tag to reduce the background. DNA is then eluted and used to prepare genomic libraries using TruSeq ChIP Library Prep Kit (Illumina). The libraries are sequenced on HiSeq2500 or NextSeq platform ensuring ~50 mln. reads per library. Data are analyzed using a custom pipeline as described before [36]. Motif analysis was done using MEME software package.
HT-SELEX procedures
For the first round of HT-SELEX, ~200ng dsDNA libraries containing randomized sequence CAGGCCTNNNNNNNN was used for EMSA shift with hisSUMO-hZFY titrated from low to high concentration. Each time only the lane containing lowest amount protein was chosen and the bound portion of DNA (no more than 20% of total DNA) was cut, amplified for next round of SELEX selection enrichment. Since in the first round of HT-SELEX, the most enriched site turned out to be CAGGCCTAGGCGTTG, the DNA library was redesigned as CAGGCCTAGGCGTNNNNNNNN for further HT-SELEX by EMSA separation. Again, each time no more than 20% of total DNA should be in bound state for selection and enrichment analysis.
Spec-seq procedures
The experimental procedures are essentially the same as our previous work, with all binding reactions set up at 1X NEBuffer 4, room temperature, all EMSA performed at 9% Tris-glycine gel, cold room, 200V, 30mins. For ZFY, given the putative consensus site, we designed tandemly randomized dsDNA libraries Rand1-8 for pilot run, and it turned out GGCCTAGTCGTTTTTG has slightly higher affinity than the SELEX-enriched site, so we redesigned dsDNA libraries Rand9-12 including 9N, 9NN to cover some partially non-specific sites listed in Table 1 (main text). We noticed that for some ZFY proteins, particularly ZFY(F11-F13), when the protein concentration is too high, the shifted DNA fragments easily to form protein oligomers or aggregate near the EMSA well, so generally low concentration of protein (<100uM) was used and only monomer ZFY-DNA complex was cut for Spec-seq analysis. By default, all energy matrices are derived by data regression of the binding energy of reference site and all its single variants. For ZFY and CTCF upstream motif, the reference sites are GGCCTAGTCGTTTTTG and TGCAGTNCCN respectively.
Two-color fluorescence anisotropy
All assays are set at 1X NEBuffer 4, room temperature, 40nM FAM- or TAMRA-labeled DNA probes. Nickel and heparin column purified hisSUMO-ZFY (F7-F13) was used for anisotropy experiment. Experimental procedures are same as described in previous work[45].
Dissociation kinetics assay by fluorescence anisotropy
All binding assays are all set in 1X NEBuffer 4, 37° C with 30nM FAM-DNA probe, and in this condition the basal value for FAM-DNA probe without protein is ~15mA. With saturating concentration of ZFY protein added, the anisotropy values can go above 100mA, but in our cases, we titrated low volume of protein (<4% v/v) and the initial values in equilibrium state are slightly above 40mA, therefore in such low occupancy state (<20%), the DNA probe is more likely bound by the protein in assumed specific conformations. After we injected highly concentrated unlabeled competitor DNA (500pM/uL X 2uL) into 100uL binding reactions, the molar ratio between FAM probe and competitor DNA should go below 1:200, after that we measured the anisotropy values with 20s or 40s time interval for up to 90mins.
To measure the intrinsic dissociation rate of protein-DNA complex, we did titration experiments first with different molar ratio of unlabeled competitor DNA into the binding reactions, as in Fig. S1. When the competition ratio is below 1:100, the observed dissociation rate reaches some plateau and cannot increase further, thus we assume it is appropriate to use 1:200 competitor ratio curve to estimate the intrinsic dissociation rate or mean lifetime of the protein-DNA complex.
After setting up the binding reaction for at least 20mins, we assumed the equilibrium state is reached. Slightly to our surprise, with no competitor DNA added, we still observed the anisotropy value slowly decrease over time, probably owing to the steady inactivation or degradation of ZFY protein at 37 °C. To exclude the possibility that the measured dissociation rates are differentially biased by different protein inactivation rates, we measured this inactivation process alone for different proteins, and they all showed very similar inactivation rates, which are significantly slower than our observed dissociation rates (Fig. S2).
To estimate the dissociation rate koff or mean lifetime τ, we fit our data using exponential decay model with following equation:
The base value is usually in the range 15 to 17, and the range parameter depends on the first measured anisotropy value in each experiment, which should not affect the mean lifetime. Each experiment was repeated at least three times to calculate the mean values and standard deviations (as in Table S2). All experiments were performed using TECAN Safire2 instrument, 490nm excitation/525nm emission wavelength.
Motif analysis of human ZNF343 using published Chip-exo data
For human ZNF343, to derive the extended motif beyond the previously identified core motif GAAGCA, we searched this hexamer and found 3,237 intact binding sites within all those 4,532 binding peaks called by Trono lab (NCBI GEO GSE78099). Using this hexamer as the prefixed core (position 1-6), all flanking sequences within ±40bp ranges were extracted and aligned accordingly (from −39 to 46). For each particular site, all those raw Chip-exo reads (NCBI SRA SRX2512772) falling in its neighboring range were mapped based on the their starting positions, either in forward or reverse direction, thus the total Chip-exo reads for this site was calculated as the forward exo reads count (position −39 to +5) plus the reverse exo reads count (position +2 to 46). Ideally the Chip-exo reads count for each site should be proportional to the binding probability of that site, or approximately its binding affinity if the in vivo ZNF343 protein occupancy is low enough, therefore the negative logarithmic ratio of its Chip-exo reads was used as the relative binding energy for data regression and motif analysis. All processed data including binding site reads and box plot analysis was deposited to Mendeley Data.
Author Contributions
Z.Z. designed and performed the HT-SELEX, Spec-seq and anisotropy dissociation experiments in G.D.S. lab, performed two-color anisotropy in P.F. lab, wrote the first draft of manuscript. P.M.P., T.B., and M.W. performed and analyzed the Affinity-seq experiments. G.D.S. and P.F. supervised and revised this work.
Acknowledgement and Funding
We thank Dr. Rafael Casellas for providing the CTCF PWM data for motif comparison purposes. We also thank an anonymous reviewer for helpful suggestions to improve the presentation. This work was supported by NIH grants HG000249 (GDS) and GM078452 (PMP).