Abstract
SIGNIFICANCE Many bacterial and archaeal species encode CRISPR-Cas immunity systems that protect against invasion by foreign DNA. In the Escherichia coli CRISPR-Cas system, a protein complex, Cascade, binds 61 nt CRISPR RNAs (crRNAs). The Cascade-crRNA complex is directed to invading DNA molecules through base-pairing between the crRNA and target DNA. This leads to recruitment of the Cas3 nuclease that destroys the invading DNA molecule, and promotes acquisition of new immunity elements. We show that Cascade-crRNA binding to DNA is highly promiscuous in vivo. Consequently, endogenous E. coli crRNAs direct Cascade binding to >100 chromosomal locations. In contrast, target degradation and acquisition of new immunity elements requires highly specific association of Cascade-crRNA with DNA, limiting CRISPR-Cas function to the intended targets.
ABSTRACT In CRISPR immunity systems, short CRISPR RNAs (crRNAs) are bound by CRISPR-associated (Cas) proteins, and these complexes target invading nucleic acid molecules for degradation in a process known as interference. In Type I CRISPR systems, the Cas protein complex that binds DNA is known as Cascade. Association of Cascade with target DNA can also lead to acquisition of new immunity elements, in a process known as priming. The sequence determinants for protospacer binding and interference have been well characterized for Type II CRISPR systems such as the Cas9 system of Streptococcus pyogenes. In contrast, relatively little is known about the requirements for Cascade-DNA binding, interference, and priming in Type I systems. Here, we use genome-scale approaches to assess the specificity determinants for Cascade-DNA interaction, interference, and priming in vivo for the Type I-E system of Escherichia coli. Remarkably, as few as 5 bp of crRNA-DNA are sufficient for association of Cascade with a DNA target. Consequently, a single crRNA promotes Cascade association with numerous off-target sites, and the endogenous E. coli crRNAs direct Cascade binding to >100 chromosomal sites. In contrast to the low specificity of Cascade-DNA interactions, >18 bp are required for both interference and priming. Hence, Cascade binding to sub-optimal, off-target sites is inert. Our data support a model in which initial Cascade association with DNA targets requires little sequence complementarity at the crRNA 5□ end, whereas recruitment and/or activation of the Cas3 nuclease, a prerequisite for interference and priming, requires extensive base-pairing.
INTRODUCTION
Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR)-Cas (CRISPR-associated) systems are adaptive immune systems found in approximately 40% of bacteria and 90% of archaea (1). CRISPR-Cas systems are characterized by the presence of CRISPR arrays and Cas proteins. CRISPR arrays are genomic loci that consist of short repetitive sequences (“repeats”), interspaced with short sequences of viral or plasmid origin (“spacers”) (2–7). Spacers are acquired during a process known as “adaptation”, in which a complex of Cas1 and Cas2 integrates invading DNA into a CRISPR array, effectively immunizing the organism from future assault by the invader (8). In the archetypal Type I-E CRISPR system of Escherichia coli, immunity occurs by a process known as “interference”. During interference, a CRISPR array is transcribed, and Cas6e processes the transcript into individual, 61 nt CRISPR RNAs (crRNAs) that each include a single, 32 nt spacer sequence flanked by partial repeat sequences (9, 10). Individual crRNAs are then incorporated into Cascade, a protein complex composed of five different Cas proteins (Cse1-Cse22-Cas76-Cas5-Cas6e) (9, 11). Cascade-crRNA complexes bind to target DNA sequences known as “protospacers” that are complementary to the crRNA spacer, and are immediately adjacent to a short DNA sequence known as a “Protospacer-Associated Motif” (PAM). The crRNA bound by Cascade forms an R-loop with one strand of the target DNA, which in turn leads to recruitment of the Cas3 nuclease, DNA cleavage, and elimination of the invader (12–21).
For Type I CRISPR systems, adaptation can occur by two mechanisms: “naïve” and “primed”. Naïve adaptation requires only two Cas proteins: Cas1 and Cas2 (8). Primed adaptation, by contrast, requires all Cas proteins and an existing crRNA (22, 23). The molecular details of primed adaptation are not well understood. Priming requires association of Cascade with a target DNA molecule, and the newly acquired spacers correspond to locations on the same DNA molecule as the protospacer (22–25). Some Type I CRISPR systems acquire spacers preferentially in one direction relative to the targeted protospacer, with the majority of spacers in either direction coming from the same DNA strand (22–27). In E. coli, priming is primarily unidirectional, and has been proposed to involve translocation of Cas3 away from the Cascade-bound protospacer (22, 25).
There are conflicting reports on the relationship between interference and priming. Initially, it was proposed that priming occurs only when Cascade-protospacer interactions are sub-optimal and cannot lead to interference, e.g. with a sub-optimal PAM, or with mismatches in the PAM-proximal region of the spacer/protospacer known as the “seed” (22, 24, 28–30). However, more recent studies have shown that at least some protospacers can lead to both interference and priming, indicating that the requirements for interference and priming overlap (31–33).
Prior to interference or priming, the Cascade-crRNA complex must bind to the target protospacer. This requires an interaction between Cse1 and the PAM, and base-pairing between the crRNA and protospacer DNA (13, 16, 19). PAM recognition is required for both Cascade binding and later activation of Cas3 (14, 18, 19, 34). Changes to the optimal PAM weaken Cascade binding to a protospacer (35). Nonetheless, some sub-optimal PAMs are sufficient for interference, albeit with lower efficiency than the optimal PAM (30). Sequences within the crRNA spacer are also required for initial binding of Cascade to a protospacer; mutations in positions 1-5 and 7-8 of the protospacer (the “seed sequence”) reduce the affinity of Cascade for the protospacer (28, 36).
The precise sequence determinants for Cascade binding, interference and priming are unclear. Moreover, association of Cascade with protospacer DNA has not previously been studied in an in vivo context. Here, we use ChIP-seq to perform the first in vivo assessment of Cascade binding to its DNA targets. Our data show that base-pairing between the crRNA and protospacer with as few as 5 nt in the seed region, coupled with an optimal PAM, is often sufficient for Cascade binding. Hence, crRNAs, including those transcribed from the native E. coli CRISPR loci, drive off-target binding at hundreds of chromosomal sites. If Cascade binding to DNA was sufficient for interference or priming, these off-target binding events would be catastrophic for the bacterium. However, we show that near-complete base-pairing between the crRNA and protospacer is required for efficient interference and priming. Thus, under native conditions, the Cascade-crRNA complex samples potential DNA target sites, but limits nuclease activity to protospacers that meet a higher specificity threshold that would only be expected of on-target sites.
RESULTS
An AAG PAM and seed matches are sufficient for Cascade binding to DNA target sites in vivo
Previous studies of Cascade association with protospacer DNA have been in vitro, using purified Cascade and crRNA. To determine the in vivo target specificity of E. coli Cascade, we used ChIP-seq to map the association of Cse1-FLAG3 and FLAG3-cas5 (FLAG-tagged strains retain CRISPR function; Figure S1) across the E. coli chromosome in Δcas3 (interference-deficient) cells constitutively expressing all other cas genes, and each of two crRNAs that target either the lacZ promoter or the araB promoter (both targets are chromosomal; Figure S2A-B). ChIP-seq data for Cse1 and Cas5 were highly correlated (R2 values of 0.93-0.99 for lacZ-targeting cells, and 0.99 for araB-targeting cells), consistent with Cse1 and Cas5 always binding DNA together in the context of Cascade. We detected association of Cascade with many genomic loci for each of the two spacers tested (Figure 1A+B; Table S1). In all cases, the genomic region with strongest Cascade association was the on-target site at lacZ or araB. Off-target binding events occurred with <20% of the ChIP signal of on-target binding. To determine the sequence requirements for off-target Cascade binding with each of the two crRNAs used, we searched for enriched sequence motifs in the Cascade-bound regions, excluding the on-target site (Table S2). For both the lacZ and araB spacers, the most enriched sequence motif we identified was a close match to an AAG PAM, followed by 5 nt of sequence complementarity at the start of the seed region (Figure 1C-D; c.f. Figure S2A-B). In some cases, we observed Cascade binding events associated with non-AAG PAMs; however, these sites were more weakly bound, and/or had matches in seed region beyond position 5. We conclude that as few as 5 bp in the seed region, together with an AAG PAM, are sufficient for Cascade binding, with additional base-pairing in the seed region increasing binding and/or overcoming the need for an AAG PAM.
Extensive off-target Cascade binding driven by endogenous spacers
We identified several sites of Cascade binding that were shared between cells targeting lacZ and cells targeting araB. These bound regions were not associated with sequences matching the seed regions of either crRNA. We reasoned that these off-target binding events may be due to Cascade association with the endogenous E. coli crRNAs. To test this hypothesis, we performed ChIP-seq of Cse1-FLAG3, as described above, for cells expressing only the endogenous CRISPR RNAs from their native loci. Thus, we identified 188 binding sites for Cascade (Figure 2A; Table S1). These sites were associated with four enriched sequence motifs, with each motif corresponding to an AAG PAM and 5-10 nt matching the seed region of a crRNA from the CRISPR-I array (spacers #1, #3, #4, and #8; Figure 2B; Figure S2C; Table S2). The strongest binding events were associated with spacer #8 of CRISPR-I (Figure 2B; Figure S2C). To confirm that Cascade binding events were due to association with endogenous crRNAs, we repeated the ChIP-seq experiment in cells lacking the CRISPR-I array and cells lacking the CRISPR-II array. Deletion of CRISPR-II had little effect on the profile of Cascade binding (Figure 2C; Table S1). In contrast, deletion of CRISPR-I resulted in loss of Cascade binding to almost all sites bound in wild-type cells (Figure 2D; Table S1). Instead, low-level binding of Cascade was observed at a small number of sites that were associated with a weakly enriched sequence motif corresponding to a perfect PAM and 8 nt matching the seed region of spacer #2 of CRISPR-II (Figure S2D + S3; Table S2).
CRISPR-I spacer #8 is the major determinant of off-target Cascade binding in cells expressing endogenous crRNAs
Our data suggested that the majority of Cascade binding associated with endogenous crRNAs is due to CRISPR-I, and that the dominant spacer from CRISPR-I is spacer #8 (“sp8”). To confirm this, we measured Cascade binding by ChIP-seq in cells lacking CRISPR-I but expressing a plasmid-encoded sp8 crRNA. Most of the Cascade binding sites we observed were identical to those seen in cells expressing both CRISPR arrays, or cells expressing only CRISPR-I (Figure 3A; Table S1), and corresponded to regions containing strong matches to sp8 (orange dots in Figure 3A correspond to regions containing a match to the sp8 motif shown in Figure 2B). As expected, and unlike for cells expressing CRISPR-I, we detected only a single strongly enriched sequence motif (Figure S4A; Table S2). This motif, as expected, corresponds to an AAG PAM and 9 nt matching the seed region of sp8 (Figure S2C). We also detected a weakly enriched sequence motif (Figure S4B; Table S2) that corresponds to an AAG PAM and the 11 nt immediately downstream of the second repeat on the plasmid encoding the sp8 crRNA. This is likely due to formation of a non-canonical crRNA that consists of the sequence between the second repeat and the transcription terminator (Figure S2E). A transcription terminator hairpin has previously been shown to function analogously to repeat sequence in the E. coli crRNAs (37).
The most enriched Cascade target region in cells with CRISPR-I, and cells expressing sp8 crRNA, was inside the yggX gene. We identified a sequence in this region with an AAG PAM and matches to positions 1-5 and 7-10 of sp8 (Figure 3B). We used targeted ChIP-qPCR to measure Cascade binding to this site in cells lacking CRISPR-I but expressing plasmid-encoded sp8. We compared binding of Cascade to yggX in wild-type cells, and cells where the putative protospacer was mutated in the region predicted to bind the sp8 crRNA seed. As expected, we observed greatly reduced Cascade binding at the mutated site relative to the wild-type site. Similarly, we observed greatly reduced Cascade binding at the wild-type site when we expressed a mutant sp8 with changes in the seed region (Figure 3C). However, when we combined the mutant spacer with the mutant protospacer, base-pairing potential was restored, and we observed wild-type levels of Cascade binding (Figure 3C). We conclude that sp8 is the major determinant for off-target Cascade binding in cells expressing endogenous crRNAs.
Off-target Cascade binding events do not affect local gene expression
Cascade binding events can lead to transcription repression by preventing initiating RNA polymerase binding to a promoter, or acting as a roadblock to elongating RNA polymerase within a transcription unit (38, 39). To determine if off-target events driven by endogenous spacers affect local gene expression, we measured global RNA levels using RNA-seq in Δcas3 cells with other cas genes constitutively expressed, with either intact CRISPR arrays or a ΔCRISPR-I deletion. We detected few differences in RNA levels between the two strains, and none of the differences correspond to genes within 1 kb of a Cascade binding site identified by ChIP-seq. We conclude that off-target binding by a Cas3-deficient complex does not impact local gene expression.
Off-target Cascade binding is not associated with interference
Previous studies have suggested that extensive mismatches at the 3□ end of the spacer/protospacer prevent interference (12, 18). To determine whether off-target Cascade binding events lead to interference, we constructed a ΔyggX Δcas3 strain expressing all other cas genes, with both CRISPR arrays intact. We introduced a plasmid with the off-target protospacer from yggX that is an imperfect match to sp8, or an equivalent plasmid with a protospacer that is a perfect match to sp8. We transformed each of these strains with a plasmid expressing cas3, or an equivalent empty vector, simultaneously selecting for retention of the protospacer-containing plasmid. We reasoned that the number of viable transformants with the cas3-containing plasmid would be low for cells where interference caused loss of the protospacer-containing plasmid, since these cells would be killed by the antibiotic selection. In contrast, the number of viable transformants with the empty vector should be high in all cases. Thus, we measured the relative level of interference for each of the two protospacers. As expected, the protospacer that perfectly matches sp8 resulted in highly efficient interference, whereas the protospacer with the native yggX sequence (i.e. imperfect match to sp8) resulted in no detectable interference (Figure 4A). We conclude that off-target Cascade binding events do not cause interference.
Off-target Cascade binding is not associated with priming
The molecular determinants for priming have not been well studied. However, protospacers with multiple mismatches to a crRNA can still result in priming (24), and a recent study suggested that binding of Cascade to a protospacer with extensive mismatches, including in the seed, is sufficient to cause priming (12). To test whether off-target Cascade binding is sufficient for priming, we used the strains described above that contained a plasmid with a protospacer that is either an imperfect or a perfect match to sp8. We then introduced a plasmid with an inducible copy of cas3, under non-inducing conditions, to avoid interference. Following induction of cas3 expression, we harvested cells and PCR-amplified the 5□ end of the CRISPR-II array to determine whether new spacers had been acquired because of priming. We observed robust primed spacer acquisition for the protospacer with a perfect match to sp8, but no detectable spacer acquisition for the off-target protospacer with an imperfect match to sp8 (Figure 4B). We conclude that off-target Cascade binding events do not cause priming.
Strong Cascade binding to protospacers with extensive mismatches at the crRNA 3□ end
To further delineate the protospacer sequence requirements for Cascade binding, interference and priming, we constructed 13 variants of a protospacer that matches sp8. We selected sp8 because it elicits robust Cascade binding, interference, and priming (Figures 3 + 4). The protospacer variants (Figure 5A) included those with (variant i) complete sequence complementarity and an optimal, AAG PAM; (variants ii - iii) non-optimal PAMs: CCG, which is expected to completely abolish Cascade binding (35), and ATT, a sub-optimal sequence previously shown to cause priming but not detectable interference (40); (variants iv - viii) two or three mismatches in the first three positions of the seed; and (variants ix – xiii) stretches of ≥6 nt mismatches at various positions within the protospacer.
We pooled cells containing each of the protospacer variants. We used ChIP of Cse1-FLAG3 in Δcas3 cells to measure association of Cascade with all protospacers within the pool (see Methods). As expected, the protospacer with a CCG PAM (variant ii) had far less Cascade association than did the optimal protospacer (variant i) (Figure 5A). We presume that the level of ChIP signal for the protospacer with the CCG PAM (variant ii) represents the background of this experiment. The protospacer with a sub-optimal, ATT PAM (iii), showed reduced Cascade binding relative to the optimal protospacer (variant i), but was well above the experimental background (Figure 5A). Similarly, mismatches in the seed region (variants iv - viii) resulted in partial or complete loss of Cascade association, depending on the specific sequence mismatch (Figure 5A). Our data for PAM and seed mutants are consistent with earlier studies showing that these sequences are important for Cascade binding (21, 28, 35, 36).
Mismatches in the protospacer from positions 1-6 (variants xi and xii) or 7-20 (variant xiii) abolished Cascade binding (Figure 5A). This is consistent with the observation from our ChIP-seq data that sequence matches in positions 1-8 appear to be required for Cascade binding to off-target sites using sp8 (Figure 2B + S4A). Strikingly, mismatches across positions 25-32 (variant ix) or positions 19-32 (variant x) did not reduce Cascade association relative to the optimal protospacer (variant i) (Figure 5A). In fact, these protospacer variants showed a modest increase in Cse1 association relative to the optimal protospacer (variant i; Figure 5A), suggesting conformational differences in the Cascade-DNA complex when the 3□ end of the crRNA is mismatched with the protospacer.
Near-complete crRNA-protospacer base-pairing is required for priming and interference
We next determined which of the protospacer variants lead to interference. Using a modification of a previously described assay (see Methods) (24, 40), we measured the level of interference with a plasmid target for each of the 13 protospacers, using Δcas1 cells that cannot acquire new spacers; primed spacer acquisition cannot contribute to the level of interference in these cells. As expected, the optimal protospacer (i) was associated with robust levels of interference, whereas protospacer variants that do not bind Cascade (variants ii, iv, v, xi, xii, and xiii; Figure 5A) were not associated with detectable interference (Figure 5B). Protospacers with PAM and seed variants that showed reduced but not abolished Cascade binding (variants iii, vi, vii, and viii; Figure 5A) were associated with a range of interference levels that correlate well with the level of Cascade binding. However, the ability of protospacers to cause interference did not always correlate with the level of Cascade association. Specifically, we detected no interference for either of the protospacer variants with mismatches only at the 3□ end (variants ix and x; Figure 5B), even though these protospacers bind Cascade at least as well as the optimal protospacer (Figure 5A).
Previous studies have proposed that some protospacers with sub-optimal PAMs or mismatches in the seed region are not subject to detectable interference, but are subject to priming (12, 22, 24, 40). We determined whether the 13 protospacer variants caused priming in a plasmid context. Specifically, we introduced an inducible copy of cas3 into cells containing each of the protospacers on a high-copy plasmid. We then induced expression of cas3, and PCR-amplified the CRISPR-II array to determine whether new spacers had been added. We observed robust primed spacer acquisition for all protospacers associated with interference (variants i, iii, vii, and viii; Figure 5C). By contrast, we observed no spacer acquisition for protospacers that do not bind Cascade (variants ii, iv, v, xi, xii, and xiii; Figure 5C). Strikingly, we observed primed spacer acquisition for two protospacers that were not associated with detectable interference (Figure 5C). One of these protospacers (variant vi) has the seed mismatch with the lowest level of Cascade binding that is above the experimental background (Figure 5A). The other protospacer has mismatches across positions 25-32 (variant ix). Thus, for these protospacers, we detected robust Cascade binding and priming but we were unable to detect interference. For the protospacer with mismatches across positions 19-32 (variant x), we detected no priming. Thus, for this protospacer, we detected robust Cascade binding, but no priming or interference.
DISCUSSION
Base-pairing in the seed region together with an AAG PAM is sufficient for Cascade to bind DNA
Relatively little is known about the sequence determinants for Cascade-DNA binding, interference, and priming. Moreover, no previous studies have measured Cascade binding to protospacer DNA in vivo. Our ChIP data indicate that an AAG PAM and as little as 5 nucleotides of base-pairing at the start of the seed region are sufficient for E. coli Cascade to bind DNA targets. The sequence requirements for protospacer binding in Type II systems are similarly relaxed (41–43). The affinity of Cascade for a protospacer increases as the extent of base-pairing increases, but maximal affinity occurs with no more than an 18 bp match at the 5□ end (Figure 5A).
AAG is the optimal PAM in E. coli
Two previous studies proposed that AAG, GAG, TAG, AGG, and ATG are optimal PAMs in E. coli (24, 44), while another study suggested that AAG, ATG and GAG PAMs were associated with moderately higher affinity Cascade binding than an AGG PAM (35). Our data clearly indicate that AAG is the optimal PAM for off-target sites, with most off-target Cascade binding events being associated with an AAG PAM. Specifically, 65% of Cascade binding sites associated with a detectable motif have an AAG PAM for the crRNAs targeting lacZ and araB, and the plasmid-encoded sp8 crRNA. Moreover, off-target Cascade binding events with higher enrichment scores, suggestive of higher Cascade affinity, were more likely to be associated with an AAG PAM than Cascade binding events with lower enrichment scores (76% vs 61% for the top 20% and bottom 80% of bound regions, respectively, after sorting by Cse1 enrichment level). We hypothesize that the dependence on the PAM for Cascade binding is increased in situations where base-pairing only occurs in the seed region. According to this model, complete or nearcomplete base-pairing between the crRNA and protospacer would weaken the requirement for an optimal PAM, obscuring differences in PAM affinity. This would explain why previous studies suggested that there are at least three optimal PAMs (24, 35, 44).
Defining the crRNA seed
The seed region of a crRNA has been previously defined as positions 1-5 and 7-8, with position 1 being immediately adjacent to the PAM (28). However, our data suggest that the length of the seed varies between crRNAs, since we observed off-target binding with some crRNAs that requires base-pairing in positions 1-5, whereas off-target binding for other crRNAs requires base-pairing up to position 9 (Figures 1-2, S3-S4). We propose that the crRNA sequence determines the length of the seed, and that this reflects the initial binding mode, prior to extended base-pair formation. Every 6th position of the crRNA is flipped out in the Cascade-crRNA complex, and hence does not contribute to base-pairing (16, 45, 46). Consistent with this, the importance of position 6 for off-target binding is substantially less than that of positions 1-5 (Figures 1-2, S3-S4). Nonetheless, off-target protospacers had a sequence match to the crRNA at position 6 far more frequently than expected by chance (45% for the crRNAs targeting lacZ and araB, and the plasmid-encoded sp8 crRNA; Binomial Test p-value = 2.4e-10). We hypothesize that the initial binding of Cascade to a protospacer includes base-pairing interactions at position 6, but that the complex rapidly transitions to a conformation in which the 6th position is flipped out of the helix. Our data are consistent with an in vitro study of another Type I-E system, where position 6 was also shown to contribute to off-target Cascade binding (47). The apparent requirement for a sequence match at position 6 is not consistent across all crRNAs we tested, suggesting that the pathway towards stable seed base-pairing differs in a sequence-dependent manner.
Interference and priming require near-complete R-loop formation
Although binding of Cascade to a DNA target requires relatively little sequence identity, our data indicate that robust interference and priming require at least 18-25 bp, beginning in the seed region. This is consistent with in vitro data showing that near-complete R-loop formation is required to license Cas3 activity (12, 18). Thus, although Cascade binds DNA promiscuously, functional binding occurs with high specificity. Our data support a previously proposed model in which complete R-loop formation triggers a conformational change in Cascade at the 3□ end of the spacer, which is then transmitted, presumably through Cse2 to PAM-associated Cse1, 5□ to the spacer (18, 48). This change in Cse1 conformation then recruits Cas3, and/or activates the nuclease activity of Cas3, as suggested by a recent structural study (48). In support of this model, we detected higher ChIP signal for Cascade bound to protospacers without complete R-loop formation than those with complete R-loop formation (Figure 5A), suggesting that the conformation of Cascade with respect to the DNA changes upon R-loop completion, moderately decreasing ChIP efficiency.
Evidence that interference and priming are obligately coupled processes
Priming was initially proposed to be an alternative pathway to interference, with optimal PAM/seed sequences leading to interference, and sub-optimal sequences leading to priming (12, 17, 22, 24, 40, 49). However, primed spacer acquisition has been observed in situations where interference occurs, suggesting that priming and interference can be coupled processes (Figure 5, variants i, iii, vii, and viii) (23, 31–33). While these data show that priming and interference can occur at the same time at a population level, they do not necessarily indicate that individual priming and interference events are coupled. Moreover, while it has been proposed that interference and priming are obligately coupled (50), this has not been tested, and there are many examples where primed spacer acquisition has been observed in the absence of detectable interference (12, 22, 24, 31, 40, 49). Our data show that protospacers with seed sequence mismatches can cause detectable priming but not detectable interference when the protospacer is present on a multi-copy plasmid (Figure 5). Strikingly, for protospacers with seed mismatches, the levels of interference and priming correlate well with the level of Cascade binding (Figure 5). We detected primed spacer acquisition but not interference for the weakest-bound seed variant that has above-background levels of Cascade binding (Figure 5, variant vi). This is consistent with the expectation that primed spacer acquisition is a more sensitive readout of Cascade/Cas3 function since (i) it is an irreversible process, and (ii) it does not require destruction of all copies of the plasmid. Our data are consistent with a model in which low levels of interference are undetectable when plasmid replication outpaces plasmid degradation (50). We also observed primed spacer acquisition in the absence of detectable interference for a protospacer with mismatches across positions 25-32 (Figure 5, variant ix). We propose that this degree of mismatch at the 3□ end of the crRNA greatly reduces, but does not abolish, the isomerization of Cascade into the “active” state that recruits/activates Cas3.
Extensive, inert, off-target binding of Cascade
Cascade has many off-target binding sites due to its ability to bind DNA with low sequence-specificity. Consequently, the endogenous crRNAs transcribed from the bacterial genome result in extensive off-target binding, even in the absence of an on-target site. Since off-target binding does not involve complete R-loop formation, it has no deleterious effects on genome integrity. We also observed no impact on transcription associated with any of the off-target binding events, despite that fact that targeted Cascade binding is known to repress transcription by occluding promoters or acting as a roadblock for elongating RNA polymerase (38, 39). Transcription repression by Cascade is considerably weaker when targeting within a transcribed region (i.e. acting as a roadblock) (38). Given that the location of off-target Cascade binding sites is essentially random with respect to genome organization, and that genes make up ~90% of the E. coli genome, off-target Cascade binding is expected to be primarily intragenic. This may partly explain the lack of transcriptional impact. Moreover, a recent study showed that the level of repression by Cascade occlusion of a promoter is greatly reduced with as few as 6 bases mismatched at the 3□ end of the spacer/protospacer (51), suggesting that even intergenic off-target Cascade binding sites would be transcriptionally inert. We propose that incomplete R-loop formation results in an unstable Cascade-DNA complex with a relatively high rate of dissociation, such that it cannot compete effectively with initiating or elongating RNA polymerase. Consistent with this model, stable association of Cascade with DNA in vitro has been shown to require near-complete R-loop formation (20). We conclude that Type I CRISPR systems have evolved to tolerate off-target binding driven by the endogenous crRNAs, and are only functional at on-target sites. Given the length of crRNA spacers in Type I systems, there is no expectation of complete or near-complete spacer-protospacer base-pairing by chance. It is important to note that self-targeting by Type I CRISPR systems has been described previously, but these would be considered “on-target” events, likely caused by acquisition of spacers from the chromosome. As expected for spacers with perfect sequence complementarity, these self-targeting crRNAs are typically functional in gene regulation and interference (52–54).
Not all crRNAs are created equal
The E. coli genome encodes at least 19 crRNAs, yet our data suggest that only four crRNAs contribute to off-target binding of Cascade. All four of these crRNAs are encoded in the CRISPR-I array, and the majority of off-target binding is driven by just one, sp8. The lack of off-target binding driven by CRISPR-II crRNAs is likely due to weak transcription of this array, which is repressed by H-NS (55). In contrast, the CRISPR-I array is likely co-transcribed with the upstream cas genes, which are strongly transcribed in the strain used in this study. The preference for specific spacers within CRISPR-I cannot be explained by differences in expression levels, since the crRNAs are transcribed as a single RNA. Rather, biases in spacer usage are more likely due to differential assembly of specific crRNAs into Cascade. Consistent with this, a previous study surveyed crRNAs associated with Cascade. Spacers #2, #4 and #8 represented 68% of the Cascade-associated crRNAs (9). The cause of this bias is unclear, but may in part be due to differences in RNA secondary structure between spacers, which could impact the efficiency of RNA processing by Cas6e. Consistent with this, RNA secondary structure of repeat sequences, and associated processing by Cas6, has been shown to be impacted by spacer sequences in the Type I-D system of Synechocystis sp. PCC 6803 (56). Nonetheless, it is likely that other factors influence the level of off-target binding, since the relative association of crRNAs for spacers #2, #4 and #8 with Cascade is likely to be similar (9), but sp8 drives a disproportionately high level of off-target binding.
METHODS
Strains and plasmids
All strains, plasmids, oligonucleotides and purchased, chemically synthesized dsDNA fragments are listed in Table S3. All strains are derivatives of MG1655 (57). CB386 has been previously described (38). CB36 contains a chloramphenicol resistance cassette in place of cas3. We removed this cassette using Flp recombinase, expressed from plasmid pCP20 (58), to generate strain AMD536. Epitope tagged strains AMD543 and AMD554 (Cse1-FLAG3 and FLAG3-Cas5, respectively), were generated using the previously described FRUIT method of recombineering (59). Cse1 was C-terminally tagged in AMD543 by inserting a FLAG3 tag immediately upstream of codon 495 using oligonucleotides JW6364 and JW6365. Tagging of Cse1 resulted in an 8 amino acid C-terminal truncation. We predicted based on phylogenetic comparisons and on structural data (46) that this truncation would not impact the function of Cse1. Cas5 was N-terminally tagged in AMD554 by inserting FLAG3 using oligonucleotides JW6272 and JW6273. LC060 is a derivative of was generated using (i) FRUIT (59) with oligonucleotides JW7537-JW7540 to delete the CRISPR-II locus, (ii) P1 transduction of the CB386 (Δcas3 Pcse1)::(cat::PJ23199) region, (iii) FRUIT (59) to C-terminally tag Cse1 with FLAG3 (as described above for AMD543), and (iv) pCP20-expressed Flp recombinase (58) to remove the cat cassette. LC074 is a derivative of AMD536 in which the CRISPR-I array was deleted using FRUIT (59) with oligonucleotides JW7529 and JW7530 and a synthesized dsDNA fragment (gBlock 14148263; Integrated DNA technologies). LC077 is a derivative of LC074 in which Cse1 was C-terminally tagged with FLAG3 (as described above for AMD543). AMD566 is a derivative of AMD536 in which Cse1 was C-terminally tagged with FLAG3 (as described above for AMD543). LC099 is a derivative of AMD566 in which the off-target binding site for Cascade in yggX was mutated using FRUIT (59) with oligonucleotides JW7635-8. LC103 is a derivative of AMD536 in which the the yggX gene was replaced with a kanamycin resistance cassette using P1 transduction from the Keio Collection ΔyggX::kanR strain (60). LC106 is a derivative of LC103 with an unmarked, scar-free deletion of cas1 made using FRUIT with oligonucleotides JW7898-JW7901.
Plasmids that express crRNAs targeting the lacZ promoter (pCB380) and araB promoters (pCB381) have been described previously (38). All other crRNA-expressing plasmids are derivatives of pAMD179. pAMD179 was constructed by amplifying a DNA fragment from plasmid pAMD172 (Integrated DNA Technologies) using with oligonucleotides JW6421 and 6513. This DNA fragment was cloned into pBAD24 (61) cut with NheI and HindIII (NEB) using the In-Fusion method (Clontech). The inserted fragment contains two repeats from the CRISPR-I array, separated by a stuffer fragment containing XhoI and SacII restriction sites, and an intrinsic transcription terminator downstream of the second repeat. To clone individual spacers, pairs of oligonucleotides were annealed, extended, and inserted using In-Fusion (Clontech) into the XhoI and SacII sites of pAMD179 to generate pLC008 (with oligonucleotides JW6518 and JW7911), pLC010 (with oligonucleotides JW6518 and JW7912), and pAMD189 (with oligonucleotides JW7598 and JW7693).
pLC021 and pLC022 are derivatives of pBAD24 (61) containing a protospacer matching the off-target Cascade binding site in yggX (pLC021) or a protospacer with a perfect match to sp8 (pLC022). These plasmids were constructed by annealing and extending pairs of oligonucleotides (JW7913 and JW7914 for pLC021, and JW7924 and JW7925 for pLC022), and cloning the resultant DNA fragments into the EcoRV and SphI sites of pBAD24. pAMD191 is a derivative of pBAD33 (61) that expresses cas3 under arabinose control. To construct pAMD191, cas3 was amplified by colony PCR using oligonucleotides JW7736 and JW7738. The PCR product was cloned into the SacI and HindIII sites of pBAD33 using In-Fusion (Clontech). All protospacers described in Figure 5 are cloned into plasmid pLC020, the “pre-protospacer plasmid”, which is a derivative of pBAD24 (61). pLC020 was generated by cloning the ~500 bp region upstream of E. coli thyA (amplified by colony PCR using oligonucleotides JW8040 and JW8128) and the ~500 bp region downstream of E. coli thyA (amplified by colony PCR using oligonucleotides JW8042 and JW8043) into the EcoRI site of pBAD24 using In-Fusion (Clontech), simultaneously generating a new EcoRI site between the upstream and downstream regions of thyA. The thyA gene was then amplified by colony PCR using a universal forward primer (oligonucleotide JW8129) and each of 13 reverse primers (oligonucleotides JW8130, JW8139, JW8145, JW8169, JW8499-JW8502, and JW8675-JW8679) containing the 13 protospacer variants described in Figure 5. The resulting PCR products were cloned into the EcoRI site of the pBAD24 derivative using In-Fusion (Clontech) to generate plasmids pLC023-pLC035 (see Table S3 for details).
ChIP-qPCR
For all ChIP-qPCR and ChIP-seq experiments, cells were grown overnight in LB, subcultured in LB supplemented with 0.2% arabinose and 100 μg/mL ampicillin (for experiments where a crRNA was expressed from a plasmid) at 37 °C with aeration to an OD600 of ~0.6. AMD566 and LC099 with either pLC008 or pLC010 were used for ChIP-qPCR. ChIP-qPCR was performed as described previously (62), except that 2 μL anti-FLAG M2 monoclonal antibody (Sigma) and 1 μL anti-σ54 monoclonal antibody (NeoClone) were included simultaneously in the immunoprecipitation step. qPCR was performed using oligonucleotides JW7490-1 (amplifies the off-target site in yggX) and JW7922-3 (amplifies the region upstream of hypA). Since σ54 is known not to bind within yggX (63), we were able to normalize binding of Cse1 within yggX to the binding of σ54 upstream of hypA.
ChIP-seq
Strains AMD543, AMD554, LC060, LC077, AMD543/AMD554 with pCB380/pCB381, and strain LC074 with pLC008, were used for ChIP-seq of Cse1-FLAG3 and FLAG3-Cas5. Cells were grown and processed as described for ChIP-qPCR. ChIP-seq was performed in duplicate, following a previously described protocol (64) using 2 μL anti-FLAG M2 monoclonal antibody (Sigma). Sequencing was performed on an Illumina High-Seq 2000 Instrument (Next-Generation Sequencing and Expression Analysis Core, State University of New York at Buffalo) or an Illumina Next-Seq Instrument (Wadsworth Center Applied Genomic Technologies Core). ChIP-seq data analysis was performed as previously described (65), with reads mapped to the updated MG1655 E. coli genome (accession code U00096.3). Relative sequence coverage values were calculated by dividing the sequence read coverage at a given genomic location by (total number of sequence reads in the run/100,000). Values plotted in Figures 1A-B, 2A and 2D are the maximum values in 1 kbp regions across the genome. R2 values comparing ChIP-seq datasets were calculated by comparing read coverage at peak centers for all peaks identified for the analyzed datasets. Read coverage at peak centers was determined using a custom Python script. Sequence motifs were identified using MEME (version 4.12.0) (66) with default parameters.
RNA-seq
RNA-seq was performed in duplicate with strains AMD536 and LC074. Cells were grown overnight in LB, subcultured in LB supplemented with 0.2% arabinose at 37 °C with aeration to an OD600 of ~0.6. RNA was purified using a modified hot phenol method, as previously described (67). Purified RNA was treated with 2 μL DNase (TURBO DNA-free kit; Life Technologies) for 45 minutes at 37 °C, followed by phenol extraction and ethanol precipitation. The RiboZero kit (Epicure) was used to remove rRNA, and strand-specific cDNA libraries were created using the ScriptSeq 2.0 kit (Epicure). Sequencing was performed using an Illumina Next-Seq Instrument (Wadsworth Center Applied Genomic Technologies Core). Differential RNA expression analysis was performed using Rockhopper (version 2.03) using default parameters (68). Differences in RNA levels were considered statistically significant for genes with q-values ≤ 0.01.
Plasmid transformation efficiency assay
LC103 was transformed with either pLC021 or pLC022. These strains were then transformed with either empty pBAD33 or pAMD191 (expresses cas3), and cells were plates on M9 medium supplemented with 0.2% glycerol, 0.2% arabinose, 100 μg/mL ampicillin and 30 μg/mL chloramphenicol at 37 °C. After overnight growth, colonies were counted, and the ratio of pAMD191-transformed cells to pBAD33-transformed cells was calculated for each of the two strains.
PCR to assess primed spacer acquisition
Primed spacer acquisition was assessed for AMD536 with pAMD191 and either pLC021 or pLC022 (Figure 4B), LC103 with pAMD191 and each of pLC023-pLC035 (Figure 5C), and AMD543/AMD544 with pAMD191 and pAMD189 (expresses a self-targeting crRNA; Figure S1). Cells were grown overnight in LB supplemented with 100 μg/mL ampicillin and 30 μg/mL chloramphenicol at 37°C with aeration, and sub-cultured the next day in LB supplemented with chloramphenicol and 0.2% arabinose at 37°C with aeration for six hours. Cells were pelleted from 1 mL of culture by centrifugation, and cell pellets were frozen at −20°C. PCRs were then performed on the cell pellets, amplifying the CRISPR-II array using oligonucleotides JW7818 and JW7819. PCR products were visualized on acrylamide gels.
Sequence analysis of protospacers from a pooled ChIP library
LC099 with each of the 13 protospacer variant plasmids (pLC23-pLC035), was grown overnight in LB supplemented with 100 μg/mL ampicillin. 10 mL subcultures were grown in LB supplemented with 100 μg/mL ampicillin and 0.2% arabinose at 37°C with aeration to an OD600 of ~0.6. 3 mL from each culture was combined. ChIP was performed on mixed cultures 2 μL M2 anti-FLAG monoclonal antibody (Sigma), as previously described (62). A Zymo PCR Clean and Concentrate kit was used to purified ChIP and input DNA. A 50 μL FailSafe (Epicentre) PCR reaction using FailSafe PCR 2X PreMix “C” and 5.48 ng of ChIP DNA was performed following the manufacturer’s instructions, using oligonucleotide JW8567 and each of oligonucleotides JW8537, JW8556, JW8557, JW8558, JW8559, JW8561, JW8562, JW8563, JW8564, and JW8565 (these incorporate different Illumina indices). PCR products were purified and concentrated using 0.8X Ampure Beads (Beckman Coulter Life Sciences) and sequenced on an Illumina Mi-Seq Instrument (Wadsworth Center Applied Genomic Technologies Core). Sequence reads were mapped to each of the 13 protospacer variants using a custom Pythom script. Relative protospacer abundance in input and ChIP samples for each protospacer were normalized to the total sequence reads. Values for normalized protospacer abundance were further normalized to values from the input sample. Protospacer abundance values are reported relative to those for the optimal protospacer (variant i in Figure 5).
Measuring interference for a pooled protospacer library
Overnight cultures of LC106 strains with each of the 13 protospacer plasmids (pLC23-pLC035) were grown in LB with 100 μg/mL ampicillin and 30 μg/mL kanamycin. All 13 cultures were combined to make a single subculture; 7.7 μL of each strain into a 10 mL culture. Electrocompetent cells were made and transformed with either empty pBAD33 or pAMD191 (pBAD33-cas3). Transformants were plated onto M9 agar supplemented with 0.2% glycerol, 0.2% arabinose, and 30 μg/mL chloramphenicol, and grown overnight at 37°C. Cells were scraped off plates, washed in LB, and protospacers were PCR amplified from cell pellets with oligonucleotide JW8567 and each of oligonucleotides JW8537, JW8558, JW8559, JW8562, JW8563, and JW8566 (these incorporate different Illumina indices). PCR products were purified and concentrated with 0.8X Ampure Beads (Beckman Coulter Life Sciences), and sequenced using a Illumina Mi-Seq Instrument (Wadsworth Center Applied Genomic Technologies Core). Sequence reads were mapped to each of the 13 protospacer variants using a custom Pythom script. Individual protospacer abundances were compared between Cas3-expressing cells and cells containing empty pBAD33. Protospacer abundances were normalized to those for the protospacer with a CCG PAM (variant ii in Figure 5).
ACKNOWLEDGEMENTS
We thank Chase Beisel for sharing strains and plasmids. We thank the Wadsworth Center Applied Genomic Technologies Core Facility and the University at Buffalo Genomics and Bioinformatics Core Facility for Sanger and MiSeq sequencing. We thank the Wadsworth Center Media and Tissue Culture and Glassware Core Facilities. We thank Todd Gray, Keith Derbyshire, and Shailab Shrestha for helpful discussions. This study was supported by NIH Grant AI126416 (to J.T.W.) and a University at Albany, SUNY, RNA Fellowship (to L.A.C.).