Abstract
Type I CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats)-Cas (CRISPR associated) loci provide prokaryotes with a nucleic-acid-based adaptive immunity against foreign DNA1. Immunity involves “adaptation,” the integration of ~30-bp DNA fragments into the CRISPR array as “spacer” sequences, and “interference,” the targeted degradation of DNA containing a “protospacer” sequence mediated by a complex containing a spacer-derived CRISPR RNA (crRNA)1–4. Specificity for targeting interference to protospacers, but not spacers, occurs through recognition of a 3-bp protospacer adjacent motif (PAM)5 by the crRNA-containing complex6. Interference-driven DNA degradation of protospacer-containing DNA can be coupled with “primed adaptation,” ill which spacers are acquired from DNA surrounding the targeted protospacer in a bidirectional, orientation-dependent manner2,3,7. Here we construct a robust in vivo model for primed adaptation consisting of an Escherichia coli type I-E CRISPR-Cas “self-targeting” locus encoding a crRNA that targets a chromosomal protospacer. We develop a strand-specific, high-throughput-sequencing method for analysis of DNA fragments, “FragSeq,” and use this method to detect short fragments derived from DNA surrounding the targeted protospacer. The detected fragments have sequences matching spacers acquired during primed adaptation, contain ~3- to 4-nt overhangs derived from excision of genomic DNA within a PAM, are generated in a bidirectional, orientation-dependent manner relative to the targeted protospacer, require the functional integrity of machinery for interference and adaptation to accumulate, and function as spacer precursors when exogenously introduced into cells by transformation. DNA fragments with a similar structure accumulate in cells undergoing primed adaptation in a type I-F CRISPR-Cas self-targeting system. We propose the DNA fragments detected in this work are products of universal steps of spacer precursor processing in type I CRISPR-Cas systems.
CRISPR interference in the E. coli type I-E system is performed by the Cascade complex, composed of a crRNA and several Cas proteins1,8,9. Initial binding of Cascade to a protospacer results in formation of an R-loop containing an RNA-DNA heteroduplex formed between the crRNA and “target” strand, and extrusion of single-stranded DNA derived from the “nontarget” strand6,8,10–14. Cas3, a single-stranded nuclease and 3’-5’ helicase, is recruited to the Cascade-protospacer complex and cleaves the nontarget strand to initiate unwinding and degradation of the targeted DNA11,13,15. In vitro, Cas3 can translocate on DNA as a component of a larger complex that includes Cascade and the key proteins of CRISPR adaptation, Cas1 and Cas216.
CRISPR adaptation in the E. coli I-E system is mediated by a Casl-Cas2 complex that can facilitate spacer acquisition in the absence of interference, a process termed “naive adaptation”14,17–19. In primed CRISPR adaptation, interference-driven DNA degradation initiated at a “priming protospacer (PPS),” is coupled with acquisition of spacers from DNA in the PPS region2,3,20. One hallmark of primed adaptation is that nearly all PPS-region sequences from which spacers are acquired contain a consensus 5’-AAG-3’/3’-TTC-5’ PAM (PAMAAG)2,3,20. A second hallmark of primed adaptation is that spacer acquisition occurs in a bidirectional, orientation-dependent manner relative to the PAM of the PPS. In particular, the non-transcribed strand of spacers acquired from the PAM-proximal region (upstream) or PAM-distal region (downstream) is derived from the nontarget strand or target strand, respectively21.
Available in vivo models of primed adaptation that contain a plasmid-borne PPS or phage-borne PPS are limited due to difficulties in detecting bidirectional spacer acquisition or by high rates of cell lysis2,3,21. Therefore, to overcome limitations of these systems we constructed a derivative of E. coli K12 with a type I-E CRISPR-Cas locus containing a spacer, SpyihN encoding a crRNA targeting a chromosomal protospacer in the non-essential gene yihN (Fig. 1a; Supplementary Table 1). Induction of cas gene expression in “self-targeting” cells leads to inhibition of cell growth accompanied by an increase in cell length (Fig. 1b). Furthermore, analysis of chromosomal DNA by high-throughput sequencing shows induction of cas gene expression causes a dramatic loss of −300 kb of chromosomal DNA in the PPS region (Fig. 1c, Extended Data Fig. 1a-1b. Supplementary Table 2). Loss of PPS-region DNA is also observed in cells containing a catalytically inactive Cas 1 variant (CaslH208A)22 but is not observed in cells containing a nuclease-deficient Cas3 variant (Cas3H74A)11 or cells in which SpyihN is replaced by a spacer targeting M13 phage (SpM13)10 (Extended Data Fig. 1a. Supplementary Table 3). Similar results are obtained using methods for analysis of doublestranded or single-stranded DNA (Extended Data Fig. 1b, Supplementary Table 2), indicating that interference-driven degradation of both the target and nontarget strands occurs in the selftargeting strain. The results establish that induction of cas gene expression results in interference-driven degradation of PPS-region DNA in the type I-E CRISPR-Cas self-targeting system.
To determine whether interference-driven degradation of PPS-region DNA is coupled with spacer acquisition from PPS-region sequences we analyzed CRISPR arrays by PCR (Fig. 1d). Results indicate ~20% of arrays acquire a spacer in cells in which cas gene expression is induced, while no spacer acquisition is detected in cells in which cas gene expression is not induced (Fig. 1d). Furthermore, no spacer acquisition is detected in cells in which Sp-yith is replaced by SpM13 (Fig. 1d), indicating that spacer acquisition requires interference-driven degradation of PPS-region DNA. Iiigh-throughput sequencing analysis of amplicons derived from arrays that have acquired a spacer indicate the self-targeting system exhibits the defining hallmarks of primed adaptation. In particular, >95% of spacers are acquired from a PAMAAG-containing “protospacer” in the PPS region and, furthermore, spacer acquisition occurs in a bidirectional, orientation-dependent manner characteristi c of the E. coll I-E system21 (Fig. 1e, Supplementary Table 4–5). We conclude that the type I-E CRISPR-Cas self-targeting strain provides a robust in vivo model system for primed adaptation.
It has been proposed that interference-driven DNA degradation produces fragments that serve as spacer precursors in primed adaptation3,23. To test this model, we developed a method for strand-specific, high-throughput-sequencing of DNA fragments, “FragSeq,” and applied this method to identify products of degradation in self-targeting cells undergoing primed adaptation (Fig. 2a, Extended Data Fig. 2–4, Supplementary Tables 6–12 and Methods). Results show accumulation of fragments derived from PPS-region DNA in wild-type cells but not in cells containing inactive variants of Cas 1 or Cas3, or cells in which SpyihN is replaced by SpM13 (Fig. 2a, Extended Data Fig. 3a, Supplementary Table 7). Thus, accumulation of PPS-region-derived fragments in cells undergoing primed adaptation requires the functional integrity of both interference and adaptation.
Analysis of length distributions of the PPS-region-derived fragments indicate they are produced in a bidirectional, orientation-dependent manner reminiscent of spacer acquisition (Fig. 2b). The most abundant nontarget-strand fragments (FragNT) and target-strand fragments (FragT) emanating from the PAM-proximal region of the PPS (upstream) are 32- to 34-nt and 36- to 38-nt, respectively, and the most abundant FragNT and Frag1 emanating from the PAM-distal region of the PPS (downstream) are 36- to 38-nt and 32- to 34-nt, respectively (Fig. 2b). In addition, the relative abundance of complementary 32- to 34-nt and 36- to 38-nt fragments show a positive correlation (Pearson correlation coefficient 0.48, Supplemental Table 11), suggesting the fragments identified by FragSeq represent individual strands of double-stranded DNA products having lengths similar to that of spacers (30-bp). Alignments of the chromosomal sequences associated with the 5’ or 3’ ends of complementary fragments reveals the presence of a consensus 5’-AAG-3’/3’-TTC-5’ PAM derived from sequences associated with the 5’-ends of 32- to 34-nt fragments and the 3’-ends of 36- to 38-nt fragments (Fig. 2c, Supplementary Tables 9,10). Thus, the results of FragSeq suggest cells undergoing primed adaptation accumulate 33- or 34-bp double-stranded DNA fragments containing a 3’-end, 4- or 3-nt overhang derived from excision of a PAM-containing sequence (Fig. 2c). Furthermore, the relative abundance of these fragments and spacers acquired during primed adaptation that have an identical sequence shows a positive correlation (Pearson correlation coefficient 0.5-0.6, Supplemental Table 12). Accordingly, the results strongly suggest the fragments accumulating in cells undergoing primed adaptation are products of an intermediate step between protospacer selection and spacer integration.
To directly test whether the PPS-region-derived fragments detected by FragSeq serve as substrates for spacer integration we performed a “prespacer efficiency assay24” (Fig. 3a). We tested synthetic mimics corresponding to the most abundant PPS-region-derived fragments (Fig. 3b, Supplementary Table 13). Results show that 33- or 34-bp synthetic mimics containing a 3’-end, 4- or 3-nt overhang on the PAM-derived end, respectively, and a blunt PAM-distal end were integrated into arrays with an efficiency similar to a control fragment containing a consensus PAMAAG (~10% prespacer efficiency; Fig. 3b, Supplementary Tables 14, 15). In addition, the synthetic mimics and PAMAAG-containing control fragment were integrated in a “direct” orientation with the G:C of the PAM positioned adjacent to the first repeat in the array (Fig. 3, Supplementary Table 15). Introduction of a 5’-end, 1-nt overhang on the PAM-distal end reduced prespacer efficiency by ~45-fold, to a value similar to that of a 33-bp control fragment without a PAM (Fig. 3b, Supplementary Table 15). The results establish that PPS-region-derived fragments containing a 3’-end overhang on the PAM-derived end and blunt PAM-distal end function as efficient spacer precursors.
In prior work we developed an E. coli strain that provides a model system for studies of self-targeting by the type I-F CRISPR-Cas system from Pseudomonas aeruginosa25 (Extended Data Fig. 5a). Compared with the orientation bias in spacer acquisition observed in type I-E systems, orientation bias in type I-F systems is reversed. In particular, the nontranscribed strand of spacers acquired from the PAM-proximal region of the PPS (upstream) or PAM-distal region of the PPS (downstream) are derived from the target strand or nontarget strand, respectively in type I-F. To determine whether spacer precursors could be detected in the type I-F system we performed FragSeq analysis in cells undergoing primed adaptation (Extended Data Fig. 5b, Supplementary Tables 17–21). Similar to the type I-E system, we detect accumulation of spacer-sized double-stranded DNA fragments containing a 3’-end. 5-nt overhang on the PAM-derived end and a blunt PAM-distal end (Extended Data Fig. 5b). Thus, in spite of exhibiting opposite orientation bias in spacer acquisition, primed adaptation in type I-E and type I-F systems involves generation of spacer precursors with a similar structure (Extended Data Fig. 5c).
In summary, we have identified spacer precursors produced as products of an intermediate step (or steps) between protospacer selection and spacer integration for type I-E and type I-F CRISPR-Cas systems. Accumulation of spacer precursors in the type I-E system requires the functional integrity of components of interference and adaptation (Fig. 4) indicating that protospacer selection involves coordination between the interference machinery and adaptation machinery (Fig. 4a). Strikingly, spacer precursors detected during primed adaptation in both type I-E and type I-F systems share an asymmetrical structure characterized by a 3’-end overhang on the PAM-derived end. Thus, we propose that spacer precursors detected in this work are products generated during universal steps of prespacer processing that occur in all type I CRISPR-Cas systems. We further propose that the asymmetrical structure of the spacer precursors detected in this work is a key determinant of the sequential integration of prespacers into the CRISPR array (Fig. 4b). In addition, the FragSeq method reported in this work should be applicable, essentially without modification, to identify spacer precursors that form in vivo in any CRISPR-Cas system.
Contributions
A. A. S, I. O. V., B. E. N., K. S. and E. Semenova designed the experiments.
A. A. S., E. Savitskaya, K. A. D., I. O. V., I. F., N. M., A. M., A. S. and E. Semenova performed the experiments.
A. A. S. and E. Savitskaya analyzed high-throughput sequencing data.
A. A. S., B. E. N., K. S. and E. Semenova wrote the manuscript.
Competing Interests
The authors declare no competing interests
Methods
Bacterial strains and plasmids
The E. coli strains used in this study are listed in Supplementary Table 1. Red recombinase-mediated gene-replacement technique was used to obtain strains KD403, KD518 and KD75329.
Plasmid pCDF-Ec.cas1/2 was constructed as described previously4. Plasmids pCas and pCsy were described earlier25.
Growth conditions during CRISPR interference and primed adaptation assays
For analysis of CRISPR-mediated self-targeting by the type I-E system, overnight culture of KD403 strain grown at 37°C in LB medium was diluted 100-fold into 10 ml fresh LB and incubated at 37°C until OD600 reached 0.3. Culture was divided into two portions, cas genes inducers, IPTG and L-(+)-arabinose, were added at 1 mM concentration to one portion, and cultures with and without inducers were incubated at 37°C for 7 hr. At various time points postinduction, cells were plated with serial dilutions on 1.5% LB agar plates for counting colony forming units (CFU) or were monitored using fluorescent microscopy.
In assays using strains KD403, KD518, KD753 and KD263 that were followed by sequencing of total genomic DNA, short DNA fragments or newly acquired spacers, similar conditions of culture growth and cas genes induction were applied, except that overnight cultures were diluted 100-fold in 100 ml LB and grown at 30°C. Five hours postinduction, 10 ml of cells were pelleted by centrifugation at 3000 × g for 5 min at 4°C, washed with 10 ml of PBS, pelleted by centrifugation at 3000 × g for 5 min at 4°C and resuspended in 1 ml of PBS. Cells were divided into 125 μl aliquots and stored at −70°C before they were used for DNA isolation.
For analysis of short DNA fragments generated during self-targeting by the type I-F system, cultures of strain KD675 transformed with plasmids pCas and pCsy were grown at 37°C in LB supplemented with 100 μg/ml ampicillin and 50 μg/ml spectinomycin. Overnight cultures were diluted 200-fold into 10 ml of LB without antibiotics, grown at 37°C until OD600 reached 0.3 and supplemented with 1mM IPTG and 1mM L-(+)-arabinose. Cells were harvested 24 hr postinduction and prepared for DNA isolation as described above for strains KD403, KD518, KD753 and KD263.
Fluorescence microscopy
Cultures grown with or without induction of cas gene expression were analyzed using a LIVE/DEAD viability kit (Thermo Scientific) at 5 hr after induction. Viable cells in each culture were detected by addition of 20 μM SYT09, green fluorescent dye that can penetrate through intact cell membranes. Non-viable cells in each culture were detected by addition of 20 μM propidium iodide dye, which cannot enter viable cells. Sample chambers were made using a microscope slide (Menzel-Gläser) with two strips on the upper and lower edges formed by double-sided sticky tape (Scotch TM). To obtain a flat substrate required for high-quality visualization of bacteria, a 1.5% agarose solution was placed between tape strips and covered with another microscopic slide. After solidification of the agarose the upper slide was removed and several agarose pads were formed. 1 μl of each cell suspension (with and without induction) was placed on an agarose pad. The microscopic chamber was sealed using coverslip (24 × 24 mm, Menzel-Gläser).
Fluorescence microscopy was performed using Zeiss AxioImager.Z1 upright microscope. Fluorescence signals in green (living cells) and red (dead cells) fluorescent channels were detected using Zeiss Filter Set 10 and Semrock mCherry-40LP filter set respectively. Fluorescent images of self-targeting cells were obtained using Cascade II:1024 back-illuminated EMCCD camera (Photometries). Microscope was controlled using AxioVision Microscopy Software (Zeiss). All image analysis was performed using ImageJ (Fiji) with ObjectJ plugin used for measurements of cell length30.
High-throughput sequencing of total genomic DNA
Total genomic DNA was purified by GeneJET Genomic DNA Purification Kit (ThermoFisher Scientifc). Sequencing libraries were prepared either by NEBNext® Ultra™ II DNA Library Prep Kit for Illumina (NEB) or by Accel-NGS® 1S Plus DNA Library Kit (Swift Biosciences) and sequenced on a NextSeq 500 platform.
Raw reads were analyzed in R with ShortRead and Biostrings packages31. Bases with quality < 20 were substituted with N and the reads were mapped to the KD403 reference genome using Unipro UGENE platform32. Bowtie2 was used as a tool for alignment with end-to-end alignment mode and 1 mismatch allowed33. The BAM files were analyzed by Rsamtools package and reads with the MAPQ score equal to 42 were selected and used for downstream coverage analysis34. Mean coverage over nonoverlapping 1 kb bins was calculated and normalized to the total coverage (the sum of means).
High-throughput sequencing of spacers acquired during primed adaptation
Cell lysates were prepared by resuspending cells in water and heating at 95°C for 5 min. Cell debris was removed from lysates by centrifugation at 16 × g for 1 min. For analysis of spacer acquisition in strains KD263 and KD403 lysates were used in PCR reactions containing primers LDR-F2 (ATGCTTTAAGAACAAATGTATACTTTTAG) and Ec.min_R (CGAAGGCGTCTTGATGGGTTTG) (25 cycles, Ta = 52°C). Reaction products were separated by agarose gel electrophoresis (Fig. 1d). To obtain amplicons derived from extended CRISPR arrays in strain KD403 PCR reactions were performed using primers LDR-F2 (ATGCTTTAAGAACAAATGTATACTTTTAG) and autoSP2_R (AATAGCGAACAACAAGGTCGGTTG) (30 cycles, Ta=52°C). Reaction products were separated by agarose gel electrophoresis and the amplicon derived from the extended array was purified from the gel using a GeneJET Extraction Kit (ThermoFisher Scientifc) and sequenced on a NextSeq 500 system.
Bioinformatic analysis was performed in R using ShortRead and Biostrings packages31. Bases with quality < 20 were substituted with N and spacer sequences were extracted from the reads containing two or more CRISPR repeats. Spacers of length 33 bp were mapped to the KD403 genome to identify 33-bp “protospacer” sequences with 0 mismatches. Spacers that aligned to one position in the chromosome were used to determine protospacer distribution along the genome. Spacers arising from protospacers due to potential ‘slippage’ or ‘flippage’ were removed from analysis35 (Supplementary Tables 4, 5).
Prespacer efficiency assay
Competent cells were prepared from a culture of BL21-AI cells containing plasmid pCDF-Ec.cas1/2 grown in the presence of 13 mM arabinose and 1 mM IPTG to induce expression of cas1 and cas2. Complementary oligonucleotides (Supplementary Table 13) were introduced into cells by electroporation as described previously24. Lysates from cell cultures were generated and used in PCR reactions containing a primer complementary to the leader sequence (GGTAGATTGTGACTGGCTTAAAAAATC) and a primer complementary to the preexisting spacer in the array (GTTTGAGCGATGATATTTGTGCTC), respectively. Amplicons corresponding to extended and nonextended CRISPR arrays were isolated using GeneJET PCR Purification Kit (ThermoFisher Scientifc) and sequenced on a NextSeq 500 platform. Bioinformatic analysis was performed in R using ShortRead and Biostrings packages31. Bases with quality < 20 were substituted with N and reads containing the leader-repeat junction were further analyzed (sequence spanning the last 21 nucleotides of the leader and the whole repeat were searched for with 6 mismatches allowed). The reads containing two repeats were considered expanded. Newly acquired spacers were extracted from the expanded reads and mapped to the genome, plasmid and transforming oligonucleotide sequence with 2 mismatches allowed. 33 bp oligo-derived spacers that were cut between AA and G before integration were considered as properly processed. For simplicity only properly processed oligo-derived spacers inserted into the CRISPR array in direct (GCCCAATTTACTACTCGTTCTGGTGTTTCTCGT) or reverse (ACGAGAAACACCAGAACGAGTAGTAAATTGGGC) orientation were included into analysis.
Isolation of DNA fragments generated in vivo
Total genomic DNA was isolated from cultures of strains KD403, KD518, KD753, KD263 and KD675 by collecting 1.25 ml of cell suspensions by centrifugation, resuspending cells in 125 μl of PBS, adding 2 ml lysis buffer (0.6% SDS, 12 μg/ml proteinase K in lx TE buffer) and incubating at 55°C for 1 hr. Two milliliters of phenol:chloroform:isoamyl alcohol (25:24:1) (pH 8) were added to the lysate, the solution was gently mixed, and the aqueous and organic phases separated by centrifugation at 7000 × g for 10 min at room temperature. The upper aqueous phase containing total genomic DNA was collected and the residual phenol was removed by the addition of 2 ml of chloroform:isoamyl alcohol (24:1). The solution was gently mixed, centrifuged at 7000 × g for 10 min at room temperature, the upper DNA-containing fraction was transferred into fresh tube, 0.2 M NaCl, 15 μg/ml of Glycoblue (Invitrogen) and two volumes of cold 100% ethanol was added, and the solution was incubated at −80°C overnight. Precipitated DNA was recovered by centrifugation at 21000 × g for 30 min at 4°C. Pellets were washed twice with 80% ethanol, resuspended in 200 μl of 1x TE buffer, and treated with 1 mg/ml RNase A at 37°C for 30 min to remove residual RNA. DNA was isolated by phenol:chloroform:isoamyl alcohol extraction and ethanol precipitation as described above.
DNA fragments < 700 bp in length were isolated from 9 μg of total genomic DNA using a Select-a-Size DNA Clean & Concentrator kit (Zymo Research) according to manufacturer’s recommendations. To ensure the binding of fragments <50 bp to the column filter, the volume of 100% ethanol added to the fraction prior to on-filter purification was increased from 290 μl to 600 μl. DNA fragments were eluted with 2 × 50 μl of elution buffer, pooled aid purified by ethanol precipitation. 100 μl DNA was mixed with 10 μl of 3 M NaOAc (O.lxV), 1 μl of 10 mg/ml glycogen (0.01xV), and 330 μl of 100% ethanol, vortexed, and incubated overnight at −80°C. DNA was recovered by centrifugation at 21000 × g for 30 min at 4°C. Pellets were washed 3 times with 80% cold ethanol, air dried for —5 min, and resuspended in 5 μl of nuclease-free water.
High-throughput sequencing of DNA fragments: FragSeq
The DNA oligo i116 that served as a 3’ adapter was adenylated using 5’ DNA Adenylation Kit (NEB), purified by ethanol precipitation as above and diluted to 10 μM with nuclease-free water.
DNA fragments < 700 bp (in 5 μl water) were heat-denatured at 95°C for 5 min, cooled to 65°C, and mixed with 0.5 μM adenylated oligo i116, 1x NEBuffer 1, 5 mM MnC¾, and 10 pmol of thermostable 5’ App DNA/RNA ligase (NEB) in 10-μl reaction volume. The mixture was incubated at 65°C for 1 hr, heated at 90°C for 3 min, and cooled to 4°C on ice. Ligated products were combined with lx T4 RNA ligase buffer, 12% PEG 8000, 10 mM DTT, 60 μg/ml BSA, and 10 U of T4 RNA ligase 1 (NEB) in a 25-μl reaction volume. The reaction was incubated at 16°C for 16 hr, 25 μl of 2x loading dye was added, and products were separated by electrophoresis on 10% 7 M urea slab gels (equilibrated and run in lx TBE buffer). The gel was stained with SYBR Gold nucleic acid gel stain, bands were visualized on a UV transilluminator, and products of —40 to —500 nt were excised from the gel and recovered as described in Vvedenskaya et al.36. The excised gel slice was crushed, 400 μl of 0.3 M NaCl in 1x TE buffer was added, and the mixture incubated at 70°C for 10 min. The eluate was collected using a Spin-X column. After the first elution step the elution procedure was repeated, eluates were pooled, and DNA was isolated by ethanol precipitation and resuspended in 15 μl of nuclease-free water.
Next, the 3’ adapter-ligated DNA fragments were adenylated using 5’ DNA Adenylation Kit (NEB) in a 20-μl reaction following the manufacturer’s recommendations. Nuclease-free water was added to 100 μl, DNA fragments were purified by ethanol precipitation and resuspended in 5 μl of nuclease-free water. The two-step ligation procedure described above was repeated using 5 μl of adenylated 3’-ligated DNA fragments, 0.5 μM of barcoded oligos i 112, i113, i114 or i115 that served as 5’ adapters (barcodes were used as internal controls; Supplementary Table 22), 10 pmol of thermostable 5’ App DNA/RNA ligase at the first ligation step, and 10 U of T4 RNA ligase 1 at the second ligation step. Reactions were stopped by addition of 25 μl of 2x loading dye, and products were separated by electrophoresis on 10% 7 M urea slab gels (equilibrated and run in lx TBE buffer). DNA products of ~70 to ~500 nt in size were excised and eluted from the gel as described above, isolated by ethanol precipitation, and resuspended in 20 μl of nuclease-free water.
To amplify DNA, 2 to 8 μl of adapter-ligated DNA fragments were added to a mixture containing lx Phusion HF reaction buffer, 0.2 inM dNTPs, 0.25 μM Illumina RP1 primer, 0.25 μM Illumina index primer and 0.02 U/μl Phusion HF polymerase in a 30-μl reaction (Supplementary Table 23). PCR was performed with an initial denaturation step of 30 sec at 98°C, amplification for 15 cycles (denaturation for 10 sec at 98°C, annealing for 20 sec at 62°C and extension for 15 sec at 72°C), and a final extension for 5 min at 72°C Amplicons were isolated by electrophoresis using a non-denaturing 10% slab gel (equilibrated and run in lx TBE). The gel was stained with SYBR Gold nucleic acid gel stain and species of—150 to —300 bp were excised. DNA products were eluted from the gel with 600 μl of 0.3 M NaCl in 1xTE buffer at 37°C for 3 hr, purified by ethanol precipitation, and resuspended in 25 μl of nuclease-free water. Barcoded libraries were sequenced on Illumina NextSeq 500 platform in high output mode.
Bioinformatic analysis was performed in R using ShortRead and Biostrings packages31. Bases with quality < 20 were substituted with N. After adapter trimming, all reads were compared to each other to reveal clusters of overamplified reads containing the same insert and combination of unique molecular identifiers conjugated to adapters. For each cluster a consensus sequence was extracted and used together with non-overamplified reads for further alignment to KD403 reference genome with 2 mismatches allowed. Only reads with a length 16 to 100 nt uniquely aligned to the genome were further analyzed (Extended Data Figure 4). Logos were generated using ggseqlogo package37.
Code availability
All codes used in this study are available to editors and reviewers upon request from the corresponding author. The code will be made publicly available at publication.
Data availability
Raw sequencing data obtained in this study are available to editors and reviewers upon request from the corresponding author. The data will be made publicly available at publication.
Acknowledgments
The work was carried out using scientific equipment of the Center of Shared Usage «The analytical center of nano- and biotechnologies of SPbPU and Next Generation Sequencing equipment of The Waksman Genomics Core Facility. This work was supported by NIH grant GM10407 (KS), NIH grant GM118059 (BEN), and Russian Science Foundation grant 14-14-00988 (KS).