Abstract
Cell-free expression systems enable rapid prototyping of genetic programs in vitro. However, current throughput of cell-free measurements is often limited by the use of single-channel reporter assays. Here, we describe DNA Regulatory element Analysis by cell-Free Transcription and Sequencing (DRAFTS), a rapid and robust in vitro approach for multiplexed measurement of transcriptional activities from thousands of regulatory sequences in a single reaction. We employed this method in active cell lysates developed from ten diverse bacterial species. Interspecies analysis of transcriptional profiles from >1,000 diverse regulatory sequences revealed functional differences in gene expression that could be predictively modeled. Finally, we constructed and examined the transcriptional capacities of dual-species “hybrid” cell lysates that can simultaneously harness gene expression properties of multiple organisms. We expect that this cell-free multiplex transcriptional measurement approach will improve genetic circuit prototyping in new bacterial chassis for synthetic biology.
Introduction
The cell envelope is a key physical barrier that compartmentalizes cellular functions and insulates biological systems from the external environment. However, this barrier also limits direct access to the genome and other cellular components, which can greatly stymie genetic engineering efforts and biomolecular characterizations. Cell-free expression systems have long been established as a simplified in vitro approach to overcome these challenges by maintaining cellular processes in the absence of an intact cell membrane1. Cell-free systems, made up of either individually reconstituted cellular components2-4 or cell lysates5,6, can support a variety of catalytic reactions in vitro when supplied with energy sources, cofactors and ions7-9. These cell-free approaches have facilitated the development of therapeutically useful natural products10,11 and biologics with nonstandard amino acids12 or chemical moieties otherwise challenging to synthesize13. They have also been used to generate viable phage particles directly from purified phage DNA5,14,15 and detect pathogens in emerging diagnostics applications16,17. Recently, cell-free transcription-translation (TXTL)18,19 has increasingly gained popularity as a prototyping platform for synthetic biology20,21. By shortcutting time-consuming cloning and transformation steps, TXTL reactions can directly yield proteins and metabolites straight from DNA that encode biosynthetic genes, operons, or pathways. TXTL has also been useful for characterizing synthetic gene circuits with multi-layered components and dynamic behaviors20,22,23. However, the correspondence between in vitro measurements and actual in vivo conditions in live cells remains poorly understood8,24,25.
Traditional approaches to track RNA or protein levels in TXTL reactions rely on fluorescent reporters that have limited sensitivity and multiplexing capacity. In contrast, deep sequencing offers a vastly improved approach to characterize thousands of genetic designs or variants simultaneously, which can be produced from pooled DNA synthesis or mutagenesis methods. Such massively parallel reporter assays have enabled detailed studies of governing regulatory features of gene expression, such as 5’ untranslated region (UTR) structure and codon usage, directly in cell populations26-28. Cell-free studies have also used similar strategies to study transcription from purified RNA polymerases of phage29 and E. coli30. Large-scale and rapid analysis of genetic components can thus greatly shorten the design-build-test cycle for synthetic biology31.
Recently, a number of TXTL systems have been developed from diverse bacterial species32-35. These TXTL systems allow better design and testing of genetic circuits and metabolic pathways in new microbes with unique biochemical capabilities35,36. Cell-free methods have the potential to more rapidly advance the development of non-model strains that lack characterized genetic parts (i.e. promoters, ribosome binding sites) into useful chassis for industrial synthetic biology. Additionally, the use of diverse TXTLs opens the door for systematic assessment of compatibility between circuit components and various chasses. However, the utility of these in vitro approaches is dependent on their correspondence to in vivo conditions, which are not well-established.
Here, we describe a new deep sequencing-based multiplex strategy to rapidly measure the activities of thousands of regulatory sequences in a single cell lysate reaction, called DNA Regulatory element Analysis by cell-Free Transcription and Sequencing (DRAFTS). We first outline a simple pipeline to rapidly develop and characterize cell-free expression systems in new microbes, which we applied to generate active cell lysates from diverse and industrially useful bacteria. Applying DRAFTS to libraries of natural DNA regulatory sequences, we systematically compared transcriptional measurements made in vitro in TXTL versus in vivo in cell populations. We then analyzed transcriptional patterns across ten diverse bacteria to identify common and diverging transcriptional capacities between different species as well as in “hybrid” cell lysates made from multiple species. This study provides a guiding roadmap to generate, optimize, and analyze new cell-free expression systems using massively parallel reporter assays to expand the toolbox for genetic prototyping for synthetic biology and biomanufacturing applications.
Multiplex quantification of 5’ regulatory sequence libraries using DRAFTS
Thanks to recent advances in DNA synthesis, tens to hundreds of thousands of genetic designs can now be cheaply and quickly built in parallel27,37. However, these pooled libraries cannot be easily characterized using traditional TXTL reporter assays (e.g. fluorescence), which can often only support single-channel of measurement. As a result, individual genetic circuits need to be separately generated and tested, which adds significant hands-on burden and limits throughput even in 96-well, 384-well or microfluidic formats35,38. On the other hand, deep sequencing methods are ideally suited for quantitative measurements of nucleic acid species in pooled reactions. A key constraint of multiplex measurements in pooled reactions, however, is that individual species must not cross-interact with one another. Given these considerations, we designed DRAFTS for quantitative transcriptional analyses of DNA regulatory sequences in pooled TXTL reactions by next-generation sequencing (Figure 1a).
To test whether multiplex transcriptional measurements can be carried out in cell-free expression systems, we first performed pooled E. coli TXTL reactions using libraries of DNA regulatory parts (ranging from 234 to 29,250 members) derived from our previous study of natural regulatory sequences37 (Supplementary Data Table 1). From each TXTL reaction, we performed targeted RNA-seq on specific mRNAs, which were enriched by reverse transcription and 3’-end cDNA ligation of sequencing adaptors (Suppl. Fig. S1). The relative RNA abundance of each construct was normalized to its relative DNA abundance, which was determined by DNA-seq of the input regulatory library, to yield the final quantitative measurement of transcription (see Methods). Across different regulatory libraries, transcription levels spanned >5 orders of magnitude and were highly reproducible using the same and separately prepared cell lysates on different days (Suppl. Fig. S2a, Figure 1b). Similar to previous reports37, transcriptional values were consistent across libraries using alternate N-terminal barcodes and downstream genes (sfGFP or mCherry) (Suppl. Fig. S2b,c). We observed that relative transcription levels remained constant across different TXTL reaction times (0.5, 1, 2 or 4 hours) as well as input DNA concentrations (0.025, 0.25, 2.5, or 25 nM) (Suppl. Fig. S2d,e).
To compare TXTL transcription levels with in vivo values, we transformed each library into the same E. coli BL21 strain that was used for cell lysate preparation. We observed that the distribution of DNA libraries was much more uniform in vitro than in vivo, which is likely due to the decoupling of gene expression from cell fitness in vitro and can improve multiplexed measurements (Figure 1c). More importantly, both transcription levels and transcription start site positions (TSSs) were highly concordant between in vitro and in vivo measurements (Figure 1d,e, Suppl. Fig. S3a-d). Lastly, the sequences upstream of TSSs were enriched for sigma70 binding sites, indicating an accurate identification of mRNA 5’ positions and enabling compositional analysis of promoter sequences (Suppl. Fig. S3e,f). Together, these results demonstrate that in vitro TXTL reactions faithfully recapitulate in vivo transcription conditions, enabling accurate, more uniform, and multiplex characterization of thousands of regulatory components in a single pooled TXTL reaction.
A simple pipeline for preparing and optimizing diverse microbial lysates
Given the robustness of multiplex transcriptional measurements in E. coli TXTL reactions, we sought to generalize this approach by first generating and optimizing cell-free expression systems from other bacterial species. Recently, several studies have reported TXTL reactions from several non-E. coli bacterial species33-36. To further expand upon these advances, we established a streamlined experimental pipeline to construct, test and optimize cell lysates from diverse species (Methods)18,19. In short, cells are grown in rich media, lysed by sonication, incubated in a run-off reaction, and dialyzed (Figure 2a)39. We used the expression of Broccoli, an RNA fluorescence aptamer, to quantify the transcriptional activity of cell lysates40. When bound to the fluorophore DFHBI-1T, Broccoli yields a fluorescence signal that is linearly proportional to its mRNA concentration. We systematically optimized reaction buffer conditions by supplementing with different levels of Mg-glutamate and K-glutamate. The most transcriptionally active buffer conditions were used for all subsequent in vitro studies (Figure 2b).
In total, we implemented our cell lysate preparation and optimization pipeline on 10 phylogenetically and ecologically diverse bacterial species including seven Proteobacteria (E. coli, E. fergusonii, S. enterica, P. agglomerans, K. oxytoca, P. putida, V. natriegens), two Firmicutes (B. subtilis, L. lactis) and one Actinobacteria (C. glutamicum). E. fergusonii and P. agglomerans strains are novel isolates from the murine gut and agricultural waste respectively, while other strains have been previously described. Each strain was first grown in their optimal growth conditions and cell lysates were subsequently prepared using the same general protocol (Suppl. Methods). Final cell-free reaction conditions were individually optimized for each species (Figure 2b and Methods) and quantified to determine transcriptional yields (Suppl. Figure 4a). In the E. coli lysate, we found that the expression of the Broccoli reporter peaked after ~4 hours and then decreased thereafter, consistent with previous reports24. In contrast, P. agglomerans, V. natriegens, C. glutamicum and L. lactis showed a prolonged Broccoli signal over time with no observed decrease over 8 hours. These differences could arise as a result of different stability dynamics of the DNA template or resulting mRNA molecules as well as alternative energy recycling processes for sustaining transcription. While transcription requires a few key proteins and cofactors, translation needs many more components with a far greater degree of coordination between them to function properly. Thus, for cell-free lysate reactions, in vitro translation is expected to require significantly more tuning and optimization than in vitro transcription. Nonetheless, we quantitatively characterized the translational potential of our 10 cell lysates using an eGFP reporter plasmid and detected measurable levels of protein expression in most lysates except for L. lactis (Suppl. Fig S6b). These results demonstrate that our cell lysate preparation methods are able to generate active cell-free expression reactions from diverse bacteria with relative ease.
To improve the flexibility of our cell-free expression systems, we explored the possible use of uncloned PCR product inputs by attempting to sequester the RecBCD complex, which normally degrades linear DNA templates. Two strategies have been used to enhance the stability of linear DNA involving co-incubation with either GamS from λ-phage or DNA fragments that contain Chi sites5,41. To further protect linear DNA templates from RecBCD and other nucleases, we introduced Ku proteins from the non-homologous end-joining (NHEJ) pathway of Mycobacterium tuberculosis, which protects damaged DNA by binding to DNA, into our cell-free reactions (Suppl. Fig. S5a)42. Like GamS and Chi-site DNA, Ku also protected linear DNA templates and enhanced their gene expression in E. coli cell-free reactions. Sufficient flanking regions (as little as 100 bp) enabled binding to linear DNA ends by Ku to confer resistance against nuclease degradation (Supp. Fig. S5b,c). We then tested Ku in our other cell-free expression systems and found that it improved expression of linear DNA templates in a broad range of species, in contrast to GamS and Chi-site strategies, which worked only in closely related species of E. coli and S. enterica (Suppl. Fig. S6). Together, these results provide a foundation for further optimizations of transcription and translation efficiency in cell-free expression systems and demonstrate their potential for rapid and flexible characterization of genetic designs in new bacterial species.
Large-scale multiplex transcriptional characterizations in diverse bacteria
Understanding the regulatory capacity of microbes is key for building reliable genetic circuits that experience fewer failures. In previous work, we found that the activity of a regulatory element can vary in different species37. A major roadblock in identifying these species-selective regulatory properties is the low efficiency of DNA transformation in many bacteria, which limits multiplex characterization in them. To address this challenge, we first tested a small 234-member regulatory library (RS234)37 using DRAFTS in seven species to obtain multiplex transcriptional data. These results were compared with those obtained by transforming the library into each species and measuring in vivo transcription levels. Encouragingly, in vitro and in vivo transcription levels were highly correlated between cell lysates and cell libraries (Pearson’s r between 0.71 and 0.9, Suppl. Figure S7), which demonstrates the utility of DRAFTS for accurate multiplex transcriptional measurements in diverse bacteria.
To further scale the throughput of DRAFTS, we mined diverse antibiotic resistance and virulence genes to generate a 1,383-member library of natural regulatory sequences (RS1383) (Suppl. Fig. S1b and Methods). Since antibiotic resistance and virulence genes are often mobilized between microbes, we hypothesized that their regulatory regions may exhibit interesting host-range properties. Accordingly, we tested this library in the 10 cell lysates using DRAFTS (Supplementary Data Table 2). By using the same RS1383 library in in vitro reactions, we can also minimize any confounding contextual effects of different plasmid backbones, which are otherwise needed for in vivo characterizations in different species43. While the RS1383 library was relatively uniform following in vitro reactions in most lysates, a group of sequences (~13%) was significantly depleted in the L. lactis lysate. Motif analysis revealed a CCNGG motif in these sequences, which corresponded to a known recognition site of a restriction enzyme (ScrFIR) present in the L. lactis genome (Suppl. Figure S8)44. Therefore, the depletion of these sequences was likely the result of restriction cleavage in L. lactis lysates, highlighting the potential interference by bacterial defense systems to confound cell-free studies. We thus removed these sequences from future analyses as they artificially inflated transcription activity calculations.
Many regulatory sequences were transcriptionally active in cell lysates, spanning several orders of magnitude in expression (Suppl. Fig. 9a). Hierarchical clustering revealed distinct groups of activity levels across phylogenetically diverse bacteria (Figure 3a,b). For each regulatory element, we performed pairwise comparisons of their transcription levels between species (Figure 3c, Suppl. Fig. S9b). Pearson correlations of these pairwise comparisons showed varying levels of transcriptional concordance, which when clustered further revealed distinct gram-negative and gram-positive groups (Figure 3c). Interestingly, more phylogenetically related species shared a more similar transcription profile (Pearson’s r=0.73, Figure 3d). Principal component analysis on the RS1383 transcription profiles also revealed two distinct groups for gram-negative and gram-positive species (Suppl. Fig. S9c). We did not find distinct patterns of expression across different classes of antibiotic resistance genes across lysates (Suppl. Figure S10). However, regulatory sequences derived from Firmicutes, which have AT-rich genomes, tended to be more active overall compared to those from Proteobacteria and other phyla that have higher genomic GC-content (Figure 3e), highlighting the importance of regulatory sequence composition. Together, these results provide a higher resolution delineation of the transcriptional capacity and functional similarity between phylogenetically diverse species than that has been described previously.
Finally, we characterized an even larger library of 7,003 regulatory sequences (RS7003)37 in the 10 cell lysates (Supplementary Data Table 3) to demonstrate the utility of DRAFTS in mechanistic modeling of gene expression. We first examined the relationship between 8 regulatory sequence features known to influence promoter activity (i.e. -10/-35 motif strength and interaction45, spacer sequence, TSS region composition, mRNA 5’-end stability, GC%, Methods) and the measured in vitro expression levels (Figure 4a). In line with previous results37, the most predictive features for transcriptional activity across species were the strength of the -10 and -35 motifs of sigma70 (positively correlated) and the GC content of the regulatory sequence (negatively correlated). We combined all 8 features to generate a linear regression model of transcriptional activity for each species, using 10% of the dataset for bootstrapped training (Methods). These transcriptional models were able to explain up to 50% of the variance in some species (Figure 4b,c). Our findings highlight the utility of applying cell-free expression approaches to diverse bacteria for gene regulation characterization and modeling to deepen mechanistic studies28,45-47.
Building synthetic hybrid cell lysates
The diversity of metabolisms and cellular physiologies in the microbial kingdom provides a rich basis to generate cell lysates with unique capacities. Synthetic amalgamation of cell lysates may further extend or broaden capabilities that are otherwise uniquely confined to individual species. With an expanded biochemical repertoire, modified lysates could, in theory, express genes and pathways beyond their native regulatory or metabolic capacities. Furthermore, they could also mimic synthetic bacterial co-cultures, but in an in vitro setting that does not require intercellular transport, a limitation in many cross-feeding studies48. To test this hypothesis, we synthetically mixed cell lysates from different species into “hybrid lysates” (Figure 5a). We first constructed 7 hybrid lysates with equal proportions of two species and examined their ability to produce mRNA and proteins. Interestingly, most hybrid lysates achieved comparable levels of mRNA and protein expression to their single-species counterparts. This result suggests that machineries for gene regulation, transcription, and protein synthesis from evolutionarily distinct species generally do not negatively interfere with one another (Suppl. Fig. S11, S12).
To better delineate the transcriptional properties of hybrid lysates, we then applied DRAFTS using the RS7003 library to examine the transcriptional capacity of E. coli-B. subtilis (Ec:Bs) or E. coli-C. glutamicum (Ec:Cg) hybrid lysates generated with different mixing ratios (4:1, 1:1, 1:4). Transcriptional profiles from hybrid lysates were compared to one another as well as non-mixed single species (Figure 5b,c, Supplementary Data Table 4). Interestingly, equally mixed lysates exhibited transcriptional profiles that were more similar to lysates from both species individually. Tuning the ratio of species in the hybrid lysate resulted in transcriptional profiles more aligned with the majority species (e.g. 1:4 or 4:1) while still retaining some similarity to the minority species. The ability to augment different cell lysates (e.g. Proteobacteria, Firmicutes, Actinobacteria) with properties from functionally different species by simply mixing cell lysates in different ratios and combinations will provide a powerful general approach for in vitro synthetic biology. We expect that synthetic hybrid lysates with even higher combinations of species could further broaden the repertoire of biochemical diversity, additively or even multiplicatively.
Discussion
Cell-free expression systems enable rapid prototyping of natural or synthetic genetic circuits by eliminating time-consuming transformation and cell growth steps. In this study, we developed, optimized, and characterized cell-free lysates for 10 diverse bacterial species from three phyla (Proteobacteria, Firmicutes, Actinobacteria), many of which have never been utilized in cell-free expression reactions. Using DRAFTS in these lysates, we further demonstrated the accurate multiplexed characterization of thousands of DNA regulatory sequences in vitro. This deep sequencing-based approach can quantify transcriptional activities across >5 orders of magnitude and simultaneously determine transcriptional start sites. In vitro transcriptional measurements faithfully recapitulated in vivo expression levels, thus avoiding laborious construction of multiple species-specific libraries and transformation into some challenging or recalcitrant species.
We observed increased regulatory dissimilarity between phylogenetically more distant species, suggesting a functional evolutionary divergence of transcriptional regulation across the natural microbial biome49. Differences between microbial species could be exploited to build more refined regulatory circuits that function in specific organisms or multiple targeted species37. Using these high-throughput datasets, simple linear models of transcriptional activation could be easily trained to predict promoter strengths in a wide array of organisms. Our identification of a strain-specific restriction enzyme recognition site that affected the stability of DNA templates suggests potential routes for strain optimization in future cell-free studies using diverse bacteria. Advances in cell lysate preparation, especially for organisms requiring anaerobic or fastidious growth conditions, will extend in vitro expression systems to new arenas to better understand and improve microbial physiology. Hybrid cell lysates generated by mixing different species could yield in vitro systems with new biochemical properties. These synthetic in vitro hybrids constitute a step towards harnessing the diverse properties of natural and engineered bacteria into a single in vitro reaction, with possible applications in microbiome prototyping, biochemical manufacturing, and the development of artificial minimal cells48,50.
While this study focused on characterization of promoters that are constitutively activated, our approach can extend to inducible regulatory systems including riboswitches51-54. In these cases, additional regulatory proteins could be co-expressed with a library of regulatory DNA variants in the same in vitro reaction and their ligand-response dynamics could be monitored over time. Such approaches could also expedite the characterization of promoters controlled by multiple regulators47 and facilitate their integration into complex regulatory networks51,55. Transcriptional profiles of entire genetic circuits could also be characterized in TXTL reactions using whole-transcriptome library preparation methods, which have been employed in vivo to aid debugging of individual components and dynamic circuit behaviors56. Sequencing-based quantification methods for translation such as Ribo-seq57,58 could enable in vitro multiplex characterization for protein synthesis in different bacteria. Finally, we expect multiplexed sequencing and quantification approaches to also improve analysis of fungal59, plant60, insect61, mammalian62,63 and other cell-free expression systems derived from eukaryotic organisms to study and engineer increasingly complex layers of gene regulation using synthetic biology.
Methods
Library construction
Regulatory sequences were mined from databases of antibiotic resistance64 and virulence genes65 (Suppl. Fig. S1b). Oligo library design, synthesis, and cloning were performed as previously described37. Initial libraries were transformed into E. coli MegaX DH10B cells (Invitrogen) and subsequently passaged twice 1:25 in LB+50 μg/mL carbenicillin. Plasmid DNA was extracted (Zymo Midiprep) and column purified once more (PureLink, Invitrogen) to avoid carryover of RNase A into cell-free expression systems. Final concentrations of 10-25 nM were suitable for downstream experiments.
Lysate preparation
Strain information, media, buffer recipes and growth conditions used for each species can be found in Supplementary Materials. E. fergusonii and P. agglomerans strains were isolated from mouse feces and hemp feedstock (supplied by Ecovative) respectively. Lysates were generated using similar methods used to produce E. coli TXTL systems18,19. In general, single colonies of each species were inoculated into 4 mL liquid media and grown 6-8 hours. 100-500 μL of the culture was transferred to 50 mL of the same liquid media and grown overnight. 3.3 mL from the overnight culture was transferred to two flasks containing 330 mL growth medium in 1 L baffled flasks (1:100 dilution) and grown to OD600 1.2-1.6. Flasks were rapidly chilled on ice and cells were centrifuged at 4,300xg for 10 minutes at 4°C. Pellets were washed three times with 50 mL S30A buffer in 50 mL centrifuge tubes. The mass of the final pellet was measured and 0.8 mL (1.2 mL for L. lactis) S30A buffer was added per gram of pellet mass and then transferred to Eppendorf tubes in 300 uL aliquots. The aliquoted resuspensions were sonicated on ice using a Qsonica125 sonicator with 3.2 mm probe at 35% amplitude for 4 rounds of 30 seconds, with 30-second breaks. Lysates were clarified by centrifuging at 12,000xg for 10 minutes at 4°C. Supernatant was removed and transferred to 2 mL microtubes with screw cap opened. Run-off reactions were performed by incubating clarified lysates at 30°C or 37°C (whichever temperature used for cell growth), 250 rpm for 60-80 minutes. Samples were clarified by centrifuging at 12,000xg for 10 minutes at 4°C. Supernatant was loaded in dialysis cassette (Slide-A-Lyzer 10k MWCO, Thermo Scientific) and dialyzed in S30B buffer for 2-3 hours at 4°C. Samples were clarified once more by centrifuging at 12,000xg for 10 minutes at 4°C, then aliquoted and stored at-80°C.
Transcriptional Optimization
Transcriptional optimization was performed separately for each species. Cell lysates were combined with amino acids, PEG, and energy buffer as previously described19, with additional Mg-glutamate and K-glutamate (both from Sigma Aldrich) at concentrations of 0, 3, 6 or 9 mM and 0, 30, 60, or 90 mM respectively (total 16 combinations), in a skirted white 96-well PCR plate (Bio-rad). A plasmid construct containing strong broad-host range promoter (Gen_18145) identified by our previous study37, F30-Broccoli, and B0015 terminator was used as a DNA template, and nuclease-free water was used as a negative control. DNA template (2 uL) and 10 mM DFHBI-1T (0.5 uL, Tocris Bioscience) was added to each well immediately before time course measurements. Fluorescence was tracked for 8 hours using a Synergy H1 plate reader (BioTek) at 30°C using excitation and emission wavelengths of 482 and 505 nm respectively. The concentrations of Mg-glutamate and K-glutamate yielding the highest fluorescence peak was used for future experiments, typically between 1-4 hours.
Library Transcriptional Measurements
Library measurements were carried out in 10 μL reactions containing 7.5 μL cell-free mixture and 2.5 μL DNA at a final concentration of 10-25 nM. Reactions were incubated at 30°C for 30 min. Each reaction was then split and 2 μL was used for amplification of DNA library and total RNA was extracted from the remaining 5 uL using a Zymo RNA Clean and Concentrator-5 kit, with the remaining volume saved as backups. For in vivo measurements, library cultures were grown until OD600 ~0.2. RNA was then extracted using RNAsnap66. For in vivo libraries, cells were lysed using PrepGem bacteria kit (Zygem) for amplification of input DNA library sequences. Unless otherwise stated, RNA sequencing library was prepared by reverse transcription and common adaptor ligation at 3’-end of the cDNA.
Gene-specific reverse transcription was carried out for all RNA samples as follows:
10 μL Total RNA sample (up to 5 μg)
1 μL Reverse transcription primer (20 μM)
1 μL 10 mM dNTPs (Invitrogen)
2.5 μL Nuclease free water
Components were incubated at 65 °C for 5 minutes and chilled on ice for 1 minute. The following were then added to the reaction:
4 μL 5x RT buffer
0.5 μL RNase inhibitor (Ribolock Thermo Scientific)
1 μL Maxima reverse transcriptase (Thermo Scientific)
The reverse transcription reaction was carried out as follows on a 96-well thermocycler (BioRad):
42°C for 90 minutes
50°C for 2 minutes
42°C for 2 minutes
Repeat the two steps above for 9 cycles
85°C for 5 minutes
4°C hold
The completed reaction was incubated with 1 μL RNase H at 37°C for 30 minutes.
Then the common adaptor ligation was carried out as follows:
5 μL cDNA
2 μL Adaptor (40 μM)
Components were incubated at 65 °C for 5 minutes and chilled on ice for 1 minute. The following were then added to the reaction:
2 μL T4 RNA ligase buffer
0.8 μL DMSO
0.2 μL 100 mM ATP
8.5 μL 50% PEG8000
1.5 μL T4 RNA ligase (New England Biolabs)
Reactions were incubated overnight at 22°C for overnight.
Illumina indexes and adaptors were added to both cDNA and input library DNA using a two-step amplification process. All primer sequences are listed in Supplementary Materials. Amplifications were performed using the following PCR reaction mixture and cycling:
10 μL 2x Q5 Hot Start Mastermix (New England Biolabs)
0.2 μL 10x SYBR Green I (Invitrogen)
1 μL Forward primer (10 uM)
1 μL Reverse primer (10 uM)
1 μL 1:10 dilution of library DNA template
6.8 μL Nuclease free water (Ambion)
5 min. 98 °C Initial Denaturation
10s. 98 °C Denaturation
20s. 62 °C Annealing
30s. 72 °C Extension
Go to step 2. Until end of exponential amplification
2m. 72 °C Final Extension
PCR was performed using a CFX96 Touch Real-Time PCR machine (Bio-Rad). Amplification was stopped as soon as exponential phase ceased, typically around 20 cycles. Samples were diluted 1:100 and amplified again using the same protocol using indexing primers for 8-10 cycles. Samples were then co-purified and examined on a 2% agarose gel to verify correct band sizes of ~ 200 bp for cDNA and 350 bp for input plasmid DNA libraries.
Read processing and data analysis
After merging paired-end reads using BBmerge67 and filtering out mismatched reads, barcode mapping of reads to designed library constructs and expression level calculations were performed by normalizing each construct’s individual RNA abundance by its DNA abundance based on merged counts from two technical replicates, as done in previous studies27,37. The log10 transformed expression levels were converted to Z-scores for further comparative analysis across samples.
Determination of transcription start sites and 5’-end mRNA structure stability
To identify the TSSs of regulatory sequences, alignment of the merged RNA reads with the reference sequence was performed after trimming common adaptor sequence with two random bases at the end. Then we processed TSS calls for each regulatory sequence by an algorithm using kmeans function of scikit-learn package in Python. The algorithm starts with a seed of 16 clusters, then the number of clusters is reduced if two clusters are found within 10 bp of each other or if the cluster contains less than 1% of all reads. Primary TSS was defined as TSS in which >70% of TSS calls lie within the cluster with >200 counts. Other secondary TSSs defined as all the TSSs in which >10% of TSS calls lie. Free energy of 5’-end mRNA structure was computed using the NUPACK package (http://nupack.org)68. The 5’-end of mRNA was defined by first 50 bp downstream of TSS of each regulatory sequence. Only promoters with primary TSSs were used for this analysis.
Motif discovery and scoring
The conserved -10 and -35 motifs of regulatory sequences were identified within the 45 bp preceding primary TSSs using MEME software analysis69. Spacer sequences between -35 and -10 regions of promoters were extracted during motif scanning and were given penalty scores for suboptimal sizes deviating up to +/- 2 bp from the optimal 17 bp. We also examined motifs found within the 8 bp of downstream sequences from primary TSS as this region is important for transcription initiation30,70. Unless otherwise stated, top 10% highly active promoters were selected for motif analysis. The motif function in the Biopython package was used to scan the aligned sequences by MEME to obtain motif position weight matrix and match scores for each promoter in the regulatory sequence library.
Statistical methods
All library measurements were performed in duplicate for both RNA and DNA and the Pearson correlation coefficients between most of the biological replicates were >0.9. For most analyses, a cutoff of 15 DNA reads was set to minimize noise from low abundance constructs. All statistics were performed using commonly used Python libraries such as Numpy, Scipy, Pandas, and Seaborn. Linear modeling was performed using the LinearRegression function in scikit-learn package, by randomly splitting the data in 10% and 90% for the training and test sets, respectively, unless otherwise stated.
Author contributions
S.S.Y., N.I.J., V.N. and H.H.W. developed the initial concept. A.L.G., C.R., S.P.C., R.M., M.R. and D.G. provided key biological reagents, strains and code. S.S.Y., N.I.J., and J.P. performed experiments and analyzed the results under the supervision of H.H.W.; N.I.J., S.S.Y. and H.H.W. wrote the mansucript with input from all authors.
Competing financial interests
The authors declare no competiting financial interests.
Methods
Production and purification of Ku in E. coli
Ku gene from Mycobacterium tuberculosis was synthesized (IDT) and cloned into pET28c with riboJ (BBa_K1679038) and RBS (BBa_B0034) to yield pET28c-Ku. The pET28c-Ku plasmid was introduced to E. coli BL21(DE3). Ku was produced in the strain by 0.2 mM IPTG (isopropyl-b-D-thiogalactopyranoside) induction at 18°C for 16 hours. Soluble lysate of the culture was prepared by sonication on ice using a Qsonica125 sonicator with 3.2 mm probe at 40% amplitude for 50 rounds of 30 seconds, with 30 second breaks and clarification by centrifuging at 12,000xg for 10 minutes at 4°C. Recombinant Ku protein in the soluble lysate was purified using Nickel-NTA-agarose. Ku was recovered in the 300 mM imidazole eluates. Eluted Ku protein was dialyzed in PBS for 2-3 hours at 4°C, and stored at -20°C.
Template-switching adaptor ligation
Data in Figure S7 was prepared by Template-switching, an alternative method for addition of common adaptor sequences to cDNA using the following reverse transcription reaction:
5 μL Total RNA sample (up to 5 μg)
1 μL Reverse transcription primer (20 μM)
1 μL 10 mM dNTPs (Invitrogen)
6.5 μL Nuclease free water
Components were incubated at 65 °C for 5 minutes and chilled on ice for 1 minute. The following were then added to the reaction:
4 μL 5x RT buffer
0.5 μL RNase inhibitor (Ribolock Thermo Scientific)
1 μL Maxima reverse transcriptase (Thermo Scientific)
1 μL Template switching oligo (20 μM)
The reverse transcription reaction was carried out as follows on a 96-well thermocycler (BioRad):
42°C for 90 minutes
50°C for 2 minutes
42°C for 2 minutes
Repeat the two steps above for 9 cycles
85°C for 5 minutes
4°C hold
The completed reaction was incubated with 1 μL RNase H at 37°C for 30 minutes.
Acknowledgements
We thank members of the Wang lab for advice and comments on the manuscript. H.H.W. acknowledges funding support from NSF (MCB-1453219), NIH/NIGMS (U01GM110714-01A1, 1R01AI132403-01), DARPA (HR0011-17-C-0068), DoD ONR (N00014-15-1-2704), the Sloan Foundation (FR-2015-65795). V.N. acknowledges funding support from DoD ONR (N00014-13-1-0074) and Human Frontier Science Program (RGP0032/2015). S.S.Y. thanks support from Basic Science Research Program through the National Research Foundation of Korea funded by the Ministry of Education (NRF-2017R1A6A3A03003401). N.I.J. is supported by an NSF Graduate Research Fellowship (DGE-1644869). We also thank K.J.Jeong (KAIST, Daejeon, Korea) for providing pCES208 plasmid.