Abstract
Efficient and highly organized transcription initiation and termination is fundamental to an organism’s ability to survive, proliferate, and quickly respond to its environment. Over the last decade, our simplistic outlook of bacterial transcriptional regulation and architecture has evolved to include stimulus-responsive regulation by untranslated RNA and the formation of alternate transcriptional units. In this study, we map the transcriptional landscape of the bacterial pathogen Streptococcus pneumoniae by applying a combination of high-throughput RNA-sequencing techniques. Our study reveals a complex transcriptome wherein environment-respondent alternate transcriptional units are observed within operons stemming from internal transcription start sites (TSS) and transcription terminators (TTS) suggesting that more fine-tuning of regulation occurs than previously thought. Additionally, we identify many putative cis-regulatory RNA elements and riboswitches within 5’-untranslated regions (5’-UTR) of genes. By integrating TSSs and TTSs with independently collected RNA-Seq datasets from a variety of conditions, we establish the response of these regulators to changes in growth conditions and validate several of them. Furthermore, to determine the importance of ribo-regulation by 5’-UTR elements for in vivo virulence, we show that the pyrR regulatory element is essential for survival, successful colonization and infection in mice suggesting that such RNA elements are potential drug targets. Importantly, we show that our approach of combining high-throughput sequencing with in vivo experiments can reconstruct a global understanding of regulation, but also pave the way for discovery of compounds that target (ribo-) regulators to mitigate virulence and antibiotic resistance.
Introduction
The transcriptional architecture of bacterial genomes is far more complex than originally proposed. The classical model of an operon describes a group of genes under the control of a regulatory protein where transcription results in a polycistronic mRNA with a single transcription start site (TSS) and a single transcription terminator site (TTS) [1]. However, many individual examples have established that the same operon may encode alternative transcriptional units under varying environmental conditions [2, 3]. Furthermore, advancements in sequencing technology that enable highly accurate mapping of TSSs and TTSs on a genome-wide level have demonstrated that the number of TSSs and TTSs can significantly exceed the number of operons [4]. Thus it seems likely that the bacterial transcriptional landscape, or the genome-wide map of all possible transcriptional units, is shaped by an operon architecture that encodes many TSSs and TTSs within single operons, thus significantly increasing complexity with the objective of enabling diverse transcriptional outcomes [5, 6].
To achieve a complex landscape of alternative transcriptional units, transcriptional regulation occurs on multiple levels. In addition to the many protein activators and repressors that control transcription initiation, there are also many non-coding RNAs (ncRNAs), including both small ncRNAs (sRNAs) and highly structured portions of mRNAs that play essential roles as regulatory elements controlling metabolism, stress-responses, and virulence [7, 8]. Trans – acting small RNAs (sRNAs) allow selective degradation or translation of specific mRNAs [9] and cis – acting mRNA structures, such as riboswitches, interact with small molecules including, metal ions, and protein ligands to affect expression of downstream genes by regulating transcription attenuation or translation inhibition [10]. RNA regulation has been shown to play a key role in shaping the transcriptional landscape of a wide range of pathogenic bacteria including Staphylococcus aureus, Listeria monocytogenes, Helicobacter pylori, and several strains of Streptococci [6,11–19]. Several RNA regulators have been validated and associated with pathogenicity and virulence [20, 21], and could be used as highly specific druggable targets [22, 23], however, only a select set of regulators have been targeted to date [24, 25].
Streptococcus pneumoniae is a major causative agent of otitis media, meningitis, pneumonia, and bacteremia. It causes 1.2 million cases of drug-resistant infections in the US annually and results in ∼1 million deaths per year worldwide [26–28]. While high-resolution transcriptional mapping data are available for other Streptococcus species, these studies have shown limited experimental validation [17], or have focused primarily on the role of sRNAs in virulence [29]. Additionally, previous studies of the S. pneumoniae transcriptome have demonstrated the presence of ncRNA regulators and assessed their roles in infection and competence, however, these studies also largely focused on sRNAs [13,15,30]. Thus, while potentially incredibly valuable, a high-resolution validated map of the genome-wide transcriptional landscape for S. pneumoniae is not available.
Here a comprehensive characterization of the S. pneumoniae TIGR4 transcriptional landscape is created using RNA-Seq [31], 5’ end-Seq [19], and term-seq (3’-end sequencing) [32]. We obtain a global transcript coverage map, identifying all TSSs, and all TTSs, which highlights a highly complex S. pneumoniae transcriptional landscape including many operons with multiple TSSs and TTSs. Furthermore, we demonstrate how TSS and TTS mapping under one set of conditions can be leveraged to analyze independently obtained RNA-Seq data collected under a variety of conditions, and we experimentally validate this approach with several cis-acting RNA regulators. Finally, we demonstrate that the functionality provided by the RNA cis – regulator pyrR is critical for S. pneumoniae in vivo in a mouse infection model. Importantly, our work demonstrates how a variety of high-throughput sequencing efforts can be combined to map out a comprehensive transcriptional landscape for a bacterial pathogen as well as identify potentially druggable ncRNA targets.
Results
Streptococcus pneumoniae has a complex transcriptional landscape
To characterize the transcriptional landscape of Streptococcus pneumoniae TIGR4 (T4), we first determined transcript boundaries by mapping transcription start (TSSs) and termination sites (TTSs) from 5’ and 3’ end sequencing reads obtained from exponential growth (Fig 1A and Fig 1B). For the 2341 annotated genes in T4, a total of 1597 TSSs and 1330 TTSs were identified, as well as 236 antisense terminators suggesting extensive antisense regulation (Fig 2, S1 and S2 Tables). RNA-Seq based clustering of genes with Rockhopper [33] detected 773 single gene operons and grouped 1512 genes into 474 multi-gene operons (S3 Table, and Fig 3). We classified operons into five categories based on the number of internal TSSs and TTSs (Fig 2). The majority of S. pneumoniae genes are independent transcriptional units with a single TSS and TTS (simple operons) (Fig 3A). Traditional operons with multiple genes and a single TSS and TTS make up 5% of the operons, while multiTSS, and multiTTS operons make up 4% and 3% of transcriptional units respectively. However, complex operons (most of which consist of two genes) with multiple TSSs and TTSs are the second largest category comprising 26% of all operons (Fig 3A and Fig 3B). Most complex operons are defined by a secondary internal TSS and TTS, however there are several significantly more complex examples where the operon contains multiple TSSs and TTSs (Fig 3B), indicating a highly intricate system of possible transcripts.
Since our data revealed many operons with complex structure, we sought to corroborate specific examples using additional data sources. One complex operon we identified, which is also present in existing databases of operon structure [5, 34], consists of 9 genes (SP1018-SP1026) with six internal TSSs and eight TTSs (Fig 4A). In addition, this operon displays unequal and complex gene expression patterns when independently collected RNA-Seq data from diverse media conditions is mapped to the transcript. In poor growth medium (MCDM) the operon can be split into two parts based on expression, where the last five genes in the operon (SP1022-1026) are expressed significantly higher than the first four genes (SP1018-1021), while in rich medium (SDMM) the read depth across the operon is similar. This observation corroborates the role of the internal regulatory mechanisms for maintaining differences in gene expression between different growth conditions.
A second validation of our data and analysis approach derives largely from existing low-throughput experiments. The mal regulon is a multiple operon system under the control of a single protein, malR (SP2112), which downregulates regulon expression at the malM (SP2107) promoter. Our data shows that the mal regulon includes operons belonging to three different categories, a traditional operon (malAR/SP2111-2112), a multiTTS operon (malMP/SP2016-2107) and a complex operon (malXCD/SP2108-2110). From the RNA-Seq coverage maps it is clear that the three operons can be differentially controlled and expressed in rich vs poor media (Fig 4B). Furthermore, the TSS and TTS identified by our analysis reveal features that have been previously described in lower throughput assays [24]. Thus, although our data may highlight many examples of complex transcriptional architecture, these examples are verifiable through the incorporation of additional RNA-Seq data, and where applicable are consistent with low-throughput studies done in the past.
Genome-wide identification and pan-genome wide conservation of regulatory RNAs
To identify RNA regulators that act through premature transcription termination, we compiled 3’ end sequencing reads upstream of translational start sites, allowing a minimum 5’-untranslated region (UTR) length of 70 bases. We detected 565 such early TTS sites that represent putative regulatory elements (represented in black (TSS) and orange (early TTS) in the inner band of Fig 2). By screening these regulatory elements against 380 published S. pneumoniae strains [35], covering a large part of the pan-genome, we found that 20 candidates (∼4%) were conserved across all genomes, 171 (45%) were identified in at least 350 genomes, 68 (∼12%) candidates were identified in fewer than half of the genomes (S1A Fig), while a single candidate, identified upstream of a transposase, was found only in T4. Interestingly, 415 (73%) candidates were found as single copies within a genome, while the others had varying copy numbers ranging from 2 to 29. Evolutionary distance of each candidate cluster was estimated using MEGA-CC [36], which reveals that each cluster is made of highly similar, if not identical, sequences (S1B Fig).
The candidate RNAs were compared to previously identified S. pneumoniae small RNAs and to homologs of characterized structured RNA families (including a variety of riboswitches and other cis-regulatory structured RNAs). A total of 111 candidates overlap with previously identified S. pneumoniae small RNAs (sRNAs), 86 out of the 88 intergenic sRNAs described in [13] and 32, with an additional 24 within 150 nucleotides, out of 89 described by [15] (S5 Table).
To identify homologs of characterized RNA families in T4 we used Infernal (an RNA specific homology search tool, [37]) to search the genome, which detected 51 of the 565 candidates, and 3 out of the 6 regulatory elements experimentally validated in this work (S4 Table), indicating the existence of many novel regulatory elements in T4. However, the coordinates identified by Infernal do not always match with those identified by our methodology (S4 Table). For example, the 23S-methyl ncRNA identified by Infernal is found in T4 between coordinates 466473 and 466568, and the regulatory element experimentally identified is found between coordinates 466469 and 466579. All the Infernal identified ncRNAs overlapping with the candidates identified here are listed in S4 Table. On the other hand, in certain cases like the L1 regulator, the coordinates do not overlap at all. This could be due to the existence of a condition specific secondary TSSs that were not picked up in our growth conditions. Despite these exceptions, the majority of the Infernal identified ncRNAs families show complete overlap with the candidates.
Leveraging RNA-Seq data collected under various conditions enables identification and validation of environment-responsive RNA regulators
We reasoned that since we mapped RNA-regulators by means of 5’end sequencing, we would be able to associate these regulators with specific growth conditions using environment dependent RNA-Seq data. To confirm the biological relevance of an RNA regulator and associate it with a specific condition one would expect to see a change in the 5’ UTR coverage relative to the accompanying gene. For instance, if a regulator forms an early terminator the RNA-Seq coverage in the 5’UTR is relatively high, while the coverage in the controlled gene would be much lower. Alternatively, if the environment relieves the formation of the early terminator the coverage across the 5’UTR and gene would become less skewed. To determine the applicability of this assumption we leveraged independently collected RNA-Seq data sampled under different nutrient conditions including, rich and poor media, and nutrient depletion conditions where a single nutrient was removed from the environment. RNA-Seq data were mapped to each putative regulatory region and coverage was calculated and averaged across the length of the 5’ UTR regulatory element and the downstream gene. From our list of candidate 5’-UTR regulatory elements, 128 showed more than two fold change in read-through between rich and poor media, with the majority showing an increase in read-through in poor media. Five regulators from this list were validated by qRT-PCR (Fig 5A-B, S2 Fig), confirming that RNA-Seq data can indeed be used to identify conditions to which an RNA regulator responds.
Importantly, two of the validated regulators are known as the thiamine pyrophosphate (TPP) and flavin mononucleotide (FMN) riboswitches. In many bacteria the TPP riboswitch binds thiamine pyrophosphate and regulates thiamine biosynthesis and transport [38]. Similarly, the FMN riboswitch regulates biosynthesis and transport of riboflavin by binding to FMN [39]. While we validated that these riboswitches respond to poor media by increasing expression of their respective genes (Fig 5A-B) we suspected that this was due to depletion of each specific ligand in the poor media. Indeed, when poor media is supplemented either with thiamine or riboflavin, expression of the TPP or FMN controlled gene (SP0716 and SP0178 respectively) decreases by more than 3-fold, suggesting that the observed differences between rich and poor media can be attributed to the activity of these riboswitches.
In an attempt to validate the feasibility of directly associating RNA-regulators with a highly specific change in the environment we performed RNA-Seq in the presence and absence of uracil. One specific regulatory element that is sensitive to uracil is the pyrR RNA element, which in many bacteria regulates de novo pyrimidine nucleotide biosynthesis through a transcription attenuation mechanism mediated by the PyrR regulatory protein [40, 41]. In the presence of the co-regulator UMP, PyrR binds to the 5’ UTR of the pyr mRNA transcript (the pyrR RNA element) and disrupts the anti-terminator stem-loop thereby promoting the formation of a factor-independent transcription terminator resulting in reduced expression of downstream genes [40] (Fig 6A). In contrast, the co-regulator 5-phosphoribosyl-1-pyrophosphate (PRPP) antagonizes the action of UMP on termination by binding to the PyrR protein when UMP concentration is low [42] (Fig 6A). For S. pneumoniae, our data confirms that pyrR RNA elements are present in the 5’ UTR of two pyr operons (SP1278-1276; SP0701-0702), and the uracil transporter (SP1286). Furthermore, in response to the absence of uracil the coverage across the two genes directly adjacent to the regulators (SP1278 and SP0701) and over the entire two operons increases drastically (Fig 6C), demonstrating that the regulatory elements effectively turn the genes/operons on, which we confirmed by qRT-PCR (Fig 6D, S3 Fig). Thus, while term-seq can be used to map novel regulatory RNA candidates on a genome-wide scale, RNA-Seq data can be leveraged, even in retrospect, to identify environmental conditions the regulator responds to.
The pyr operon is regulated through the secondary structure of the 5’ RNA leader-region, is essential for in vitro growth and in vivo virulence and can be directly manipulated
To further investigate the importance of the pyrR regulatory RNA element in growth, three different mutants were constructed that variably affect the 5’ RNA secondary structure (Fig 7A): 1) mutation M1 interferes with the binding of PyrR to the pyr mRNA; 2) mutation M2 renders the regulatory element in an “always on” state by destabilizing the rho-independent terminator stem-loop structure that is formed in the presence of UMP; 3) M3 locks the terminator and creates an “always off” state (Fig 7A). Wild type and mutant strains were cultured in the presence or absence of uracil and the effect of the mutations on expression of SP1278 were assessed with qRT-PCR (Fig 7B). As expected, expression in the wild type decreased (9.5-fold) in the presence of uracil confirming the repressive effect of exogenous pyrimidine (Fig 7B) [43]. M1, which should be insensitive to the presence of PyrR and its co-regulator UMP (Fig 7A) is indeed unresponsive to the presence of uracil (Fig 7B). M2 triggers constitutive expression of the pyr operon (Fig 7B) and M3 has a ∼5-fold reduction in expression compared to the wild type regardless of the presence of uracil (Fig 7B).
Previously we showed that the pyrimidine synthesis pathway in S. pneumoniae is partially regulated by a two-component system (SP2192-2193) and that genes in this pathway are important for growth [44]. To determine the importance of a functional pyrR regulatory RNA element in growth, we performed growth experiments with mutants M1, M2 and M3 in the absence and presence of uracil. These data suggest that a functional pyrR does not appear to be absolutely necessary. For instance, while M1 may have a slight growth defect when cultured in the absence of uracil, M2 has no growth defect in the presence or absence of uracil (Fig 8A-B). Although both mutations result in constitutive expression of the pyr operon, mutation M1 leads to higher expression (Fig 7B) indicating that overexpression of the pyr genes may result in accumulation of end products that are detrimental to the cell. Alternatively, the M2 pyrR RNA element can still bind excess UMP-bound PyrR (as its PyrR binding domain is intact) thus reducing the effective concentration of UMP in the cell and thereby potential accumulation associated side-effects. Importantly, M3 has a severe growth defect compared to wild type in the absence of uracil (Fig 8A), which can be partially rescued upon addition of uracil (Fig 8B). This suggests that while a constitutive off-state is detrimental for the bacterium in the absence of uracil a constitutive on-state can be overcome, indicating that efficient transcriptional control may not be essential.
To determine whether we can manipulate the manner in which the pyrR RNA element affects growth, we determined growth in the presence of 5-Fluoroorotic acid (5-FOA), a pyrimidine analog. 5-FOA is converted into 5-Fluorouracil (5-FU) a potent inhibitor of thymidylate synthetase, whose activity is essential for DNA replication and repair [45]. Additionally, 5-FU competes with UMP for interacting with the PyrR protein [46]. 5-FU can thus work as a decoy, signaling that UMP is present in the cell; triggering the formation of a terminator and reducing expression of the pyr operon. The wild type strain displayed a severe growth defect in the presence of 5-FOA (Fig 8C&D), while M1 (which should not interact with PyrR and should thus be largely insensitive to the presence of 5-FOA) displayed a much smaller growth defect (Fig 8C&D). In addition, M2, which constitutively over expresses the pyr operon, is also less sensitive to 5-FOA then wild type (Fig 8C&D). Thus, the mutations we introduced into the pyrR RNA element affect the secondary structure in the manner that we intended, and can have far reaching regulatory and fitness effects. Importantly, it shows that a drug targeted against the secondary structure can directly manipulate and severely hamper growth.
A remaining key question is the importance of RNA regulatory elements in colonization and the induction of disease. Somewhat surprisingly our in vitro growth curves suggest that constitutive expression of the pyr operon (M1) and constitutive overexpression (M2) is not all that detrimental to growth, indicating that efficient regulation is not critical. To assess the effect of loss of regulation on bacterial fitness in vivo, the pyrR RNA element mutants were tested in 1x1 competition assays (mutant vs. wild type) in a mouse infection model (Fig 9). While fitness for all three mutants is similar to the unmodified strain in vitro in the presence of uracil, M1 and M3 are unable to colonize and survive in the mouse nasopharynx, or infect and survive in the lung and transition and survive in the blood (Fig 9A&C). M2 has less of a defect in vivo, but still has a significantly diminished ability to infect and survive in the lung (Fig 9B). These results indicate that efficient regulation of the pyr operon in vivo is critical for growth and survival of S. pneumoniae within the host. While we had previously shown that genes in the pyr operon are important in vivo [44], the regulatory findings in this project take our understanding a step further and, importantly, in combination with the findings that 5-FOA can efficiently interact with the RNA regulatory element, suggests that it is feasible to modulate in vivo fitness and thereby virulence by targeting such regulatory elements.
Towards comprehensive transcriptional landscape reconstructions and highly targeted regulatory RNA element inhibitors
With the advent of deep sequencing technologies, our understanding of prokaryotic transcriptional dynamics is rapidly advancing [47] and underlining that bacterial transcriptomes are not as simple as previously thought. Analysis of the S. pneumoniae TIGR4 transcriptome using three different sequencing techniques (RNA-Seq, term-seq, and 5’end-Seq) has led to a comprehensive mapping of its transcriptional landscape. Besides identifying 1597 TSSs and 1330 sense and 236 antisense TTSs, we uncovered a complex operon structure, which has also been found in E. coli [4]. Importantly, such complexity likely allows for environment-dependent modulation of gene expression producing variable transcripts in response to varying conditions, which we illustrated here through analyses of a 9-gene complex operon and the mal regulon (Fig 4). Additionally, similar environment-dependent versatile operon behavior has been observed in E. coli [4] and to a lesser extent in Mycoplasma pneumoniae [48]. This means that our understanding is shifting dramatically and it is thus becoming clear that operons in bacteria should be seen as adaptable structures that can significantly increase the regulatory capacity of the transcriptome by responding to environmental changes in a highly specific manner.
Another central aspect of our approach is the identification of putative 5’-UTR structured regulatory elements. Riboswitches and other untranslated regulatory elements (binding sites for small regulatory RNAs) are important bacterial RNA elements that are thought to regulate up to 2% of bacterial genes [7, 8]. However, the discovery of new regulators is difficult when relying solely on computational methodology and sequence conservation [49]. Here we show that through term-seq [32] it is possible to identify such RNA elements on a genome-wide scale and by combining it with RNA-Seq performed in different conditions transcriptional phenotypes can be directly linked to the RNA element. This strategy thus makes it possible to screen for regulatory RNA elements in retrospect by making use of already existing or newly generated RNA-Seq data.
Importantly, besides the ability to re-construct an organism’s intricate transcriptional landscape we show that there is also a direct application of our multi-sequencing approach, namely the ability to inhibit operons and/or pathways with specific chemicals or drugs that target the RNA regulatory element. We show that this is possible for the pyrR RNA element, a regulatory element that is important for pneumococcal growth and virulence, which means that this regulatory element could be a potential antimicrobial drug target. This idea is further strengthened by the fact that S. pneumoniae displays a growth defect in the presence of 5-FOA, which directly relates to misregulation of pyrR RNA confirming its drug-able potential.
We believe that the presented multi-omics sequencing strategy brings a global understanding of regulation in S. pneumoniae significantly closer, and because the approach is easily transferable to other species, it will enable species-wide comparisons for conservation of operon structure and regulatory elements. In addition, such detailed regulatory understanding creates new regulatory control tools for synthetic biology purposes. Moreover, the combination with for instance in vivo experiments shows that it is a realistic goal to design or select specific compounds that target ribo-regulators in order to mitigate virulence or antibiotic resistance.
Methods
Culture conditions and sample collection
For RNAtag-Seq, term-seq and 5’end-Seq library preparation, Streptococcus pneumoniae TIGR4 (T4) was cultured in rich media (SDMM) to mid-log phase (OD600 = 0.4). Cultures were diluted to an OD600 of 0.05 in fresh media, grown for one doubling (T0). At 0 min (T0) and after 30 min of growth (T30) 10ml culture was harvested by means of centrifugation (4000 rpm, 7 min at 4°C) followed by flash freezing in a dry-ice ethanol bath and storage at -80°C until RNA extraction. Sample collection was performed in four biological replicates and total RNA was isolated using an RNeasy Mini kit (Qiagen). For qRT-PCR analyses, T4 was cultured in SDMM to mid-log phase (OD600 = 0.4) and after centrifugation cultures were washed with 1X PBS and diluted to an OD600 of 0.003 in appropriate media. Cultures were harvested at mid-log followed by RNA extraction as described above.
5’end-Seq library preparation
5’end-Seq libraries were generated by dividing the total RNA into 5’ polyphosphate treated (Processed) and untreated (Non-Processed) samples that were subsequently processed and sequenced according to protocols described in Wurtzel et al., 2012 and [50] with few modifications. See supplemental methods for a detailed protocol.
RNA-Seq library preparation
RNA-Seq libraries were generated by using the RNAtag-Seq protocol [31, 51]. Briefly, 400 ng RNA was fragmented in FastAP buffer, DNase-treated with Turbo DNase, 5’-dephosphorylated using FastAP. Barcoded RNA adapters were then ligated to the 3’ terminus, samples from different conditions were pooled and ribosomal RNA was depleted using the Ribo-zero rRNA removal kit. Illumina cDNA sequencing libraries were generated by first-strand cDNA synthesis, 3’ linker ligation and PCR with 17 cycles. The final concentration and size distribution were determined with the Qubit dsDNA BR Assay kit and the dsDNA D1000 Tapestation kit, respectively.
term-seq library preparation
term-seq libraries were generated as previously described [32] with few modifications. 2 µg total RNA was depleted of genomic DNA using Turbo DNase, 5’ dephosphorylated, ligated to barcoded RNA adapters at the 3’ terminus and fragmented in fragmentation buffer. Barcoded and fragmented RNA from different conditions were pooled and ribosomal RNA was depleted using Ribo-zero. cDNA libraries were generated by first strand cDNA synthesis and RNA template was degraded as mentioned in the 5’end-Seq library preparation. Second 3’ linker was ligated and PCR amplified for 17 cycles. All four library preparations (RNAtag-Seq, term-seq, 5’end-Seq processed and 5’end-Seq unprocessed) were pooled according to the method of preparation and sequenced at high depth (8.5 million reads/sample) on an Illumina NextSeq500.
Read processing and mapping
The sequencing reads from the 5’ end-Seq sequencing, 3’ end sequencing (term-seq), and RNAtag-Seq were processed and mapped to the S. pneumoniae TIGR4 (NC_003028.3) genome using the in-house developed Aerobio pipeline. Aerobio runs the processing and mapping in two phases. Phase 0 employs bcl2fastq to convert BCL to fastq files, quality control and de-multiplexing and compilation of the reads based on the sample conditions. Phase 1 maps the de-multiplexed reads against the genome, under default parameters, using Bowtie2 [52] and streams the output to SAMtools [53] to generate sorted and indexed BAM files for each sample.
in silico prediction of transcription start sites (TSSs) and transcription termination sites (TTSs)
Perl code from [32] was adapted to estimate the number of reads mapped at each nucleotide from the 5’ end, 3’ end, and RNA sequencing runs. With the nucleotide level coverage data calculated from the 5’ end-Seq, regions up to 500 nucleotides upstream of the translational start sites described in the annotated TIGR4 genome (NC_003028.3) were scanned for mapped reads with a minimum coverage of 2 and a Processed/non-Processed ratio of 1 as in [32]. When multiple putative TSSs were identified in a 5’ UTR, the one with the highest Processed/non-Processed ratio was assigned as the TSS for the downstream gene. Similar to the identification of the TSSs, TTSs were identified by scanning up to 150 nucleotides downstream of the translational stop site for mapped 3’ end reads with a minimum coverage of 2 in at least 4 replicates, out of the 12 total datasets. The position with the highest coverage was considered the most likely TTS for a gene.
Identification of transcript boundaries and operon structures in the genome
BAM files of the mapped RNAtag-Seq reads were analyzed using Rockhopper [33] to predict transcript boundaries and group genes into operons. Predicted operons were compared with the genome based predictions listed in the Database of Prokaryotic Operons [5,34,54], and complexity in the operon structure was characterized by surveying the number of internal TSSs and TTSs similar to [4].
Identification of candidate regulatory elements
Once the TSSs were identified, 5’-UTR regions with a length of at least 70 nucleotides were scanned for mapped 3’ end sequencing reads with a minimum coverage of 2 to identify putative early terminators. 5’-UTR regions with a predicted early TTS were binned as candidate regulatory elements. The nucleotide sequence for each candidate element was obtained and folded using RNAFold [55]. Secondary structures and free energy values were compiled for each candidate. Putative candidates were compared to known bacterial non-coding RNAs described in Rfam [56, 57] that were identified in the genome by the cmsearch function of Infernal 1.1 [37].
The response of candidate regulatory elements to different media conditions were assessed by calculating the RNA-Seq coverage in both the regulatory element and the regulated gene. Read-through was calculated for each of the candidates as described previously [32]. Briefly, read-through is the ratio (denoted in percentage) of the average coverage across the gene to that of the 5’-UTR identified here. The greater the read-through, the higher the expression of the gene with respect to the 5’-UTR. That is, if the regulator reduced the expression of the gene, read-through would be small. If the regulator turned on gene expression in response to certain conditions, the read-through would be large.
Conservation of the candidate regulatory elements in Streptococcus pneumoniae
A local BLAST [58] database was generated with the genomes of 30 S. pneumoniae strains available in Refseq 77 [59] and 350 strains from [35]. Each of the candidate regulators identified in the genome of TIGR4 was BLASTed against this database, and hits in the other genomes were extracted and aligned using MAFFT version 7 [60]. The degree of conservation across the 380 genomes was determined by surveying each candidate cluster post filtering to remove sequences that were less than 70% in length of the query and with e-values greater than 1x10-4. The candidates were also screened for overlap with previously published small RNAs identified in S. pneumoniae [13, 15].
Expression analysis using qPCR
RNA was isolated from cultures using the Qiagen RNeasy kit (Qiagen). DNase treated RNA was used to generate cDNA with iScript reverse transcriptase supermix for RT-qPCR (BioRad). Quantitative PCR was performed using a Bio-Rad MyiQ. Each sample was normalized against the 50S ribosomal gene, SP2204 and were measured in biological replicate and technical triplicates. No-reverse transcriptase and no-template controls were included for all samples.
pyrR RNA mutant growth assays
Wild type and pyrR RNA mutants of T4 were grown for 2 hours and diluted to an OD600 of 0.015 in fresh media, with varying concentrations of uracil and/or 5-FOA. Growth assays were performed in 96-well plates for 16 hours by taking OD600 measurements every half hour using a Tecan Infiniti Pro plate reader (Tecan). Growth assays were performed no less than two times.
In vivo pyrR mutant fitness determination
1 x 1 competition experiments were performed with pyrR RNA mutants (M1 to M3) that were competed against the wild-type strain after which bacterial fitness was calculated as previously described [44] with a few modifications. Lung removal and homogenization (in 10 mL 1 x PBS), blood collection (100 uL) and nasopharnyx lavage (with 1 ml 1X PBS) were perfromed on all animals 24 hours post infection, with the exception of pyrR M3, which due to the large fitness defect were harvested at 6 hours post infection.
Author contributions
MMM and TvO devised the study. MMM, TvO, IW and ZZ designed the experiments, IW and ZZ generated the sequencing data, IW and AH performed in vitro experiments and AH performed in vivo experiments. IW, ZZ and NR performed RNA-Seq data analysis, NR performed term-seq and 5’ end-Seq data analyses and IW, NR, MMM and TvO wrote the manuscript.
Supplemental figure 1. Distribution and conservation of the 565 putative regulatory candidates across 380 S. pneumoniae genomes. A. Frequency distribution of the candidates across the surveyed genomes. B. Conservation of the candidates as a measure of the mean p-Distance within each candidate cluster.
Supplemental figure 2. Validation of the regulatory activities of three putative 5’-UTR regulatory candidates in different nutrient conditions. The relative expression and average RNA-Seq coverage of SP1356 (A), SP0240 (B) and SP1951 (C) increases in poor media (MCDM) compared to rich media (SDMM), potentially compensating for the depletion of the specific ligand.
Acknowledgements
We would like to thank Jon Anthony for sequence data processing on the Aerobio platform, Charles S. Hoffman for generous gift of 5-FOA that we used in this study and Daniel Dar for discussion and helpful suggestions. The sequencing datasets generated during the current study are available in the Sequence Read Archive (SRP136114). This work is supported by NIH grant R01GM115931 to MMM and TvO, and R01AI110724 and U01AI124302 to TvO.