Abstract
The ability of Mycobacterium tuberculosis to infect, proliferate, and survive during long periods in the human lungs largely depends on the rigorous control of gene expression. Transcriptome-wide analyses are key to understanding gene regulation on a global scale. Here, we combine 5’-end-directed libraries with RNAseq expression libraries to gain insight into the transcriptome organization and post-transcriptional mRNA cleavage landscape in mycobacteria during log phase growth and under hypoxia, a physiologically relevant stress condition. Using the model organism Mycobacterium smegmatis, we identified 6,090 transcription start sites (TSSs) with high confidence during log phase growth, of which 67% were categorized as primary TSSs for annotated genes, and the remaining were classified as internal, antisense or orphan, according to their genomic context. Interestingly, over 25% of the RNA transcripts lack a leader sequence, and of the coding sequences that do have leaders, 53% lack a strong consensus Shine-Dalgarno site. This indicates that like M. tuberculosis, M. smegmatis can initiate translation through multiple mechanisms. Our approach also allowed us to identify over 3,000 RNA cleavage sites, which occur at a novel sequence motif. The cleavage sites show a positional bias toward mRNA regulatory regions, highlighting the importance of post-transcriptional regulation in gene expression. We show that in low oxygen, a condition associated with the host environment during infection, mycobacteria change their transcriptomic profiles and endonucleolytic RNA cleavage is markedly reduced, suggesting a mechanistic explanation for previous reports of increased mRNA half-lives in response to stress. In addition, a number of TSSs were triggered in hypoxia, 56 of which contain the binding motif for the sigma factor SigF in their promoter regions. This suggests that SigF makes direct contributions to transcriptomic remodeling in hypoxia-challenged mycobacteria. Our results show that M. smegmatis and M. tuberculosis share a large number of similarities at the transcriptomic level, suggesting that similar regulatory mechanisms govern both species.
Introduction
Tuberculosis is a disease of global concern caused by Mycobacterium tuberculosis (Mtb). This pathogen has the ability to infect the human lungs and survive there for long periods, often by entering into non-growing states. During infection, Mtb must overcome a variety of stressful conditions, including nutrient starvation, low pH, oxygen deprivation and the presence of reactive oxygen species. Consequently, the association of Mtb with its host and the adaptation to the surrounding environment requires rigorous control of gene expression.
As the slow growth rate and pathogenicity of Mtb present logistical challenges in the laboratory, many aspects of its biology have been studied in other mycobacterial species. One of the most widely used models is mycobacteria is Mycobacterium smegmatis, a non-pathogenic fast-growing bacterium that shares substantial genomic similarity with Mtb. A PubMed search for “Mycobacterium smegmatis” returns 3,907 publications, reflecting the sizable body of published work involving this model organism. While there are marked differences between the genomes, such as the highly represented PE/PPE-like gene category and other virulence factors present in Mtb and poorly represented or absent in M. smegmatis, these organisms have at least 2,117 orthologous genes (Prasanna & Mehra, 2013) making M. smegmatis a viable model to address questions about the fundamental biology of mycobacteria. Indeed, studies using M. smegmatis have revealed key insights into relevant aspects of Mtb biology including the Sec and ESX secretion systems involved in transport of virulence factors (Coros et al., 2008, Rigel et al., 2009), bacterial survival during anaerobic dormancy (Dick et al., 1998, Bagchi et al., 2002, Trauner et al., 2012, Pecsi et al., 2014) and the changes induced during nutrient starvation (Elharar et al., 2014, Wu et al., 2016, Hayashi et al., 2018). However, the similarities and differences between M. smegmatis and M. tuberculosis at the level of transcriptomic organization have not been comprehensively reported.
Identification of transcription start sites (TSSs) is an essential step towards understanding how bacteria organize their transcriptomes and respond to changing environments. Genome-wide TSS mapping studies have been used to elucidate the general transcriptomic features in many bacterial species, leading to the identification of promoters, characterization of 5’ untranslated regions (5’ UTRs), identification of RNA regulatory elements and transcriptional changes in different environmental conditions (examples include (Albrecht et al., 2009, Mitschke et al., 2011, Cortes et al., 2013, Schlüter et al., 2013, Dinan et al., 2014, Ramachandran et al., 2014, Sass et al., 2015, Shell et al., 2015, Thomason et al., 2015, Berger et al., 2016, Čuklina et al., 2016, D’arrigo et al., 2016, Heidrich et al., 2017, Li et al., 2017). To date, two main studies have reported the transcriptomic landscape in Mtb during exponential growth and carbon starvation (Cortes et al., 2013, Shell et al., 2015). These complementary studies revealed that, unlike most bacteria, a substantial percentage (~25%) of the transcripts are leaderless, lacking a 5’ UTR and consequently a Shine-Dalgarno ribosome-binding site. In addition, a number of previously unannotated ORFs encoding putative small proteins were found (Shell et al., 2015), showing that the transcriptional landscape can be more complex than predicted by automated genome annotation pipelines. Thus, TSS mapping is a powerful tool to gain insight into transcriptomic organization and identify novel genes. Less is known about the characteristics of the M. smegmatis transcriptome. A recent study reported a number of M. smegmatis TSSs in normal growth conditions (Li et al. 2017). However, this work was limited to identification of primary gene-associated TSSs and lacked of an analysis of internal and antisense TSSs, as well as characterization of promoter regions and other relevant transcriptomic features. In addition, Potgieter and collaborators (Potgieter et al., 2016) validated a large number of annotated ORFs using proteomics and were able to identify 63 previously unannotated leaderless ORFs.
To achieve a deeper characterization of the M. smegmatis transcriptional landscape, we combined 5’-end-mapping and RNAseq expression profiling under two different growth conditions. Here we present an exhaustive analysis of M. smegmatis transcriptome during exponential growth and hypoxia. Unlike most transcriptome-wide TSS analyses, our approach allowed us to study not only the transcriptome organization in different conditions, but also the frequency and distribution of RNA cleavage sites on a genome wide scale. Whereas regulation at the transcriptional level is assumed to be the main mechanism that modulates gene expression in bacteria, post-transcriptional regulation is a key step in the control of gene expression and has been implicated in the response to host conditions and virulence in various bacterial pathogens (Kulesekara et al., 2006, Mraheil et al., 2011, Heroven et al., 2012, Jurėnaitė et al., 2013, Schifano et al., 2013, Holmqvist et al., 2016). Some regulatory mechanisms including small non-coding RNAs, RNases, Toxin-Antitoxin (TA) systems, RNA-binding proteins, and riboswitches have been described in mycobacteria ((Fields & Switzer, 2007, Warner et al., 2007, Sala et al., 2008, DiChiara et al., 2010, McKenzie et al., 2012, Winther et al., 2016) and others), emphasizing the importance of post-transcriptional regulation. Here we show that RNA cleavage decreases during adaptation to hypoxia, suggesting that RNA cleavage may be a refinement mechanism contributing to the regulation of gene expression in harsh conditions.
Materials and Methods
Strains and growth conditions used in this study
M. smegmatis strain mc2155 was grown in Middlebrook 7H9 supplemented with ADC (Albumin Dextrose Catalase, final concentrations 5 g/L bovine serum albumin fraction V, 2 g/L dextrose, 0.85 g/L sodium chloride, and 3 mg/L catalase), 0.2% glycerol and 0.05% Tween 80. For the exponential phase experiment (Dataset 1), 50 ml conical tubes containing 5 ml of 7H9 were inoculated with M. smegmatis to have an initial OD=0.01. Cultures were grown at 37°C and 250 rpm. Once cultures reached an OD of 0.7 – 0.8, they were frozen in liquid nitrogen and stored at -80°C until RNA purification. For hypoxia experiments (Dataset 2), the Wayne model (Wayne & Hayes, 1996) was implemented. Briefly, 60 ml serum bottles (Wheaton) were inoculated with 36.5 ml of M. smegmatis culture with an initial OD=0.01. The bottles were sealed with rubber caps (Wheaton, W224100-181 Stopper, 20mm) and aluminum caps (Wheaton, 20 mm aluminum seal) and cultures were grown at 37 °C and 125 rpm to generate hypoxic conditions. Samples were taken at an early hypoxia stage (15 hours) and at a late hypoxia stage (24 hours). These time points were experimentally determined according to growth curves experiments (see Figure S1). 15 ml of each culture were sampled and frozen immediately in liquid nitrogen until RNA extraction.
RNA extraction
RNA was extracted as follows: frozen cultures stored at -80°C were thawed on ice and centrifuged at 4,000 rpm for 5 min at 4 °C. The pellets were resuspended in 1 ml Trizol (Life Technologies) and placed in tubes containing Lysing Matrix B (MP Bio). Cells were lysed by bead-beating twice for 40 sec at 9 m/sec in a FastPrep 5G instrument (MP Bio). 300 μl chloroform was added and samples were centrifuged for 15 minutes at 4,000 rpm at 4°C. The aqueous phase was collected and RNA was purified using Direct-Zol RNA miniprep kit (Zymo) according to the manufacturer’s instructions. Samples were then treated with DNase Turbo (Ambion) for one hour and purified with a RNA Clean & Concentrator-25 kit (Zymo) according to the manufacturer’s instructions. RNA integrity was checked on 1% agarose gels and concentrations were determined using a Nanodrop instrument. Prior to library construction, 5 μg RNA was used for rRNA depletion using Ribo-Zero rRNA Removal Kit (Illumina) according to the manufacturer’s instructions.
Construction of 5’-end-mapping libraries
After rRNA depletion, RNA samples from each biological replicate were split in three, in order to generate two 5’-end differentially treated libraries and one RNAseq expression library (next section). RNA for library 1 (“converted” library) was treated either with RNA 5’ pyrophosphohydrolase RPPH (NEB) (exponential phase experiment, Dataset 1), or with 5’ polyphosphatase (Epicentre) (hypoxia experiment, Dataset 2), in order to remove the native 5’ triphosphates of primary transcripts, whereas RNA for Library 2 (“non-converted” library) was subject to mock treatment. Thus, the converted libraries capture both 5’ triphosphates (converted to monophosphates) and native 5’ monophosphate transcripts, while non-converted libraries capture only native 5’ monophosphates (see scheme in Figure S2.A). Library construction was performed as described by Shell et al (Shell et al., 2015). A detailed scheme showing the workflow of 5’-end libraries construction, the primers and adapters used in each step, and modifications to the protocol are shown in Figure S2.B.
Construction of RNAseq expression libraries
One third of each rRNA-depleted RNA sample was used to construct RNAseq expression libraries. KAPA stranded RNA-Seq library preparation kit and NEBNext Ultra RNA library prep kit for Illumina (NEB) were used for Dataset 1 and Dataset 2, respectively, according to manufacturer’s instructions. The following major modifications were introduced into the protocols: i) For RNA fragmentation, in order to obtain fragments around 300 nt, RNA was mixed with the corresponding buffer and placed at 85°C for 6 minutes (Dataset 1), or at 94°C for 12 minutes (Dataset 2). ii) For library amplification, 10 or 19-23 PCR cycles were used for Dataset 1 and Dataset 2, respectively. The number of cycles was chosen according to the amount of cDNA obtained for each sample. After purification, DNA concentration was measured in a Qubit instrument before sequencing.
Libraries sequencing and quality assessment
For 5’-end-mapping libraries from Dataset 1, Illumina MiSeq paired-end sequencing producing 100 nt reads was used. For 5’ end directed libraries from Dataset 2 as well as for all expression libraries, Illumina HiSeq 2000 paired-end sequencing producing 50 nt reads was used. Sequencing was performed at the UMass Medical School Deep Sequencing Core Facility. Quality of the generated fastq files was checked using FastQC.
Identification of 5’ ends and discrimination between transcription start sites (TSSs) and cleavage sites (CSs)
Paired-end reads generated from 5’-end-directed libraries were mapped to M. smegmatis mc2155 NC_008596 reference genome. In order to reduce noise from the imprecision of transcriptional initiation, only the coordinate with the highest coverage in each 5 nt window was used for downstream analyses. For read filtering, different criteria were used for the 2 datasets according to the library depth and quality (see Figure S3). In order to discriminate between TSSs and CSs, the ratio of the coverage in converted/non-converted libraries for each detected 5’ end was calculated. To focus our analyses on the 5’ends that are relatively abundant in their local genomic context, we employed a filter based on the ratio of 5’ end coverage to expression library coverage in the preceding 100nt. 5’ ends for which this ratio was ≤0.05 were removed. After this filter, 15,720 5’ ends remained and were further analyzed using a Gaussian mixture modelling to differentiate TSSs from CSs with a high confidence in Dataset 1 (Figure 1A). For this analysis, we used the iterative expectation maximization (EM) algorithm in the mixtools package (Benaglia et al., 2009) for R (version 1.1.0) to fit the mixture distributions.
Analysis of expression libraries
Reads were aligned to NC_008596 reference genome using Burrows-Wheeler Aligner (Li & Durbin, 2009). For comparison of gene expression levels according to presence or absence of Shine-Dalgarno sequences, RPKMs were calculated for all genes. The DEseq2 pipeline was used to evaluate the changes in gene expression in hypoxia (Love et al., 2014).
Transcription start sites categorization
For analysis in Figure 1D, TSSs were classified as follows: those coordinates located ≤ 500 bp upstream from an annotated gene were considered to be primary TSSs (pTSS). Coordinates located within an annotated gene were classified as internal (iTSS) or N-associated internal TSSs (N-iTSSs) if they were located within the first 25% of the annotated coding sequence. N-iTSSs were considered for reannotation as a pTSSs only if their associated gene lacked a pTSS. TSSs located on the antisense strand of a coding sequence, 5’ UTR, or 3’ UTR were considered as antisense (aTSS). 5’ UTRs boundaries were assigned after assignment of pTSSs to genes annotated in the reference genome NC_008596. When a gene had more than one pTSS, the longest of the possible 5’ UTRs was used for assignment of aTSSs. In the case of genes for which we did not identify a pTSS, we considered a hypothetical leader sequence of 50 bp for assignment of aTSSs. For assignment of aTSSs in 3’ UTRs, we arbitrarily considered a sequence of 50 bp downstream the stop codon of the gene to be the 3’ UTR. Finally, TSSs not belonging to any of the above-mentioned categories were classified as orphan (oTSSs).
Operon prediction
Adjacent genes with the same orientation were considered to be co-transcribed if there were at least 5 spanning reads between the upstream and the downstream gene in at least one of the replicates in the expression libraries from Dataset 1. After this filtering, a downstream gene was excluded from the operon if: 1) it had a TSS ≤ 500 bp upstream the annotated start codon on the same strand, and/or 2) had a TSS within the first 25% of the gene on the same strand, and/or 3) the upstream gene had a TSS within the last 50-100% of the coding sequence. Finally, the operon was assigned only if the first gene had a primary TSS with a confidence ≥ 95% according to the Gaussian mixture modeling.
Cleavage sites categorization
For CS categorization in Figure 4D, we stablished stringent criteria in order to determine the frequency of CSs in each location category relative to the amount of the genome comprising that location category. For 3’ UTR regions, we considered only CSs that were located between 2 convergent genes. To assess frequency relative to the whole genome, we considered the sum of all regions located between two convergent genes. For 5’ UTRs we considered all CSs located between 2 divergent genes, and the sum of all leader lengths for genes having a pTSS whose upstream gene is in the opposite strand (divergent) determined in this study was used for assessing relative frequency. For 5’ ends corresponding to cleavages between co-transcribed genes we used the operon structures determined in this study, and the sum of all their intergenic regions was used for assessing relative frequency. Finally, for CSs located within coding sequences all genes were considered, as all of them produced reads in the expression libraries. The sum of all coding sequences in NC_008596 genome was used for assessing relative frequency, after subtracting overlapping regions to avoid redundancy.
Results
1. Mapping, annotation and categorization of transcription start sites
In order to study the transcriptome structure of M. smegmatis, RNAs from triplicate cultures in exponential phase were used to construct 5’ end mapping libraries (Dataset 1, D1) according to our previously published methodologies (Shell et al., 2015, Shell et al., 2015) with minor modifications. Briefly, our approach relies on comparison of adapter ligation frequency in a dephosphorylated (converted) library and an untreated (non-converted) library for each sample. The converted libraries capture both 5’ triphosphate and native 5’ monophosphate-bearing transcripts, while the non-converted libraries capture only native 5’ monophosphate-bearing transcripts (Figure S2). Thus, assessing the ratios of read counts in the converted/non-converted libraries permits discrimination between 5’ triphosphate ends (primary transcripts from transcription start sites) and 5’ monophosphate ends (cleavage sites). By employing a Gaussian mixture modeling analysis (Figure 1A) we were able to identify 5,552 TSSs in M. smegmatis with an observed probability of being a TSS ≥0.95 (high confidence TSSs, Table S1). A second filtering method allowed us to obtain 222 additional TSSs from Dataset 1 (Figure S3). A total of 5,774 TSSs were therefore obtained from Dataset 1. In addition, data from separate libraries constructed as controls for the hypoxia experiment (Dataset 2, D2) in Section 8 were also included in this analysis to obtain TSSs. After noise filtering (Figure S3), 4,736 TSSs from D2 were identified. The union of the two datasets yielded a total of 6,090 non-redundant high confidence TSSs, of which 4,420 were detected in both datasets (Figure S4, Table S1).
Although not all 5’ ends could be classified with the Gaussian mixture modeling, we were able to assign 57% of the 5’ ends in Dataset 1 to one of the two 5’ end populations with high confidence (5,552 TSSs and 3,344 CSs). To validate the reliability of the Gaussian mixture modeling used to classify 5’ ends, we performed two additional analyses. First, according to previous findings in Mtb (Cortes et al., 2013) and other well studied bacteria (Sass et al., 2015, Berger et al., 2016, Čuklina et al., 2016, D’arrigo et al., 2016), we anticipated that TSSs should be enriched for the presence of the ANNNT -10 promoter consensus motif in the region upstream. Evaluation of the presence of appropriately-spaced ANNNT sequences revealed that 5’ ends with higher probabilities of being TSSs are enriched for this motif, whereas for those 5’ ends with low probabilities of being TSSs (and thus high probabilities of being CSs) have ANNNT frequencies similar to that of the M. smegmatis genome as a whole (Figure 1B). Secondly, we predicted that TSSs should show enrichment for A and G nts at the +1 position, given the reported preference for bacterial RNA polymerases to initiate transcription with these nts (Lewis & Adhya, 2004, Mendoza-Vargas et al., 2009, Mitschke et al., 2011, Cortes et al., 2013, Shell et al., 2015, Thomason et al., 2015, Berger et al., 2016). Thus, we analyzed the base enrichment in the +1 position for the 5’ ends according the p-value in the Gaussian mixture modeling (Figure 1C). These results show a clear increase in the percentage of G and A bases in the position +1 as the probability of being a TSS increases, while the percentage of sequences having a C at +1 increases as the probability of being a TSS decreases. These two analyses show marked differences in the sequence contexts of TSSs and CSs and further validate the method used for categorization of 5’ ends.
To study the genome architecture of M. smegmatis, the 6,090 TSSs were categorized according to their genomic context (Figure 1D and 1E, Table S1). TSSs located ≤500 nt upstream of an annotated gene start codon in the M. smegmatis NC_008596 reference genome were classified as primary TSSs (pTSS). TSSs within annotated genes on the sense strand were denoted as internal (iTSS). When an iTSS was located in the first quarter of an annotated gene, it was sub-classified as N-terminal associated TSS (N-iTSS), and was further examined to determine if it should be considered a primary TSS (see below). TSSs located on the antisense strand either within a gene or within a 5’ UTR or 3’ UTR were grouped as antisense TSSs (aTSSs). Finally, TSSs located in non-coding regions that did not meet the criteria for any of the above categories were classified as orphan (oTSSs). When a pTSS also met the criteria for classification in another category, it was considered to be pTSS for the purposes of downstream analyses. A total of 4,054 distinct TSSs met the criteria to be classified as pTSSs for genes transcribed in exponential phase. These pTSSs were assigned to 3,043 downstream genes, representing 44% of the total annotated genes (Table S2). This number is lower than the total number of genes expressed in exponential phase, in large part due to the existence of polycistronic transcripts (see operon prediction below). Interestingly, 706 (23%) of these genes have at least two pTSSs and 209 (7%) have three or more, indicating that transcription initiation from multiple promoters is common in M. smegmatis.
A total of 995 iTSSs (excluding the iTSSs that were also classified as a pTSS of a downstream gene, see Figure S5 for classification workflow) were identified in 804 (12%) of the annotated genes, indicating that transcription initiation within coding sequences is common in M. smegmatis. iTSSs are often considered to be pTSSs of downstream genes, to be spurious events yielding truncated transcripts, or to be consequences of incorrect gene start annotations. However, there is evidence supporting the hypothesis that iTSSs are functional and highly conserved among closely related bacteria (Shao et al., 2014), highlighting their potential importance in gene expression.
We were also able to detect antisense transcription in 12.5 % of the M. smegmatis genes. Antisense transcription plays a role in modulation of gene expression by controlling transcription, RNA stability, and translation (Morita et al., 2005, Kawano et al., 2007, Andre et al., 2008, Fozo et al., 2008, Giangrossi et al., 2010) and has been found to occur at different rates across bacterial genera, ranging from 1.3% of genes in Staphylococcus aureus to up to 46% of genes in Helicobacter pylori (Beaume et al., 2010, Sharma et al., 2010). Of the 1,006 aTSSs identified here (excluding those that were primarily classified as pTSSs), 881 are within coding sequences, 120 are within 5’ UTRs and 72 are located within 3’ UTRs (note that some aTSS are simultaneously classified in more than one of these three subcategories, Figure S6). While we expect that many of the detected antisense transcripts have biological functions, it is difficult to differentiate antisense RNAs with regulatory functions from transcriptional noise. In this regard, Lloréns-Rico and collaborators (Llorens-Rico & Cano, 2016) reported that most of the antisense transcripts detected using transcriptomic approaches are a consequence of transcriptional noise, arising at spurious promoters throughout the genome. To investigate the potential significance of the M. smegmatis aTSSs, we assessed the relative impact of each aTSS on local antisense expression levels by comparing the read depth upstream and downstream of each aTSS in our RNAseq expression libraries. We found 318 aTSSs for which expression coverage was ≥10-fold higher in the 100 nt window downstream of the TSS compared to the 100 nt window upstream (Table S3). Based on the magnitude of the expression occurring at these aTSS, we postulate that they could represent the 5’ ends of candidate functional antisense transcripts rather than simply products of spurious transcription. However, further work is needed to test this hypothesis. Finally, 78 oTSSs were detected across the M. smegmatis genome. These TSSs may be the 5’ ends of non-coding RNAs or mRNAs encoding previously unannotated ORFs.
Out of the 995 iTSSs identified, 457 were located within the first quarter of an annotated gene (N-iTSSs). In cases where there was no pTSS, we considered the possibility that the start codon of the gene was misannotated and the N-iTSS was in fact the primary TSS. Although we do not discount the possibility that functional proteins can be produced when internal transcription initiation occurs far downstream of the annotated start codon, we only considered N-iTSSs candidates for gene start reannotation when there was a start codon (ATG, GTG or TTG) in-frame with the annotated gene in the first 30% of the annotated sequence. In this way, we suggest re-annotations of the start codons of 213 coding sequences (see Table S4). These N-iTSSs were considered to be pTSSs (N-iTSSs → pTSSs) for all further analyses described in this work.
2. Operon prediction
To predict operon structure, we combined 5’ end libraries and RNAseq expression data. We considered two or more genes to be co-transcribed if (1) they had spanning reads that overlapped both the upstream and downstream gene in the expression libraries, (2) at least one TSS was detected in the 5’ end-directed libraries for the first gene of the operon, and (3) the downstream gene(s) lacked pTSSs and iTSSs (for more detail, see Materials and Methods). Thus, we were able to identify and annotate 294 operons with high confidence across the M. smegmatis genome (Table S5). These operons are between 2 and 4 genes in length and comprise a total of 638 genes. Our operon prediction methodology has some limitations. For example, operons not expressed in exponential growth phase could not be detected in our study. Furthermore, internal promoters within operons can exist, leading to either monocistronic transcripts or suboperons (Guell et al., 2009, Paletta & Ohman, 2012, Skliarova et al., 2012). We limited our operon predictions to genes that appear to be exclusively co-transcribed, excluding those cases in which an internal gene in an operon can be alternatively transcribed from an assigned pTSS. Finally, our analysis did not capture operons in which the first gene lacked a high-confidence pTSS. Despite these limitations, our approach allowed us to successfully identify new operons as well as previously described operons. Previously reported operons that were captured by our predictions included the furA-katG (MSMEG_6383-MSMEG_6384) operon involved in oxidative stress response (Milano et al., 2001), the vapB-vapC (MSMEG_1283-MSMEG_1284) Toxin–Antitoxin module (Robson et al., 2009) operon, and the ClpP1-ClpP2 (MSMEG_4672-MSMEG_4673) operon involved in protein degradation (Raju et al., 2012).
3. Characterization of M. smegmatis promoters reveals features conserved in M. tuberculosis
Most bacterial promoters have two highly conserved regions, the -10 and the -35, that interact with RNA polymerase via sigma factors. However, it was reported that the -10 region is necessary and sufficient for transcription initiation by the housekeeping sigma factor SigA in mycobacteria, and no SigA -35 consensus motifs were identified in previous studies (Cortes et al., 2013, Newton-Foot & Gey van Pittius, 2013, Zhu et al., 2017). to characterize the core promoter motifs in M. smegmatis on a global scale we analyzed the 50 bp upstream of the TSSs. We found that 4,833 of 6,090 promoters analyzed (79%) have an ANNNT motif located between positions -6 to -13 upstream the TSSs (Figure 2A). In addition, 63% of the promoters with ANNNT motifs have a thymidine preceding this sequence (TANNNT). This motif is similar to that previously described in a transcriptome-wide analysis for Mtb (Cortes et al., 2013) and for most bacterial promoters that are recognized by the σ70 housekeeping sigma factor (Ramachandran et al., 2014, Sass et al., 2015, Berger et al., 2016, Čuklina et al., 2016, D’arrigo et al., 2016). However, no apparent bias towards specific bases in the NNN region was detected in our study or in Mtb, while in other bacteria such as E. coli, S. enterica, B. cenocepacia, P. putida, and B. subtillis an A/T preference was observed in this region (Jarmer et al., 2001, Ramachandran et al., 2014, Sass et al., 2015, Berger et al., 2016, D’arrigo et al., 2016). We were unable to detect a consensus motif in the -35 region either using MEME server (Bailey et al., 2015) or manually assessing the possible base-enrichment in the -35 region. Analysis of the sequences in the immediate vicinity of TSSs revealed that G and A are the most frequent bases at the +1 position, and C is considerably more abundant at -1 (Figure 2B).
Interestingly, we identified several alternative motifs in the -10 promoter regions of transcripts lacking the ANNNT motif (Figure 2A). One of these, (G/C)NN(G/C)NN(G/C), is likely the signature of M. smegmatis’ codon bias in the regions upstream of iTSSs. The other three sequences are candidate binding sites for alternative sigma factors, which are known to be important in regulation of transcription under diverse environmental conditions. However, the identified consensus sequences differ substantially from those previously described in mycobacteria (Raman et al., 2001, Raman et al., 2004, Sun et al., 2004, Lee et al., 2008, Lee et al., 2008, Song et al., 2008, Veyrier et al., 2008, Humpel et al., 2010, Gaudion et al., 2013). The TSSs having these sigma factor motifs and the associated genes are listed in Table S6. We next examined the relationship between promoter sequence and promoter strength, as estimated by the read depths in the 5’ end converted libraries. As shown in Figure 2C, the expression levels of transcripts with ANNNT -10 motifs are on average substantially higher than those lacking this sequence. In addition, promoters with the full TANNNT motif are associated with more highly abundant transcripts compared to those having a VANNNT sequence, where V is G, A or C. These results implicate TANNNT as the preferred -10 sequence for the housekeeping sigma factor, SigA, in M. smegmatis. As shown in Figure 2C, expression levels of transcripts having the motif 2 in Figure 2A were significantly increased when compared to those lacking the ANNNT motif.
4. Leaderless transcription is a prominent feature of the M. smegmatis transcriptome
5’ UTRs play important roles in post-transcriptional regulation and translation, as they may contain regulatory sequences that can affect mRNA stability and/or translation efficiency. Whereas in most bacteria 5’-UTR bearing (“leadered”) transcripts predominate, this is not the case for Mtb, in which near one quarter of the transcripts have been reported to be leaderless (Cortes et al., 2013, Shell et al., 2015). To investigate this feature in M. smegmatis, we analyzed the 5’ UTR lengths of all genes that had at least one pTSS. We found that for 24% of the transcripts the TSS coincides with the translation start site or produces a leader length ≤5 nt, resulting in leaderless transcripts (Figure 3A). A total of 1,099 genes (including those re-annotated in section 1) have leaderless transcripts, and 155 of those (14%) are also transcribed as leadered mRNAs from separate promoters. For leadered transcripts, the median 5’ UTR length was 69 nt. Interestingly, 15% of the leaders are > 200 nt, suggesting that these sequences may contain potential regulatory elements. We then sought to compare the leader lengths of M. smegmatis genes with the leader lengths of their homologs in Mtb. For this analysis we used two independent pTSS mapping Mtb datasets obtained from Cortes et al, 2013 and Shell et al, 2015 (Figure 3B). To avoid ambiguities, we used only genes that had a single pTSS in both species. Our results show a statistically significant correlation of leader lengths between species, suggesting that similar genes conserve their transcript features and consequently may have similar regulatory mechanisms. Additionally, comparison of leaderless transcription in M. smegmatis and Mtb revealed that 62% or 73% of the genes that are only transcribed as leaderless in M. smegmatis also lack a 5’ UTR in MTB, according to Cortes et al, 2013 or Shell et al, 2015, respectively (Table S7). We next assessed if leaderless transcripts are associated with particular gene categories, and found the distribution across categories was uneven (Figure 3C). The three categories “DNA metabolism,” “Amino acid biosynthesis,” and “Biosynthesis of cofactors, prosthetic groups and carriers” were significantly enriched in leaderless transcripts (p-value < 0.05, hypergeometric test), while “Signal transduction,” “Transcription,” and “Transport and binding proteins” appear to have less leaderless transcripts.
We next evaluated the presence of the Shine-Dalgarno ribosome-binding site (SD) upstream of leadered coding sequences. For this analysis, we considered those leaders containing at least one of the three tetramers AGGA, GGAG or GAGG (core sequence AGGAGG) in the region -6 to -17 relative to the start codon to possess canonical SD motifs. We found that only 47% of leadered coding sequences had these canonical SD sequences. Thus, considering also the leaderless RNAs, a large number of transcripts lack canonical SD sequences, suggesting that translation initiation can occur through multiple mechanisms in M. smegmatis. We further compared the relative expression levels of leaderless and leadered coding sequences subdivided by SD status. Genes expressed as both leadered and leaderless transcripts were excluded from this analysis. We found that on average, expression levels were significantly higher for those genes with canonical SD sequences than for those with leaders but lacking this motif and for those that were leaderless (Figure 3D). Together, these data suggest that genes that are more efficiently translated have also higher transcript levels. Similar findings were made in Mtb, where proteomic analyses showed increased protein levels for genes with SD sequences compared to those lacking this motif (Cortes et al., 2013).
5. Identification of novel leaderless ORFs in the M. smegmatis genome
As GTG or ATG codons are sufficient to initiate leaderless translation in mycobacteria (Shell et al., 2015, Potgieter et al., 2016), we used this feature to look for unannotated ORFs in the M. smegmatis NC_008596 reference genome. Using 1,579 TSSs that remained after pTSS assignment and gene reannotation using N-iTSSs (see Figure S5) we identified a total of 66 leaderless ORFs encoding putative proteins longer than 30 amino acids, 5 of which were previously identified (Shell et al., 2015). 83% of these ORFs were predicted in other annotations of the M. smegmatis mc2155 or MKD8 genome (NC_018289.1, (Gray et al., 2013)), while 10 of the remaining ORFs showed homology to genes annotated in other mycobacterial species and Helobdella robusta and two ORFs did not show homology to any known protein. These results show that automatic annotation of genomes can be incomplete and highlight the utility of transcriptomic analysis for genome (re)annotation. Detailed information on these novel putative ORFs is provided in Table S8.
6. Endonucleolytic RNA cleavage occurs at a distinct sequence motif and is common in mRNA regulatory regions
As our methodology allows us to precisely map RNA cleavage sites in addition to TSSs, we sought to analyze the presence and distribution of cleavage sites in the M. smegmatis transcriptome. mRNA processing plays a crucial role in regulation of gene expression, as it is involved in mRNA maturation, stability and degradation (Arraiano et al., 2010). Mixture modeling identified 3,344 CSs with a posterior probability ≥0.9 (high confidence CSs) (Figure 1A, Table S9). To determine the sequence context of the CSs, we used the regions flanking the 5’ ends to generate a sequence logo (Figure 4A). There was a strong preference for a cytosine in the +1 position (present in more than the 90% of the CSs) (Figure 4B), suggesting that it may be structurally important for RNase recognition and/or catalysis.
Cleaved 5’ ends can represent either degradation intermediates or transcripts that undergo functional processing/maturation. In an attempt to investigate CS function, we classified them according to their locations within mRNA transcripts (Figure 4C, Table S9). We found that, after normalizing to the proportion of the expressed transcriptome that is comprised by each location category, cleaved 5’ ends are more abundant within 5’ UTRs and intergenic regions of operons than within coding sequences and 3’ UTRs (Figure 4D). Stringent criteria were used in these analyses to avoid undesired bias (Figure 4C and Materials and Methods). While one would expect the CSs associated with mRNA turnover to be evenly distributed throughout the transcript, enrichment of CSs within the 5’ UTRs as well as between two co-transcribed genes may be indicative of cleavages associated with processing and maturation. Alternatively, these regions may be more susceptible to RNases due to lack of associated ribosomes. Here we predicted with high confidence that at least 101 genes have one or more CSs in their 5’ UTRs (Table S10).
We detected cleaved 5’ ends within the coding sequences of 18% of M. smegmatis genes, ranging from 1 to over 40 sites per gene. We analyzed the distribution of CSs within coding sequences (Figure S7), taking into consideration the genomic context of the genes. When analyzing the distribution of CSs within the coding sequences of genes whose downstream gene has the same orientation, we observed an increase in CS frequency in the region near the stop codon (Figure S7.A). However, when only coding sequences having a downstream gene on the opposite strand (convergent) were considered, the distribution of CSs through the coding sequences was significantly different (p-value <0.0001, Kolmogorov-Smirnov D test) with the CSs more evenly distributed throughout the coding sequence (Figure S7.B). This suggests that the cleavage bias towards the end of the genes observed in Figure S7.A may be due to the fact that many of these CSs are actually occurring in the 5’ UTRs of the downstream genes. In cases where the TSS of a given gene occurs within the coding sequence of the preceding gene, a CS may map to both the coding sequence of the upstream gene and the 5’ UTR of the downstream gene. In these cases, we cannot determine in which of the two transcripts the cleavage occurred. However, cleavages may also occur in polycistronic transcripts. We therefore assessed the distributions of CSs in the operons predicted above. The distribution of CSs in genes co-transcribed with a downstream gene showed a slight increase towards the last part of the gene (Figure S7.C). This may reflect cases in which polycistronic transcripts are cleaved near the 3’ end of an upstream gene, as has been reported for the furA-katG operon, in which a cleavage near the stop codon of furA was described (Milano et al., 2001, Sala et al., 2008, Taverniti et al., 2011). The furA-katG cleavage was identified in our dataset, located 1 nt downstream of the previously reported position. A similar enrichment of CSs towards stop codons was also observed in a recent genome-wide RNA cleavage analysis in Salmonella enterica (Chao et al., 2017), although in this case the high frequency of cleavage may be also attributed to the U preference of RNase E in this organism, which is highly abundant in these regions.
7. Prediction of additional TSSs and CSs based on sequence context
The sequence contexts of TSSs (Figure 2B) and CSs (Figure 4A) were markedly different, as G and A were highly preferred in the TSS +1 position whereas C was highly preferred in the CS +1 position, and TSSs were associated with a strong overrepresentation of ANNNT -10 sites while CSs were not. These sequence-context differences not only provide validation of our methodology for distinguishing TSSs from CSs, as discussed above, but also provide a means for making improved predictions of the nature of 5’ ends that could not be categorized with high confidence based on their converted/non-converted library coverage alone. Taking advantage of these differences, we sought to obtain a list of additional putative TSSs and CSs. Thus, of the 5’ ends that were not classified by high confidence by mixture modeling, we selected those that had an appropriately positioned ANNNT motif upstream and a G or an A in the +1 position and classified them as TSSs with medium confidence (Table S11). In the same way, 5’ ends with a C in the +1 position and lacking the ANNNT motif in the region upstream were designated as medium confidence CSs (Table S12). In this way, we were able to obtain 576 and 4,838 medium confidence TSSs and CSs, respectively. Although we are aware of the limitations of these predictions, these lists of medium confidence 5’ ends provide a resource that may be useful for guiding further studies. 5’ ends that did not meet the criteria for high or medium confidence TSSs or CSs are reported in Table S13.
8. The transcriptional landscape changes in response to oxygen limitation
We sought to study the global changes occurring at the transcriptomic level in oxygen limitation employing the Wayne model (Wayne & Hayes, 1996) with some modifications (see Materials and Methods). Two timepoints were experimentally determined in order to evaluate transcriptomic changes during the transition into hypoxia (Figure S1). A different enzyme was used for conversion of 5’ triphosphates to 5’ monophosphates in these 5’-end libraries, and it appeared to be less effective than the enzyme used for the 5’ end libraries in dataset 1. As a consequence, our ability to distinguish TSSs from CSs de novo in these datasets was limited. However, we were able to assess changes in abundance of the 5’ ends classified as high-confidence TSSs or CSs in Dataset 1, as well as identify a limited number of additional TSSs and CSs with high confidence (Figure S4, Table S1). Corresponding RNAseq expression libraries revealed that, as expected, a large number of genes were up and downregulated in response to oxygen limitation (Figure S8, Table S14). We next investigated the transcriptional changes in hypoxia by assessing the relative abundance of TSSs in these conditions. We found 318 high-confidence TSSs whose abundance varied substantially between exponential phase and hypoxia (Table S15). A robust correlation was observed between the pTSS peak height in the 5’-end-directed libraries and RNA levels in the expression libraries for hypoxia (Figure S9). In an attempt to identify promoter motifs induced in hypoxia, we analyzed the upstream regions of those TSSs whose abundance increased (fold change ≥2, adjusted p-value ≤0.05). Interestingly, we detected a conserved GGGTA motif in the -10 region of 56 promoters induced in hypoxia using MEME (Figure 5A, Table S15). This motif was reported as the binding site for alternative sigma factor SigF (Rodrigue et al., 2007, Hartkoorn et al., 2010, Humpel et al., 2010). Additionally, the extended -35 and -10 SigF motif was found in 44 of the 56 promoter sequences. (Figure 5A, Table S15). SigF was shown to be induced in hypoxia at the transcript level in Mtb (Iona et al., 2016) and highly induced at the protein level under anaerobic conditions using the Wayne model in M. bovis BCG strain (Michele et al., 1999) (Galagan et al., 2013). In M. smegmatis, SigF was shown to play a role under oxidative stress, heat shock, low pH and stationary phase (Gebhard et al., 2008, Humpel et al., 2010, Singh et al., 2015) and sigF RNA levels were detected in exponential phase at a nearly comparable level to sigA (Singh & Singh, 2008). Here, we did not detect significant changes in expression of the sigF gene in hypoxia at the transcript level. However, this is consistent with reported data showing that sigF transcript levels remain unchanged under stress conditions in M. smegmatis (Gebhard et al., 2008), as it was postulated that SigF is post-transcriptionally modulated via an anti-sigma factor rather than through sigF transcription activation (Beaucher et al., 2002). We noted that, in the case of TSSs whose abundance was reduced in hypoxia, almost the totality of the promoters contains the -10 ANNNT σ70 binding motif. We then examined the presence of SigF motif in the regions upstream of 5’ ends that were not classified as high confidence TSSs. We speculate that 5’ ends associated with this motif may be potential TSSs triggered by hypoxia. We found 96 additional putative TSSs that were (1) overrepresented in hypoxia and (2) associated with appropriately-spaced SigF motifs (Table S16). Three of the hypoxia-induced genes with SigF motifs have homologous genes induced in hypoxia in Mtb (Park et al., 2003, Rustad et al., 2008).
It is well known that under anaerobic conditions mycobacteria induce the DosR regulon, a set of genes implicated in stress tolerance (Rosenkrands et al., 2002, O’Toole et al., 2003, Park et al., 2003, Roberts et al., 2004, Rustad et al., 2008, Honaker et al., 2009, Leistikow et al., 2010). The DosR transcriptional regulator was highly upregulated at both hypoxic timepoints in the expression libraries (13 and 18-fold at 15 and 24 hours, respectively). Thus, we hypothesized that the DosR binding motif should be present in a number of regions upstream the TSSs that were upregulated in hypoxia. Analysis of the 200 bp upstream the TSSs using the CentriMo tool for local motif enrichment analysis (Bailey & Machanick, 2012) allowed us to detect putative DosR motifs in 13 or 53 promoters, depending on whether a stringent (GGGACTTNNGNCCCT) or a weak (RRGNCYWNNGNMM) consensus sequence was used as input (Lun et al., 2009, Berney et al., 2014, Gomes et al., 2014) (Table 15). At least two of the 13 genes downstream of these TSSs were previously reported to have DosR motifs by Berney and collaborators (Berney et al., 2014) and RegPrecise Database (Novichkov et al., 2013) and two others are homologs of genes in the Mtb DosR regulon that were not previously described in M. smegmatis as regulated by DosR (Table S15).
We then used CentriMo to search for DosR motifs in the regions upstream of 5’ ends that were not classified as high confidence TSSs, given that TSSs derived from hypoxia-specific promoters may have been absent from Dataset 1. We found 36 putative TSSs associated with 20 different genes (Table S17), of which 11 have been shown to have DosR binding motifs (Berney et al., 2014). Five of these are homologs of genes in the Mtb DosR regulon.
9. M. smegmatis decreases RNA cleavage under oxygen limitation
There is evidence that mRNA is broadly stabilized under hypoxia and other stress conditions (Rustad et al., 2013, Ignatov et al., 2015). Thus, we anticipated that RNA cleavage should be reduced under hypoxia as a strategy to stabilize transcripts. We compared the relative abundance of each high confidence CS in stress and in exponential phase (Figure 5B) and found that RNA cleavage is significantly reduced in both hypoxia 15h and 24h on a global scale (Figure 5C). In contrast, relative abundance of TSSs did not decrease in these conditions, indicating that the reduction in CSs is not an artefact of improper normalization (Figure 5B). When the ratios of CSs abundance in hypoxia/normal growth of individual genes were analyzed, we observed the same behavior (Figure S10). These results indicate that the number of cleavage events per gene decreases during adaptation to hypoxia, which could contribute to the reported increases in half-life (Rustad et al, 2012).
Discussion
In the past years, genome-wide transcriptome studies have been widely used to elucidate the genome architecture and modulation of transcription in different bacterial species (Albrecht et al., 2009, Mendoza-Vargas et al., 2009, Mitschke et al., 2011, Cortes et al., 2013, Schlüter et al., 2013, Dinan et al., 2014, Ramachandran et al., 2014, Innocenti et al., 2015, Sass et al., 2015, Thomason et al., 2015, Berger et al., 2016, Čuklina et al., 2016, D’arrigo et al., 2016, Heidrich et al., 2017, Li et al., 2017, Zhukova et al., 2017). Here we combined 5’-end-directed libraries and RNAseq expression libraries to shed light on the transcriptional and post-transcriptional landscape of M. smegmatis in different physiological conditions.
The implementation of two differentially treated 5’-end libraries followed by Gaussian mixture modeling analysis allowed us to simultaneously map and classify 5’ ends resulting from nucleolytic cleavage and those resulting from primary transcription with high confidence. We were able to classify 57% of the 5’ ends in Dataset 1 with high confidence. In addition, we elaborated a list of medium confidence TSSs and CSs (Tables S11 and S12). These lists constitute a valuable resource for the research community.
Analysis of TSS mapping data allowed us to identify over 4,000 primary TSSs and to study the transcript features in M. smegmatis. The high proportion of leaderless transcripts, the lack of a consensus SD sequence in half of the leadered transcripts, and the absence of a conserved -35 consensus sequence indicate that the transcription-translation machineries are relatively robust in M. smegmatis. The robustness of transcription and translation are features shared with Mtb, where 25% of the transcripts lack a leader sequence (Cortes et al., 2013, Shell et al., 2015). In addition, high abundances of transcripts lacking 5’ UTRs have been reported in other bacteria including Corynebacterium diphtheria, Leptospira interrogans, Borrelia burgdorferi, and Deinococcus deserti, the latter having 60% leaderless transcripts (de Groot et al., 2014, Adams et al., 2017, Zhukova et al., 2017, Wittchen et al., 2018). Considering the high proportion of leaderless transcripts and the large number of leadered transcripts that lack a SD sequence (53%), it follows that an important number of transcripts are translated without canonical interactions between the mRNA and anti-Shine-Dalgarno sequence, suggesting that M. smegmatis has versatile mechanisms to address translation. A computational prediction showed that the presence of SD can be very variable between prokaryotes, ranging from 11% in Mycoplasma to 91% in Firmicutes (Chang et al., 2006). Cortes et al (Cortes et al., 2013) reported that the 55% of the genes transcribed with a 5’ UTR lack the SD motif. These similarities between M. smegmatis and M. tuberculosis, along with the correlation of leader lengths for homologous genes between species shown in Figure 3B, provide further evidence that many transcriptomic features are conserved between mycobacterial genomes. These data support the idea that in many cases, similar mechanisms govern modulation of gene expression in both species.
In an attempt to understand the role of RNA cleavage in mycobacteria, we identified and classified over 3,000 CSs throughout the M. smegmatis transcriptome, presenting the first report of an RNA cleavage map in mycobacteria. The most striking feature of the CSs was a cytidine in the +1 position, which was true in over 90% of the cases. While the RNases involved in global RNA decay in mycobacteria have not been yet elucidated, some studies have implicated RNase E as a major player in RNA processing and decay (Kovacs et al., 2005, Zeller et al., 2007, Csanadi et al., 2009, Taverniti et al., 2011), given its central role in other bacteria such as E. coli and its essentiality for survival in both M. smegmatis and Mtb (Sassetti et al., 2003, Sassetti & Rubin, 2003, Griffin et al., 2011, Taverniti et al., 2011, DeJesus et al., 2017). It is therefore possible that mycobacterial RNase E, or other endonucleases with dominant roles, favor cytidine in the +1 position. Interestingly, the sequence context of cleavage found here is different from that described for E. coli, for which the consensus sequence is (A/G)N↓AU (Mackie, 2013) or S. enterica, in which a marked preference for uridine at the +2 position and AU-rich sequences are important for RNase E cleavage (Chao et al., 2017).
RNA cleavage is required for maturation of some mRNAs (Li & Deutscher, 1996, Condon et al., 2001, Gutgsell & Jain, 2010, Moores et al., 2017). Therefore, the observation that CSs are enriched in 5’ UTRs and intergenic regions suggests that processing may play roles in RNA maturation, stability, and translation for some transcripts in M. smegmatis. A high abundance of processing sites around the translation start site was also observed in P. aeruginosa and S. enterica in transcriptome-wide studies (Chao et al., 2017, Gill et al., 2018), suggesting that 5’ UTR cleavage may be a widespread post-transcriptional mechanism for modulating gene expression in bacteria.
Regulation of RNA decay and processing plays a crucial role in adaptation to environmental changes. We present evidence showing that RNA cleavage is markedly reduced in conditions that result in growth cessation. It was previously demonstrated that in low oxygen concentrations mycobacteria reduce their RNA levels (Ignatov et al., 2015) and mRNA half-life is strikingly increased (Rustad et al., 2013), likely as a mechanism to maintain adequate transcript levels in the cell without the energy expenditures that continuous transcription would require. While several traits are involved in the regulation of transcript abundance and stability, the observation that cleavage events are pronouncedly reduced in these conditions pinpoint this mechanism as a potential way to control RNA stability under stress. In agreement with this hypothesis, RNase E was modestly but significantly decreased at the transcript level in early and late hypoxia (fold change = 0.63 and 0.56, respectively, p-value adjusted <0.05), suggesting that reducing the RNase E abundance in the cell may be a strategy to increase transcript half-life. Further study is needed to better understand the relationship between transcript processing and RNA decay in normoxic growth as well as stress conditions.
Hypoxic stress conditions were also characterized by major changes in the TSSs. 5’-end-mapping libraries revealed that over 300 TSSs varied substantially when cultures were limited in oxygen. We found that 56 transcripts triggered in hypoxia contain the SigF promoter binding motif, indicating that this sigma factor plays a substantial role in the M. smegmatis hypoxia response. While previous work revealed increased expression of SigF itself in hypoxia in Mtb (Galagan et al., 2013, Iona et al., 2016, Yang et al., 2018), this is the first report demonstrating the direct impact of SigF on specific promoters in hypoxic conditions in mycobacteria. Further work is needed to better understand the functional consequences of SigF activation in both organisms in response to hypoxia.
Acknowledgements
This work was supported by NSF CAREER award 1652756 to SSS. We are grateful to Zheyang Wu and Thomas Ioerger for helpful advice on the mixture modeling and Michael Chase for helpful advice on other aspects of our data analysis pipeline. We thank members of the Shell lab for technical assistance and helpful discussions. All Illumina sequencing was performed by the UMass Medical School Deep Sequencing Core.