Abstract
The majority of bacterial genomes have high coding efficiencies, but there are some genomes of intracellular bacteria that have low gene density. The genome of the endosymbiont Sodalis glossinidius contains almost 50% pseudogenes containing mutations that putatively silence them at the genomic level. We have applied multiple omic strategies, combining: Illumina and Pacific Biosciences Single-Molecule Real Time DNA-sequencing and annotation; stranded RNA-sequencing; and proteome analysis to better understand the transcriptional and translational landscape of Sodalis pseudogenes, and potential mechanisms for their control. Between 53% and 74% of the Sodalis transcriptome remains active in cell-free culture. Mean sense transcription from Coding Domain Sequences (CDS) is four-times greater than that from pseudogenes. Comparative genomic analysis of six Illumina-sequenced Sodalis isolates from different host Glossina species shows pseudogenes make up ~40% of the 2,729 genes in the core genome, suggesting that they are stable and/or Sodalis is a recent introduction across the Glossina genus as a facultative symbiont. These data further shed light on the importance of transcriptional and translational control in deciphering host-microbe interactions, and demonstrate that pseudogenes are more complex than a simple degrading DNA sequence. The combination of genomics, transcriptomics and proteomics give a multidimensional perspective for studying prokaryotic genomes with a view to elucidating evolutionary adaptation to novel environmental niches.
Importance Bacterial genomes are generally 1Kb in length, organized efficiently (i.e. with few gaps between genes or operons), and few open reading frames (ORFs) lack any predicted function. Intracellular bacteria have been removed from extracellular selection pressures acting on pathways of declining importance to fitness and thus, these bacteria tend to delete redundant genes in favour of smaller functional repertoires – maintaining genome efficiency. In the genomes of endosymbionts with a recent evolutionary relationship with their host, however, this process of genome reduction is not complete; Genes and pathways may be at an intermediate stage, undergoing mutation linked to reduced selection and small population numbers being vertically transmitted from mother to offspring in their hosts, resulting in an increase in abundance of pseudogenes and reduced coding capacities. A greater knowledge of the genomic architecture of persistent pseudogenes, with respect to their DNA structure, mRNA transcription and even putative translation to protein products, will lead to a better understanding of the evolutionary trajectory of endosymbiont genomes, many of which have important roles in arthropod ecology.
Introduction
The genomes of intracellular parasites and endosymbiotic bacteria evolve under conditions that are fundamentally different to those of free-living organisms (1). In many arthropod systems, bacteria can provide nutrients that are otherwise scarce to their host (such as B-vitamins absent from blood meals, or essential amino acids absent from plant sap), in exchange for host provision of protection, nutrition and mechanisms for vertical or horizontal transmission (2, 3). Obligate intracellular symbionts are maintained by the host and have evolved strategies that ensure their vertical transmission to the next generation of hosts. Ultimately this intracellular lifestyle, small population size and strict vertical transmission can result in extremely reduced genomes (1, 3–6). The general theory and process of this extreme genome reduction has been well studied using genomic data for intracellular bacteria, including endosymbionts such as Buchnera in aphids (7, 8), and Wigglesworthia in tsetse flies (9). However, gene loss is not limited to obligate intracellular pathogen/symbionts with strict vertical transmission, it is also observed in freeliving bacteria and facultative symbionts (10).
One of the most important mechanisms for gene loss is that of pseudogenisation, resulting from the accumulation of nonsense mutations in protein coding sequences (1). These mutations putatively silence the gene at the genomic level resulting in theoretically non-functional genes/proteins (11). Prokaryotic pseudogenes generally exist at levels approximately between 1% and 5% (12). Comparative genomic analysis between closely related strains suggests that pseudogenes are often associated with reduced selective pressure on redundant gene sets allowing mutation to accumulate and inactivate genes. This has been observed as Salmonella changes host range or utilizes a new environment (13). The low level of pseudogenes in most bacteria suggests that they are removed rapidly from the genomes due to strong selection for genome efficiency (11). There are however examples among the intracellular pathogens and endosymbionts of high levels of pseudogene presence, reducing coding capacity down towards 50% in Sodalis glossinidius (14) and Mycobacterium leprae (15). Likewise, pseudogenes can persist for long periods – the mean half-life of Buchnera aphidocola pseudogenes has been estimated to be 24 million years (16). Pseudogenes have been well studied in the context of comparative genomics to understand how gene loss has shaped bacterial genomes (17), but whether they continue to contribute to the genetic capabilities of the bacterium has seldom been assessed (18). It could, for instance, be suggested that if pseudogene-derived transcription retains some form of cis/trans regulatory function, then this could select for pseudogene retention in the genome (19). It is also clear that under some circumstances, specifically where polymerase infidelity corrects for a frameshift within homopolymeric tracts at the transcriptional level, pseudogenes can still produce functional proteins that contribute to the fitness of the bacterium (20)
In this study we aim to understand the importance of pseudogenes in bacterial genome evolution in a model of a degrading bacterial genome, that of Sodalis glossinidius. Sodalis is a facultative intracellular, secondary endosymbiont of the tsetse fly (Diptera: Glossina). The variable frequency of Sodalis in natural populations suggests that Sodalis is not an obligatory component of the tsetse microbiome (21), however, the occurrence of Sodalis in natural populations has been linked to an increased capacity of tsetse to vector African trypanosomes (22). Interestingly, Sodalis has a relatively large genome for a facultatively intracellular endosymbiont (~4Mbp) and two genome annotations suggest that pseudogene levels are between 29% (14) and 38% (23) of the total gene content. Additionally, by combining the latest high-throughput sequencing and proteomics methods, we hope to shed light on potential post-transcriptional regulatory mechanisms that may be mitigating any potential deleterious effects. At the RNA level, riboswitches (24), or small RNAs (sRNAs) – short, 50-300bp transcripts mediated by imperfect base pairing interactions, have been shown to regulate genes in this manner (25). DNA methylation could also serve as a mechanism by which to control transcription and/or translation (26, 27).
Sodalis glossinidius represents an ideal system in which to test hypotheses surrounding pseudogene functionality and their evolution, as the organism maintains an unusually reduced coding capacity, yet remains amenable to cell culture allowing for sufficient DNA, RNA and peptides to be extracted for poly-omic analyses. First: Assuming genes with nonsense mutations are non-functional and therefore costly to the cell, pseudogenes should be evolving rapidly and be removed from the genome. Secondly: If pseudogene transcription or translation is deleterious, pseudogenes should be transcriptionally and translationally silent. Thirdly: Given hypothesis two, we can expect there to be genetic mechanisms to silence pseudogenes, and we will be able to identify genetic and transcriptional features that determine pseudogene status using a combination of genomic, expression and proteomic analysis. To that end, we tested these hypotheses by: 1. Establishing pseudogene content and evolution using pan-genome data; 2. Evaluating genome-wide methylation data and negative strand expression to elucidate potential expression control mechanisms; and 3. Correlating mRNA and protein expression levels to understand functional control of pseudogenes.
Results
Sodalis Genomics
To provide an updated, accurate reference for transcriptome mapping of the S. glossinidius isolate used in this study, we sequenced de novo a Sodalis glossinidius (from the host Glossina morsitans morsitans) isolate (SgGMMB4) using Pacific Biosciences sequencing. A single SMRTcell produced a total of 48,519 reads, with a mean length of 10,290 bp and mean read score of 0.85. The chromosome was assembled in to a single 4.1 Mbp contig and one copy of the three plasmid sequences pSG1 (90.747 bp); pSG2 (38,394 bp) and pSG3 (10,640 bp). Previously published annotations from the GMM4 isolate (14, 23, 28) were used, alongside a manually curated PROKKA-generated annotation of the PacBio sequenced isolate, to generate a new annotation of our Sodalis sequence (29). The overall mean genome GC content is 54.4 %, and the pseudogenes and CDS have a similar GC of ~55.5 %. The revised annotation presented contains 3,336 putative CDS and 2,286 putative pseudogenes. In addition, 43 putative riboswitch domains, belonging to seven families, were identified (Supplementary Table 2). A prodigal (v. 2.6.2) (30), gene prediction including a Ribosome Binding Site (RBS) identification scan suggested a prevalence of standard methionine-coding ATG start codons (78 %), which decreases in the case of pseudogenes (67 %). GTG start-codons are the next most common, found in 14 % of all genes, increasing to 21 % of the total number of pseudogenes); followed by TTG (8 % overall; 12 % of pseudogenes). 61 % of all genes (48 % of pseudogenes) have an RBS predicted within 5-10 bp of the start codon, and 26 % of all genes (38 % pseudogenes) have no discernible RBS. A mummer-plot (31) of plasmid pSG1 reveals a 6,371 bp tandem repeat encompassing the inactive Type IV secretion system operon that is not present in the original Sodalis sequencing and annotation experiments, perhaps due to a collapsed repeat missed by first-generation sequencing assembly.
One example of specific gene degradation is the S. glossinidius type III secretion and motility system. Three Sodalis Symbiosis Regions (SSR1-3) have been implicated in establishing or maintaining symbiosis (32). Our study suggests that one Type III island – SSR3 is the most functionally intact. Despite much of SSR2 being degraded, the HilA regulator (originally annotated as a pseudogene) remains intact, and active, suggesting an important trans-acting function for this gene, and further emphasising its functional status (Supplementary Data 3). FlgMN in SSR1 appears to be of importance to the organism, due to this operon also remaining intact, in comparison to the rest of the degraded SSR1 (Supplementary Data 5).
Sodalis Transcriptomics
To ascertain whether pseudogenes are being transcribed, or if their transcription is being regulated throughout growth, stranded RNA-sequencing was performed on three-replicates in three conditions across a bacterial growth experiment in cell-free media (Early Log Phase, ELP; Late Log Phase, LLP; Late Stationary Phase, LSP).
Figure 1A shows boxplots of mean sense and antisense transcription (log (transcripts per million + 1) for each condition. From the density plots of sense transcription (Figure 1B) and of overall transcription (Figure 1C), it can be seen that there is a clear signal of no transcription from putatively inactive genes (logTPM+1=0). Other studies have used transcripts per million (TPM) of ≥1 (33) or ≥10 (34) as an indicator of activity, which are displayed on each of these figures. For this study, we have defined putatively active genes as having an arbitrary TPM value ≥ 1 in all three biological replicates in at least one condition. Genes with TPM≥10 in all three biological replicates in at least one condition are additionally described as being active. 53% of all combined genes and pseudogenes, 3,087 (2,237 CDS; 850 pseudogenes) exhibit active sense transcription (TPM≥10) in any given condition, with an additional 1,191 (547 CDS; 644 pseudogenes) putatively active (TPM ≥ 1; A total of 73.7% of all genes and pseudogenes). Additionally, 1,088 genes (703 CDS; 385 pseudogenes) showed active antisense transcription, with an additional 993 genes (629 CDS; 364 pseudogenes) exhibiting putatively active antisense transcription according to the same rules (Figure 1D).
Across all conditions, mean sense CDS expression (TPM=245.33) is significantly greater than that of pseudogenes (TPM = 60.96; Mann-Whitney U Test; W = 46797000, p-value < 2.2e-16). Mean antisense CDS expression (TPM=24.74) is also significantly greater than that for pseudogenes (TPM=13.64; W = 37651000, p-value < 2.2e-16).
Differential expression analysis using EdgeR (implemented in DEGUST), suggests, of the actively transcribed genes, 938 CDS and 219 pseudogenes are being differentially expressed between either LLP or LSP when compared to ELP growth (FDR ≤ 0.05). 219 CDS and 106 pseudogenes showed differential antisense expression using the same rules. Figure 2 displays volcano plots of log Fold Change against negative log FDR for Late Log Phase (Figure 2A), and for Late Stationary Phase (Figure 2B), versus Early Log Phase. Each shows that some pseudogenes are highly likely to be differentially expressed between conditions.
SgGMMB4 Proteome
To assess whether transcription may relate to translation we performed proteomic analysis from pooled bacterial cells from all conditions in the cell-free growth experiment. The PROKKA gene models and the 6-frame translations comprised a total of 5,625 and 6,769 proteins respectively. With our MS/MS search on the combined search database, we identified 1,503 attributable to the PROKKA annotation. In two instances, an alternative start codon in the annotation (GTG/Val) was manually changed to ATG to encode a Methionine residue, in order to match. It should be noted that these identified proteins are the representative proteins from each protein group as reported by ProteoAnnotator. ProteoAnnotator reports one representative protein from each protein ambiguity group (consisting of one or more proteins) formed due to sharing the same set or subset of peptide identifications. This strategy avoids double counting of proteins with no independent evidence. We further computed the Exponentially Modified Protein Abundance Index (emPAI) value as a semi-quantitative measure of protein abundance for the representative proteins using the mzidLibrary (Supplementary Figure 1) (35). We identified 34 pseudogene annotations from previous annotations that corresponded with the presence of a protein product in our data (Supplementary Data 2 and 3).
SgGMMB4 Methylome and Codon Usage
To assess whether methylation patterns differed between intact CDS and pseudogenes, we used the ability of PacBio SMRT sequencing to detect epigenetic modifications, including (for example), 6mA, 4mC or 5mC, by comparing the sequencing profiles (specifically comparing interpulse duration) between native DNA and PCR-amplified DNA (36). 24,869/29,832 (83.4%) of 5-GATC-3 motifs in the SgGMMB4 chromosome are predicted to be 6-Adenine methylated. No other epigenetic modifications or underlying motifs were detected. In CDS and pseudogenes: CDS display a significantly higher frequency of methylation (4.68 per gene) than pseudogenes (2.54 per pseudogene; Kruskal-Wallis chi-squared = 382.61, df = 2, p-value < 2.2e-16). Mean CDS length is larger than pseudogenes (714bp vs 415bp), and although CDS and pseudogenes do not significantly differ in their underlying mean GC-content (CDS=55.48%, pseudogenes=55.71%), there are therefore fewer 5-GATC-3 sites within pseudogenes (mean 3 per pseudogene) than in coding sequences (mean 5.6 per CDS; Kruskal-Wallis chi-squared = 428.82, df = 3, p-value < 2.2e-16; Supplementary Data 6). There is an increased frequency of both GAT (increase of 2.7 codons per 1000) and ATC (increase of 2.2 codons per 1000) codons in CDS vs. pseudogenes (Supplementary Figure 2).
Comparative Genomics to other Sodalis isolates
Comparing SNP rates and pseudogene carriage between multiple genomes of Sodalis isolated from different tsetse hosts could reveal whether pseudogenes are being deleted at different rates, or if there is relaxed selective pressure acting on pseudogenes. Low SNP rates and stable pseudogene carriage, indicated by high numbers of pseudogenes contributing to the Sodalis core genome, would confirm a recent association of Sodalis with the tsetse hosts and would imply that pseudogenes may not be under relaxed selective pressure. ROARY pan genome analysis of the six Illumina sequenced isolates compared to our PacBio-sequenced annotation assigned 3,183 CDS and 2,301 pseudogenes to either the core genome (all seven genomes), soft core (2 to 6 genomes) or cloud (one genome). ROARY suggests that there are 2,729 core CDS, 358 soft core CDS, and 184 cloud CDS. 1,796 pseudogenes are assigned to the core genome, which represents ~40% of the overall core genome, despite the phylogenetic distance between hosts. There are an additional 280 soft core and 137 cloud pseudogenes (Figure 2B). Core single nucleotide polymorphisms (SNP) are not targeted towards pseudogenes: Core singlenucleotide polymorphism using SNIPPY suggests there are 474 core SNP loci in pseudogenes, 814 in intact CDS and 540 intergenic core SNP.
Functional corrections of pseudogenes could be important for ecological contribution or evolutionary trajectory of pseudogenes. To test for RNA-based (functional) corrections of genomic SNPs at the transcriptional level, a SNIPPY analysis of the RNA-seq data was performed. A single base insertion in a C(8) homopolymeric tract corrects a frameshift in a putative proline/betaine transporter (Base position 869745, PROKKA_01173-01174/SG0498). Neither of the proP2/3 ORFs have been annotated as pseudogenes in this analysis, although no peptide has been detected for either ORF. A second insertion was found to correct for a frameshift in a putative transposase (C(9)>C(10); PROKKA_02124/SG0946). Finally, a single base insertion (G>GG) was found close to the SlyA_2 transcriptional regulator (PROKKA_03264/SG1443) that putatively either alters the region immediately upstream of the 5-end or extends and alters the first 22 amino acids of the annotated gene. A BLASTP search of the sequence suggests the current annotation is very similar to SlyA/MarR orthologues in other endosymbionts (Query coverage ≥ 98%; e ≤ 4e75). Each of these SNPs was only identified in the RNA-seq data and neither in the Illumina nor PacBio DNA sequencing data.
Discussion
It seems logical to consider pseudogenes as potentially maintaining a function until their association with transcriptional processes has been silenced. This is particularly pertinent in the case of Secondary (S) symbionts with high proportions of pseudogenes, like Sodalis, which are presumed to be evolving towards an obligate association with their host. In secondary symbiosis, the current opinion is that degeneration can largely be attributed to small, vertically transmitted, populations with little diversity reducing the ability of the organism to purge deleterious mutations (i.e. leading to the generation and persistence of pseudogenes) (1, 37, 38).
The primary goal of this study was to establish whether bacterial pseudogenes remain active despite genomic degradation, using Sodalis as a model given the number of putatively inactivated, functionless genes persisting in its genome. We sought to combine DNA sequencing, stranded RNA-sequencing and proteomic analysis to fully describe the Sodalis transcriptional and translational landscape with a view to better understand the evolution and functional control of bacterial pseudogenes and the process of endosymbiont genome degradation.
Pseudogenes harbour residual sense and antisense transcription
We have shown that bacterial pseudogenes can be both actively transcribed and dynamically regulated during growth. This is in line with previous work using tiling arrays in Mycoplasma pneumonia, wherein frequent antisense and non-coding transcripts were identified in a degrading bacterial genome (39). Pseudogene-derived transcripts (such as antisense small RNAs derived from pseudogenes) could act as regulators for orthologues elsewhere in the genome (40) and such an association may reduce the selective pressure towards their deletion. One of the most differentially expressed pseudogenes without a hypothetical annotation, for instance, is trehalase, an enzyme involved in the breakdown of trehalose – a sugar commonly found in insect haemolymph (41). Amongst a number of hypotheses as to the differential expression of trehelase in cell free culture (which lacks trehalose as a constituent) are implications of residual global control of metabolic processes beyond single sugars. Given the importance of this sugar in insect systems, it would be interesting to further test this pseudogene for residual function.
Defining which associations constitute significant function for which positive pressure ensures their persistence is difficult: associations with promoters, transcription factors or cis/trans-acting transcriptional regulators could all select for pseudogene retention in the genome, and reduce the likelihood of full deletion. An increasing number of bacterial small RNAs have been identified through transcriptomic analyses, including in Streptococcus (42) and Borrelia (43). Pseudogene-derived antisense RNA may be involved in the complex interactions between genome, transcriptome and proteome: sRNAs have been implicated in gene regulation of multiple target genes through processes such as translational inhibition and activation, or transcript stability (44). Hfq, a crucial chaperone involved in bacterial sRNA processing, is maintained in Sodalis (PROKKA_00878). Therefore, further studies into the roles of Sodalis sRNAs – including pseudogene-derived sRNAs and the role of Hfq or other chaperones – will be critical for full understanding of the complexity of gene regulation in this degrading bacterial genome.
Pseudogenes are difficult to define
By using stranded RNA sequencing, and comparing transcription between Sodalis CDS and pseudogenes, we have shown that intact CDS show significantly greater mean levels of transcription than pseudogenes, but there remains a proportion of CDS with little or no expression – any of which could be non-functional and mis-annotated. It should be noted that this experiment relied on cell-free culture; CDS may therefore not be expressed due to their functional redundancy in cell-free culture, and follow-up experiments may be required in further media types, or in insect-cell co-culture, to fully ascertain the Sodalis transcriptional repertoire. Similarly, however, there remain pseudogenes with residual activity, going against the classical definition of a pseudogene, and it is clear, therefore, that problems remain with the identification and annotation of pseudogenes. We and others have identified novel genes, including genes potentially important in regulating flagellum and/or Type III secretion machinery (HilA) or in Quorum Sensing (SlyA/MarR) (45). Simply defining pseudogenes using any individual genomic assay is difficult: ORFs may be shortened by frameshift mutations, yet may retain functional domains and appropriate transcriptional architecture, for instance. Coding sequences are generally characterized following a set of canonical rules of gene structure: the presence of an open reading frame (ORF), a promoter and ribosomal binding site (RBS); a methionine (or, occasionally, alternative) start-codon and a stop-codon. Similarly, pseudogenes were predicted wherever such rules break down: in the case of S. glossinidius, pseudogenes were predicted where <50 % of the functional homologue remained intact. Although studies at the single cell level in E. coli (46), and in some conditions at the population level in Clostridium (47), suggest that levels of mRNA and protein can remain uncorrelated and be regulated independently of one another, our data suggest that a balance may exist between mRNA transcript and protein abundance, as a semi-quantitative measure of peptide abundance correlates with sense expression. It is likely that each tier of control (i.e. at the DNA, transcription and translation levels), that each may act on another – for instance sRNA may impact mRNA levels, or protein interactions may regulate transcription. There remains a range of bacterial transcriptional processes still to fully characterize, including 5’-UTRs, alternative promoters or alternative transcriptional start and stop sites, and further experiments using techniques such as terminal endonuclease linked RNA-seq, which has been employed in similar experiments in Salmonella enterica serovar Typhimurium (48), would shed further light on the transcriptional landscape of this bacterium.
Transcriptional- and post-transcriptional pseudogene control mechanisms remain to be ascertained
Given its dual role in mismatch repair and the regulation of gene expression, Dam-mediated methylation of 5-GATC-3 motifs in bacterial genomes represents a potentially important factor to investigate. While Pacific Biosciences sequencing allowed for the examination of methylation status by comparing modified to unmodified DNA, the potential role methylation might play in pseudogene control remains difficult to ascertain: Pseudogenes displayed a significantly decreased rate of 6mA methylation, when compared to CDS, probably due to the tendency for pseudogenes to have fewer 5-GATC-3 methylation motifs at the genetic level (because pseudogenes are smaller than CDS). Dam-mediated methylation is predicted to post-transcriptionally regulate gene expression by altering the affinity of proteins for DNA, such as at the origin of replication (oriC) (49). In S. enterica serovar Typhimurium, Adenine methylation has been implicated in regulating quorum sensing derived virulence factors and as such Dam inhibitors or Dam-silenced pathogens have been studied for their antimicrobial or vaccine potential, respectively (50). Adenine methylation has also been implicated in protecting symbionts from heat-stress (51).
Pseudogene abundance is stable between Sodalis genomes
Given that we expect Sodalis to be routinely undergoing population bottlenecks through vertical transmission of their host, we could expect genetic drift to be acting on genes under little selective pressure, increasing SNP and/or pseudogenisation rates, or even driving their deletion. As accessory genomes diverge prior to SNP arising in the core genome (52), examining the Sodalis pan-genome derived from S. glossinidius species from multiple tsetse hosts, enabled us to examine pseudogene stability. The high number of pseudogenes in the Sodalis core genome suggests pseudogenes are stable across Sodalis strains infecting different tsetse species – in line with the suggestion that Sodalis shares an evolutionarily recent association with its tsetse host. The persistence of pseudogenes implies that the maintenance of function of degraded genes may outweigh any deleterious effects, or that there exists a mechanism by which such deleterious effects are mitigated. Kuo and Ochman have previously suggested that Salmonella pseudogenes may lack sufficient negative pressure for deletion (11). In models of cyanobacterial genomes, increased resource levels and decreased mortality have been suggested to select for slower reproduction and streamlined genomes (53). Experimental evolution experiments in Methylobacterium have shown that accessory gene deletion confers a direct fitness benefit under selective environments, rather than the associated benefit of the reduced fitness costs of maintaining a shorter genome in its own right (54).
Another interesting aspect of pseudogene transcription is the potential for mRNA transcripts to correct for non-synonymous mutations in the DNA sequence. We detected one instance of this in our analysis. The proline/betaine transporter (PROKKA_01174/SG0498) contains a frameshift but has never been annotated as a pseudogene in any of the available annotations. In this analysis, a peptide was not detected for either of the open reading frames associated with the proP_1/2 gene, however RNA sequencing detected a single base insertion inside a homopolymeric tract that corrects for the DNA frameshift. Short read RNA sequencing may be limited in its ability to resolve such corrections given the presence of 5 proline/betaine transporter coding ORFs in the SgGMMB4 genome (all other proP ORFs are annotated as pseudogenes). As this study relied on RNA-shearing based library preparation, it would be interesting to follow up this study with full-length third-generation RNA sequencing (i.e. using Pacific Biosciences cDNA (Isoseq) or direct RNA sequencing using Oxford Nanopore technologies) to elucidate the sequences of full-length mRNA and pre-mRNA, which would further enhance our knowledge as to how pseudogenes continue to contribute to overall transcription and its control despite ongoing genomic degradation.
Conclusions
It is well established that pseudogenes are ubiquitous within the Sodalis genome. As pseudogenes persist across phylogenetically distinct strains, this implies that either S. glossinidius is only recently associated with tsetse flies in evolutionary terms, or that pseudogene content was established prior to its association with tsetse (or upon its association with its last common ancestor). We have revealed that whilst transitioning from a free-living to symbiotic status, Sodalis pseudogenes are often transcribed, but at a significantly lower level than intact CDS. Some pseudogenes even remain under active transcriptional control, exhibiting differential expression throughout growth, however proteomic analysis suggests they ultimately do not contribute to the protein content of the cell. The lack of some expression from intact CDS and pseudogenes underpins the difficulty in pseudogene identification – especially in cell-free culture where the correct conditions for their expression may be lacking. That a combination of sense and antisense transcription of pseudogenes persists implies a role of pseudogene transcription in control mechanisms: e.g. cis/trans small RNA transcriptional control, and could even be playing a role in wide-reaching mechanisms such as host-symbiont interaction, or symbiont-symbiont interaction. Given the proximity of Sodalis to medically important parasites and other bacteria within the tsetse host, further study on these mechanisms is of interest for identifying novel therapeutic interventions.
Materials and Methods
DNA Sequencing
For PacBio sequencing: Sodalis glossinidius strain GMMB4 (SgGMMB4) was isolated from Glossina morsitans morsitans (Westwood) from the Langford derived long-term colony maintained at the University of Edinburgh in 2005. Six further S. glossinidius isolates were cultured from lab-based tsetse for Illumina sequencing: GP1 and GPP4 were isolated from Glossina palpalis palpalis; GAA from G. austeni; GF4 from G. fuscipes; GM1 and GMM4 from G. morsitans as previously described (55).
Bacteria were recovered from -80 ºC storage by incubation at 25 ºC on columbia agar plates supplemented with 10 % defibrinated horse blood (TCS Biosciences) in microaerophilic conditions (~5-12% CO2 CampyGen, Oxoid, UK). An individual colony was picked and grown to late stationary phase in cell-free culture medium at 25 ºC in Schneiders Insect Medium (Sigma, UK) supplemented with 10 % Fetal Calf Serum (Life Technologies, UK). High molecular weight whole genomic DNA (gDNA) was extracted from the subsequent bacterial pellet using the Zymo Research Universal gDNA extraction kit (SgGMMB4) or the Qiagen DNeasy kit (other isolates) according to the manufacturers instructions.
SgGMMB4 gDNA was sequenced on the Pacific Biosciences RS-II instrument (PacBio) at the Centre for Genomic Research at the University of Liverpool on a single SMRTcell using P6-C4 chemistry with no prior size selection. Reads were assembled and contigs polished using HGAP.3 resulting in a polished assembly consisting of a single chromosomal contig and nine further contigs. Comparison of the sequence to the available reference by MUMMER (56) and ACT (57) suggested that two contigs were a result of chimeras derived from pSG2. A further five contigs were repetitive phage-derived sequences. The chromosome was subsequently manually edited to begin at the start of the dnaA gene. The putative protein-coding, ncRNA, and tRNA gene sequences were annotated using PROKKA (v. 1.10) (58). Pseudogenes in this study were initially conservatively annotated where the PROKKA-defined ORFs overlapped with the Belda-annotated pseudogenes or predicted to be pseudogenes by PROKKA (58). In the latter case, PROKKA predicted pseudogenes based on identical annotations in sequential open reading frames (except for hypothetical protein annotations). Sequences matching possible riboswitch domains were predicted using the Denison Riboswitch Detector online webserver (Supplementary Data 4) (59). Additionally, the two available annotations for SgGMM4 (14, 23) were transferred to the PacBio SgGMMB4 scaffold using the RATT software package (29), for comparison and pseudogene prediction. An additional S. glossinidius sample (isolated from Glossina palpalis) was sequenced and assembled in the same manner using two SMRTcells using P6-C4 chemistry.
Sequencing libraries for the six further isolates of Sodalis glossinidius from multiple tsetse species (GAA; GF4; GM1; GMM4; GP1; GPP4) were prepared using a TruSeq library preparation kit (according to the manufacturer’s instructions) and sequenced on a single lane of an Ilumina HiSeq (High-output run; 2x100bp paired end reads) at the Centre for Genomic Research. The Illumina HiSeq data from the six further S. glossinidius isolates were initially processed using CASAVA 1.8 to produce FASTQ files. FASTQ data files were trimmed for the presence of Illumina adapter sequences using Cutadapt (v1.2.1), using the –O 3 option (60). The reads were further trimmed using Sickle (v1.200) (https://github.com/najoshi/sickle) with a minimum window quality score of 20. Data were assembled de novo with SPADES using default parameters and annotated using PROKKA as previously described. PROKKA-derived GFF annotations were processed through the ROARY pan-genome package to ascertain core and accessory genome coverage (61). Reads were mapped to the SgGMMB4 PacBio reference, and core SNP phylogenies derived using the SNIPPY package (https://github.com/tseemann/snippy).
Methylome sequencing
In addition to gDNA sequencing and assembly, the PacBio RSII instrument can detect epigenetic modification either in silico or by comparing native DNA to a PCR control. To that end, a Whole Genome Amplified (WGA) control was generated from SgGMMB4 as follows: 1µg gDNA was split into three equal reactions and Whole Genome Amplified using the Qiagen Repli-g Turbo kit according to the manufacturers instructions. These were then pooled and cleaned using a 2:1 ratio of homemade SPRI bead cleanup system analogous to Ampure XP beads (62). The WGA control was sequenced using one SMRTcell in the same way as described previously. Comparison to the native DNA was performed using the Motifs and Modifications module within the SMRT ®Analysis Server with a mapping quality cutoff set at QV70 and the subsequent modifications and motifs file filtered for those with a quality ≥ Q50 (P<0.0001).
RNA sequencing
Individual 10 mL cell-free liquid cultures, as described above, were set up in quintuplets for seventeen time-points at six-hourly intervals. At each time point, the contents of the culture flasks were transferred to 15 mL Falcon tubes, an Optical Density (600nm) measurement taken, and then centrifuged for 10 min at 10 ºC. The bacterial pellet was immediately resuspended in 1 mL Trizol reagent (Life Technologies) and total RNA was extracted using Zymo Research DirectZol columns. RNA cleanups were performed using a 2:1 ratio of SPRI beads as described previously. Three timepoints, representing: Early Log (12 hours); Late Log (72 hours) and Late Stationary phase (108 hours) according to the OD measurements were selected and DNase I treated using a Life Technologies DNAfree Turbo kit (according to the manufacturers instructions; (data not shown)).
Ribosomal RNA (rRNA) was depleted using a Ribo-Zero™ bacterial (low-input) rRNA Removal Kit (Epicentre), and individually barcoded, strand-specific Illumina cDNA libraries were prepared using a NEBNext® Ultra™ RNA Library Prep Kit for Illumina. Sequence data was generated using one Illumina MiSeq run with v2 chemistry generating 250 bp paired-end reads. All RNA and cDNA cleanups were performed using SPRI beads as described previously. All raw Illumina sequence Fastq files were trimmed for the presence of adapter sequences using Cutadapt version 1.2 using option -O 3 (60). and quality-trimmed using Sickle version 1.200 (63) with a minimum window quality score of 20. Any reads shorter than 10 bp after trimming were removed. Quality scores for all sequences were assessed using FASTQC v0.9.2 (http://www.bioinformatics.babraham.ac.uk/projects/fastqc). RNA-seq reads were aligned to the PacBio RSII-generated SgGMMB4 scaffold using Bowtie2 v. 2.1.0 (64). The resulting SAM files were converted to BAM and sorted using samtools v.0.1.18-r580 (65)s. For transcript-based annotation, reads were counted against the PROKKA/PacBio annotation using HTSeq version 0.5.3p9 using the –stranded option, to count in both the sense and antisense directions, and using the intersection-nonempty mode (66). EdgeR analysis was implemented using the DEGUST web package (http://degust.erc.monash.edu), which outputs counts per million, logFC and differential expression statistics (Supplementary Data 1). EdgeR counts per million were transformed to TPM according to Wagner et al (2012) (67). Further statistical analysis and figure plotting were implemented using R version 3.32. Operon structure was predicted from the RNA-seq data using Rockhopper with the default settings (68). The combined RNAseq sequencing reads were fed through the SNIPPY pipeline (as previously described) to identify potentially correcting SNPs.
Proteomics
Liquid cultures were grown as previously described for early-mid-log and late-stationary growth phase S. glossinidius GMMB4. PBS-washed pellets were suspended in 250 µl of 25 mM ammonium bicarbonate and sonicated, using a Sonics Vibra Cell (Sonics and Materials Inc., Newton, U.S.A.) and 630-0422 probe (250 µl to 10 mL) for a total of 120 joules. The sample was then analysed for protein content, 50 µg was added to 0.05% RapiGest™ (Waters, Manchester) in 25 mM ammonium bicarbonate and shaken at 550 rpm for 10 min at 80°C. The sample was then reduced (addition of 10 µL of 60 mM DTT and incubation at 60 °C for 10 minutes) and alkylated (addition of 10 µl of 180 mM iodoacetamide and incubation at room temperature for 30 minutes in the dark). Trypsin (Promega U.K. Ltd., Southampton, proteomics grade) was reconstituted in 50 mM acetic acid to a concentration of 0.2 µg/µl and 10 µL added to the sample followed by overnight incubation at 37 °C. The digestion was terminated and RapiGest™ removed by acidification (1 µl of TFA and incubation at 37 °C for 45 min) and centrifugation (15,000 x g for 15 min). To check for complete digestion each sample was analysed pre-and post-acidification by SDS-PAGE.
For LC-MS/MS analysis, a 2 µl (1 µg) injection was analysed using an Ultimate 3000 RSLC™ nano system (Thermo Scientific, Hemel Hempstead) coupled to a QExactiveHF™ mass spectrometer (Thermo Scientific). The sample was loaded onto the trapping column (Thermo Scientific, PepMap100, C18, 300 μm X 5 mm), using partial loop injection, for seven minutes at a flow rate of 4 µl/min with 0.1 % (v/v) FA. The sample was resolved on the analytical column (Easy-Spray C18 75 µm x 500 mm 2 µm column) using a gradient of 97 % A (0.1 % formic acid) 3 % B (99.9 % ACN 0.1 % formic acid) to 70 % A 30 % B over 120 minutes at a flow rate of 300 nl min. The data-dependent program used for data acquisition consisted of a 60,000 resolution full-scan MS scan (AGC set to 3e6 ions with a maximum fill time of 100 ms) the 18 most abundant peaks were selected for MS/MS using a 30,000 resolution scan (AGC set to 1e5 ions with a maximum fill time of 45 ms) with an ion selection window of 1.2 m/z and a normalised collision energy of 28. To avoid repeated selection of peptides for MSMS the program used a 30 second dynamic exclusion window.
The protein identification for the MS/MS dataset was performed using an open-source software tool – ProteoAnnotator (69). ProteoAnnotator provides an automated pipeline for various interconnected computational steps required for inferring statistically robust identifications. The tool produces a variety of output files compliant with the data standards developed by Proteomics Standard Initiative (70). Mass spectra in form of a MGF (Mascot Generic format) file were provided as input to the tool, along with the search criteria and protein database as described in the sections below. The search parameters for the MS/MS dataset were fixed modification of carbamidomethylation of cysteine and variable modification of oxidation of methionine. A single missed trypsin cleavage was allowed. The product tolerance was set as ±0.5 Da and the precursor tolerance was set as 10 ppm. The protein search database comprised of the gene model predicted by PROKKA, as previously described, plus a six-frame translation of the SgGMMB4 genome. Six-frame translated sequences with length less than eight were excluded from the search database. Decoy sequences were added to the database with a true:decoy ratio of 1:1 to create a final protein database for performing the MS/MS search. For the post-processing of results, we applied a threshold of 5 % for both peptide level and protein group level FDRs as described in Ghali et al (2004).
Nucleotide sequence database identifiers
The Pacific Biosciences assembly and annotation have been submitted to the European Nucleotide Archive under accessions LN854557-LN854560. Illumina sequence reads for the six additional Sodalis isolates are available under Project accession PRJEB9474 (accessions ERR2036891-ERR2036896). RNAseq data are available under project PRJEB20150.
The mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE (71) partner repository with the dataset identifier PXD007068.
Acknowledgements
The Pacific Biosciences RS-II instrument purchase was funded by the BBSRC under grant BB/L014777/1 and research funded by project grant BBJ017698/1 to ACD. Pisut Pongchaikul was supported by a Maihidol studentship. All sequencing was carried out at the Centre for Genomic Research at the University of Liverpool, UK. Thanks also to Dr. Chandan Pal, and the Hinton and Kröger research groups for their support and useful discussions.