Abstract
Most urinary tract infections (UTIs) are caused by uropathogenic Escherichia coli (UPEC), which depend on an extracellular organelle (Type 1 pili) for adherence to bladder cells during infection. Type 1 pilus expression is partially regulated by inversion of a piece of DNA referred to as fimS, which contains the promoter for the fim operon encoding Type 1 pili. fimS inversion is regulated by up to five recombinases collectively known as Fim recombinases. These Fim recombinases are currently known to regulate two other switches: the ipuS and hyxS switches. A long-standing question has been whether the Fim recombinases regulate the inversion of other switches, perhaps to coordinate expression for adhesion or virulence. We answered this question using whole genome sequencing with a newly developed algorithm (Structural Variation detection using Relative Entropy, SVRE) for calling structural variations using paired-end short read sequencing. SVRE identified all of the previously known switches, refining the specificity of which recombinases act at which switches. Strikingly, we found no new inversions that were mediated by the Fim recombinases. We conclude that the Fim recombinases are each highly specific for a small number of switches. We hypothesize that the unlinked Fim recombinases have been recruited to regulate fimS, and fimS only, as a secondary locus; this further implies that regulation of Type 1 pilus expression (and its role in gastrointestinal and/or genitourinary colonization) is important enough, on its own, to influence the evolution and maintenance of multiple additional genes within the accessory genome of E. coli.
Importance UTIs are a common ailment that affects more than half of all women during their lifetime. The leading cause of UTIs is UPEC, which rely on Type 1 pili to colonize and persist within the bladder during infection. The regulation of Type 1 pili is remarkable for an epigenetic mechanism in which a section of DNA containing a promoter is inverted. The inversion mechanism relies on what are thought to be dedicated recombinase genes; however, the full repertoire for these recombinases is not known. We show here that there are no additional targets beyond those already identified for the recombinases in the entire genome of two UPEC strains, arguing that Type 1 pilus expression itself is the driving evolutionary force for the presence of these recombinase genes. This further suggests that targeting the Type 1 pilus is a rational alternative non-antibiotic strategy for the treatment of UTI.
Introduction
Uropathogenic Escherichia coli (UPEC) are the primary cause of urinary tract infections (UTIs) (1, 2), which are estimated to affect more than half of all women during their lifetime (3). The total annual cost of community-acquired and nosocomial UTIs in the United States was estimated to be $2 billion in 1995 (3). Although UTIs have traditionally been effectively treated with antibiotics, in some patients UTIs recur despite apparently appropriate antibiotic therapy and sterilization of the urine (4). Furthermore, UTIs are the first or second most common indication for antibiotic therapy (5, 6), making them a major contributor to rising antibiotic resistance rates (7). Therefore, substantial effort has been devoted to studying the molecular mechanisms by which UPEC cause UTI in the service of developing alternative preventive and therapeutic strategies (2, 8-11).
One of the major successes in UTI research has been the recognition of the importance of Type 1 pili for causing UTI (12-14). Type 1 pili, encoded by the fim operon, are hair-like, multiprotein structures that extend from the outer membrane and terminate in the adhesin protein FimH (15-17). FimH binds to mannose residues on glycosylated bladder surface proteins such as uroplakin protein UPIa (18) and α3β1 integrin heterodimers (19). Adhesion to the bladder epithelium can lead to internalization of the bacteria into host cells and formation of intracellular bacterial communities (IBCs) (20-23). Bacteria in IBCs are protected from the immune response and antibiotic treatment, and can later escape from the host cells to cause recurrent infection (24, 25). Therefore, Type 1 pili directly contribute both to the initiation of infection and to intracellular persistence. Several new strategies have focused on blocking the function of Type 1 pili by small molecule inhibition or vaccination (26, 27).
The pilus structural proteins (including the FimH adhesin) and the chaperone-usher proteins that mediate pilus biogenesis are encoded within the fimAICDFGH operon (15, 16). Regulation of Type 1 pili expression centers on the epigenetic alteration of the fim operon promoter, which is located within the invertible fim switch fimS (28, 29). When fimS is in the ON orientation, the promoter is positioned to transcribe the fim genes and Type 1 pili may be synthesized. In contrast, when the fimS promoter is in the OFF orientation, bacteria do not produce Type 1 pili.
Switching of fimS from one state to another is regulated by recombinases which bind to inverted repeat (IR) sequences that flank the switch. Two recombinases, FimB and FimE, are encoded by genes that are genetically linked to the fim operon and fimS switch (30). Other known recombinases acting at fimS include the genetically unlinked IpuA and FimX (30-32). Interestingly, both the linked and unlinked Fim recombinases are also able to mediate the inversion of other switches. The hyxS switch is inverted by FimX (33), while ipuS was shown to be inverted by FimE, FimX, IpuA, and IpuB (but not FimB) (34). Like fimS, inversion of hyxS and ipuS appears to regulate downstream gene expression, but the full importance of these genes in pathogenesis is still not clear.
An open question in the field has been whether the Fim recombinases are utilized in the regulation of other, still unknown, switches, and whether such switches may be related to pathogenesis. To search for novel invertible elements, we developed an algorithm named Structural Variation detection using Relative Entropy (SVRE) to detect genomic structural variations (SVs) in whole genome sequencing data. We applied SVRE to uropathogenic strains overexpressing each Fim recombinase. In addition to the known inversions at fimS, hyxS, and ipuS, SVRE detected several SVs that were recombinase-independent. Importantly, no new invertible switches were found, indicating that fimS is inverted by several recombinases that regulate little else, suggesting that tuning of Type 1 pilus expression is of strong evolutionary importance.
Results
Development of SVRE
Invertible sequences like fimS are one class of SV, which also includes deletions, duplications, translocations, and more complex rearrangements. Several programs have been developed to call SVs from whole genome sequencing data. One primary strategy for SV detection is to identify paired-end reads with unusual mapping patterns. Generation of DNA libraries for next-generation sequencing typically includes a size selection step that restricts the physical size of the DNA fragments that are carried forward for sequencing. When mapped to an ideal reference genome, the distance between paired-end reads should reflect this length. Additionally, the reads should map to opposite strands of the genome. Paired-end reads with an appropriate mapping distance and read orientation are termed “concordant” reads. In contrast, in the presence of an SV in the input DNA relative to the reference genome, paired-end reads associated with the SV map at a distance or orientation that differs from this expectation; these reads are called “discordant” reads.
We developed SVRE, an algorithm that detects SVs by analyzing the distribution of mapping distances in segments of the genome. When reads span an SV, the local mapping distances for these reads should follow a different distribution based on the type of SV; the difference in distribution is generated by discordant reads. In the case of an invertible element like fimS, the genomic material used for sequencing may contain a mixture of both orientations (Figure 1A). Reads derived from the invertible element will map to the reference genome differently depending on the orientation of the element. If the orientation is the same as the reference, the reads will align with the expected mapping distance to opposite strands (the gray arrows in Figure 1A). However, if the orientation is reversed, the paired-end reads will map to the same strand and with a mapping distance different from that selected during library preparation (the orange arrows in Figure 1A). When paired-end reads map to the same strand, SVRE assigns them a negative mapping distance. Therefore, a hallmark of inversions is a local mapping distribution that skews towards negative values.
SVRE compares the local mapping distribution of each genome segment to the global distribution, which includes the mapping distances of all paired-end reads genome-wide. The comparison of local and global mapping distributions is made using relative entropy, a statistical test derived from information theory (35). By using relative entropy, SVRE improves on existing SV detection software by providing a more general theoretical foundation for detecting anomalous insert length distributions (as opposed to assuming a normal distribution), resulting in improved signal-to-noise ratio and accuracy. Full theoretical and algorithmic details for SVRE can be found in the Methods and Supplemental Information.
Application of SVRE to discover SVs in UTI89
SVRE was applied to the uropathogenic strain UTI89 carrying a pBAD33-based plasmid providing arabinose-inducible overexpression of fimB or fimX, both of which bias the fimS switch towards the ON orientation (a similar strategy to that used in (33)). In contrast, the UTI89 reference genome has the fimS switch in the OFF orientation; therefore, induction of fimB or fimX should result in a structural variation (inversion) at fimS relative to the published reference sequence. Indeed, with overexpression of either recombinase, windows associated with the fim switch showed a local mapping distance distribution that differed from the global distribution (Figure 1B). The difference in the distributions can be primarily attributed to the negative mapping distances observed around the fim switch due to paired reads mapping to the same strand, indicative of an inversion. The distribution in flanking windows not associated with fimS was similar to the global distribution and these windows were not predicted by SVRE to contain an SV (Figure 1B).
The SVRE algorithm assigns a Relative Information Criterion (RIC) score (i.e. relative entropy) to each window. The RIC score peaks for the fimS-associated windows were distinct and well above the genomic background (Figure 2A-B). In addition to the fimS peak, there was a distinct peak at hyxS in the FimX sample but not the FimB sample. The detection of the fimS and hyxS peaks with recombinase overexpression demonstrated the ability of SVRE to find known SVs.
In addition to the fim and hyx switches, other genomic locations exhibited distinct peaks in RIC scores. Both samples shared a RIC score peak that corresponded to the ara locus (labeled “ara” in Figures 2A and B), which is an artefact originating from the use of pBAD plasmids. The remaining peaks included two cases of inversions occurring within prophage (labeled “phg inv” in Figures 2A and B), as well as one inversion occurring in an area containing three asparagine tRNA genes (labeled “asn” in Figures 2A and B). These inversions were predicted to occur in both the FimB and FimX samples. Both samples also shared a prediction of prophage duplication (labeled “dup”), with 2 additional cases of duplication and deletion of prophage (labeled “dup/del”) found only in the FimX sample. Using PCR, each of these SVs was validated in the fimB and fimX overexpressing strains, but were also found to occur in control cells not overexpressing any recombinases (Figure S1), indicating that these SVs do not appear to be regulated by Fim recombinases. In addition, one of the prophage-associated inversions occurred in the vicinity of a predicted prophage-encoded invertase that is homologous to other phage systems that have been shown to regulate linked prophage promoters (36). The lack of novel invertible elements regulated by FimB and FimX confirms that these recombinases are specific to fimS (FimB and FimX) and hyxS (FimX).
Discovery and validation of structural variations in CFT073
The pyelonephritis isolate CFT073 encodes two recombinases (IpuA and IpuB) and one known invertible switch (ipuS) that are not found in UTI89 (31). Although IpuB was not able to regulate fimS, IpuA was shown to be capable of regulating the fim switch both in vitro and in vivo, adding another layer to Type 1 pili regulation (31). The ipuS switch is located between ipuA and ipuR, and was shown to be inverted by IpuA, IpuB, FimX, and FimE, but not FimB (34).
The CFT073 allele for each of these recombinases (in cases where they differed from UTI89) was cloned into pBAD33. CFT073 cells carrying each of these plasmids were sequenced and analyzed with SVRE (Figure 3). As expected, a peak for hyxS was detected for CFT073/pBAD-fimX cells (Figure 3F), but not for any of the other samples. Distinct peaks for fimS were observed for the FimB, FimE, IpuB, and FimX samples (Figure 3B, C, E, F). There were distinct ipuS peaks with expression of any of the recombinases (Figure 3B-F). Similar to the UTI89 samples, other peaks were observed that were unrelated to Fim recombinase activity, some of which were present in the empty vector sample (Figure 3A). These included the ara operon artefact (“ara” in Figure 3), a false-positive peak associated with mismapping to ambiguous bases in rrnD (“rib”), and phage deletions and duplications (“phg”). The phage SVs were found to occur regardless of Fim recombinase expression (Figure S2). Again, as in UTI89, there was no detection of novel invertible elements regulated by the Fim recombinases.
Effects of recombinase overexpression on ipuS inversion and expression of neighboring genes
We observed an ipuS peak in the pBAD-fimB sample (Figure 3B) despite previous data suggesting that FimB is not able to invert ipuS (34). To investigate this further, ipuS in the ON and OFF orientation was cloned onto a pUC19 backbone. The plasmid sequences confirmed the seven-nucleotide IRs that were observed previously (Figure 4A) (34). Each recombinase was expressed in the MDS42 strain background (chosen due to its lack of endogenous recombinases) in the presence of the ipuS-OFF or ipuS-ON plasmids (Figure 4B). FimB was capable of inverting ipuS, but it had the lowest efficiency of all the recombinases (Figure 4B). The ability of FimB to invert ipuS was confirmed in CFT073 (Figure 4C). Overall, IpuB and FimE exhibited the greatest efficiency in OFF to ON inversion, whereas IpuA was most efficient at ON to OFF inversion (Figure 4B-C). These data demonstrate that all of the recombinases, including FimB, are capable of facilitating the inversion of ipuS, further validating the accuracy of the SVRE predictions.
It was previously demonstrated that the orientation of the ipuS switch can regulate expression of ipuR and upaE (34). It has also been hypothesized that IpuA may regulate expression of the D-serine utilization locus (37). To delineate the genes that are affected by ipuS inversion, RT-qPCR was used to quantify relative expression of several genes in CFT073 cells overexpressing IpuA or IpuB (Figure 4D). No significant change of expression was observed for dsdC or dsdX, indicating that neither IpuA, IpuB, nor the orientation of ipuS affect expression of the D-serine utilization locus. In contrast, expression of ipuR was increased by ~1600-fold with IpuB overexpression, and ~34-fold with IpuA overexpression (Figure 4D); this correlates with the orientation of the ipuS promoter switch. The significant increase in upaE expression was not as dramatic, ~33-fold with IpuB overexpression. Together, these data suggest that ipuS inversion only affects the expression of ipuR and upaE and clarifies that dsdC and dsdX transcription are not controlled by ipuS.
Discussion
The fimS switch is a well-studied example of epigenetic regulation by DNA inversion (29, 38, 39). A single bacterium can give rise to two populations which differ only in the orientation of the fimS switch, and individual bacteria can convert between these two populations. The inversion of this switch was first noted to be controlled by two linked recombinases, FimB and FimE (30); in general, fimS inversion is described as stochastic, though regulation of the recombinases and several other proteins which bind to regions in the fimS switch can influence the bias (15, 38). Therefore, Type 1 pilus expression exhibits phase variation (stochastic inversion) that is responsive to environmental conditions (regulation of bias). With the sequencing of the genomes of several UPEC strains, most notably CFT073 (40) and UTI89 (41), genes encoding additional recombinases with homology to FimB and FimE were discovered (31, 32). These recombinases, like FimB and FimE, were found to regulate inversion of promoter elements genetically linked to the respective recombinase gene. Interestingly, these recombinases also have activity at fimS, providing potentially additional layers of regulation for Type 1 pilus expression (31, 32). Importantly, the inverted repeats for these known switches do not always share obvious sequence similarity (see below), implying that a simple search for similar inverted sequences in the genome is not a viable strategy for discovering other invertible switches. The discovery of these unlinked recombinases, therefore, raises several salient questions: (i) do the fim-linked FimB and FimE recombinases also have other inversion targets in the genome; (ii) what is the full suite of targets for all of the Fim recombinases; (iii) what is the consequence of coordinating inversion of multiple promoters with the same recombinases; (iv) are the other non-fim promoters important for Type 1 pilus expression or function; (v) what additional control of Type 1 pilus expression, if any, is gained by using an unlinked recombinase instead of or in addition to regulating FimB and FimE; (vi) is the regulation of the fimS switch important for the evolution or maintenance of the unlinked recombinases, particularly since they are not conserved in all E. coli (and thought to be on at least partially mobile elements). We have used whole genome sequencing, combined with overexpression of individual recombinases, to answer the first two of these questions. We found that the fim recombinases are very specific, and at least for CFT073 and UTI89, there are no other inversion targets for any of the recombinases aside from those already known. This therefore limits the complexity of questions (iii) and (iv) above, while further shedding light on question (vi) regarding the importance of Type 1 pili and its regulation in E. coli.
Positive verification of a new inversion locus is relatively straightforward once the locus is known, and two recent studies have used whole genome sequencing (with Illumina and PacBio data) to achieve accurate quantification of fimS inversion percentages under different conditions (42, 43). However, to truly establish the specificity of the fim recombinases, a strong negative predictive value is required when analyzing whole genome sequencing data (alternatively, a low noise level). With SVRE, we have improved the analysis of insert read lengths from paired-end short read sequencing data, leading to both sensitive and specific detection of inversions throughout the genome. The key analytical contribution of SVRE is to apply a theoretically optimal measure of differences in distributions (from an information theory perspective) that can then be related to the underlying structure of the genome. More explicitly, currently popular second-generation sequencing technology generates paired-end reads; the reads within each pair are separated by a certain distance, determined by the library preparation. Importantly, the distribution of distances should not depend on the DNA sequence itself (or location on the genome). Therefore, we can use a comparison of local versus global insert length distributions to identify when the genome structure does not match our expectation. This type of analysis is also referred to as anomaly detection, in which relative entropy is a commonly used technique (44). Many other SV detection programs use the same underlying idea, in which anomalous insert lengths are equated to variation in the genome structure, but they make the assumption that the read length distribution is normal (45, 46). Our use of relative entropy in SVRE therefore brings several key advantages: (i) generality to any distribution of insert lengths (which may change depending on how library preparation and size selection are done); (ii) elimination of parameters required to tune the program (such as specifying the expected mean and variance of the assumed normal distribution); (iii) utilization of information contained in “concordant” reads that are within the bulk of the expected distribution (these are still used in the calculation of relative entropy); and (iv) removal of the need for a cutoff for number of “discordant” reads.
From a practical point of view, we find that SVRE produces generally low background signals for most of the genome, from which known SVs clearly stand out (Figure 2A and 2B, between 3.5-4.5 Mbp). To make an assessment of the value of using information theory to analyze read length distributions, we reanalyzed our sequencing data with five other commonly used programs including GASVPro (47), SVDetect (46), Pindel (48), breseq (49), and DELLY (45) (Figure S3). In general, DELLY showed the greatest agreement with SVRE, while GASVPro had the least overlap. Some of these algorithms, such as GASVPro and Pindel, produced many more predictions than SVRE, and required applying a cutoff to allele depth in order reduce the calls to a manageable number. A clear advantage of SVRE is that it enables a simple visualization of the relative entropy (Figures 2 and 3), in addition to providing a list of SV predictions. The connection between DNA structure and relative entropy provides a natural priority ranking for validation and study of individual SVs. Use of SVRE on UTI89 and CFT073 thus allowed us to identify all previously known targets of the Fim recombinases as invertible sequences in the genome. We also identified several SVs that were unrelated to the Fim recombinases. Finally, the good signal-to-noise ratio provides confidence that under the conditions tested, we indeed found no additional invertible elements in the entire genome.
Among the previously identified inversion loci, we found that ipuS could be inverted by FimB, both in its native context in the CFT073 chromosome (Figure 3) and when the ipuS switch was inserted into a plasmid (Figure 4). In contrast, the original work identifying ipuS concluded that FimB was not capable of inverting ipuS (34). We did find that, of the five Fim recombinases, FimB inverted ipuS in either direction with the lowest efficiency (Figure 4B-C), making its effects more difficult to detect. Combined with differences in the chosen promoters to drive FimB expression, this possibly accounts for the discrepancy between the two studies. Our results also confirm that ipuS orientation regulates expression of ipuR and upaE, while clarifying that the dsd operon is not regulated by ipuS (Figure 4D). Interestingly, FimE strongly drove inversion from OFF to ON in the MDS42 background (Figure 4B) but not in the CFT073 background (Figure 4C). Of note, while traditionally FimE was thought to only mediate inversion in the ON to OFF direction, FimE has been noted to mediate OFF to ON inversion in some conditions in different strains (42, 50). Therefore, these FimE results could be due to the allele of FimE or other strain-dependent differences.
It is remarkable that Type 1 pilus expression is regulated by five Fim recombinases that regulate little else. The convergence at fimS suggests a potentially intricate coordination to control Type 1 pili expression; presumably this facilitates optimal host colonization or adhesion in some other evolutionarily relevant environment. The genetic context for these recombinases may provide some hints as to how fimS regulation by both “core” and “accessory” recombinases has evolved. FimB and FimE are considered to be core recombinases since they are encoded adjacent to fimS and are present in nearly all E. coli strains (51). In contrast, the accessory recombinases FimX, IpuA, and IpuB are encoded at distal locations on two different pathogenicity islands. FimX is encoded adjacent to hyxS, while IpuA and IpuB are encoded adjacent to ipuS. Therefore, it seems likely that the original role of FimX was to regulate hyxS, while IpuA and IpuB originally regulated ipuS. We speculate that once UPEC acquired the pathogenicity islands housing these recombinases, the recombinases were co-opted to regulate fimS in addition to their cognate switch, and that this additional layer of regulation has given UPEC some sort of advantage. This idea is supported by the observation that fimX is enriched in UPEC strains (83.2%) compared to commensals (36%) (51). However, ipuA and ipuB are found at low levels in roughly equal proportions among UPEC (23.7%) and commensals (15%) alike (51). How these three switches, whose IRs differ in length and sequence, could be regulated by multiple recombinases is still not clear and an area for further investigation. FimB and FimE have been shown to bind to fimS at the IRs at half sites that overlap and flank the IRs (52). Therefore, one would hypothesize that the IRs and their surrounding sequence would be quite similar. There is some alignment observed between ipuS and fimS, and ipuS and hyxS (34). However, the alignment between fimS and hyxS is poor, despite the fact that FimX is able to facilitate recombination at both switches (31-33). It thus remains an open question how the Fim recombinases recognize these IRs with apparently dissimilar sequences.
The fact that additional recombinases have been recruited to regulate fimS does imply that proper Type 1 pilus expression is important to the evolutionary success of UPEC. This notion is consistent with the observation of positive selection on the FimH adhesin, which results in tuning the conformational flexibility of the protein, leading to modulation of the dynamics of binding to the surface of bladder epithelial cells (53-57). Of note, proper regulation may in some cases include downregulation of Type 1 pili expression at appropriate times, which is also supported by the regulatory mutations seen in EHEC (to lock the fimS switch in the OFF orientation) (58), the widespread inactivation of fimB in the ST131 E. coli lineage via an insertion sequence (42), and the strong positive selection on fimA (thought to be due to immune evasion) (59). Downregulation may also explain the finding of low Type 1 pilus expression in bacteria in the urine of some human UTI patients (60-62), though variation in the interaction between different hosts and pathogens during infection is another possibility (63). Here, we have provided additional data that argue that Type 1 pili are important to the success of E. coli, and particularly UPEC, suggesting that current efforts to target Type 1 pilus function to prevent and treat UTI represent a rational anti-virulence strategy.
Materials and Methods
Bacterial strains
All strains utilized in this study are listed in Table S1. Creation of knockout strains was done using lambda red recombination (64) with 50 bp flanking sequences as described before (65). Primers used for recombination are listed in Table S2.
Preparation of sequencing data
Overnight cultures were diluted 1:100 into LB broth containing chloramphenicol (20 μg/mL) and were incubated with shaking at 25° C for 24 h, then diluted 1:1000 into fresh media supplemented with chloramphenicol and arabinose (0.5%) and incubated for another 24 h. After the 48 h growth period, genomic DNA was extracted and prepared for Illumina sequencing. For UTI89, the library was prepared using standard techniques including shearing, end-repair, size selection, PCR, and purification with AMPure XP beads; sequencing was performed on an Illumina HiSeq 2000 machine as paired reads with a length of 76 bps. The CFT073 libraries were made using the Illumina TruSeq DNA Library Prep Kit v2 and were sequenced on the Illumina MiSeq as paired reads of a length of 150 bps.
Development of SVRE
We developed SVRE to improve on existing strategies used in SV detection, particularly those which make use of insert length distributions. When mapped to a perfect reference (i.e. not containing an SV), paired reads will map on opposite strands and at a distance determined by the insert size of the sequencing library, which is usually intentionally controlled during library preparation. Paired reads that map in this way are referred to as “concordant” pairs, while those that do not are “discordant”. One immediate strategy is to focus on discordant reads; clusters of discordant reads mapping to a particular region of the genome are then identified as a potential SV. However, distinguishing between these two classes is not always trivial, and appropriate cutoffs for how many discordant reads should be required to support a true SV are difficult to determine a priori. Programs such as GASVPro (47), SVDetect (46), DELLY (45), VariationHunter (66), BreakDancer (67), and the read distribution module of LUMPY (68) define concordant reads as those whose mapping distances fall within a chosen range based on the expected mapping distance and the standard deviation. In other words, library preparation is assumed to generate a roughly normal distribution of read insert lengths. Another drawback to this approach is that concordant reads are discarded and any information that concordant reads could supply for predicting SVs (such as differences in their length distribution) is lost.
Another strategy that avoids this concordant/discordant differentiation considers the overall distribution of mapping distances. By looking at histograms of mapping distances, changes from the expected distribution can be detected by a number of methods including statistical tests (X2, K-S test, t-test, Z-test, etc.) or by using classification algorithms (such as support vector machines). Existing algorithms that utilize this distribution comparison strategy include SVM2 (69) and MoDIL (70).
SVRE also uses a distribution comparison strategy. We choose the global insert length distribution as an empirical null model; implicitly, we are assuming that SVs are rare overall and therefore have a minimal global effect on the insert length distribution. We then compare the distribution of a local window to this global distribution using relative entropy (Kullback-Leibler divergence, relative information content, or information divergence/gain). In information theory, relative entropy is a measure of the divergence between two “information” distributions (35). This is strongly related to concepts about signal encoding and compression, in which entropy is known to define an optimal theoretical lower limit for compressed or encoded message size.
With respect to SV detection, to the extent that information is carried within insert length distributions, we suggest that relative entropy is a potentially optimal statistic for quantifying how different a local distribution is from the global null distribution, though we have not formally proven this.
Details about the implementation of SVRE can be found in the Supplemental Information. SVRE was written in Perl and is available for download at https://github.com/swainechen/svre.
Structural variation prediction with other software
GASVPro version 1.2 (47), SVDetect version 0.8b (46), Pindel version 0.2.5b9 (48), breseq version 0.33.1 (49), and DELLY version 0.7.8 (45) were run according to the instructions provided by the developers. Fastq files were used as the input for breseq, whereas the other programs required sorted, paired-end bam files which were produced using BWA-MEM (71) and SAMtools (72). Any additional pre- and post-processing steps, as well as analysis of the output, were performed ad hoc with Python.
PCR to confirm structural variations
The primers utilized to validate predicted SVs are listed in Table S2 and were designed according to the specific SV type as outlined in Fig S1A-C. PCR was performed with cells grown for 48 h at 25° C with passaging at 24 h and cells grown for 7 h at 37° C. The cells were grown in LB with arabinose to induce expression of recombinases. PCR was performed with cells from a freshly grown culture or with gDNA isolated from the culture.
Cloning
The vectors pSLC-372 and pSLC-373 contain the ipuS switch in the OFF or ON position, respectively, cloned into the BamHI and SacI sites of pUC19. To obtain ipuS DNA in both orientations, ipuS was amplified from CFT073/pBAD-ipuA cells induced with arabinose. Plasmids encoding for Fim recombinases were made by amplifying the recombinase from the genomic DNA of either UTI89 or CFT073, and cloning it into the SacI and XbaI sites of pBAD33. The same FimB plasmid was used for both strains given that the fimB sequence is identical in the two genomes. These plasmids, along with the primers used for making them, are listed in Table S3.
Quantification of ipuS orientation
Overnight cultures were diluted 1:100 into 2 mL of LB supplemented with chloramphenicol (20 μg/mL) and arabinose (0.5%) and grown shaking for 7 h at 37° C. A PCR was then performed to amplify across the ipuS switch using primers cwr175 and cwr178 to amplify from the genome, or primers M13F and M13R to amplify from the plasmids pSLC-372 and pSLC-373 (Table S2). The resulting product was digested with PacI, which has only one site in the PCR product that is located within ipuS. This digestion reaction results in two bands that differ in size depending on the orientation of the switch. The digest reactions were run on a 2% gel, imaged, and the densities of one OFF orientation band and one ON orientation band were quantified using ImageJ FIJI. The total density of the two bands was set to 100% and the percent of ON versus OFF was then calculated.
RT-qPCR
Overnight cultures of CFT073 carrying pBAD33, pBAD-ipuA, or pBAD-ipuB, were subcultured 1:100 into 10 mL of LB with chloramphenicol (20 μg/mL) in a 100 mL flask and were grown with shaking for 3 h at 37° C. Arabinose was then added to a final concentration of 0.5%, and the cells were allowed to incubate for another hour, at which point 0.5 mL of culture was added to 1 mL of RNAprotect Bacteria Reagent and the cells were lysed using proteinase K and lysozyme. RNA was isolated using the RNeasy Mini Kit, and DNA was removed with DNase I digestion. The SuperScript II RT kit was used to make cDNA. For each sample, a control reaction was run that lacked reverse transcriptase to check for DNA contamination during the qPCR reactions.
Primers employed in the qPCR reaction are listed in Table S2. A control lacking cDNA was included for each pair of primers, in addition to the reactions with and without reverse transcriptase for each sample. The KAPA SYBR FAST qPCR Master Mix was used along with 0.5 μM of each primer and ROX Low. The reactions were run on the ViiA 7 Real-Time PCR System with the following program: 95° C for 3 minutes followed by 40 cycles of 95° C for 3 seconds and 60° C for 20 seconds. The data were analyzed using the ΔΔCt method with 16S acting as a reference gene and the pBAD33 sample as the reference sample. Differences between sample ΔCt values were tested using an unpaired, two-tailed T test.
Supplemental
Figure S1. Confirmation of novel structural variations in UTI89. A PCR strategy was employed that was specific to each SV type. (A) For inversions, two sets of primers were used. One set produces a band when the invertible element is in the orientation found on the reference genome. In contrast, the other set produces a band if there is an inversion event. (B) Deletions were detected by using distant primer sets that only produce a band if the intervening sequence is deleted, bringing the priming sites closer together. (C) Duplications were detected using outward facing primer pairs that produce a band only if a tandem duplication event occurs. (D-I) For each SV, the leftmost coordinate of significant windows called by SVRE are represented by red (UTI89/pBAD-fimB) and blue (UTI89/pBAD-fimE) lines. The primers used to confirm the predicted SVs are depicted on the schematic of the neighboring genes, and the gels that resulted from the use of those primers are shown below. (D-F) Confirmation of inversions at (D) 0.9 Mb, (E) 2.1 Mb, and (F) 2.9 Mb were performed in UTI89 (“Ctrl”), UTI89/pBAD33 (“EV”), and UTI89/pBAD-fimX (“fimX”) cells. The linked phage invertase pin is highlighted in (A). (G-I) Confirmation of (G) a prophage deletion at 1.6 Mb, prophage duplication and deletions at (H) 1.2 Mb and (I) 5.0 Mb. The PCR was performed using WT UTI89 as well as UTI89ΔfimBΔfimEΔfimX (“ΔBEX”).
Figure S2. Confirmation of novel structural variations in CFT073. For each SV, the leftmost coordinate of significant windows called by SVRE are represented by red (pBAD-fimB), black (pBAD-fimE), orange (pBAD-ipuA), green (pBAD-ipuB), and blue (pBAD-fimX) lines. The primers used to confirm the predicted SVs are depicted on the schematic of the neighboring genes, and the gels that resulted from the use of those primers are shown below. Confirmation of the SVs was performed in CFT073 carrying either pBAD33 (“EV”) or plasmids encoding the various recombinases. (A) Detection of duplication and deletion of phage at 0.9 Mb and (B) a phage at 1.3 Mb.
Figure S3. Comparison of SVRE calls to that of other SV prediction programs. SV predictions for (A) UTI89 and (B) CFT073 are listed in the first columns of each table. Whether that SV was detected in a given sample by a program is indicated by a filled box following the color code indicated in the legend.
Table S1. Strains utilized in this work. The table lists the strains used in this work. If the strain was part of a previous publication, the appropriate reference is given.
Table S2. Primers used for strain creation, SV validation, and qRT-PCR. The table lists primer sets used to detect SVs, create knockout mutant strains, and measure gene expression.
Table S3. Plasmids utilized in this work. For each plasmid that was used in this work, either a reference is given or the primers that were used in the creation of the plasmid are listed.
Supplemental Information. Implementation of SVRE. A description of how the SVRE program is implemented, including how relative entropy is calculated.
Acknowledgments
This work was supported by the National Research Foundation, Singapore (NRF-RF2010-10 to S.L.C.); the National Medical Research Council, Ministry of Health, Singapore grant numbers NMRC/CIRG/1357/2013, NMRC/CIRG/1358/2013, and NMRC/OFIRG/0009/2016; and the Genome Institute of Singapore (GIS) / Agency for Science, Technology, and Research (A*STAR). Experiments were performed by CWR, LTL, BP, SR, and CYC. The SVRE algorithm was developed by RS and SLC. The manuscript was written by CWR, RS, BP, and SLC.