ABSTRACT
Spontaneous mutations are ultimately essential for evolutionary change and are also the root cause of nearly all disease. However, until recently, both biological and technical barriers have prevented detailed analyses of mutation profiles, constraining our understanding of the mutation process to a few model organisms and leaving major gaps in our understanding of the role of genome content and structure on mutation. Here, we present a genome-wide view of the molecular mutation spectrum in Burkholderia cenocepacia, a clinically relevant pathogen with high %GC content and multiple chromosomes. We find that B. cenocepacia has low genome-wide mutation rates with insertion-deletion mutations biased towards deletions, consistent with the idea that deletion pressure reduces prokaryotic genome sizes. Unlike previously assayed organisms, B. cenocepacia exhibits a GC-mutation bias, which suggests that at least some genomes with high GC content may be driven to this point by unusual base-substitution mutation pressure. Notably, we also observed variation in both the rates and spectra of mutations among chromosomes, and a significant elevation of G:C>T:A transversions in late-replicating regions. Thus, although some patterns of mutation appear to be highly conserved across cellular life, others vary between species and even between chromosomes of the same species, potentially influencing the evolution of nucleotide composition and genome architecture.
INTRODUCTION
As the ultimate source of genetic variation, mutation is implicit in every aspect of genetics and evolution. However, as a result of the genetic burden imposed by deleterious mutations, remarkably low mutation rates have evolved across all of life, making detection of these rare events technologically challenging and accurate measures of mutation rates and spectra exceedingly difficult (Kibota and Lynch 1996; Lynch and Walsh 1998; Sniegowski et al. 2000; Lynch 2011; Fijalkowska et al. 2012; Zhu et al. 2014). Consequently, most estimates of mutational properties have been derived indirectly using comparative genomics at putatively neutral sites (Graur and Li 2000; Wielgoss et al. 2011) or by extrapolation from small reporter-construct studies (Drake 1991). Both of these methods are subject to potentially significant biases, as many putatively neutral sites are subject to selection and mutation rates can vary substantially among different genomic regions (Lynch 2007).
To avoid the potential biases of these earlier methods, pairing classic mutation accumulation (MA) with whole-genome sequencing (WGS) has become the preferred method for obtaining direct measures of mutation rates and spectra (Lynch et al. 2008; Denver et al. 2009; Ossowski et al. 2010; Lee et al. 2012; Sung, Ackerman, et al. 2012; Sung, Tucker, et al. 2012; Heilbron et al. 2014). Using this strategy, a single clonal ancestor is used to initiate several replicate lineages that are passaged through repeated single-cell bottlenecks for several thousand generations. The complete genomes of each evolved lineage are then sequenced and compared with the other lines to identify de novo mutations that occurred over the course of the experiment. The bottlenecking regime minimizes the ability of natural selection to eliminate deleterious mutations, and the parallel sequencing provides a large enough body of information to yield a nearly unbiased picture of the natural mutation spectrum of the study organism (Lynch et al. 2008).
The MA-WGS method has now been used to examine mutational processes in several model eukaryotic and prokaryotic species, yielding a number of apparently generalizable conclusions about mutation rates and spectra. For example, a negative scaling between base-substitution mutation rates and both effective population size (Ne) and the amount of coding DNA supports the hypothesis that the refinement of replication fidelity that can be achieved by selection is determined by the power of random genetic drift among phylogenetic lineages (Lynch 2011; Sung, Ackerman, et al. 2012). This “drift-barrier hypothesis” therefore predicts that organisms with very large population sizes such as some bacteria should have evolved very low mutation rates (Lee et al. 2012; Sung, Ackerman, et al. 2012; Foster et al. 2013).
Universal transition and G:C>A:T biases have also been observed in all MA studies to date (Lind and Andersson 2008; Lynch et al. 2008; Denver et al. 2009; Ossowski et al. 2010; Lee et al. 2012; Sung, Ackerman, et al. 2012; Sung, Tucker, et al. 2012), corroborating previous findings using indirect methods (Hershberg and Petrov 2010; Hildebrand et al. 2010). However, several additional characteristics of mutation spectra vary among species (Lynch et al. 2008; Denver et al. 2009; Ossowski et al. 2010; Lee et al. 2012; Sung, Tucker, et al. 2012; Sung, Ackerman, et al. 2012), and examining the role of genome architecture, size, and lifestyle in producing these idiosyncrasies will require a considerably larger number of detailed MA studies. Among bacterial species that have been subjected to mutational studies, genomes with high GC content are particularly sparse and no studies have been conducted on bacteria with multiple chromosomes, a genome architecture of many important bacterial species (e.g Vibrio, Brucella, Burkholderia).
The Burkholderia cepacia complex is a diverse group of bacteria with important clinical implications for patients with cystic fibrosis (CF), in whom it can form persistent lung infections and highly resistant biofilms (Coenye et al. 2004; Mahenthiralingam et al. 2005; Traverse et al. 2013). Burkholderia cenocepacia is the most threatening pathogenic member of this complex in CF patients, and is renowned for its rapid diversification following infection (Coenye et al. 2004; Zlosnik et al. 2011). The core genome of B. cenocepacia HI2424 has a high %GC content (66.8%) and harbors three chromosomes, each containing rDNA operons (LiPuma et al. 2002), although the third chromosome can be eliminated under certain conditions (Agnoli et al. 2012). The primary chromosome (Chr1) is ∼3.48 Mb and contains 3253 genes; the secondary chromosome (Chr2) is ∼3.00 Mb and contains 2709 genes; and the tertiary chromosome (Chr3) is ∼1.06 Mb and contains 929 genes. In addition, B. cenocepacia HI2424 contains a 1.64 Kb plasmid, which contains 159 genes and lower GC content than the core genome (62.0%). Although the GC content is consistent across the three core chromosomes, the proportion of coding DNA declines from Chr1 to Chr3, while the evolutionary rate of genes increases (Cooper et al. 2010; Morrow and Cooper 2012). Whether this variation in evolutionary rate is driven by variation in non-adaptive processes like mutation bias or variation in the relative strength of purifying selection remains a largely unanswered question in the evolution of bacteria with multiple chromosomes.
Here, we applied whole-genome sequencing to 47 MA lineages derived from B. cenocepacia HI2424 that were evolved in the near absence of natural selection for over 5550 generations each. We identified a total of 282 mutations spanning all three replicons and the plasmid, enabling a unique perspective on inter-chromosomal variation in both mutation rate and spectra, in a bacterium with the highest %GC content studied with MA-WGS to date.
RESULTS
A classic mutation-accumulation experiment was carried out for 217 days with 75 independent lineages all derived from the same ancestral colony of B. cenocepacia HI2424 (LiPuma et al. 2002) using a daily serial transfer regime in which a single colony from each line was re-streaked onto a fresh plate. Measurements of generations incurred each day were taken monthly and varied from 26.2 ± 0.12 to 24.9 ± 0.14 (mean ± 95% CI of highest and lowest measurements, respectively) (Figure S1), resulting in an average of 5554 generations per line over the course of the MA experiment. Thus, across the 47 lines whose complete genomes were sequenced, we were able to visualize the natural mutation spectrum of B. cenocepacia HI2424 over 261,038 generations of mutation accumulation.
Whole-genome sequencing was performed using the 151-bp paired-end Illumina HiSeq platform to an average depth of ∼50x. Mutations were identified using a consensus approach that leverages the parallel sequencing of nearly isogenic lineages to verify the ancestral consensus base at each site in the reference genome, then compare that to base calls of individual lineages. This approach allows us to minimize false-positive identifications while missing few true mutations, as evidenced by previous studies that have verified mutations called by this method through conventional sequencing (Sung, Ackerman, et al. 2012; Sung, Tucker, et al. 2012). From the comparative sequence data, we identified 245 base-substitutional (bps) changes, 33 short-insertion/deletion (indel) mutations (with sizes in the range of 1 to 145 base pairs), and four plasmid-loss events spanning the entire genome (Figure 1, Table S1, S2). With means of 5.21 bps and 0.70 indel mutations per line, the distribution of bps and indels across individual lines did not differ significantly from a Poisson distribution (bps: χ2 = 1.81, p = 0.99; indels: χ2 = 0.48, p = 0.92), indicating that mutation rates did not vary over the course of the MA experiment.
Mutation-accumulation experiments rely on the basic principle that when the effective population size (Ne) is sufficiently reduced, the efficiency of selection is minimized to the point at which all mutations become fixed by genetic drift with equal probability (Kibota and Lynch 1996). Ne in this mutation accumulation study was calculated to be ∼12.86, using the harmonic mean of the population size over 24 hours of colony growth (Hall et al. 2008). Thus, only mutations conferring effects of s > 0.078 will be subject to the biases of natural selection (Lynch et al. 2008), which is expected to be a very small fraction of mutations (Kibota and Lynch 1996; Elena et al. 1998; Zeyl and DeVisser 2001).
We tested for selection in our observed mutation spectra using the ratio of synonymous substitutions per synonymous site to non-synonymous substitutions per non-synonymous site. Given the codon usage and conditional mutation rates of B. cenocepacia HI2424, 27.8% of coding substitutions are expected to be synonymous. The observed percentage of synonymous substitutions (25.5%) did not differ significantly from this null-expectation (χ2 = 0.54, df = 1, p = 0.46). Although both base-substitutions (χ2 = 4.20, df = 1, p = 0.04) and indels (χ2 = 21.3, df = 1, p < 0.0001) were biased to non-coding DNA, evidence exists that mismatch repair preferentially repairs damage in coding regions, which can create artificial signatures of selection in MA experiments (Lee et al. 2012). Thus, our overall observations are consistent with the fact that MA experiments induce limited selection on the mutation spectra, at least as far as base substitutions are concerned.
Low base-substitution and indel mutation rates
The preceding results imply that base-substitution and indel mutation rates for B. cenocepacia are 1.33 (0.008) × 10−10/bp/generation and 1.68 (0.003) × 10−11/bp/generation (SEM), respectively. Based on the 7.70 Mb genome size, these per-base mutation rates correspond to a genome-wide base-substitution mutation rate of only 0.0010/genome/generation, and an indel mutation rate of only 0.00013/genome/generation. Although the ∼1:3 ratio of synonymous to non-synonymous substitutions is consistent with negligible influence of selection on base-substitution mutations in this study, too few indels occurred to evaluate a signature of selection, although their scarcity could reflect some selective loss of genotypes with loss-of-function mutations (Heilbron et al. 2014; Zhu et al. 2014). Moreover, although we applied PINDEL to identify indels of any size based on the aberrant mapping of paired-end reads, intermediate and large indels cannot be identified with our multi-aligner consensus method using short-read aligners, making them more difficult to accurately assign than base-substitutions or short indels. Thus, because some indel mutations may have been either purged by selection or overlooked by our analysis, our estimate of the indel rate should be considered a lower limit.
Base-substitution mutations are not GC>AT biased
One of the central motivations for studying the molecular mutation spectrum of B. cenocepacia was its high GC content (66.8%). A universal mutation bias in the direction of AT has been observed in all other wild-type species studied by MA, and has also been inferred in comparative analyses of several bacterial species, including Burkholderia pseudomallei (Lynch et al. 2008; Denver et al. 2009; Hershberg and Petrov 2010; Hildebrand et al. 2010; Ossowski et al. 2010; Lee et al. 2012; Sung, Tucker, et al. 2012). If this G:C>A:T mutation bias extends to B. cenocepacia, biased gene conversion or selection in the direction of GC content would have to occur (Lynch et al. 2008; Duret and Galtier 2009; Raghavan et al. 2012; Zhu et al. 2014).
In comparing the relative mutation rates of G:C>A:T transitions and G:C>T:A transversions with those of A:T>G:C transitions and A:T>C:G transversions, corrected for the ratio of G:C to A:T sites analyzed in this study, we found no mutational bias in the A:T direction. Rather, substitutions in the G:C direction were 17% more frequent than mutations in the A:T direction per base pair, although the rates were not significantly different (χ2 = 0.91, df = 1, p = 0.33). The lack of mutational bias in the A:T direction can largely be attributed to A:T>C:G transversions occurring at significantly higher rates than any other transversion type, most notably the G:C>T:A transversions (χ2 = 8.68, df = 1, p = 0.0032). However, A:T>G:C transitions also occurred at nearly the same rate as G:C>A:T transitions, the latter of which have been the most commonly observed substitution in other studies, putatively due to deamination of cytosine or 5-methyl-cytosine (Figure 2) (Lee et al. 2012; Sung, Tucker, et al. 2012; Zhu et al. 2014).
Using the ratio of the conditional rate of mutation in the G:C direction to that in the A:T direction (x), the expected GC content under mutation-drift equilibrium is x/(1+x) = 0.539 ± 0.043 (SEM). Therefore, it is clear that the observed mutation bias is not sufficient to drive the overall GC content of 66.8%. Either the B. cenocepacia genome is still moving towards mutation-drift equilibrium, or GC-biased gene conversion and/or natural selection are responsible for the observed %GC content (Lynch et al. 2008; Duret and Galtier 2009; Raghavan et al. 2012; Zhu et al. 2014).
Deletion bias favors genome-size reduction and AT composition
Although our lower bound estimates of the insertion and deletion mutation rates are both ∼15-fold lower than the base-substitution mutation rate, many indels affect more than one base. Specifically, the 17 deletions observed in this study result in the deletion of a total of 376 bases, while the 16 insertions result in a gain of 121 bases. Therefore, the number of bases that are impacted by indels in this study is still more than twice the number impacted by bps, indicating that indels may still play a central role in the genome evolution of B. cenocepacia if they are not purged by natural selection.
As noted above, of the 33 short indels observed in this study, 17 were deletions and 16 were insertions, suggesting that small-scale insertions and deletions occur with similar probability in B. cenocepacia. However, the average size of deletions was higher than the average size of insertions, leading to an experiment-wide deletion and insertion rates of 1.97 (0.86) × 10−10 and 6.11 (1.90) × 10−11/bp/generation (SEM). Thus, there is a net deletion rate of 1.36 (5.95) × 10−10/bp/generation (Table 1). Although no indels >150 bp were observed in this study, examining the depth of coverage of the B. cenocepacia HI2424 plasmid relative to the rest of the genome revealed that the plasmid was lost at a rate of 1.53 × 10−5 per cell division, while gains in plasmid copy number were not observed (Table 1).
The base composition of deletions was also biased, with GC bases being deleted significantly more than expected based on the genome content (χ2 = 30.4, df = 1, p < 0.0001). In contrast, no detectable bias was observed towards insertions of GC over AT bases (χ2 = 1.20, df = 1, p = 0.27) (Table 1). Thus, indels in B. cenocepacia are expected to reduce genome wide GC content, further supporting the implied need for other population-genetic processes favoring GC content (Lynch et al. 2008; Duret and Galtier 2009; Raghavan et al. 2012; Zhu et al. 2014). Overall, the greater number of bases that were deleted than inserted in this study suggests that the natural indel spectrum of B. cenocepacia causes both genome-size reduction and increased AT content.
Non-uniform chromosomal distribution of mutations
Another major goal of this study was to investigate whether mutation rates and spectra vary among chromosomes and chromosomal regions. The three core chromosomes of B. cenocepacia vary in size and content but are sufficiently large to have each accumulated a considerable number of mutations in this study (Morrow and Cooper 2012). Chromosome 1 (chr1) is the largest chromosome (both in size and in gene count), with more essential and highly expressed genes than either chromosome 2 (chr2) or 3 (chr3) (Figure S4). Expression and number of essential genes are second highest on chr2 and lowest on chr3 (Cooper et al. 2010; Morrow and Cooper 2012). In contrast, average non-synonymous and synonymous variation among orthologs shared by multiple strains of B. cenocepacia, as well as fixed variation among Burkholderia species (dN and dS), are highest on chr3 and lowest on chr1 (Figure S4) (Cooper et al. 2010; Morrow and Cooper 2012).
The base-substitution mutation rates of the three core chromosomes differ significantly based on a chi-square proportions test, where the null expectation was that the number of substitutions would be proportional to the number of sites covered on each chromosome (χ2 = 6.77, df = 2, p = 0.034) (Figure 3A). Specifically, base-substitution mutation rates are highest on chr1, and lowest on chr2, which is the opposite of observed evolutionary rates on these chromosomes (Figure S4) (Cooper et al. 2010). There was moderate variation in the ratio of GC to AT base-pairs covered on each chromosome, and because AT bases experience slightly higher mutation rates overall than GC bases in B. cenocepacia (Figure 2), we set up a second chi-squared test to test whether the inter-chromosomal variation in substitution rates could be due to variation in nucleotide content. Here, the null expectation for the frequency of base-substitutions expected on each chromosome was calculated by taking the product of the number of GC bases covered across all lines, the number of generations incurred per line, and the overall GC substitution rate across the genome. The resultant product was then added to the product of the same calculation for AT substitutions to obtain the total expected number of substitutions on each chromosome, given both their size and nucleotide content. The differences in the base-substitution mutation rates of the three core chromosomes remained significant when this test was performed (χ2 = 6.88, df = 2, p = 0.032), indicating that the intra-chromosomal heterogeneity in base-substitution mutation rates cannot be explained by variation in nucleotide content.
The conditional base-substitution mutation spectra were also significantly different in all pairwise chi-squared proportions tests between chromosomes (chr1/chr2: χ2=14.3, df=5, p=0.014; chr1/chr3: χ2=17.0, df=5, p=0.004; chr2/chr3: χ2=13.4, df=5, p=0.020) (Figure 3C). These comparisons further illustrate that the significant variation in conditional base-substitution mutation rates is mostly driven by a few types of substitutions that occur at higher conditional rates on particular chromosomes. Specifically, although their individual differences were not quite statistically significant, G:C>T:A transversions seem to occur at the highest rate on chr3 (χ2 = 5.94, df = 2, p = 0.051) and A:T>C:G transversions occur at the highest rate on chr1 (χ2 = 5.67, df = 2, p = 0.059) (Figure 3B; Figure 4A).
Unlike base-substitution mutation rates, neither the deletion or insertion mutation rate varied significantly among chromosomes (Deletions: χ2=3.81, df=2, p=0.15; Insertions: χ2=0.64, df=2, p=0.73), (Figure 3B; Figure 4B). No indels were observed on the 0.16 Mb plasmid, but as noted above, four plasmid loss events were observed. The latter events involve the loss of 157 genes, and are expected to have phenotypic consequences. The relative rarity of indels observed in this study limits our ability to analyze their intra-chromosomal biases in great detail, but the repeated occurrence of indels within microsatellites (57.6% of all indels) suggests that replication slippage is a common cause of indels in the B. cenocepacia genome (Figure 4B).
DISCUSSION
Despite their relevance to both evolutionary theory and human health, the extent to which generalizations about mutation rates and spectra are conserved across organisms remains unclear. Because of their diverse genome content, bacterial genomes are particularly amenable to studying these issues (Lynch 2007). In measuring the rate and molecular spectrum of mutations in the high-GC, multi-replicon genome of B. cenocepacia, we have corroborated some prior findings of MA studies in model organisms, but also demonstrated idiosyncrasies in the B. cenocepacia spectrum that may extend to other organisms with high %GC content and/or with multiple chromosomes. Specifically, B. cenocepacia has a low mutation rate and is consistent with a universal deletion bias in prokaryotes (Mira et al. 2001). However, the lack of G:C>A:T bias is inconsistent with all previous findings in mismatch-repair proficient organisms (Lynch et al. 2008; Denver et al. 2009; Hershberg and Petrov 2010; Hildebrand et al. 2010; Ossowski et al. 2010; Lee et al. 2012; Sung, Tucker, et al. 2012).
Bacterial genomes are also advantageous study subjects for their relatively ordered patterns of replication initiated at only one origin per chromosome. In genomes with multiple chromosomes, the origins apparently fire at different times to maintain termination synchrony, causing smaller chromosomes to be replicated later (Rasmussen et al. 2007; Cooper et al. 2010). With this model in mind, it becomes noteworthy that both mutation rates and spectra differed significantly among chromosomes in this multi-replicon genome and in a manner suggesting greater oxidative damage or more inefficient repair in late replicated regions.
As a member of a species complex with broad ecological and clinical significance, B. cenocepacia is a taxon with rich genomic resources that enable comparisons between the de novo mutations reported here and extant sequence diversity. With 7050 genes, B. cenocepacia HI2424 has a large amount of coding DNA (GE) (6.8 × 106 base pairs), and a high average nucleotide heterozygosity at silent-sites (πs) (6.57 × 10−2) relative to other strains (Watterson 1975; Mahenthiralingam et al. 2005). By combining this πs measurement and the base-substitution rate from this study, we estimate that the Ne of B. cenocepacia is approximately 247 × 106, which is in the upper echelon among species whose Ne has been derived in this manner (Figure S5).
Under the drift-barrier hypothesis, high target size for functional DNA and high Ne increase the ability of natural selection to reduce mutation rates (Lynch 2010; Lynch 2011; Sung, Ackerman, et al. 2012). Thus, given the large proteome and Ne of B. cenocepacia, it is unsurprising that B. cenocepacia has relatively low base-substitution and indel mutation rates when compared to other organisms (Sung, Ackerman, et al. 2012). However, the low substitution and indel mutation rates observed in this study need not imply limited genetic diversity among species of the Burkholderia cepacia complex. Rather, because of their high Ne and evidently frequent lateral genetic transfer, species of the Burkholderia cepacia complex are remarkably diverse (Baldwin et al. 2005; Pearson et al. 2009), demonstrating that low mutation rates need not imply low levels of genetic diversity.
Because mutations provide the raw material for evolutionary processes, the mutational spectrum of B. cenocepacia has important implications for its genome evolution, which possibly extend to other GC-rich or multi-replicon genomes. Despite similar rates of insertion and deletion events, deletions were larger than insertions, and plasmids were lost relatively frequently, which together support the model that bacterial genomes are subject to a deletion bias (Mira et al. 2001; Kuo and Ochman 2009). Ultimately, this dynamic has the potential to drive the irreversible loss of previously essential genes during prolonged colonization of a host and may enable host dependence to form more rapidly in prokaryotic organisms than in eukaryotes, which do not have a strong deletion bias (Denver et al. 2004; Kuo and Ochman 2009; Dyall et al. 2014).
The lack of GC mutation bias observed in B. cenocepacia has not been seen previously in non-mutator MA lineages of any kind (Lind and Andersson 2008; Lynch et al. 2008; Denver et al. 2009; Ossowski et al. 2010; Lee et al. 2012; Sung, Ackerman, et al. 2012; Sung, Tucker, et al. 2012). Interestingly, the lack of GC mutation bias appears to be primarily caused by a substantial elevation of the A:T>C:G mutation rate relative to all other transversion types on chromosome 1 (Figure 2; Figure 3C). A decreased ratio of G:C>A:T to A:T>G:C transition mutations relative to that seen in other bacteria was also observed (Lee et al. 2012; Sung, Tucker, et al. 2012). In principle, a decreased rate of G:C>A:T transition mutation could be achieved by an increased abundance of uracil-DNA-glycosylases, which remove uracils from DNA following cytosine deamination (Pearl 2000), or by a lack of cytosine methyltransferases, which methylate the C-5 carbon of cytosines and expose them to increased rates of cytosine deamination (Kahramanoglou et al. 2012). However, B. cenocepacia HI2424 does not appear to have an exceptionally high number of UDGs, and it does contain an obvious cytosine methyltransferase homolog, suggesting that active methylation of cytosines does occur in B. cenocepacia. Extending these methods to more genomes with high GC content will be required to determine whether a lack of AT mutation bias is a common feature of GC-rich genomes.
Perhaps the most important finding from this study is that both mutation rates and spectra vary significantly among the three autonomously replicating chromosomes that make up the B. cenocepacia genome (Figure 3). The possibility that mutation rates vary among genome regions has been demonstrated several times using reporter genes and comparative methods (Hudson et al. 2002; Mira and Ochman 2002; Hawk et al. 2005; Cooper et al. 2010; Lang and Murray 2011; Agier and Fischer 2012; Morrow and Cooper 2012), and also directly in a more recent study that used similar methods to those described here (Foster et al. 2013). Although comparative evidence demonstrates that evolutionary rates in multi-chromosome bacteria increase on secondary chromosomes (Cooper et al. 2010; Morrow and Cooper 2012), differences among taxa can be a consequence of biases at the level of mutation and/or selection. Our data demonstrate that base-substitution mutation rates vary significantly among chromosomes, but not in the direction predicted by comparative studies (Cooper et al. 2010). Specifically, we find that base-substitution mutation rates are highest on the primary chromosome (Figure 3A,B), where evolutionary rates are lowest. Thus, purifying selection must be substantially stronger on the primary chromosome to offset the effect of an elevated mutation rate.
The spectra of base-substitutions also differed significantly among chromosomes, with two types of transversions occurring much more frequently on only one of the three replicons. While A:T>C:G transversions are more than twice as likely to occur on the primary chromosome as elsewhere, G:C>T:A transversions are more than twice as likely to occur on the third chromosome (Figure 3C). The G:C>T:A transversions are a particularly interesting class of substitutions because they can arise through oxidative damage (Michaels et al. 1992; Lee et al. 2012) and may be elevated late in the cell cycle when intracellular levels of reactive oxygen species are high. Models of replication timing in another multi-chromosome bacterium, Vibrio cholerae, have demonstrated that smaller secondary chromosomes initiate replication later in the cell cycle (Rasmussen et al. 2007). While not all mutations arise during replication, late replicating regions have been associated with higher transversion rates in prokaryotes and multicellular eukaryotes (Mira and Ochman 2002; Stamatoyannopoulos et al. 2009; Chen et al. 2010), and specifically with G:C>T:A transversions in several species comparisons (Mira and Ochman 2002). Thus, because late-replicated regions on the larger primary and secondary chromosomes are expected to replicate concordantly with those on the tertiary chromosome (Rasmussen et al. 2007), we would not only expect elevated rates of these mutations on the tertiary chromosome, but also on the later replicated regions of the primary and secondary chromosomes.
We tested this prediction by measuring the overall rates of G:C>T:A transversions in the early replicated regions on chr1 and chr2 (prior to chr3 initiation), and comparing them to the late replicated regions on chr1 and chr2 (following chr3 initiation), as well as the rates on chr3. Although the low number of total G:C>T:A transversions observed in this study prevents us from statistically distinguishing conditional G:C>T:A transversion rates between late and early replicated regions of chr1 and chr2, the conditional G:C>T:A transversion rate is higher in late than early replicated regions of chr1 and chr2 (Figure S6), which is remarkable considering that early replicated genes on chr1 and chr2 are expressed more, which has been shown to induce G:C>T:A transversions independent of replication (Klapacz and Bhagwat 2002; Kim and Jinks-Robertson 2012; Alexander et al. 2013). Thus, we suggest that late replicating DNA, particularly in divided genomes, is inherently predisposed to increased rates of G:C>T:A transversions, possibly due to increased exposure to oxidative damage or variation in DNA-repair mechanisms, although the transversion type responsible for these increases may vary between species (Mira and Ochman 2002).
A mechanism of an increased A:T>C:G transversion mutation rate on the primary chromosome is less clear, but a decreased rate of A:T>C:G transversions in a late replicating reporter relative to that on an intermediate replicating reporter has been demonstrated previously in Salmonella enterica (Hudson et al. 2002). Thus, it is possible that this form of transversion is reduced in late replicating DNA, or that it is primarily caused by other forms mutagenesis (Klapacz and Bhagwat 2002), although transcriptional mutagenesis is unlikely as A:T>C:G transversions occur relatively frequently in non-coding DNA relative to other substitution types (Figure S7).
In summary, this study has demonstrated that the GC-rich genome of B. cenocepacia has a relatively low mutation rate, with a mutation spectrum biased toward deletion and G:C production. Moreover, both the rate and types of base-substitution mutations that occur most frequently vary by chromosome, likely related to replication dynamics, the cell cycle, and transcription (Klapacz and Bhagwat 2002; Cooper et al. 2010; Merrikh et al. 2012). Although this study represents an essential first step in broadening our understanding of mutation rates and spectra beyond that of model organisms, whether the observed mutational traits are common to all GC-rich genomes with multiple replicons, or are merely species-specific idiosyncrasies will require a more thorough investigation across a more diverse collection of GC-rich and multi-replicon bacterial genomes. Ultimately, by better understanding the core mutational processes that generate the raw variation on which evolution acts, we can aspire to develop true species-specific null-hypotheses for molecular evolution, and by extension, enable more accurate analyses of the role of all evolutionary forces in driving genome evolution.
MATERIALS AND METHODS
Mutation accumulation
Seventy-five independent lineages were founded by single cells derived from a single colony of Burkholderia cenocepacia HI2424, a soil isolate that had only previously been passaged in the laboratory during isolation (Coenye and LiPuma 2003). Independent lineages were then serially propagated every 24 hours onto fresh high nutrient Tryptic Soy Agar (TSA) plates (30 g/L Tryptic Soy Broth (TSB) Powder, 15 g/L Agar). Two lineages were maintained on each plate at 37°C, and the isolated colony closest to the base of each plate half was chosen for daily re-streaking. Backups were maintained in a 4°C walk-in refrigerator in case of line extinction or experimental error, but were never used in any of the sequenced lineages. Following 217-days of MA, frozen stocks of all lineages were prepared by growing a final colony per isolate in 5 ml TSB (30 g/L TSB) overnight at 37°C, and freezing in 8% DMSO at -80°C.
Daily generation times were estimated each month by placing a single representative colony from each line in 2 ml of Phosphate Buffer Saline (80 g/L NaCl, 2 g/L KCl, 14.4 g/L Na2HPO4 • 2H2O, 2.4 g/L KH2PO4), serially diluting to 10-3 and spread plating 100 ul on TSA. By counting the colonies on the resultant TSA plate, we calculated the number of viable cells in a single colony and thus the number of generations between each transfer. The average generation time across all lines was then calculated and used as the daily generation time for that month. These generationtime measurements were used to evaluate potential effects of declining colony size over the course of the MA experiment as a result of mutational load, a phenotype that was observed (Figure S1). Final generation numbers per line were estimated as the sum of monthly generation estimates, which were derived by multiplying the number of generations per day in that month by the number of days between measurements (Figure S1).
DNA extraction and sequencing
Genomic DNA was extracted from 1 ml of overnight culture inoculated from 47 frozen derivatives of MA lines using the Wizard Genomic DNA Purification Kit (Promega Inc.). Concentration and purity were analyzed using a Thermo Scientific Nanodrop 2000c (Thermo Scientific Inc.) and a 1% Agarose gel with a Quick-Load 2-log ladder (New England BioLabs Inc.). Following library preparation, sequencing was performed using the 151-bp paired-end Illumina HiSeq platform at the University of New Hampshire Hubbard Center for Genomic Studies with an average fragment size between paired-end reads of ∼386 bp. Sequenced lineages were then individually mapped to the reference genome of Burkholderia cenocepacia HI2424 (LiPuma et al. 2002), with both the Burrows-Wheeler Aligner (BWA) (Li and Durbin 2009) and Novoalign (www.novocraft.com), producing an average sequence depth of ∼50x.
Molecular analysis and mutation identification
To identify base-substitution mutations, the sam alignment files that were produced by each reference aligner were first converted to mpileup format using samtools (Li et al. 2009). Forward and reverse read alignments were then produced for each position in each line using in-house perl scripts. Next, a three-step process was used to detect polymorphisms.
First, a base for each individual line was called if a site was covered by at least two forward and two reverse reads, and at least 80% of those reads identified the same base. Second, an ancestral consensus was called as the base with the highest support among reads across all lines, as long as there were at least three lines with sufficient coverage to identify a base. Lastly, at sites where both an individual line base and ancestral consensus were identified, individual line bases were compared to the ancestral base, and if they were different, a putative base-substitution mutation was identified. Putative base-substitution mutations were identified as true substitutions if both aligners independently identified the mutation.
Although the above criteria for identifying individual line bases and overall consensus bases are relatively lenient given our coverage of ∼50× for individual lines and ∼2200× across all lines, both the coverage and support for all substitutions that were called dramatically exceeded those criteria, demonstrating that we were not simply obtaining false positives in regions of lower coverage (Table S1). Furthermore, these same methods have been used to identify base-substitution mutations in both Escherichia coli and Bacillus subtilis MA lines, where 19 of 19 and 69 of 69 base-substitution mutations called were confirmed by conventional sequencing, respectively (Lee et al. 2012; Sung, Ackerman, et al. 2012). Thus, these criteria are unlikely to result in false positives, while allowing us to cover the majority of the B. cenocepacia genome and reduce false negatives.
For insertion-deletion mutations (indels), inherent difficulties with gaps and repeat elements can reduce agreement in the alignment of single reads using short-read alignment algorithms, even in the case of true indels. Thus, putative indels were first extracted from both BWA and Novoalign at all sites where at least two forward and two reverse reads covered an indel, and 30% of those reads identified the exact same indel (size and motif). Next, the alignment output was additionally passaged through the pattern-growth algorithm PINDEL to verify putative indels from the alignment and identify larger indels using paired-end information (Ye et al. 2009). Here, a total of twenty reads, including at least six forward and six reverse reads were required to extract a putative indel. Putative indels were only kept as true indels for further analysis if: a) they were independently identified by both alignment algorithms and PINDEL, and at least 50% of the full-coverage reads (>25 bases on both sides of the indel) from the initial alignment identified the mutation; b) they were identified only by BWA and Novoalign, and at least 80% of the good-coverage reads from the initial alignment identified the mutation; or c) they were larger indels that were only identified by the more strict requirements of PINDEL.
Unlike base-substitutions mutations, many reads that cover an indel mutation may fail to identify the mutation because they lack sufficient coverage on both sides of the mutation to anchor the read to the reference genome, particular when they occur at simple sequence repeats. Therefore, applying the initially lenient filter to extract putative indels is justified to identify all potential indels. By then focusing only on the good-coverage reads and applying an independent paired-end indel identifier (PINDEL), we can filter out indels that are more likely to be false positives, while keeping only the high concordance indels supported by multiple algorithms. Although there remains more uncertainty with indel calls than with base-substitutions mutations, we are confident that we have obtained an accurate picture of the naturally occurring indels from this study because of the high concordance across algorithms and reads (Table S2; Figure S2), and the fact that no indels were called independently by more than 2 lines (Figure S3). A complete list of the indels identified in this study, along with the algorithms that identified them, their coverage, and concordance across well-covered reads can be found is Table S2.
Mutation-rate Analysis
Once a complete set of mutations had been identified in each lineage, we calculated the substitution and indel mutation rates for each line using the equation μ = m/nT, where μ represents the mutation rate (μbs for bps, μindel for indels), m represents the number of mutations observed, n represents the number of sites that had sufficient depth and consensus to analyze, and T represents the total generations over the course of the MA study for an individual line. The standard error of the mutation rate for each line was measured as described previously with the equation SEx = √μ/nT (Denver et al. 2004; Denver et al. 2009).
The final μbs and μindel for B. cenocepacia were calculated by taking the average μ of all sequenced lineages, and the total standard error was calculated as the standard deviation of the mutation rates across all lines (s) divided by the square root of the number of lines analyzed (N): SEpooled = s/√N. Specific base-substitution mutation rates were further divided into conditional rates for each substitution type using the equation μbs = m/nT, where m is the number of substitutions of a particular type, and n is the number of ancestral bases that can lead to each substitution with sufficient depth and consensus to analyze.
Calculation of GE, πs, and NE
Effective genome size (GE) was determined as the total coding bases in the B. cenocepacia genome. Silent site diversity (Πs) was derived using a survey of 200 B. cenocepacia strains across 7 loci (atpD, gltB, gyrB, recA, lepA, phaC, trpB), which were concatenated and aligned using BIGSdb (Jolley and Maiden 2010), and analyzed using DNAsp (Librado and Rozas 2009). Using the value of μbs obtained in this study, Ne was estimated by dividing the value of πs by 2μbs (Πs = 2Neμbs) (Kimura 1983).
ACKNOWLEDGMENTS
We thank Kenny Flynn for helpful discussion and Brian VanDam for technical support. This work was supported by the Multidisciplinary University Research Initiative Award from the US Army Research Office (W911NF-09-1-0444 to ML, P. Foster, H. Tang, and S. Finkel); and the National Science Foundation Career Award (DEB-0845851 to VSC).