Abstract
Background The genus Burkholderia consists of species that occupy remarkably diverse ecological niches. Its best known members are important pathogens, B. mallei and B. pseudomallei, which cause glanders and melioidosis, respectively. Burkholderia genomes are unusual due to their multichromosomal organization.
Results We performed pan-genome analysis of 127 Burkholde-ria strains. The pan-genome is open with the saturation to be reached between 86,000 and 88,000 genes. The reconstructed rearrangements indicate a strong avoidance of intra-replichore inversions that is likely caused by selection against the transfer of large groups of genes between the leading and the lagging strands. Translocated genes also tend to retain their position in the leading or the lagging strand, and this selection is stronger for large syntenies.
We detected parallel inversions in the second chromosomes of seven B. pseudomallei. Breakpoints of these inversions are formed by genes encoding components of multidrug resistance complex. The membrane components of this system are exposed to the host’s immune system, and hence these inversions may be linked to a phase variation mechanism. We identified 197 genes evolving under positive selection. We found seventeen genes evolving under positive selection on individual branches; most of the positive selection periods map to the branches that are ancestral to species clades. This might indicate rapid adaptation to new ecological niches during species formation.
Conclusions This study demonstrates the power of integrated analysis of pan-genomes, chromosome rearrangements, and selection regimes. Non-random inversion patterns indicate selective pressure, inversions are particularly frequent in a recent pathogen B. mallei, and, together with periods of positive selection at other branches, may indicate adaptation to new niches. One such adaptation could be possible phase variation mechanism in B. pseudomallei.
Background
The first evidence of multiple chromosomes in bacteria came from studies on Rhodobacter sphaeroides (Suwanto and Kaplan, 1989). Known bacteria with multiple chromosome belong to the Chloroflexi (Kiss et al., 2010), Cyanobacteria (Welsh et al., 2008), Deinococcus-Thermus (White et al., 1999), Firmicutes (Wegmann et al., 2014), Proteobacteria (Holden et al., 2004), and Spirochaetes (Ren et al., 2003) phyla. The organization of these genomes varies. There could be linear and circular chromosomes as in Agrobacterium tumefaciens (Allardet-Servent et al., 1993) or several circular chromosomes as in Brucella spp. (Michaux et al., 1993) or Burkholderia spp. (Lessie et al., 1996). Species belonging to one genus may have different numbers of chromosomes, for example Burkholderia cepacia has three chromosomes (Lessie et al., 1996) while Burkholderia pseudomallei (Holden et al., 2004), two.
In bacteria with multiple chromosomes, the majority of genes necessary for the basic life processes usually are located on one (primary) chromosome. Other (secondary) chromosomes contain few essential genes and are mainly composed of niche-specific genes (Egan, Fogel, and Waldor, 2005). An exception is two circular chromosomes of Rhodobacter sphaeroides that share responsibilities for fundamental cell processes (Mackenzie et al., 2001). Usually genes on a secondary chromosome evolve faster than genes on a primary chromosome (Cooper et al., 2010). At that, secondary chromosomes may serve as evolutionary test beds so that genes from secondary chromosomes provide conditional benefits in particular environments (Cooper et al., 2010). Secondary chromosomes usually evolve from plasmids (Egan, Fogel, and Waldor, 2005). The plasmids may carry genes encoding traits beneficial for the organism’s survival, for example, resistance to antibiotics or to heavy metals. Usually the plasmid size is relatively small but in some cases plasmids are comparable in size to the chromosome, like the megaplasmid in Rhizobium tropici (Geniaux et al., 1995).
The distinction between plasmids and chromosomes is not clearly defined, a fundamental criterion being that a chromosome must harbor genes essential for viability. In addition, chromosomes differ from plasmids in the replication process. The chromosomal replication is restricted to a particular phase of the cell cycle and the origins may be initiated only once per cycle (Boye, Løbner-Olesen, and Skarstad, 2000). In contrast, the plasmid replication is not linked to the cell cycle (Leonard and Helmstetter, 1988) and may be initiated several times per cycle (Solar et al., 1998). However, some replicons carry essential genes but have plasmid-like replication systems (Egan, Fogel, and Waldor, 2005), and it is not obvious how to classify them. Recently, the term “chromid” has been proposed for such replicons (Harrison et al., 2010).
We analyzed bacteria from the genus Burkholderia. Their genomes are comprised of two or three chromosomes. The genus is ecologically diverse (Coenye and Vandamme, 2003); for example, B. mallei and B. pseudomallei are pathogens causing glanders and melioidosis, respectively, in human and animals (Howe, Sampath, and Spotnitz, 1971); B. glumae is a pathogen of rice (Ham, Melanson, and Rush, 2011); B. xenovorans is an effective degrader of polychlorinated biphenyl, used for biodegradation of pollutants (Goris et al., 2004); B. phytofirmans is a plant-beneficial endophyte that may trigger disease resistance in the host plant (Frommel, Nowak, and Lazarovits, 1991).
By definition, the pan-genome of a genus or species is the set of all genes found in at least one strain (Tettelin et al., 2005). The core-genome is the set of genes shared by all strains. The pan-genome size of 56 Burkholderia genomes has been estimated to exceed 40,000 genes with no sign of saturation upon addition of more strains, and the core-genome is approximately 1,000 genes (Ussery et al., 2009). A separate analysis of 37 complete B. pseudomallei genomes did not show saturation either (Spring-Pearson et al., 2015). The genomes of B. mallei demonstrate low genetic diversity in comparison to B. pseudomallei (Ussery et al., 2009). The core-genome of B. mallei is smaller than that of B. pseudomallei, while the variable gene sets are larger for B. mallei (Losada et al., 2010). B. thailandensis, also belonging to the pseudomallei group, adds many genes to the pan-genome of B. mallei and B. pseudomallei but does not influence the core-genome (Ussery et al., 2009).
Several examples of gene translocations between chromosomes in Burkholderia are known, e.g., the translocation between the first and the third chromosomes in B. cenocepacia AU 1054, affecting many essential genes (Guo et al., 2010). Following interchromosomal translocation, genes change their expression level and substitution rate, dependent on the direction of the translocation (Morrow and Cooper, 2012).
The analysis of gene gains and losses shows that about 60% of gene families in the Burkholderia genus has experienced horizontal gene transfer. More than 7,000 candidate donors belong to the Proteobacteria phylum (Zhu et al., 2011). Gene gains and losses impact the pathogenicity of species. The loss of a T3SS-encoding fragment in B. mallei ATCC 23344, compared to B. mallei SAVP1, is responsible for the difference in the virulence between these strains (Schutzer et al., 2008). Another example is the loss of the L-arabinose assimilation operon by pathogens B. mallei and B. pseudomallei in comparison with an avirulent strain B. thailandensis. Introducing the L-arabinose assimilation operon in a B. pseudomallei strain made it less virulent (Moore et al., 2004). Hence, although the mechanism is not clear, there must be a link between the presence of this operon and virulence. Gene loss also influences the adaptability of an organism. The genomic reduction of B. mallei following its divergence from B. pseudomallei likely resulted in its inability to live outside the host (Losada et al., 2010; Godoy et al., 2003). The acquisition of the atrazine degradation and nitrotoluene degradation pathways by B. glumae PG1, compared to B. glumae LMG 2196 and B. glumae BGR1, could result from an adaption since these toxic agents are used in the farming industry as a herbicide and a pesticide, respectively (Lee et al., 2016).
B. pseudomallei is known to have a high rate of homologous recombination relative to the mutation rate (Cheng et al., 2008). A study of 106 isolates of B. pseudomallei revealed that at least 78% of the core-genome of the reference strain K96243 is covered by recombination events, comparable to Streptococcus pneumonia, a highly recombinogenic species (Didelot et al., 2012). At that, recombination is more common between members of the same genomic clade, what might be a consequence of sharing restriction-modification systems by the clade members (Nandi et al., 2015).
Genome rearrangements such as duplications, deletions, and inversions also play important roles in the bacterial evolution, as they alter the chromosome organization and gene expression in ways impossible through point mutations. DNA rearrangements may be constrained. Chromosomal rearrangements often happen via recombination between repeated sequences, such as insertion (IS) elements (Raeside et al., 2014) and rRNA operons (Huang et al., 2008). Selection has been argued to preserve the size symmetry of the two replichores of a circular chromosome between the origin and the terminus of replication (Eisen et al., 2000). Reconstruction of the history of genome rearrangements provided a base for a new class of phylogeny reconstruction algorithms (Alekseyev and Pevzner, 2009; Hu, Lin, and Tang, 2014).
Genomic analyses of the first sequenced B. pseudomallei strains revealed that their chromosomes are largely collinear except for several inversions (Challacombe et al., 2014; Nandi et al., 2010). One of them was observed in two strains from distinct geographic origins, suggesting that the inversions may had occurred independently (Nandi et al., 2010). Whole-genome comparisons of clonal primary and relapse B. pseudomallei isolates revealed an inversion in the relapse isolate relative to the primary isolate and other complete B. pseudomallei genomes (Hayden et al., 2012).
In comparison to B. pseudomallei, B. mallei genomes harbor numerous IS elements that most likely have mediated the higher rate of rearrangements (Nierman et al., 2004). In particular, IS elements of the type IS407A had undergone a significant expansion in all sequenced B. mallei strains, accounting for 76% of all IS elements, and chromosomes were dramatically and extensively rearranged by recombination across these elements (Losada et al., 2010). Both chromosomes of B. pseudomallei and B. thailandensis have been shown to be highly syntenic between the two species. Only several large-scale inversions have been identified, translocations between chromosomes have not been observed. Breakpoints flanking these inversions contain genes involved in DNA recombination such as transposases, phage integrases, and recombinases (Yu et al., 2006).
Here, we performed a pan-genome analysis for 127 complete Burkholderia strains, reconstructed the history of rearrangements such as interchromosome translocations, inversions, deletions/insertions, and gene gain/loss events, and identified genes evolving under positive selection.
Methods
Available (as of 1 September 2016) complete genome sequences of 127 Burkholderia strains (Suppl. Table S1) were selected for analysis. The genomes were taken from the NCBI Genome database (NCBI, 2017).
Construction of orthologs
We constructed orthologous groups using Proteinortho V5.13 with the default parameters (Lechner et al., 2011).
Estimation of the pan-genome and coregenome size
To predict the number of genes in the Burkholderia pan-genome and core-genome, we used the binomial mixture model (Snipen, Almoy, and Ussery, 2009) and the Chao lower bound (Chao, 1987) implemented in the R-package Micropan (Snipen and Liland, 2015). To select the model better fitting the distribution of genes by the number of strains in which they are present, we used the Akaike information criterion with correction for a finite sample size (Akaike, 1974; Hurvich and Tsai, 1989).
Phylogenetic trees
Phylogenetic trees were visualized by FigTree v1.4.2 (http://tree.bio.ed.ac.uk/software/figtree/).
Trees based on nucleotide alignments
We performed codon alignment for each of the 2117 orthologous groups using Mafft version v7.123b (Katoh and Standley, 2013) and Guidance v2.01 (Penn et al., 2010). Four orthologous groups containing sequences with score below 0.8 were excluded from further analysis. Poorly aligned residues (guidance score below 0.8) were masked. The resulting sequences were concatenated and the tree was constructed with RAxML v8.2.9 (Stamatakis, 2014) using the GTR+Gamma model with 100 bootstrap runs.
Trees based on protein alignments
We used 1046 orthologous protein-coding genes from 127 genomes. We used Mafft v7.273 (Katoh and Standley, 2013) in the linsi mode to align genes belonging to one orthologous group. Concatenated protein-coding sequences were used to construct the tree. We used PhyML (Guindon et al., 2010) with the JTT model and discrete gamma with four categories and approximate Bayes branch supports.
Trees based on gene content
The gene content tree was constructed using the pairwise distance matrix , where Straini is the set of orthologs belonging to a given strain i, ignoring paralogs.
Trees based on gene order
Trees based on gene order were build using the MLGO software (Maximum Likelihood for Gene-Order Analysis) with default parameters (Hu, Lin, and Tang, 2014).
Synteny blocks and rearrangements history
Synteny blocks for closely related strains were constructed using the Sibelia software (Minkin et al., 2013) with the minimal length of blocks being 5000 bp. We filtered out blocks observed in any single genome more than once. Synteny blocks for distant strains were constructed using the Drimm-Synteny program (Pham and Pevzner, 2010) based on locations of universal genes. The rearrangements histories for given topologies were constructed using the MGRA v2.2 server (Avdeyev et al., 2016).
Calculation of inversion positions
The origins and terminators of replication were determined by analysis of GC-skew plots with Ori-Finder (Gao and Zhang, 2008) and an ad hoc Python script. Statistical significance of over-representation of inter-replichore inversions was calculated as the probability of a given number of inter-replichore inversions in the set of inversions with the given lengths. The probability of occurrence of the origin or the terminator of replication within the inversion was calculated as the ratio of the inversion length to the replichore length.
History of interchromosomal translocations
To reconstruct translocations between chromosomes, we ordered universal single-copy orthologs and assigned a vector of ortholog presence to each strain. A component of this vector was the chromosome (1, 2, 3) harboring the ortholog in the strain. Then we subjected the obtained alignment of vectors to PAML 4.6 (Yang, 1997) for ancestral reconstruction with default parameters, except model = REV(GTR) and RateAncestor = 2.
Gene acquisition and loss
We used GLOOME (Cohen et al., 2010) for the gain/loss analysis in the evolution non-stationary model with a variable gain/loss ratio. Other parameters were set based on character counts directly from the phyletic pattern.
Gene annotation
To assign GO terms to genes we used Interproscan (Jones et al., 2014). A GO term was assigned to an OG, if it was assigned to at least 90% of genes in this OG. To determine overrepresented functional categories we used topGO v.3.6 package for R (Alexa and Rahnenfuhrer, 2016). Clusters of Orthologous Groups were predicted using eggNOG v4.5 database (Huerta-Cepas et al., 2016). Protein subcellular localization was predicted using PSORTb v3.02 web server (Yu et al., 2017).
Detection of positive selection
We applied codon models for positive selection to OGs common for the B. mallei, B. pseudomallei, B. thailandensis, B. oklahomensis clade. Given the low number of substitutions, it is usually not possible to reliably reconstruct a phylogenetic tree topology based on individual genes. On the other hand, given the high recombination rate, it is quite likely that gene evolutionary histories are slightly different between OGs. To overcome these issues we first used statistical binning (Mirarab et al., 2014) to group genes with similar histories, and then applied a conservative approach to detect positive selection based on multiple tree topologies.
The procedure was implemented as follows. First we constructed a phylogenetic tree for every gene using RAxML with the GTR+Gamma model and maximum likelihood with 100 bootstrap replicates. Genes with unexpectedly long branch lengths were filtered out (the maximum branch length > 0.1 or the sum of branch lengths > 0.3). Statistical binning was performed at the bootstrap incompatibility threshold of 95. For each of 25 obtained clusters we created a tree with bootstrap support using the concatenated sequence of OGs belonging to the cluster.
We used two different methods to detect positive selection. The M8 vs M8a comparison allows for gene-wide identification of positive selection (Yang, 2007), while the branch-site model accounts for positive selection on a specific branch (Zhang et al., 2005). Each test was performed six times using different trees: the maximum likelihood tree and five random bootstrap trees. We used the minimum value of the LRT (likelihood ratio test) statistic to avoid false identification of positive selection which could be caused by an incorrect tree topology.
For the branch-site model we tested each internal branch as a foreground branch one by one; we did not test terminal branches to avoid false positives caused by sequencing errors. Results of the branch-site tests were aggregated only in case of bipartition compatibility. We considered only bipartitions that were present in at least three tests, we also computed the minimum value of the LRT statistic. The test results were mapped back to the species tree based on bipartition compatibility.
In both cases we used the chi-square distribution with one degree of freedom for the LRT to compute the p-value. Finally, we computed the q-value, while all LRT values equal to zero where excluded from the test. We set the q-value threshold to 0.1.
Correlations
To estimate dependencies between various parameters such as expression level, localization in the first/second chromosome, localization on the leading/lagging strand, we used linear models (lm function, R v3.3.2). Additional parameters such as sum of branch lengths, alignment length and GC-content were included as they can affect the power of the method (Drummond et al., 2005). The parameters were transformed to have a bell-shaped distribution if possible: log(x+1) for the expression levels, log(x+10−6) for the LRT statistic, and log(x) for the alignment length, sum of branch lengths, standard deviation of GC-content, and ω0 (negative selection). Continuous variables were centered at zero and scaled so that the standard deviation was equal to one. This makes the linear model coefficients directly comparable. Outliers were identified on the residual plots and excluded from the model; the residual plots did not indicate abnormalities. For the linear models, we included potential confounding variables in the model, and kept only significant ones for the final linear model.
Results and discussion
Pan-genome and core-genome analysis
Analysis of orthology yielded 757,526 non-trivial or-thologous groups and 21,740 orphans, that is, genes observed in only one genome (some of them could result from mis-annotation). The core-genome size dependence on the number of analyzed strains is shown in Fig. 1a. The number of universal genes that are present in all strains saturates at about 1,050. The pan-genome size for all strains is 48,000 genes with no signs of saturation, showing that the gene diversity of the Burkholderia species has not been captured yet (Fig. 1b). Based on these data, the binomial mixture model (Snipen, Almoy, and Ussery, 2009) predicts that as more genomes are sequenced, the Burkholderia coregenome contains 457 genes, whereas the pan-genome size is 86,845. The number of new genes decreases with each new genome n at the rate N(n) = 2557n−0,56 confirming that the pan-genome is indeed open (Fig. S1a). Each new genome adds about 171 genes to the pan-genome. The Chao lower bound estimate (Chao, 1987) of the pan-genome size is 88,080. These results are consistent with the reported pan-genome size of 56 Burkholderia strains (Spring-Pearson et al., 2015).
Suppl. Fig. S2 and S3 show the core- and pangenome size dependencies for B. pseudomallei and B. mallei, respectively. Their pan-genomes also have not reached saturation (N(n) = 788n−0,53 for B. pseudomallei and N(n) = 867n−0,87 for B. mallei) (Suppl. Fig. S1b,c). These results are also consistent with the reported pan-genome size of 37 B. pseudomallei strains (Ussery et al., 2009).
The distribution of genes by the number of strains in which they are present has a typical U-shape form (Fig. 2), with numerous unique and universal genes and fewer periphery genes. We approximated this distribution with the sum of three exponents (Makarova et al., 2007) and the sum of two power functions (Gordienko, Kazanov, and Gelfand, 2013), and applied the method of the least squares with the Akaike information criterion (AIC) (Hurvich and Tsai, 1989) to define which of these functions better fits the data. Approximation by the sum of three exponents recapitulates the U-shape slightly better. This is consistent with the analysis of the Streptococcus pan-genome (Shelyakin et al., 2018), in which the sum of three exponents also provides a better fit.
Phylogenetic reconstruction
The phylogenetic tree (hereinafter “the basic tree”) and the gene content tree are largely consistent as the trees have the same clades with one major exception (Suppl. Fig. S4). In the gene content tree B. mallei and B. pseudomallei form two distinct clusters, whereas in the basic tree monophyletic B. mallei are nested within paraphyletic B. pseudomallei. The former discrepancy could be due to the lifestyles of B. mallei and B. pseudomallei, as both species are pathogens of animals and possess specific sets of genes. Thus even if universal genes in some pseudomallei strains are closer to the orthologous genes in mallei than to genes in other pseudomallei strains, these species will be distant on the gene content tree due to species-specific genes.
Although the trees are composed of the same clades, we observed numerous contradictions in strains positions. These contradictions are likely caused by clade-specific patterns of recombination and accessory gene exchange (Nandi et al., 2015).
Gains and losses of genes along the phylogenetic tree (Suppl. Fig. S5a) were assessed, excluding plasmid genes. We observed that the Burkholderia species have experienced numerous gains and losses, that could explain their ecological diversity. In particular, a separate analysis of the B. pseudomallei group (Suppl. Fig. S5b) yielded considerable gene loss in the B. mallei clade. The genome reduction among the B. mallei strains is likely associated with the loss of genes redundant for obligate pathogens (Losada et al., 2010).
Rearrangements of universal singlecopy genes
To analyze inter-chromosomal translocations, we considered single-copy universal genes (hereinafter “core genes”) and analyzed their distribution among the chromosomes (Table 1). The majority of such genes belong to the first chromosome, ten-fold less genes are in the second chromosome, and they are almost absent in the third chromosome, the only exception resulting from a large translocation from the first to the third chromosome in B. cenocepacia AU 1054 (Guo et al., 2010).
The genomes of B. cenocepacia 895, B. cepacia strain LO6, and B. contaminans MS14 were not included in the rearrangement analysis due to likely artifacts of the genome assembly (See Suppl. Fig. S6).
Reconstruction of translocations of 1024 core genes between the chromosomes yielded 210 events (Fig. 3). Thirty-eight events were reconstructed separately for B. mallei and B. pseudomallei. There was no statistically significant overrepresentation of GO categories in translocated genes set.
Six genes have been translocated independently on different tree branches twice or more times, encoding Aldo/keto reductase (IPR020471), HTH-type transcriptional regulator ArgP (IPR017685), Gamma-glutamyltranspeptidase (IPR000101), Acid phosphatase AcpA (IPR017768), Tryptophan synthase beta subunit-like PLP-dependent enzyme (IPR036052), TonB-dependent receptor.
The reconstructed common ancestor of Burkholderia has 965 universal single-copy genes in the first chromosome, and 81, in the second chromosome.
We analyzed intra-chromosomal rearrangements that involve the core genes using only one representative strain from clades with closely related species. Core genes were grouped into 87 synteny blocks that contained two or more core genes in the same order in all analyzed genomes. The rearrangements history yielded no parallel events except parallel translocations between chromosomes described above. There was no correlation between the number of rearrangements and the average mutation rates of the core genes (data not shown) that could also be explained by ecological diversity of strains.
Intra-species rearrangements
For clades with closely-related strains such as the B. thailandensis, B. pseudomallei, B. mallei, and B. cepacia groups we reconstructed the history of genome rearrangements using synteny blocks based on nucleotide alignments of chromosomes.
B. mallei clade
For fifteen B. mallei strains and two B. pseudomallei used as outgroups, we constructed 104 common synteny blocks in both chromosomes. Only one block with length 40 kb, that includes 24 universal genes, was translocated in the B. mallei clade. This block is surrounded by IS elements and rRNAs that may indicate that this translocation resulted from recombination between chromosomes.
This indicates that in these strains translocations between chromosomes are rare in comparison to within-chromosome rearrangements. Fixing the tree to the basic one, we reconstructed 88 inversions in the first chromosomes and 27 inversions in the second ones (Fig. 4b). The reconstruction yields nine parallel events in the first chromosomes and three, in the second ones. The boundaries of the inversions are formed by repeated sequences (transposases).
To check whether the contradictions between the tree topology and the inversion history were caused by homologous recombination, we constructed trees based on genes involved in these events. For all inverted sequences, strains do not change their position in the tree (data not shown). Therefore, we suppose that parallel events were caused by active intragenome recombination coupled with a limited number of repeated elements.
We applied maximum likelihood optimization methods to obtain a topology based on the universal gene order. The optimized topology (Suppl. Fig. S7a) yielded a comparable number of parallel inversions, demonstrating that the latter were not an artifact arising from an incorrect phylogeny. We have observed the correlation between the inversion rate and the mutation rates in the core genes (Spearman test, ρ = 0.8, p-value= 10−7) (Fig. 5).
B. pseudomallei clade
The gene order in 51 strains of B. pseudomallei turned out to be significantly more stable than that in B. mallei, as only three inversions were reconstructed in the first chromosomes, and five, in the second chromosomes (Fig. 4a). Moreover, the average coverage of chromosomes by synteny blocks was more than 90% for the first and 80% for the second chromosomes, revealing a stable order and gene content. Two blocks with length about 20-25 kb are swapped in B. pseudomallei K42 that is likely to be an assembly artifact.
Inversions in the second chromosomes with length about 1.3 Mb have the same boundaries for all seven strains despite the fact that they are located at distant branches of the phylogenetic tree (Fig. 4c). Breakpoints of these inversions are formed by six genes encoding (1,2) Rhamnosyltransferase type 1 A,B; (3) drug resistance transporter (mrB/QacA subfamily); (4) rhamnosyltransferase type II and (5,6) the components of RND efflux system, outer membrane lipoproteins nodT and emrA.
B. thailandensis clade
For 15 strains B. thailandensis we constructed 56 synteny blocks in both chromosomes. Two strains of B. ok-lahomensis and one B. pseudomallei were used as outgroups. The average coverage by blocks was 75% for the first, and 50% for the second chromosomes. Fixing the tree topology to the basic tree, we reconstructed 18 inversions and 265 insertion/deletion events (Fig. 6). B. thailandensis has a higher rate of inversions and deletions than B. oklahomensis and B. pseudomallei.
The reconstruction yields two parallel events in the first chromosomes and one, in the second ones. The boundaries of these inversions are formed by repeated sequences (transposases). For all inverted sequences, strains do not change their position in the trees based on sequences similarities of genes involved in these events (data not shown).
The topology of the phylogenetic tree based on the order of synteny blocks (Suppl. Fig. S7b) is largely consistent with the basic tree, the only exception being changed position of B. thailandensis E254 caused by the parallel inversions.
Two non-universal, non-trivial translocated synteny blocks were found. One is a block with length 38 kb in the first chromosome in B. pseudomallei, the second chromosome in B. oklahomensis, and absent in the B. thailandensis genomes. This block comprises genes linked with the amino acid metabolism. The second block is a parallel phage insertion with length 9 kb in the first chromosome of B. oklahomensis strain EO147 and in the second chromosome of B. thailandensis 2003015869.
B. cepacia group
For 27 strains of the cepacia group, the average coverage of chromosomes by synteny blocks was 50% for the first, 30% for the second, and less than 10% for the third chromosome. This agrees with the preferred location of universal genes discussed above. Hereafter, the third chromosomes are not considered due to their low conservation. Fixing the tree to the basic one, we reconstructed 17 inversions and 574 insertion/deletion events. The topology of the phylogenetic tree based on the order of synteny blocks (Suppl. Fig. S7c) is not consistent with the basic tree and most of deep nodes have low bootstrap support that may be explained by numerous parallel gain/loss events.
Only one parallel inversion of length 530 kb was found in the first chromosome of B. cenocepacia AU 1054 and B. cenocepacia J2315, the inversion breakpoints formed by 16S-23S rRNA locus. In order to distinguish between truly parallel events and homologous recombination between these strains, we constructed a tree based on proteins encoded by genes from the inverted fragment. B. cenocepacia AU 1054 and B. cenocepacia J2315 did not change their position in the tree, and in particular, did not cluster together (data not shown). Hence, this block was not subject to homologous recombination between these strains.
Two non-universal synteny blocks were found on different chromosomes in different strains. One block with length 8.5 kb is located on the first chromosome of B. cenocepacia MC0-3 and on the second chromosome of B. cepacia ATCC 25416. The cassette contains five genes that belong to the iron uptake pathway, and an AraC family protein. Some parts of this cassette were found in others Burkholderia species (Fig. 7a).
Another block with length 5.5 kb was found only in 17 of 30 strains belonging to the cepacia group (Fig. 7b). The cassette contains four genes forming the acetyl-CoA carboxylase complex, glycoside hydrolase (GO:0005975 carbohydrate metabolic process), and a LysR family protein. This synteny block is found in all B. mallei, B. pseudomallei, B. oklahomensis, B. glumae, B. gladioli and is absent in B. thailandensis and other strains. Its presence in different chromosomes and differences between the tree of this cassette (Suppl. Fig. S8) and the basic tree indicate that this cassette is spreading horizontally.
Selection on rearrangements positions
In many bacteria, within-replichore inversions, that is inversions with endpoints in the same replichore, have been shown to be relatively rare and significantly shorter than inter-replichore inversions (Darling, Miklós, and Ragan, 2008; Repar and Warnecke, 2017). The pattern of inversions reconstructed on both chromosomes in B. mallei is consistent with both of these observations.
Inter-replichore inversions are overrepresented on the first (p-value < 10−33) and on the second (p-value < 10−30) chromosomes. The lengths of inter-replichore inversions have a wide distribution up to the full replichore size (Fig. 8b), whereas the observed within-replichore inversions mainly do not exceed 15% of the replichore length. We observed only two longer inversions, both in B. mallei FMH23344. These inversions overlap with each other and may be explained by a single translocation event. This strong avoidance of inter-replichore inversion is probably caused by selection against gene movement between the leading and the lagging strands (Zhang and Gao, 2017).
The reconstruction of translocations also revealed that genes tend to retain their position on the leading or lagging strand (two-sided Binomial test, p-value=0.03, Fig. 8a). Moreover, all blocks with length more than three genes retain their position. We have not observed any difference in the level of purifying selection between genes translocated from the leading and lagging strands.
Positive selection on core genes
1842 single-copy genes common for B. oklahomensis, B. thailandensis, B. pseudomallei, B. mallei clade were tested to identify genes evolving under positive selection. We detected 197 genes evolving under positive selection using the M8 model (Suppl. Table S3). No GO categories were significantly overrepresented but we observed overrepresentation of outer membrane proteins (permutation test, p-value=0.03) consistent with observations in other bacterial species (Cao et al., 2017; Xu, Chen, and Zhou, 2011).
To identify branch-specific positive selection, we used the branch-site test. In total, we identified seventeen events (Table 2), twelve of which we successfully mapped to the basic tree (Fig. 9). In the remaining five cases (flagellar hook protein FlgE, porin related exported protein, penicillin-binding protein, phosphoenolpyruvate-protein kinase and cytidylate kinase) the detected branches (bipartitions) of the gene trees were incompatible with the basic tree, and thus could not be mapped to it.
Outer membrane proteins such as the flagellar hook protein FlgE, porin-related exported protein, OmpA family protein can serve as targets for the immune response. Moreover, OmpA is known to be associated with virulence, being involved in the adhesion and invasion of host cells, induction of cell death, serum and antimicrobial resistance, and immune evasion (Sousa et al., 2012). Error-prone DNA polymerase has a lower replication accuracy, and, thus, a higher mutation rate. Positive selection on this polymerase might be a result of adaptation to a new life style. Bacterial transcription factors are known to enable rapid adaptation to environmental conditions, that might explain strong positive selection on LysR-family transcriptional regulator.
The majority of genes evolving under positive selection have been identified in the longest branches; accordingly, the fraction of events is higher in these branches. This might indicate rapid adaptation to new ecological niches during species formation. However, the branch-site test for positive selection is more powerful on longer branches, and the position of a branch in the tree might affect the power (Yang and Reis, 2011). Hence, the overrepresentation of positive selection events can be related to the power of the method, and does not necessary indicate the higher number of genes affected by positive selection on these branches.
We have used linear modeling to identify determinants affecting purifying selection (Table 3). The strongest observed correlation is that highly expressed genes tend to evolve under stronger purifying selection, which is also consistent with previous observations (Cooper et al., 2010). The expression levels in our dataset are higher for the first chromosome (Table 4), which is consistent with observations on other multi-chromosome bacterial species (Dryselius et al., 2008).
Longer genes tend to experience stronger purifying selection that is consistent with previously shown negative correlation between the dN/dS value and the median length of protein-coding genes in a variety of species (Novichkov et al., 2009). A similar result was obtained for eukaryotes (Kryuchkova-Mostacci and Robinson-Rechavi, 2015). However, this observation also could be explained by the greater power in detecting strong negative selection in longer genes, similarly to the increase in the power when detecting positive selection for longer genes (Yang and Reis, 2011).
Conclusions
The Burkholderia pan-genome is open with the saturation to be reached between 86,000 and 88,000 genes. The core-genome of the strains considered here is about 1,050 genes and the predicted core-genome size is 460 genes. The tree based on the alignment of universal genes and the gene content tree show some differences, caused by excessive gene gains and losses at some branches, most notably, gene loss in B. mallei following a drastic change of the lifestyle. These losses likely have been caused by a high rate of intragenomic recombination, that also has resulted in the plasticity of the gene order in chromosomes in this branch. The rearrangement rates differ dramatically in the Burkholderia species, possible reflecting the history of adaptation to different ecological niches. Young pathogens such as Y. pestis, Shigella spp, B. mallei are known to have a particularly high rate and variety of mobile elements that may be explained by fast evolution under changed selection pressure in new conditions, bottlenecks in the population history, and weaker selection against repetitive elements due to the decreased effective population size (Mira, Pushker, and Rodriguez-Valera, 2006). An accommodation of IS elements in B. mallei is most likely responsible to frequent genome rearrangement (Nierman et al., 2004). We showed the correlation between the inversion rate and the mutation rates in the core genes during its evolution.
The reconstructed rearrangements indicate strong avoidance of intra-replichore inversions that is likely caused by selection against transfer of large groups of genes between the leading and the lagging strands. Inter-replichore inversions are strongly overrepresented. Moreover, the lengths of inter-replichore inversions has a wide distribution and they may be very long, whereas the observed intra-replichore inversions rarely exceed 15% of the replichore length. This result is consistent with the inversion pattern in other bacterial species (Darling, Miklós, and Ragan, 2008) that may be explained, in particular, by over-presentation of highly expressed genes on the leading strand (Price and Arkin, 2005). At that, translocated genes also tend to retain their position in the leading or the lagging strand and this selection is stronger for large syntenies.
Gene cassettes spreading horizontally have been found in the B. cepacia group on different chromosomes. The first one is comprised of iron uptake genes found in only two B. cepacia strains. These genes are known to form a pathogenicity island highly conserved in various Enterobacteriaceae (Lesic and Carniel, 2005). The second cassette contains genes from the fatty acids pathway.
We detected parallel inversions in the second chromosomes of seven B. pseudomallei. Breakpoints of these inversions are formed by genes encoding components of multidrug resistance complex. The membrane components of this system are exposed to the host’s immune system, and hence these inversions may be linked to a phase variation mechanism. Similar parallel inversions involving paralogous genes encoding membrane proteins PhtD have been observed in Streptococcus pneumoniae (Shelyakin et al., 2018).
We identified 197 genes evolving under positive selection. We also identified seventeen genes evolving under branch-specific positive selection. Most of the positive selection periods map to the branches that are ancestral to species clades. This might indicate a rapid adaptation to new ecological niches during species formation or simply result from the increased power of the used methods on long branches.
Availability of data and materials
The datasets supporting the conclusions of this article and used ad hoc scripts are available via the link https://github.com/OlgaBochkaryova/burkholderiagenomics.git.
Competing interests
The authors declare that they have no competing interests.
Author’s contributions
MSG conceived the study, OOB, EVM and MSG designed the study; EVM, OOB and IID developed the methods, analyzed the data; EVM, OOB and IID wrote the manuscript, MSG reviewed the paper. All authors read and approved the final version of the manuscript.
Funding
The study was supported by the Russian Science Foundation under grant 18-14-00358. Analysis of gene selection was supported by the Russian Foundation of Basic Research (grant 16-54-21004) and Swiss National Science Foundation (grant number IZLRZ3_163872) and performed in part at the Vital-IT center for highperformance computing of the Swiss Institute of Bioinformatics.
Ethics approval and consent to participate
Not applicable
Consent for publication
Not applicable
Additional Files
Additional file Fig. S1 — Number of new genes added to pangenome with addition of each genome. (a) Burkholderia spp, (b) B. pseudomallei and (c) B. mallei.
Additional file Fig. S2 — Core-genome (a) and pangenome (b) size of B. pseudomallei strains.
Additional file Fig. S3 — Core-genome (a) and pangenome (b) size of B. mallei strains.
Additional file Fig. S4 — Comparison the topologies of phylogenetic trees based on the protein sequence similarity of single-copy universal genes and gene content.
Additional file Fig. S5 — Gene flow during Burkholderia evolution. Red and blue numbers are the numbers of gained and lost genes on a given branch.
Additional file Fig. S6 — Whole-genome alignments of cepacia strains that were not included in the rearrangement analysis due to likely artifacts of the genome assembly. (a) Burkholderia sp. 383 and B. cepacia strain LO6 (b) Burkholderia sp. 383 and B. contaminans strain MS14, (c) Burkholderia sp. 383 and B. cenocepacia strain 895, (d) B. cepacia strain LO6 and B. cenocepacia strain 895.
Additional file Fig. S7— Tanglegrams showing differences between tree topology based on the protein sequence similarity of single-copy universal genes and tree topology based on synteny blocks arrangement. (a) B. mallei clade; (b) B. thailandensis clade; (c) B. cepacia group.
Additional file Fig. S8— Phylogenetic tree constructed for the genes from the gene cassette transferred horizontally.
Additional file Fig. S9— Phylogenetic species tree showing detected events of positives selection.
Additional file Table S1 — List of analyzed Burkholderia strains.
Additional file Table S2 — Genes evolving under positive selection.
Additional file Table S3 — Linear models (a) of average ω (negative selection, estimated using M8); (b) expression level (Lazar Adler et al., 2016).
Additional file Table S4 — Chromosomal localization of universal orthologs.
Acknowledgements
Analysis of parallel inversions was performed by Alisa Rodionova at the Summer School of Molecular and Theoretical Biology (Barcelona, 2016), supported by the Zimin Foundation.