Abstract
Klebsiella pneumoniae (Kp) has emerged as an important cause of two distinct public health threats: multidrug resistant (MDR) healthcare-associated infections1 and community-acquired invasive infections, particularly pyogenic liver abscess2. The majority of MDR hospital outbreaks are caused by a subset of Kp clones with a high prevalence of acquired antimicrobial resistance (AMR) genes, while the majority of community-acquired invasive infections are caused by ‘hypervirulent’ clones that rarely harbour acquired AMR genes but have high prevalence of key virulence loci3–5. Worryingly, the last few years have seen increasing reports of convergence of MDR and the key virulence genes within individual Kp strains6, but it is not yet clear whether these represent a transient phenomenon or a significant ongoing threat. Here we perform comparative genomic analyses for 28 distinct Kp clones, including 6 hypervirulent and 8 MDR, to better understand their evolutionary histories and the risks of convergence. We show that MDR clones are highly diverse with frequent chromosomal recombination and gene content variability that far exceeds that of the hypervirulent clones. Consequently, we predict a much greater risk of virulence gene acquisition by MDR Kp clones than of resistance gene acquisition by hypervirulent clones.
Kp MDR evolution is largely driven through acquisition of AMR genes on diverse mobilisable plasmids7 which are particularly prevalent among the subset of clones that have become globally disseminated and frequently cause hospital outbreaks8; e.g. clonal group (CG) 258 which is implicated in global spread of the Kp carbapenemases9. Kp pathogenicity is driven by a wide array of interacting factors10 –12 that are present in all strains, including the type III fimbriae (mrk) and the surface polysaccharides (capsule and lipopolysaccharide (LPS))12, 13 which exhibit antigenic variation between strains. The majority of hypervirulent Kp, distinguished clinically as causing invasive infections even outside the hospital setting14, are associated with just two4, 15 of the >130 predicted capsular serotypes16, K1 and K2, that are considered particularly antiphagocytic and serum resistant15, 17. Hypervirulent Kp are also associated with high prevalence of several other key virulence factors; the rmpA/rmpA2 genes that upregulate capsule expression to generate hypermucoidy18, 19; the colibactin genotoxin that induces eukaryotic cell death and promotes invasion to the blood from the intestines20, 21; and the yersiniabactin, aerobactin and salmochelin siderophores that promote survival in the blood by enhancing iron sequestration11,22–24.
Yersiniabactin synthesis is encoded by the ybt locus, which is usually mobilised by an integrative, conjugative element known as ICEKp. It is present in ∼40% of the general Kp population and seems to be frequently acquired and lost from MDR clones25. Fourteen distinct ybt+ICEKp variants are recognised, one of which also carries the colibactin synthesis locus (clb)25. In contrast, the salmochelin (iro), aerobactin (iuc) and rmpA/rmpA2 loci are usually co-located on a virulence plasmid26, 27. These loci are much less common in the Kp population (<10% prevalence each) and until recently were rarely reported among MDR strains33,5.
The reasons for the apparent separation of MDR and hypervirulence are unclear but there are growing reports of convergence from both directions, i.e. hypervirulent strains gaining MDR plasmids28–33 and MDR strains gaining a virulence plasmid plus/minus an ICEKp6,30,34. Most such reports are sporadic, but in 2017 Gu and colleagues described a fatal outbreak of MDR, carbapenem-resistant Kp belonging to CG258 that had acquired ICEKp in the chromosome plus iuc and rmpA2 on a virulence plasmid6. The report fuelled growing fears of an impending public health disaster in which highly virulent MDR strains may be able to spread in the community, causing dangerous infections that are extremely difficult to treat35. However, there remain significant knowledge gaps about Kp evolution that limit our ability to understand the severity of this public health threat, and to evaluate the relative risks of convergence events.
Here we report a comparative analysis of genome evolutionary dynamics between MDR and hypervirulent clones, leveraging a curated collection of 2265 Kp genomes to identify 28 common clones with at least 10 genomes in each (total 1092 genomes in 28 clones, 10–266 genomes per clone, see Supplementary Tables 1,2 and Figure 1a,b). First we explored the distribution of key virulence loci and resistance determinants for 10 distinct antimicrobial classes across the total population (Methods). The presence of the virulence plasmid (with or without ICEKp) was negatively associated with the presence of acquired AMR genes: the majority (n=77/88) of genomes harbouring the virulence plasmid contained 0–1 acquired AMR genes, while the distributions were much broader for genomes without the virulence plasmid plus/minus ICEKp (median 1, interquartile range (IQR) 0-9, p < 1×10−10 for both pairwise Wilcoxon Rank Sum tests). Interestingly, among the genomes without the virulence plasmid the distribution of acquired AMR genes was slightly shifted towards higher numbers in genomes harbouring ICEKp (median 1 vs 1, IQR 0-8 vs 0-10, p < 1×10−8; see Supplementary Figure 1). Next we calculated the proportion of genomes harbouring the virulence and AMR loci within each of the 28 common clones (Figure 1c). Hierarchical clustering of the virulence locus data defined a group of 6 hypervirulent clones including the previously described CG23, CG86 and CG653, each of which harboured the virulence plasmid-associated genes iuc, iro and/or rmpA/rmpA2 at high frequency (31– 100%). Hierarchical clustering of the AMR data defined a group of 8 MDR clones, including the outbreak associated CG258, CG15 and CG1478, each with a high frequency (≥56%) of genomes encoding acquired resistance determinants for ≥3 drug classes (in addition to ampicillin to which all Kp share intrinsic resistance via the chromosomally encoded SHV-1 beta-lactamase). As expected, AMR genes were rare among the hypervirulent clones, with the exception of CG25 in which 11 of 16 genomes harboured ≥4 acquired AMR genes. The iuc, iro and rmpA/rmpA2 loci were rare (<12% frequency) among the MDR clones and those not assigned to either group (‘unassigned’ clones); however, the ybt locus was frequently identified across the spectrum of clones as has been reported previously25.
We used Gubbins36 to identify putative chromosomal recombination imports within each clone and calculated r/m (the ratio of single nucleotide variants introduced by homologous recombination relative to those introduced by substitution mutations), which ranged from 0.02–25.50 (Figure 2a, Supplementary Table 1). With the exception of CG25, the hypervirulent clones generally exhibited lower r/m values (median 1.15), while the MDR clones trended towards higher values (median 5.47) although the differences were not statistically significant (Kruskall-Wallis test p = 0.07). Recombination events were not evenly distributed across chromosomes: in 19/28 clones ≥50% of the chromosome was not subject to any recombination events, while the maximum recombination load in each clone ranged from mean 1.1–47.6 events (Figure 2b, Supplementary Figure 2).
In many cases there was a major peak defining a recombination hot-spot at the capsule (K) and adjacent LPS antigen (O) biosynthesis loci (see e.g. CG258 in Figure 2b, and Supplementary Figure 2). Among the 17 clones with ≥1 detectable recombination hotspot (arbitrarily defined as mean recombination count of ≥5 per base calculated over non-overlapping 1000 bp windows), the galF K locus gene was ranked among the top 2% recombination counts in 16 clones (Figure 2c and Supplementary Figure 2). Consistent with these findings, 20 clones were associated with ≥3 distinct K loci and 11 clones were also associated with ≥3 O loci (Figure 2d and Supplementary Figures 3, 4). Together these data indicate that the capsule and LPS are subject to strong diversifying selection. However, this was not the case for the hypervirulent clones, which were associated with low K and O locus diversity: five out of the six had just one K and one O locus type (either KL1 or KL2, plus O1/O2v1 or O1/O2v2) and showed no evidence of recombination events affecting galF (Figure 2c,d Supplementary Figures 3, 4).
The key selective drivers for K/O locus diversity are not known, but the mammalian immune system is unlikely to play a major role since Kp live ubiquitously in the environment and are opportunistic rather than obligate human pathogens37. Instead, phage and/or protist predation are likely candidates38. Numerous capsule specific Kp phage have been reported39, 40 and ecological modelling supports a key role for phage-induced selective pressures in maintaining surface polysaccharide diversity in free-living bacteria41. The relative lack of diversity among the hypervirulent clones may suggest that they are not subject to the same selective pressures, perhaps indicating some sort of ecological segregation. This possibility is intriguing and could explain the separation of hypervirulence and MDR, by limiting opportunities for horizontal gene transfer between MDR and hypervirulent clones. Isolates representing both clone types have been identified among diverse host-associated niches31–42,43 but it is not possible to determine any particular ecological preference due to the lack of systematic sampling efforts to-date. An alternative explanation is that the hypervirulent clones are subject to some sort of mechanistic limitation for chromosomal recombination, that in turn limits surface polysaccharide diversity and the acquisition of other chromosomally encoded accessory genes, as have recently shown to be frequently acquired by CG258 strains44. If so, we may also expect a general trend towards lower gene content diversity in the hypervirulent clones.
To assess overall gene content diversity we conducted a pan-genome analysis using Roary45. Jaccard gene content distances were generally lower for genome pairs within hypervirulent clones than the MDR or unassigned clones, suggesting the former have less diverse pan-genomes (p < 1×10−15 for each pairwise Wilcoxon Rank Sum test, Figure 3a). Supporting this trend, the hypervirulent clones were associated with comparatively shallow pan-genome accumulation curves (Figure 3b). In order to quantify the differences in these curves we fitted the pan-genome model proposed by Tettelin and colleagues46, and derived an alpha value for each clone (Supplementary Table 1), whereby values <1 indicate an open pan-genome and >1 indicate a closed pan-genome. Consistent with previous data showing extensive gene content diversity within the Kp species5, all but two clones had alpha values below 1. The exceptions were hypervirulent CG380 (alpha = 1.06) and MDR CG231 (alpha = 1.16). There was a general trend towards higher alpha (i.e. less open) among the hypervirulent clones (median alpha = 0.77, IQR 0.70–0.89), although the difference was only statistically significant in comparison to the unassigned clones (median alpha = 0.52, IQR 0.51–0.54, p = 0.005) and not the MDR clones (median alpha = 0.62, IQR 0.59–0.69, p = 0.23, Figure 3b).
It is well known that large groups of accessory genes can be linked on the same mobile element (e.g. large conjugative MDR plasmids that are common in the MDR clones, or the virulence plasmids characteristic of the hypervirulent clones), so a single gain or loss event may have a large effect on gene-based measures such as pairwise Jaccard distances and accumulation curves. Hence we used a principal component analysis (PCA) to generate a metric that is less sensitive to the correlation structure in the gene content data (see Methods). The PCA transformed the accessory gene content matrix comprising 1092 genomes vs 39375 genes into coordinates in a 40-dimensional space. These 40 axes captured >60% of the variation in accessory gene content and were used to calculate the Euclidean distance of each genome to its clone centroid. The resulting distributions of distances provided further support that the MDR and unassigned clones display greater gene content variation than hypervirulent clones (Figure 3c; p < 1×10−15, 2 d.f., Kruskal-Wallis test; p < 1×10−15 for each pairwise Wilcoxon Rank Sum test) and suggest this is associated with a greater frequency of horizontal gene transfer events rather than a similar number of events introducing larger changes in gene content. In addition, the putative ancestry of accessory genes (see Methods) was more diverse among MDR and unassigned clones than the hypervirulent clones, supporting that the latter are subject to a more limited range of partners for horizontal gene transfer (Wilcoxon Rank Sum tests: hypervirulent vs MDR, p = 0.0027; hypervirulent vs unassigned, p = 1×10−4; MDR vs unassigned, p = 0.38; Figure 3d, Supplementary Table 1).
To further explore differences in common sources of accessory gene diversity, we assessed phage and plasmid diversity. For each genome we summed the length of genomic regions identified as phage by VirSorter47 (range 0–221 kbp, Supplementary Table 2) and used a PCA of phage-associated gene content to calculate distance to clone centroids as for the total pan-genome (Figure 4a,b and Supplementary Figure 6). Hypervirulent clones showed similar phage load and diversity to the unassigned clones (Figure 4a,b; Wilcoxon Rank Sum tests, hypervirulent vs unassigned: load, p = 0.15; diversity, p = 0.15), whereas MDR clones were generally associated with higher load and diversity than both the hypervirulent and unassigned groups (p < 1×10−15 for each pairwise comparison). Although these analyses were dependent on the quality and breadth of the underlying viral sequence database which may be subject to species bias, we have no reason to expect that this would be skewed with respect to MDR over hypervirulent clones. Hence it is clear that Kp, and in particular the MDR clones, are subject to frequent attack by diverse phage.
Unfortunately it is not possible to reliably identify plasmid sequences from draft genome assemblies48. Instead we used plasmid replicon and relaxase (mob) typing as indicators of plasmid load and diversity. Each genome contained 0–12 of 69 uniquely distributed replicon markers and 0–23 mob-positive assembly contigs (detected by screening against the PlasmidFinder49 database and mob PSI-BLAST50, 51, respectively; Supplementary Table 2). MDR and unassigned genomes harboured a greater number of replicon markers than hypervirulent genomes, largely driven by low replicon loads in CG23 that was overrepresented among the hypervirulent genomes (Figure 4c,d, Supplementary Figure 7; Wilcoxon Rank Sum tests: hypervirulent vs unassigned, p = 1.5×10−4; MDR vs unassigned, p < 1×10−15; MDR vs hypervirulent, p < 1×10−15). There were no significant differences between the hypervirulent and unassigned groups for counts of mob-positive contigs per genome, and comparatively small differences between the MDR and unassigned or hypervirulent groups (Wilcoxon Rank Sum test: MDR vs unassigned, p = 1×10−6; MDR vs hypervirulent, p <1×10−6, Supplementary Figure 7).
Comparison of effective Shannon’s diversity of replicon profiles indicated that the hypervirulent clones harbour less plasmid diversity than either of the unassigned or MDR clones (not driven solely by CG23, see Figure 4d, Supplementary Figure 7, Wilcoxon Rank Sum tests: hypervirulent vs MDR, p = 0.0013; hypervirulent vs unassigned, p = 0.0015; MDR vs unassigned, p = 0.44). Similar trends were seen for effective Shannon’s diversity of mob types but the differences were not statistically significant after Bonferroni correction for multiple testing (n=3 tests), a finding that is not surprising given that far fewer mob types have been defined and that only ∼48% completely sequenced Enterobacteriaceae plasmids deposited in GenBank could be mob typed (whereas ∼83% could be replicon typed)50.
While these data are subject to the biases of the underlying databases within which clinically relevant (MDR and virulence) plasmids are overrepresented, they are also consistent with the findings above regarding overall pan-genome diversity. These data imply that MDR clones frequently acquire and lose plasmids, consistent with the high plasmid diversity reported previously for ST25852, 53 and several others7, and with data from recent investigations of Kp circulating in hospitals which showed that individual plasmids transferred frequently between clones54, 55. In contrast, the hypervirulent clones were associated with comparatively low plasmid diversity, mirrored by generally narrow plasmid load distributions. Taken together these data imply that hypervirulent clones acquire novel plasmids infrequently but can stably maintain them. For example, we recently estimated that the virulence plasmid, which by definition is highly prevalent in these clones, has been maintained for >100 years in CG2331. In addition, laboratory passage experiments have shown that hypervirulent strains can maintain MDR plasmids introduced in vitro29, and we showed that a horse-associated subclade of CG23 has maintained a single MDR plasmid for at least 20 years31.
The combination of infrequent plasmid acquisitions and limited chromosomal recombination suggests that hypervirulent clones may be subject to particular constraints on DNA uptake and/or integration. One possible explanation is that these Kp clones possess enhanced defences against incoming DNA such as CRISPR/Cas or restriction-modification (R-M) systems. However, our genome data reveals no significant differences in either system (see Supplementary Text, Supplementary Figures 8–11). Alternatively, the key virulence determinants themselves, or other proteins encoded on the virulence plasmid, may play a role. Two variants of the virulence plasmid predominate among hypervirulent clones and share limited homology aside from the iuc, iro and rmpA loci27. It seems unlikely that a siderophore system would influence DNA uptake, however it is conceivable that upregulation of capsule expression by rmpA18, 56 may play a role by exacerbating the inhibitory effect of the capsule.
Capsule expression has been associated with a comparative reduction in Kp transformation frequency in vitro57 and in a natural Streptococcus pneumoniae population58. Additionally, the capsule is known to conceal the LPS59 which, together with the OmpA porins, are considered key target sites for attachment of conjugative pili during the initial phases of mate-pair formation60, 61. Hence we speculate that overexpression of the capsule in hypervirulent clones may result in a reduction of DNA uptake. Given that capsule types differ substantially in their thickness and polysaccharide composition18, 62, it is also likely that their influence on DNA uptake is type dependent. The K2 capsule, which is associated with five of the 6 hypervirulent clones investigated here is considered among the thicker capsule types18 and thus may have a comparatively greater influence. We used our genome data to test this hypothesis by comparing the genomic diversity of KL2 and non-KL2 genomes within MDR CG15, the only clone with sufficient KL2 and non-KL2 genomes for comparison. CG15-KL2 genomes formed a deep branching monophyletic subclade consistent with long-term maintenance of KL2 for an estimated 34 years (see Supplementary Text and Supplementary Figures 12 and 13). This KL2 subclade showed a comparatively low rate of recombination (r/m 0.58 vs 6.75) and more limited gene content diversity than the rest of the clone (p = 0.0004, Supplementary Text and Supplementary Figure 12). However, gene-content diversity in the KL2 subclade of CG15 was higher than that of the hypervirulent clones (Supplementary Figure 12), perhaps due to the absence of the rmpA capsule upregulator. Thus the genome data support our hypothesis and should motivate future laboratory studies of this phenomenon; a task that will not be trivial given the low efficiency of in vitro transformation for wild-type Kp strains63, the challenge of identifying suitable selective markers for distinguishing MDR strains, and the sensitivity of conjugation efficiencies to laboratory growth conditions60. If confirmed, this would imply that hypervirulent clones are evolutionarily constrained by a key determinant of the hypervirulent phenotype, and as such are self-limited in their ability to adapt to antimicrobial pressure.
Regardless of the mechanisms, our data clearly show that hypervirulent Kp clones are less diverse than their MDR counterparts, and suggest that the rate of virulence plasmid acquisition by MDR clones will far exceed the rate of MDR plasmid acquisition by hypervirulent clones. This is particularly worrying from a hospital infection control perspective since many of the MDR clones investigated here appear well adapted to transmission and colonisation in the human population, and are frequent causes of hospital outbreaks7, 8. Given the mounting evidence that MDR clones can carry multiple plasmids at limited fitness cost64–66 and frequently exchange plasmids with other bacteria54, 55, it seems these MDR clones may also be the perfect hosts for consolidation and onwards dissemination of MDR and virulence determinants. The greatest concern is that these determinants will be consolidated onto a single mobile genetic element; indeed mosaic Kp plasmids carrying AMR genes plus iuc and rmpA2 have already been reported in an MDR Kp clone30, and Escherichia coli plasmids bearing iuc, rmpA and AMR genes have been detected in Kp27. Whether these convergent strains and plasmids are fit and disseminating is not known. Recent experience with convergent carbapenem-resistant CG258 in China – which retrospective surveillance studies showed was already widely disseminated at the time of the outbreak report6, 67 – highlights the ease with which deadly strains can circulate unnoticed. As reports of convergent Kp strains continue to increase, the need for global genomic surveillance encompassing clone, AMR and virulence locus information68 is clearly greater than ever.
Methods
Genome collection and clone definition
We collected and curated 2265 Kp genomes, comprising 647 genomes sequenced and published previously by our group5, 15,16,69 plus 1623 publicly available genomes53, 70–76 as described previously16. Genomes were assigned to chromosomal multi-locus sequence types (MLST, as below), and a single representative of each sequence type (ST, n=509) was selected for initial phylogenetics to define clones for further analysis. Sequence reads were mapped to the NTUH-K2044 reference chromosome (accession: NC_012731) using Bowtie v277 and single nucleotide variants were identified with SAMtools v1.3.178 as implemented in the RedDog pipeline (https://github.com/katholt/RedDog). Where genomes were available only as de novo assemblies, sequence reads were simulated using SAMtools wgsim78 (n=852 genomes, for each of which 2 million x 100bp PE reads were simulated without errors). Allele calls were filtered to exclude sites that did not meet the following quality criteria: unambiguous consensus base calls, phred quality ≥30, depth ≥5 reads but <2-fold mean read depth, no evidence of strand bias. Subsequently, we generated a variable site alignment by concatenating nucleotides at core genome positions, i.e. at positions for which ≥95% genomes contained a base-call with phred quality ≥20. The resulting alignment of 192,433 variable sites was used to infer a maximum likelihood phylogeny with FastTree v2.1.979 (gamma distribution of rate heterogeneity among sites, Figure 1a). Genomes were clustered into 259 phylogenetic lineages (clones) using patristic distance (distance threshold = 0.04). This identified 29 clones (clonal groups, CGs) that were each represented by ≥10 isolates from at least three different countries. One of these (CG82) was subsequently excluded because it uniquely included only historical isolate genomes (dated 1932–1949 or unknown). The remaining 28 clones (totalling 1092 genomes) were subjected to comparative analysis in this study. We refer to each as CGX, where X is the predominant ST in the clone, as per the convention for Kp.
For each clone of interest, reads were mapped to a completed chromosomal reference genome belonging to that clone (see Supplementary Table 1 and below), and variant calling and phylogenetic inference was performed as above. Phylogenies were manually inspected alongside genome source information to identify and de-duplicate clusters of closely related genomes from the same patient and/or known hospital outbreaks. Additional random sub-sampling was applied to CG258, which was otherwise drastically overrepresented in the collection (>700 genomes subsampled to 266 genomes). The final set of clones and genomes used for analyses are listed in Supplementary Tables 1 and 2, respectively.
Note that for the initial investigations of virulence and AMR determinant distributions in the broader Kp population (shown in Supplementary Figure 1) we considered an independent subset of the original curated genome collection (n=1124). This subset was described previously16 and was considered more representative of the population diversity because known outbreaks and overrepresented sequence types were subsampled.
Clonal reference genome selection
Reference genomes for each clone were identified among publicly available completed Kp chromosome sequences for each ST represented in the clones of interest. Where there was no suitable publicly available reference genome we selected a representative isolate from our collection, for which Illumina data were available, and generated additional long read sequence data for completion of the genome through hybrid genome assembly (details below). The exception was CG380 for which no suitable reference was publicly available and for which we did not have access to any isolates in our collection. As such, we generated a pseudo-chromosomal reference by scaffolding the de novo assembly contigs for genome SRR209867576 (the CG380 genome with the lowest number of contigs) to the most closely related completed genome in our initial phylogeny (NCTC9136, available via the NCTC3000 genomes project website: http://www.sanger.ac.uk/resources/downloads/bacteria/nctc/). Contigs were scaffolded using Abacas80 and manually inspected using ACT81 (contig coverage ≥20%).
Long read sequencing and hybrid genome assembly
Novel completed reference genomes were generated for 10 clones (CG25, CG29, CG36, CG253, CG43, CG45, CG152, CG230, CG231, CG661) for which Illumina data were available5, 82. Novel long read data were generated on the Pacific Biosciences platform for two isolates (CG231 strain MSB1_8A, and CG29 strain INF206), and an Oxford Nanopore Technologies MinION device for the remaining isolates as described previously83. Long read sequence data were combined with the existing short read Illumina data to generate complete hybrid assemblies with Unicycler84. The final completed assemblies were deposited in GenBank (accessions listed in Supplementary Table 1) and are available in Figshare (see below).
MLST, virulence and resistance gene screening
Chromosomal MLST, AMR and virulence genes were detected with SRST285 (or Kleborate, available at https://github.com/katholt/Kleborate, for typing assemblies when no sequence reads were available). Sequence reads were assembled de novo using SPAdes v386, and Kaptive v0.5.116, 87 was used to determine K and O locus types from assemblies. K and O locus diversities were calculated using the R package Vegan v2.4.388. The indices were converted to effective values to enable direct comparison between clones using the formula described previously89; effective Shannon diversity = exp(Shannon diversity).
Recombination detection
Recombination analysis was performed independently for each clone: single nucleotide variants were identified by mapping and variant calling against the clonal reference genome as above, and a pseudo-chromosomal alignment was used as input for Gubbins v2.0.036, with the weighted Robinson-Foulds convergence method and RAxML90 phylogeny inference. The Gubbins output files were used to calculate r/m and mean recombination counts per base, calculated over non-overlapping 1000 bp windows (relative to each clone-specific reference chromosome).
Pan-genome, plasmid and phage analyses
The SPAdes derived genome assemblies were annotated with Prokka v1.1191 and subjected to a pan-genome analysis with Roary v3.6.045 (BLASTp identity ≥95%, no splitting of ‘paralogs’). The resulting gene content matrix comprised 1092 genomes vs 39375 genes (after excluding 1070 core genes present in ≥95% genomes) was used to calculate pairwise Jaccard distances and as input for PCA with the Adegenet R package v2.0.192. Coordinates for the top 40 principal components (PC) were extracted, capturing 61.1% of the variation in the data. We calculated the Euclidean distance from each genome to its clone centroid (the vector of mean coordinate values for that clone), and compared the distributions of distances across clones. Pan-genome accumulation curves were visualised using the R package Vegan v2.4.388 and alpha values were calculated using R package micropan v1.1.293.
Accessory genes were identified as those present in <95% of all 1092 genomes belonging to the 28 clones analysed. For each genome assembly within a clone, the accessory gene sequences were extracted and concatenated into a single multi-fasta file (one per clone) that was used as input for ancestral assignment by Kraken v0.10.6, run with the miniKraken database94. For each clone the proportions of accessory genes assigned to distinct genera were used to calculate Shannon’s diversity indices using the R package Vegan v2.4.388.
Phage were identified from genome assemblies using VirSorter v1.0.347, with the highest confidence threshold. The resulting output includes a set of putative phage sequences in fasta format, in which we identified open reading frames (ORFs) using Prokka v1.1191. The resulting ORF sequences were clustered into non-redundant phage gene sequences using CD-HIT-EST v4.6.195 (identity ≥95%). BLASTn was used to tabulate the presence/absence of each of the resulting phage genes within the putative phage sequences identified by VirSorter in each genome (identity ≥95%, coverage ≥95%). The resulting gene content matrix was used as input for PCA and centroid distance calculations as described above for all accessory genes (25 PCs were used, capturing 60.0% of the variation in the data).
Plasmid replicons defined in the PlasmidFinder database49 were identified from read data using SRST285 and from assemblies using BLASTn (identity ≥80%, coverage ≥50% for both methods). Identically distributed replicons were collapsed into a single entry to minimise the influence of multi-replicon plasmids. Plasmid mob types were identified by PSI-BLAST as previously described50, 51. Effective Shannon’s diversities were calculated for each clone based on the replicon and mob presence/absence matrices, using the R package Vegan v2.4.388 as described above.
CRISPR/Cas and restriction-modification systems
CRISPR arrays were identified from genome assemblies using the CRISPR Recognition Tool v1.296 and genomes with >3 putative arrays were investigated manually to check for spurious identifications and/or identifications of single arrays split over multiple assembly contigs. Nucleotide sequences for the previously described Kp cas genes97 were extracted from the NTUH-K2044 reference chromosome (accession: NC_012731.1). Genomes were screened for novel cas genes by HMM domain search using the domain profiles developed by Burstein and colleagues98 (HMMER v3.1b2, bit score ≥20099). A representative set of putative novel cas genes was extracted from the genome of isolate INF256 (read accession: ERR1008719, genome assembly available in Figshare) and tBLASTx was used to detect the presence of NTUH-K2044- and INF256-like cas genes among all genomes (identity ≥85%, coverage ≥25%). Note that the INF256 cas genes were subsequently found to be highly similar to those of strain Kp52.145 reported during the course of this study100.
Putative restriction enzymes (REases) were identified from genome assemblies by HMM domain search using the domain profiles developed previously101, 102, parameters as above. CD-HIT v4.6.195 was used to cluster the predicted amino acid sequences of these REases such that distinct clusters represented enzymes that are thought to recognise distinct methylated nucleotide motifs101; i.e. using amino acid identity thresholds of 80% identity for type I and type III REases; 55% identity for type II REases. To date no suitable threshold has been determined for type IV REases and thus in order to include type IV enzymes in our analyses we used the more conservative 55% identity threshold. Type IIC REase sequences cannot be aligned101 and were therefore excluded from the analysis. Nucleotide sequences for a single representative of each REase cluster were used to search all assemblies by BLASTn (identity ≥80%, coverage ≥90%). Only the single best hit was recorded for each region of each genome. In order to assess the DNA recipient potential of each genome, we also used BLASTn to screen a broader sample of the Kp population, comprising the 1124 representative non-redundant genomes from our curated collection as described above and previously16, plus a further 598 diverse Kp genomes published during the course of this project103 – 106. A putative donor-recipient pairing was considered compatible if the complete set of REases in the recipient genome were also present in the donor genome (we assume that genomes positive for an REase also carry the corresponding methyltransferase).
Investigating the impact of the K2 capsule on CG15 recombination and pan-genome diversity
K loci were overlaid onto the CG15 recombination-free maximum likelihood phylogeny, revealing that the KL2 locus was restricted to one of two major subclades (shaded blue and grey in Supplementary Figure 12). Recombination dynamics and pan-genome diversity were investigated separately for each of these subclades, using the methods described above. We used BEAST2107 to estimate the time to most recent common ancestor (tMRCA) of the KL2 subclade, using as input the recombination-free single nucleotide variant alignment generated by Gubbins (2967 bp). The final analysis included 21 genomes (those for which years of collection were not known were excluded, see Supplementary Table 2) and was completed as described previously31. Temporal structure was confirmed by date-randomisation tests, which showed that the evolutionary rate derived from the true data did not overlap those derived from any of 20 independent randomisations (Supplementary Figure 13).
Data availability
Individual accession numbers and genotyping data for all genomes included in comparative analyses are listed in Supplementary Table 2. Reference genome accessions are listed in Supplementary Table 1. Reference genomes, study genome assemblies and annotations, pseudogenome alignments (Gubbins inputs), Gubbins per branch statistics and recombination predictions output files, python script to calculate mean recombination events per base, pan-genome gene content matrix, python code for Euclidean distance calculation, accessory gene ancestor matrix, representative phage gene sequences, phage gene content matrix, representative REase gene nucleotide sequences and CG15-KL2 BEAST xml input files are available in Figshare (https://doi.org/10.26188/5b8cb880dcffc).
Author contributions
LMJ, CLG, MMCL and AJ provided isolates, performed DNA extractions and/or generated sequence data. KLW, RRW, RM, AT and SD analysed the data. KLW and KEH designed the study and wrote the manuscript. All authors contributed to data interpretation, read and commented on the manuscript.
Competing interests
The authors declare no conflicts of interest.
Acknowledgements
This work was supported by a Viertel Foundation of Australia Senior Medical Research Fellowship to KEH, the Bill and Melinda Gates Foundation, Seattle (OPP1175797), and the University of Melbourne.
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.
- 72.
- 73.
- 74.
- 75.
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.↵
- 99.↵
- 100.↵
- 101.↵
- 102.↵
- 103.
- 104.
- 105.
- 106.
- 107.↵