ABSTRACT
Escherichia coli is a Gram-negative bacterial species with both great biological diversity and important clinical relevance. To study its population structure in both world-wide and genome-wide scales, we scrutinise phylogenetically 104 high-quality complete genomes of diverse human/animal hosts, among which 45 are new additions to the collection; most of them are clustered into two major clades: Vig (vigorous) and Slu (sluggish). The two clades not only show distinct physiological features but also genome content and sequence variation. Limited recombination and horizontal gene transfer separate the two clades, as opposed to extensive intra-clade gene flow that functionally homogenizes even commensal and pathogenic strains. The two clades that are genetically isolated should be recognized as two subspecies both independently represent a continuum of possibilities range from commensal to pathogenic phenotype. Additionally, the frequent intra-clade recombinant events, often in larger fragments of over 5kb, indicates possibility of highly-efficient gene transfer mechanism depending on inheritance. Underlying molecular mechanisms that constitute such recombinant barrier between the subspecies deserve further exploration and investigations among broader microbial taxa.
IMPORTANCE The concept of bacterial species has debated over decades. The question becomes more important today as human microbiomes and their health relevance are being studied extensively. The human microbiomes where thousands of bacterial species co-habit need to be deciphered at minute details and down to species and subspecies. In this study, we scrutinize the population genomics of E. coli and define two subspecies that are distinct from each other concerning physiology, ecology, and clinical features. As opposed to extensive genetic recombination within subspecies, limited genetic flux between subspecies leads to their phenotypical distinctions and separate evolution paths. We provide a key example illustrating that the divergence of a species into two subspecies depends on recombination efficiency; when the recombination efficiency becomes a barrier the species appears split into two. The E. coli scenario and its molecular mechanisms deserve further exploration in a broader taxa of microbes.
INTRODUCTION
Escherichia coli, best known as a ubiquitous member of the normal intestinal bacterial microflora in humans and other warm-blooded animals, is a Gram-negative bacterium of the family Enterobacteriaceae. E. coli persists as a harmless commensal in the mucous layer of the cecum and colon normally, whereas some variants have evolved to adapt pathogenic lifestyle that causes different disease pathologies, including pandemic and lethal episodes (1). Depending on the site of infection, pathogenic E. coli strains are divided into intestinal pathogenic E. coli (IPEC) and extraintestinal pathogenic E. coli (ExPEC), which are able to successfully propagate both intra- and extra-intestinally, respectively. Naturally, E. coli is a highly versatile species that survives in diversified ecological habitats, such as sludgy environment of lake/river banks and tidal zones, as well as human and animal intestines (2). Their environmental adaptation and lifestyle alteration, together with experimental manipulation, make the species an excellent model for studying commensalism-pathogenicity and genotype-phenotype relationships.
Phylogenetic analysis reveals that E. coli exhibits complex within-species sequence diversity that hinders strain classification although various typing methods including ribo-typing, MLST (multi-locus sequence typing), phylotyping, and whole genome phylogeny have been applied. Albeit complicated, the population of E.coli is largely acknowledged as clonal, and four major phylotypes, A, B1, B2, and D have been identified (3) which basically differ in habitat and life-style (4, 5). Phylotypes are loosely associated to phenotypical characteristics, such as antibiotic-resistance and growth rate (3), and also correlates with pathotype, as the ExPEC strains are normally part of B2 and D (6), whereas the IPEC strains belong to A, B1, and D (7). However, recent genome-wide sequencing studies have revealed dispensable/variable genes take a large part of genome plasticity contributing to biological diversity of the species. Many virulence genes, including the most lethal Shiga toxins (Stx) and carbapenem hydrolyase, are subject to frequent horizontal gene transfer (HGT), through which distinct pathogenic and resistance phenotypes are acquired (8). Therefore, extensive HGTs interrupt the connections between phenotypes and their mainstream phylogeny. Meanwhile, homologous recombination of the E. coli core genome is found more frequent than previously realized, which also obscures phylogeny and leads to either divergent or convergent characters (9). These observations on genetic flow challenge the clonality of the species population, and raise the concern on the emergence of disastrous “superbugs” if both lethal toxins and highly-resistant genes are recombined into a new strain (10, 11).
To understand the population structure and genetic diversity within the species E. coli, we utilize 104 high-quality complete genome sequences from diverse geographic and host range; among them, 45 that have animal-host information are first released from our own sequencing efforts and the rest of explicitly human hosts are from public databases. Our analysis results support the view that E. coli is predominantly clonal, and except that a few strains of intermediate minor clades, most strains cluster into two major clades. Each in successful adaptation to their own ecological niche and the two clades distinct in physiological features, pathogenicity, genome content, and homologue sequences; we propose that they deserve a permanent distinction as two subspecies. Further HGT analysis reveals that recombination is very extensive within subspecies, which homogenizes strains into continuum of genome possibilities, but rather limited when it happens in cross-subspecies manner. The barrier of genetic exchanges between the two subspecies maintains clonal characters of the species and drives them into separate evolutionary paths.
RESULTS AND DISCUSSION
We contribute a unique fraction of complete genome sequences for a dataset used for phylogenetic analysis
Our dataset for an in-depth analysis is composed of 104 complete genomes which include 59 human-host complete genomes from public databases and 45 newly added complete genomes of animal-host isolates. The latter set is selected from a world-wide collection of 202 E. coli strains of animal hosts, which includes strains with broad diversity in geography, climate, and host range, and several ECOR (Escherichia coli collection of reference) strains (12). Our effort includes the identification of the MLST (multi-locus sequence typing) sequences for constructing a MLST–based phylogeny. The MLST-based phylogeny reveals a complex partition among animal or human hosts (Figure 1A), and at the end, 45 animal-host isolates is brought on to represent diverse origin and genetic heterogeneity in terms of host diet, geographic distribution (Figure 1B), as well as MLST-clusters for further genome sequence finishing; their full genome sequences reveal similar chromosomal organizational features to human-host isolates, such as a uniform G+C content of 50.58%. Their genome sizes range from 4.25 to 4.94Mb and are predicted to encode 4,728 genes in average, a slightly smaller than that of pathogenic strains, but similar to commensals.
To construct the genome-based phylogeny, 7 draft genomes of strains isolated from wild environment (13), which are phylogenetically distant from host-related strains, are included as outgroups. The core genome shared by all the 111 genomes is composed of 1,095 genes and collectively 1.05 Mb in length. The maximal-likelihood phylogeny of the core genome indicates ancestor position of the environmental strains (Figure 2) and the host-related strains, regardless of human or animal origin, are mingled together. The majority of the host-related strains clusters into two major clades, each contains 56 and 31 strains respectively. The rest of 17 strains, holding closer positions to the ancestral environment strains, is split into several minor clades. Overall, this clustering pattern is largely in congruence with their phylotypes as previously reported (9, 14).
The two major clades are physiologically distinct and adaptive to each ecological niches
The two major clades of host-related strains also exhibit distinctions in host, climate (5), and pathogenicity (Table S1). According to their distinct characters, such as movement and growth, we name them Vig (Vigorous) and Slu (Sluggish) clades just for convenience. The Vig, composed of strains of phylotype A and B1, is featured for its strains of carnivores host and tropical geographic distribution; all E. coli strains that have led to human pandemic infections belong to this particular clade. Although some Vig strains also survive in herbivores and cold area, and many strains are commensals, it appears that the Vig strains prefer warmer temperature, richer amino acid nutrients, and are able to propagate rapidly under ideal conditions, albeit adapt to a wide range of ecological niches. On the contrary, the Slu strains, which fall in phylotype B2, are only found in herbivores and omnivores and colder climates. Besides commensals, Slu strains occasionally cause extra-intestinal and antibiotic-resistant infections that are rarely seen among the Vig strains. These ecological features of Slu strains suggest that they are more adaptive to lower temperatures, poor amino acid nutrients, and therefore exhibit slower growth rate. This adaptation to low temperature of the Slu clade (mainly B2 strains) coincides with the report of a large population investigation (15), and facilitates transient period of epizoism and migration to extra-intestinal habitats together with their tolerance to poor nutrition. Meanwhile, its slow growth rate increases survival rate under stresses, such as those of antibiotics, and offers better opportunities to gain antibiotics-resistant genes or functional mutations.
To confirm some of the distinct features between the clades Vig and Slu, such as optimal temperature and amino acid preference, we design a few straightforward physiological experiments with resuscitated strains in our collection (23 Vig and 15 Slu strains). First, we measure growth rates for the strains at various temperatures and find that the Vig strains grow significantly faster than the Slu strains at 37°C and 41°C, which resemble the intestinal environment of warm-blood animals (Figure 3A). However, at lower temperature of 27°C and 32°C, their differences are not significant. The results well explain the higher prevalence of the Vig strains and the preference of Slu strains to colder climates. Second, we compare their survival rates after heat shock at 50°C, and the faster-growing Vig strains exhibit vulnerability and poorer survival rates as compared to the Slu strains after 20 and 30 minutes of heat stress (Figure 3B), which is in congruent with the measurement of their growth rates. Third, we measure mobility of these strains in response to chemotaxis amino acid— phenylalanine—in a custom-designed microfluidic device. At each time point, more cells of the Vig strains reach the destination pool at the same chemoattraction, indicating their faster response to amino acids and higher mobility than the Slu (Figure 3C). This ability of Vig strains ensures rapid approaching toward amino acid nutrients and allocates to their carnivoral inhabitation. Finally, we calculate the strain growth rate in various carbohydrate sources relative to glucose. All E. coli strains grow much rapidly in monosaccharides and disaccharides than in polysaccharides. However, when compared to the Slu strains, the Vig strains grow slightly faster in monosaccharides and disaccharides, but a little slower in polysaccharides (Figure 3D). Although the difference is not significant possibly due to small sample size, it seems that the Slu strains may be more adaptive to herbivore host where polysaccharides are more abundant. From all the physiological experiments, it is apparent that the two clades have diverged from each other in phenotypical features in adaptation to distinct ecological niches as well as leading to distinct clinical features.
Genomic distinctions between the two clades correspond to their phenotypes
The distinct physiological and ecological features between clades Vig and Slu lead to the speculation that the two clades should be distinct in genome content and homologous sequences especially in genes related to metabolism, energy, and mobility. To confirm this, we formulate a parameter – pair-wised genome content distance – of all strains, finding that the Vig and Slu strains are clearly separated in a neighbor-joining tree that derives from the distance matrix (Figure 4A); the result indicates that the two clades have a significant number of different genes. We subsequently apply two-sample Kolmogorov-Smirnov test to all dispensable genes and identify 279 and 336 orthologues that exhibit significant enrichment in the Vig and Slu clades, respectively (Table S2). These genes represent genome content that are characteristic of the two clades; when annotated according to the COG database for their functions, as expected, over half of them are in the categories of metabolism. However, although different between the Vig and Slu clades, the distribution (presence/absence) of these characteristic genes does not vary so much with host dietary preference (Figure 4B). For metabolizing carbonates and lipids, the Vig and Slu contain a set of diversely characteristic genes. For genes of mobility related, purine and amino acid metabolic, the Vig strains are apparently richer than the Slu, which are in accordance with their adaptation to carnivore intestines (Figure 4B).
Next, we search for all virulent genes within dispensable genes. For both clades, unlike metabolic genes, the content of virulent genes varies greatly among strains. Pathogenic strains have much more virulent genes than commensals, including all kinds of toxins and iron-uptake genes. However, the two clades differ in their virulent gene content: the Vig pathogens are rich of T3SS and other secretion system, whereas the Slu pathogens have more adhesion and invasion genes that facilitates extra-intestinal infections (Figure 4C). We also scrutinize beta-lactamase genes and the notorious lethal Shiga-like toxins (Stx) that often leads to pandemic infections when transferred to E. coli strains (16). These genes are usually carried in plasmid, but can be inserted into chromosome by mobile elements under strong selections (17). Although both clades contain strains with narrow spectrum beta lactamases as TEM-1 and OXA-1 (18), the extended spectrum beta lactamases (ESBLs) are rarely found among Vig strains, whereas Stx exclusively shows up in Vig strains. (Figure 4D). These differences between the two clades in their genome content lead to explanation for their phenotypical and pathogenic characteristics.
In addition to gene content, we further investigate sequence variances between homologs that are shared by the two clades. First, the Vig and Slu clades are phylogenetically distinct in their core genome, and from the orthologous pairs we identify 126 and 227 non-synonymous polymorphic sites which reside in 97 and 168 genes and are specific for the Vig and Slu clades, respectively (Table S3). These polymorphic sites exclusively found in each clade are named as lineage-associated variations or LAVs. The ratios of non-synonymous to synonymous polymorphism (dN/dS) of these LAV-containing genes are in average 0.02 and below 0.4, as if they are under strong purifying selection. We extract all metabolic genes in carbonate and amino acid pathways from the core genes and concatenate them to construct maximal-linkage tree. Clearly, genes of Vig and Slu strains cluster separately (Figure 5A), likely to contribute to their metabolic characteristics. We also investigate homologous sequences of the highly prevalent dispensable genes (orthologues present in over 80% strains of the two clades and calculate pair-wised gene distances based on amino acid sequences between orthologues. The result shows much larger distances between inter-clade pairs than intra-clade pairs (Figure 5B), indicating that each clade utilizes their own preferred alleles for highly shared dispensable genes. Finally, we check the GC content difference between the two clades, and unexpectedly find that GC% of the Vig strains are a slightly, but significantly, higher than that of the Slu strains in the core genes, dispensable genes, and whole genome (Figure 5C). We speculate that the underlying reasons are adaptation to the higher temperature habitant of the Vig clade. In any case, the difference in GC content definitely leads to a global effect on nucleic acid composition of all genes and can be an independent factor driving genome evolution (19). Apparently, all above genetic characteristics in genome content and sequence diversity between the two clades may explain in part their phenotypic/physiological differences and molecular mechanisms for adaptation to their distinct ecological niches.
Barrier to inter-clade recombination separates the clades genetically into subspecies
E. coli is generally clonal (20, 21) and the Vig and Slu strains fit in such a framework, diverging from each other through accumulation of independent mutations and limiting exchange of genetic materials mutually. However, in previous studies, the E. coli recombination rate is evaluated as comparable to a mutation rate with the ratio of r/m ≈0.9 (9, 22, 23). Theoretically, such a recombination rate is able to confound clonal framework and intermingles strains into an unstructured population. The question becomes how the species keeps clonal structure and sustains a relatively high recombination rate.
To scrutinize clonal status, we first evaluate general recombination rate based on the concept of homoplasy – polymorphisms shared by two or more strains but not present in their common ancestor (22, 24). Practically, homoplasic polymorphisms can only be inferred for core genome where recombination are mainly introduced through homolog sequences. Our results indicate that the r/m of the E. coli core genome is about 0.5, much lower than previously estimated (22) (Table 1). The reason is the fact that dispensable genes are more subjected to HGT and lead to over-estimated r/m ratio when they landed in a core genome. The large sample size of our current study results in a minimal definition of the core genome that excludes almost all the dispensable genes and thus narrows r/m ratio. However, the r/m ratios of the Vig and Slu clades are 0.949 and 0.745, much higher than the entire species (Table 1); the deviations point out the fact that there are more within-clade recombination events than that of between-clade, as it has been proposed in a previous study (9). The high rate of within-clade homologous recombination is further confirmed by an analysis based on a Bayesia-based method—BRATNextGen, which has been used for analysing homologous recombination events between and within clades on the basis of a specified degree of sequence divergence (25, 26). The result illustrates the fact that strains share more recombinant fragments within clades than cross clades (Figure 6A). Both analyses demonstrate that E. coli strains rarely exchange genetic material with distant relatives whereas closely related strains recombine more frequently, and thus reconciling the controversy between clonal structure and high overall recombination rate of the species.
The above estimations of recombination rate are applied for the core genome which only takes less than one-fourth of the average E.coli genome, however, recombination rate varies with chromosomal regions, and genes differ in their possibilities of being transferred—dispensable genome contains more mobile elements and are more prone to be horizontally transferred (27) To obtain complete understanding of genetic exchange in term of whole genome and all strains of entire species, we try to scan recombinant events across the entire genome length; recombination plays critical roles in transferring and shuffling dispensable genes that shape genome content in significant ways (28). In general, recombinant events between close-related strains are not easy to identify due to weak signals of recombined sequences (29, 30). Based on sequence alignment, we first identify all near identical fragments for each genome pair as candidate recombinant fragments (31). Since recombinant often insert genes into different location, using complete genome sequences, we also compare synteny (linear order) of the candidate recombinant fragments and define non-syntenic fragments as results of true recombinant events. Our result shows that strains with intra-clade genome-pairs exchange more genome content (nearly 10x) than those with inter-clade pairs; the sum of total recombinant fragment lengths approaches one half of the genome in some cases and never less than 300kb which is even larger than the upper limit of inter-clade pairs (Figure 6B). The extensive intra-clade genetic exchange in the dispensable genome strengthens the clonal structure that is defined by the core genome. Furthermore, we find that intensive genetic material exchange where virulent genes can be included between commensal and pathogenic strains depends on their clade statuses: higher in within-clade events (Figure 6B).
We further correlate recombinant frequency (here we use the number of recombination fragments per genome-pair) to phylogenetic distance between paired genomes. When plotted against core genome distance (from which the core genome phylogeny has been inferred) for both Vig and Slu strains, the regression curves of recombination frequency shows a reverse-S shape, i.e., there is a rapid decline over transition between intra- and inter-clade genome-pairs (Figure 6C). Strains of the same clade recombine more frequently, which often have nearly one hundred recombinant fragments per genome, and show a decrease in the number of recombination fragments due to overlapping of larger fragments when very closely related. In fact, very close sister strains can exchange almost half of their genome. The phenomenon of chromosomal fragment transfer over 100kb or even 1Mb has been experimentally validated in other species (32, 33). However, the very efficient genetic exchange of large fragment and high frequency is not seen between clades, where such fragments rarely exceed 5kb (Figure 6D). The result implies that closely related strains (within clade) may have a unique highly-efficient molecular mechanism for recombination, whereas genetic material exchange between remote relatives (cross clade) is confined to some low-efficient ways.
Population structure of host-related strains and relationship between phylotype and phenotype
E. coli is a well-studied gram-negative bacterial species due to its importance in both clinical practice and biological research. However, biologically meaningful strain or population classification based on genotypes and phenotypes as well as other features is still a difficult task. Our study starts from phylogeny based on high-quality whole genome and, through detailed genomic analysis and experimentation ends with the definition of two major clades – Vig and Slu. The two clades are distinct in many aspects and their clade-centric characteristics explain many ecological and clinical features. E. coli infections fall into two categories with different clinical outcomes and treatment strategies: 1) acute intestinal infections with various severities and 2) opportunistic infections often in extraintestinal loci and often resistant to common antibiotics. The two types infections appear to correlate with different clinical features of the two clades and their traditional phylotypes (34, 35). Genomic analysis reveals that the genetic structure underlying the physiological, ecological and clinical traits of Vig and Slu strains are a pile of characteristic genes and sequence variations, especially genes involved in metabolic pathways, mobility, and toxic or resistant phenotypes. And the separation of the two clades is caused by the limitation of between-subspecies genetic exchange or recombination, as similar scenarios found among other species of bacteria (36-38), archaea (39), and eukaryotes (40). Among the characteristic genes, the extremely virulent toxin—Shige-like toxin and the most notorious antibiotic resistant genes—ESBLs, are partitioned into the Vig and Slu with very little overlap. Although these genes are often carried by plasmids and ready to transfer among strains, it seems that strains carrying both has been reported to be rather rare (41), also supporting the between-clade recombinant barrier. Therefore Vig and Slu should be regarded as two subspecies since the recombinant barrier has genetically separated them and made them clearly divergent in all aspects of physiology, ecology and clinical significance. In clinical, identification of subspecies for a pathogenic strain will give informative clinical guidance for the treatment of the infection it caused. On the other side, the genetic boundary between commensals and pathogenic strains is not very clear. Although pathogenic strains bear much more toxic genes, a commensal strain can exchange genetic material and acquire enough virulent genes from its close pathogenic sisters, and then become pathogenic under appropriate host conditions. Intensive recombination readily alters virulent gene content and thus blurs the lines between commensal and pathogen, making them genetically undistinguishable in clinical practice (6).
Since species are “lineages evolving separately from other lineages” (42), genetic diversification and geographic separation of E coli subspecies, represented by the Vig and Slu clades, demonstrate an early process of microbial speciation, whose mechanisms and processes have been debated over decades (43, 44). Until recently, technological innovation, especially the invention of next-generation sequencing (NGS) technology, coupled with the emerging discipline of population genomics, has been providing unprecedented tools and opportunities for the interrogation of molecular details on many ongoing evolutionary processes among natural microbial populations, especially those with healthcare applications. The emergence of the two E.coli subspecies appears initiated from distinct genetic units with functional relevance, which are incubated and frequently traded among closely related, structurally comparable, and geographically cohabitating strains through mutually beneficial mechanisms. Our recombinant analysis reveals that the within-subspecies recombination rate is much more significant than that of between-subspecies, and such a diversifying process eventually drives subspecies or their populations keeping evolving into nascent species (45). In our data, recombination has overall effects on both core and dispensable portions of the genome, and results in hundreds of characteristic genes as well as lineage-associated variations or LAVs; these genetic and functional elements form a complex background for species and its population to evolve under nature selection. Certainly, physical barriers that interrupt recombination, accelerating the process of speciation (43). In the case of the two E coli subspecies, both are widely spread and co-inhabiting, such as commensals in intestines of both humans and other omnivores, and our study and observations of them does not support the hypothesis of geographic isolation. Therefore, other types of physical barriers such as CRISPR-Cas system (46), restriction-modification system (47), DNA uptake signal sequences (48), and incompatible transfer mechanisms due to pili (49), which have been reported in some species, may play roles in E. coli sub-speciation or speciation but remain to be elucidated. Our results highlight rate difference between intra-subspecies and inter-subspecies recombination as a barrier possibly due to less functional benefits. Some high-efficiency mechanisms, such as distributive conjugal transfer, has been reported in a species of Mycobacteria, which is able to transfer fragments over 100kb at one time but lose function when such genetic exchange happened between remote relatives (50, 51). On the other hand, low-efficiency mechanisms, such as phage (52) or transposon (53), usually transfer short fragments at lower frequency (54), but may still work when happening across subspecies due to broader host range. These mechanisms underlying cross-subspecies recombinant barrier deserve further exploration and should be investigated in a broad range of microbial taxa for better understanding of microbial population structural dynamics and speciation.
CONCLUSION
Based on an unprecedented dataset, we thoroughly studied the population structure and dynamics of E. coli, including genetic diversity, habitant divergence, physiological features, recombination rate, and gene flow. We defined two E. coli subpopuations among large number of isolates and suggest that they appear to be distinct subspecies that have evolved to bear characteristic gene content and sequence variance, which lead to distinct physiological, ecological, and clinical characteristics. There is an apparent barrier of recombinant between the two subspecies, which drive their genetic diversification. Although the underlying mechanism still needs further demonstration, novel molecular mechanisms differentiating intra- and inter-subspecies genetic material exchange may exist. The discovery of such mechanisms and confirmation in broad range of microbial taxa will surely deepen our understanding of the process of bacterial speciation.
MATERIAL AND METHODS
Strains and MLST typing
A world-wide collection of 202 Escherichia coli strains from vertebrate hosts (12) was kindly provided by Professor Shulin Liu (Genomics Research Center, Harbin Medical University, Harbin, China; Department of Microbiology and Infectious Diseases, University of Calgary, Calgary, Canada). Genomic DNA were extracted using a Qiagen DNeasy kit (Qiagen) for 172 successfully resuscitated isolates. PCR and Sanger sequencing were performed for 16S rDNA, MLST genes (seven housekeeping genes of adk, fumC, gyrB, icd, mdh, purA, and recA, see details at http://mlst.ucc.ie/) for these strains. Then five strains whose 16S rDNA sequence showed <97% identity with E. coli reference or hit best to other species and 18 strains whose MLST alleles were failed to be identified by this method were removed, leaving 149 animal-host strains for further analysis. We further included 59 complete genomes of human-host deposited in NCBI (ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/) as of July 2013 and seven draft genomes published by a study of environmental strains (13) from ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria_DRAFT/, then in silico identified their MLST loci with BLAST. Together, we aligned the seven MLST fragments of all 213 strains with ClustalW and concatenated them for inference of maximum-likelihood phylogeny using RaxML with model GTR and 1,000 bootstraps.
Genome sequencing, annotation and phylogeny inference
Genomic DNA libraries of 45 selected representative strains were prepared using a NEBNext DNA Library Prep Kit and sequenced on a Hiseq 2000 for 2 × 100 bp run at the Beijing Institute of Genomics (Beijing, P.R.China). The raw reads were quality filtered and trimmed using SolexaQA (-p 0.01), with an average coverage of 150× and assembled with SOAPdenovo (S) (55), and then the assemblies were scaffolded into circular genomes with GAAP (http://gaap.big.ac.cn), which is based on synteny of core genes on genomic scale, and assist assembly of high quality genomes (27). For these genomes, protein-coding genes were predicted by using GeneMark.hmm with a pre-trained model and annotated using BLAST against the COG database. Gene sequences were mapped to metabolic pathways by using BLAST against KEGG GENES for KEGG Orthology assignment using the KEGG automatic annotation server. The identification of orthologous genes was performed with pan-genome analysis pipeline (PGAP) (56) with identity and coverage of pairwise genes set at 0.7. Orthologues common to all 111 strains (including 45 animal-host, 59 human-host and seven environment strains) are identified as core genes, and others (orthologues shared by a portion of strains) as dispensable genes. The 1,095 single-copy core genes were concatenated into core genome and used to construct phylogenetic tree. We aligned protein sequences of each core genes using ClustalW, and traced them back to nucleic acid sequences. The maximum-likelihood phylogeny was inferred on the basis of SNPs in the core genome based on RaxML in the model GTR+G with 1,000 bootstraps. Phylogenetic trees were viewed and modified by using FigTree and EvolView (57). Phylotypes of all strains were identified in silicon according to the presence of three phylotype-specific genes or fragments as previously described (58).
Measurement of growth rate, survival rate, and mobility
Strains were resuscitated and cultured in pre-filtered LB broth. Unless otherwise specified, each strain was normalized by cultivation at 17°C in LB overnight with shaking at 200 rpm, and OD600nm was measured for each culture in 96-well plate (triplicates for each strain with a blank control) on an Infinite® 200 PRO Microplate Reader (Tecan) every half hour.
Growth rate in diverse temperature and carbonate source
Overnight cultures were 1:50 diluted with LB broth and incubated at 27, 32, 37, and 41°C or change medium supplemented with glucose, starch, glycogen, heparin, cellulose, fructose, lactose, and maltose as sole carbon source. OD600 was monitored until the cultures reach 0.6.
Survival rate at heat stress
Overnight cultures of each testing strain were adjusted to OD600=0.1. Immediately, and 50μl of inoculum was diluted 1:4 with LB broth and incubated at 50°C for 0min, 10 min, 20min, and 30min. OD600 was monitored until the cultures reach 0.1. The number of living cells in each initial aliquot relative to a negative control was calculated as e-Δt, where Δt is the time to OD600=0.1. The survival rate was defined as the ratio of live cell number after heat shock to that of negative control.
Mobility and response to chemoattractive amino acids
Each strain was evaluated on a custom-designed PDMS microfluidic device as previously described (59). Briefly, there were two pools—sink and source pool—which were connected with a 600μm channel. At the end of the channel adjacent to the source pool, a thin layer of precast gel obstructed cells into the source pool, in front of which a 400μm2 observation channel was set there for cell counting. Normalized cells of single clones were re-suspended in an amino acid-free basal medium with OD600=0.1. Immediately, 30 μl of cells and 30 μl of chemotaxis solution (with 100 μM phenylalanine in basal medium) were injected in the sink and source pool respectively. Cells reached the observation channel was immediately observed and auto-tracked by using a Nikon Ti-E inverted microscope system every 5 min for 1 h. The cell count of observation channel in each images, which serves as an indicator of cell mobility in response to chemoattractive amino acid, was automatically measured by using ImageJ software.
Genome content similarity, gene distribution, and homologue distance between Vig and Slu
The similarities between paired genomes were calculated on the basis of the Bray–Curtis dissimilarity index. A dissimilarity index of d is calculated as 1 − [2 * Sij/(Si + Sj)], where Sij is the number of dispensable genes shared by strains i and j, and Si and Sj are the numbers of dispensable genes in strains i and j, respectively. Pairwise dissimilarity indices were used to construct a distance matrix, which was used to construct the neighbour-joining tree, and genome similarity (1 – d) was used to produce a heatmap. Then we applied two-sample Kolmogorov-Smirnov test for each dispensable orthologue on its balanced distribution/presence in Vig and Slu. When p <0.01, the orthologue was significantly enriched in Vig or Slu. Antibiotic-resistance genes were identified by using BLAST against CBMAR database (http://14.139.227.92/mkumar/lactamasedb) with E value of 10−5and best hit. Similarly, Shiga-like toxin genes were identified by BLAST against the protein sequences of that from E. coli O157:H7 (lcl|AB015056.1_prot_BAA88123.1_1 for A-subunit and lcl|AB015056.1_prot_BAA88124.1_2 for B-subunit) downloaded from NCBI database. To calculate pairwise sequence distances for protein between homologs, individual orthologues of high-prevalence dispensable genes (present in >80% of strains) were first aligned with ClustalW, and then calculated pairwise distance of genes in one orthologue from various strains of one species by using Protdist in the Phylip package.
Recombination inference
We calculated r/m statistics for the core genome of E. coli using a computational method PHI which recognize homoplasic sites of amino acid sequences in a 1-kb overlapping window. We used all the default parameters and collected homoplasic sites in windows of p<0.01 as polymorphism sites caused by recombination (r) and the others as those caused by mutation (m). We also inferred recombination fragments of core genome with a Bayesian algorithm-based method—BratNextGen (25). The strain clusters were a priori defined on the basis of a PSA tree. In the default procedure, alpha was set at 3.58. One hundred iterations were performed until the model parameters converged, and the significances of inferred recombinant fragments (p< 0.05) were assessed using 100 replicate permutations of sites in the genome.
We utilized a programme— gmos (Genome MOsaic Structure) to compute local alignments between paired query and subject genomes and reconstructed the query mosaic structure of recombinant fragments over 600bp (31). These fragments, although almost identical caused by very recent recombination, still held the possibility of vertically inherited under strong selection, and thus only regarded as candidate recombinant fragments. We also identified fragments that changed their genome locations as parsimony recombination fragments for each genome pair. To achieve this, we assigned consecutive number for each candidate fragment along their order in subject genome. We also recorded their relative order in the query genome as a new sequence where we identified the longest increasing/decreasing subsequence as fragments that keep their original locations, and the other fragments were recognized as parsimonious recombination fragments.
DECLARATION
Availability of data and materials:
The final 45 complete genomes contributed by our laboratory has been deposit in the GenBank under the accession numbers CP012758∼CP012800, CP012806, and CP012807.
Competing interests
The authors declare no competing interests.
Funding
This work was supported by the National Scientific Foundation of China [31470180, 31471237, and 30971610].
Author contribution
J.Y. and Y.K. conceived the project and led the writing; S.L, provided the strains; L.Y., Z.H., X.J., R.F., Y.Y., and C.L. compiled the data; and C.F., Z.G., X.J., X.G., Q.M., Y.Z., J.W., J.X., and S.H. analysed the data. All authors contributed to the writing and/or intellectual development of the manuscript.
ADDITIONAL FILES
Table S1. E.coli strains used in this study
Table S2. Characteristic genes enriched in clade Vig and Slu
Table S3. List of LAV-containing genes of clade Vig and Slu
Acknowledgement
Acknowledgement to W. Chen who provided manuscript feedback.
Footnotes
↵† The first 5 authors should be regarded as joint First Authors.
LIST OF ABBREVIATIONS
- dN/dS
- ratios of non-synonymous to synonymous polymorphism
- ECOR
- Escherichia coli collection of reference
- ESBLs
- extended spectrum beta lactamases
- ExPEC
- extraintestinal pathogenic E. coli
- HGT
- horizontal gene transfer
- IPEC
- intestinal pathogenic E. coli
- LAVs
- Linage Associated Variations
- MLST
- multi-locus sequence typing
- NGS
- next-generation sequencing
- r/m
- ratio of polymorphisms caused by recombination to mutation
- Slu
- Sluggish clade
- Stx
- Shiga toxins
- Vig
- Vigorous clade