Abstract
Individual bacterial lineages stably persist for years in the human gut microbiome1–3. However, the potential of these lineages to adapt during colonization of healthy people is not well understood2,4. Here, we assess evolution within individual microbiomes by sequencing the genomes of 602 Bacteroides fragilis isolates cultured from 12 healthy subjects. We find that B. fragilis within-subject populations contain substantial de novo nucleotide and mobile element diversity, which preserve years of within-person evolutionary history. This evolutionary history contains signatures of within-person adaptation to both subject-specific and common selective forces, including parallel mutations in sixteen genes. These sixteen genes are involved in cell-envelope biosynthesis and polysaccharide utilization, as well as yet under-characterized pathways. Notably, one of these genes has been shown to be critical for B. fragilis colonization in mice5, indicating that key genes have not already been optimized for survival in vivo. This lack of optimization, given historical signatures of purifying selection in these genes, suggests that varying selective forces with discordant solutions act upon B. fragilis in vivo. Remarkably, in one subject, two B. fragilis sublineages coexisted at a stable relative frequency over a 1.5-year period despite rapid adaptive dynamics within one of the sublineages. This stable coexistence suggests that competing selective forces can lead to B. fragilis niche-differentiation even within a single person. We conclude that B. fragilis adapts rapidly within the microbiomes of individual healthy people, providing a new route for the discovery of key genes in the microbiome and implications for microbiome stability and manipulation.
Main Text
Billions of de novo mutations are generated daily within each person’s gut microbiome6–9 (Table 1). It is unknown if any of these mutations confer a strong adaptive benefit to the bacteria in which they emerge or, in contrast, all available mutations are deleterious or neutral. While some bacterial pathogens are known to adapt within individual infections10–14, investigations into healthy carriage of commensals have not revealed similar signals of within-person adaptive mutations4,15. These observations raise the possibility that millions of years of commensal evolution within mammalian digestive systems16,17 has exhausted all strongly beneficial point mutations. This hypothesis is echoed by signals of long-term purifying selection in the gut microbiome2,18. However, gut microbiomes are heterogeneous and individualized environments that may vary over time1,14,19, and it is possible that new mutations may still drive rapid adaptation of commensal species within individual people.
Should adaptive mutations arise and be detectable within individual microbiomes, they are likely to indicate genes and pathways critical for long-term bacterial persistence in the human body11,13,20,21. The selective forces on these pathways might be common or person-specific, and their identification could guide microbiome-targeted therapies, including the selection and engineering of therapeutic bacteria for long-term colonization. To date, characterization of within-person evolution in the gut microbiome has been limited1,2,4,22, as it is difficult to distinguish de novo mutations from variants in homologous regions shared by co-colonizing bacteria using metagenomics alone. Culture-based approaches, which enable single-cell level whole-genome comparisons, have been limited to a small number of isolates. Further, it is often implicitly assumed that identifying within-person adaptation requires longitudinal sampling. However, if gut commensals diversify during their colonization within an individual, as is the case for bacterial pathogens12,13,23, co-existing genotypes can enable the inference of within-person evolution without long time-series.
To begin assessing the degree to which gut commensals evolve and diversify during colonization, we used a culture-dependent approach and focused on Bacteroides fragilis, a prevalent and abundant commensal in the large intestine of healthy people24. We surveyed intra-species diversity within 12 healthy subjects (ages 22-37; Supplementary Table 1), sequencing the genomes of 602 B. fragilis isolates from 30 fecal samples. These fecal samples included longitudinal samples from 7 subjects spanning up to 2 years and single samples from 5 subjects (Supplementary Table 2). None of these isolates were enterotoxigenic25 (Methods).
First using a reference based approach, we found that isolate genomes from different subjects differed by more than 10,000 single nucleotide polymorphisms (SNPs), while genomes from the same subject differed by fewer than 100 SNPs (with one isolate exception; Extended Data Fig. 1). We concluded that each subject was dominated by a unique lineage, consistent with previous investigations of within-host B. fragilis diversity5,24,26. We refer to each major lineage by its host ID (e.g. L01 for Subject 01’s lineage).
The SNP diversity was substantial within many lineages, enabling us to infer several years of within-person evolution. For each lineage, to discover variants in genomic regions not present in the reference, we assembled a draft genome using reads from all isolates, identified polymorphisms via alignment of short reads, and constructed a parsimony phylogeny (Methods, Fig. 1a, Extended Data Fig. 2–4). Between 8 and 182 de novo SNPs were identified per lineage (Fig. 1b). To estimate the age of the B. fragilis diversity within each subject at initial sampling, we calculated the average mutational distance of each population to its most recent common ancestor (dMRCA). To convert dMRCA to approximate units of time (tMRCA), we estimated the rate at which B. fragilis accumulates SNPs in the human gut by comparing SNP contents across longitudinal samples from the same subject (molecular clock; Fig. 1c, Extended Data Fig. 5a-h; Methods). Given our molecular clock estimate of ∼0.9 SNPs/genome/year, 11 of 12 lineages had values of tMRCA between ∼1.1-10 years (Fig. 1d) at the initial sampling. Due to the low acquisition rates of Bacteroidetes strains in healthy adult microbiomes1,27,28, we hypothesize that these within-subject populations emerged from a single cell within each subject. However, it is possible that some of the B. fragilis diversity within each person was inherited from a colonization event carrying multiple genotypes.
One outlier, L08, had a significantly higher value of dMRCA at initial sampling (38.9, P<0.001, Grubb’s test, Fig. 1d). This excess of mutations was due exclusively to an increase in a single type of mutation within one major sublineage (GC to TA transversions, P<0.001, Chi-square test), strongly suggesting a hypermutation phenotype (Fig. 1e-f, Extended Data Fig. 5p). Hypermutation, an accelerated mutation rate usually due to a defect in DNA repair, is associated with adaptation and its emergence is commonly observed in laboratory experiments and during pathogenic infections23,29–32.The dMRCA of non-hypermutator sublineages from L08 was compatible with within-person diversification (9.9 SNPs/genome/year), and the topology of the rooted phylogeny was also consistent with the emergence of a hypermutation phenotype within this subject (Fig. 1e). This is the first evidence of the co-existence of hypermutator and normal lineages within a healthy human, though commensal E. coli isolates with hypermutation phenotypes have been isolated before33.
Interestingly, each lineage’s tMRCA at initial sampling was less than its subject’s age (22-37 years), suggesting that these lineages colonized their subjects later in life, that adaptive or neutral sweeps purged diversity, or both. To determine if sweeps occur during colonization, we looked for mutations that fixed over time. We observed sweeps within 3 of the 7 lineages with longitudinal samples, one of which was associated with a significant decrease in dMRCA (L04; P <0.001; Wilcoxon rank-sum test; Extended Data Fig. 5–6). Thus, sweeps appear to be common during colonization, and B. fragilis lineages likely resided longer in their hosts than suggested by tMRCA at initial sampling.
We next assessed the contribution of horizontal evolution by identifying within-lineage mobile element differences (MEDs). We defined MEDs as DNA sequences with multi-modal coverage across isolates within a lineage (Methods). We found MEDs in 11 of the 12 lineages (Fig. 1b). These mobile elements include putative plasmids, integrative conjugative elements (ICEs), and prophages (Supplementary Table 3). We examined each MED’s distribution across its lineage’s phylogeny and used parsimony to categorize it as a gain or loss event. We inferred 10 MEDs gained, 12 lost, and 17 ambiguous loci in ∼50 cumulative years of evolution (using tMRCAs at initial samplings). This provided lower-bound estimates of ∼0.05 gain/genome/year and ∼0.04 loss/genome/year. We further estimated that MEDs change the B. fragilis genome by at least ∼1.3 kbp gain/genome/year and ∼1.9 kbp loss/genome/year. Thus, while gain and loss events are more rare than SNPs, they contribute more to nucleotide variation during B. fragilis evolution.
We reasoned that if these mobile elements were transferred from other species in the same microbiomes, we would observe evidence in metagenomes from the same stool communities. In particular, a transferred region should have increased coverage in the metagenome compared to the rest of the B. fragilis genome, owing to its presence in other species. We leveraged stool metagenomes available from 8 subjects, scanning for genomic regions with high relative coverage and high identity (>3X and >99.98%, respectively, Methods). We found evidence of one inter-species MED transfer within Subject 04 (38X relative coverage in the metagenomes; Methods; Fig. 2a-b). This MED, a putative prophage, was absent from all isolates at Day 0 yet present in 68% of isolates at Day 329. This combination of longitudinal genomic and metagenomic evidence strongly suggests that this prophage was acquired by B. fragilis during the sampling period.
This same approach enabled us to identify inter-species transfers even when the genomic regions were present in all B. fragilis isolates of a given lineage (for 3 lineages; Supplementary Table 4; Fig. 2c). We confirmed one candidate, a putative integrative conjugative element (ICE) in Subject 01 containing a type VI secretion system34 (T6SS), by culturing and sequencing 94 isolates of other Bacteroides species. This ICE was present in all isolates of 3 species (n=82) and contained only 4 SNPs among these species, suggesting recent transfer (Fig. 2d, Extended Data Fig. 7, Methods). T6SSs mediate inter-bacterial competition and have been shown to be shared by members of the same microbiome24,35. The prevalence of this ICE in this subject suggests it confers a strong selective advantage to its recipient species. In general, however, there are limited statistical tools for distinguishing adaptation from neutral evolution for mobile element changes.
To assess if positive selection was a significant driver of within-person B. fragilis evolution, we examined the identity of observed SNPs. We searched for parallel evolution, a hallmark of positive selection in which similar changes emerge independently, focusing specifically on parallel evolution occurring within a person. We identified 16 genes mutated in parallel within a single subject, a significant deviation from a neutral model (P<0.001, Fig. 3a, Extended Data Fig. 8a-f; Methods). These genes were significantly enriched for nonsynonymous mutations, as reflected by dN/dS, the normalized ratio of nonsynonymous to synonymous mutations, indicating that mutations in these genes were indeed adaptive (dN/dS = 6.03, CI = (1.57, 51.3)). In contrast, genes mutated only once within a subject did not show such an enrichment (dN/dS = 0.93, CI = (0.74, 1.16); Fig. 3b).
Genes under parallel evolution reveal challenges to B. fragilis survival in vivo. The 16 genes include 5 involved in cell envelope biosynthesis, a dehydratase implicated in amino-acid metabolism, and 4 with unclear biological roles (Fig. 3c). The remaining 6 genes all encode for homologs of SusC or SusD, a large group of outer-membrane polysaccharide importers (Supplementary Table 5). A typical B. fragilis lineage has 75 SusC/SusD pairs and their substrates are thought to be mainly complex yet unknown polysaccharides36,37. SusC proteins form homodimeric β-barrels capped with SusD lids38, and the observed mutations were enriched at the interface between the barrel and lid (Fig. 3d-e; Methods). Notably, one of these susC homologs (BF3581) has been shown to be critical for B. fragilis colonization in mice and its locus has been designated as commensal colonization factor (ccf)5. Its essentiality is thought to be related to binding to host-derived polysaccharides5, and, therefore, mutations altering Sus proteins might reflect pressures to utilize host or diet-derived polysaccharides37. Alternatively, the presence of Sus proteins in the outer membrane and their co-occurrence on this list with genes involved in cell envelope synthesis (Fig. 3c, 3f) hints that selection on these genes might be driven by the pressure to evade the immune system39 or phage predation40.
While our results show that single amino acid changes in key genes of B. fragilis confer rapid adaptive advantages within individual people, these same genes show signatures of purifying selection across lineages separated by thousands of years (Fig. 3g; Methods). Some of the mutated residues driving this adaptation are even highly conserved across species (>25% residues; Supplementary Table 5). The discrepancy in signals between timescales implies that the selective forces acting on these genes are not constant and raises the possibility that adaptive mutations occurring in vivo may incur collateral fitness costs in the context of other selective forces41,42. This notion of competing selective forces is echoed by the well-described invertible promoters of B. fragilis, which enable rapid alternation between different outer-membrane presentations43,44. Interestingly, the invertible promoters control the same major pathways that we identified as undergoing positive selection (capsule synthesis and polysaccharide importers)43,45. The non-constant selective forces driving these inversions and mutations might be specific to some people or lineages, recently introduced into the human population, present only at particular times (e.g. during early stages of colonization), or coexisting within individual people (Fig. 3h). We found evidence of both subject-specific and other selective forces. Three Sus genes (BF1802, BF1803, and BF3581) were each mutated multiple times within a subject, (P < 0.003 for each, Fisher’s exact test), yet no times in other subjects. In contrast, five genes under selection were mutated in multiple lineages, with two genes even acquiring mutations at the same amino-acid residue in different lineages (BF1708 and BF2755; Fig. 3c). Remarkably, a BF2755 mutation (Q100P) found polymorphic in 3 subjects was also in the ancestor of L12 and two publicly available genomes (Extended Data Fig. 8g), suggesting a common and strong selective pressure on this amino acid.
Could competing selective forces create multiple coexisting niches for B. fragilis even within a same individual? We noticed that the two lineages with the largest dMRCA at initial sampling (L01 and L08) had long-branched, co-existing sublineages that might reflect niche-differentiation (Extended Data Fig. 2, 4a). We closely examined L01’s evolutionary history over a 537-day period, during which the relative abundance of B. fragilis did not substantially change, using 206 stool metagenomes (Extended Data Fig. 9a). We tracked 21 abundant SNPs whose evolutionary relationships were previously identified from isolate genomes and inferred the population dynamics of their corresponding sublineages (Fig. 4a-c; Methods). The relative ratio of the two major sublineages (SLs), SL1 and SL2, which diverged ∼8 years prior to initial sampling, remained stable across the 1.5-year period (Fig. 4c; Extended Data Fig. 9b). SL1 showed multiple signatures of rapid adaptation during this period, including mutations in genes under selection, competition of mutations through clonal interference (e.g. between SL1-a and SL1-b, and within SL1-a), and a rapid sweep involving two SNPs related to Sus genes (0 to ∼70%<300 days; SL1-a-1; Fig. 4c-d). The continued coexistence of SL1 and SL2 despite a sweep within SL1 is particularly striking and suggests frequency-dependent selection or occupation of distinct, perhaps spatially segregated, niches46–49. The fact that 11 of 12 intragenic mutations separating these sublineages are amino-acid changing furthers the notion that they are functionally distinct. Therefore, it is likely that B. fragilis niche-differentiation can occur within a single person.
We present here the first description of rapid within-person adaptation for a bacterial species whose native niche is the human intestine, as well as the most time-resolved description of bacterial within-person evolutionary dynamics to date. Within the gut microbiome of individual people, B. fragilis acquires adaptive point mutations in key genes, including polysaccharide importers and capsule synthesis genes, under the pressure of natural selection. This adaptation can be strikingly fast; near-daily tracking of one donor’s B. fragilis population revealed de novo mutations that rose from 0 to 70% frequency in less than a year (Fig. 4b). Continuing adaptation suggests there is no single optimal B. fragilis sequence for survival in the human microbiome and points to competing selective forces. Should rapid within-person adaptation be a common feature of gut commensals, as it is for many opportunistic pathogens of the cystic fibrosis lung23,47,50, it may have far-reaching implications for the microbiome field. Adaptation to the unique combination of selective forces present within each person may partially explain the observed stability of individual lineages in the microbiome1 and necessitate a personalized approach for microbiome manipulations. De novo mutation may need to be considered as a possible driver of ecological dynamics and inter-personal and temporal differences in community composition. Culture-based evolutionary approaches therefore provide both fundamental insights into the dynamics of human microbiomes and a powerful discovery route for genes and pathways critical to bacterial survival within the microbiome.
Author contributions
S.Z., T.D.L., and E.J.A. designed the study; S.Z. performed B. fragilis experiments; M.P. and M.G. performed experiments for other Bacteroides; S.M.G, R.J.X., and E.J.A. coordinated acquisition of metagenomic data. S.Z. and T.D.L. analyzed the data; S.Z., T.D.L., and E.J.A wrote the manuscript with input from all authors.
Competing financial interests
Eric Alm is a co-founder and shareholder of Finch Therapeutics, a company that specializes in microbiome-targeted therapeutics.
Methods
Study cohort and sample collection
Stool samples were obtained from OpenBiome, a non-profit stool bank, under a protocol approved by the institutional review boards at MIT and the Broad Institute. All 12 subjects were healthy people screened by OpenBiome to minimize the potential for carrying pathogens and had ages between 22 and 37 years and body-mass indexes between 19.5 and 26.2 at initial sampling. Subjects were de-identified before receipt of samples. Supplementary Table 1 contains detailed information about each subject.
OpenBiome received and processed fresh stool donations within 6 hours of generation. Most samples were homogenized in a buffer containing 12.5% glycerol and 0.9% sodium chloride by mass (relative ratio of buffer to stool was either 10:1 or 2.5:1 volume/mass). Some samples were homogenized in proprietary buffers (1:1 volume/mass). Homogenized samples were passed through a 330-micron filter and stored at −80°C. Subjects 01-07 had multiple samples from which B. fragilis was selectively cultured, with time-series spanning 31 to 709 days. For Subjects 08-12, only one sample was selectively cultured for B. fragilis. Metagenomic sequencing was performed on stool samples from 8 of the 12 subjects (319 stool samples in total). Detailed information about samples used for isolation, including handling conditions prior to sample receipt, is in Supplementary Table 2 and information about samples used for metagenomic sequencing is in Supplementary Table 6.
Library construction and Illumina sequencing
Samples were serially diluted in phosphate-buffered saline (PBS) and cultured for B. fragilis on Bacterodies Bile Esculin plates (BD 221836) in an anaerobic environment. Single colonies suspected of being B. fragilis based on colony morphology were re-suspended in 50μL of PBS with 0.1% L-cysteine. For future characterization, 15μL of the re-suspension was mixed with 15μL of 50% glycerol and stored at −80°C. DNA was extracted from the remaining 35μL using the PureLink Pro 96 genomic purification kit, following the manufacturer’s instructions. Genomic DNA libraries were constructed and barcoded using a modified version of the Illumina Nextera protocol51 (Library Prep. 1). Libraries from one sample (S01-0259, Day 709) were prepared by the BioMicroCenter at MIT using a different protocol, with lower input DNA and a final Pippin size-selection step (Library Prep. 2). Genomic libraries were sequenced either on the Illumina Hiseq platform with paired-end 100-bp reads or on the Illumina Nextseq platform with paired-end 75-bp reads by the Broad Institute Genomics Platform (Supplementary Table 2). Only isolates with average coverage of greater than 10 reads across the B. fragilis genome were included for analysis.
Identification of major lineages and SNPs
To estimate the distance between isolates across subjects and identify major lineages, we aligned all short reads to a publicly available reference genome NCTC9343 (NCBI accession: CR626927.1) and identified SNPs. Reads were first trimmed and filtered using Cutadapt52 and Sickle53 (pe -f 20 -r 50), and aligned using Bowtie2 (Alignment parameters: -X 2000 --no-mixed --very-sensitive --n-ceil 0,0.01 --un-conc). Isolates for which more than 70% of reads aligned to the reference were included as being B. fragilis. From all subjects, 14 isolates were discarded (1 isolate from subject 10 and 13 isolates from subject 06), all of which had fewer than 5% of reads aligning to NCTC9343, suggesting other species. Candidate SNPs were identified using SAMtools54 and filtered using custom filters modified from previous work23. In particular, genomic positions were considered to be candidate SNP positions if at least one pair of isolates was discordant on the called base and both members of the pair had: FQ scores (produce by SAMtools; lower values indicate more agreement between reads) less than −60, at least 7 reads that aligned to each of the forward strand and reverse strand, and a major allele frequency of at least 90%. If the median coverage across samples at a candidate position was less than 10 reads or if 33% or more of the isolates failed to meet filters described above, this position was discarded. For each SNP position identified, a nucleotide call was assigned to each isolate using the major allele call across reads for that isolate at that position. If fewer than 7 reads aligned to either forward or reverse strand of a position in an isolate, or the major allele frequency was smaller than 90%, an ambiguous call was assigned to the isolate at that SNP position.
We generated a neighbor-joining tree from the concatenated list of variable positions from conserved genomic regions present in all B. fragilis isolates from all subjects. When computing the distance between each pair of isolates, we only used variable positions that had unambiguous nucleotide calls from both isolates. This tree showed 12 major clades corresponding to the 12 subjects and one minor clade containing a single isolate from Subject 10 (Extended Data Fig. 1a). Within each major clade, all isolates differed from one another by fewer than 100 SNPs. We therefore operationally defined a lineage as a set of isolates that differ by fewer than 100 SNPs and refer to specific genotypes within a lineage as sublineages. All lineages differed by over 10,000 mutations (Extended Data Fig. 1b); given the molecular clock estimated by this work, this represents at least thousands of years of evolutionary distance.
De novo assemblies of lineage genomes and within-lineage SNP identification
To enable us both to detect variants within genes carried only in a subset of lineages and to detect gains and losses of genomic regions that are specific to single lineages, we created a pan-genome for each major lineage. For each major lineage, we concatenated reads (trimmed and filtered) from all isolates and used this concatenated file as the input for de novo genome assembly via Spades v3.10.0 (parameter: --careful)55. To limit the memory required for assembly, we used 0.25 million pairs of reads from each isolate (∼7x coverage). Isolates prepared by the Library Prep. 2, as well as a few isolates with apparent cross contamination (genome assemblies built only using reads from single isolates were larger than 6MB; B. fragilis genome assembly sizes range from 4.8 to 5.3 MB) were excluded in building assemblies. Isolates not used to build the genome assembly are indicated as such in the metadata associated with the uploaded raw data (see Data availability). Statistics of these genome assemblies are in Supplementary Table 1.Assembly genomes were annotated using Prokka v1.1156. Lineage pan-genomes successfully assembled regions present in only a single isolate (e.g. Extended Data Figure 2, 3c, 3e) and enabled detection of mutations that would have been missed by comparison to a single reference (Extended Data Fig. 3c vs Extended Data Fig. 3g). A genome assembly of the minor lineage from Subject 10 was built using all reads from this isolate.
Within-lineage mutations were identified by alignment of short reads to the corresponding lineage genome assembly, using the same parameters as described in the previous section. For lineage 10, the major allele frequency filter was set to 95%. Candidate positions in MEDs were also discarded (see below for information on MED identification). Detailed information of intra-subject SNPs from the 12 subjects are listed in Supplementary Tables 7-18.
The gene content across the 12 major lineage genomes and the NCTC9343 reference varied between 10%-20% (Using the Szymkiewicz-Simpson similarity coefficient and taking gene length into account, Supplementary Table 19).
Toxin detection
We compared the genome assemblies of the 12 major lineages and 1 minor lineage to the Virulence Factors Database, which contains >2400 virulence factors25, via BLAST using a threshold bit score of 200. We found only two hits to the database: Cps4J in L11 and ospC4 in L01. Both hits were not toxins previously characterized for B. fragilis. In contrast, this method identified 171 hits to known B. fragilis-related toxins from 30 out of 88 B. fragilis genomes from National Center for Biotechnology Information (NCBI).
Phylogeny of isolates from each B. fragilis lineage and identification of ancestral alleles
We used parsimony to reconstruct the evolutionary relationship between isolates from the same lineage. For each major lineage, a phylogeny of all isolates was built using a list of concatenated intra-subject SNPs and the closest lineage as an outgroup. We used the dnapars program, a parsimony tree builder from PHYLIP v3.69 to infer the phylogeny57. When parsimony could not resolve which allele was more likely to be ancestral, we inferred the ancestral allele to be the majority nucleotide at this genomic position across all other lineages with this genomic region. If a region was unique to a lineage, we assigned the ancestral allele that minimized the average mutational distances to the most recent common ancestor (dMRCA) for all isolates (3 cases).
dMRCA of each B. fragilis major lineage, molecular clock, and tMRCA
To calculate dMRCA for each subject at each time point, we counted the number of positions at which the called allele was different than the ancestral allele for each isolate, assessing only SNP positions that were polymorphic among isolates from the particular time point, and averaged the results.
For each lineage with multiple time points, we computed the average number of new SNPs brought in per isolate from a later time point compared to the collection of SNPs identified at the initial time point. We then used linear regression to estimate the rate of evolution. The slope of the regression is our estimation of the evolutionary rate (Fig. 1c). Additional analysis approaches gave similar values of the molecular clock (Extended Fig. 5a-h).
Each tMRCA was calculated by dividing dMRCA by the estimated molecular clock (Fig. 1d). We stress that tMRCA is not an estimate of time to colonization, but simply an estimate of the age of the coexisting diversity, as sweeps can purge diversity. While potential systematic false negative and false positive SNPs may have impacted tMRCA values, these sources of error would have had a similar impact on our molecular clock estimation, as SNP-calling was consistent throughout. Other possible sources of error in estimating tMRCA include incorrect designation of ancestral versus derived allele and undersampling of the population, though collector curves for dMRCA indicate that sampling was usually sufficient (Extended Data Fig. 6a-l). Interestingly, collector curves for the number of de novo SNPs reflect that the number of SNPs identified did not saturate (Extended Data Fig. 6m-x).
Mutation spectrum of hypermutator sublineage
SNPs were categorized into 6 types, based on the chemical nature of the single nucleotide changes (Fig 1f). For L08, we computed the frequency of each type separately for the hypermutator sublineage and non-hypermutator sublineages (Fig. 1f, purple and yellow bars). For the remaining lineages (L01-L07 and L09-L12), we computed the mutation spectrum for each lineage and then computed the mean and standard deviation of each of the 6 types (Fig. 1f, gray bars). The mutation spectrum was significantly different between the hypermutator sublineage and the non-hypermutator sublineages (Chi-square test, P<0.001), as well as the mean across the other 11 lineages (Chi-square test, P<0.001). No significant difference was found between the 11 other lineages and the non-hypermutator sublineages from L08 (Chi-square test, P=0.4).
When excluding the GC-TA type of mutation from the analysis, we found no significant difference between the non-hypermutator sublineage in L08 from the 11 other lineages (Extended Data Fig. 5p, P=0.11, Chi-square test), suggesting that the hypermutation phenotype was exclusively due to an increase in GC-TA mutations.
Identification of Mobile element differences (MEDs)
We aligned short reads to the assembled genome of each major lineage as above and identified candidate regions that were at least 500nt in length, had low relative coverage (< 0.2X) at every nucleotide in at least one isolate, and had >0.9X coverage at every nucleotide in at least one isolate. For L01, we excluded isolates from the last time point, as these isolates’ genomic libraries were prepared differently than the other isolates and therefore had different coverage pattern genomewide.
To account for the fact that single mobile elements could have been separated into multiple pieces in the genome assembly, we grouped regions suspected to emerge from the same event. We clustered sequences that had identical presence/absence patterns across all isolates, where presence was defined by >0.4X average relative coverage over the region. On 3 occasions, we noticed regions that had the same presence/absence pattern but had different coverage distribution across isolates, suggesting they came from distinct mobile elements. In these cases, we separated these clusters of sequence regions into clusters with consistent coverage distribution patterns. Detailed information of all MEDs is in Supplementary Table 3.
MED gain and loss rates
We used parsimony to infer whether a MED was a gain or loss event. For each MED, we inferred events on the phylogenetic tree generated from whole genome data. If a single change of one type (e.g. gain) could explain the distribution, but more events were required for the other type (e.g. loss), the MED was categorized as such (Supplementary Table 3; Fig. 1b). Seventeen MEDs were classified as unknown because either: multiple gain or multiple loss events were required to explain the distribution (e.g. MED01-2); or both a single gain event and a single loss event were consistent with the distribution. Interestingly, one putative MED from L11 appeared to have been lost many times among isolates during culture (Extended Data Fig. 4d, f). To estimate lower bounds for the rates at which gain and loss events change B. fragilis genomes, we weighted each observed MED j by its frequency within lineage i (fij). We then divided the weighted sum of events by the total time of diversification, estimated by the sum of tMRCA at initial sampling. The following equation was used for gain and loss events, separately:
To estimate the absolute contribution of gain and loss events to the size of B. fragilis genomes, we accounted for length of each MED (Lij).
Metagenomic library construction and Illumina sequencing
Genomic DNA was extracted from stool samples for metagenomic sequencing by the Microbial Omics Core at the Broad Institute using MoBio PowerSoil kits (Qiagen 12955-4) according the manufacturer’s instructions. Genomic DNA libraries were constructed and barcoded by the Broad Technology Labs from 100-250pg of DNA using the Nextera XT DNA Library Preparation kit (Illumina) according to the manufacturer’s recommended protocol, with reaction volumes scaled accordingly. Pooled libraries were sequenced on the HiSeq platform with paired-end 100bp reads by the Broad Technology Labs.
Inter-species mobile element transfer
For each lineage, we scanned the assembled genome for regions with high average relative coverage when aligning metagenomic reads to the lineage genome assembly (>3X). The coverage of metagenomic reads over the B. fragilis assembly varied over as much as 1000 folds due to reads from homologous regions of different species. Therefore, to normalize against the true expected coverage of the B. fragilis genome, we divided observed coverage at each position by the mean coverage across positions between the 30th percentile and 70th percentiles (median was not precise given the low coverage in some samples). To identify recent transfer events, we searched the genome for candidate regions >5000 nucleotides in length and in which the consensus genome from metagenomes was <0.02% different from the consensus genome from isolates of the same subject. We found 14 candidate regions in 3 lineages. We found only two candidate regions that overlapped with MEDs, all of which were in Subject 04 (representing one MED). Information about these candidate regions is listed in Supplementary Table 4.
We identified two genomic regions (31 Kb and 62 Kb, respectively) that were candidates for inter-species mobile element transfer in Subject 01. These two regions contained distinct ORFs homologous to conserved genes from type 6 secretion system of genomic architecture 2 (Extended Data Fig. 7c), consistent with a single transfer event. This transfer event was inferred to be an integrative conjugative element (ICE) because it contains the tra genes associated with integrative conjugative elements and a tRNA gene at one edge of a transfer region (Supplementary Table 4). To test if the putative ICE was indeed transferred between species, we cultured and sequenced the genomes of 94 Bacteroides isolates from this subject. We examined 53 Bacteroides vulgatus isolates (43 isolates one B. vulgatus lineage, 10 isolates from a different B. vulgatus lineage, Extended Data Fig. 7a, b), 25 Bacteroides ovatus isolates, 4 Bacteroides xylanisolyens isolates, 10 Bacteroides stercoris isolates and 2 Bacteroides salyersiae isolates. We sequenced these isolates as described for B. fragilis and aligned reads to the mobile element candidates, using the same parameters for B. fragilis. Strikingly, both genomic regions were present (average coverage >10 reads) in all B. ovatus, B. xylanisolyens, and B. vulgatus isolates profiled, but absent in all isolates of the other two species. The perfect co-occurrence of these two genomic regions further supports that they were from a single transfer event.
Parallel evolution
We counted a gene as under parallel evolution if, in at least one subject, the gene had multiple independent SNPs and more than 1 SNP per 2,000 bp (to account for the fact that long genes are more likely to be mutated multiple times by chance). Cases in which two SNPs in the same gene always occurred together in the same isolates were not included as parallel evolution (one case from L04). To identify nucleotide positions that mutated multiple independent times within a person, we leveraged the parsimony phylogenies described above. We inferred the genotypes of all internal nodes using the parsimony assumption and counted the number of mutation events. This method identified 3 nucleotides that were mutated multiple times within an individual (Extended Data Fig 2, 4). To determine whether the number of genes under parallel evolution represented a significant departure from what would be expected in a neutral model, we performed for each subject 1,000 simulations in which we randomly shuffled the mutations found across the lineage genome assembly and calculated how many genes showed a signature of within-person parallel evolution (Fig. 3a). To compare genes from different assemblies, coding sequences identified by Prokka from all lineages were clustered using CD-HIT with at least 98% identity and 90% coverage58. Detailed information for each gene under parallel evolution is in Supplementary Table 5 and gene clusters are listed in Supplementary Table 19. Simulations performed for metrics of cross-subject parallel evolution did not yield additional signatures of adaptive evolution (Extended Data Fig. 8).
dN/dS
Mutations were categorized as synonymous (S) or non-synonymous (N) based on open-reading frame annotations created by Prokka56. To calculate dN/dS for sets of de novo mutations emerged within subjects (Fig. 3b, first two categories), we normalized the observed N/S ratios by the expected N/S ratios. For any given set of SNPs, we calculated the expected N/S for these SNPs, accounting for both (1) the different probabilities of acquiring nonsynonymous mutations for different types of mutations and (2) the codon compositions of the genes in which these SNPs occurred. This method is similar to what we have done previously23, but accounts for different codon composition between genes. 95% confidence intervals were calculated using binomial sampling.
To compute dN/dS for mutations across lineages (Fig. 3b, third category), we leveraged publicly available sequences. We downloaded fastq files of 55 publicly available B. fragilis isolate sequencing runs. We then identified mutations across these genomes and the 12 major lineages from this study (one isolate per lineage) using the same approach and parameters described above (Identification of major lineages and SNPs). The NCTC9343 genome was used as reference and ancestor. Expected N/S ratio was calculated with the same method described above, using all the SNPs identified across lineages.
To compute dN/dS for cross-lineage mutations in individual genes (Fig. 3g), we normalized the observed N/S with expected N/S of the particular genes. Expected N/S ratio was calculated with the same method described above, using only cross-lineage SNPs identified within the particular genes. For 3 genes not present in the NTCT9343 genome, we used the de novo assemblies to recruit reads from the publicly available sequences. No cross-lineage SNPs were identified in these 3 genes and dN/dS was not reported for these genes.
Annotation of genes under selection
To discover homologs of the sixteen genes under within-person parallel evolution, we used blastp to search against the RefSeq database, excluding proteins from B. fragilis genomes. Top hits with 3-4 letter gene names were searched against the B. fragilis genome to confirm whether they are true orthologs. We used the organisms from which these gene names were initially described to avoid false propagation of misannotation. We also used PaperBLAST to aid in identifying candidate gene names59. Cellular localizations were predicted using CELLO.
Conservation scores for each mutated residue was predicted using the Consurf web service60. For each gene, we used blastp to find homologs from the RefSeq database (first 100 hits; sequence similarity from 35% to 95%; query coverage > 80%). A multiple sequence alignment (MSA) was created using Clustal omega from the EMBL-EBI web service {ref} (default parameters). We then used each MSA to generate conservation score at each amino-acid residue using Consurf (default parameters). Detailed information is in Supplementary Table 5.
SusC and SusD protein structures and interface residues
Available crystal structures of a SusC homolog (BT1763) from Bacteroides thetaiotaomicron38 and BF1802 from B. fragilis NCTC_934361 were used to visualize the mutations observed in Sus genes under parallel evolution. We aligned the five B. fragilis SusC proteins under parallel evolution and BT1763 using Clustal Omega from the EMBL-EBI web service62 (default parameters). For all non-synomymous mutations, we identified their aligned positions on the BT1763 crystal structure. Two amino acid residues aligned to the first 211 amino-acid region, which encodes for a plug domain and is not available in the crystal structure of BT176338. Eight non-synonymous mutations from Sus genes under parallel evolution are marked in red in Fig. 3d and Fig. 3e, using PyMol software63.
To test if the mutated residues were enriched at the interface between SusC and SusD, we used the PDBePISA web service64 (default parameters) to classify residues on the BT1763 crystal structure as in contact or not in contact with the SusD homolog. Of 806 residues, 119 were inferred to be interface residues. Among the 8 residues that were mutated in parallel, 4 of them were predicted to be interface residues in both programs, a significant enrichment (P=0.02, Fisher exact test). A similar result was obtained using the PyMol function InterfaceResidues (cutoff=1.0; P=0.02, Fisher exact test).
Enrichment of membrane proteins
For all genes from the 12 major lineage genome assemblies, we used CELLO65 to predict the cellular localization. Genes were considered to be membrane-related if they were annotated as inner membrane, periplasmic, or outer membrane. To compare our observation to the null expectation, we performed simulations. For each of the sixteen genes, we randomly selected one gene from the genome assembly of the lineage in which parallel evolution was identified. If a gene had parallel mutation in multiple lineages, we randomly chose one of the lineages. The cellular localization of n SNPs was assigned based on the CELLO prediction of this randomly picked gene, where n is the number of SNPs the original gene had across lineages. The proportion of SNPs from membrane-related genes was inferred using all sixteen such randomly picked genes (repeat genes not allowed). This procedure was repeated 1000 times to draw a null distribution of proportion of membrane-related SNPs. We calculated that in the sixteen genes under selection, 79% of the SNPs are from membrane-related genes, a significant deviation from the null distribution (P<0.001, Fig. 3f).
Signatures of subject-specific adaptation
Fisher’s exact statistic was used to test subject-specific adaptation, comparing the number of SNPs in a tested gene within a particular lineage, the number of SNPs in other genes within this lineage, the number of SNPs in this gene from all other lineages combined, and the number of SNPs in other genes from all other lineages combined. We tested 10 genes that were present in multiple subjects but mutated only in one subject. The p-values for BF1802, BF3581, BF1803, are all less than 0.005, suggesting person-specific adaptation.
Mutation dynamics
Metagenomic reads from Subject 01, acquired as described above, were aligned to the assembled genome of L01 using the same parameters described for aligning isolates reads. We tracked the frequency of each SNP found in 4 or more isolates from L01; SNPs found in fewer isolates were not abundant in the metagenomes. For each of the 21 SNPs that met this threshold, we calculated the frequency of reads at each position that agreed with the mutation (derived) allele. As the sequencing depth was limited and B. fragilis represented only ∼5% of reads on average (Extended Data Fig. 9a), not every SNP was covered at every time point. For each SNP, we visualized its dynamics by using time points with non-zero read counts and smoothing the trajectory using the Savitzky-Golay method with a span of 25 and degree of 0 (Fig. 4b).
To plot a schematic of the population dynamics of different sublineages (Fig. 4c), we averaged frequencies of SNPs that were shared by a particular sublineage to estimate the relative abundance of this sublineage. To fill the time points where no stool community was sampled, we generated a continuous relative abundance trajectory for each sublineage using Fourier curve fitting (Matlab model fourier8). To visualize parent and child sublineages separately, we subtracted the relative abundance of a parent sublineage by the sum of relative abundances of its child sublineages. When the combined relative abundance of child sublineages exceeded that of their parent sublineage, we set the frequency of the parent sublineage to 0. After Day 180, we manually set the frequency of the SL1 parent genotype to zero, and reduced discontinuities caused by this assignment by an additional Fourier curve fitting step (Matlab parameter: fourier8). The imputed relative frequencies were then renormalized so that they sum up to 1.
We also examined L03’s dynamics during colonization using 74 metagenomes collected over 144 days (Extended Data Fig. 9c-f). The same methods were used as described above, with the exception that mutations in ≧3 isolates were able to be tracked, owing to the higher relative abundance of B. fragilis in Subject 03. This schematic shows an expansion of a SNP and SNPs that decrease over time.
Data availability
Data is in the process of being uploaded to public servers. FASTQ files for the 602 B. fragilis isolates, with adaptors removed and filtered for quality, will be uploaded to the SRA. BAM files of the 352 metagenomes aligned to B. fragilis lineage assemblies will also be available on the SRA. Lineage assemblies with annotations will be uploaded to NCBI.
Code availability
Commented custom MATLAB code will be uploaded to Github prior to publication.
Supplementary Tables
There are 19 Supplementary Tables uploaded in a single .xlsx file.
Supplementary Table 1: Subject information and per-lineage statistics
Supplementary Table 2: Stool samples used for culturing single-colony isolates
Supplementary Table 3: Mobile element difference (MED) information
Supplementary Table 4: Candidate inter-species transfers
Supplementary Table 5: Genes under selection in vivo
Supplementary Table 6: Stool samples used for metagenomic sequencing and alignment results
Supplementary Table 7: de novo SNPs within L01
Supplementary Table 8: de novo SNPs within L02
Supplementary Table 9: de novo SNPs within L03
Supplementary Table 10: de novo SNPs within L04
Supplementary Table 11: de novo SNPs within L05
Supplementary Table 12: de novo SNPs within L06
Supplementary Table 13: de novo SNPs within L07
Supplementary Table 14: de novo SNPs within L08
Supplementary Table 15: de novo SNPs within L09
Supplementary Table 16: de novo SNPs within L10
Supplementary Table 17: de novo SNPs within L11
Supplementary Table 18: de novo SNPs within L12
Supplementary Table 19: Clustering of gene homologs from different lineages
Acknowledgements
We thank OpenBiome for providing stool samples, and Hera Vlamakis, Paige Swanson, Timothy Arthur, Julian Avila Pacheco, and Xiaofang Jiang for their assistance in obtaining samples and data. We are grateful to the BioMicroCenter at MIT and Microbial Omics Core at the Broad Institute for their assistance with library preparation and sequencing, Sean Kearney, Kathryn Kauffman, and Nadine Fornelos Martins for experimental assistance, and Vicki Mountain, Katya Frois-Moniz, and Shandrina Burns for administrative assistance. We thank members of the Alm lab for helpful discussions and Kevin Roelofs, Xiaoqian Yu, and Zhenrun Zhang for comments on the manuscript. This work was funded by a grant from the Broad Institute. T.D.L. acknowledges support from Boehringer Ingelheim.