Abstract
Samples of bacteria collected over a period of time are attractive for several reasons, including the ability to estimate the molecular clock rate and to detect fluctuations in allele frequencies over time. However, longitudinal datasets are occasionally used in analyses that assume samples were collected contemporaneously. Using both simulations and genomic data from Neisseria gonorrhoeae, Streptococcus mutans, Campylobacter jejuni, and Helicobacter pylori, we show that longitudinal samples (spanning more than a decade in real data) may suffer from considerable bias that inflates estimates of recombination and the number of rare mutations in a sample of genomic sequences. While longitudinal data are frequently accounted for using the serial coalescent, many studies use other programs or metrics, such as Tajima’s D, that are sensitive to these sampling biases and contain genomic data collected across many years. Notably, longitudinal samples from a population of constant size may exhibit evidence of exponential growth. We suggest that population genomic studies of bacteria should routinely account for temporal diversity in samples or provide evidence that longitudinal sampling bias does not affect conclusions.
Introduction
Evolutionary analysis of bacterial genomes provides insights into the origins of diversity and is increasingly used to inform control measures for infectious pathogens. Many analyses based on simple population genetic models, such as coalescent or diffusion theory (Kingman 1982; Kimura 1964), assume individuals are sampled contemporaneously (i.e. from the same generation). This assumption is reasonable for data from eukaryotes, which typically have longer generation times such that samples across years may only differ by a few generations. However, bacteria have shorter generations such that a sample collected across years could violate this assumption, and the consequences for the inference of evolutionary parameters from such data have not been extensively studied. An important exception is the serial coalescent (implemented in BEAST; Drummond et al. 2002) which can account for such differences in sampling times and has been used extensively to reconstruct the history of microbes. In many cases longitudinal samples may be purposefully collected to estimate mutation rates from ‘measurably evolving populations’ (Drummond et al. 2003). Nonetheless, serial coalescent methods do not currently allow for inference of selection or complex demographic scenarios and so researchers may choose other methods with different features to analyze population genetic data, such as the popular algorithms that fit models to data using the mutation site-frequency spectrum (SFS; e.g. Excoffier et al. 2013, Gutenkunst et al. 2009). Coalescent methods may also not allow homologous recombination, which is quite common in many bacteria and may be quantified using patterns of linkage disequilibrium or phylogenetic congruence (Smith et al. 1993; Suerbaum et al. 1998; Feil et al. 2001).
We use simulations to show that longitudinal samples have an excess of rare mutations compared to contemporaneous data and can exhibit more evidence of recombination. These patterns can be seen in real genomic datasets, including previously published samples N. gonorrhoeae, S. mutans, C. jejuni, and H. pylori as examples. Our results suggest that, at least for some bacterial species, longitudinal samples spanning ~10 years have biased summary statistic values that can result in misleading demographic inference or between-species comparisons of recombination rates (e.g. Smith et al. 1993; Feil et al. 2001), especially if species differ in generation time, population size, or sample composition. Thus, researchers doing analyses sensitive to the shape of the genealogy should either account for different sampling times or provide sufficient evidence that longitudinal sampling biases do not affect conclusions.
Results and Discussion
Longitudinal samples have distinct genealogies
The process of coalescence describes the underlying genealogy of a sample of sequences, which dictates the patterns of genetic diversity we observe. Sequences collected from different generations (i.e. a longitudinal sample) cannot coalesce until their ancestral lineages are simultaneously present. Until this occurs (going backwards in time – the period noted Tsamp in Figure 1), particular lineages may mutate and recombine but cannot coalesce, distorting genealogical structure. These genealogical distortions are negligible if the evolutionary time (in generations) separating longitudinal samples (Tsamp) is small in comparison to the mean time to pairwise coalescence (Tcoal) in which mutation and recombination occur in contemporaneous samples (Depaulis et al. 2009). However, any factors that decrease Tcoal, such as smaller effective population sizes from transmission bottlenecks, or increase Tsamp, such as shorter generation times (in years, the timescale on which samples are collected) may cause longitudinal samples to have distinct differences from contemporaneous ones.
We simulated three different sampling schemes in which we longitudinally sampled sequences through time in different ways (Figure 2A). For each sampling scheme, we varied the timespan between the first and last sample, using either 0.2N or 0.5N generations, where N is the population size. Compared to a contemporaneous sample, longitudinal samples had SFS with an excess of rare single-nucleotide polymorphisms (SNPs; Figure 2B;Depaulis et al. 2009) and exhibited more evidence of historical recombination as measured by pairwise phylogenetic compatibility (PC) between pairs of mutations (Wilson 1965, Figure 2C). The time spanning sample collection affected observed patterns of genetic variation much more than the particular sampling schemes used here, although the asymmetric sample (scheme 3) with more lineages from certain time points was slightly less biased (Figure 2). Other unexplored sample structures, such as those with a vast majority of lineages coming from a similar generation, would likely look more like contemporaneous samples.
SFS for nonsynonymous sites under purifying selection were also more skewed by sampling bias, but overall levels of purifying selection as measured by ratios of nonsynonymous to synonymous diversity were roughly similar across all samples (Figure S1).
Bacterial genetic samples may span relevant evolutionary timescales
To illustrate the potential for bacteria to have genealogies distorted by longitudinal sampling, we used genomic datasets from four bacterial pathogens sampled over time: N. gonorrhoeae (Grad et al. 2016; spanning 13 years), S. mutans (Cornejo et al. 2013; spanning 27 years), C. jejuni (Sheppard et al. 2013; Yahara et al. 2014; spanning 11 years), and H. pylori (Blanchard et al. 2013; spanning 11 years). We selected these datasets because they not only contained longitudinal samples spanning more than a decade from a restricted geographic location but also had enough samples from a given year for meaningful comparison (i.e. greater than or equal to 10). Population structure may also frequently create biases in evolutionary inference, but this has been studied elsewhere (e.g. Lapierre et al. 2016), so we attempted to minimize the effects of structure by only looking at samples within a city (San Diego, California, USA for N. gonorrhoeae or Cleveland, Ohio, USA for H. pylori), country (United Kingdom for S. mutans), or sequence type (ST45 for C. jejuni).
For three species examined, a sufficient number of generations separate longitudinal samples (Tlong) with respect to population size (Tcoal) to distort genealogies and create measurable differences between summary statistics: serial samples exhibit more evidence of historical recombination (although only slightly for C. jejuni) and have skewed SFS with an excess of rare mutations (Figure 3B). SFS for larger samples, which are unavoidably longitudinal, are either similar or slightly more skewed than subsamples (Table S1). These summary statistic biases could thus affect within-species estimation of recombination rates (e.g. Takuno et al. 2012) or between-species comparisons (e.g. Smith et al. 1993; Feil et al. 2001) if sampling dates are not accounted for, especially if datasets differ in sampling timespans or species have different generation times or effective population sizes (e.g. Suerbaum et al. 1998 which compares bacteria with a eukaryote). These biases could also create problems for any application that relies on the shape of the SFS, such as the calculation of Tajima’s D (as in Touchon et al. 2014) or programs that use the SFS to fit demographic models to data, such as δaδi, prfreq, or fastsimcoal2 (used in Cornejo et al. 2013; Pepperell et al. 2013; Montano et al. 2015, respectively). However, these biases could be favorable for other applications such as genome-wide association studies since longitudinal samples would exhibit less linkage than contemporaneous ones.
Summaries for longitudinal and contemporaneous H. pylori samples appear quite similar (Figure 3). Potential explanations include larger population sizes, longer generation times, or other factors such as population structure.
To illustrate, we used fastsimcoal2 (Excoffier et al. 2013) to fit two demographic models, either constant population size or exponential growth, to all samples. In agreement with the sample SFS in Figure 3B, model fits to longitudinal samples either had grossly inflated signals of population growth for N. gonorrhoeae or, for S. mutans, provided evidence for growth when contemporaneous samples were better explained by a model of constant size (Table 1, Figure S2). The 8-fold population growth estimated from the longitudinal S. mutans sample is similar to that reported to the 5-fold growth in Cornejo et al. 2013. Thus, while these results are not definitive, they still provide evidence that longitudinal sampling bias may have contributed to some, if not all, of the signal of growth in S. mutans. Masking rare variants (e.g. singletons and doubletons) does not ameliorate these biases, according to growth model fits to simulated datasets with larger sample sizes (n=50, Table S2).
Conclusions
Samples collected over time are common in the growing literature on the population genomics of bacteria. This reflects analyses of samples already collected and in freezers, but also a deliberate strategy to examine the way populations change over time. We have found that longitudinal sample schemes can produce erroneous signals of population growth and exaggerated rates of recombination if sample dates are ignored. This happens for intuitive reasons illustrated in Figure 1; the longer sampling period provides more generations for mutation or recombination to occur, skewing the SFS and the total amount of observed recombination. This can generate wholly artificial signals of population growth.
While our results urge caution in interpreting evolutionary analyses when collection dates are not accounted for, this problem is not expected to affect species with small Tlong (longer generations or near-contemporaneous samples) and/or large Tcoal (large population sizes), such as non-pathogenic bacteria that have less population structure and do not experience frequent population bottlenecks from limited transmission. However, bacterial genomic samples frequently span more than a decade and may have significant biases. We thus suggest that sampling dates and proof that analyses do not suffer from longitudinal sampling bias should be routinely provided in evolutionary genetic studies of bacteria.
Methods
Bacterial Genomic Data
N. gonorrhoeae data were kindly provided by Yonatan Grad (Grad et al. 2016), C. jejuni data were provided by Samuel Sheppard (Sheppard et al. 2013, Yahara et al. 2014), and we downloaded S. mutans data used in Cornejo et al. 2013 and H. pylori data reported in Blanchard et al. 2013 from NCBI. We analyzed de novo assemblies with PROKKA (Seemann 2014), using an amino acid file from a reference genome (FA1090 for N. gonorrhoeae, UA159 for S. mutans, CjeNCTC11168 for C. jejuni, and HPY26695 for H. pylori). We then identified core and accessory genes with ROARY (Page et al. 2015) and used only core genes that were also present in the reference genome for analysis of PC and the SFS. We inferred position information between polymorphic sites using the relative positions of genes in the reference genome, not from a reference-based DNA alignment. We calculated PC (Wilson 1965) and Tajima’s D (Tajima 1989) form core gene alignments using custom Perl scripts. All contemporaneous and serial samples used in this study may be found in Table S3.
Simulations of longitudinal datasets
We designed a forward-in-time simulator of a Wright-Fisher population in C++. We used this program to simulate a population of N=1000 haploid genomes for a burn-in time of 10N generations until the population reached mutation-drift balance. After this time, we either took a contemporaneous or longitudinal sample of n=50 genomes. Longitudinal samples spanned either 200 (0.2N) or 500 (0.5N) generations between the first and last sample, and we used three different sampling schemes (Figure 2A). For scheme one, we sampled 1 genome every 4 or 10 generations until n=50, such that samples were collected across ~0.2N or ~0.5N generations, respectively. Likewise, for scheme two, we sampled 10 genomes every 50 or 125 generations until n=50. Lastly, for scheme three we took five samples of size 18, 14, 10, 6, and 2 that were separated by 50 or 125 generations. We note that while real bacterial populations sizes are likely much larger than the size simulated here, our results scale to populations of arbitrary size as long as time is measured in N generations. For the results in Figure 2, we simulated 50 kb fragments and fewer repetitions (20), but for the analyses of purifying selection in Figure S1, we simulated 10 kb fragments and more repetitions (1000).
Demographic model fitting
We fit both a model of constant population size and exponential growth to the SFS of fourfold degenerate sites using fastsimcoal2 (Excoffier et al. 2013). For each sample SFS, we ran fastsimcoal2 50 times, which uses an expectation-maximization algorithm to search parameter space. We chose the run that produced the highest likelihood for model selection, and we used Akaike information criterion (AIC) to evaluate which model had the higher probability of being correct given the candidate set of models (Figure S2). To explore the effect of masking rare variants for parameter estimation, we fit exponential growth models to simulated datasets using either the full SFS (default) or a minimum mutation count of three (by including the “-C 3” flag in the fastsimcoal2 command; Table S2).
Acknowledgements
We would like to thank Marc Lipsitch, Hsiao-Han Chang, and Omar Cornejo for useful discussion. We would also like to thank Michael Stanhope for providing the sampling dates of the S. mutans isolates.