Abstract
The advent of next generation sequencing technologies has made whole-genome and whole-population sampling possible, even for eukaryotes with large genomes. With this development, experimental evolution studies can be designed to observe molecular evolution “in-action” via Evolve-and-Resequence (E&R) experiments. Among other applications, E&R studies can be used to locate the genes and variants responsible for genetic adaptation. Existing literature on time-series data analysis often assumes large population size, accurate allele frequency estimates, and wide time spans. These assumptions do not hold in many E&R studies.
In this article, we propose a method-Composition of Likelihoods for Evolve-And-Resequence experiments (Clear)–to identify signatures of selection in small population E&R experiments. Clear takes whole-genome sequence of pool of individuals (pool-seq) as input, and properly addresses heterogeneous ascertainment bias resulting from uneven coverage. Clear also provides unbiased estimates of model parameters, including population size, selection strength and dominance, while being computationally efficient. Extensive simulations show that Clear achieves higher power in detecting and localizing selection over a wide range of parameters, and is robust to variation of coverage. We applied Clear statistic to multiple E&R experiments, including, data from a study of D. melanogaster adaptation to alternating temperatures and a study of outcrossing yeast populations, and identified multiple regions under selection with genome-wide significance.
1 Introduction
Natural selection is a key force in evolution, and a mechanism by which populations can adapt to external ‘selection’ pressure. Examples of adaptation abound in the natural world [22], including for example, classic examples like lactose tolerance in Northern Europeans [9], human adaptation to high altitudes [55, 69], but also drug resistance in pests [15], HIV [24], cancer [27, 70], malarial parasite [3, 44], and others [56]. In these examples, understanding the genetic basis of adaptation can provide valuable information, underscoring the importance of the problem.
Experimental evolution refers to the study of the evolutionary processes of a model organism in a controlled [7, 10, 28, 37, 38, 46, 47] or natural [5, 8, 16, 17, 41, 50, 68] environment. Recent advances in whole genome sequencing have enabled us to sequence populations at a reasonable cost, even for large genomes. Perhaps more important for experimental evolution studies, we can now evolve and resequence (E&R) multiple replicates of a population to obtain longitudinal time-series data, in order to investigate the dynamics of evolution at molecular level. Although constraints such as small sizes, limited timescales, and oversimplified laboratory environments may limit the interpretation of E&R results, these studies are increasingly being used to test a wide range of hypotheses [34] and have been shown to be more predictive than static data analysis [12, 18, 52]. In particular, longitudinal E&R data is being used to estimate model parameters including population size [33, 49, 60, 64, 65, 67], strength of selection [11, 29, 30, 40, 43, 57, 60], allele age [40] recombination rate [60], mutation rate [6, 60], quantitative trait loci [4] and for tests of neutrality hypotheses [8, 13, 23, 60].
While many E&R study designs are being used [6, 53], we restrict our attention to the adaptive evolution due to standing variation in fixed size populations. This regime has been considered earlier, typically with D. melanogaster as the model organism of choice, to identify adaptive genes in longevity and aging [13, 51] (600 generations), courtship song [63] (100 generations), hypoxia tolerance [71] (200 generations), adaptation to new laboratory environments [26, 46] (59 generations), egg size [32] (40 generations), C virus resistance [42] (20 generations), and dark-fly [31] (49 generations).
The task of identifying selection signatures can be addressed at different levels of specificity. At the coarsest level, identification could simply refer to deciding whether some genomic region (or a gene) is under selection or not. In the following, we refer to this task as detection. In contrast, the task of site-identification corresponds to the process of finding the favored mutation/allele at nucleotide level. Finally, estimation of model parameters, such as strength of selection and dominance at the site, can provide a comprehensive description of the selection process.
In the effort to analyze E&R selection experiments, many authors chose to adapt existing tests that were originally used for static data, pairwise comparisons (two time-points) and single replicates to perform a null scan. For instance, Zhu et al. [71] used the ratio of the estimated population size of case and control populations to compute test statistic for each genomic region. Burke et al. [13] applied Fisher exact test to the last observation of data on case and control populations. Orozco-terWengel et al. [46] used the Cochran-Mantel-Haenszel (CMH) test [1] to detect SNPs whose read counts change consistently across all replicates of two time-point data. Turner et al. [63] proposed the diffStat statistic to test whether the change in allele frequencies of two populations deviate from the distribution of change in allele frequencies of two drifting populations. Bergland et al. [8] calculated Fst to populations throughout time to signify their differentiation from ancestral (two time-point data) as well as geographically different populations. Jha et al. [32] computed test statistic of generalized linear-mixed model directly from read counts.
Alternatively, direct methods have been developed to analyze time-series data by taking a likelihood approach, and estimating population genetics parameters. Bollback et al. [11] proposed a Hidden Markov Model (HMM) to estimate the selection coefficient s and population size by using a diffusion approximation to the continuous Wright Fisher Markov process. Steinrücken and Song [57] proposed a general diploid selection model which takes into account of dominance of the favored allele and approximates likelihood analytically. Mathieson and McVean [43] adopted HMMs to structured populations and estimated parameters using an Expectation Maximization (EM) procedure on discretized allele frequency. Feder et al. [23] modeled increments in allele frequency with a Brownian motion process, proposed the Frequency Increment Test (FIT). More recently, Topa et al. [62] proposed a Gaussian Process (GP) for modeling single-locus time-series pool-seq data. Terhorst et al. [60] extended GP to compute joint likelihood of multiple loci under null and alternative hypotheses. Recently, Schraiber et al. [54] proposed a Bayesian framework to estimate parameters using Monte Carlo Markov chain sampling.
While existing methods have been successfully applied to their corresponding application, they make some assumptions which may not hold in E&R studies. First, they assume that the underlying population size is large, so it is reasonable to model dynamics of allele frequencies using continuous state models. A number of existing methods were originally designed to process wide time spans such as ancient DNA studies. Finally, they assume that input data is in the form of unbiased allele frequencies, which may not be valid for shotgun sequencing experiments.
Here, we consider a Hidden Markov Model (HMM), similar to Williamson et al. [67] and Boll-back et al.’s [11] but under a “small-population-size” regime. Specifically, we use a discrete state (frequency) model. We show that for small population sizes, discrete models can compute likeli-hood exactly, which improves statistical performance, especially for short time-span experiments. Additionally, we add another level of sampling-noise to the traditional HMM model, allowing for heterogeneous ascertainment bias due to uneven coverage among variants. We show that for a wide range of parameters, Clear provides higher power for detecting selection, estimates model parameters consistently, and localizes favored allele more accurately compared to the state-of-the-art methods, while being computationally efficient.
2 Materials and Methods
Consider a panmictic diploid population with fixed size of N individuals. Let ν = {νt}t∈ 𝒯 be frequencies of the derived allele at generations t ∈ 𝒯 for a given variant, where at generations 𝒯 = {τi : 0 ≤ τ0 < τ1,… < τT} samples of n individuals are chosen for pooled sequencing. The experiment is replicated R times. We denote allele frequencies of the R replicates by the set {ν}R. To identify the genes and variants that are responding to selection pressure, we use the following procedure:
(i) Estimating population size. The procedure starts by estimating the effective population size, N̂, under the assumption that much of the genome is evolving neutrally.
(ii) Estimating selection parameters. For each polymorphic site, selection and dominance parameters s, h are estimated so as to maximize the likelihood of the time series data, given N̂
(iii) Computing likelihood statistics. For each variant, a log-odds ratio of the likelihood of selection model (s > 0) to the likelihood of neutral evolution/drift model is computed. Likelihood ratios in a genomic region are combined to compute the Clear statistic for the region.
(iv) Hypothesis testing. An empirical null distribution of the Clear statistic is calculated using genome-wide drift simulations, and used to compute p-values and thresholds for a specified FDR. We perform single locus hypothesis testing within selected regions to identify significant variants and report genes that intersect with the selected variants.
These steps are described in detail below.
2.1 Estimating Population Size
Methods for estimating population sizes from temporal neutral evolution data have been developed [2, 11, 33, 60, 67]. Here, we aim to extend these models to explicitly model the sampling noise that arise in pool-seq data. Specifically, we model the variation in sequence coverage over different locations, and the noise due to sequencing only a subset of the individuals in the population. In addition, many existing methods [11, 23, 60, 62] are designed for large populations, and model frequency as a continuous quantity. We show that smooth approximations may be inadequate for small populations, low starting frequencies and sparse sampling (in time) that are typical in experimental evolution (see Results, Fig 3A-C, and Fig 2). To this end, we model the Wright-Fisher Markov process for generating pool-seq data (Fig S1) via a discrete HMM (Fig 1-B). We start by computing a likelihood function for the population size given neutral pool-seq data.
Likelihood for Neutral Model
We model the allele frequency counts 2Nνt as being sampled from a Binomial distribution. Specifically, where π is the global distribution of allele frequencies in the base population. Here we simply assume is π is the site frequency spectrum of fixed sized neutral population Fig S2. Note that π may depend on the demographic history of the founder lines.
To estimate frequency after τ transitions, it is enough to specify the 2N × 2N transition matrix P(τ), where P(τ)[i,j] denotes probability of change in allele frequency from i/2N to j/2N in τ generations:
Furthermore, in an E&R experiment, n ≤ N individuals are randomly selected for sequencing. The sampled allele frequencies, {yt}t∈𝒯, are also Binomially distributed
We introduce the 2N × 2n sampling matrix Y, where Y[i, j] stores the probability that the sample allele frequency is j/2n given that the true allele frequency is i/2N.
We denote the pool-seq data for that variant as {xt = <ct,dt>}t∈ 𝒯 where dt,ct represent the coverage, and the read count of the derived allele, respectively. Let {λt}t ∈ 𝒯 be the sequencing coverage at different generations. Then, the observed data are sampled according to
The emission probability for a observed tuple xt = <dt, ct> is
For 1 ≤ t ≤ T, 1 ≤ j ≤ 2N, let αt,j denote the probability of emitting x1,x2,…,xt and reaching state j at τt. Then, αt can be computed using the forward-procedure [19]: where δt = τt − τt-1. The joint likelihood of the observed data from R independent observations is given by where x = {xt}t∈𝒯. The graphical model and the generative process for which data is being generated is depicted in Fig 1-B and Fig S1, respectively.
Finally, the last step is to compute an estimate N̂ that maximizes the likelihood of all M variants in whole genome. Let denote the time-series data of the i-th variant in replicate r. Then,
2.2 Estimating Selection Parameters
Likelihood for Selection Model
Assume that the site is evolving under selection constraints s ∈ ℝ, h ∈ ℝ+, where s and h denote selection strength and dominance parameters, respectively. By definition, the relative fitness values of genotypes 0|0, 0|1 and 1|1 are given by w00 = 1, w01 = 1 + hs and w11 = 1 + s. Then, νt+, the frequency at time τ t + 1 (one generation ahead), can be estimated using:
The machinery for computing likelihood of the selection parameters is identical to that of population size, except for transition matrices. Hence, here we only describe the definition transition matrix Qs,h of the selection model. Let denote the probability of transition from i/2N to j/2N in τ generations, then (See [20], Pg. 24, Eqn. 1.58-1.59):
The maximum likelihood estimates are given by
Using grid search, we first estimate N (Eq. 8), and subsequently, we estimate parameters s, h (Eq. 12, Fig S3). By broadcasting and vectorizing the grid search operations across all variants, the genome scan on millions of polymorphisms can be done in significantly smaller time than iterating a numerical optimization routine for each variant(see Results and Fig 4).
2.3 Empirical Likelihood Ratio Statistics
The likelihood ratio statistic for testing directional selection, to be computed for each variant, is given by where s̅ =arg max. Similarly we can define a test statistic for testing if selection is dominant by
While extending the single-locus WF model to a multiple linked-loci can improve the power of the model [60], it is computationally and statistically expensive to compute exact likelihood. In addition, computing linked-loci joint likelihood requires haplotype resolved data, which pool-seq does not provide. Here, similar to Nielsen et al [45], we calculate composite likelihood ratio score for a genomic region. where L is a collection of segregating sites and Hℓ is the likelihood ratio score based for each variant ℓ in L. The optimal value of the hyper-parameter L depends upon a number of factors, including initial frequency of the favored allele, recombination rates, linkage of the favored allele to neighboring variants, population size, coverage, and time since the onset of selection (duration of the experiment). In S1 Text, we provide a heuristic to compute a reasonable value of L, based on experimental data.
We work with a normalized value of 𝓗, given by where μC and σC are the mean and standard deviation of 𝓗 values in a large region 𝒞. We found different chromosomes to have different distribution of 𝓗i values, and therefore decided to use single chromosomes as 𝒞.
2.4 Hypothesis Testing
Single-Locus tests
Under neutrality, Log-likelihood ratios can be approximated by 𝒳2 distribution [66], and p-values can be computed directly. However, Feder et al. [23] showed that when the number of independent samples (replicates) is small, 𝒳2 is a crude approximation to the true null distribution and results in more false positive. Following their suggestion, we first compute the empirical null distribution using simulations with the estimated population size (See Fig S1). The empirical null distribution of statistic H is used to compute p-values as the fraction of null values that exceed the test score. Finally, we use Storey and Tibshirani’s method [59] to control for False Discovery Rate in multiple testing.
Composite likelihood tests
Similar to single-locus tests, we compute the null distribution of the 𝓗* statistic using whole-genome simulations with the estimated population size, and subsequently compute FDR. The simulations for generating the null distribution of 𝓗* are described next.
2.5 Simulations
We use the same simulation procedure for two purposes. First, we use them to test the power of Clear against other methods in small genomic windows. Second, we use the simulations to generate the distribution of null values for the statistic to compute empirical p-values. We mainly chose parameters that are relevant to D. melanogaster experimental evolution [35]. See also Fig 1-A for illustration.
I. Creating initial founder line haplotypes. Using msms [21], we created neutral populations for F founding haplotypes with command $./msms <F> 1 -t <2μWNe> -r <2rNeW> <W>, where F = 200 is number of founder lines, No = 106 is effective founder population size, r = 2 × 10−8 is recombination rate, μ = 2 × 10−9 is mutation rate. The window size W is used to compute θ = 2 μ NoW and ρ = 2NorW. We chose W = 50Kbp for simulating individual windows for performance evaluations, and W = 20Mbp for simulating D. melanogaster chromosomes for p-value computations.
II. Creating initial diploid population. An initial set of F = 200 haplotypes was created from step I, and duplicated to create F homozygous diploid individuals to simulate generation of inbred lines. N diploid individuals were generated by sampling with replacement from the F individuals.
III. Forward Simulation. We used forward simulations for evolving populations under selection. We also consider selection regimes which the favored allele is chosen from standing variation (not de novo mutations). Given initial diploid population, position of the site under selection, selection strength s, number of replicates R = 3, recombination rate r = 2 × 10−8 and sampling times 𝒯 = {0,10, 20, 30, 40, 50}, simuPop [48] was used to perform forward simulation and compute allele frequencies for all of the R replicates. For hard sweep (respectively, soft sweep) simulations we randomly chose a site with initial frequency of ν0 = 0.005 (respectively, ν0 = 0.1) to be the favored allele. For generating the null distribution with drift for p-value computations, we used this procedure with s = 0.
IV. Sequencing Simulation. Given allele frequency trajectories we sampled depth of each site in each replicate identically and independently from Poisson(λ), where λ ∈ {30,100,300} is the coverage for the experiment. Once depth d is drawn for the site with frequency ν, the number of reads c carrying the derived allele are sampled according to Binomial(d,ν). For experiments with finite depth the tuple <c, d> is the input data for each site.
3 Results
Modeling Allele Frequency Trajectories in Small Populations
We first tested the goodness of fit of the discrete versus continuous models in modeling allele frequency trajectories, under general E&R parameters. For this purpose, we conducted 100K simulations with two time samples 𝒯 = {0, τ} where τ ∈ {1,10,100} is the parameter controlling the density of sampling in time. In addition, we repeated simulations for different values of starting frequency ν0 ∈ {0.005,0.1} (i.e., hard and soft sweep) and selection strength s ∈ {0,0.1} (i.e., neutral and selection). Then, given initial frequency ν0, we computed the expected distribution of the frequency of the next sample ν τ under two models to make a comparison. Fig 2A-F shows that Brownian motion (continuous model) is inadequate when ν0 is far from 0.5, or when sampling times are sparse (τ > 1). If the favored allele arises from standing variation in a neutral population, it is unlikely to have frequency close to 0.5, and the starting frequencies are usually much smaller (see Fig S2). Moreover, in typical D. melanogaster experiments for example, sampling is sparse. Often, the experiment is designed so that 10 ≤ τ ≤ 100 [26, 35, 46, 71].
In contrast to the Brownian motion approximation, discrete Markov chain predictions (Eq. 11) are highly consistent with empirical data for a wide range of simulation parameters (Fig 2A-M). Moreover, the discrete markov chain can be modified to model the case when the the allele is under selection.
Detection Power
We compared the performance of Clear against other methods for detecting selection. For each method we calculated detection power as the percentage of true-positives identified with false-positive rate ≤ 0.05. For each configuration (specified with values for selection coefficient s, starting allele frequency ν0 and coverage λ), power of each method is evaluated over 2000 distinct simulations, half of which modeled neutral evolution and the rest modeled positive selection.
We compared the power of Clear with Gaussian process (GP) [60], FIT [23], and CMH [1] statistics. FIT and GP convert read counts to allele frequencies prior to computing the test statistic. Clear shows the highest power in all cases and the power stays relatively high even for low coverage (Fig 3 and Table S1). In particular, the difference in performance of Clear with other methods is pronounced when starting frequency is low. The advantage of Clear stems from the fact that favored allele with low starting frequency might be missed by low coverage sequencing. In this case, incorporating the signal from linked sites becomes increasingly important. We note that methods using only two time points, such as CMH, do relatively well for high selection values and high coverage. However, the use of time-series data can increase detection power in low coverage experiments or when starting frequency is low. Moreover, time-series data provide means for estimating selection parameters s, h (see below). Finally, as Clear is robust to change of coverage, our results (Fig 3B,C) suggest that taking many samples with lower coverage is preferable to sparse sampling with higher coverage.
Site-identification
In general, localizing the favored variant, using pool-seq data is a nontrivial task due to extensive linkage disequilibrium [61]. To measure performance, we sorted variants by their H scores and computed rank of the favored allele for each method. For each setting of ν0 and s, we conducted 1000 simulations and computed the rank of the favored mutation in each simulation. The cumulative distribution of the rank of the favored allele in 1000 simulation for each setting (Fig 5) shows that Clear outperforms other statistics.
An interesting observation is revisiting the contrast between site-identification and detection [39, 61]. When selection strength is high, detection is easier (Fig 3A-F), but site-identification is harder, due to the high LD between flanking variants and the favored allele (Fig 5A-F). Moreover, site-identification becomes more difficult whenever the initial frequency of the favored allele is low, i.e., at the onset of selection, LD between favored allele and its nearby variants is high. For example, when coverage λ = 100 and selection coefficient s = 0.1, the detection power is 75% for hard sweep, but 100% for soft sweep (Fig 3B-E). In contrast, the favored site was ranked as the top in 14% of hard sweep cases, compared to and 95% of soft sweep simulations.
Estimating Parameters
Clear estimates effective population size N̂ and selection parameters, ŝ and ĥ, as a byproduct of the hypothesis testing. We computed bias of selection fitness (s – ŝ) and dominance (h – ĥ) for of Clear and GP for 1000 simulations in each setting. The distribution of the error (bias) for 100 × coverage is presented in Fig 6 for different configurations. Fig S4 and Fig S5 provide the distribution of estimation errors for 30 ×, and 300 × coverage, respectively. For hard sweep, Clear provides estimates of s with lower variance of bias (Fig 6A). In soft sweep, GP and Clear both provide unbiased estimates of s with low variance (Fig 6B). Fig 6C-D shows that Clear provides unbiased estimates of h as well when h ∈ {0,0.5,1, 2} and s = 0.1. We also tested if Clear provide unbiased estimates of N, by estimating population size on 1000 simulations when N ∈ {200, 600,1000}. As shown in Fig 7A-C, maximum likelihood is attained at true value of the parameter.
Running Time
As Clear does not compute exact likelihood of a region (i.e., does not explicitly model linkage between sites), the complexity of scanning a genome is linear in number of polymorphisms. Calculating score of each variant requires and 𝒪(TRN3) computation for 𝓗. However, most of the operations are can be vectorized for all replicates to make the effective running time for each variant. We conducted 1000 simulations and measured running times for computing site statistics H, FIT, CMH and GP with different number of linked-loci. Our analysis reveals (Fig 4) that Clear is orders of magnitude faster than GP, and comparable to FIT. While slower than CMH on the time per variant, the actual running times are comparable after vectorization and broadcasting over variants (see below).
These times can have a practical consequence. For instance, to run GP in the single locus mode on the entire pool-seq data of the D. melanogaster genome from a small sample (≈1.6M variant sites), it would take 1444 CPU-hours (≈ 1 CPU-month). In contrast, after vectorizing and broadcasting operations for all variants operations using numba package, Clear took 75 minutes to perform an scan, including precomputation, while the fastest method, CMH, took 17 minutes.
3.1 Analysis of a D. melanogaster Adaptation to Alternating Temperatures
We applied Clear to the data from a study of D. melanogaster adaptation to alternating temperatures [26, 46], where 3 replicate samples were chosen from a population of D. melanogaster for 59 generations under alternating 12-hour cycles of hot stressful (28°C) and non-stressful (18°C) temperatures and sequenced. In this dataset, sequencing coverage is different across replicates and generations (see S2 Fig of [60]) which makes variant depths highly heterogeneous (Fig S8).
We first filtered out heterochromatic, centromeric and telomeric regions [25], and those variants that have collective coverage of more that 1500 in all 13 populations: three replicates at the base population, two replicates at generation 15, one replicate at generation 23, one replicate at generation 27, three replicates at generation 37 and three replicates at generation 59. After filtering, we ended up with 1,605,714 variants.
Next, we estimated genome-wide population size N̂ = 250 (Fig 7-E) which is consistent with previous studies [33, 46]. The likelihood curves of Clear are sharper around the optimum compared to that of Bollback et. al [11]’s method (see Supplementary Fig. 1 in [46]). Also, chromosomes 3L and 3R appear to have smaller population size Fig 7-D, N̂ = 200,150, respectively. Others have made similar observations on this data. In particular, Jonas et al. [33] shown that the chromosome-wise population size varies even more when it is computed for each replicate separately (see Table 1 in [33]). For instance, N̂ is 131 for chromosome 3R replicate 1, while it is 328 for chromosome X replicate 2.
While it would be ideal to compute Clear statistic for each replicate and chromosome separately, computing empirical p-values and significant regions become computationally intensive as empirical null distribution of each replicate and each chromosome needs to be computed. Hence, we use a single genome-wide estimate N̂ = 250 in all analyses, but we normalize statistic 𝓗* separately for each chromosome.
We use a heuristic calculation (See S1 Text) to choose the sliding window size L as the distance where the LD between the favored mutation and a site L/2bp away remains strong. For D. melanogaster parameters, we obtained L = 30kbp. We computed the normalized test statistic 𝓗* on sliding windows of size of 30Kbp and step size of 5Kbp over the genome (See Fig 8-A).
Empirical null distribution of 𝓗* was estimated by creating 100 whole genome simulations (400K statistic values) as described in Section 2.5. Then, p-value of the test statistic in each region in the experimental data was calculated as the fraction of the null statistic values that are greater than or equal to the test statistic(see Fig S9). After correcting for multiple testing, we identified 5 contiguous intervals (Fig 8) satisfying FDR≤ 0.05, and covering 2, 829 polymorphic sites. We further performed single-locus hypothesis testing on the 2,829 sites to identify 174 individual variants with FDR ≤ 0.01 (Fig 8-B).
The final set of 174 variants fall within 32 genes(Table S3) including many Serine inhibitory proteases (serpins), and other genes involved in endocytosis. Recycling of synaptic vesicles is seen to be blocked at high temperature in temperature sensitive Drosophila mutants [36]. This is also supported by GO enrichment analysis, where a single GO term ‘inhibition of proteolysis’ is found to enriched (corrected p-value:0.0041). To test for dominant selection, we computed D statistic on simulated neutral and experimental data, and computed p-values accordingly. After correcting for multiple testing, 96 variants were discovered with FDR ≤ 0.01 (Fig S10).
3.2 Analysis of Outcrossing Yeast Populations
We also applied Clear to 12 replicate samples of outcrossing yeast populations [14], where samples are taken at generations 𝒯 = {0,180,360, 540}. We observed a significant variation in the genome-wide site frequency spectrum of certain populations over different time points for some replicates (Fig S11). The variation does not have an easily identifiable cause. Therefore, we focused analysis on seven replicates r ∈ {3, 7,8,9,10,11,12} with genome-wide site-frequency spectrum over the time range (Fig S12).
We estimated population size to be N̂ = 2000 haplotypes, and computed ŝ, ĥ and H statistic accordingly. To compute p-values, we created 1M single-locus neutral simulations according to experimental data’s initial frequency and coverage. By setting FDR cutoff to 0.05, only 18 and 16 variants show significant signal for directional and dominant selection, respectively (Fig S10). Selected variants for directional selection are clustered in two regions, which match 2 of the 5 regions (regions C and E in Fig. 2-a in [14]) identified by Burke et al. in their preliminary analysis.
4 Discussion
We developed a computational tool, Clear, that can detect regions and variants under selection E&R experiments. Using extensive simulations, we show that Clear outperforms existing methods in detecting selection, locating the favored allele, and estimating model parameters. Also, while being computationally efficient, Clear provide means for estimating populations size and hypothesis testing.
Many factors such as small population size, finite coverage, linkage disequilibrium, finite sampling for sequencing, duration of the experiment and the small number of replicates can limit the power of tools for analyzing E&R. Here, by an discrete modeling, Clear estimates population size, and provides unbiased estimates of s, h. It adjusts for heterogeneous coverage of pool-seq data, and exploits presence of linkage within a region to compute composite likelihood ratio statistic.
It should be noted that, even though we described Clear for small fixed-size populations, the statistic can be adjusted for other scenarios, including changing population sizes when the demography is known. For large populations, transitions can be computed on sparse data structures, as for large N the transition matrices become increasingly sparse. Alternatively, frequencies can be binned to reduce dimensionality.
The comparison of hard and soft sweep scenarios showed that initial frequency of the favored allele can have an nontrivial effect on the statistical power for identifying selection. Interestingly, while it is easier to detect a region undergoing strong selection, it is harder to locate the favored allele in that region.
There are many directions to improve the analyses presented here. In particular, we plan to focus our attention on other organisms with more complex life cycles, experiments with variable population size and longer sampling-time-spans. As evolve and resequencing experiments continue to grow, deeper insights into adaptation will go hand in hand with improved computational analysis.
Software and Data Availability
The source code and running scripts for Clear are publicly available at https://github.com/airanmehr/clear.
D. melanogaster data originally published [26, 46]. The dataset of the D. melanogaster study, until generation 37, is obtained from Dryad digital repository (http://datadryad.org)under accession DOI: 10.5061/dryad.60k68. Generation 59 of the D. melanogaster study is accessed from European Sequence Read Archive (http://www.ebi.ac.uk/ena/) under the project accession number: PRJEB6340. The dataset containing experimental evolution of Yeast populations [14] is downloaded from http://wfitch.bio.uci.edu/~tdlong/PapersRawData/BurkeYeast.gz (last accessed 01/24/2017). UCSC browser tracks for D. melanogaster and Yeast data analysis are found in Suppl. Data 1 and 2, respectively.
Conflict of interest
VB is a co-founder, has an equity interest, and receives income from Digital Proteomics, LLC (DP). The terms of this arrangement have been reviewed and approved by the University of California, San Diego in accordance with its conflict of interest policies. DP was not involved in the research presented here.
5 S1 Text Choosing Window Size
In genome-wide scans for detecting selection, we apply the Clear statistic on sliding windows of length Lbp. The single locus statistic values within the window are averaged to get the composite statistic. While the statistic is robust to variation in window-size, choosing a very large window where LD has decayed will weaken the composite signal, and choosing a small window will decrease the power of composite likelihoods. Here, we use a systematic calculation to choose L as the distance where the LD between the favored mutation and a site L/2bp away remains strong.
Consider a segregating site l bp away from the favored allele in a selective sweep. Let ρτ be the LD between the favored allele and the site, τ generations after the onset of selection. Then, we have (see Eqs. 30-31 in [58]): where K(τ) = 2ν τ (1 – ντ) is the heterozygosity at the selected site, r is the recombination rate (crossovers/bp/gen). The ‘decay factor’, α τ = e−rτl, and ‘growth factor’, βτ, are due to recombination and selection, respectively. Under regular parameter settings, linkage to the favored allele is expected to increase after onset of selection and then decreases due to crossover events (See Fig S13-A). While ρ0 is unknown in pool-seq E&R experiments, we compute the value of l so that
In E&R scenarios, we let τ be the time of the last sampling. For given s, we aim to compute the smallest window size L over all possible starting frequencies. Specifically, where the term ν̂ τ depends on initial frequency ν0 and selection strength s (Eq. 9).
We used D. melanogaster dataset parameters, N = 250, r = 2 × 10−8 and τ = 59 to compute the optimal window size for different values of N s, ranging from weak selection to strong selection: N s ∈ {20,100,200, 500}, or s ∈ {0.08, 0.4,0.8,2}. We set L = 30Kbp (See Fig S13-B) to provide good resolution for detecting weak selection.
Acknowledgments
AI, AA, and VB were supported by grants from the NIH (1R01GM114362) and NSF (DBI-1458557 and IIS-1318386). CS is supported by the European Research Council grant ArchAdapt.
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].↵
- [67].↵
- [68].↵
- [69].↵
- [70].↵
- [71].↵