Abstract
A composite likelihood ratio test implemented in the program SweepFinder is a commonly used method for scanning a genome for recent selective sweeps. SweepFinder uses information on the spatial pattern of the site frequency spectrum (SFS) around the selected locus. To avoid confounding effects of background selection and variation in the mutation process along the genome, the method is typically applied only to sites that are variable within species. However, the power to detect and localize selective sweeps can be greatly improved if invariable sites are also included in the analysis. In the spirit of a Hudson-Kreitman-Aguadé test, we suggest to add fixed differences relative to an outgroup to account for variation in mutation rate, thereby facilitating more robust and powerful analyses. We also develop a method for including background selection modeled as a local reduction in the effective population size. Using simulations we show that these advances lead to a gain in power while maintaining robustness to mutation rate variation. Furthermore, the new method also provides more precise localization of the causative mutation than methods using the spatial pattern of segregating sites alone.
Introduction
Rapid advances in sequencing technology during the past few years have facilitated studies using genome-wide molecular data for detecting signatures of selective sweeps (Akey et al. 2002; Carlson et al. 2005; Wang et al. 2006; Voight et al. 2006; Kelley et al. 2006; Kimura et al. 2007; Williamson et al. 2007; Tang et al. 2007; Sabeti et al. 2007; Xia et al. 2009; Qanbari et al. 2012; Ramey et al. 2013; Long et al. 2013; Chávez-Galarza et al. 2013; Huber et al. 2014), and a large number of computational methods have been developed for this purpose (e.g., Fu & Li 1993; Kim & Stephan 2002; Sabeti et al. 2002, 2007; Kim & Nielsen 2004; Nielsen et al. 2005; Voight et al. 2006; Jensen et al. 2007; Boitard et al. 2009; Chen et al. 2010; Pavlidis et al. 2010, 2013; Li 2011). The various methods differ in the assumptions that they make about the selective sweep. For example, the extended haplotype test and its derivatives are powerful in cases where the beneficial mutation has not yet reached fixation in the population (Sabeti et al. 2002, 2007; Voight et al. 2006). Methods based on measures of population subdivision rest on the assumption that in geographically structured populations a selective sweep has a locally confined effect on genetic diversity, which increases population differentiation at the position of the sweep (Akey et al. 2002; Chen et al. 2010). More recently, statistics have been developed specifically for the detection of soft sweeps, i.e. a pattern caused by multiple haplotypes sweeping to high frequencies (Ferrer-Admetlla et al. 2014; Garud et al. 2015).
In the present study, we are solely concerned with the model of a classical hard selective sweep in a single population, and we assume that the beneficial mutation has reached fixation not too long ago. The methods usually applied in this scenario aim to detect deviations in the shape of the site frequency spectrum (SFS), which can be quantified with simple summary statistics like Tajima’s D or Fay and Wu’s H. In addition, more powerful statistics have been developed that explicitly model the effect of a selective sweep on the SFS in a likelihood ratio framework (Kim & Stephan 2002; Nielsen et al. 2005). Kim and Stephan (2002) proposed a composite likelihood ratio statistic based on calculating the product of marginal likelihood functions for all sites on a chromosome under models with and without a selective sweep at a particular position, and under the assumption of a panmictic population of constant size. The resulting composite likelihood ratio is then computed for each position of interest to evaluate the evidence for a sweep at those positions. This method, therefore, does not only incorporate information regarding the SFS, but does so in a way that uses the spatial distribution of segregating alleles of different frequencies. The null distribution of the test statistic is approximated using simulations. An extension to this test was proposed by Nielsen et al. (2005). In this method, the overall genomic SFS is used as the neutral, or background, model instead of using the standard neutral model as the null. The distribution of the SFS under the alternative hypothesis of selection is derived by considering the way a selective sweep would modify the observed background distribution of allele frequencies. This leads to a computationally fast method, facilitating genome-wide analyses. Nielsen et al. (2005) also argued that the use of the overall genomic SFS to represent the neutral case leads to increased robustness, and showed that the method was robust to a two-epoch growth model and an isolation-migration model with population growth in both populations, with parameters estimated from human single nucleotide polymorphism (SNP) data (Marth et al. 2004). Since then, it has become clear that while this method may be more robust than some previous SFS-based approaches, it can produce a high proportion of false positives if there has been a strong recent bottleneck in population size, but a standard neutral model is used to calculate critical values (Jensen et al. 2005; Pavlidis et al. 2008).
If invariable sites are included in the analysis, then both the methods of Kim and Stephan (2002) and Nielsen et al. (2005) may be sensitive to assumptions regarding selective constraint and mutation rates. A region with strongly reduced levels of variation due to selective constraint or reduced mutation rate may be misinterpreted as a region that has experienced a recent selective sweep (Nielsen et al., 2005; Boitard et al., 2009; Pavlidis et al., 2010). For these reasons, Nielsen et al. (2005) proposed to use only polymorphic sites, an option that became incorporated as default in both Sweepfinder (Nielsen et al. 2005) and SweeD (Pavlidis et al. 2013).
Background selection can also lead to locally reduced levels of neutral variation (Charlesworth et al., 1993, 1995; Hudson and Kaplan, 1994, 1995; Nordborg et al., 1996; Charlesworth, 2012; Cutter and Payseur, 2013) and cannot be ignored for the study of neutral polymorphisms in many cases (Williford & Comeron 2010; Cutter & Payseur 2013; Messer & Petrov 2013). Cutter and Payseur (2013) argue that the inevitability and prevalence of deleterious mutations necessitates the incorporation of background selection in the null model when identifying positive selection. There is a well-developed mathematical framework for quantifying the strength of background selection given the genome-wide mutation rate, recombination rate, position of functional elements and distribution of fitness effects (Hudson & Kaplan 1995; Nordborg et al. 1996). As data sets and methods for estimating the effect of background selection for each position in the genome are becoming available (McVicker et al. 2009; Comeron 2014), the objective of developing methods for detecting positive selection that can take background selection into account is becoming tenable.
Here, we explore the potential for improving the composite likelihood ratio test of SweepFinder (Nielsen et al. 2005) by either including invariant sites that differ with respect to an outgroup (i.e., fixed differences), or all invariant sites, in addition to polymorphic sites. When only including fixed differences, the method incorporates the information typically represented in a Hudson-Kreitman-Aguadé (HKA) test (Hudson et al. 1987), but adds the information from the spatial distribution of allele frequencies. We show that this approach is robust to variation in mutation rate across the genome, and develop an approach for incorporating estimates of the strength of background selection into the SweepFinder framework. Using the reduction in diversity relative to divergence as a necessary hallmark of a selective sweep in our model also helps to reduce false positives, e.g. in the case of a recent population bottleneck. Finally, we compare results of both the old and the new version of the likelihood ratio test applied to human genetic data.
Materials and Methods
Including invariant sites into the SweepFinder framework
Starting with n aligned DNA sequences, each of length L, we wish to determine whether a selective sweep has occurred at some defined position along the sequence. Based on results of Durrett and Schweinsberg (2004), Nielsen et al. (2005) derived an approximate formula for pk*, the probability of observing k derived alleles, k ∈{1,2,…,n- 1}, in a sample of size n, immediately after a selective sweep, for a site at a particular distance (d) from the selected mutation. For each k, pk* is a function of d, the background allele frequency distribution p=(p1,p2,…,pn-1), and the parameter α=r ln(2 Ne)/s. Here, r is the per-base per-generation recombination rate, s is the selection coefficient, and Ne is the effective population size. The parameter pk is the expected proportion of sites, not affected by the sweep, in which the derived allele has a frequency of k/n in the sample. The vector p is commonly estimated as the observed SFS from the whole genome, under the assumption that only a small and therefore negligible proportion of positions are affected by selection. The parameter α quantifies the relative influence of recombination and selection, with small values of α indicating strong sweeps.
The equations in Nielsen et al. (2005) allow for the incorporation of invariant sites that may or may not be fixed differences relative to an outgroup, using p=(p0, p1,…,pn) as the definition of p, and with the modification that the upper limit of the sum in equation (5) of Nielsen et al. (2005) is n and not n-1. The quantity pk* is a function of the probability of a lineage escaping a selective sweep, Pe=1-exp(-αd), where d is the distance between the polymorphic site and the sweep location. Our new version of SweepFinder allows defining distances between sites as genetic distance. This is achieved by allowing d to be defined by a recombination map rather than by physical distance as in the previous version. As in Nielsen et al. (2005), we then define the composite likelihood ratio statistic CLR = 2[log(CLsweep) - log(CLbackground)], where CLsweep is the composite likelihood maximized over alpha, and CLbackground is the composite likelihood calculated under the assumption of α = ∞. This is a composite likelihood ratio, and not a full likelihood ratio, because sites in the genome are not independent, but correlated due to linkage disequilibrium. One thing to notice, about which there has existed some confusion in the literature, is that this approach is not window based but in theory incorporates information from all SNPs in the genome to inform the CLR calculated for a single point in the genome. However, for computational efficiency SweepFinder uses a cut-off for distances from the focal SNP to include in the calculation. As distances become large, the contribution to the likelihood ratio approaches zero. The value used for the cut-off in SweepFinder is αd = 12, corresponding to a probability of a lineage escaping a sweep of 0.999994. Furthermore, SweepFinder calculates probabilities on a grid of recombination distances and uses a smooth interpolation to approximate probabilities for a particular point.
The effect of including invariant sites on the SFS is illustrated in Figure 1. In a region close to the site of the selective sweep, variability is reduced because almost all the probability mass is concentrated on fixed alleles. Notice also, that as the mutation rate scales all categories proportionally, a change in the mutation rate will not change the SFS defined on {1, 2,…,n} (Figure 1a and 1b). This statement does not hold true when invariant sites that do not differ from the outgoup (Figure 1c) are incorporated.
Accounting for background selection
A B-value (B), is the factor by which the effective population size is expected to be reduced due to background selection, i.e. Ne* = NeB, where Ne and Ne* are the effective population sizes with and without background selection, respectively (Charlesworth 2012). We will assume that a reasonable estimate of the ‘B-value map’, the value of B for each site in the genome, is available (see for example McVicker et al., 2009, for humans). We note that this assumption limits the use of our method to organisms for which such estimates have been obtained. We also note that we only model the main effect of background selection: the well-known reduction in effective population size. However, background selection can also affect the distribution of allele frequencies (Charlesworth et al. 1993, 1995; Hudson & Kaplan 1994; Zeng & Charlesworth 2011; Lohmueller et al. 2011a; Nicolaisen & Desai 2013), an effect that is ignored here.
Based on the B-value map, the expected site frequency spectrum can be adjusted simply by multiplying all categories in the spectrum, except for the zero and the n (fixed differences) category, by B, i.e., by setting for 1 < k < n-1, as the expected diversity reduction is proportional to B. The n category can be adjusted as described in the next section. The zero category can be obtained by standardization, i.e. . If the zero category is not included in the analysis, all included categories will have to be standardized to ensure that the frequencies sum to 1. The calculation of the CLR then proceeds as in Nielsen et al. (2005).
Effect of background selection on number of fixed differences
We assume the availability of a sample of n chromosomes and a single chromosome from an outgroup species, which split from the ingroup species g generations ago. Fixed differences are defined as sites with an allele that is invariant within the ingroup sample, but different from the allele at the orthologous position of the outgroup chromosome (Figure 2). The expected number of fixed differences, K, in the sample is then E[K] = μ(2Tanc − Tin), where Tin is the time to the most recent common ancestor in the ingroup sample, Tanc is the divergence time between ingroup and outgroup, and μ is the per-generation mutation rate. We further assume a standard neutral coalescent model with populations of constant sizes Ne,in and Ne,anc for the ingroup population, and ancestral population, respectively (Figure 2), and that the split time, g, is so large that we can assume Pr(Tin > g) ≈ 0. Then E[Tin] = 4Ne,in(1-1/n), where n is the sample size of ingroup sequences, and E[Tanc] = g+2Ne,anc. Then, under an infinite sites model and the relative number of fixed differences with and without background selection is which reduces to for Ne,anc =Ne,in = Ne. In the limit of large split times (g → ∞), E[K(B)]/E[K] ≈ 1, and the effect of background selection on fixed differences can generally be ignored if g >> Ne/n. In our new version of SweepFinder, if the B-value map is included for sweep detection, estimates of Ne,in, Ne,anc, and g have to be provided to the software.
Constant size and bottleneck simulations
Simulations were performed under the model described in Figure 2 assuming L = 100kb and n = 30, using msms (Ewing & Hermisson 2010).
We set the split time, g, between ingroup and outgroup to 20 coalescent time units (2Ne generations), resulting in a neutral divergence of 0.1. The scaled mutation rate θ = 4Neμ per site was set to 0.005 and the population scaled recombination rate per site, 4Ner, to 0.02. Those parameters where chosen to be comparable to the ones in (Nielsen et al. 2005). One chromosome was sampled from the outgroup species to classify invariant sites into sites that differ or do not differ to the outgroup. To analyze the effect of reduced mutation rate in a genomic region compared to the background, we varied the mutation rate between 0.1 and 0.9 times the mutation rate in other regions. Further, we simulated two demographic scenarios, a constant size and a bottleneck population. In simulations with selection, the selected mutation was introduced in the population at specified times (0.01, 0.02, 0.04, 0.06, 0.08, 0.16, 0.24,…, 1.2), at a frequency of 1/(2Ne with a population scaled selection coefficient of 2Nes = 200. We only kept simulations in which the mutation did not get lost (-SFC option in msms).
For the bottleneck simulations, we varied onset (0.004, 0.04 and 0.4), strength (0.05, 0.1 and 0.5) and duration (0.08 and 0.4), and explored all possible combinations of those parameters. To compare different bottleneck scenarios, θ was scaled depending on the bottleneck parameters to keep SNP density constant for all simulations (on average ∼1850 SNPs per simulation). This was achieved by calculating a scaling factor (f) using the formula of Marth et al. (2004) and the approach described in DeGiorgio et al. (2014). The recombination rate was scaled to be 4fNer to keep the mutation over recombination rate ratio comparable to the constant size simulations. The split time was also adjusted to g/f.
For the simulations with selective sweeps, we used 200 replicates for each parameter setting and sweep start time, and assumed Ne = 10,000. For calculation of the false positive rate, we conducted 4,000 neutral simulations under each bottleneck condition. For power calculations we generally assumed that the correct demographic model was known and used to identify critical values for the test, while for investigations of robustness we used the standard neutral model to estimate critical values. In all cases, the background site frequency spectrum was estimated using 1,000 neutral simulations. Note that in our analyses, the significance level is set so that 5% of all simulated 100kb regions are expected to contain at least one outlier, i.e. it is an experiment-wise significance level based on our simulated sequence length.
Simulation of background selection
Background selection was simulated with the forward simulation software SFS_CODE (Hernandez 2008). To reduce the computational burden we simulated relatively small populations of Ne = 250 (Hernandez 2008). We used n = 15 and assumed constant population sizes with neutral and deleterious mutation rate of θ = 0.0025 per bp, g/(4Ne) = 2, 4Ner = 0.15, and L = 100kb. We further assumed a selection coefficient of 2Nes = -50, reducing the neutral diversity by background selection by 40%. In the middle of the sequence (from 37.5kb to 62.5kb) we introduced a 100-fold reduction in recombination rate, which led to a local increase in the effect of background selection and an 80% reduction in SNP density (see Figure S1). This reduction in recombination rate mimics a selective sweep by locally reducing diversity through the effect of background selection (Figure S1). While the effect of background selection is more likely to act on a megabase scale (McVicker et al. 2009), we simulate strong background selection in a small segment of simulated sequence to keep the data sets small reducing the computational burden of the simulations. However, the difference in scale should not affect the generality of our conclusions.
To simulate selective sweeps in conjunction with background selection, a single positively selected mutation was introduced into the population 0.02 coalescence time units (2Ne) in the past in the middle of the sequence, with a selection coefficient of 2Nes = 2000, or 0.1 coalescence time units in the past with a selection coefficient of 2Nes = 200. Whenever the mutation was lost from the simulation, the output was discarded and the simulation was repeated. For simulations without background selection, we set the deleterious mutation rate to zero. The composite likelihood ratio was calculated using a grid of 40 points for each simulated data set. The neutral simulations described above were used as background site frequency spectrum. For the HKA test we used non-overlapping windows of length 5kb.
Analysis of human data
We used data from nine unrelated European individuals sequenced by Complete Genomics (Drmanac et al. 2010). Data and filtering steps were the same as in (DeGiorgio et al. 2014). We found that in low complexity regions around the centromeres and elsewhere in the genome, diversity drops to low levels while divergence to chimpanzee stays constant or even increases relative to other regions. CRG100 values, a measure of local alignability (Derrien et al. 2012), were downloaded from the UCSC Genome Browser at http://genome.ucsc.edu/. Low complexity regions are highly correlated with low values of CRG100 values, and increased levels of missing data. Therefore we only retained SNPs and fixed differences with a CRG value of 1 and full sample size. We also excluded windows with average CRG value of less than 0.9, in 100kb windows moving by 50kb.
We obtained recombination rates between pairs of sites from the sex-averaged pedigree-based human recombination map from deCODE Genetics (Kong et al. 2010). For the sweep scan, we calculated a composite likelihood ratio at grid points with 1kb spacing. We ran both standard SweepFinder, using only polymorphic sites (CLR1), and our new method using polymorphic sites, fixed differences relative to chimpanzees and the B-value map from McVicker et al (2009) (CLR2B). We assume an effective population size of humans and the human-chimpanzee ancestor population of 10,000 and 99,000, respectively, and a split time of 240,000 generations (McVicker et al. 2009). To look for overlaps with previous sweep scans, we use the supplementary table from (Akey 2009), compiling SFS based scans (Carlson et al. 2005; Kelley et al. 2006; Williamson et al. 2007), LD based scans (Wang et al. 2006; Voight et al. 2006; Kimura et al. 2007; Tang et al. 2007; Sabeti et al. 2007), and one FST based scan (Akey 2009).
Results
Including diversity as a sweep signal increases power and precision
We compare the power and accuracy of the CLR test when including only variable sites (CLR1), variable sites and fixed differences (CLR2), and all sites (CLR3), in the calculation of the composite likelihood ratio. CLR1 is the CLR that is calculated by current sweep detection software (Nielsen et al. 2005; Pavlidis et al. 2013). We start with a simple scenario of a constant population size with no background selection, and an advantageous mutation in the middle of the sequence, with selection strength of 2Nes = 200 and varying start times (see Methods).
The power drops quickly with the age of the selected mutation, and approaches zero for sweeps that start more than 0.5 coalescence time units (2Ne generations) in the past (Figure 3a). The root-mean-square error (RMSE) of the estimated location of the sweep also increases for older sweeps (Figure 3b). At an age of 0.5 coalescent time units, localization using the CLR1 statistic is not better than picking a site at random. In contrast, CLR2 and CLR3 still have power until 0.8 time units in the past. Furthermore, for sweeps that start 0.2 coalescence time units in the past, there is an almost 40% increase in power. In summary, both power and accuracy to localize the selected allele vastly increase when including fixed sites and there is little difference between including all sites (CLR3) and fixed differences (CLR2).
Including only fixed differences maintains robustness against mutation rate variation
We investigated the effect of varying mutation rates on the inference of sweeps. To this end we use two sets of simulations in 100 kb windows: one set with a population mutation rate of 0.005 and another set of simulations with reduced mutation rates relative to the first set. The likelihood ratio is then calculated using the first set of simulations as the background SFS when calculating the CLR for the second set (see Materials and Methods). The power is estimated by simulating a third set of simulations, with similarly varying mutation rates as in the second set, but with a beneficial mutations with selection coefficient 2Nes = 200 arising at 0.08 coalescence units in the past. The selected site is placed in the middle of the simulated region. In both cases, the null distribution of the test statistic is obtained using simulations with a constant high mutation rate of 0.005 and no selective sweeps.
If all sites are used for inference (CLR3), the power is close to 1 irrespective of the mutation rate. However, the false positive rate increases rapidly with the reduction in the mutation rate, so that at a 60% reduction already half the signals are false positives and at a reduction of 40%, almost all of the signals from the neutral simulations are false positives (Figure 4). This explains the apparently constant power. In contrast to CLR3, the power of both CLR1 and CLR2 reduces with the reduction in mutation rate (Figure 4). For CLR1, this reduction in power is due to the reduced SNP density. The power for CLR1 is only 80% to begin with and drops to 55% at a reduction in mutation rate by 50%. CLR2 is performing much better: the power to detect a sweep is still at 80% with a mutation rate reduction of 50%. The false positive rate for both CLR1 and CLR2 stays at or below the expected 5% level. In fact, the tests become extremely conservative when a mutation rate that is too high is used to obtain the null distribution of the composite likelihood ratio. This is because the distribution of the composite likelihood ratio is not invariant to the number of SNPs included in the analysis. Including many more SNPs for generating the null distribution (as a consequence of a higher mutation rate) than used in the analyses of the data, will result in a conservative test.
Robustness to population bottlenecks
We simulated several bottleneck scenarios, varying onset, duration and strength of the bottleneck (Figure 5a), and calculated the false positive rates for the three sweep statistics (CLR1-3). Critical values for a 5% significance level were obtained from simulations with a constant size population. For each bottleneck scenario, we adjusted mutation rate, recombination rate and split time to the outgroup, so that the expected number of SNPs as well as divergence to the outgroup are comparable for all bottleneck models and for the constant size model (see Methods). This is equivalent to adjusting mutation rate and recombination rate in the simulations used to obtain critical values to match the observed data.
In a scenario with recent (onset = 0.004 or 0.04) and strong (strength = 0.05) or intermediate (strength = 0.1) bottlenecks, this generates a large proportion of false positives (>87%) if the population size is assumed to be constant (Figure 5b). The proportion of false positives is smaller if the bottleneck is old, as most lineages coalesce before or during the bottleneck. This is true for all three CLR statistics. However, by including invariant sites or fixed differences into the CLR framework, we increase robustness to bottlenecks whenever the chance of surviving the bottleneck is relatively small, e.g. when the bottleneck is strong (5%), or intermediate (10%) and has a long duration (0.2).
As a specific example, European humans are assumed to have experienced a bottleneck during colonization of Europe. Estimated bottleneck parameters (Lohmueller et al. 2011b) indicate a relatively recent, short, but strong bottleneck (onset = 0.014, duration = 0.005, strength = 0.05). Simulating data under this scenario results in a proportion of false positives of 0.21 for CRL1 and CLR2 and 0.24 for CLR3, suggesting that constant population size is not a suitable demographic model for calculating significance thresholds for any of the three CLR tests.
False positives due to background selection are prevented by including a B-value map
A strong reduction in diversity relative to divergence in regions of the genome can be caused not only by selective sweeps, but also by the effects of deleterious mutations on linked neutral variation, i.e. background selection. We adapted SweepFinder to enable the inclusion of genome-wide estimates of this effect, the B-value map, to account for this type of variation. To evaluate the method, we simulated a genomic region with increased background selection, i.e. a local reduction in diversity due to background selection (Figure S1, Figure 6 and Methods). The background SFS used to calculate the CLR statistics was based on, otherwise identical, neutral simulations. To evaluate power in the presence of background selection, we simulated data with both background selection and a recently completed selective sweep located in the middle of the sequence (Figure 6). The nominal false positive rate, which is used to determine the nominal significance level, was estimated from neutral simulations without background selection.
The HKA test and the uncorrected CLR2 and CLR3 cannot distinguish background selection from selective sweeps, as is evident from the nearly 100% false positives under our strong background selection scenario (Figure 7a). If only polymorphic sites (CLR1) are used, the test does not suffer from an elevated level of false positives, indicating that CLR2 and CLR3 mainly pick up on the diversity reduction. However, if the diversity reduction due to background selection is factored in using a B-value map, the statistics return to the desired behavior in that the FPR corresponds to the nominal significance level, while maintaining increased power as compared to CLR1.
Analysis of a human genetic variation dataset
We screened the data from nine unrelated European individuals sequenced by Complete Genomics (Drmanac et al. 2010) for selective sweeps to prove the utility of our improvements to SweepFinder. We compare the composite likelihood ratio across the whole genome, calculated using only polymorphic sites (CLR1), with our new approach by including fixed differences to chimpanzees into the calculation (CLR2). To account for varying diversity across the genome due to background selection we also incorporate the B-value map from McVicker et al. (2009) into the calculation of CLR2, henceforth referred to as CLR2B.
Due to the complex human demography and the added complication of background selection, we do not calculate critical values, but report the 0.2% most extreme regions in Table 1. This approach has previously been used in other selection scans (e.g., Voight et al. 2006) under the argument that it is an outlier approach, although we notice that no formal testing has been done here or in Voight et al. (2006) to determine the degree to which the most extreme values indeed are outlying with respect to some parametric distribution.
The strongest sweep signal is on chromosome 4, 33.6 Mbp, a region without any annotated genes. The closest gene, ARAP2, is 2.15 Mbp downstream from the CLR2 peak. This sweep region has a B-value close to one, and a strong reduction in diversity relative to divergence. The peak in CLR1 shows that this region is characterized by a sweep-like site frequency spectrum. This region was also listed as a candidate region in LD-based (Wang et al. 2006; Voight et al. 2006; Kimura et al. 2007; Sabeti et al. 2007) and SFS- based sweep scans (Carlson et al. 2005; Kelley et al. 2006; Williamson et al. 2007).
The gene with the strongest CLR2B signal is KIAA1217, which was suggested to affect lumbar disc herniation susceptibility (Karasugi et al. 2009). The gene is also an outlier for haplotype-based sweep statistic for detecting incomplete soft or hard sweeps, in an African population (Ferrer-Admetlla et al. 2014). This may suggest that the variant is fixed, or at very high frequency in Europe, but still polymorphic in Africa. Another gene in one of the outlier regions, HERC2, is known to modulate iris color and blonde hair (Wilde et al. 2014). This candidate has previously been identified in a screen for population-specific sweeps using XP-CLR (Chen et al. 2010). Analyses of ancient DNA suggests that strong selection has been operating on HERC2 in western Eurasia during the past 5000 years (Wilde et al. 2014).
About half of our outlier regions in Table 1 overlap with at least one candidate region of previous sweep scans in humans (Akey 2009), and most of them are also outlier regarding CLR1. However, there are some notable exceptions: One example is the sweep region on chromosome 7, at 72.6 Mbp, with the genes BCL7B, FZD9 and BAZ1B. This region has a small CLR2B percentile rank of 0.0005, but a much larger CLR1 percentile rank (0.071), and is not listed in Akey (2009).
In conclusion, we show that CLR2B shows enrichment for previously detected candidates, but also identifies novel sweep signals. These previously undetected sweeps are likely to be enriched for sweeps that started between 0.2 and 0.8 Ne generations ago and thus escaped detection with LD, FST or SFS based methods.
Discussion
We evaluated the performance of a composite likelihood ratio test for detecting selection sweeps (Nielsen et al. 2005) when including fixed differences in the likelihood ratio in addition to SFS information, using extensive simulations. We show that there can be a marked increase in power as well as a reduction in false positive rate for a number of different scenarios in several different models of mutation rate variation, population bottlenecks and background selection. We also show that estimates of the strength of background selection can be included into the framework to prevent false positives in regions with strong, long-term background selection. By applying the method to human genetic data, we detect novel regions that are not identified as candidate regions with the standard SweepFinder approach.
Using invariant sites increases power and robustness
Given that both diversity and divergence change proportionally with mutation rate, we integrate variation in mutation rates by including a measure of divergence to an outgroup species. More specifically, we include sites that are not polymorphic within the species under investigation, but differs from an outgroup sequence, i.e. inferred fixed differences. If the SweepFinder CLR is calculated including all sites, ignoring outgroup information, variation in mutation rates can create false positives (Figure 4). However, if only fixed differences are added to the SFS, the power, but not the false positive rate, increases.
Furthermore, including invariant sites can increase robustness to certain bottleneck scenarios if the bottleneck is of intermediate to high strength, but not too recent (Boitard et al. 2009; Pavlidis et al. 2010). However, like many other methods for detecting selective sweeps (Barton 1998; Jensen et al. 2005; Voight et al. 2006; Boitard et al. 2009; Pavlidis et al. 2010; Crisci et al. 2013) the CLR test can suffer from a disturbingly high false positive rate in the presence of recent bottlenecks in population size. The use of an empirically derived demographic background SFS does not eliminate the sensitivity to demographic assumptions, because the CLR does not model the correlation in coalescence times along the sequence correctly irrespective of the demographic model. A bottleneck will force many lineages to coalesce in a short amount of time. If the duration of the bottleneck is such that at least some lineages escape the bottleneck in most regions, the few regions in which all lineages coalesce during the bottleneck may very much resemble regions that have been affected by a selective sweep. Realistic demographic models should be used if assigning p-values to individual sweeps.
Background selection as a null model for sweep detection
What is often neglected in previous discussions of diversity-based sweep detection methods is variation in diversity across the genome that is not caused by variation in mutation rate (or conservation level), but by variation in background selection, i.e. by the effect of deleterious mutations on linked neutral variation (Charlesworth et al. 1993; Hudson & Kaplan 1995; Charlesworth 2012; Cutter & Payseur 2013). A locally increased level of background selection will lead to a reduction in diversity similar to that expected after a selective sweep.
As datasets and methods for estimating the effect of background selection for each position in the genome are becoming available (McVicker et al. 2009), the objective of developing methods for detecting positive selection that can take background selection into account is becoming tenable. We present the first such method by including a map of predicted B-values in the calculation of the CLR. McVicker et al. (2009) provide such a B-value map for humans by defining functional elements based on mammalian sequence conservation, and fitting parameters to phylogenetic data. Therefore, reductions in neutral diversity in regions of the human data do not influence the local estimation of B. Our approach considers a local reduction in diversity as evidence for a selective sweep only if it is not also predicted by a local drop in B-values, i.e. background selection is our evolutionary null model (Cutter & Payseur 2013). We simulated background selection levels typical for humans (McVicker et al. 2009), and by accounting for background selection, we could effectively prevent false positives without loosing power. If one does not account for background selection, the proportion of false positives is large and similar to that of a HKA test (Figure 7a).
Application to human data
Finally, by applying our method to human genetic variation data, we show that the new method detects novel regions that were not identified as candidates using the standard SweepFinder approach. Based on our simulations, we would expect those regions to be enriched for old selective sweeps that started between 0.2 and 0.8 Ne generations ago, a time range where the power of other SFS-based, FST and LD based methods is low (Sabeti et al. 2006). Interestingly, the strongest signal we find, which has been missed by most previous scans, is near KIAA1217, a gene affecting lumbar disc herniation susceptibility. We speculate that the selection in this region may possibly be related to changes in human muscular-skeletal function subsequent to the evolution of erect bipedal walk. Increased risk of lumbar disc herniation is a likely consequence of bipedal walk. We may still be evolving to optimize muscular–skeleton functions after this recent, radical change in skeletal structure and function.
Data Accessibility
We did not generate any new data set for this study.
Author Contributions
CDH, IH and RN conceived the study question and the experimental design. CDH implemented the simulations. CDH and MD adapted the software SweepFinder to account for background selection. CDH and MD analyzed the simulations and the human genetic variation data. CDH, MD, IH and RN wrote the paper.
Acknowledgments
We gratefully acknowledge Melissa Hubisz for her help with the original SweepFinder software. This work was supported by the Austrian Science Fund (Vienna Graduate School of Population Genetics, FWF W1225) to C.D.H.