Abstract
Statistical methods for identifying adaptive mutations from population-genetic data face several obstacles: assessing the significance of genomic outliers, integrating correlated measures of selection into one analytic framework, and distinguishing adaptive variants from hitchhiking neutral variants. Here, we introduce SWIF(r), a probabilistic method that detects selective sweeps by learning the distributions of multiple selection statistics under different evolutionary scenarios and calculating the posterior probability of a sweep at each genomic site. SWIF(r) is trained using simulations from a user-specified demographic model and explicitly models the joint distributions of selection statistics, thereby increasing its power to both identify regions undergoing sweeps and localize adaptive mutations. Using array and exome data from 45 ‡Khomani San hunter-gatherers of southern Africa, we identify an enrichment of adaptive signals in genes associated with metabolism and obesity. SWIF(r) provides a transparent probabilistic framework for localizing beneficial mutations that is extensible to a variety of evolutionary scenarios.
Introduction
Adaptive mutations that spread rapidly through a population, via processes known as selective sweeps, leave distinctive signatures on genomes. These genomic signatures fall into three categories: differentiation among populations, long shared haplotype blocks, and changes in the site frequency spectrum (SFS). Statistics that are commonly used to detect genomic signatures of selective sweeps include FST1 for measuring population differentiation, iHS2 for identifying shared haplotypes, and Tajima’s D3 for detecting deviations from the neutral SFS. Some approaches, like SweepFinder4 and SweeD5 integrate information across sites by modeling changes to the SFS. Often, statistical scans for adaptive mutations proceed by choosing a particular genomic signature and a corresponding statistic, obtaining the statistic’s empirical distribution across loci in a genome-wide dataset, and focusing on loci that fall past an arbitrary, but conservative threshold2,6–11.
Recently, there has been increased focus on developing composite methods for identifying selective sweeps, which combine multiple statistics into a single framework12-19; we refer to the statistics that are aggregated in composite methods such as these as “component statistics”. Most composite methods draw upon machine learning approaches like support vector machinesi2,15, deep learning19, boosting14,17, or random forest classification18 in order to identify genomic windows containing selective sweeps. These windows vary in size from 20kb to 200kb, often identifying candidate sweep regions containing many genesi8,19. One method, the Composite of Multiple Signals or “CMS”13,16, uses component statistics that can be computed site-by-site in pursuit of localizing adaptive variants within genomic windows, but the output from this method cannot be interpreted without comparison to a genome-wide distribution. In addition, CMS must rely on imputation or other methods of compensation when component statistics are undefined, a complication that typically does not arise when using window-based component statistics. In a subset of populations from the 1000 Genomes Project, we found that more than half of variant sites had at least one undefined component statistic (Supplementary Table 1); iHS was frequently undefined because it requires a minor allele frequency of 5% to be computed2, and along with XP-EHH7, cannot be calculated near the ends of chromosomes or sequenced regions. This poses a particular problem when scanning for complete sweeps, defined here as sweeps in which the beneficial allele has fixed in the population of interest.
Here we introduce a Bayesian classification framework for detecting and localizing adaptive mutations in population-genomic data called SWIF(r) (SWeep Inference Framework (controlling for correlation)). SWIF(r) has three major features that enable genome-wide characterization of adaptive mutations: first, SWIF(r) computes the per-site probability of selective sweep, which is immediately interpretable and does not require comparison with a genome-wide distribution; second, no imputation or compensation mechanisms are necessary in the case of undefined component statistics; and third, we explicitly learn pairwise joint distributions of selection statistics, which gives substantial gains in power to both identify regions containing selective sweeps and localize adaptive variants. Existing composite methods for selection scans have subsets of these features, but SWIF(r) combines all three in a unified statistical framework. Our approach incorporates the demographic history of populations of interest, while being robust to misspecification of that history, and is also agnostic to the frequency of the adaptive allele, identifying both complete and incomplete selective sweeps in a population of interest. We assess SWIF(r)’s performance in simulations against state-of-the-art univariate and composite methods for identifying genomic targets of selective sweeps, and we confirm that we can localize known adaptive mutations in the human genome using data from the 1000 Genomes Project. We then apply SWIF(r) to identify novel adaptive variants in genomic data from the ‡Khomani San, an understudied hunter-gatherer KhoeSan population in southern Africa, representing the most basal human population divergence. Open-source software for training and running SWIF(r) is freely available at https://github.com/ramachandran-lab/SWIFr.
Results
We first describe the theoretical framework of SWIF(r) and compare SWIF(r) to existing sweep-detection methods using simulated data. We also validate the ability of our method to localize known adaptive mutations in data from the 1000 Genomes. To enable the application of SWIF(r) to single-nucleotide polymorphism (SNP) array datasets from diverse populations, we implement an algorithm for modeling ascertainment bias in simulated data used for training. To illustrate the power of our approach, we apply SWIF(r) to SNP array data from the ‡Khomani San of southern Africa. We show that genes bearing “SWIF(r) signals” — which we define as genomic loci at which SWIF(r) reports a posterior sweep probability greater than 50% — are associated with metabolism and obesity, and we use exome data to show that these genes contain multiple candidate adaptive mutations. Note that we train SWIF(r) on simulations of hard sweeps (Online Methods); our focus here is not on the relative roles of various modes of selection in shaping observed human genomic variation (for recent treatments on this question see18,20–23), although we note that SWIF(r) is extensible to multi-class classification, and could be used in future applications to explore multiple modes of selection. In this study, our focus is on localizing genomic sites of adaptive mutations that have spread through populations of interest via hard sweeps.
Implementation of SWIF(r)
SWIF(r) draws on Bayesian inference and machine learning to localize the genomic site of a selective sweep based on probabilities that incorporate dependencies among component statistics. Unlike genomic outlier approaches, the output of SWIF(r) can be interpreted directly for each genomic site: given a set of n component statistics for a site, SWIF(r) calculates the probability that the site is neutrally evolving or, alternatively, is the site of a selective sweep. We will refer to these two classes as “neutral” and “adaptive” respectively, and these posterior probabilities can be computed as follows: where s1, …., sn represent observed values for n component statistics such as iHS and FST, and π is the prior probability of a sweep, which may be altered to reflect different genomic contexts. If a component statistic is undefined at a site, it is simply left out of Equation 1, and does not need to be imputed. The data for learning the likelihood terms, P(s1, …, sn|adaptive) and P(s1, …, sn|neutral), come from calculating component statistics on simulated haplotypes from a demographic model with and without simulated selective sweeps comprising a range of selection coefficients and present-day allele frequencies (Online Methods). We note that this general framework is similar to that used by Grossman et al.13 for CMS, which assumes that the component statistics are independent, and computes the product of posterior probabilities P(adaptive|si) for each statistic si. SWIF(r) strikes a balance between computational tractability and model accuracy by learning joint distributions of pairs of component statistics, thereby relaxing this strict independence assumption.
We base SWIF(r) on a machine learning classification framework called an Averaged One-Dependence Estimator (AODE)24, which is built from multiple One-Dependence Estimators (ODEs), each of which conditions on a different component statistic in order to compute a posterior sweep probability at a given site. An ODE conditioning on Sj assumes that all other component statistics are conditionally independent of one another, given the class (neutral or adaptive) and the value of Sj. As shown in Equation 2, this assumption effectively reduces the dimensionality of the likelihood terms P(s1,…, sn|class) in Equation 1. The AODE then reduces variance by averaging all possible ODEs to produce a posterior probability that incorporates all pairwise joint probability distributions (Equation 3; Online Methods).
Assumption made by ODEj
(One-Dependence Estimator conditioning on Sj):
SWIF(r):
Calibration of posterior probabilities calculated by SWIF(r)
A desirable property of probabilities, like those calculated by SWIF(r), is that they be well calibrated: in this context, for the variant positions where the posterior probability reported by SWIF(r) is around 60%, approximately 60% of those sites should contain an adaptive mutation, and approximately 40% should be neutral. We implemented a smoothed isotonic regression scheme to calibrate the probabilities calculated by SWIF(r) (Online Methods). Briefly, when applying SWIF(r) to a given dataset, we calculate the empirical frequencies of neutral and sweep variants that are assigned posterior probabilities between 0 and 1 in simulation, and use isotonic regression25 to map the posterior probabilities to their corresponding empirical sweep frequencies (Supplementary Figure 1, Supplementary Figure 2). We then impose a smoothing function that prevents multiple posterior probabilities from being mapped to the same calibrated value (Supplementary Figure 3, Supplementary Figure 1E, Supplementary Figure 2E). This calibration procedure relies on the relative makeup of the training set; a classifier that is calibrated for a training set made up of neutral and sweep variants in equal parts would not be well-calibrated for a training set in which sweep variants only make up 1% of the whole. For each application of SWIF(r) in this study, we calibrated SWIF(r) for a specific training set makeup (Online Methods; see also Supplementary Figure 1 and Supplementary Figure 2).
The calibrated probabilities reported by SWIF(r) can be interpreted directly as the probability that a site contains an adaptive mutation, or fed into a straightforward classification scheme by way of a probability threshold; in this study, we classify sites with a posterior probability above 50% as adaptive SWIF(r) signals. The classifier may be tuned by altering either this threshold or the prior sweep probability π (Supplementary Figure 4.
Performance of SWIF(r) using simulated data
We implemented SWIF(r) using the following component statistics, which can each be calculated site-by-site in a genomic dataset: FST1, XP-EHH7 (altered as in Wagh et al.26; Supplementary Note), iHS2, and difference in derived allele frequency ΔDAF). Training simulations used the demographic model of Europeans, West Africans, and East Asians inferred by Schaffner et al.27, and simulated selective sweeps within each of those populations (Online Methods). We compared SWIF(r)’s performance against each component statistic, SweepFinder4, composite method CMS13 (altered by excluding ΔiHH because of non-normality; Supplementary Note, Supplementary Figure 5), and window-based sweep-detection methods evoNet19 (using the same component statistics as SWIF(r)), and evolBoosting14. We also evaluated the robustness of SWIF(r) to both demographic model misspecification and background selection.
In Figure 1A and Supplementary Figure 6, we evaluate the ability of SWIF(r) to localize the site of an adaptive mutation against that of its component statistic, the composite method CMS, and SweepFinder. The performance of each component statistic varies with different sweep parameters: for example, iHS is most powerful for identifying adaptive mutations that have not yet risen to high frequency within the population of interest, while XP-EHH and ADAF are more effective for those that have (Supplementary Figure 7). This underscores the advantage of composite methods for detecting selective sweeps when the parameters of the sweep are unknown12–19. Aggregating over many different sweep parameters, SWIF(r) outperforms each component statistic, as well as CMS and SweepFinder, improving the tradeoff between the false positive rate (fraction of neutral variants incorrectly classified as adaptive) and true positive rate (fraction of adaptive mutations that are correctly classified as such) (Figure 1, Supplementary Figure 6). SWIF(r) also outperforms CMS in distinguishing adaptive mutations from linked neutral variation (Supplementary Figure 8). The performance of SWIF(r) is particularly striking for incomplete sweeps: for example, in Figure 1B, SWIF(r) achieves up to a 50% reduction in the false positive rate relative to CMS for adaptive mutations that have only swept through 20% of the population at the time of sampling (see also Supplementary Figure 6A-C, noting that SweepFinder was designed to identify complete sweeps in a population of interest). For the same incomplete sweep simulations summarized in Figure 1B, Figure 1C shows the performance of each of the individual ODEs (Equation 2); in this particular evolutionary scenario, conditioning on FST or ΔDAF results in the best performance. However, the best-performing ODE changes based on the parameters of the selective sweep (Figure 1D). By averaging across all ODEs, SWIF(r) is robust to variable performance of ODEs in the absence of prior knowledge of the true sweep parameters (Figure 1A-C).
While few composite methods for sweep detection operate site-by-site, there are a handful of machine-learning composite approaches that identify genomic windows containing adaptive mutationsi4,17–19. In order to compare SWIF(r) against such methods, we had to alter SWIF(r) to calculate window-based sweep probabilities; there are many potential ways to do this that may be differentially powerful, and here we chose simply to use the highest probability assigned to any variant within a given genomic window as the probability for that window. We compared window-based SWIF(r) to two state-of-the-art composite window-based methods: evolBoosting14, which combines 120 statistics using boosted logistic regression, and evoNet19, which was developed to jointly infer demography and selection using a deep learning framework (Online Methods). SInce both SWIF(r) and evoNet are frameworks that are designed to incorporate any set of statistics, we implemented evoNet to use the same statistics as SWIF(r). When comparing SWIF(r) with evolBoosting, we used 40kb windows following Lin et al.14, and show that SWIF(r) outperforms evolBoosting across a range of sweep parameter values (Supplementary Figure 9). For comparison with evoNet, we used 100kb windows following Sheehan et al.19 (Supplementary Figure 10). We find that SWIF(r) performs similarly to or better than evoNet in this implementation, although we note that this analysis likely downplays the strengths of both methods, as each method has been altered from its original design to enable a direct comparison.
While ROC curves are informative for illustrating the performance of different sweep detection methods, it is important to note that the genome has far more neutral variants than adaptive mutations. Therefore, more relevant performance comparisons can be made by illustrating the predicted false discovery rate (FDR) for a given true positive rate using Power-FDR curves. These curves depend on the composition of the training set, since the false discovery rate rises as the proportion of adaptive variants in the training set decreases. In Figure 1E and Supplementary Figure 6F, we plot Power-FDR curves for all methods using the same training set composition we use for calibration of SWIF(r) for application to the ‡Khomani San dataset (99.95% neutral variants and 0.05% adaptive variants). Curves for other training set compositions can be found in Supplementary Figure 11. Power-FDR comparisons between SWIF(r) and evolBoosting and evoNet can be found in Supplementary Figure 9F and Supplementary Figure 10F assuming that 1% of windows contain a sweep. SWIF(r) performs well relative to its component statistics, CMS, and SweepFinder, however, these analyses illustrate the inherent difficulty of site-by-site detection of adaptive mutations. Because there are so many more neutral variants than adaptive variants in the genome, even a small false positive rate can result in a substantial false discovery rate. Window-based methods, including window-based SWIF(r), may appear to have lower false discovery rates (since there are many fewer windows than variants, and thus fewer opportunities for false positives to arise; see Supplementary Figure 9 and Supplementary Figure 10), but this comes at the cost of a longer list of putative SNP targets, since each classified window contains a large number of individual variants.
Robustness of SWIF(r) to demographic model misspecification
We first assessed the sensitivity of SWIF(r) to the demographic model used in training simulations using two very different demographic models of West Africans, Europeans, and East Asians, from Schaffner et al.27 (“Schaffner model”) and Gronau et al.28 (“Gronau model”). These demographies differ in multiple evolutionary parameter estimates: the Schaffner model includes an ancient population expansion and post-divergence bottlenecks, features not included in the Gronau model. Divergence times differ almost two-fold in some cases between the two models, with the Yoruban/Eurasian split at 47kya in the Gronau model and 88kya in the Schaffner model. Furthermore, the Schaffner model allows migration between East Asian, European, and West African populations, while the Gronau model does not. Effective population sizes also differ dramatically, in some cases over six-fold (see Supplementary Figure 12 for a full comparison).
We trained SWIF(r) using simulations from the Gronau model (including simulation of ascertainment bias; Online Methods), and tested SWIF(r) on simulated haplotypes drawn from the Schaffner model (with and without selective sweeps). As shown in Supplementary Figure 13, even with dramatic demographic differences and thinning to simulate ascertainment bias, SWIF(r) is quite robust to this misspecification.
Both the Schaffner model and the Gronau model, as implemented, do not include very recent population expansion, so we implemented a third demographic model from Gravel et al.29 that includes exponential population growth within the last 23,000 years. SInce cosi cannot simulate selective sweeps overlapping with demographic changes, we only simulated sweeps beginning 5kya, allowing for 18,000 years of exponential expansion (Online Methods). In Supplementary Figure 13, we show that SWIF(r) is also robust to this recent population expansion.
Robustness of SWIF(r) to background selection
We also assessed the sensitivity of SWIF(r) to background selection by generating a set of training simulations containing neutral regions, selective sweeps, and exonic regions, using forward simulator slim30 (Online Methods). We trained SWIF(r) on both neutral and sweep simulations, and tested the ability of SWIF(r) to distinguish between exonic and sweep sites, relative to its ability to distinguish between neutral and sweep sites. We find that SWIF(r) is fully robust in this scenario, meaning that we would not expect background selection in genic regions to result in false positive sweep signals (Supplementary Figure 14). These results are aligned with those of Enard et al.31, who have shown that background selection has little to no effect on haplotype-based statistics iHS and XP-EHH (and in fact makes iHS more conservative).
SWIF(r) correctly localizes canonical adaptive mutations in humans
For application to data from phase 1 of the 1000 Genomes Project, we used training simulations from the Schaffner demographic model27, calibrated SWIF(r) for a training set composed of 0.01% sweep variants and 99.99% neutral variants (Supplementary Figure 1), and applied it to SNP array data from West African (YRI), East Asian (CHB and JPT), and European (CEU) populations. SWIF(r) reports high sweep probabilities at multiple SNPs within known and suspected selective sweep loci in each of these populations (Supplementary Table 2, Supplementary Table 3). Figure 2 illustrates the ability of SWIF(r) to localize sites of adaptive mutations within genomic regions containing canonical sweeps. Adaptive SNPs have been determined via functional experiments in SLC24A532, DARC33, and HERC234; we find that modeling the dependency structure among component statistics within SWIF(r) enables statistical localization of these experimentally identified adaptive mutations (Figures 2A,C,D). Methods that treat component statistics as independent, as CMS does, cannot localize these experimentally identified adaptive SNPs (Supplementary Figure 15, Supplementary Figure 16). In CHB and JPT, SWIF(r) recovers a strong adaptive signal in the vicinity of EDAR, offering new hypotheses for targets of selection in this genomic region. Whole-genome results with gene annotations can be found in Supplementary Figure 17, Supplementary Figure 18, Supplementary Figure 19 and Supplementary Table 2 (see also Supplementary Figure 20). False discovery estimates can be found in Supplementary Table 4. Of the 126 genes across these populations with SWIF(r) signals (i.e. at least one variant within the gene has posterior hard sweep probability greater than 50%), 63% were identified in at least one positive selection scan conducted in humans (Supplementary Table 3).
Adaptive loci in the ‡Khomani San are enriched for metabolism- and obesity-related genes
We applied SWIF(r) to samples from the ‡Khomani San, a formerly hunter-gatherer KhoeSan population of the Kalahari desert in southern Africa, using Illumina SNP array data for 670,987 phased autosomal sites genotyped in 45 individuals45 (Supplementary Note). The KhoeSan have likely occupied southern Africa for ∼100,000 years, and maintain the largest long-term Ne of any human population46,47, a feature that facilitates adaptive evolution. We trained SWIF(r) on simulations from the Gronau demographic model28 (Supplementary Figure 21, Supplementary Table 5, and Online Methods), and implemented an ascertainment modeling scheme to produce a training dataset with population-level site frequency spectra similar to the observed array data. Briefly, for each simulated haplotype, SNPs were subsampled to match the empirical three-dimensional unfolded SFS for YRI, CEU, and CHB+JPT individuals in the 1000 Genomes Project on the chips used to genotype the ‡Khomani San (Online Methods, Supplementary Figure 22, Supplementary Figure 23). We calibrated SWIF(r) for this dataset based on a training set composed of 0.05% sweep variants and 99.95% neutral variants (Supplementary Figure 2). After applying SWIF(r) to SNP data, we then examined whether genomic regions identified by SWIF(r) contain annotated functional mutations identified in high-coverage exome data from the same 45 individuals48 (Supplementary Note).
SWIF(r) identifies a number of genomic regions bearing signatures of selective sweeps in the ‡Khomani San, driven by extreme values in multiple component statistics that together produce a posterior sweep probability greater than 50% (Figure 3A,B; see also Supplementary Figure 24, Supplementary Figure 25, and Supplementary Table 6). These signals comprise 108 SNPs, of which 94 are distributed across 80 genes, and the remaining 14 are intergenic, defined as genomic variants that do not land within 50kb of an annotated gene (Supplementary Table 7). We observe an abundance of SWIF(r) signals within the Major Histocompatibility Complex (MHC), a region of immunity genes for which studies have indicated ongoing selection in many populations49–52, including an iHS outlier scan in the ‡Khomani San53. We show in Supplementary Figure 26 that the SWIF(r) signals in this region are not qualitatively different from the SWIF(r) signals we see throughout the genome, despite the fact that balancing selection is typically thought to be the primary mode of selection in the MHC54.
We tested for a common functional or phenotypic basis among the 80 genes bearing SWIF(r) signals by conducting a gene ontology enrichment analysis across public databases with Enrichr55. We find that these genes are significantly enriched for dbGaP categories related to adiponectin, body mass index, and triglyceride phenotypes (Figure 3C, Table 1). Specifically, SNPs in genes related to adiponectin (ADIPOQ, PEPD, DUT, and ASTN2) have among the highest posterior sweep probabilities (all ≥ 75%). SWIF(r) also identified SNPs within three other genes (PDGFRA, SIDT2, and PHACTR3) that have previously been associated with obesity and metabolism phenotypes (Figure 3B, Table 1). Some of the genes highlighted in Figure 3B are also involved in muscle-based phenotypes (Supplementary Note), but here we focus on the substantial evidence supporting the association of the highlighted genes with obesity and metabolism phenotypes in prior GWA and functional studies (Table 1).
One variant that SWIF(r) identifies, rs6444174, has a calibrated sweep probability of 90%, driven by extreme values at this SNP in FST, XP-EHH, and ΔDAF (Figure 3A; empirical p-values 4.4 × 10−4, 4.0 × 10−4, 5.5 × 10−4 respectively). This variant lies in ADIPOQ, which is expressed predominantly in adipose tissue56, and codes for adiponectin, a regulator of glucose and fatty acid metabolism. In a study of associations between ADIPOQ variants and adiponectin levels and obesity phenotypes in 2968 African American participants, rs6444174 was found to be associated with serum adiponectin levels in female participants (p = 6.15 × 10−5), and with body mass index in all normal-weight participants (p = 3.66 × 10−4). The allele at high frequency in the ‡Khomani individuals studied here corresponds to decreased adiponectin levels and increased BMI, respectively57.
Exome-based support for targets of selection identified by SWIF(r)
This SWIF(r) scan was performed using SNP array data ascertained from primarily Eurasian polymorphisms, a common feature of commercial SNP array platforms. Thus, the observed SWIF(r) signals are likely tagging haplotypes common in the ‡Khomani San, and may not themselves be causal polymorphisms. We examined high-coverage exome data48 within each gene to identify putatively functional mutations near the sites identified by SWIF(r) (see Supplementary Table 8 for full results). This allows us to identify variants not captured on SNP array platforms, including variants that are unique to the ‡Khomani San. We note that we did not include the MHC genes in this exome analysis, because of potential issues with mapping and phasing of exome sequence data in the MHC region. In ADIPOQ, we identify a missense mutation, rs13716447, for which the nearest SNP that is present on the SNP array is rs6444174 (less than 1kb away); rs6444174 has a calibrated SWIF(r) sweep probability of 90%, the highest in ADIPOQ (Figure 4). The missense T allele at rs13716447 is at high frequency in the ‡Khomani San relative to all other populations sequenced in the 1000 Genomes Project (27% vs. <0.5%; Figure 4). Furthermore, in the SImons Genome Diversity Project (SGDP), whose samples are drawn from 130 diverse and globally distributed human populations, only four copies of the missense allele at rs13716447 are found: two copies in a ‡Khomani San individual, and one copy each in a Namibian San individual and a Ju|’hoansi San individual. This SNP defines the two major haplogroups within the ADIPOQ gene in a median-joining haplotype network for the gene region (Supplementary Figure 27), providing some support for selection at this SNP.
Two other genes highlighted in Figure 3B harbor promising polymorphisms that may be related to the underlying causal haplotypes. In PEPD, we identify a novel polymorphism at 10% frequency in the ‡Khomani (chr19:33882361) which is a missense mutation approximately 42kb from the SNP identified by SWIF(r). We also identify a missense mutation in the first exon of PHACTR3 at 38% frequency in this sample, which is at < 2% frequency in other global populations including other Africans sequenced as part of the 1000 Genomes Project. Because the SNP array density is low, we expect that SWIF(r) signals in this population may in many cases be somewhat removed from the causal variants that these signals tag. We note that intronic variants in both PEPD and PHACTR3 have been identified as cis-eQTLs that affect RNAseq expression in adipose tissue, in two independent northern European cohorts58.
For some of the genes identified by SWIF(r), exome data either was not generated, or did not reveal nearby functional polymorphisms with differential allele frequencies between the ‡KhomaniSan and other worldwide populations. One such gene is RASSF8, previously annotated as under positive selection in the Namibian and ‡Khomani San populations relative to western Africans using XP-EHH59; in our SNP array analysis, we detect a cluster of four SNPs in RASSF8 within 70kb of each other, each with SWIF(r) sweep probability >98%. RASSF8 is present in the BMI, Triglycerides, Lipids, and Cholesterol dbGaP categories (Figure 3C), yet functional mutations underlying this SWIF(r) signal remain elusive.
Discussion
In this paper, we have presented both a new method for selective sweep detection, SWIF(r), and new insight into adaptive evolution in the ‡Khomani San. Not only does SWIF(r) outperform existing SNP-based component statistics and composite methods when detecting both complete and incomplete sweeps in simulation (Figure 1), it also localizes experimentally validated adaptive mutations using genomic data alone (Figure 2). SWIF(r) accounts for the confounding effect of neutral population histories when detecting sweeps by generating training simulations based on a demographic model, and we find it is robust to misspecification of the demographic model underlying the testing data (Supplementary Figure 13). We outline an algorithm for modeling SNP ascertainment in training simulations, thereby enabling the application of SWIF(r) to selection scans using genotype array data from diverse understudied populations like the ‡Khomani San (Online Methods). While some of the component statistics we use here may be fairly robust to ascertainment bias, this algorithm also enables future use of component statistics that are more vulnerable to ascertainment bias, such as SFS statistics, within SWIF(r)’s framework. When analyzing genotype and exome data from 45 ‡Khomani San individuals, we find that SWIF(r) signals tagging functional variants are enriched in genes associated with metabolism and obesity (Figure 3, Table 1, Figure 4).
Composite classification frameworks such as SWIF (r) quantitatively ground a common qualitative approach used in scans for adaptive sweeps based on summary statistics: evidence for selection at a locus is considered stronger when extreme values are observed for more than one statistic (Figure 3A). Furthermore, machine-learning approaches like SWIF(r) that incorporate joint distributions of selection statistics can detect sweep events that individual univariate statistics cannot (Supplementary Figure 28). SWIF(r) additionally reports calibrated probabilities assessing evidence for selective sweeps site-by-site, resulting in a transparent probabilistic framework for localizing adaptive mutations. These features allow for the localization of specific adaptive variants rather than adaptive regions, and minimize bias arising from undefined component statistics. While approaches such as Approximate Bayesian Computation can exploit higher-dimensional correlations in order to distinguish between selective sweep modes at candidate loci100, this comes at the cost of genome-scale tractability, and can be vulnerable to the curse of dimensionality19. The AODE framework allows us to transparently calculate probabilities without the need for imputation of undefined statistics, and our priors are made explicit, allowing for clearer interpretation. Future applications of SWIF(r) can easily incorporate new site-based summary statistics as they are developed (Supplementary Figure 29), and can assign variable site-specific prior sweep probabilities according to genomic annotations: for example, one could assign a smaller prior for synonymous variants relative to non-synonymous variants, or a higher prior in regulatory regions relative to intergenic regions31.
In order for the class probabilities reported by SWIF(r) to be practically interpretable, we calibrated SWIF(r), such that k% of variants with a posterior sweep probability of k% are indeed sweep variants. We have implemented a calibration scheme based on isotonic regression for SWIF(r) that maps the posterior sweep probabilities to their empirical sweep proportions in simulated data (Supplementary Figure 1, Supplementary Figure 2, Supplementary Figure 3), but importantly, this calibration relies on the composition of the training set used. While for some classifiers, the proportions of classes are known, or can be reliably estimated (e.g. see Durand et al.101 and Scheet and Stephensi102), the proportion of sites throughout the human genome that are adaptive is unknown. For calibrating SWIF(r), we chose training training sets made up overwhelmingly of neutral variants; while our calibration of SWIF(r) always preserves the rank order of posterior probabilities (Supplementary Figure 3), the specific choice of training set makeup can have a dramatic effect on the calibration. Therefore, a direct interpretation of the posterior probabilities reported by SWIF(r), or any other classifier that calculates probabilities, must incorporate knowledge of the scenarios used for training and calibration.
One caveat for interpretation of the SWIF(r) results presented here is that we train SWIF(r) on hard selective sweeps. In simulation, we find that SWIF(r) is also sensitive to sweeps from standing variation with a low initial frequency (Supplementary Figure 30); indeed, the sweep in West Africans in the gene DARC, for which SWIF(r) calculates a high sweep probability (Figure 2A) has recently been shown to have originated from standing variation in the ancestral population103. Given the multiple metabolism- and obesity-related sweep targets identified by SWIF(r) in the ‡Khomani San (Figure 3), we also suspect that some putative adaptive mutations identified by SWIF(r) may be components of polygenic adaptation.
Genes with SWIF(r) signals in our high-throughput genomic scan for selective sweeps in the ‡Khomani San have been independently identified in multiple GWA studies and functional experiments as associated with metabolism- and obesity-related phenotypes (Table 1). One way to interpret this signal is through the lens of the “thrifty gene” hypothesis, which posits that ready fat storage was positively selected for in hunter-gatherer populations due to the survival advantage it conferred in unreliable food cyclesi104. The hypothesis further states that modern disease phenotypes such as type 2 diabetes and obesity are the consequence of a radical shift in diet from ancestral environments and forager subsistence strategies to a contemporary environment with abundant food in the form of simple sugars, starches, and high fat, though this is a subject of much debate105,106. Although most indigenous Khoe and San groups of the Kalahari are classically considered small and thin, populations such as the Khoekhoe cow/goat pastoralists are characterized by steatopygia (i.e. extensive fat accumulation along the buttocks and thighs in women), as notoriously described by early European explorers and anthropologists107–109. While the thrifty gene hypothesis would predict an increase in metabolic pathology for these individuals, studies have shown that accumulated subcutaneous gluteofemoral fat, found in patients exhibiting steatopygia110,111, is protective against diabetes and other metabolic disordersi112,113. The mutations and genes identified by our SWIF(r) scan, such as ADIPOQ, are natural targets for functional assays to determine the origins and consequences of subcutaneous versus visceral fat; future studies could merge such assays with phenotypic data on diabetes and metabolic syndromes in KhoeSan groups to gain new insight into the “obesity-mortality paradox”114.
In this selection scan, we also see an abundance of SWIF(r) signals in the MHC region involved in immunity. It is possible that this signal reflects balancing selection, which is the mode of selection canonically thought to be occurring within this region54. Indeed, it has been shown that the signatures of balancing selection and incomplete or recurrent sweeps may be similar to signatures of positive selection115,116. We note, however, that other studies using different methodologies have detected signatures of directional selection in the MHC in human populations117,118, and others have noted that fluctuating directional selection is a possible mechanism for pathogen-mediated selection in this region119.
The probabilistic framework of SWIF(r) suggests two natural extensions for future applications. First, while the use of SNP-based component statistics enabled us to localize adaptive mutations, SWIF(r) could easily incorporate region-based component statistics, including composite likelihood approaches like XP-CLR120, and SFS-based measures3,121 in order to help detect older selection events, for which haplotype-based statistics are less powerful. Second, future studies can exploit the flexibility and interpretability of SWIF(r) to conduct multi-class classification. Supplementary Figure 31 illustrates a preliminary extension of SWIF(r) that classifies sweeps based on the start time of positive selection. Recent methods have attempted multi-class sweep classification using hierarchical binary classification or other machine learning approaches14,18,22, but without the benefits of a transparent probabilistic framework in which priors are made explicit. Using the probabilistic framework of SWIF(r), future studies could determine the mode of adaptive evolution at genomic sites, including background selection or sweeps from standing variation or recurrent mutation21,22, or infer the timing or selective strength of an adaptive event100. Thus, SWIF(r) offers a technical advance in genome-wide sweep detection that can yield new insight into the modes and roles of selection in shaping population-genomic diversity.
Online Methods
SImulation of haplotypes for 1000 Genomes analysis
SImulations based on the demographic model of African, Asian, and European populations outlined in Schaffner et al.27 were carried out with the following alterations necessary for allowing the simulation of recent selective sweeps (within the last 30ky): no modern population growth (within the last 30 generations), and migration ending 500 generations following the Asian/European split instead of continuing to the present. We carried out 100 simulations of 1Mb regions from this neutral demographic model using cosi, which resulted in ∼400,000 neutral training points. We generated a new recombination map for each simulation with the recosim package within cosi, using a hierarchical recombination model that assumes a regional rate drawn from the observed distribution of rates in the deCODE genetic map122, and then randomly generates recombination hotspots with randomly drawn local rates27. For each simulation, we generated 120 1Mb-long haplotypes from each of the three populations.
Selective sweeps continued until the time of sampling, and were simulated for a range of sweep parameters: start time ranging over [5, 10, 15, 20, 25, 30kya], final allele frequency ranging over [0.2, 0.4, 0.6, 0.8, 1.0], and population of origin ranging over [African, Asian, European]. Note that these sweeps cover a range of incomplete as well as complete sweeps in a population of interest. The selection coefficients for each parameter set are fully determined by the effective population size, sweep start time, and final allele frequency, and are displayed in Supplementary Table 5. We calculated these selection coefficients using Equation 4 for complete sweeps (by which we mean sweeps where the beneficial mutation has reached fixation in the population of interest), and Equation 5 for incomplete sweeps, where t1 is the sweep start time and t2 is the sweep end time, both measured in generations from the present, Ne is the effective population size, and ϕ is the present-day frequency of the beneficial allele, 0 < ϕ ≤ 1123. The range of selection coefficients corresponds to a range of α = 2Ns of ∼100-4500. For each set of sweep parameters, we carried out 100 simulations with the adaptive allele located halfway along the 1MB region, for a total of 9,000 sweep training points.
SImulation of haplotypes for ‡Khomani San analysis
To train our classifier to identify selective sweeps in the ‡Khomani San, we used the demographic model inferred by Gronau et al.28 based on six diploid whole-genome sequences, one from each of six populations: European, Yoruban, Han Chinese, Korean, Bantu, and San. We used the inferred population sizes, coalescent times, and migration rates reported by Gronau et al.28, which are calibrated based on a 6.5 million year human-chimpanzee divergence, the presence of migration between the Yoruban and San populations, and 25 years/generation to construct a demographic model for the Han Chinese, European, Yoruban, and San, shown in Supplementary Figure 21.
Because Uren et al.45 showed that the ‡Khomani San have experienced recent gene flow from both Western Africa and Europe, we replaced migration in the Gronau model with two pulses of recent migration: one pulse from the Yoruban population with migration rate 0.179 at 7 generations ago, and one pulse from the European population with migration rate 0.227 at 14 generations ago. We found that these rates resulted in present-day admixture levels that matched those found in Uren et al.45 (Supplementary Note).
Using cosi27, we simulated 1Mb genomic regions, comprising both neutral and sweep scenarios as described earlier, with sample sizes matching the number of individuals in the filtered 1000 Genomes dataset, and the number of San individuals in our study.
Because the ‡Khomani have been isolated for so long, we included additional sweep scenarios, with sweeps beginning and ending between 30 and 60kya. We called these “old sweeps” and trained the classifiers on three classes: neutral, “old sweeps”, and “recent sweeps” (those occurring within the last 30ky). SInce we found that the classifiers did not have enough power to reliably distinguish between old and recent sweeps (Supplementary Figure 31), in applications to data, we only considered the total probability of a sweep, given by the sum of the posterior probabilities for each sweep class.
SImulation of haplotypes including recent population growth
To test SWIF(r)’s robustness to misspecification of recent population growth, we implemented a set of simulations using a demographic model from Gravel et al.29 that estimates recent exponential population expansions with rates of 0.38% for Europe and 0.48% for East Asia over the last 23,000 years. SInce the simulation software cosi27 cannot simulate sweeps and population-level changes simultaneously, we allowed the expansion to last from 23,0000 years ago to 5,000 years ago, to allow for sweeps beginning 5,000 years ago. We also included the migration rates inferred by Gravel et al.29 between Europe, East Asia, and Africa, and between Africa and the ancestral population of Europe and East Asia. As in other analyses, We simulated selective sweeps spanning a range of present-day allele frequencies from 20% to 100%.
SImulation of background selection
For evaluating SWIF(r)’s robustness to background selection, we generated 3 sets of simulations of 1Mb each using forward simulator slim30: neutral regions, regions with a hard sweep, and genic regions. For genic regions, we followed Messer and Petrov21 to simulate gene structure: each simulation had one gene with 8 exons of 150bp each, separated by introns of 1.5kb, and flanked by a 550bp 5′UTR and a 250bp 3′UTR. Within exons and UTRs, 75% of sites were assumed to be functional. Mutations were assumed to be codominants, and fitness effects across different sites were assumed to be additive. Functional sites were divided into 40% “strongly deleterious” sites with selection coefficient −0.1, and 60% “weakly deleterious” sites with selection coefficients between −0.01 and −0.0001. The mutation rate was set at 2.5 × 10−8 per site per generation, and the recombination rate at 10−8. Note that for testing the robustness of SWIF(r), we only considered sites from these simulations that landed in exons or UTRs.
For all three sets of simulations, we simulated two populations with Ne = 5000, which split from each other 40,000 years ago. For sweep simulations, we drew selection coefficients for the beneficial allele from an exponential distribution with mean 0.03, and sweeps begin 10,000 years ago. SInce forward simulations are much more computationally intensive than coalescent simulations, we rescaled our parameters by a factor of 10 (10 times larger for mutation rate, recombination rate, and selection coefficients, 10 times smaller for population sizes and 10 times shorter for all times) to make the simulations feasible30.
Implementation of the Classifiers
For ease of comparison, we built SWIF(r) using the same statistics that comprise the Composite of Multiple SIgnals (CMS)13,16: the fixation index (FST), cross-population extended haplotype homozygosity (XP-EHH) (adapted for improved performance on incomplete sweeps; Supplementary Note), the integrated haplotype score (iHS), and change in derived allele frequency (ΔDAF). AiHH was excluded in applications to real data 469 because of non-normality (Supplementary Note, Supplementary Figure 5).
Implementation of SWIF(r)
To avoid over-fitting the joint distributions modeled in the AODE framework (Equation 3), we fit Gaussian mixture models with full covariance matrices (i.e. containing nonzero off-diagonal entries) to the joint probability distributions of each pair of statistics Si and Sj within each scenario C (neutral or sweep), P(Si = si, Sj = si|C), with the number of components ranging between three and five based on Bayesian Information Criterion (BIC) curves. Joint probabilities were learned using sites for which both component statistics were defined. We used the python package scikit-learn124 to compute the BIC curves and fit the mixture models. These mixture models capture the salient features of each pairwise joint distribution, as illustrated in the example in Supplementary Figure 32. Given the smoothed joint distribution learned for a pair of statistics (Si, Sj), we calculate the conditional probability distributions Sj|si as one-dimensional gaussian mixtures: where N(μ, σ2) denotes the normal distribution with mean μ and variance σ2, k indexes the components in the joint Gaussian mixture, w(k) is the weight assigned to each component, μi and μj are the components of the joint mean, and σi, σj, and ρ are taken from the joint covariance matrix Σ.
We find that SWIF(r) loses little power to identify sites with adaptive mutations when one of the component statistics is undefined, but each component statistic differentially influences the power of SWIF(r) (Supplementary Figure 33).
Calibration of SWIF(r) probabilities
There are a few techniques for calibrating probabilities returned by a binary classifier so that of all of the data points that are given a k% probability of belonging to class A by the classifier, k% of those are indeed drawn from class A, and (100 – k)% are drawn from class B25,125. Isotonic regression (IR) is a popular method because it makes no assumptions about the mapping function beyond requiring that it be monotonically increasing126. In the case of SWIF(r) probabilities, IR calibration works by grouping sweep and neutral variants from a training dataset into posterior probability bins, and mapping each bin to the empirical proportion of variants in the bin that are sweep variants. We used 10 bins for calibration, because we found that using more bins increased the risk of overfitting. This can be mitigated by performing more simulations, but in our case, even with 1000 neutral simulations of 1Mb each, mid to high posterior probabilities were extremely rare at neutral variants. For sweep site localization in data from the 1000 Genomes Project, we calibrated SWIF(r) based on a training dataset composed of 99.99% simulated neutral variants and 0.01% simulated sweep variants, and for application to the ‡Khomani San SNP array data, we calibrated SWIF(r) based on a dataset composed of 99.95% simulated neutral variants and 0.05% simulated sweep variants (Supplementary Figure 1, Supplementary Figure 2, Supplementary Figure 3). The slightly larger fraction of sweep simulations in the ‡Khomani San training set relative to the 1000 Genomes training set allowed for more sensitivity to older sweeps, and accounted for the sparser SNP density of this dataset. In both data, we restricted the simulated sweep variants to those with present-day allele frequencies over 50%, since we have the most power in this realm (Supplementary Figure 6), and wanted to avoid overcorrection of strong signals.
A downside to IR is that by its nature, it maps a range of input values to the same output value, which removes some information about which probabilities are larger than others. We implemented a “smoothed” isotonic regression for calibration that interpolates the piecewise constant mapping function learned by IR (Supplementary Figure 3). In practice, we find that both methods of calibration produce equally well-calibrated classifiers; that is, after either method, the data points in our simulated dataset that have a calibrated posterior sweep probability of k% are made up of approximately k% sweep simulations and (100 – k)% neutral simulations (Supplementary Figure 1, Supplementary Figure 2). Unlike IR alone, however, smoothed isotonic regression has the advantage of preserving strict monotonicity of posterior probabilities.
Implementation of CMS
We implemented CMS following the algorithm described in Grossman et al.13 and personal communication with the authors. Based on the simulations and component statistics described above, CMS is computed as the product of individual posterior distributions: where π, the prior probability of a sweep, is 10−6.
When one or more component statistics are undefined at a locus, CMS is not well-defined. If statistics are simply left out of the product, this artificially inflates the reported score. Some compensation is thus required to avoid such a bias, which is not discussed by Grossman et al. We implemented a conservative compensation scheme: if statistic Si is undefined at a locus, we set its value to the mean of the distribution for that statistic learned from neutral simulations.
For the purpose of evaluating CMS using YRI, CEU, and CHB+JPT samples from the 1000 Genomes, we use CMS Viewer (https://pubs.broadinstitute.org/mpg/cmsviewer/; use date: 04/26/2016), an interactive tool designed by Grossman et al.16 for visualizing genome-wide CMS scores.
Implementation of window-based methods
We implemented evolBoosting using the R package released by Lin et al.14 (http://www.picb.ac.cn/evolgen/softwares/) using default settings. We trained and tested evolBoosting on the simulations of YRI, CEU, and CHB+JPT described above, including all sweep durations and present-day allele frequencies, splitting the simulations in two equally sized groups for training and testing. We used the middle 40kb of each 1Mb simulation, and generated window-based SWIF(r) probabilities by taking the maximum posterior sweep probability for all SNPs within the 40kb window.
For evoNet19 we used the same simulations as above, but using the central 100kb windows of each 1Mb simulation (following Sheehan et al.19). The software released by Sheehan et al.19 is generalized, so that any component statistics may be used for implementing the deep learning framework. Therefore, we implemented evoNet using the component statistics we use here for SWIF(r): FST, XP-EHH, iHS and ΔDAF. evoNet exploits the signatures at three distances from the beneficial allele (0-10kb, 10-30kb, and 30-50kb away), so we used the mean values for each component statistic in each of these windows, for a total of 12 component statistics. For this comparison, we generated window-based SWIF(r) probabilities by taking the maximum posterior sweep probability for all SNPs within the 100kb window.
ROC analysis
To generate the ROC curves for CMS, SweepFinder, and the component statistics (Figure 1A-C), we varied the threshold for classifying a mutation as adaptive in order to cover the range from ∼0% false positive rate to ∼100% true positive rate. For SWIF(r), and the ODEs, we varied the prior π, and sites with scores greater than 0.5 were classified as adaptive (Supplementary Figure 4). To generate Figure 1D, we partitioned all simulations by present-day frequency of the adaptive mutation and sweep start time. For each pair of these parameters, we approximated the area under the ROC curves (AUROC) by summing the areas of the trapezoids defined by each pair of neighboring points in the ROC plane, then identified the summary statistics with the highest and second-highest AUROC. ROC curves for window-based SWIF(r), evoNet, and evolBoosting, were generated in much the same way, except that we varied the threshold for classifying a window as containing an adaptive variant.
Ascertainment modeling
For our selection scan in the ‡Khomani San population, we use genotype data from two SNP arrays45; the ascertainment bias of these arrays means that the simulated haplotypes we generate from the four populations (‡Khomani San, YRI, CEU, CHB+JPT) for training SWIF(r) differ dramatically from the observed data for these populations at the sites genotyped on the arrays. To account for this, we implemented an ascertainment-modeling algorithm that prunes sites from simulated haplotypes in order to provide SWIF(r) with simulations for training that match the site frequency spectrum (SFS) of the observed data as closely as possible. The key to this algorithm is to define regions of joint SFS space that are similar in terms of representation on the SNP arrays (e.g. SNPs with low derived allele frequency in all populations are fairly common, while SNPs that are highly differentiated across multiple populations are relatively rare). Defining these “equivalence classes” (hereafter referred to as “SFS regions”) in joint SFS space allows us to learn the density of SNPs from each SFS region along the SNP arrays, and then to thin simulations in order to re-create those densities. This first requires smoothing of the joint SFS to account for sparsity. The full algorithm is as follows:
Learn the empirical 3D SFS for YRI, CEU, and CHB+JPT individuals in the 1000 Genomes Project, restricted to SNPs present in the overlap between the Illumina OmniExpress and OmniExpressPlus platforms (Supplementary Note) This results in a three-dimensional array of SNP counts for each triplet of derived allele frequencies (DAFYRI, DAFCEU, DAFCHB+JPT). For this dataset, given 87 YRI individuals, 81 CEU individuals, and 186 CHB and JPT individuals, the dimensions of this three-dimensional array are 175 × 165 × 373 (2n + 1 in each dimension for n individuals).
To account for sparseness in the empirical 3D SFS, subdivide each axis into 40 evenly-spaced bins to create a new 40 × 40 × 40 array where each entry is the average SNP count within that 3-dimensional bin; this array approximates the original empirical 3D SFS. Use the one-dimensional histogram of average SNP counts across all 403 bins to define five intervals that span the range of counts, then assign each bin to its interval (Supplementary Figure 34). Groups of bins belonging to the same interval will be hereafter referred to as “SFS regions.” We note that we choose a 40 × 40 × 40 array for smoothing because it resulted in SFS regions with well-defined boundaries in 3-dimensional space (Supplementary Figure 34); these dimensions may need to be altered for other datasets to achieve well-defined boundaries as in Supplementary Figure 34.
In most SFS regions, the SNP counts in the 3-dimensional SFS are relatively invariant; however, in the SFS region with the highest SNP counts (the region in red in Supplementary Figure 34, corresponding predominantly to SNPs with low derived allele frequency in all populations), there is a wide range of SNP counts (this is analogous to the higher variability in counts of low-frequency variants in the 1-dimensional site frequency spectrum relative to that of medium- and high-frequency variants). To account for this increased variability, apply a similar procedure as above: subdivide each bin in the highest SFS region by 2 in each dimension (resulting in 8 sub-bins), re-learn the average SNP count within that sub-bin, use a histogram of average SNP counts across sub-bins to again define five intervals, and assign each sub-bin to its interval, thereby defining an additional set of SFS regions that gives better resolution in higher-density areas.
For each 1Mb block along the SNP array, count the number of SNPs that fall in each SFS region, based on the observed derived allele frequencies at each SNP for YRI, CEU, and CHB+JPT. This provides a measure of SNP density (counts per Mb) for each SFS region. Applying this over a sliding window of 1Mb across the entire SNP array results in a distribution of densities for each SFS region.
Within each 1Mb block of simulated sequence data, assign each simulated SNP to its SFS region. For each SFS region, draw a value from the distribution of SNP densities learned in step 3, then randomly down-sample the number of simulated SNPs that fall in that region to match this value. In the rare case in which downsampling is not possible for a given SFS region (i.e. there are fewer simulated SNPs in that region than the value drawn from the distribution of densities), retain all simulated SNPs that belong to the SFS region.
For training the classifiers, restrict the simulated ‡Khomani San genotype data (as well as the simulated data from the 1000 Genomes populations) to the downsampled set of SNPs.
Software and Data availability
SWIF(r) repository: https://github.com/ramachandran-lab/SWIFr; selscan repository: https://github.com/szpiech/selscan 1000 Genomes phase 1 data: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/integrated_call_sets/.
‡Khomani San genotype data were first described by Uren et al.45, and ‡Khomani San exome data were first described by Martin et al.48. Queries regarding access to ‡Khomani San data analyzed here should be sent to the South African San Council for research and ethics review by contacting both Leana Snyders (leanacloete{at}ymail.com) and admin{at}sasi.org.za.
Author Contributions
L.A.S. and S. Ramachandran conceived the study and L.A.S. implemented the methods. Sequence data from the ‡Khomani San were generated and processed by E.G.A. and B.M.H., and L.A.S., E.G.A., B.M.H., and S. Ramachandran contributed to analysis of SWIF(r) results. A.P.F. annotated SWIF(r) results and contributed to comparison of SWIF(r) with existing methods. S. Rong contributed simulations of different modes of selection. L.A.S, E.G.A., B.M.H., and S. Ramachandran wrote the manuscript.
Competing Financial Interests
The authors declare no competing financial interests.
Supplementary Note
cosi demographic parameter file for 1000 Genomes dataset
length 1000000
mutation_rate 1.5e-8
recomb_file <filename>
gene_conversion_rate 4.5e-9
pop_define 1 european
pop_define 2 asian
pop_define 3 african
#initial sizes and sample sizes
pop_size 1 7700
sample_size 1 120
pop_size 2 7700
sample_size 2 120
pop_size 3 24000
sample_size 3 120
#Migration start
pop_event migration_rate “afr to eur migration” 3 1 1505 .000032
pop_event migration_rate “eur to afr migration” 1 3 1504 .000032
pop_event migration_rate “afr to as migration” 3 2 1503 .000008
pop_event migration_rate “as to afr migration” 2 3 1502 .000008
#Migration end
pop_event migration_rate “afr to eur migration” 3 1 1996 0
pop_event migration_rate “eur to afr migration” 1 3 1995 0
pop_event migration_rate “afr to as migration” 3 2 1994 0
pop_event migration_rate “as to afr migration” 2 3 1993 0
#Recent Bottlenecks:
pop_event bottleneck “african bottleneck” 3 1997 .008
pop_event bottleneck “asian bottleneck” 2 1998 .067
pop_event bottleneck “european bottleneck” 1 1999 .02
#Population splits:
pop_event split “asian and european split” 1 2 2000
pop_event split “out of Africa” 3 1 3500
#Out-of-africa bottleneck
pop_event bottleneck “OoA bottleneck” 1 3499 .085
#Ancestral expansion
cosi demographic parameter file for ‡Khomani dataset
length 1000000
mutation_rate 1.5e-8
recomb_file <filename>
gene_conversion_rate 4.5e-9
pop_define 1 european #E
pop_define 2 han_chinese #H
pop_define 3 yoruban #Y
pop_define 4 khomani #K
#initial sizes and sample sizes
pop_size 1 9700
sample_size 1 164
pop_size 2 5800
sample_size 2 372
pop_size 3 17800
sample_size 3 174
pop_size 4 21000
sample_size 4 90
#population size changes
pop_event change_size “H” 3 1240 3500
pop_event change_size “HE” 2 1441 1200
pop_event change_size “HEY” 3 1881 11500
pop_event change_size “HEYK” 4 5241 8700
#migration to Khomani
pop_event migration_rate “YRI to Khomani migration” 3 4 14 0.227
pop_event migration_rate “YRI to Khomani migration end” 3 4 15 0
pop_event migration_rate “CEU to Khomani migration” 1 4 6 0.179
pop_event migration_rate “CEU to Khomani migration end” 1 4 7 0
#population splits:
pop_event split “CE” 2 1 1440
pop_event split “CEY” 3 2 1880
pop_event split “CEYK” 4 3 5240
Selection Statistic Calculations
For each segregating site within the neutral simulations, and for the adaptive site in sweep simulations, we computed component selection statistics (Fst, XP-EHH, iHS, ΔiHH, ΔDAF) with respect to a population of interest (the population undergoing a sweep in sweep simulations, and a population chosen uniformly at random in the case of neutral simulations), ignoring sites for which the derived allele frequency in the population of interest was zero. FST for each of the pairwise population comparisons involving the population of interest was computed as in Weir et al.32, and then averaged together. iHS was computed with selscan33. ΔiHH was calculated as defined in Grossman et al.15: ΔiHH|iHHancestral-iHHderived|, where iHH is the integrated haplotype homozygosity defined in Voight et al.4. ΔDAF was also calculated as defined in Grossman et al.15: where DAF1 is the derived allele frequency in the population of interest, and DAF2 and DAF3 are the derived allele frequencies in the other two populations. XP-EHH was computed with a minor alteration of selscan in which EHH is computed as in Wagh et al.34 (see “Adaptation of selscan for better performance on incomplete sweeps”), and XP-EHH values were also normalized with population-specific mean and standard deviation, learned from neutral simulations, to correct for inherent biases based on linkage disequilibrium structure.
Small adjustments were required for computing these component statistics using the ‡Khomani San simulations and genotype array data23. In these analyses, there are three outgroups (western Africa, Europe, and eastern Asia) instead of two: XP-EHH was defined as the maximum XP-EHH value across the three comparisons; and ΔDAF was defined as
Results for each component statistic were normalized within 1MB regions, with iHS and ΔiHH being normalized first within frequency bins as in Voight et al.4. For the CMS, we learned one-dimensional probability distributions for each (scenario, component statistic) pair in 60 evenly spaced bins, with minimum and maximum values below, chosen to encompass the full range of values observed across all neutral and sweep simulations:
When computing statistics for either 1000 Genomes or ‡Khomani San genotype data, component statistics were normalized genome-wide, following Grossman et al.14.
Adaptation of selscan33 for better performance on incomplete sweeps
XP-EHH is defined as log where iHH1 and iHH2 are “integrated haplotype homozygosities” for the population of interest and a reference population respectively. iHH is computed as the integral under the EHH curve, where EHH at a distance x from the core SNP is canonically computed as follows: where N is the total number of chromosomes, G is the number of distinct haplotypes, and ni is the number of chromosomes of distinct type i. In the case of incomplete sweeps from a de novo mutation, this definition somewhat counterintuitively leads to negative values of XP-EHH. This is due to the fact that the numerator has an upper bound of where Na and NA are the number of chromosomes that have each of the two possible alleles at the core SNP. In the reference population, there is a larger upper bound of , leading to a larger value of iHH for the reference population than for the population with the adaptive mutation.
Following Wagh et al.34, we instead define EHH as follows:
When XP-EHH is defined this way, it is far more powerful for detecting sweeps, and does not return negative values for incomplete sweeps. We modified selscan’s source code to implement this change. These modifications can be found at https://github.com/lasugden/selscan (original software at https://github.com/szpiech/selscan).
Removal of ΔiHH from analyses of real data
While other statistics are distributed approximately normally after normalization, ΔiHH maintains a very long tail (1-2 orders of magnitude longer than the tails of other statistics), which is exacerbated by normalization. This problem is compounded by the fact that ΔiHH is an absolute value, leading to high scores even when the ancestral haplotype is longer than the derived haplotype. In Supplementary Figure 5, we show the distribution of ΔiHH and iHS values for sites with SWIF(r) sweep probability ≥ 50% in the 1000 Genomes dataset using SWIF(r) with ΔiHH included as a component statistic. Note the scale of both statistics; while iHS ranges from −4 to 4, ΔiHH ranges from 0 to 100 (in simulations, the probability mass for ΔiHH lies mostly below 5, and entirely below 14). The plot shows that ΔiHH can be extremely high, even for values of iHS that are positive and thus provide evidence against selective sweeps. Furthermore, the more negative iHS values correspond to more moderate ΔiHH values, and not the most extreme ones, thus leading to a large number of false positives. For these reasons, we removed ΔiHH as a component statistic for analysis of simulations and genotype data from the 1000 Genomes and the ‡Khomani San.
Processing of 1000 Genomes Data
We performed a genome-wide scan for selective sweeps in four populations (YRI, CEU, CHB, JPT) using phase 1 of the 1000 Genomes Project (May 2011 release), with CHB and JPT grouped together and representing East Asia. We filtered the samples to omit children from parent-child pairs and trios (NA07048, NA10847, and NA10851 from CEU and NA19129 from YRI), and we only analyzed single-nucleotide variants. We also removed loci that were monomorphic within the filtered set of unrelated individuals across the four populations analyzed. We used ancestral allele information provided by the 1000 Genomes Project.
Calculation of migration rates from YRI and CEU to ‡Khomani San
Uren et al.23 use the software package Tracts35 to infer the magnitude of genetic contributions to present-day ‡Khomani San individuals from three source populations: KhoeSan (data presented in Uren et al.23 and Schuster et al.36), LWK (1000 Genomes), and CEU (1000 Genomes). In our simulations, we use YRI as a proxy population for LWK (i.e. migration rates learned in Uren et al. from LWK are implemented in our demographic model as migration rates from YRI), as both are Bantu-speaking populations, which all originate from west central Africa. Therefore both populations provide an appropriate source for the Bantu ancestry in the ‡Khomani San. The migration rates reported by the Tracts software represent the proportion of the target population that is replaced at a given generation by a source population, starting in this case 14 generations ago. Prior to 14 generations ago, we assume that the ‡Khomani San population is made up of 100% KhoeSan ancestry, and we converted the Tracts migration rates into migration rates from only two source populations (YRI and CEU) into the third (‡Khomani San). We achieved this by going through an intermediate step in which we calculated a matrix of ancestry proportions at each generation. The table below shows the tracts output from Uren et al.23 (migration rates mS, mY, and mC at each generation from 14 generations ago to present. We then calculate the ancestry proportions at each generation, initializing at generation 14 (mY and mS at generation 14 add to 1, indicating that the ancestry proportions at this time point are equal to the migration rates), and working backwards. If we define to be the migration rates at generation i in this 3-source population model, and to be the ancestry proportions at generation i, then we can calculate these by the following recursive formulas, where represents the fraction of the population not being replaced at the given generation.
The entries in the following table for aS, aY, and aC are calculated using these formulas. Finally, we need to convert these ancestry proportions into migration rates for a model with two population sources (YRI and CEU). For simplicity, we assume a one-generation migration pulse from YRI at generation 14, and a one-generation migration pulse from CEU at generation 7. We note that in order to achieve a present-day CEU ancestry proportion of 0.179, the CEU migration rate must be 0.179 (the migration rate is again defined as the fraction of the population being replaced by the source population). Finally, to achieve a present-day YRI ancestry proportion of 0.186, the migration rate from YRI must be 0.227, since (1 – 0.179) × 0.227 = 0.186 (this accounts for the proportion of YRI ancestry that gets replaced by CEU ancestry in generation 7).
Generation of exome and array data in San population
‡Khomani San individuals were sampled in 2006 in Upington, South Africa and neighboring villages. Institution Review Board (IRB) approval for assessment of genetic diversity and ancestry inference was obtained from Stanford University [Protocol 13829] and Stony Brook University [Protocol 727494-5]. Still-living individuals were re-consented in 2011 (IRB approved from Stanford University and Stellenbosch University, South Africa). {Khomani N|u-speaking individuals, local community leaders, traditional leaders, non-profit organizations and a legal counselor were all consulted about the project aims before DNA collection commenced. All individuals initially orally consented to participate in the project in the presence of a witness fluent in the native language, and were re-consented with written consent. DNA was collected via saliva (Oragene kits). Ancestry and genotyping details of individuals included here can be found in Uren et al.23.
90 KhoeSan DNA samples were captured with 3 exome platforms: 74 samples on an Agilent SureSelect Human All Exon V2 44Mb, 8 samples on an Agilent SureSelect Human All Exon 50Mb, and 8 samples on an Agilent SureSelect Human All Exon V4+UTRs 71Mb (Martin et al.29). Illumina short-read sequencing data were jointly processed according to the best practice pipeline of the 1000 Genomes Project16. Reads were aligned to the hg19 reference genome using bwa-mem 0.7.1037. The resultant BAM files were then sorted and marked for duplicate reads using the Picard v1.92 toolkit (http://broadinstitute.github.io/picard/). The following programs were then run with GATKv3.2.238: RealignerTargetCreator, IndelRealigner, BaseRecalibrator, PrintReads, HaplotypeCaller, GenotypeGVCFs, and VariantRecalibrator, and ApplyRecalibration. During the HaplotypeCaller step, we filtered reads to include only the Agilent capture regions ± 100 bp of padding.
For phasing, exome data were merged with Illumina SNP arrays for each of 87 individuals to improve accuracy by providing a broader SNP scaffold. After merging and filtering to 5% genotyping missingness using vcftools, 759,586 SNPs remained. Data were phased using a two-step phasing process as follows: first, 25 related individuals, consisting of 3 trios and 8 duos, were used to create a family reference panel; after phasing unrelated individuals using default protocols in SHAPEIT239, pedigree information was used via duo-trio phasing with duoHMM40 to inform the phasing of the unrelated individuals in a second step, improving phasing accuracy overall by correcting phase switch errors.
{Khomani San individuals were genotyped on two SNP array platforms: the Illumina OmniExpress and OmniExpressPlus chips. Only SNPs shared between these two platforms23 were retained for investigation of adaptive sites, in order to avoid allele frequency biases related to platform choice; there is broad overlap between these arrays, with the OmniExpressPlus SNP array containing an additional 250k sites. 86 individuals were genotyped and were phased using only pedigree information. This pedigree-phased dataset provides the highest possible SNP density for these individuals and, after quality control filtering, included just under 650k SNPs.
For the analyses presented here, we selected 45 unrelated ‡Khomani San individuals with low rates of recent European and Bantu-derived admixture from the phased datasets. These individuals were identified by running ADMIXTURE v1.241 on a joint dataset of Illumina SNP arrays consisting of diverse populations of individuals23 and selecting those KhoeSan individuals with >90% Khoesan ancestry at K = 6. SIx was determined to be the K value that best fit the data, using both cross-validation procedures and cluster appearance.
SWIF(r) signals associated with muscle-based phenotypes
Both MYH15 and TTN have been associated with obesity and metabolism phenotypes (Table 1). Additionally, both genes encode striated muscle. MYH15 has been associated with coronary heart disease42, and TTN mutations have been associated with cardiomyopathy43, with RNAseq data indicating that expression is highest in heart tissue44. Associations for MYH15 and TTN with the obesity and metabolism phenotype may be a consequence of these other functions and associations. Exome support for selection acting within these genes is given below; allele frequencies for populations other than the ‡Khomani San refer to frequencies found in the 1000 Genomes phase 3 dataset16.
TTN (titin): We find 6 missense mutations within 50kb of the SWIF(r) signal; in itself, this may not be unusual given the exon richness in this gene. However, many of the mutations have a population frequency of approximately 50% in the Khomani San, while being absent or having much lower frequencies in other worldwide populations. For example, the derived asparagine to isoleucine (rs11900987) mutation is conserved amongst mammals30, <1% in other human populations, but present at 48% in our San sample. Other mutations segregating at similar frequency within 50kb suggest that a high frequency haplotype is under adaptive evolution the ‡Khomani San.
MYH15 (myosin heavy chain 15): Our exome analysis found two missense mutations in MYH15 with large allele frequency deviations. the G allele of rs9868484, which lies ∼4kb from the SWIF(r) signal, is at a frequency of 71% in our sample, relative to a maximum frequency of 40% elsewhere in Africa. The T allele of rsi078456, ∼50kb from the SWIF(r) signal, is at a frequency of 22%, relative to a maximum of 4% worldwide. In addition, a splice region variant ∼38kb from the SWIF(r) signal, rsi13330737, has derived allele frequency 46% in the {Khomani sample relative to a maximum of 1% worldwide.
Acknowledgments
We thank Dean Bobo, Barbara Engelhardt, Chris Gignoux, David Guertin, Erik Sudderth, Zachary Szpiech, Jeremy Mumford, Lorin Crawford, Paul Norman, Sara Mathieson, and the Ramachandran Lab for helpful discussions; we also thank Shari Grossman, Ilya Shlyakhter, and Pardis Sabeti for discussing details regarding the implementation of CMS. We are grateful to Ryan Hernandez for multiple discussions regarding analysis of exome sequences and false discovery estimates, as well as to three reviewers for multiple suggestions that improved the manuscript. This research was supported by the Pew Charitable Trusts (S Ramachandran is a Pew Scholar in the Biomedical Sciences), and US National Institutes of Health (NIH) grant R01GM118652 (to S Ramachandran) and COBRE award P20GM109035. S Ramachandran is an Alfred P. Sloan Research Fellow and also acknowledges support from National Science Foundation (NSF) CAREER Award DBI-1452622. S Rong is supported by an NSF Graduate Research Fellowship. APF was supported by an REU supplement to NSF CAREER Award DBI-1452622. EGA was supported by NIH grant K12-GM-102778.
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.
- 10.
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.
- 51.
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.
- 62.
- 63.
- 64.
- 65.
- 66.
- 67.
- 68.
- 69.
- 70.
- 71.
- 72.
- 73.
- 74.
- 75.
- 76.
- 77.
- 78.
- 79.
- 80.
- 81.
- 82.
- 83.
- 84.
- 85.
- 86.
- 87.
- 88.
- 89.
- 90.
- 91.
- 92.
- 93.
- 94.
- 95.
- 96.
- 97.
- 98.
- 99.↵
- 100.↵
- 101.↵
- 102.↵
- 103.↵
- 104.↵
- 105.↵
- 106.↵
- 107.↵
- 108.
- 109.↵
- 110.↵
- 111.↵
- 112.↵
- 113.↵
- 114.↵
- 115.↵
- 116.↵
- 117.↵
- 118.↵
- 119.↵
- 120.↵
- 121.↵
- 122.↵
- 123.↵
- 124.↵
- 125.↵
- 126.↵