Abstract
Methods to identify signatures of selective sweeps in population genomics data have been actively developed, but mostly do not identify the specific mutation favored by the selective sweep. We present a method, iSAFE, that uses a statistic derived solely from population genetics signals to pinpoint the favored mutation even when the signature of selection extends to 5Mbp. iSAFE was tested extensively on simulated data and in human populations from the 1000 Genomes Project, at 22 loci with previously characterized selective sweeps. For 14 of the 22 loci, iSAFE ranked the previously characterized candidate mutation among the 13 highest scoring (out of ∼ 21, 000 variants). Three loci did not show a strong signal. For the remaining loci, iSAFE identified previously unreported mutations as being favored. In these regions, all of which involve pigmentation related genes, iSAFE identified identical selected mutations in multiple non-African populations suggesting an out-of-Africa onset of selection. The iSAFE software can be downloaded from https://github.com/alek0991/iSAFE.
Introduction
Genetic data from diverse human populations have revealed a multitude of genomic regions believed to be evolving under positive selection. We consider a regime where a single, favored, mutation increases in frequency in response to a selective pressure. The favored mutation either exists as standing variation at the onset of selection pressure, or arises de novo, after the onset. Neutral mutations on the same lineage as the favored mutation, hitchhike (are co-inherited) with the favored mutation, and increase in frequency, leading to a loss of genetic diversity.
Methods for detecting genomic regions under selection from population genetic data exploit a variety of genomic signatures. Allele frequency based methods analyze the distortion in the site frequency spectrum; Linkage Disequilibrium (LD) based methods use extended homozygosity in haplotypes; population differentiation based methods use difference in allele frequency between populations; and finally, composite methods combine multiple test scores to improve the resolution1,2. Recently, a lack of rare (singleton) mutations has been used to detect very recent selection3. The signature of a selective sweep can be captured even when standing variation or multiple denovo mutations create a ‘soft’ sweep of distinct haplotypes carrying the favored mutation. Together with the advent of deep sequencing, these methods have identified multiple regions believed to be under selection in humans and other organisms, and provide a window into genetic adaptation and evolution.
In contrast, little work has been done to identify the favored mutation in a selective sweep. Grossman et al.4 note that different selection signals identify overlapping but different regions, and a composite of multiple signals (CMS) can localize the site of the favored mutation. An alternative strategy is to use functional information to annotate SNPs and rank them in order of their functional relevance. However, the signal of selection is often spread over a large region, up to 1–2 Mbp on either side 5, and the high LD makes it difficult to pinpoint the favored mutation. Here, we propose a method, iSAFE (integrated Selection of Allele Favored by Evolution), that exploits coalescent based signals in ‘shoulders’ 5 of the selective sweep (genomic regions proximal to the region under selection, but carrying the selection signal) to rank all mutations within a large (5Mb) region based on their contribution to the selection signal. iSAFE requires that the broad region under selection is identified using existing methods, but does not depend on knowledge of the specific phenotype under selection, and does not rely on functional annotations of mutations, or knowledge of demography.
Results
iSAFE uses a 2-step procedure to identify the favored variant, given a large region (5Mb) under selection. In the first step, it finds the best candidate mutations in small (low recombination) windows. Finally, it combines the evidence to give an iSAFE-score to all variants in the large region. It considers only biallelic sites, taking as input a binary SNP matrix with each row corresponding to a haplotype h, each column to a site e. Entries in the matrix correspond to the allelic state, with 0 denoting the ancestral allele, and 1 denoting the derived allele.
A haplotype ‘contains/carries a mutation e’ if it has the derived allele at site e. Recently, we devised the Haplotype Allele Frequency (HAF) score to capture the dynamics of a selective sweep6. The HAF score for a haplotype h (HAF(h)) is the sum of the derived allele counts of mutations in h (Fig. 1A and online methods). It has been shown that, when h is a carrier of the favored allele, HAF(h) increases with the frequency of the favored mutation (Eq. S9), in contrast to HAF scores of non-carriers (Eq. S10), and this can be used to separate carrier haplotypes from non-carriers without knowing the favored mutation6.
Denote two haplotypes as ‘distinct’ if they have different HAF-scores. For any mutation e, let fe denote the mutation frequency, or the fraction of haplotypes carrying the mutation. Let κ(e) (Fig. 1B) denote the fraction of distinct haplotypes that carry mutation e.
Similarly, let ϕ(e) denote the normalized sum of HAF-scores of all haplotypes carrying the mutation e.
We observe empirically that in a region evolving according to a neutral Wright-Fisher model, κ(e) and ϕ (e) are both estimators of fe. Moreover, empirical results suggest that the expected value of ϕ (e) -κ(e) is 0, and variance is proportional to fe(1 -fe). Based on these observations, we define the SAFE-score of mutation e as
Empirically, SAFE(e) behaves like a Gaussian random variable, with mean 0, under neutrality (Fig. S2), and it can be used to test departure from neutrality. However, its real power appears during positive selection, when SAFE-scores change in a dramatic, but predictable manner (Fig. 1B-E). Assuming a no recombination scenario (only for visual exposition), label mutations as ‘non-carrier’ if they are carried only by haplotypes not carrying the favored allele. The remaining mutations can be labeled as ‘ancestral’, if they arise before the favored mutation, or ‘descendant’, if they arise after (Fig. 1C). Representing each mutation as a point in a 2-dimensional plot of ϕ, κ values, these classes are clustered differentially (Fig. 1D,E). The selective sweep reduces the number of distinct haplotypes carrying the favored mutation (lower κ), leaving non-carrier mutations with an increased fraction of distinct haplotypes (higher κ). On the other hand, increased HAF-scores in carrier haplotypes reduces the proportion of total HAF-score contributed by non-carrier haplotypes (lower ϕ). In contrast, the favored mutation has high positive value of ϕ - κ due to high HAF-scores for carriers (higher ϕ), and the reduced number of distinct haplotypes among its descendants (lower κ). As we go up to ancestral mutations, the number of non-carrier haplotype descendants increase, and κ grows faster than ϕ. As we go down to descendant mutations, there is a reduction in the already small number of distinct haplotypes. However, ϕ decreases sharply, reducing ϕ - κ (see Fig. 1B,C,E). Thus, we expect that the mutation with the highest SAFE-score is a strong candidate for the favored mutation.
We performed extensive simulations to test SAFE on samples evolving neutrally and under positive selection. We varied one parameter in each run (see online methods, ‘Simulation Experiments’), including window size (L = 50kbp), number of individual haplotypes (n = 200) chosen from a larger effective population size (N = 20K), scaled selection coefficient (Ns = 500), initial and final favored mutation frequencies (v0 = 1/N, and v). Only a few tests have been developed to identify or localize the favored mutation: Composite of Multiple Signals (CMS)4, and Selection detection by Conditional Coalescent Tree (SCCT)7. CMS combines statistics from different selection tests, including the integrated Haplotype Score (iHS)8, so as to localize the signal. In order to develop a unified probabilistic model, CMS expects control populations as input, as well as demographic models, and cannot be used directly on data based solely on coalescent simulations. Therefore, we compared SAFE against iHS and SCCT to obtain a baseline comparison here. The median SAFE rank of the favored mutation in a 50kbp region was 1 out of ∼250 variants (left panel of Fig. 1F), and the favored mutation was in the top 5 in 91% of simulations. In comparison,the median ranks of iHS and SCCT were 6 and 3, respectively. Although SCCT was better at pinpointing selective causal sites in a small window (50kbp) than iHS, in larger regions (5Mbp), iHS performed better than SCCT (Fig. S13). The comparisons to CMS using simulated models of human demography are described later.
While standing variation, v0 > 1/N, generally weakens the selection signal, the performance of SAFE remains relatively robust to variation in v0. The median SAFE rank of the favored allele is at most 3 out of ∼250 variants in all cases except when v0 ≥ 1000/N (Figure S4). Similarly, the performance is robust to selection pressure, with only a slight degradation at weak selection (Ns = 50) (Fig. S5) where the median rank goes to 9 (3.5%-ile), while for Ns ≥ 200 the median rank is at most 2. As expected, the performance improves with increasing sample size (Fig. S6). We also tested SAFE on a model of European demography and observed similar results (Fig. S7). These tests used L = 50kbp, chosen so as to minimize the effects of recombination.
Next, we tested SAFE with increasing window sizes, and observed that while the median rank of the favored mutation increases with increasing window size, the percentile rank improves up to 80kbp and then degrades to 3%-ile around 1Mbp (Fig. 2A, and S8). The deterioration for larger windows is likely due to most haplotypes becoming unique, and κ losing its utility in pinpointing the favored mutation. However, the selective sweep signal is known to extend to large, linked regions, as far as 1Mbp on either side of the favored allele. These ‘shoulders’ of selective sweeps are helpful in identifying the region under selection, but make it harder to pinpoint the favored mutation. We further refined our method to exploit the signal from shoulders.
For larger regions, we considered a set of 50% overlapping windows of fixed size (300 SNPs). For each window, we applied SAFE and chose the mutation with the highest SAFE-score. Let S1 denote the set of selected mutations. Mutations in S1 are likely to contain either the favored mutation itself or mutations linked to it. For mutation e in window w, let Ψe,w′ denote the larger of the SAFE-score of e, when e is ‘inserted’ into window w′, and 0 (Fig. 2C). As different windows have different genealogies due to recombination, Ψe,w′ is relatively high when e is the favored mutation and the genealogies of w, w′ are identical or very similar, but not otherwise. In contrast, the SAFE-score of a non-favored mutation e is relatively low when inserted in other windows (Fig. 2D; see online methods). Define the weight of a window w as
Windows that contain the favored mutation and those sharing its genealogy are expected to have high a values. We defined the iSAFE-score for all mutations e (including those not in S1) as:
We tested the power of iSAFE to identify the favored mutation in varying window sizes and observed consistently high performance as the window size was increased from 250kbp all the way to 5Mbp (Fig. 2B). The median rank remained between 3 and 5 up to 5Mbp, and the performance remained robust to a large range of parameter choices including both hard and soft sweep scenarios, selection pressure and favored mutation locations (Fig. S11, S12). iSAFE greatly improved upon iHS and SCCT, placing the favored mutation within top 20 in 88% of the cases, in contrast to iHS (39%), and SCCT (34%), for an ongoing selective sweep with fixed population size (Fig. S13).
iSAFE-scores are not based upon likelihood computations, and the distribution of scores depend upon largely unknown factors including demography, time since onset of selection, selection coefficient, and other parameters. Nevertheless, they can be used to rank order the mutations. Additionally, iSAFE scores are normalized and can be compared across samples. We found distinct differences in performance below a score threshold of 0.1. The median rank of the favored mutation is 4 when peak iSAFE-score exceeds 0.1 versus a median rank of 10 along with a longer tail, when peak iSAFE-score is below 0.1 (Fig. S14). Empirically computed p-values (online methods) on iSAFE indicate good performance when p-value < 1e-4 (Fig. S15).
Not surprisingly, iSAFE performance deteriorates when the favored mutation is fixed, or near fixation (v > 0.9 in Fig. S16). To handle this special case, we include individuals from non-target populations. For a mutation, define the Maximum Difference in Derived Allele Frequency score (MDDAF) as the difference where DT is the derived allele frequency in the target population and min(DNT) is the minimum derived allele frequency over all non-target populations. Simulations of human population demography under neutral evolution (Fig. S20), shows P (MDDAF > 0.78|DT > 0.9) = 0.001 (see Fig. S22). Therefore, when we observe the rare event of high frequency mutations in target (DT > 0.9) with MDDAF > 0.78, we add random outgroup samples to the data to constitute 10% of the data (online methods). In testing on the phase 3 of 1000 Genomes Project (1000GP) data, we chose outgroup samples from non-target 1000GP populations. The addition of outgroup samples using the MDDAF criterion was tested in extensive simulations. While the performance did not change for v < 0.9, it dramatically improved for high frequencies, including when the favored mutation was fixed in the target population (Fig. S16). In testing on models of human demography, we also compared against CMS. While CMS showed excellent performance in localizing the favored mutation, iSAFE scoring greatly improved the ranking. For example, iSAFE ranked the favored mutation within the top 20 in 94% of the simulations of a 5Mbp region (Fig. 3, S17), in contrast to CMS which had a top 20 ranking in 35% of cases.
In testing instances of previously characterized sweeps in 1000GP data, we note that perfor-mance is difficult to characterize due to many complicating factors. Multiple sweeps could be occurring in response to different selection events, including background selection in the same region; or polygenic selection may also dilute the selection signal at any one locus. Moreover, the favored mutation is well-characterized in only a few instances. We looked for genes/regions that showed the signature of a selective sweep in one of the 1000GP sub-populations, and had additional evidence pointing to the favored mutation. We identified 22 genes with some evidence, but only 8 ‘well characterized’ cases with additional support for the favored mutation (Supp. Table S1).
We used iSAFE to rank all variants (∼21,000) in a 5Mbp region surrounding the gene. Among the 8 well characterized cases, (Fig. 3C), iSAFE ranked the candidate mutations as 1 in 5 cases: SLC24A5, LCT, EDAR, ACKR1, TLR1; and, it assigned ranks 2 (ABCC1), 4 (HBB), and 13 (G6PD) in others. In almost all cases, we observed high iSAFE-scores (≥0.1).
We checked to see if the other 14 regions under selection showed a strong iSAFE signal. In 3 of the 14 regions (FUT2, F12, ASPM; Fig. S30), we only observed weak signals, and did not make a prediction (peak iSAFE < 0.027), although we do see a strong iSAFE peak 1.3Mbp away from the ASPM gene (Fig. S30D). In other regions, iSAFE ranked the candidate mutations as 1 in the SLC45A2/MATP (CEU), MC1R (CHB+JPT), and ATXN2-SHB3 (GBR) genes (Fig. 3D), and 7, 8, and 12 in PSCA (YRI), ADH1B (CHB+JPT), and PCDH15 (CHB+JPT) genes, respectively. In each case, the iSAFE-scores were high with the exception of PSCA (peak iSAFE = 0.04, online methods).
The other 5 putative selected regions are interesting in that the top-ranked iSAFE mutations had high scores, but were distinct from the reported candidate mutations (Fig. 3D). Many of these genes are involved in pigmentation, deterMining, skin, eye, and hair color. For example, the Tyrosinase (TYR) gene, encoding an enzyme involved in the first step of melanin production, is considered to be under positive selection with a nonsynonymous mutation rs1042602 as a candidate favored variant9. A second intronic variant, rs10831496, in GRM5, 396kbp upstream of TYR, has been shown to have a strong association with skin color 10. In contrast, iSAFE ranks mutation rs672144 at the top. Interestingly, this variant was the top ranked mutation not only in CEU (iSAFE = 0.48, p-val≪1.3e-8), but also in EUR, EAS, AMR, and SAS (iSAFE >0.5, p-val≪1.3e-8;Fig. S23). The result is consistent with the signal of selection being observed in all populations except AFR. It may not have been previously reported because it is near fixation in all populations of 1000GP except for AFR (Fig. S23H). We plotted the haplotypes carrying rs672144 and found that two distinct haplotypes carry the mutation, both remaining high frequency, maintained across a large stretch of the region, suggestive of a soft sweep with standing variation (Fig 4). A similar analysis applied to genes TRPV6, KITLG, OCA-HERC2 (Fig. 3D), where in each case, the top iSAFE mutations were identical across all non-African populations (online methods), and supported an out-of-Africa onset of selection. In the one remaining gene (CYP1A2/CSK; Fig. 3D), the top ranked iSAFE mutation rs2470893 was previously found significant in a genome wide association study11, and was tightly linked to the candidate mutation. To summarize, iSAFE analysis ranked the candidate mutation among the top 13 in 14 of the 22 loci, did not show a strong signal in 3, and identified plausible alternatives in the remaining 5.
Discussion
The identification of the favored allele in a selective sweep is a long-standing computational problem in population genomics. Our results suggest that statistics obtained from the coalescent structure of a region under a selective sweep can indeed pinpoint the favored mutation. iSAFE was designed to work in regimes where the selection strength is high, and there is a single favored mutation. However, its performance remains robust to a range of simulation parameters, including a wide range of initial frequencies (standing variation), and the frequency of the favored mutation at the time of sampling. iSAFE is not highly parametrized. While most results in the paper are presented on human populations, iSAFE can be easily extended to other populations, with additional demographic simulation or empirical calculations required to recalibrate p-values.
An important challenge was that regions undergoing a selective sweep also present a signal far away from the favored mutation, making it harder to pinpoint the favored mutation. We observe that when a true favored mutation is inserted into a shoulder region, it gets higher SAFE-scores on average, in contrast to the insertion of a hitchhiking mutation. The iSAFE technique uses this idea to exploit the shoulders and rank mutations according to the weighted sum of their SAFE-scores in all windows.
We also use a cross-population technique in a limited manner by using the frequency differential of mutations in high frequency scenarios to get representative non-carrier haplotypes in the sample, and show its power in identifying nearly fixed favored mutations. We do assume a model with a single, favored variant, and future work could contribute to identify multiple interacting loci favored by selection. Finally, we use only population based methods, and future work will seek to integrate these techniques with a functional analysis of mutations.
Online Methods
1 The iSAFE statistic
1.1 iSAFE: Input, Output and Overview
Consider a sample of phased haplotypes in a genomic region. We assume that all sites are biallelic and polymorphic in the sample. Thus, our input is in the form of a binary SNP matrix with each row corresponding to a haplotype and each column to a mutation, and entries corresponding to the allelic state, with 0 denoting the ancestral allele, and 1 denoting the derived allele. The output is a non-negative iSAFE-score for each mutation, with the highest score corresponding to the favored mutation.
At a high level, iSAFE uses a 2-step procedure to identify the favored variant, given a large region (5Mb) under selection. In the first step, it finds the best candidate mutations in small (low recombination) windows. Finally, it combines the evidence to give an iSAFE-score to all variants in the large region.
1.2 The Haplotype Allele Frequency (HAF-)score
The HAF score for haplotype h is the sum of the derived allele counts of the mutations on h. Define the SNP matrix M such that, Mh,e = 1 if haplotype h carries the derived allele of SNP e, and 0 otherwise. The Haplotype Allele Frequency (HAF) score of haplotype h defined in Ronen et al. (2015)6 as: where is derived allele count for SNP e, and is number of shared derived alleles (mutations) between haplotypes h and ht (see Fig. 1A).
1.3 SAFE: Selection of Allele Favored by Evolution.
For each SNP e, define ϕ as:
In other words, ϕ is sum of HAF scores of carriers of the derived allele e , divided by sum of HAF scores of all haplotypes in the sample .
Similarly, for each SNP e, we define κ as: implying that κ is the fraction of distinct non-zero values in HAF scores of SNP e carriers. κ is closely related, but not identical, to fraction of all distinct haplotypes that carry the mutation e.
We use ϕ and κ, to define the SAFE score of a SNP e as: where fe is the derived allele frequency of SNP e.
To explain the behavior of the SAFE-score in pin-pointing the favored mutation, we describe a collection of theoretical and empirical observations that can be summarized as follows:
Under neutrality, ϕ(e) and κ(e) are (biased) estimators of fe.
λf (1 – f) is a biased estimator for variance of (ϕ – κ), where λ is a positive constant.
The two points above allow the use of SAFE-score as a statistic that empirically follows a Gaussian distribution with mean 0 under neutrality.
For a population evolving under selection, ϕ and κ move in opposite directions. Specifically, for the favored mutation e, ϕ(e) increases, while κ(e) decreases. The SAFE-score tends to be maximized for the favored mutation e.
We elaborate on these points below.
1.3.1 Behavior of ϕ, κ under neutrality, constant population size
Consider a sample of size n selected from a population evolving neutrally according to the Wright Fisher model (constant population size, random mating, discrete generations, no recombination), with scaled mutation rate θ. Let ξi be the number of sites with derived allele count i. From Ronen et al. 6, the mean of the HAF scores of all n haplotypes in the sample is
Under the coalescent model, Eq. (22) of Fu 199512 shows that 𝔼[ξi] = θ/i for all 1 ≤ i ≤ n - 1. By averaging over all haplotypes in all genealogies, the expected HAF score is computed as
Thus, the expected HAF score is,
Therefore, the fraction of the total HAF-score of fn randomly chosen haplotypes is approximately f. A mutation e with derived allele frequency also has fn descendants (carriers). However, to compute the sum of the HAF-scores, we must consider a random coalescent process with a condition that carriers coalesce to a common ancestor before any carrier coalesces with a non-carrier.
This is harder, even though conditional coalescent processes have been studied extensively (e.g., Wiuf and Donnelly13). Empirical analysis on neutral coalescent simulations conditioned on the mutation e having fn carriers reveals that (Fig. S1)
While κ has not been studied previously, it is closely related to the fraction of distinct haplotypes in the sample. Empirically, for a mutation e, with fn descendants, we observe that (Fig. S1) and, for all e (Fig. S2),
1.3.2 Distribution of SAFE-scores in a neutrally evolving population
The discussion above suggests that E(SAFE(e)) = 0 for all derived alleles e. Additionally, empirical observations suggest that λf (1-f) is a biased estimator for variance of (ϕ-κ), where λ is a positive constant. We observed empirically that the distribution of the SAFE score of derived alleles in a neutrally evolving population is therefore approximated by a Gaussian distribution with mean 0 and unknown variance λ (see Fig. S2).
1.3.3 Behavior of ϕ, κ, SAFE in a population under selection, constant population size
The dynamics of HAF-score for a haplotype carrying the favored mutation in an ongoing selective sweep was analyzed earlier 6. It increases dynamically upto fixation of the favored allele, and then decreases dramatically.
Formally, let HAFcar (respectively, HAFnon) denote the HAF score of a random haplotype carrier of the favored allele (respectively, a non-carrier) when a fraction f of the n sampled haplotypes carry the favored allele. In S1 Text of Ronen et al.(2015) 6, we show that under strong selection (Ns " 1) and no recombination,
Because of the separation between carriers and non-carriers, the HAF-scores can be used to predict the carrier of ongoing selective sweeps without knowledge of the favored allele 6. Moreover, for the favored allele e with fn descendants, in a hard selective sweep that is not very close to fixation, we can approximate ϕ(e) as
For a population undergoing a positive natural selection with favored mutation e, ϕ(e) overestimates the favored allele frequency f (Fig. 1B,E and Eq. S11). On the other hand, κ(e) underestimates f (Fig. 1B,E). Therefore, we expect the distribution of (ϕ - κ) for the favored allele to be skewed in positive direction.
SAFE score performs very well in separating the favored variant within a small window (See Fig. S4-S10); but the performance decays in larger windows (Fig. S8); because in larger windows most of the haplotypes become unique and κ estimate f correctly, even for favored mutations of selective sweeps, while we expect it to underestimate the f for the favored mutations. Consequently, the estimator κ is no-longer useful for pinpointing the favored mutation.
1.4 Illustration of iSAFE: integrated SAFE for large regions
We devise iSAFE-score by extending the SAFE score to boost the performance in larger windows.We apply the SAFE score, as a kernel, on overlapping sliding windows. Define S as the set of all SNPs, W as the set of all sliding windows. Let S1 v S denote the subset of mutations that had the highest SAFE-score in their respective windows. For mutation e ∈ S, and window w ∈ W, let Ψe,w denote the SAFE-score of e, when e is ‘inserted’ into window w if it is positive, 0 otherwise. Fig. S3 provides a cartoon illustration of windows w1, w2, w3 and ⋆, ▴, and ▪, where ⋆ denotes the favored mutation and is located in w2.
We note the following:
Ψ⋆,w2 is high for the favored mutation ⋆ However, Ψ▴, w1 and Ψ▪,w3 may be high even for hitchhiking mutations (▴,▪) due to the genealogies of w1 and w3. Thus SAFE-score by itself may not be a reliable predictor over a large region containing multiple windows.
When a non-favored mutation is inserted in a window with a different genealogy, it is not likely to have a high SAFE-score. When ⋆and ▴ are inserted into window w3, Ψ⋆,w3> Ψ▴, w3 because ⋆ separates carriers from non-carriers and has high values for ϕ(⋆) and low values for κ(⋆). On the other hand, κ(▴) is higher because its descendants include non-carriers which are typically distinct haplotypes. Similarly Ψ⋆,w1 > Ψ▪,w1 because ϕ (▪) is lower in w1. In other words, the weighted sum of Ψ⋆,w over all windows w is likely to dominate other mutations.
Similarly, the window containing the favored mutation (w2) has the appropriate genealogy, and is likely to give a high score to multiple candidate mutations.
Based on these considerations, we define the score a of window w ∈ W as:
The window with the highest weight is the one which gets higher SAFE-scores for other mutations that are insrted into it. Finally, we define the score iSAFE of mutation e ∈ S as: where the mutation with the highest score is one that gives high scores when inserted into high weight windows.
1.5 MDDAF: Maximum Difference in Derived Allele Frequency
We have shown that iSAFE is successful in pinpointing the favored variant in an ongoing selective sweep. When the favored mutation is near fixation (v > 0.9), iSAFE performance decays and when the favored variant is fixed (v = 1), iSAFE cannot detect the favored mutation because it is no longer a variant (Fig. 3A). For the purpose of pinpointing the favored mutation in a fixed selective sweeps we add random samples from non-target population (outgroup) to the target population to constitute 10% of the sample.
To minimize the noise added to the data with random outgroup samples, we devise a simple method to decide whether to use outgroups or not. Our score is motivated by the work of Grossman et al.(2010) 4, who introduced the ΔDAF score of a mutation as , where DT is the derived allele frequency in the target population and % is the average derived allele frequency in non-target populations. As it is possible that some of the non-target populations are also under selection, choosing the average derived allele frequency may lower ΔDAF, and weaken the signal of selection. Instead we define the Maximum Difference in Derived Allele Frequency (MDDAF) score as: where, DT is the derived allele frequency in the target population and min(DNT) is the minimum derived allele frequency over all non-target populations.
1.6 Adding Outgroup Samples
Simulation of human population demography under neutral evolution (Fig. S20), shows P (MDDAF > 0.78|DT > 0.9) = 0.001 (Fig. S22) making it a rare event to have high MDDAF score even when the frequency is high in the Target population. Therefore, when there is a high frequency mutation (DT > 0.9) with MDDAF > 0.78 in the target population, we add random outgroup samples to the data to constitute 10% of the data. For analysis on real data, where we looked at 1000GP populations, we randomly selected outgroup samples from non-target populations of 1000GP.
In Fig. S16, we compared the performance of iSAFE with or without having the option of using outgroup samples; we simulated 5Mbp of human genome based on the human demography model described in Fig. S20. The selection happens in a random time, with a distribution given in Fig. S21, after the out of Africa in the lineage of EUR population (as the target population). When the onset of selection is before split of EUR and EAS (> 23kya), both (EUR and EAS) are under selection. When we have random sample option, we use the MDDAF criterion to decide whether we should use random sample or not. In case of adding random sample, we add a random subset of individuals from EAS+AFR to constitute 10% of the data (200 haplotypes from EUR and 22 from EAS+AFR).
The performance of iSAFE for sweeps with v < 0.9 did not change with or without having outgroup sample option (Fig. 3A). When frequency of the favored mutation is near fixation (v > 0.9) having the outgroup sample option is helpful and increase the performance of the iSAFE. When the sweep is fixed (v = 1), iSAFE is no longer capable of detecting the favored mutation without having outgroup samples because the favored mutation is no longer a variant in the target population. However, with the outgroup sample option, iSAFE can successfully pinpoint the Favored mutation even in a fixed selective sweep (see Fig. 3A).
2 Simulation Experiments
2.1 Default simulation parameters
Neutral and sweep samples were generated using the simulator msms 14. By default, simulated populations are haploid with sample size of n = 200 haplotypes from a larger effective population of N = 20000 haplotypes, each of length L, with default value 50kbp for SAFE and 5Mbp for iSAFE. For human populations, a mutation rate of approximately µ = 2.5 · 10-8 mutations per bp per generation15,16, and a recombination rate of approximately r = 1.25 · 10-8 per bp per generation 17 have been proposed. For SAFE simulations, we used a scaled mutation rate θ = 2µN = 1 mutations per kbp per generation and scaled recombination rate v = 2rN = 0.5 crossovers per kbp per meosis to approximate human rates. The rates were scaled linearly by L. In the case of positive selection the default scaled selection strength of the favored allele was set to Ns = 500, with the favored mutation located at a random position uniformly distributed on the range [1, L]. The default value for favored mutation starting frequency v0 = 1/N (hard sweep), and the frequency of the favored mutation (v) at the time of sampling is a random value uniformly distributed on the range [0.1, 0.9]. We used the default parameters for all simulations unless otherwise stated.
2.2 A model of human demography
We simulated demography of AFR, EUR, EAS populations with parameter shown in the Fig. S20 based on a popular demographic model of human population18. In case of positive selection, selection coefficient was set to s = 0.05 and starting favored allele frequency v0 = 0.001. The time of onset of selection was chosen at random (using the distribution in Fig. S21) after the out of Africa event, in the lineage of EUR population (as the target population). When the onset of selection is before split of EUR and EAS (> 23kya), both (EUR and EAS) are under selection.
3 Human Population Datasets
We downloaded the phased haplotypes of the 1000 Genomes Project (Phase 3; GRCh37) dataset from http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/. The Ancestral Alleles dataset (GRCh37) is downloaded from http://ftp.ensembl.org/pub/release-75/fasta/ancestral-alleles/. The physical position was converted into genetic position using the genetic map in ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20110106-recombination-hotspots/.
4 Computing Selection Statistics
4.1 Computing iHS scores
We used the selscan 19 (v1.1.0a) software available at https://github.com/szpiech/selscan, with default settings to calculate the raw iHS 8 score. Next, we normalized the iHS score by estimating the distribution of raw iHS scores on 1,000 neutral simulations with the same simulation parameters. The iHS scores were always computed on a 5Mb window. When comparing results with SAFE on a 50kbp window, we used the corresponding iHS scores in the identical 50kbp region surrounding the favored variant (Fig. 1,S4). In considering 5Mb windows (Fig. S13), we compared the iHS scores on all variants for iHS against iSAFE.
4.2 Computing SCCT scores
We used the SCCT (v1.1) software available at https://github.com/wavefancy/scct, provided by Wang et al. 7, with flanking SNPs size 300, and frequency interval 0.01.
4.3 Computing CMS scores
CMS requires a control population as well as a demographic model in addition to the target population under selection. All CMS comparisons on simulated data were performed using a model of human demography18, described in Fig. S20, with a random onset of selection (Fig. S21). We used the CMS (v2.0) software available at https://github.com/broadinstitute/cms, disabling CMS’ default allele frequency filter in order to allow a more direct comparison with iSAFE SNP rankings.
5 Empirical p-val computation
We applied iSAFE on a neutrally evolving simulated population with window size 5Mbp, based on European demography shown in Fig. S20. A p-value was calculated based on empirical distribution of iSAFE on these simulated populations. We limited the number of samples to ∼74,800,000 for efficiency, and this allows us to get a p-value as low as 1.34e-8 for iSAFE-score 0.304. Scores higher than this cut-off are considered to have p-value < 1.34e-8.
6 Results on selective sweeps in human populations
6.1 Well characterized selective sweeps
We examined 8 well characterized selective sweeps with strong candidate mutation. These genes are LCT, SLC24A5, TLR1, EDAR, ACKR1/DARC, ABCC11, HBB, and G6PD4,20,21,22,23,24. iSAFE results for these genes are summarized in Fig. 3 and Table S1.
We also examined 14 other regions reported to be under selection with one or more candidate favored mutations9,25,26,4,27.
6.2 Pigmentation genes
SLC45A2/MATP
This region is involved in human pigmentation pathways and is a target of selective sweep in European population9. A nonsynonymous mutation rs16891982 is associated with light skin pigmentation and is believed to be the favored variant4,9. This mutation is also ranked first by iSAFE out of ∼21,000 mutations (5Mbp) in CEU population with a significant score (see Fig. 3N, iSAFE=0.32, p-val<1.3e-8).This mutation is almost fixed in European; frequency in AFR, EAS, SAS, AMR, and EUR is 0.04, 0.01, 0.06, 0.45, and 0.94, respectively.
MC1R
The MC1R gene is implicated in many skin color phenotypes, including red hair, fair skin, freckles, poor tanning response and higher risk of skin cancer. It is is a target of positive selection in East Asian populations, with a non-synonymous mutation (rs885479) suggested as a candidate favored mutation25. This mutation is ranked first by iSAFE in CHB+JPT (see Fig. 3P, iSAFE =0.24, p-val = 1.4e-6) out of ∼16,000 mutations (2.8Mbp). The putative selected region is 300kbp away from the telomere of chromosome 16.
GRM5-TYR
The Tyrosinase (TYR) gene, encoding an enzyme involved in the first step of melanin production is present in a large region under selection. A nonsynonymous mutation rs1042602 in TYR gene is reported as a candidate favored variant9. A second intronic variant rs10831496 in GRM5 gene, 396kbp upstream of TYR, has been shown to have a strong association with skin color 10.
In contrast, iSAFE ranks mutation rs672144 as the top candidate for the favored variant region out of ∼22,000 mutations (5Mbp). This variant was the top ranked mutation not only in CEU (iSAFE = 0.48, p-val≪1.3e-8), but also the top ranked mutation for EUR, EAS, AMR, and SAS (see Fig. 3Q and Fig. S23). The signal of selection is strong in all populations (iSAFE >0.5, p-val≪1.3e-8 for all of) except AFR, which does not show a signal of selection in this region. It may not have been reported earlier because it is near fixation in all populations of 1000GP except for AFR (f = 0.27), as seen in Fig. S23G. We plotted the haplotypes carrying rs672144 and found (Fig. 4) that two distinct haplotypes carry the mutation, both with high frequencies maintained across a large stretch of the region, suggestive of a soft sweep with standing variation.
The previously suggested candidates rs1042602, rs10831496 are fully linked to rs672144 (Fig. S24), but not to each other. The EUR haplotypes can be partitioned into 4 clusters (Fig. S24). Each of the 4 haplotypes show high homozygosity, suggestive of selection. However, rs1042602 can only explain the sweep in clusters C1+C2. rs10831496 can only explain C1+C3. Only rs672144 explains all 4 clusters, providing a simpler explanation of selection in this region. GTEx eQTL analysis on TYR gene for the tissue ‘Skin - Sun Exposed (Lower leg)’ showed p-value 0.61 for rs1042602, p-value 0.15 for rs10831496, and p-value = 0.08 for rs672144. While the p-value does not rise to a level of significance due to sample size issues, it is indicative of a regulatory function for the mutation.
OCA2-HERC2
This region is suggested as a target of selection in European4,28,9, and several mutations in this region are associated with hair, eye, and skin pigmentation. For example, rs12913832 is considered to be the main determinant of iris pigmentation (brown/blue) and is also associated with skin and hair pigmentation and the propensity to tan9. rs1667394 is also linked to blond hair and blue eyes 28. Some other mutations, many fully linked, (rs4778138, rs4778241, rs7495174, rs1129038, rs916977) are also associated with blue eyes28. This region is also suggested to be a target of selection in East Asia with rs1800414 suggested as a candidate for light skin pigmentation in that population. We applied iSAFE on this region to all 1000GP super-populations.
iSAFE selected a single variant rs1448484 in OCA2 (with high confidence, p-val<1.34e-8 for EUR, EAS, AMR and p-val=2.13e-6 for SAS) as the favored variant in all 1000GP populations (EUR, EAS, SAS, AMR) except for AFR that showed no signal of selection in this region (see Fig. S25 and Fig. 3P). This variant is close to fixation in all populations except for AFR, where v = 20% (see Fig. S25F). iSAFE result along with the frequency pattern of the top ranked variant, suggests an out of Africa selection, probably on light skin color, on this region. The other candidate variants are all ranked high, and tightly linked with the top-ranked variant (Table S2).
KITLG
This genomic region has been linked to skin pigmentation29 in European and East Asian populations, and shows a strong signature of selective sweep on regulatory regions surrounding the gene in all non-African populations 25, with a candidate variant rs642742, that is associated with skin pigmentation29.
iSAFE analysis identified the same mutations gaining the top rank in multiple populations (Fig. S26). Top rank mutations in EUR, SAS, EAS, and AMR populations are shown in Table S3. The top ranked mutation in EUR and CEU populations (rs405647) was ranked 1, 2, 3 in AMR, SAS, and EAS, respectively, and is tightly linked to rs642742 (Dt = 0.92). Mutation rs661114 is ranked 2 in EUR, 5 in CEU, 6 in SAS, and 20 in AMR, and lies in a region with H3K27 acetylation that is associated with enhanced expression.
TRPV6
This region has been reported a target of selection in CEU population26. TRPV6 is involved in calcium absorption. It has been suggested that “Individuals with lighter skin pig-mentation might have produced too much 1,25-dihydroxyvitamin D, resulting in an increased intestinal Ca2+ absorption. Thus, to reduce the risk of absorptive hypercalciuria with kidney stones, the derived haplotype would have spread only among individuals with lighter skin pigmentation”30. iSAFE suggests 10 strongly linked mutations located along a 9kbp region located 84kbp downstream of TRPV6 (see Fig. S28). These mutations are ranked in the top 10 in all non-African populations (Table S5). There is no signal of selection in this region in AFR. The pattern of selection in this region in global population along with the confidence and consistency of iSAFE results in all non-African populations is consistent with an out of Africa selection on this region with the favored mutation being near fixation in all non-African populations (Fig. S27).
6.3 Population specific selection: East Asian
PCDH15
This gene plays a role in development of inner-ear hair cells and maintaining retinal photoreceptors and is reported to be under selection in East Asian and a nonsynonymous mutation rs4935502 is proposed to be the favored variant4. This mutation is ranked 12 by iSAFE in CHB+JPT (see Fig. S30A, iSAFE =0.45, p-val<1.34e-8). All top mutations are highly linked.
ADH1B
”The ADH1B gene encodes one of three subunits of the Alcohol dehydrogenase (ADH1) protein, a major enzyme in the alcohol degradation pathway that catalyzes the oxidization of alcohols into aldehydes.” This region is a target of positive selection in East Asian popu-lation26. A non-synonymous mutation in this gene is associated with Alcohol dependence 31. We tested this gene in CHB+JPT populations. iSAFE rank, in 2Mbp around ADH1B gene, for the candidate mutation (rs1229984) is 8 (see Fig. S30B). The top rank mutation is an up-stream mutation (rs3811801) 5kbp upstream of the candidate mutation rs1229984 and highly linked to it (Dt = 0.99). The second rank mutation (rs284787) is a 3t-UTR of ADH7 which is shown to be associated with Upper Aerodigestive Tract Cancers in a Japanese Population 32.
6.4 Population specific selection: UK
The UK Biobank project was recently investigated for regions under selection. The regions were reported as a target of a recent selection by analyzing the structure of UK Biobank and Ancient Eurasians27. We applied iSAFE on GBR (British in England and Scotland) population in 1000GP to check if the favored mutation could be confirmed.
ATXN2-SH2B3
Galinsky et al. proposed a nonsynonymous mutation (rs3184504) as a candidate that is associated to blood pressure33. We tested this region in GBR population of 1000GP. This candidate mutation is jointly ranked first with two other mutations rs7137828, rs7310615 (see Fig. 3O, iSAFE = 0.27, p-val=1.6e-7). rs7137828 is an intronic mutation in ATXN2 that is associated with Primary Open Angle Glaucoma that is a leading cause of blindness worldwide 34. The other first rank mutation (rs7310615) is associated with blood expression levels of SH2B335. Surprisingly, all of the top 10 mutations, ranked by iSAFE have a known association to a phenotype (Table S4), and are highly linked (Fig. S29).
CYP1A2/CSK
We tested a 5Mbp region around these genes in GBR population of 1000GP. The proposed mutation rs1378942 by27 with frequency 0.69 in GBR population is ranked 89 by iSAFE (iSAFE = 0.13, p-val=7.0e-5). The top-ranked mutation rs2470893 (Fig. 3U, iSAFE = 0.16, p-val=2.7e-5) is between CYP1A1 and CYP1A2 with frequency 0.40 in GBR and is associated with Caffeine metabolism 11. rs2470893 and rs1378942 are in a strong LD(Dt = 0.91).
FUT2
The signal of selection on 5Mbp around this region in GBR population is very weak (Fig. S30E), with peak iSAFE = 0.026, p-val=0.009. There is a very weak peak in 400kbp around FUT2 gene (chr:49077276-49475876). The stop gained mutation rs601338 proposed as a candidate mutation by27 is ranked 4 (p-val=0.1).
F12
The signal of selection on 5Mbp around this region in GBR population is very weak (Fig. S30F, peak iSAFE = 0.027, p-val=0.008). The proposed mutation rs2545801 has a very weak signal (p-val=0.2).
Other genes
PSCA
This gene has been reported as a target of selection in YRI population 26. A 5tUTR mutation rs2294008 proposed as a candidate favored mutation in this region that is associated with urinary bladder and gastric cancers 36,37. The signal of iSAFE in 5Mbp around this gene in YRI population is weak (see Fig. S30C, peak iSAFE = 0.04, p-val=2.4e-3). The proposed mutation rs2294008 is ranked 7 in 5Mbp region surrounding this region. The local rank in 400kbp around this gene is joint-first with 8 other mutations including rs2976392 which is also associated with diffuse-type gastric cancer 37. Other mutations are rs2978979, rs2920279, rs2978980, rs2920282, rs2294010, rs2717562, rs2978982. This 9 mutation are fully linked in YRI population in a 20kbp region that cover PSCA from upstream regulatory region to its down stream (chr8:143757286-143776668, GRCh37/hg19).
ASPM
This gene is reported to be a target of weak selection in GBR population26. The signal in 2Mbp around this gene is very weak (see Fig. S30D, peak-iSAFE = 0.025, p-val=0.01). The proposed mutation rs41310927 has a very weak signal (p-val=0.4). However, we do see a strong iSAFE signal 1.3Mbp away from the ASPM gene.
Acknowledgments
This research was supported in part by grants from the NSF (IIS-1318386 and DBI-1458557), and from the NIH (R01GM114362).