Abstract
Comparisons of DNA from archaic and modern humans show that these groups interbred, and in some cases received an evolutionary advantage from doing so. This process - adaptive introgression - may lead to a faster rate of adaptation than is predicted from models with mutation and selection alone. Within the last couple of years, a series of studies have identified regions of the genome that are likely examples of adaptive introgression. In many cases, once a region was ascertained as being introgressed, commonly used statistics based on both haplotype as well as allele frequency information were employed to test for positive selection. Introgression by itself, however, changes both the haplotype structure and the distribution of allele frequencies, thus confounding traditional tests for detecting positive selection. Therefore, patterns generated by introgression alone may lead to false inferences of positive selection. Here we explore models involving both introgression and positive selection to investigate the behavior of various statistics under adaptive introgression. In particular, we find that the number and allelic frequencies of sites that are uniquely shared between archaic humans and specific present-day populations are particularly useful for detecting adaptive introgression. We then examine the 1000 Genomes dataset to identify regions that were likely subject to adaptive introgression and discuss some of the most promising candidate genes located in these regions.
1 Introduction
There is a growing body of evidence supporting the idea that certain modern human populations admixed with archaic groups of humans after expanding out of Africa. In particular, non-African populations have 1-2% Neanderthal ancestry [1, 2], while Melanesians and East Asians have 3% and 0.2% ancestry, respectively, from Denisovans [3, 4, 2].
Recently, it has become possible to identify the fragments of the human genome that were introgressed and survive in present-day individuals [5, 6, 2, 7]. Researchers have also detected which of these introgressed regions are present at high frequencies in some present-day non-Africans but not others. Some of these regions are likely to have undergone positive selection in those populations after they were introgressed, a phenomenon known as adaptive introgression (AI). One particularly striking example of AI is the gene EPAS1 [8] which confers a selective advantage in Tibetans by making them less prone to hypoxia at high altitudes [9, 10, 11, 12, 13, 14, 15, 16]. The selected Tibetan haplotype is known to have been introduced in the human gene pool by Denisovans or a population closely related to them [17, 18].
In this study, we use simulations to assess the power to detect AI using different summary statistics that do not require the introgressed fragments to be identified a priori. Some of these are inspired by the signatures observed in EPAS1, which contains an elevated number of sites with alleles uniquely shared between the Denisovan genome and Tibetans. We then apply these statistics to real human genomic data from phase 3 of the 1000 Genomes Project [19], to detect AI in human populations, and find candidate genes. While these statistics are sensitive to adaptive introgression, they may also be sensitive to other phenomena that generate genomic patterns similar to those generated by AI, like ancestral population structure and incomplete lineage sorting. These processes, however, should not generate long regions of the genome where haplotypes from the source and the recipient population are highly similar. To assess whether the candidates we found are truly generated by AI, we explored the haplotype structure of some of the most promising candidates, and used a probabilistic method [20] that infers introgressed segments along the genome by looking at the spatial arrangement of SNPs that are consistent with introgression. This allows us to verify that the candidate regions contain introgressed haplotypes at high frequencies: a hallmark of AI.
2 Methods
2.1 Summary statistics sensitive to adaptive introgression
Several statistics have been previously deployed to detect AI events (reviewed in Racimo et al. [21]). We briefly describe these below, as well as three new statistics tailored specifically to find this signal (Table 1). One of the simplest approaches consists of applying the D statistic [1, 22] locally over windows of the genome. The D statistic was originally applied to compare a single human genome against another human genome, so as to detect excess shared ancestry between one of the genomes and a genome from an outgroup population. Application of this statistic comparing non-Africans and Africans served as one of the pieces of evidence in support of Neanderthal admixture into non-Africans. However, it can also be computed from large panels of multiple individuals instead of single genomes. This form of the D statistic has been applied locally over windows of the genome to detect regions of excess shared ancestry between an admixed population and a source population [23, 24].
The D statistic, however, can be confounded by local patterns of diversity, as regions of low diversity may artificially inflate the statistic even when a region was not adaptively introgressed. To correct for this, Martin et al. [25] developed a similar statistic called fD which is less sensitive to differences in diversity along the genome. Both of these patterns exploit the excess relatedness between the admixed and the source population.
AI is also expected to increase linkage disequilibrium (LD), as an introgressed fragment that rises in frequency in the population will have several closely linked loci that together will be segregating at different frequencies than they were in the recipient population before admixture. Thus, two well-known statistics that are informative about the amount of LD in a region -D′ and r2 - could also be informative about adaptive introgression. To apply them over regions of the genome, we can take the average of each of the two statistics over all SNP pairs in a window. In the section below, we calculate these statistics in two ways: a) using the introgressed population only (D′[intro] and r2[intro]), and b) using the combination of the introgressed and the non-introgressed populations (D′[comb] and r2[comb]).
We also introduce three new statistics that one would expect, a priori, to be particularly effective at identifying windows of the genome that are likely to have undergone adaptive introgression. First, in a region under adaptive introgression, one would expect the divergence between an individual from the source population and an admixed individual to be smaller than the divergence between an individual from the source population and a non-admixed individual. Thus, one could take the ratio of these two divergences over windows of the genome. One can then take the average of this ratio over all individuals in the admixed and non-admixed panels. This average should be larger if the introgressed haplotype is present in a large number of individuals of the admixed population. We call this statistic RD.
Second, for a window of arbitrary size, let UA,B,C(w,x,y) be defined as the number of sites where a sample C from an archaic source population (which could be as small as a single diploid individual) has a particular allele at frequency y, and that allele is at a frequency smaller than w in a sample A of a population but larger than x in a sample B of another population (Figure 1). In other words, we are looking for sites that contain alleles shared between an archaic human genome and a test population, but absent or at very low frequencies in an outgroup (usually non-admixed) population. Below, we denote panels A, B and C as the “outgroup”, “target” and “bait” panels, respectively. For example, suppose we are looking for Neanderthal adaptive introgression in the Han Chinese (CHB). In that case, we can consider CHB as our target panel, and use Africans as the outgroup panel and a single Neanderthal genome as the bait. If UAFR,CHB,Nea (1%, 20%, 100%) = 4 in a window of the genome, that means there are 4 sites in that window where the Neanderthal genome is homozygous for a particular allele and that allele is present at a frequency smaller than 1% in Africans but larger than 20% in Han Chinese. In other words, there are 4 sites that are uniquely shared at more than 20% frequency between Han Chinese and Neanderthal, but not with Africans.
This statistic can be further generalized if we have samples from two different archaic populations (for example, a Neanderthal genome and a Denisova genome). In that case, we can define UA,B,C,D (w, x, y, z) as the number of sites where the archaic sample C has a particular allele at frequency y and the archaic sample D has that allele at frequency z, while the same allele is at a frequency smaller than w in an outgroup panel A and larger than x in a target panel B (Figure S1). For example, if we were interested in looking for Neanderthal-specific AI, we could set y = 100% and z = 0%, to find alleles uniquely shared with Neanderthal, but not Denisova. If we were interested in archaic alleles shared with both Neanderthal and Denisova, we could set y = 100% and z = 100%.
Another statistic that we found to be useful for finding AI events is Q95A,B,C(w, y), and is here defined as the 95th percentile of derived frequencies in an admixed sample B of all SNPs that have a derived allele frequency y in the archaic sample C, but where the derived allele is at a frequency smaller than w in a sample A of a non-admixed population (Figure 1). For example, Q95AFR,CHB,Nea(1%, 100%) = 0.65 means that if one computes the 95% quantile of all the Han Chinese derived allele frequencies of SNPs where the Neanderthal genome is homozygous derived and the derived allele has frequency smaller than 1% in Africans, that quantile will be equal to 0.65. As before, we can generalize this statistic if we have a sample D from a second archaic population. Then, Q95A,B,C,D(w, y, z) is the 95th percentile of derived frequencies in the sample B of all SNPs that have a derived allele frequency y in the archaic sample C and derived allele frequency z in the archaic sample D, but where the derived allele is at a frequency smaller than w in the sample A (Figure S1).
In the section below, we evaluate the sensitivity and specificity of all these statistics using simulations. We also evaluate the effect of adaptive introgression on a common statistic that is indicative of population variation - expected heterozygosity (Het), as this statistic was previously found to be affected by archaic introgression in a serial founder model of human history [26]. We measured Het as the average of 2*p*(1-p) over all sites in a window, where p is the sample derived allele frequency in the introgressed population.
2.2 Simulations
None of these statistics have been explicitly vetted under scenarios of AI so far, though the performance of D and fD has been previously evaluated for detecting local introgression [25]. Therefore, we aimed to test how each of the statistics described above performed in detecting AI. We began by simulating a three population tree in Slim [27] with constant Ne = 10, 000, mutation rate equal to 1.5 ∗ 10−8 per bp per generation, recombination rate equal to 10−8 per bp per generation, and split times emulating the African-Eurasian and Neanderthal-modern human split times (4,000 and 16,000 generations ago, respectively). We allowed for admixture between the most distantly diverged population and one of the closely related sister populations, at different rates: 2%, 10% and 25% (Figure 2.A). This is meant to represent Neanderthal admixture into Eurasians, with Africans as the non-admixed population. Under each of the three admixture rate scenarios, we simulated regions that were evolving neutrally, regions where the central SNP was under weak positive additive selection (s = 0.01) and regions with a central SNP under strong selection (s = 0.1).
We also tested how the statistics perform at detecting adaptive introgression when the alternative model is not a neutral introgression model, but a neutral model with ancestral structure (Figure 2.G). We followed a model described in Huerta-Sanchez et al (2014) and simulated a population in which an African population splits from archaic humans before Eurasians, but is allowed to exchange migrants with them. Afterwards, we split Eurasians and archaic humans. At that point, we stop the previous migration and only allow for migration between the Eurasian and African populations until the present, at double the previous rate. This is meant to generate loci where Eurasians and archaic humans share a more recent common ancestor with each other than with Africans, but because of ancient shared ancestry, not recent introgression. We simulated 3 scenarios, in which we set the per-generation ancient(recent) migration rate to be 0.01(0.02), 0.001(0.002) and 0.0001(0.0002). We call these the strong-, medium-, and weak-migration scenarios, respectively. The stronger the migration, the weaker the ancestral structure, as archaic-shared segments in Eurasians will tend to be removed by migration with Africans.
2.3 Plotting haplotype structure
The Haplostrips software (Marnetto et al. in prep.) was used to produce plots of haplotypes at candidate regions for AI. This software displays each SNP within a predefined region as a column, while each row represents a phased haplotype: the result is a heatmap. Each haplotype is labeled with a color that corresponds to the 1000 Genomes panel of its carrier individual. The haplotypes were first hierarchically clustered via the single agglomerative method based on Manhattan distances, using the stats library in R. The resulting dendrogram of haplotypes was then re-ordered by decreasing similarity to a reference sequence constructed so that it contains all the derived alleles found in the archaic genome (Altai Neanderthal or Denisova). The reordering is performed using the mininum distance method, so that haplotypes with more derived alleles shared with the archaic population are at the top of the plot. Derived alleles are represented as black spots and ancestral alleles are represented as white spots. Variant positions were filtered out when the site in the archaic genome had mapping quality less than 30 or genotype quality less than 40, or if the minor allele had a population frequency smaller than 5% in each of the present-day human populations included in the plot.
2.4 Hidden Markov Model
As haplotypes could look archaic simply because of ancestral structure or incomplete lineage sorting, we used a Hidden Markov Model (HMM) described in ref. [20] (which assumes an exponential distribution of admixture tract lengths [28, 29]), in order to verify that our candidate regions truly had archaic introgressed segments. This procedure also allowed us to confirm which of the archaic genomes was closest to the original source of introgression, as using a distant archaic source as input (for example, the Denisova genome when the true source is closest to the Neanderthal genome) produced shorter or less frequent inferred segments in the HMM output than when using the closer source genome.
The HMM we used requires us to specify a prior for the admixture rate. We tried two priors: 2% and 50%. The first was chosen because it is consistent with the genome-wide admixture rate for Neanderthals into Eurasians. The second, larger, value was chosen because each candidate region should a priori have a larger probability of being admixed, as they were found using statistics that are indicative of admixture in the first place. We observe almost no differences in the number of haplotypes inferred using either value. However, the larger prior leads to longer and less fragmented introgressed chunks, as the HMM is less likely to transition into a non-introgressed state between two introgressed states, so all figures we show below were obtained using a 50% admixture prior. The admixture time was set to 1,900 generations ago and the recombination rate was set to 2.3 ∗ 10−8 per bp per generation. A tract was called as introgressed if the posterior probability for introgression was higher than 90%.
2.5 Testing for enrichment in genic regions
To test for whether uniquely shared archaic alleles at high frequencies were enriched in genic regions of the genome, we looked at archaic alleles at high frequency in any of the Non-African panels that were also at low frequency (< 1%) in Africans. As background, we used all archaic alleles that were at any frequency larger than 0 in the same Non-African populations, and that were also at low frequency in Africans. We then tested whether the high-frequency archaic alleles tended to occur in genic regions more often than expected.
SNPs in introgressed blocks will tend to cluster together and have similar allele frequencies, which could cause a spurious enrichment signal. To correct for the fact that SNPs at similar allele frequencies will cluster together (as they will tend to co-occur in the same haplotypes), we performed linkage disequilibrium (LD) pruning using two methods. In one (called “LD-1”), we downloaded the approximately independent European LD blocks published in ref. [30]. For each set of high frequency derived sites, we randomly sampled one SNP from each block. In a different approach (called “LD-2”), for each set of high frequency derived sites, we subsampled SNPs such that each SNP was at least 200 kb apart from each other. We then tested these two types of LD-pruned SNP sets against 1000 SNP sets of equal length that were also LD-pruned and that were obtained randomizing frequencies and collecting SNPs in the same ways as described above.
3 Results
3.1 Simulations
3.1.1 Statistics based on shared allele configurations
We tested the performance of the statistics described above under scenarios of adaptive introgression. Figures S2, S3 and S4 show the distribution of statistics that rely on patterns of shared allele configurations between source and introgressed populations (Het, D, fD, UA,B,C, Q95A,B,C and RD), for different choices of the selection coefficient s, and under 2%, 10% and 25% admixture rates, respectively. For Q95A,B,C(w, 100%) and UA,B,C(w, x, 100%), we tested different choices of w (1%, 10%) and x (0%, 20%, 50% and 80%). Some statistics, like Q95A,B,C(1%, 100%) and fD show strong separation between the selection regimes. For example, with an admixture rate of 2%, Q95A,B,C(1%, 100%) has 100% sensitivity at a specificity of 99%, for both s = 0.1 and s = 0.01.
Other statistics are not as effective, however. For example, UA,B,C(1%, 0%, 100%) shows some power when the admixture rate is low (2%), but almost no power when the admixture rate is high (25%). This is because setting the test population archaic allele frequency minimum threshold at x = 0% means that any site with some archaic allele in the test panel will be counted, regardless of the allele frequency, so long as the archaic allele is at low frequency in the outgroup panel. At high admixture rates, low- and medium-frequency archaic alleles would naturally occur under neutrality, so they would not be informative about AI.
3.1.2 LD-based statistics
In turn, Figure S5 shows the distribution of LD-based statistics under different selection and admixture rate regimes. Note that while D′[intro], D′[comb] and r2[comb] are generally increased by adaptive introgression, this is not the case with r2[intro] under strong selection and admixture regimes. This is because r2 will tend to decrease if the minor allele frequency is very small, which will occur if this frequency is only measured in the population undergoing adaptive introgression. In general, these statistics are not as powerful for detecting AI as allele configuration statistics like U or Q95.
3.1.3 Receiving operator curves
In Figures 3 and S6, we plot receiving operator curves (ROC) of all these statistics, for various selection and admixture regimes. In general, QA,B,C(1%, 100%), QA,B,C(10%, 100%) and fD are very powerful statistics for detecting AI. The number of uniquely shared sites UA,B,C(x, y, z) is also powerful, so long as the population in the target panel (y) is large. Additionally, for different choices of y, using w = 1% yields a more powerful statistic than using w = 10%.
3.1.4 Joint distributions
We were also interested in the joint distribution of pairs of these statistics. Figure S7 shows the joint distribution of Q95A,B,C(1%, 100%) in the y-axis and four other statistics (RD, Het, D and fD) in the x-axis, under different admixture and selection regimes. One can observe, for example, that while QA,B,C(1%, 100%) increases with increasing selection intensity and admixture rates, Het increases with increasing admixture rates, but decreases with increasing selection intensity. Thus, under AI the two forces cancel each other out, and we obtain a similar value of Het as under neutrality. Furthermore, the joint distributions of Q95A,B,C(1%, 100%) and fD or RD show particularly good separation among the different AI scenarios.
Another joint distribution that is especially good at separating different AI regimes is the combination of Q95A,B,C(w, 100%) and UA,B,C(w, x, 100%). In Figure 4, we show this joint distribution, for different choices of w (1%, 10%) and x (20%, 50%). Here, with increasing intensity of selection and admixture, the number of uniquely shared sites and the quantile statistic increase, but the quantile statistic tends to only reach high values when selection is strong, even if admixture rates are low.
3.1.5 Alternative demographic scenarios
We evaluated the performance of our statistics under various alternative demographic scenarios. First, we simulated a 5X bottleneck occurring in population B 1,600 generations before the admixture event, and lasting 200 generations, to observe its effects on the power of the statistics for detecting AI (Figure 2.B). Though we observe a reduction in power - most evident in the heterozygosity statistics - none of the statistics are very strongly affected by this event (Figure S8). We also simulated a bottleneck of equal size but occurring after the admixture event - starting 1,400 generations ago, and lasting 200 generations (Figure 2.C). In this case, the sensitivity of all the statistics is strongly reduced when the admixture rate is low (Figure S9). For example, when looking at the raw values of the UA,B,C and Q95A,B,C statistics, we observe that for low admixture rates the distribution under selection has more overlap with the distribution under neutrality, which explains the low power (Figures S10, S11). Additionally, UA,B,C seems to display more elevated values under neutrality than in the constant population size model. However, the relative performance of each statistic with respect to all the others does not appear to change much (Figure S9).
We next explored a model where the introgressed haplotype was not immediately adaptive in the Eurasian population, but instead underwent an intermediate period of neutral drift, before it becomes advantageous (Figure 2.D). In such a situation, our power to detect AI is reduced, for all statistics (Figure S12). This is particularly an issue when the admixture rate is low, as in those cases the starting frequency of the selected allele in the Eurasian population is low, so it is more likely to drift to extinction during the neutral period, before it can become advantageous.
We also evaluated the performance of our statistics under selective scenarios that did not involve adaptive introgression, to check which of them were sensitive to these models and which were not. Under a model of selection from de novo mutation (SDN, Figure 2.E), in which a single mutation appears in the receiving population after the admixture event, the heterozygosity and linkage disequilbrium statistics (r2[intro] and D′[intro]) are the most sensitive ones (Figure S13). This is expected, given that classical selective sweeps are known to strongly affect patterns of heterozygosity and linkage disequilibrium in the neighborhood of the selected site [31, 32, 33]. Since all other statistics have very poor sensitivity to detect SDN, we expect to be able to distinguish signatures generated from SDN and AI.
We also simulated a model of selection from standing variation (Figure 2.F), by randomly selecting 20% of haplotypes within the introgressing population to be advantageous, after the introgression event had already occurred. In this case, all statistics perform poorly, especially when admixture is low. Interestingly, when admixture is high (Figure S14), Q95A,B,C(1%, 100%) and UA,B,C(1%, 0%, 100%) are the best performing statistics. This is likely because some of the haplotypes that are randomly chosen to be selected also happen to be ancestrally polymorphic and present in the archaic humans.
When we set ancestral structure to be our null model, we observe different behaviors depending on the strength of the migration rates. When the migration rates are strong (Figure S15), we have excellent power to detect AI with several statistics, including Q95A,B,C(1%, 100%), D, fD, RD and UA,B,C(1%, 50%, 100%). When the rates are of medium strength (Figure S16), the power is slightly reduced, but the same statistics are the ones that perform best. When the migration rates are weak - meaning ancestral structure is very strong - Q95A,B,C(1%, 100%) loses power, and the best-performing statistics are RD, D and fD (Figure S17). We note, though, that the genome-wide D observed under this last ancestral structure model (D = 0.24) is much more extreme than the genome-wide D observed empirically between any Eurasian population and Neanderthals or Denisovans, suggesting that if there was ancestral structure between archaic and modern humans, it was likely not of this magnitude.
3.2 Global features of uniquely shared archaic alleles
Before identifying candidate genes for adaptive introgression, we investigated the frequency and number of uniquely shared sites at the genome-wide level. Specifically, we wanted to know whether human populations varied in the number of sites with uniquely shared archaic alleles, and whether they also varied in the frequency distribution of these alleles. Therefore, we computed UA,B,Nea,Den(1%, x, y, z) and Q95A,B,Nea,Den(1%,y, z) for different choices of x, y and z. We used each of the non-African panels in the 1000 Genomes Project phase 3 data [19] as the “test” panel (B), and chose the outgroup panel (A) to be the combination of all African populations (YRI, LWK, GWD, MSL, ESN), excluding admixed African-Americans. When setting x = 0% (i.e. not imposing a frequency cutoff in the target panel B), South Asians as a target population show the largest number of archaic alleles (Figure 5.A). However, East Asians have a larger number of high-frequency uniquely shared archaic alleles than Europeans and South Asians, for both x = 20% and x = 50% (Figure 5.B-C). Population-specific D-statistics (using YRI as the non-admixed population) also follow this trend (Figure S18) and we observe this pattern when looking only at the X chromosome as well (Figure S19). These results hold in comparisons with both archaic human genomes, but we observe a stronger signal when looking at Neanderthal-specific shared alleles. To correct for the fact that some panels have more segregating sites that others (and may therefore have more archaic-like segregating sites), we also scaled the number of uniquely shared sites by the total number of segregating sites per population panel (Figure 5.D-F), and we see in general the same patterns, with the exception of a Peruvian panel, which we discuss further below. We also observe similar patterns when calculating Q95A,B,Nea,Den(1%,y, z) genome-wide (Figure S20). The elevation in UA,B,Nea,Den and Q95A,B,Nea,Den in East Asians may result from higher levels of archaic ancestry in East Asians than in Europeans [34], and agrees with studies indicating that more than one pulse of admixture likely occurred in East Asians [35, 36].
Surprisingly, the Peruvians (PEL) harbor the largest amount of high frequency mutations of archaic origin than any other single population, especially when using Neanderthals as bait (Figures 5.B-C, S19). It is unclear whether this signal is due to increased drift or selection in this population. Skoglund et al. [37] argue via simulations that if one analyzes a population with high amounts of recent genetic drift and excludes SNPs where the minor allele is at low frequency, some statistics that are meant to detect archaic ancestry - like D - may be artificially inflated. Our filtering procedure to select uniquely shared archaic alleles necessarily excludes sites where the archaic allele is at low frequency in the target panel, and the PEL panel comes from a population with a history of low effective population sizes (high drift) relative to other Non-Africans [19], which could explain this pattern. This could also explain why the effect is not seen when x = 0% (Figure 5.A), or when computing D-statistics (Figure S18), both of which include sites with low-frequency alleles in their computation. Additionally, scaling the uniquely shared sites by the total number of segregating sites per population panel mitigates (but does not completely erase) this pattern. After scaling, PEL shows levels of archaic allele sharing within the range of the East Asian populations at x = 20% (Figure 5.E), but is still the panel with the largest number of archaic sites at x = 50% (Figure 5.F).
Additionally, we plotted the values of UAFR,X,Nea,Den(w,1%,y, z) and Q95AFR,X,Nea,Den(1%,y,z) jointly for each population X, under different frequency cutoffs w. When w = 0%, there is a generally inversely proportional relationship between the two scores (Figure S21), but this becomes a directly proportional relationship when w = 20% (Figure 6) or w = 50% (Figure S25). Here, we also clearly observe that PEL is an extreme region with respect to both the number and frequency of archaic shared derived alleles, and that East Asian and American populations have more high-frequency archaic shared alleles than Europeans.
We checked via simulations if the observed excess of high frequency archaic derived mutations in Americans and especially Peruvians could be caused by genetic drift, as a consequence of the bottleneck that occurred in the ancestors of Native Americans as they crossed Beringia. We observe that if the introgressed population B undergoes a bottleneck, this can lead to a larger number of UA,B,C(w, x, y, z) for large values of x (Figure S10, S11, S22). Indeed, population structure analyses of the 1000 Genomes samples suggest that Peruvians have the largest amount of Native American ancestry [19] and show a bottleneck with a lack of recent population growth, which could explain this pattern. We also observe an increase in the variance of the distribution of U and Q95 in the presence of a bottleneck, especially when long and severe (Figures S23, S24).
3.3 Candidate regions for adaptive introgression
To identify adaptively introgressed regions of the genome, we computed UA,B,C,D(w, x,y, z) and Q95A,B,C,D(w, y, z) in 40kb non-overlapping windows along the genome, using the low-coverage sequencing data from phase 3 of the 1000 Genomes Project [19]. We used this window size because the mean length of introgressed haplotypes found in ref. [2] was 44,078 bp (Supplementary Information 13), and 40kb is well over the length needed to reject incomplete lineage sorting [17]. Our motivation was to find regions under AI in a particular panel B, using panel A as a non-introgressed out-group (generally Africans, unless otherwise stated). We used the high-coverage Altai Neanderthal genome [2] as bait panel C and the high-coverage Denisova genome [4] as bait panel D. We deployed these statistics in three ways: a) to look for Neanderthal-specific AI, we set y = 100% and z = 0%; b) to look for Denisova-specific AI, we set y = 0% and z = 100%; c) to look for AI matching both of the archaic genomes, we set y = 100% and z = 100% (Figure S1, Table S3). To try to determine the adaptive pressure behind the putative AI event, we obtained all the CCDS-verified genes located inside each window [38].
For guidance as to how high a value of U and Q95 we would expect under neutrality, we used the simulations from Figure 2 to obtain 95% empirical quantiles of the distribution of these scores under neutrality. Tables S1 and S2 show the 95% quantiles for these two statistics under various models of adaptive introgression and ancestral structure, for different choices of parameter values (see Methods Section). When examining our candidates for AI below, we focused on windows whose values for UA,B,Nea,Den(w, y, z) and Q95A,B,Nea,Den(w,x, y, z) were both in the 99.9% quantile of their respective genome-wide distributions, and also verified that these values would be statistically significant at the 5% level under a simple model of neutral admixture.
We also calculated D and fD along the same windows (using Africans as the non-admixed population), and saw good agreement with the new statistics presented here (Table S3). Finally, we validated the regions most likely to have been adaptively introgressed by searching for archaic tracts of introgression within them that were at high frequency, using a Hidden Markov Model (see below).
3.3.1 Continental populations
When focusing on adaptive introgression in continental populations, we first looked for uniquely shared archaic alleles specific to Europeans that were absent or almost absent (< 1% frequency) in Africans and East Asians. Conversely, we also looked for uniquely shared archaic alleles in East Asians, which were absent or almost absent in Africans and Europeans. In this continental survey, we ignored Latin American populations as they have high amounts of European and African ancestry, which could confound our analyses. Figure 7 shows the number of sites with uniquely shared alleles for increasing frequency cutoffs in the introgressed population, and for different types of archaic alleles (Neanderthal-specific, Denisova-specific or common to both archaic humans). In other words, we calculated UAFR,EUR,Nea,Den(1%, x, y, z) and UAFR,EAS,Nea,Den(1%, x, y, z) for different values of x (0%, 20%, 50% and 80%) and different choices of y and z, depending on which type of archaic alleles we were looking for. We observe that the regions in the extreme of the distributions for x = 50% corresponded very well to genes that had been previously found to be candidates for adaptive introgression from archaic humans in these populations, using more complex probabilistic methods [6, 5] or gene-centric approaches [39]. These include BNC2 (involved in skin pigmentation [40, 41]), POU2F3 (involved in skin keratinocyte differentiation [42, 43]), HYAL2 (involved in the response to UV radiation on human ker-atinocytes [44]), SIPA1L2 (involved in neuronal signaling [45]) and CHMP1A (a regulator of cerebellar development [46]). To be more rigorous in our search for adaptive introgression, we looked at the joint distribution of the U statistic and the Q95 statistic for the same choices of w, y and z, and then selected the regions that were in the 99.9% quantiles of the distributions of both statistics (Figures 8, S26, S27). We find that the strongest candidates here are BNC2, POU2F3, SIPA1L2 and the HYAL2 region.
We also scanned for regions of the genome where South Asians (SAS) had uniquely shared archaic alleles at high frequency, which were absent or almost absent in Europeans, East Asians and Africans. In this case, we focused on x = 20% because we found that x = 50% left us with no candidate regions. Among the candidate regions sharing a large number of high-frequency Neanderthal alleles in South Asians, we find genes ASTN2, SFMBT1, MUSTN1 and MAML2 (Figure S28). ASTN2 is involved in neuronal migration [47] and is associated with schizophrenia [48, 49]. SFMBT1 is involved in myo-genesis [50] and is associated with hydrocephalus [51]. MUSTN1 plays a role in the regeneration of the muscoskeletal system [52]. Finally, MAML2 codes for a signaling protein [53, 54], and is associated with cutaneous carcinoma [55] and lacrimal gland cancer [56].
3.3.2 Eurasia
We then looked for AI in all Eurasians (EUA=EUR+SAS+EAS, ignoring American populations) using Africans as the non-admixed population (AFR, ignoring admixed African-Americans). Figure 8 shows the extreme outlier regions that are in the 99.9% quantiles for both UEUA,AFR,Nea,Den(1%, 20%,y, z) and Q95EUA,AFR,Nea,Den(1%, y, z), while Figure S29 shows the entire distribution. We focused on x = 20% because we found that x = 50% left us with almost no candidate regions. In this case, the region with by far the largest number of uniquely shared archaic alleles is the one containing genes OAS1 and OAS3, involved in innate immunity [57, 58, 59, 60]. This region was previously identified as a candidate for AI from Neanderthals in non-Africans [61]. Another region that we recover and was previously identified as a candidate for AI is the one containing genes TLR1 and TLR6 [62, 63]. These genes are also involved in innate immunity and have been shown to be under positive selection in some non-African populations [64, 65].
Interestingly, we find that a very strong candidate region in Eurasia contains genes TBX15 and WARS2. This region has been associated with a variety of traits, including adipose tissue differentiation [66], body fat distribution [67, 68, 69, 70], hair pigmentation [71], facial morphology [72, 73], ear morphology [74], stature [73] and skeletal development [75, 73]. It was previously identified as being under positive selection in Greenlanders [76], and it shows particularly striking signatures of adaptive introgression, so we devote a separate study to its analysis [77].
3.3.3 Population-specific signals of adaptive introgression
To identify population-specific signals of AI, we looked for archaic alleles at high frequency in a particular non-African panel X, which were also at less than 1% frequency in all other non-African and African panels, excluding X (Table S3). This is a very restrictive requirement, and indeed, we only find a few windows in a single panel (PEL) with archaic alleles at more than 20% frequency, at sites where the archaic alleles is at less than 1% frequency in all other panels. One of the regions with the largest number of uniquely shared Neanderthal sites in PEL contains gene CHD2, which codes for a DNA helicase [78] involved in myogenesis (UniProtKB by similarity), and that is associated with epilepsy [79, 80].
3.3.4 Shared signals among populations
In the previous section, we focused on regions where archaic alleles were uniquely at high frequencies in particular populations, but at low frequencies in all other populations. This precludes us from detecting AI regions that are shared across more than one non-African population. To address this, we conditioned on observing the archaic allele at less than 1% frequency in a non-admixed outgroup panel composed of all the African panels (YRI, LWK, GWD, MSL, ESN), excluding African-Americans, and then looked for archaic alleles at high frequency in particular non-African populations. Unlike the previous section, we did not condition on the archaic allele being at low frequency in other non-African populations as well. The whole joint distributions of U and Q95 for this choice of parameters for each non-African panel are shown in figs. S30 to S48, while regions in the 99.9% quantile for both statistics are shown in Figure 8.
Here, we recapitulate many of the findings from our Eurasian and continental-specific analyses above, like TLR1/TLR6, BNC2, OAS1/OAS3, POU2F3, LIPA and TBX15/WARS2 (Figure 8). For example, just as we found that POU2F3 was an extreme region in the East Asian (EAS) continental panel, we separately find that almost all populations composing that panel (CHB, KHV, CHS, CDX, JPT) have archaic alleles in that region at disproportionately high frequency, relative to their frequency in Africans. Additionally though, we can learn things we would not have detected at the continental level. For example, the Bengali from Bangladesh (BEB) - a South Asian population - also have archaic alleles at very high frequencies in this region.
We detected several genes that appear to show signatures of AI across various populations (Figures 8). One of the most extreme examples is a 120 kb region containing the LARS gene, with 76 uniquely shared Neanderthal alleles at < 1% frequency in Africans and > 50% frequency in Peruvians, which are also at > 20% frequency in Mexicans. LARS codes for a leucin-tRNA synthetase [81], and is associated with liver failure syndrome [82]. Additionally, a region containing gene ZFHX3 displays an elevated number of uniquely shared Neanderthal sites in PEL, and we also observe this when looking more broadly at East Asians (EAS) and - based on the patterns of inferred introgressed tracts (see below) - in various American (AMR) populations as well. ZFHX3 is involved in the inhibition of estrogen receptor-mediated transcription [83] and has been associated with prostate cancer [84].
We also find several Neanderthal-specific uniquely shared sites in American panels (PEL, CLM, MXL) in a region previously identified as harboring a risk haplotype for type 2 diabetes (chr17:6880001-6960000) [85]. This is consistent with previous findings suggesting the risk haplotype was introgressed from Neanderthals and is specifically present at high frequencies in Latin Americans [85]. The region contains gene SLC16A11, whose expression is known to alter lipid metabolism [85]. We also find that the genes FAP/IFIH1 have signals consistent with AI, particularly in PEL. This region has been previously associated with type 1 diabetes [86, 87]. A previous analysis of this region has suggested that the divergent haplotypes in it resulted from ancestral structure or balancing selection in Africa, followed by local episodes of positive selection in Europe, Asia and the Americas [88]. A more recent analysis has found this as a region of archaic AI in Melanesians as well [7].
Another interesting candidate region contains two genes involved in lipid metabolism: LIPA and CH25H. We find a 40 kb region with 11 uniquely shared Denisovan alleles that are at low (< 1%) frequency in Africans and at very high (> 50%) frequency in various South and East Asian populations (JPT, KHV, CHB, CHS, CDX and BEB). The Q95 and D statistics in this region are also high across all of these populations, and we also find this region to have extreme values of these statistics in our broader Eurasian scan. The LIPA gene codes for a lipase [89] and is associated with cholesterol ester storage disease [90] and Wolman disease [91]. In turn, the CH25H gene codes for a membrane hydroxylase involved in the metabolism of cholesterol [92] and associated with Alzheimer’s disease [93] and antiviral activity [94].
Finally, we find a region harboring between 3 and 10 uniquely shared Neanderthal alleles (depending on the panel used) in various non-African populations. This region was identified earlier by ref. [5] and contains genes PPDPF, PTK6 and HELZ2. PPDPF codes for a probable regulator of pancreas development (UniProtKB by similarity). PTK6 codes for an epithelial signal transducer [95] and HELZ2 codes for a helicase that works as a transcriptional coactivator for nuclear receptors [96, 97].
3.4 The X chromosome
Previous studies have observed lower levels of archaic introgression in the X chromosome relative to the autosomes [5, 6]. Here, we observe a similar trend: compared to the autosomes, the X chromosome contains a smaller number of windows with sites that are uniquely shared with archaic humans (Figure 7). For example, for w = 1% and x = 20%, we observe that, in Europeans, 0.4% of all windows in the autosomes have at least one uniquely shared site with Neanderthals or Denisovans, while only 0.05% of all windows in the X chromosome have at least one uniquely shared site (P = 4.985 × 10−4, chi-squared test assuming independence between windows). The same pattern is observed in East Asians (P = 1.852 × 10−8).
Nevertheless, we do identify some regions in the X chromosome exhibiting high values for both UA,B,C,D(w, x, y, z) and Q95A,B,C,D(w,y, z). For example, a region containing gene DHRSX contains a uniquely shared site where a Neanderthal allele is at < 1% frequency in Africans, but at > 50% frequency in a British panel (GBR). Another region contains gene DMD and harbors two uniquely shared sites where two archaic (Denisovan/Neanderthal) alleles are also at low (< 1%) frequency in Africans but at > 50% frequency in Peruvians. DHRSX codes for an oxidoreductase enzyme [98], while DMD is a well-known gene because mutations in it cause muscular dystrophy [99], and was also previously identified as having signatures of archaic introgression in non-Africans [100].
3.5 Introgressed haplotypes in candidate loci
We inspected the haplotype patterns of candidate loci with support in favor of AI. We displayed the haplotypes for selected populations at seven regions: POU2F3 (Figure 9.A), BNC2 (Figure 9.B), OAS1 (Figure 9.C), LARS (Figure 9.D), FAP/IFIH1 (Figure 9.E), LIPA (Figure 9.F) and SLC16A11 (Figure S49.C). We included continental populations that show a large number of uniquely shared archaic alleles, and included YRI as a representative African population. We then ordered the haplotypes by similarity to the closest archaic genome (Altai Neanderthal or Denisova) (Figure 9). As can be observed, all these regions tend to show sharp distinctions between the putatively introgressed haplotypes and the non-introgressed ones. This is also evident when looking at the cumulative number of differences of each haplotype to the closest archaic haplotype, where we see a sharp rise in the number of differences, indicating strong differentiation between the two sets of haplotypes. Additionally, the YRI haplotypes tend to predominantly belong to the non-introgressed group, as expected.
3.5.1 Consequences of relaxing the outgroup frequency cutoff
When using a more lenient cutoff for the outgroup panel (10% maximum frequency, rather than 1%), we find a few genes that display values of the U statistic that are suggestive of AI, and that have been previously found to be under strong positive selection in particular human populations [101, 102]. The most striking examples are TYRP1 in EUR (using EAS+AFR as outgroup) and OCA2 in EAS (using EUR+AFR as outgroup)(Table S3). Both of these genes are involved in pigmentation. We caution, however, that the reason why they carry archaic alleles at high frequency may simply be because their respective selective sweeps pushed an allele that was segregating in both archaic and modern humans to high frequency in modern humans, but not necessarily via introgression.
In fact, TYRP1 only stands out as an extreme region for the number of archaic shared alleles in EUR when using the lenient 10% cutoff, but not when using the more stringent 1% cutoff. When looking at these SNPs in more detail, we find that their allele frequency in Africans (∼ 20%) is even higher than in East Asians (∼ 1%), largely reflecting population differentiation across Eurasia due to positive selection [102], rather than adaptive introgression. When exploring the haplotype structure of this gene (Figure S49.B), we find one haplotype that shows similarities to archaic humans but is at low frequency. In the combined YRI+EUR panel, just 6.37% of all haplotypes have 36 or less differences to the Neanderthal genome, and this number is roughly the point of transition between the archaic-like and the non-archaic-like haplotypes (Figures S49.B). There is a second - more frequent - haplotype that is more distinct from archaic humans but present at high frequency in Europeans. The uniquely shared sites obtained using the lenient (< 10%) allele frequency outgroup cutoff are tagging both haplotypes together, rather than just the highly differentiated archaic-like haplotype.
OCA2 has several sites with uniquely shared alleles in EAS (AFR+EUR as outgroup) when using the lenient 10% cutoff, but only a few (2) shared archaic sites when using the < 1% outgroup frequency cutoff. When exploring the haplotype structure of this gene, we fail to find a clear-cut differentiation between putatively introgressed and non-introgressed haplotypes, so the evidence for adaptive introgression in this region is also weak. OCA2 does not show a large number of differences between the haplotypes that are closer to the archaic humans (Figure S49.A). A close inspection of its haplotype structure shows that OCA2 does not show a large number of differences between the haplotypes that are closer and those that are distant from the archaic humans (Figure S49.A).
Finally, using the lenient outgroup cutoff of < 10% and a target cutoff of > 20%, we find the gene with the highest number of uniquely shared sites among all the populations and cutoffs we tested: MUC19. This region is rather impressive in containing 115 sites where the archaic alleles are shared between the Mexican panel (MXL) and the Denisovan genome at more than 20% frequency, when using all populations that are not MXL as the outgroup. However, the actual proportion of individuals that contain a Denisova-like haplotype (though highly differentiated from the rest of present-day human haplotypes) is very small. Only 11.86% of haplotypes in the combined YRI+AMR panel show 69 differences or less to the closest archaic genome (Denisova), and the next closest haplotype has 134 differences (Figure S49.D).
Overall, a finer investigation of these three cases suggests that using a lenient outgroup frequency cutoff may lead to misleading inferences. Nevertheless, the haplotype structure of these genes and their relationship to their archaic human counterparts are quite unusual. It remains to be determined whether these patterns could be caused by either positive selection or introgression alone, or whether a combination of these or other demographic forces is required to explain them.
3.6 Inferred introgressed tracts
We used a HMM [20] to verify that the strongest candidate regions effectively contained archaic segments of a length that would be consistent with introgression after the divergence between archaic and modern humans. For each region, we used the closest archaic genome (Altai Neanderthal or Denisova) as the putative source of introgression. We then plotted the inferred segments in non-African continental populations for genes with strong evidence for AI. Among these, genes with Neanderthal as the closest source (figs. S50 to S57) include: POU2F3 (EAS,SAS), BNC2 (EUR), OAS1 (Eurasians), LARS (AMR), FAP/IFIH1 (PEL), CHD2 (PEL), TLR1-6 (EAS) and ZFHX3 (PEL). Genes with Denisova as the closest source (figs. S58 and S59) include: LIPA (EAS, SAS, AMR) and MUSTN1 (SAS).
3.7 Testing for enrichment in genic regions
We aimed to test whether uniquely shared archaic alleles at high frequencies were enriched in genic regions of the genome. SNPs in introgressed blocks will tend to cluster together and have similar allele frequencies, which could cause a spurious enrichment signal. Therefore, we performed two types of LD pruning, which we described in the Methods section.
Regardless of which LD method we used, we find no significant enrichment in genic regions for high-frequency (> 50%) Neanderthal alleles (LD-1 P=352, LD-2 P=0.161) or Denisovan alleles (LD-1 P=0.348, LD-2 P=0.192). Similarly, we find no enrichment for medium-to-high-frequency (> 20%) Neanderthal alleles (LD-1 P=0.553, LD-2 P=0.874) or Denisovan alleles (LD-1 P=0.838, LD-2 P=0.44).
4 Discussion
Here, we carried out one of the first investigations into the joint dynamics of archaic introgression and positive selection, to develop statistics that are informative of AI. We find that one of the most powerful ways to detect AI is to look at both the number and allele frequency of mutations that are uniquely shared between the introgressed and the archaic populations. Such mutations should be abundant and at high-frequencies in the introgressed population if AI occurred. In particular, we identified two novel summaries of the data that capture this pattern quite well: the statistics Q95 and U. These statistics can recover loci under AI and are easy to compute from genomic data, as they do not require phasing.
We have also studied the general landscape of archaic alleles and their frequencies in present-day human populations. While scanning the present-day human genomes from phase 3 of the 1000 Genomes Project [19] using these and other summary statistics, we were able to recapitulate previous AI findings (like the TLR [62, 63] and OAS regions [61]) as well as identify new candidate regions for AI in Eurasia (like the LIPA gene and the FAP/IFIH1 region). These mostly include genes involved in lipid metabolism, pigmentation and innate immunity, as observed in previous studies [5, 6, 103]. Pheno-typic changes in these systems may have allowed archaic humans to survive in Eurasia during the Pleistocene, and may have been passed on to present-day human populations during their expansion out of Africa.
When using more lenient definitions of what we consider to be “uniquely shared archaic alleles” we find sites containing these alleles in genes that have been previously found to be under positive selection (like OCA2 and TYRP1) but not necessarily under adaptive introgression. While these do not show as strong signatures of adaptive introgression as genes like BNC2 and POU2F3, their curious haplotype patterns and their relationship to archaic genomes warrants further exploration.
We tested whether uniquely shared archaics alleles at high frequencies in non-Africans were significantly more likely to be found in genic regions, relative to all shared archaic alleles, but did not find a significant enrichment. Though this suggests archaic haplotypes subject to AI may not be preferentially found near or inside genes, it may also be a product of a lack of power, or of the fact that not all uniquely shared archaic alleles may be truly introgressed. As mentioned before, some of these alleles may be present due to incomplete lineage sorting, which could add noise to the test signal. A more rigorous - and possibly more powerful - test could involve testing whether HMM-inferred introgressed archaic segments at high frequency tend to be found in genic regions, relative to all inferred introgressed archaic segments, while controlling for features like the length of introgressed segments and the sensitivity of the HMM to different regions of the genome. However, in this study, we did not pursue this line of research further.
In this study, we have mostly focused on positive selection for archaic al-leles. One should remember, though, that a larger proportion of introgressed genetic material was likely maladaptive to modern humans, and therefore selected against. Indeed, two recent studies have shown that negative selection on archaic haplotypes may have reduced the initial proportion of archaic material present in modern humans immediately after the hybridization event(s) [104, 105].
Another caveat is that some regions of the genome display patterns that could be consistent with multiple introgression events, followed by positive selection on one or more distinct archaic haplotypes [62]. In this study, we have simply focused on models with a single pulse of admixture, and have not considered complex scenarios with multiple sources of introgression. Additionally, the currently limited availability of high-coverage archaic human genomes may prevent us from detecting AI events for which the source may not have been closely related to the sequenced Denisovan or Altai Neanderthal genomes. This may include other Neanderthal or Denisovan subpop-ulations, or other (as yet unsampled) archaic groups that may have lived in Africa and Eurasia.
It is also worth noting that positive selection for archaic haplotypes may be due to heterosis, rather than adaptation to particular environments [104]. That is, archaic alleles may not have been intrinsically beneficial, but simply protective against deleterious recessive modern human alleles, and therefore selected after their introduction into the modern human gene pool. The degree of dominance of deleterious alleles in humans remains elusive, so it is unclear how applicable this model would be to archaic admixture in humans.
Although many of the statistics we introduced in this study have their draw-backs - notably their dependence on simulations to assess significance - they highlight a characteristic signature left by AI in present-day human genomes. Future avenues of research could involve developing ways to incorporate uniquely shared sites into a robust test of selection that specifically targets regions under AI. For example, one could think about modifying statistics based on local between-population population differentiation, like PBS [9], so that they are only sensitive to allele frequency differences at sites that show signatures of archaic introgression.
Finally, while this study has largely focused on human AI, several other species also show suggestive signatures of AI [106]. Assessing the extent and prevalence of AI and uniquely shared sites in other biological systems could provide new insights into their biology and evolutionary history. This may also serve to better understand how populations of organisms respond to introgression events, and to derive general principles about the interplay between admixture and natural selection.
10 Supplementary Figures
5 Acknowledgments
We are grateful to Montgomery Slatkin, Rasmus Nielsen, Fergal P. Casey, Kirk Lohmueller and Amy Ko for helpful advice and comments. E.H.S. is supported by UC Merced start-up funds. D.M. is supported by a University of Torino PhD Scholarship. F.R. is supported by N.I.H. grant R01HG003229 to Rasmus Nielsen.
Footnotes
Email address: fernandoracimo{at}gmail.com (Fernando Racimo)
6 References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].↵
- [67].↵
- [68].↵
- [69].↵
- [70].↵
- [71].↵
- [72].↵
- [73].↵
- [74].↵
- [75].↵
- [76].↵
- [77].↵
- [78].↵
- [79].↵
- [80].↵
- [81].↵
- [82].↵
- [83].↵
- [84].↵
- [85].↵
- [86].↵
- [87].↵
- [88].↵
- [89].↵
- [90].↵
- [91].↵
- [92].↵
- [93].↵
- [94].↵
- [95].↵
- [96].↵
- [97].↵
- [98].↵
- [99].↵
- [100].↵
- [101].↵
- [102].↵
- [103].↵
- [104].↵
- [105].↵
- [106].↵
- [107].
- [108].
- [109].