Abstract
The origin and fate of new mutations within species is the fundamental process underlying evolution. However, while previous efforts have been focused on characterizing the presence, frequency, and phenotypic impact of genetic variation, the evolutionary histories of most variants are largely unexplored. We have developed a non-parametric approach for estimating the date of origin of genetic variants that can be applied to large-scale genomic variation data sets. We demonstrate the accuracy and robustness of the approach through simulation and apply it to over 16 million single nucleotide poly-morphisms (SNPs) from two publicly available human genomic diversity resources. We characterize the differential relationship between variant frequency and age in different geographical regions and demonstrate the value of allele age in interpreting variants of known functional and selective importance. Finally, we use allele age estimates to power a rapid approach for inferring the genealogical history of a single genome or a group of individuals.
Introduction
Each generation, a human genome acquires an average of about 70 single nucleotide changes through mutation in the germ-line of its parents [1, 2]. Yet while, at a global scale, many millions of new variants are generated each year, the vast majority are lost rapidly through genetic drift and purifying selection. Consequently, even though the majority of variants themselves are extremely rare, the majority of genetic differences between genomes result from variants found at global frequencies of 1% or more [3], which may have appeared thousands of generations ago. Genome sequencing studies [3–7] have catalogued the vast majority of common variation (estimated to be about 10 million variants [8]) and, at least within coding regions and particular ancestries, to date, more than 660 million variants genome-wide have been reported [9], many of them at extremely low frequency [5, 10–13].
Despite the importance of genetic variation in influencing quantitative traits and risk for disease, as well as providing the raw material on which natural selection can act, relatively little attention has been paid to inferring the evolutionary history of the variants themselves; with notable exceptions of evolutionary importance, particularly those affecting geographically varying traits such as skin pigmentation, diet, and immunity [14–16]. Rather, attention has focused on the indirect use of genetic variation to detect population structure [17–23], identify related samples [24–26], and estimate parameters of models of human demographic history [27–30]. The evolutionary history of classes of variants contributing to polygenic adaptation (for example those affecting height [14, 31–33], though see [34]) or causing potential loss of gene function [35] has received attention, though rarely at the level of specific variants. Previous work on rare variants has identified ancestral connections between individuals and populations [3, 36–38] and demonstrated evidence for explosive population growth [10–13]. Nevertheless, to date, no comprehensive effort has been made to infer the age, place of origin, or pattern of spread for the vast majority of variants.
Method
We have developed an integrated framework for estimating the age of genetic variants; the point in time when the mutation arose that is ancestral to all chromosomes that carry the allele observed at a single locus in sample data (Figure 1A). Our approach, which we refer to as the genealogical estimation of variant age (GEVA), borrows from coalescent modeling [27, 30, 39, 40], but makes no assumptions about the demographic or selective processes that influence genealogical history, or about relatedness among sampled individuals. Instead, we learn about the age of a mutation from the distribution of the time to the most recent common ancestor (TMRCA) between pairs of chromosomes. A copy of the piece of the ancestral chromosome on which the mutation occurred is still present today in the individuals carrying the mutant allele. Over time, additional mutations have accumulated along the inherited sequence (haplotype), and its length has been broken down by recombination during meiosis in each generation. We estimate this ancestral segment in a targeted manner, using a simple hidden Markov model (HMM), constructed empirically from sequencing data to provide robustness to realistic rates of data error (Figure 1B). By measuring the impact of mutation and recombination on the segments shared between pairs of haplotypes, we infer the TMRCA distribution using probabilistic models to accommodate the stochastic nature of the mutation and recombination processes (Figure 1C). Moreover, we make full use of the information available in whole genome sequencing data, to perform comparisons between pairs of chromosomes that both carry the mutation (concordant pairs) and between pairs where one carries the mutation and the other carries the ancestral allele (discordant pairs). Information from hundreds or thousands of haplotype pairs is then combined within a composite likelihood framework to obtain an approximate posterior distribution on the time of the ancestral mutation (Figure 1D). One benefit of our method is that we can increase the number of pairwise TMRCA inferences incrementally to update allele age estimates, or to combine information across many data sets to improve the genealogical resolution from a wider distribution of independently sampled chromosomes. We additionally use a heuristic method for rejecting outlier pairs to improve robustness to low rates of data error and recurrent mutation. Full details are given in the Supplementary Text.
Simulation study
To validate our approach, we performed coalescent simulations [41] under different demographic models. Using a standard coalescent model with constant mutation and recombination rates, we found low bias (relative error, ∊ = 0.268; see Supplementary Text) for allele age estimates and high correlation between true and inferred age (Spearman’s ρ = 0.953; Figure 2A, Supplementary Figure 1A). We also compared our approach for estimating pairwise TMRCA posteriors to that obtained from the computationally more demanding pairwise sequentially Markovian coalescent (PSMC) methodology [27] (Supplementary Figure 1B). This estimates a demographic model (over a grid of time intervals) for each pair separately and can, for every position in the genome, return the inferred posterior distribution on the TMRCA. We found that PSMC-based estimations perform similarly well (ρ = 0.952), though the time discretization of PSMC increases bias (∊ = 0.530) and, in particular, leads to overestimation of the age for the youngest variants. We note that PSMC was not designed strictly for this purpose, hence is not optimized for estimating allele age. Under a complex demographic model that recapitulates the human expansion out of Africa [42] and with empirical and variable recombination rates, GEVA maintained a similarly high level of accuracy (∊ = 0.198, ρ = 0.937; Figure 2B; Supplementary Figure 2). In this situation, although the PSMC methodology is expected to model the demographic history for pairs of individuals better, the time discretization still leads to worse performance overall, with the addition of a substantial computational cost. We next introduced realistic data complications (see Supplementary Text), including genotype error calibrated using data from the 1000 Genomes Project [3] compared to data from the Illumina Platinum Genomes Project [43], as well as errors arising through in silico haplotype phasing (Figure 2B). We found that GEVA estimates remain largely unbiased and strongly correlated with true age after the inclusion of data error (∊ = 0.346, ρ = 0.925; Supplementary Figure 3) and after phasing (∊ = 0.430, ρ = 0.921; Supplementary Figure 4). We found that the PSMC-based approach continued to show higher bias and reduced correlation at the same set of variants, both after error (∊ = 1.042, ρ = 0.882) and after phasing (∊ = 1.009, ρ = 0.880). Reduced data quality resulting from sequencing or phasing errors leads to an underestimation of haplotype lengths for variants that are relatively young, for which we overestimate TMRCA and, hence, allele age (particularly for alleles younger than approximately 100 generations).
Age of selected variants
To evaluate the performance of GEVA on empirical data, we considered three loci. First, we considered variants affecting the well-studied lactase persistence (LP) trait, where numerous approaches, including the use of archaeological data, genetic data, and a biological understanding of the functional and evolutionary impact of previously associated variants has resulted in consensus expectations for the age. The LCT gene encodes the lactase enzyme, but is regulated by variants in an intron of the neighboring MCM6 gene. We estimated the age of the derived T allele of the rs182549 variant (G/A-22018), which is at a frequency of approximately 50% in European populations and which forms part of a haplotype associated with LP [44]. We estimate the variant to be 696 generations old (Figure 3A, Supplementary Figure 5); approximately 14,000 to 21,000 years ago, depending on assumptions about generation time in humans [45, 46]. Our estimate is based on data from two different sources, the 1000 Genomes Project (TGP) [3] and the Simons Genetic Diversity Project (SGDP) [4], which, when estimated separately, give very similar ages (696 and 699 generations respectively). We obtained a similar age estimate of 693 generations for the derived A allele of rs4988235 (C/T-13910; Supplementary Figure 6), which is also strongly associated with LP and in near perfect association with rs182549; though we note that there is evidence for multiple origins of the variant [47]. Previous estimates of the age of these variants range between 2,200 and 21,000 years [48], putting our estimate on the higher end of this range. Multiple sources of information suggest that these variants only achieved high frequency in European populations within the last 10,000 years (<400 generations) [49]. Our results therefore suggest that the mutation conferring the strongly selected phenotype (estimated to have a selection coefficient of up to 15% in European and up to 19% in Scandinavian populations [49]) was present for hundreds of generations before its rapid sweep through the population.
We next considered the protein-coding missense variant rs3827760 in the EDAR gene, where the derived G allele (Val370Ala substitution) is found at high frequency (>80%) in East Asian and American populations, and which is associated with sweat, facial and body morphology, and hair phenotypes [50–52]. We estimated the variant to be 1,462 generations old (approximately 30,000 to 45,000 years; Figure 3B, Supplementary Figure 7), again with strong concordance between TGP (1,513 generations) and SGDP (1,350 generations). Our estimate is consistent with previous estimates and limited evidence from ancient DNA studies [15, 53]. Our results further suggest that the variant rapidly rose in frequency following its origin through mutation, which is consistent with previous findings of strong positive selection of this variant in East Asia [54]. Of the 430,568 variants estimated to have arisen between 1,300 and 1,500 generations ago (within SGDP, see below), only 3,052 variants have reached a frequency higher than 30% in frequency globally, and only 423 variants higher than 80% in frequency within East Asian populations, demonstrating how unusual such a rapid rise in frequency is.
Finally, we considered the variant rs80194531, where the derived allele causes an Asn78Thr substitution in the ZEB1 gene. The variant is reported as pathogenic for corneal dystrophy [55], but is present at 6% in African ancestry samples within TGP. We estimated the age of the variant to be 5,892 generations old (110,000 to 180,000 years), again with consistency between TGP and SGDP (5,879 and 5,905 generations respectively; Figure 3C, Supplementary Figure 8). Such an ancient age seems inconsistent with the reported dominant pathogenic effect [55]. Moreover, of the 1,142,335 variants found at comparable frequencies (5–7%) in African ancestry individuals within SGDP, 54% were estimated to be younger, suggesting that this variant is in no way unusual.
Distribution of allele age in the human genome
We next sought to characterize the age distribution of genetic variation across the human genome, by applying GEVA to more than 16 million variants identified in TGP or SGDP, referred to as the atlas of variant age, after confirming that estimates of allele age obtained from the two sources agreed (Spearman’s ρ = 0.871; Figure 4, Supplementary Figure 9). We find substantial variation in the relationship between variant frequency and age depending on the population on which frequency is measured and the geographical distribution of the variant (Figure 5A). Variants in African ancestry groups are typically older than in other groups, and also have the greatest variance in age, for a given frequency. For example, variants below 0.5% (within a given ancestry group) have a median age of around 600 generations in African ancestry groups, 350 generations in East Asian ancestry groups, and 400 generations in European ancestry groups. The age distribution of variants restricted to a particular ancestry group (or, conversely, shared between them) indicates the degree of connection between populations (Supplementary Figure 10). For example, there are many variants up to 10,000 generations old (0.2–0.3 million years) that are restricted to African ancestry groups, yet are observed at frequencies up to 10%, but the oldest variants in this frequency range that are restricted to East Asian ancestry groups are typically under 1,000 generations (20,000–30,000 years) old. Variants restricted to American ancestry groups are typically under 750 generations old (15,000 to 22,500 years), consistent with existing knowledge about the settlement of the Americas via the Bering land bridge that connected Asia and North America during the last glacial maximum around 15,000 to 23,000 years ago [56–58]. We note, however, that recent admixture and the sampling strategies of the different data sets [59, 60] can have a strong impact on age distributions. For example, variants at high frequency within American populations, but which are nevertheless restricted to just American and African populations, are considerably younger (on average), than lower frequency variants (within American populations) with the same geographical restriction (Supplementary Figure 10). These variants likely arose recently within Africa and entered American populations through admixture, rising to high frequency through population bottlenecks [61].
Such heterogeneity in relationship between frequency and age, coupled with heterogeneous and unknown sampling strategies, complicates the use of frequency as a means of assessing variants for potential pathogenicity during the interpretation of individual genomes. The atlas of variant age potentially offers a more direct approach for screening variants, given the high probability of elimination of non-recessive deleterious variants within a few generations [62]. To assess the value of allele age in the interpretation of potentially pathogenic variants we estimated the ages of >70,000 variants in TGP annotated by the Ensembl Variant effect Predictor [63], by Polyphen-2 [64] and SIFT [65], as damaging or deleterious (Figure 5B). Of the variants analyzed, 50% of damaging (PolyPhen-2) and 49% of deleterious variants (SIFT) are estimated to have arisen within the last 500 generations (10,000 to 15,000 years), compared to 41% of benign and 42% of tolerated variants (Supplementary Figure 11). Compared with control sets of variants (those annotated as benign by PolyPhen-2 and tolerated by SIFT and matched for allele frequency within the focal ancestry group), variants annotated as damaging or deleterious have a notable dearth of older variants (>1,000 generations) for a given frequency, consistent with theoretical expectations and previous results [36, 66, 67]. These results suggest that old alleles can largely be excluded from consideration of pathology (though recent origin is not evidence in favour of pathogenicity).
Ancestry sharing
Finally, we investigated the extent to which patterns of sharing of variants of different ages could power approaches for learning about genealogical history. Previous work has highlighted the descriptive value of genetic variants in identifying individuals with recent common ancestry and patterns of demographic isolation and migration [23, 68–72], though has also highlighted the challenges of interpreting the output of approaches such as PCA [73, 74]. Conversely, numerous model-based approaches have been developed that use patterns of variant and haplotype sharing to infer underlying demographic parameters [27–30, 75–79], though these typically make strong simplifying assumptions about the space of possible histories. Patterns of sharing of variants of different ages provide a non-parametric approach for combining descriptive and inferential approaches by learning about connections between individuals and groups of people over time. Specifically, for any two haploid genomes we can estimate the fraction that has reached a common ancestor at a given point in time (the cumulative coalescent function, CCF) through the fraction of variants of that age or less that are shared. More generally, we use dynamic programming to estimate a maximum likelihood CCF between any pair or group of individuals (Figure 6A; see Supplementary Text), though noting that uncertainty in variant age estimates and haplotyping error will tend to cause over-smoothing of coalescent profiles.
To illustrate the value of this non-parametric approach in describing the history of individuals and groups we first considered the coalescent history between a single individual of American (Puerto Rican) ancestry from TGP (Individual ID: HG00733) and all others in the TGP sample, using GEVA age estimates for variants on Chromosome 20 (Figure 6B, Supplementary Video 1). As a positive control, we included the parents of HG00733 (HG00732 and HG00731), who reach a CCF of near one in the most recent epoch (though note that the parents were used for haplotype phasing, which estimates transmitted haplotypes, hence the CCF reaching one, rather than the expected one-half). Within the first 100 generations, we see additional coalescence with the untransmitted parental chromosomes and other individuals from the Puerto Rican sample. The earliest common ancestry outside Puerto Rico is seen with a Colombian individual at around 90 generations ago (maternal side) and with a Peruvian individual at around 100 generations ago (paternal side). Coalescence with individuals sampled from outside the Americas occurs further back in time (>100 generations ago), initially with European individuals in a period around 300–600 generations ago, then uniformly with non-African individuals around 1,000–4,000 generations ago, and more strongly with African individuals around 6,000–10,000 generations ago. Because of the impact of data errors on rare variants discussed above, the absolute timings of the early events are likely substantially overestimated, though we expect the relative ordering of events to be robust.
The CCFs to all other members of a reference panel (averaged across all chromosomes in both haploid genomes) provide an overview of the genealogical relationships for a target individual. As an example, we inferred the CCF profiles of a Siberian Eskimo to all other individuals in SGDP (Figure 6C), showing common ancestry to other Central Asian and Siberian individuals within a few hundred generations, substantial common ancestry with American individuals before 1,000 generations, and typically more recent common ancestry with East Asians than West Eurasians than Africans. Notably, relatively little additional coalescence is seen during the period from c. 2,000 to c. 10,000 generations ago, which is a pattern shared among non-African individuals and agrees with previous findings of a period of reduced coalescence, peaking 100,000–200,000 years ago [30].
The CCF can also be represented as a coalescent intensity function (CIF; see Supplementary Text), which measures the rate of change of common ancestry over time (Figure 6C, middle panel), analogous (for a pair of individuals) to the effective population size, Ne, in population genetics modeling. The CIF reveals additional structure, for example around a 3,000 to 20,000 generation period; those parts of the Siberian Eskimo’s genome that have not yet coalesced with other genomes sampled from the same ancestry group have a very low CIF, while the CIF to the African-ancestry samples (which have had very little coalescence until this point) is relatively high (though note the absolute rate remains very low over this period). Over time, the maximum CIF for the target individual across all others in the sample fluctuates between an Ne equivalent of one to two thousand until approximately 1,000 generations ago, before climbing to an Ne equivalent of 105 and then decreasing. Note that the Ne equivalent from the maximum CIF will tend to be lower than parametric estimates that assume exchangeability among individuals sampled from the same location.
More generally, patterns of allele sharing over time can be used across the entire cohort to summarize genealogical history. We estimated the pairwise CIFs for the 130 population groups defined in SGDP, after aggregating the CCFs across chromosomes and samples (see Supplementary Text), and show their ancestral relationships at different time periods (Figure 7; Supplementary Video 2). These reveal how the rates and structure of coalescence have changed over time, with the most recent epoch (around 200 generations) dominated by coalescence within each sampling group, but also identifying recent connections between groups, such as between southern Siberian and north-east Asians (Figure 7A; note that some populations, such as the Chaplin Eskimo, Balochi and Samaritans, show strong within-group coalescence prior to this point and by 800 generations are coalescing primarily with related populations). The epoch around 800 generations ago (Figure 7B) is dominated by structure broadly corresponding to the continental level, though some southern African populations (notably the Khomani San, Ju’hoan North, and the Mbuti) remain isolated up to around 1,500 generations ago (30,000–45,000 years), which overlaps with previous findings [80]. By this date, there is very little remaining structure among European populations, but many additional inter-continental connections are now identified. For example, we see a north-to-south gradient of decreasing coalescence between American populations and Siberian or East Asian populations. In particular, we identify strong coalescence of all American ancestry individuals with Siberian Eskimos, Aleutian Islanders, and Tlingit people in a period between 500 and 1,000 generations ago (Supplementary Video 2), and very little structure among American, Siberian, and East Asian populations as a whole prior to around 1,000 generations ago, which agrees with previous results regarding the human migration into the Americas, extended isolation, and subsequent dispersal across the continent [58]. By 4,000 generations ago, we see high levels of coalescence between non-African and African populations (Figure 7C), and essentially no structure in the epoch around 20,000 generations ago (Figure 7D). The Khomani San and Ju’hoan North remain largely isolated from other populations (apart from each other) until c. 5,000 generations ago.
The maximum CIF profiles, which provide a non-parametric equivalent to the effective population size (Ne) in population genetics modelling, (Figure 7E) highlight several features including differences among modern ancestry groups in the intensity of coalescence within the last 1,000 generations (particularly intense for American and Oceanic populations); a major period of intense coalescence among all non-African ancestry individuals 1,000-2,000 generations ago, following the migration of modern humans out of Africa [81, 82]; a weaker, but still marked increase in coalescent intensity for African ancestry samples around 2,000 generations ago; and an older reduction in coalescent intensity, peaking around 5,000 to 8,000 generations ago, potentially driven by ancient population structure within Africa and (for non-African populations) possible admixture with archaic lineages [83–88]. We find minor quantitative, but not qualitative, differences among chromosomes (Supplementary Figure 12).
Discussion
We have demonstrated how allele age estimates can provide insight to a range of problems in statistical and population genetics. However, there are several important assumptions and limitations of the approach. First, a key assumption is that of a single origin for each allele. Given the size of the human population and the mutation rate, it is likely that every allele has arisen multiple times over evolutionary history. Nevertheless, unless the mutation rate is extremely high, it is still probable that most individuals with the allele do so through common ancestry. Moreover, multiple origins can potentially be identified through the presence of the allele on multiple haplotype backgrounds, as has, for example, been seen for the rs4988235 allele at LCT [47, 89, 90] (though we note that [90] conclude that the allele of variant rs4988235 was brought into African populations through historic gene flow, possibly through the Roman Empire), the O blood group [91], or alleles in the Human Leukocyte Antigene (HLA) region [92, 93]. A variant lying in a region with high rates of non-crossover (gene conversion) may similarly be found on multiple haplotype backgrounds [94]. However, for genomes with very high mutation rates, such as HIV-1 [95], recurrence is sufficiently high to make estimates of allele age meaningless. In addition, while we have shown GEVA to be robust to realistic levels of sequencing and haplotype phasing error, the actual structures of error found in reference data sources, such as TGP [96], have additional complexity whose effect is unknown.
Our approach also assumes a known and time-invariant rate of recombination. For most species, only indirect estimates of the per generation recombination rate are available and, in humans [97] and mice [98], there is evidence for evolution in the fine-scale location of recombination hotspots through changes in the binding preferences of PRDM9. However, because broad-scale recombination rates evolve at a much lower rate than hotspot location [99], and because our approach for detecting recombination events is driven largely by the presence of recombinant haplotypes, we expect GEVA to be relatively robust for recent variants. Older variants may be more affected, but for such variants most information comes from the mutation clock, which is likely to have been more stable over time.
An atlas of allele ages has multiple applications beyond statistical and population genetics. For example, recent variants provide a natural index when searching for related samples in population-scale data sets. Moreover, as demonstrated here, it is possible to combine information from multiple, potentially even distributed data sets, by estimating coalescent time distributions for pairs of concordant and discordant haplotypes in each data resource separately, or to update age estimates by the inclusion of additional samples. Future extensions to infer location of origin or the ancestral haplotype, integrating the growing wealth of genome data from ancient samples, will be an important step towards reconstructing the ancestral history of the entire human species.
Data sources and code availability
Estimation of allele age and shared ancestry was conducted on publicly available data sets; the 1000 Genomes Project (TGP) [3] and the Simons Genetic Diversity Project (SGDP) [4]. We used phased haplotype data of Chromosomes 1-22 from the final release TGP panel (Phase 3; GRCh37), available for 2,504 individuals from 26 populations worldwide (five continental population groups). Additional data was available from TGP for 31 related individuals which we included in our shared ancestry analysis. We used phased haplotype data of Chromosomes 1-22 from the publicly available SGDP panel (PS2; GRCh37), consisting of 278 individuals from 130 populations worldwide (seven continental population groups). Recombination rates were determined for each chromosome using the genetic maps available from the International HapMap Project (Phase 2; GRCh37) [100]. Genotype data from the Illumina Platinum Genomes Project [43] (GRCh37; Chromosomes 1-22) was used as a reference to measure genotype error in a matched subsample from TGP. We used information from the Ensembl data base (human assembly GRCh37; release 92 version 20180221) to determine the ancestral and derived allelic states for variants in both TGP and SGDP panels, as predicted through multi-species alignments in the Ensembl EPO pipeline.
Data availability
Atlas of variant age for the human genome: (temporary link) https://www.dropbox.com/sh/hkrrj7sopmvkjrx/AAAQFBwdhBUTR0xUvm8-72Lka?dl=0 Shared ancestry in TGP and SGDP: (temporary link) https://www.dropbox.com/sh/h60yjoznqgvhre3/AAA56rAj0wZPGj9T0Ui-8l06a?dl=0
Source code availability
GEVA: https://github.com/pkalbers/geva
CCF: https://github.com/pkalbers/ccf
We modified the original source code of MSMC2 to optimize the performance of the PSMC algorithm in our simulation analysis: https://github.com/pkalbers/msmc2
Acknowledgements
Funded by the Wellcome Trust (100956/Z/13/Z to GM, 099685/Z/12/Z to PKA) and the Li Ka Shing Foundation (to GM). We thank members of the McVean group for comments and discussion.