Age distributions of rare lineages reveal recent demographic history and selection

Alexander Platt; Jody Hey

doi:10.1101/265900

ABSTRACT

The age of an allele of a given frequency offers insight into both its function and origin, and the distribution of ages of alleles in a population conveys significant information about its history. The rarer the allele the more likely it is to reveal functional biological insight and the more recent the historical revelation. By measuring the length of the haplotype shared between an individual carrying a rare variant and its closest relative not carrying the variant we are able to approximate the age of the variant and can apply this method even when only a single copy of a variant has been sampled in a population. Applying this technique to rare variants in a large population sample from the United Kingdom, we identify historical migration from West Africa approximately 400 years ago, evidence of direct selection against novel protein-altering rare variants in individual biological pathways, continued negative frequency dependent selection on protein-altering variants in olfactory transducers and the innate immune system, and map the impact of background selection on the most recent portions of the sample genealogy.

Introduction

The age of an allele can reveal much about how it came to its observed frequency. In particular an allele that is younger than expected given its frequency is likely to have been under directional selection. This effect is not surprising for evolutionarily favored alleles, but it also applies to the case of harmful alleles¹. Such alleles are often quickly lost from a population, but those that persist and turn up in a sample of genomes are likely to be young. In this context “harmful” refers to Darwinian fitness, but it is probable that many alleles with negative impacts on health would also be under negative selection^2,3. Thus, if we estimate allele age, we can combine this with other information such as allele frequency, geographical distribution, and functional annotation to improve predictions of an allele’s effect on human health.

Alternatively, an allele that is older than expected given its frequency is also a candidate for having an interesting history. Alleles that are maintained in the population by balancing selection can be much older than their frequency alone would suggest^4,5. Admixture can also be a source of alleles that are older than they appear to be on the basis of allele frequency alone⁶.

In developing an estimator of allele age, we considered several constraints. The estimator should not be a function of allele frequency, it should not be a function of the demographic history of the sampled population, and it should not be highly sensitive to the details of the data pipeline. We also wish to be able to study the age of the very rarest alleles, including those that appear only once in a sample (singletons). This last criterion rules out methods that are based on the similarity of different copies of an allele at flanking sequences^7,8. Such methods can inform on the time at which different copies of an allele last shared a common ancestry, which is a useful proxy for allele age⁸, but they cannot be applied to singletons.

We turned instead to an approach that utilizes comparisons between a chromosome carrying a focal allele and other chromosomes not carrying that allele. Consider an alignment of two chromosomes, one with and one without a derived allele at a SNP, and consider that the genealogy or gene tree at the site of the allele will coalesce when the two chromosomes most recently shared a common ancestor for the base position of the SNP. Then a search that extends along the chromosome to the right or the left will reveal a distance to the first flanking mismatch between the chromosomes. This mismatch will have been caused by a mutation on one of the two gene tree edges, or by a variant introduced by a recombination event on one of those edges. This distance can be modeled as a Poisson process with a rate parameter that is the product of the time of common ancestry and the sum of recombination and mutation rates per base position per generation^9–11. By extension if we compare the chromosome with the allele with all other chromosomes not carrying that allele, we can identify that which shows the longest flanking region without a mismatch, what we call the maximal shared haplotype or msh. This will be a function of first coalescence time, t_c, between the chromosome with the allele and all chromosomes not carrying the allele.

We developed a coalescent model for the probability density of t_c as a function of msh. For a singleton, the model provides a likelihood estimator, , that uses both msh values (to the 5’ and 3’ directions from the focal base). For alleles that occur more than once, it can be used as a composite likelihood estimator of time of first coalescent event ancestral to the edge upon which the mutation causing that allele occurred. The bias for is low for rare alleles, including singletons, and it works similarly well when using sequenced chromosomes with known or estimated phase, despite the presence of switch errors in the latter (figure S1 and table S1). Having age estimates for singleton alleles also provides a means for estimating the phase of singletons when estimating chromosome haplotypes from short-read data (fig S3). Though the method assumes a demographic model for the length distribution of the sister to the edge carrying the mutation, the resulting estimator is quite insensitive to this model (figure reffig:robust demography).

View this table:

Table S1.

Performance characteristics of t_c estimator. Properties of estimated t_c values compared to true simulated t_c values. a) attributes of variants found k times in the sample. b) simulations of either constant-sized populations or one with recent exponential growth. c) haplotype phasing taken directly from the simulation (known) or statistically inferred from simulated diploid genotypes. d) root mean squared log error of estimator. e) average signed log error. f) Pearson’s r statistic.

Figure S1.

Estimator performance. Density plots for estimation of t_c values as a function of true t_c values in simulated data. Panels (a) and (b) are simulated constant-sized populations of 10,000. Panels (c) and (d) are populations that have grown exponentially from 10,000 to 500,000 over the last 120 generations. Panels (a) and (c) are evaluated on true known haplotypes. Panels (b) and (d) use statistically phased haplotypes from simulated genotypes. Densities of singleton variants is indicated in blue, variants sampled five times in orange, 10 in green, and 25 in red.

Figure S2.

Estimator robustness to demographic model assumptions. Distribution of estimated t_c values for simulated singleton variants estimated assuming a constant-sized population of 10,000 (black) and 500,000 (red) as a function of the value estimated assuming the correct generative model of exponential population growth from 10,000 to 500,000 over the last 120 generations.

Figure S3.

phasing accuracy of singleton variants as a function of the ratio of haplotype lengths under a variety of demographic models: samples of 1,000 chromosomes from constant population sizes of 10,000 and 1,000,000, and populations of 500,000 that have been growing exponentially at rates of 0.01%, 0.03%, and 0.1% per generation.

For identifying harmful alleles, will become increasingly useful as sample sizes grow, as the degree to which rare alleles are enriched for alleles under directional selection will be higher the rarer an allele is. Our focal data set, the UK10K genome sequencing study of over 7,200 genomes (3,600 individuals), has singleton alleles at an estimated population frequency less than 0.00014. We first asked if the distribution of values is consistent with previously estimated demographic histories for the UK, and then turned to questions of the direct selective effects of rare alleles, and finally to a study of indirect selective effects (background selection).

The probability that any two haplotypes share a common ancestor in a particular generation depends on the size and structure of the population. In small populations there are fewer ancestors overall and therefore a greater opportunity to share one than in a larger population, and in subdivided populations two individuals from different subpopulations are less likely to share a common ancestor than individuals from the same subpopulation. These kinds of processes do not depend on the function (or lack thereof) of any particular genetic locus and are simply properties the movements of individuals. Therefore, they simultaneously influence the ages of coalescence events all across the genome. Any given demographic model of the history of a population implies a particular distribution of coalescence times for neutral loci everywhere in the genome. To assess the ability of recently published demographic histories of Europe and the United Kingdom we compare their predicted distributions of times of coalescence (t_c) with those estimated at loci associated with rare variants in the UK10k population sample. We also use this distribution to fit additional parameters explicitly modeling potential recent immigration from Africa to the United Kingdom.

Natural selection, by contrast, acts to perturb the t_c distribution heterogeneously across the genome. Natural selection can act directly on individual genetic variants to alter the expected length of the lineage on which they are found compared to neutral variants of the same frequencies. While neutral variants may persist at low frequencies for extended periods, or return to low frequencies after drifting to higher frequencies, deleterious rare variants are being steadily removed from the population and are prevented from drifting to higher frequencies or sojourning there. This leads to a younger age distribution of rare deleterious variants than neutral variants. Negative frequency dependent selection, where rare alleles confer an advantage over common ones, on the other hand, preserves rare variants (thus preventing their loss) and makes common variants rarer. Both of these phenomena will contribute to an older t_c distribution associated with rare, negative frequency dependent variants than rare neutral variants. Previous work seeking to identify genetic variants undergoing current or recent selection have focused on relatively common variants (e.g.^12,13). These can show signals of strong positive selection where new or very rare variants have quickly risen to high frequencies, and directional selection acting on common variants contributing to highly polygenic traits. Using t_c estimates we can see the direct influence of natural selection as it influences rare variants.

Results

Neutral variation and demographic history

Age distribution of rare variation in United Kingdom

We calculated for 21,992,410 of the rarest variants in the UK10k¹⁴ whole-genome population sequencing sample that has been filtered to remove close relatives and individuals of non-European ancestry. The distribution of estimates revealed a dramatic excess of variation that is both old and rare – well beyond what is predicted by previous models of UK or European human history. Figure 1 shows the means and standard deviations of the distributions of log() values for variants found 2, 3, 4, 5, 10, and 25 times in the UK10k sample of 3,621 individuals (7,242 haplotypes), and compares them with predictions from five published models of UK and European demographic histories^15–19 as well as new models with additional admixture events from an African population or diverged archaic human group. For all of the lowest frequency classes, the observed data contain variants that are far too old to have been generated by the published models (all of which returned mean simulated t_c distributions considerably smaller than for from the UK10K singletons). The models proposed by Gutenkunst et al.¹⁵ and Gravel et al.¹⁶, the two published models with migration between Africa and Europe, return standard deviations similar to that of the observed data but with insufficient old alleles to substantially raise the predicted mean. For variants as common as those found 25 times (a frequency of 0.35%), all models fit the observed distribution reasonably well.

Figure 1.

Distributions of log(t_c) values. Variants of different frequencies in the UK10k data (k values) represented as mean ± one standard deviation of the log-transformed t_c values. The observed distribution is marked in black with other colors indicating expectations under proposed demographic histories of Britain and Europe.

Admixture from archaic humans will have introduced old alleles, and some of these are expected to appear at the lowest frequencies in the UK10K sample. However, we found that admixture with archaic humans does not introduce sufficiently rare alleles in the numbers necessary to explain the discrepancy. While the alleles introduced by such admixture are old, few of the introduced alleles end up in the relevant frequency classes at the time of sampling. When we turned to extant human populations as potential sources of old, very low frequency variants, we found an excellent fit with models that include recent admixture from African populations.

Figure 2 shows the fit of a series of models based on that by Gazave et al.¹⁷ with the addition of a recent migration event from an un-sampled African population to an ancestral UK population from which it separated 2,000 generations ago. The best fit model is one with 1.2% admixture 21 generations ago. This is a model in which 10% of rare UK10k variants predate the human expansion out of Africa (see extended data figure S5); have been segregating at moderate frequency within Africa; and were recently introduced to the UK population from Africa through migration.

Figure S4.

10,000 coalescent simulations comparing the msh of a singleton variant with the approximation of the msh considering only events on the external branch immediately ancestral to the singleton variant and its first sister branch. Each sample is 100 chromosomes drawn from a constant-sized diploid population of 1e6.

Figure S5.

Expected distributions of t_c values by age and population. Simulation results for the expected distribution of t_c values colored by the population in which the coalescent event took place. Demographic models are (a) Gazave et al. and (b) African admixture (21gen).

Figure 2.

African admixture parameter optimization. Root mean squared log error between observed t_c values and those simulated over a grid of proposed values for timing and magnitude of African gene flow into the ancestors of the UK10k population.

Distribution of old rare UK variation across geography

If migration from Africa is responsible for substantially altering the age distribution of rare UK alleles we expect to find a substantial proportion of these rare British alleles at higher frequency in African populations than they are in the UK (or other European populations). We assessed all of the UK10k singleton variants, and variants found 25 times, for their presence and frequency within the populations of the 1000 genomes project phase 3 dataset²⁰. Figure 3 shows that very rare UK10K alleles are typically at their rarest in Great Britain and present in higher frequencies in African populations than elsewhere in the world. In contrast, variants observed 25 times within the UK10k data show a very different pattern, as they are rarely found outside of Europe, but are found at relatively high frequency when they do occur. This is the pattern expected of alleles that have primarily not been recently introduced by migration, but rather have persisted both inside and outside of Africa since their origins before the population divergence at the time of the out of Africa expansion.

Figure 3.

Geographic distribution of UK10k variants. Panels (a) and (b) reflect variants that are singletons in the UK10k data. Panels (c) and (d) reflect variants found 25 times in the UK10k data. Panels (a) and (c) describe the proportion of UK10k variants found in each population sample in 1000 Genomes Project data. Panels (b) and (d) describe the average frequency in the 1000 Genomes Project data of UK10k variants. 95% confidence intervals are all less than 0.4% of the bar heights for panels (a) and (b) and less than 4% of the bar heights for panels (c) and (d)

Extended data figure S6 shows that the distributions of values for UK10k variants that are found in African populations in the 1000 Genomes Project data are considerably older than those that are not.

Figure S6.

Age distribution of African variants. Distributions of t_c values for UK10k singleton variants (a) and variants found 25 times (b) grouped by presence or absence in any West African population of the 1000 Genomes Project.

Distribution of old rare variation across UK individuals

While figure 2 indicates support by the data for models with 1.2% admixture, the overall distribution of rare allele ages is consistent with a range of models that vary in the timing of the admixture event. We further refined the estimate of the time of admixture by explicitly modeling the distribution of quantity of introgressed alleles observed among the individuals in the UK10K sample. In a random-mating model with admixture occurring over a short period of time, the introgressed alleles will come to be spread fairly evenly in the population over a small number of generations. Figure 4 shows that the distribution of the number of doubleton alleles with ages predating the out of Africa expansion shows considerable clustering across individuals, with 13% of individuals harboring 40% of all such variants. This clumping is suggestive of recent introgression, but could also be consistent with older introgression where the decay of clustering slowed by non-random mating. As shown in figure 2, the observed distribution is well fit by a broad range of models of assortative mating, all implying a time of admixture 11-14 generations before sampling. The model ‘African admixture (14gen)’ in figure 1 represents the expected distribution of log(t_c) values for a 1.2% admixture event 14 generations before sampling.

Figure 4.

Cumulative proportion of doubleton alleles more than 2,500 generations old carried by individuals ordered from most to fewest old doubleton variants. Observed distribution is marked in dotted black. Colored lines are simulations of 1.2% admixture at different historic time points and assuming (a) random mating or (b) assortative mating with a threshold of 22% and intensity of 63%. Panel (c) shows the minimum (across ages) root mean square error between the observed cumulative proportion of old doubletons and that predicted under a range of models of non-random mating. Dark bars at the top and left indicate poor fit of random mating models. Light bar across the middle indicates similar support for models with a wide range of ancestry thresholds.

If Africa populations are indeed the source of many of the old singletons in the UK, then individuals carrying large numbers of old singletons should show higher sharing of non-singleton alleles with African populations. We tested this prediction by assessing the differential similarity between UK10K genomes (that have either high or low counts for singletons with estimated ages older than 3,000 generations) with an ABBA-BABA test^21,22. For each of UK10K genomes the D statistic was determined using a random CEPH genome (representative of Europe) and a random Yoruban genome (representative of Africa) from the 1000 genomes dataset²⁰. The two distributions of D statistics show some overlap (extended data figure S7). However the UK10K genomes with low rare singleton counts each had mean D values (overall mean of-0.3615) lower than each of the high rare singleton count genomes (overall mean of −0.3500) consistent with closer proximity to Yoruban genomes for the high singleton count UK10K genomes. A t test of the two groups of mean D values was highly significant (t = 5.9589, d.f. = 8, p=0.0003).

Figure S7.

Histogram of D statistics calculated for UK10k individuals with the greatest (orange) and least (blue) number of singleton variants with values > 3,000 generations.