Abstract
Here we present the first genome wide statistical test for recessive selection. This test uses explicitly non-equilibrium demographic differences between populations to infer the mode of selection. By analyzing the transient response to a population bottleneck and subsequent re-expansion, we qualitatively distinguish between alleles under additive and recessive selection. We analyze the response of the average number of deleterious mutations per haploid individual and describe time dependence of this quantity. We introduce a statistic, BR, to compare the number of mutations in different populations and detail its functional dependence on the strength of selection and the intensity of the population bottleneck. This test can be used to detect the predominant mode of selection on the genome wide or regional level, as well as among a sufficiently large set of medically or functionally relevant alleles.
1 Introduction
In diploid organisms, selection on an allele, or a group of alleles, can be categorized as additive, dominant or recessive, or as part of a more general epistatic network. A large body of existing work is devoted to statistical methods to detect and quantify selection using DNA sequencing data, including comparative genomics and the sequencing of population samples [1,2,3]. However, much less progress has been made toward identifying the predominant mode of selection as additive, recessive or dominant. Genetics of model organisms and of human disease provide plenty of anecdotal evidence in favor of the importance of dominance [4]. Although genome-wide association studies suggest that alleles of small effects involved in human complex traits frequently act additively, estimation of genetic variance components from large pedigrees suggests a substantial role for dominance in a number of human quantitative traits [5]. Alleles of large effects involved in human Mendelian diseases, spontaneous and induced mutations in model organisms, such as mouse, zebrafish, or Drosophila, are frequently recessive [6]. In spite of these observations, the role of dominance in population genetic variation and evolution remains unexplored and no formal statistical framework to test for dominance coefficient is currently available.
Using a combination of theoretical analysis and computer simulations, we demonstrate that recessive selection can be qualitatively distinguished from additive selection in populations that experienced a population bottleneck and subsequent re-expansion. Previous studies of non-additive variation in the presence of a bottleneck lack a complete description of the dynamics after re-expansion [7,8,9,11,3], or focus on epistatic interactions rather than recessive selection [12,13,14,15,16,17], with the notable exception of a recent independently conducted complementary analysis found in [18]. Contrary to naive expectation, the number of deleterious recessive alleles per haploid genome is transiently reduced after a population bottleneck, while the number of additively or dominantly acting alleles is increased. In spite of a well-documented increase in frequency of some recessively acting variants in founder populations, the average number of recessive alleles carried by an individual is reduced as a consequence of the bottleneck. With the growing availability of DNA sequencing data in multiple populations, these results demonstrate the potential to directly evaluate the role of dominance, either on a whole genome level, or in specific categories of genes.
Population bottlenecks are a common feature in the history of many human populations. For example, the “Out of Africa” bottleneck involved ancestors of many present-day human populations. Numerous recent bottlenecks affected, among others, well studied populations of Finland and Iceland. More generally, bottlenecks followed by expansions are standard features in the recent evolution of most domesticated organisms. We suggest that complex demographic history may assist rather than complicate statistical inference of selection in population genetics. Here we use the distinct demographic histories of two subpopulations to identify the type of selection dominating the dynamics, and show that the average number of mutations per individual, 〈x〉, is dependent on the mode of selection. We introduce a measure BR (the “burden ratio” defined below) that provides a simple statistical test for any set of polymorphic alleles in the population, where BR < 1 corresponds to predominantly additive selection and BR > 1 to predominantly recessive selection, as shown in Figure 1. This test is not restricted to the simplified demographic model presented in this paper, but rather provides a quite generic qualitative test for the predominance of recessive selection in comparison between two populations, one of which experienced a bottleneck event.
2 Model
We work with a simple demography described by an ancestral population of N0 individuals that splits into two subpopulations, one with population size N0 equal to the initial population size (“equilibrium”), and one with reduced bottleneck population size NB (“founded”). The latter population persists at this size for TB generations before instantaneously re-expanding to the initial population size N0, as shown in Figure 1. Time t is measured after the re-expansion from the bottleneck, as we are interested in the dynamics during this period. Quantities measured in the equilibrium population, and equivalently prior to the split, are denoted with a subscript “0”. We consider only deleterious mutations with average selective effect of magnitude s > 0, such that s represents the strength of deleterious selection. Extensions of this analysis to a full distribution of selective effects can be found in the SI. The initial population is in steady state with 2N0Ud deleterious alleles introduced into the population at a mutation rate Ud per haploid individual per generation. In a steady state equilibrium, the site frequency spectrum (SFS) of polymorphic alleles is given by Kimura [19].
Here h > 0 is the dominance coefficient for deleterious mutations, where h = 1/2 corresponds to a purely additive set of alleles, and h = 0 corresponds to the purely recessive case. For the present analysis, we primarily focus on these two limits, contrasting their effects on the genetic diversity. The solution represents a mutation-selection-drift balance in which new mutations are exactly compensated for by the purging of currently polymorphic alleles due to selection and extinction due to stochastic drift. In this way, an approximately static number of polymorphic alleles exists in the population at any given time.
3 Results
We follow the expected number of mutations per chromosome in the population, which is simply the first moment of the SFS.
When multiplied by s, this is the effective “mutation load” of each individual in the additive case, but in the case of purely recessive selection this is not proportional to the fitness, as selection acts only on homozygotes. We refer to this statistic generally as the “mutation burden” to avoid assumption of any given mode of selection. Comparison between the mutation burden in the equilibrium and founded populations in the form of the “burden ratio”, BR, provides a test for recessive alleles.
To gain intuition for this qualitative difference, we work to quantitatively understand the population dynamics in a simple demography, first for purely additive selection, and then for purely recessive selection for comparison.
3.1 Additive selection and response to a bottleneck
The initial site frequency spectrum for purely additive alleles is given by Equation (1) with h = 1/2.
Here . When , the SFS rapidly decays as x → 1 simplifying the functional form[20]. We approximately compute the initial mutation burden as follows.
Now we deviate from equilibrium by reducing the population size to 2NB chromosomes, representing a population bottleneck. The effect that a bottleneck has on the site frequency spectrum is twofold: a fraction of alleles are removed from the population due to increased random drift, and the mean of the remaining alleles occurs at higher frequency. The dynamics of the distribution ϕ(x, t) during such a change in demography can be computed from Kolmogorov’s forward equation, as detailed in the SI. The first moment of the distribution, the mutation burden, follows the temporal dynamics derived from summing the Kolmogorov equation over all alleles in the genome, and takes the following form.
The burden of additive mutations is not directly affected by drift, as the drift term vanishes from the dynamics of the first moment, however the dependence on the second moment introduces an indirect dependence on drift. In the strong selection regime, in the limit where , extinction of some alleles is exactly compensated for by an increase in frequency of other alleles. This is true in the equilibrium distribution prior to the bottleneck when , where and . During the bottleneck, the mutation burden 〈x〉 monotonically increases; the second moment 〈x2〉 increases, as well, reaching a maximum value in the case of a long bottleneck where it scales as . Provided , the second moment is guaranteed to be subdominant to the first moment, simplifying the dynamics as follows.
For a bottleneck of duration TB, this equation admits solutions of the form,
After plugging in the initial value , we find that the time dependence drops out completely, demonstrating that the population remains in mutations selection balance throughout the bottleneck. After instantaneous re-expansion to the initial population size, the dynamics of the distribution ϕ(x) are completely analogous to those inside the bottleneck in this limit, such that the mutation burden never deviates during the demographic perturbation.
In the opposite limit of completely relaxed selection during the bottleneck, the dynamics of the mutation burden are completely driven by the influx of new mutations.
The net effect of this accumulation over the course of the bottleneck is simply the integral of this quantity. For a bottleneck with duration TB generations, the net effect of mutation accumulation due to relaxed selection is given simply by the following expression.
Additionally, one can show that the second non-central moment gains an analogous con-tribution in addition to the net effect of drift.
Here we have expressed the second moment as a function of the bottleneck intensity . Immediately after re-expansion from the bottleneck, selection is again efficient, such that the dynamics are completely described by Equation (6). Although the second moment is increased due to relaxed selection during the bottleneck, we find that this increase is negligible in comparison to the direct accumulation in the first moment provided IB « 1. As a result, the primary effect of the bottleneck in this limit is to accrue new mutations that are subsequently purged when selection is again efficient in the re-expanded population. The dynamics for the two limiting cases can be summarized as follows.
We note that at all times in both limiting cases, and asymptotically decays to the equilibrium frequency on a timescale given by the strength of selection of the accumulated deleterious mutations. In the case of an instantaneous bottleneck, we find that the mutation burden is only slightly shifted even if selection is fully relaxed, resulting in effectively no observable change in either limit. Our statistical measure, the burden ratio BR, in the additive case can be written approximately as follows.
As we will see in the following sections, recessive selection results in depleted mutation burden with corresponding values BR > 1, proving a contrast to the additive scenario and justifying our use of this statistic as a test for recessivity.
3.2 Recessive selection and dynamics of the mutation burden
Prior to the bottleneck, the initial site frequency spectrum for alleles under recessive selection is given by the h = 0 limit of Equation (1).
At low frequencies the spectrum decays slower than in the additive case, representing alleles protected from recessive selection by existing primarily in heterozygous form. In contrast, at high frequencies the spectrum decays faster than the additive exponential decay, falling off as .
3.2.1 Instantaneous population bottlenecks
First, we restrict our analysis to an instantaneous bottleneck with intensity IB = 1/2NB, as this provides insight into the non-equilibrium response of the frequency spectrum to a downsampling event. Later, we extend our analysis to finite bottlenecks that persist for TB generations, with total intensity IB = TB/2NB. We represent the increase in drift due to a single generation bottleneck by downsampling. During this time step, NB diploid individuals are chosen at random from the initial larger population of N0 individuals.
Binomial sampling gives the distribution ϕΒ of deleterious alleles with frequency x = k/2NB. There is a loss of allelic variation due to the bottleneck, corresponding to the k = 0 term in Equation (13).
Re-expansion is modeled as up-sampling the distribution ϕΒ(x) from NB to N0 diploid individuals, which has negligible effect on the first and second moments of the distribution. As a result of drift to higher frequencies during the bottleneck, much of the existing variation appears in homozygous form immediately after the increase in population size. These individuals are rapidly selected out of the population, driving high frequency alleles to lower frequencies on a very short time scale. Since drift is once again suppressed, selection becomes far more efficient, particularly for alleles of large selective effect.
The time evolution of ϕ after the bottleneck is given by the forward Kolmogorov equation for recessive selection (see SI). The mutation burden follows the time dependence,
Here we suppress a selection term proportional to 〈x3〉 of in analogy to the additive case. Since recessive selection depends quadratically, rather than linearly, on the allele frequency, the increased variance of the distribution drives the motion of the mutation burden. Alleles with frequency appear in homozygous form and are rapidly pushed down to lower frequencies. This happens on a time scale of order s−1/2 and effectively reduces the variance, slowing the decrease in the mutation burden 〈x〉. New mutations introduced during this period slowly drift to appreciable frequencies, replacing those lost in the bottleneck. This process is drift controlled, rather than selection controlled, and thus occurs on a time scale of 0(2N0) generations. As a result, the mutation burden quickly decreases due to selection immediately after the bottleneck until it slows to a stop, and then gradually increases as the population accumulates new mutations and re-equilibrates.
A minimum in the mutation burden 〈x(t)〉founded occurs when the time derivative van-ishes. This corresponds to a characteristic time scale associated with the selective effect s, where our statistical test is maximized. Since this time scale is shorter than the time scale of drift, we can imagine rescaling time by the effective population size 2N0 and then working in the perturbative regime t/2N0 ≪ 1. This allows us to Taylor expand near the re-expansion time t = 0 to understand the motion of the mutation burden at times soon after the bottleneck.
To understand the time dependence of 〈x2〉, specifically the time derivative, we analyze the higher moments in the same fashion as employed for the first moment in Equation (14). All relevant moments are computed in the SI and we note sufficient convergence to validate this expansion. This allows for the re-expression of Equation (15) to second order in t in terms of the first three moments of the site frequency spectrum immediately after re-expansion. The moments of the post-bottleneck initial distribution can be written in terms of the initial equilibrium distribution using the integral form given in Equation (13). Details of this calculation appear in the SI. In the strong selection limit 2N0s ≫ 1 these initial equilibrium moments are readily approximated by standard convolutions of a polynomial with a Gaussian. Suppressing subdominant contributions in the limit , we find the following approximation to the trajectory of the mutation burden immediately after the bottleneck re-expands.
Concentrating on this second order expansion in t, we find that the curve first drops from its initial value , quickly reaches a minimum, and is then brought back up by the the positive second order term. The location of the minimum is easily found to have the following parameter dependence.
The second derivative is positive at this extremum, implying a local minimum. Plugging tmin into our expression for 〈x(t)〉 in the limit N0s ≫ 1, we find the following minimum value for the average number of recessive deleterious mutations per genome following a bottleneck.
We note that is the approximate mutation burden for the equilibrium distribution in the 2N0s ≫ 1 limit, allowing us to simply write the extreme value of the BR statistic as follows.
We find the following dependence on time in immediate response to a population bottleneck.
This expansion is only valid in the small time limit where the quadratic term is subdominant, such that all values are positive. Long before this simple quadratic expression becomes negative, higher order contributions become relevant and dominate. As seen in simulations described in the following section, for recessive deleterious mutations, the burden ratio remains positive at all times.
This precise result applies strictly in the limit of a strong, single generation bottleneck, where N0 ≫ NB. Additionally, the technique used to compute integral expressions re-quired the strong selection limit 2N0s ≫ 1. Analysis of higher order contributions to the trajectory are made substantially easier by restricting to the limit , which happens to be biologically reasonable, for example, in human populations where most examples of founding events are on the order of N0 ~ 104 and NB ~ 103 (see further discussion in the SI on general dominance coefficients). Despite these analytic restrictions in parameter space, our simulations described below indicate that the signature of BR > 1 is ubiquitous for populations under predominantly recessive selection.
3.2.2 Extended population bottlenecks
We argue that for the case of relatively low intensity bottlenecks, where intensity is defined as IB = TB/2NB ≪ 1, we can approximately express the magnitude of BR using a simple substitution . This is equivalent to the claim that for low intensity bottlenecks, the BR statistic depends only on the ratio of the bottleneck time to the bottleneck population size, and any explicit dependence on TB occurs in subdominant contributions. This intuition is confirmed by simulations described in below, where we show that the accuracy of our analytic approximation breaks down as IB → 1 and the intensity becomes non-perturbative. For short bottlenecks with IB < 1/10, the approximation of an instantaneous single generation sampling event remains sufficiently accurate, even for strong selective coefficients s ~ 0.1. Under this trivially extended instantaneous approximation, BR(t) can be written in terms of the intensity of a short bottleneck as follows.
The BR of maximum effect, has a magnitude given approximately by,
For illustration of the behavior described in the above analytics we present a time series of recessive simulations with curves representing various selection coefficients in Figure 2. The time dependence of the BR statistic is plotted to demonstrate the simulated population’s response to a founder’s event. Crucially, we find that the peak BR values vary in both magnitude and time as a function of s, as is consistent with our analytic understanding and intuition.
3.3 Transient response and time of observation determine detectable selection coefficients
Thus far, we have detailed the dynamic dependence of a set of alleles in a population, all with selective effect s, in response to demographic perturbation in the form of a bottleneck. Notably, for recessive selection, a peak response occurs in the BR statistic at some time tmin after re-expansion. In general, both the magnitude of BR(tmin) and the time of the peak itself depend sensitively on the selection coefficient. In general, a distribution of mutations with different selective effects will be present, many of which may be simultaneously polymorphic in a given population. Since alleles of different selective effect respond to the bottleneck on different time scales, one can ask what selective effect is most likely to be observed at a given time. For example, very strong selection has the tendency to peak and subsequently re-equilibrate immediately after the bottleneck, such that observation of alleles with large s is substantially more difficult at later times. On the other hand, alleles under relatively weak selection have a peak effect at very late times, such that at the time of data collection a statistically significant response may not yet have occurred.
We would like to understand the transient behavior of the burden ratio BR(t), as well as the value of the selection coefficient s for which BR is largest at a given time. When comparing to population data, one has little control over the demographic history, and thus it becomes important to understand the selective coefficient that dominates at the time of observation. According to the time dependent expression in Equation (21), we expect the effect to decrease quite rapidly for very large s. However, the peak occurs quite early in the case of larger s values, allowing the mutation burden to equilibrate over a longer period of time between the peak and observation to return to mutation burden values close to BR ~ 1. This tells us that the equilibration process is what reduces the magnitude of BR for large s. In the case of very recent bottlenecks, the large s values dominate, but for later times of observation, this signal has partially equilibrating, potentially allowing a smaller s value to dominate the statistic. At a given time of observation tobs, one can represent BR(s, tobs) as a function of various selection coefficients s. Figure 3 represents BR(s) for a fixed tobs for various dominance coefficients h. We concentrate here on recessive variation with h = 0, but note that a crossover occurs at some value hc where additive and recessive effects offset each other in the BR statistic (detailed in SI). Based on our analytics, we expect the peak to shift from extreme high s values at early times to extreme low s values at late times, eventually dissolving into neutrality. We take the s derivative of Equation (21) to find the maximum at tobs.
One can easily show that the second derivative evaluated at this point is negative, confirming that this is a maximum. This result matches our intuition: maximum s values of BR(s, t) are found at high s for early times, smax(t → 0) ≫ 1, and at low s for late times, smax(t → ∞) ≪ 1. This is qualitatively observed in our simulations by comparing the relative values of BR(s) as a function of time.
As the effect is transient, we can define a relaxation time trelax corresponding to the vanishing of any response to the bottleneck. This is given by determining when smax is dominated by effectively neutral variation at roughly smax ∼ 1/2N0. After this time, BR(s, t) cannot be differentiated from one for any s.
We note that the return to equilibrium happens on a time scale faster than random drift, even for the weakest selective effects, thus validating our perturbative approximations using t/2N0 ≪ 1. Higher order time dependence in Equation (21) may substantially correct this estimate, but we feel that the presentation of this methodology is conceptually important and provides a greater understanding of the transient dynamics of population response to bottlenecks. As it is relevant to human populations, we note that if both populations expand exponentially after the bottleneck, the effect may persist long beyond trelax. This is explored analytically in the SI and in simulations in an accompanying paper [21].
4 Comparison of analytic results to simulation
We checked our analytic results using a forward time population simulator, described in detail in the SI. Given the ubiquity and analytic simplicity of the exponential decay in the additive scenario, we focus here on our predictions for recessive variation. We compare analytic expressions of BR(tmin) at the peak response given in Equation (22) for various selection coefficients. We simulated a wide range of bottleneck parameters to test the limitations of our theoretical understanding. In Figure 4, we demonstrate the accuracy of our analytic results, by plotting the ratio of the simulated values of BR(tmax,s,IB) to our analytic predictions BR(tmax, s, IB) as presented in Equation (22). We arrange our simulated data by bottleneck intensity IB, as we expect the instantaneous bottleneck approximation to break down as intensity is increased due to longer bottleneck duration TB ≫ 1. As plotted, complete agreement between simulated data and analytic predictions is represented by a flat line at . As expected, we find deviations as we approach the limitations of our perturbative approximation roughly around Tb ∼ 2NB/10 when IB ∼ 0.1. Below these higher intensities, we find quite good agreement for all parameter sets well below 10% error, even at IB = 0.05.
5 Discussion
The increase in prevalence of recessive phenotypes following population bottlenecks has been attracting the interest of geneticists for a long time [7,22]. Theoretical analysis of allele frequency dynamics in a population expanding after a bottleneck suggested that frequency of an individual allele may rise due to increased drift [22,23,24]. Here, we focus on a more general question of the collective dynamics of recessively acting genetic variation. Surprisingly, our analysis suggests that the number of recessively acting variants per haploid genome is reduced in response to a bottleneck and subsequent re-expansion. Generally, we have demonstrated that the frequency spectra of recessive deleterious polymorphisms behave distinctly from additively acting variation following a population bottleneck and subsequent re-expansion. The response of additive variation depends crucially on the average number of deleterious alleles, and on the number of generations for which selection is relaxed during the bottleneck. In contrast, the dynamics of recessive variation crucially depend on the width of the site frequency spectrum, rather than the average number of mutations per individual, such that the accumulation of deleterious mutations can respond strongly even to a single generation bottleneck. Importantly, the temporal dynamics of the accumulation of deleterious alleles depends qualitatively on dominance coefficient and quantitatively on selection coefficient. The qualitative dependence on dominance coefficient allows for a robust statistical test for recessivity. If the variation is additive, the number of deleterious variants per a haploid genome is larger in a bottlenecked population than in a corresponding equilibrium population. If the variation acts recessively, this number is smaller. The selection coefficient determines the timing of response to a bottleneck.
By explicitly analyzing the non-equilibrium response to a bottleneck, we have demon-strated a technique for using potentially confounding demographic features to probe the underlying population genetic forces. In realistic populations, for example in modern humans, substantial work has been done to identify and understand the recent demographic history of geographically disparate populations [25,26,27,28,29,30,31,32,33,34]. In the case of the “Out of Africa” event, a historically substantiated and believable demographic model can be used to model the difference between African and European populations since their divergence. Comparison between populations that have and have not undergone a bottleneck can be used to infer plausible selection and dominance coefficients. In an accompanying paper [21], we specialize this analysis using a realistic demographic model to attempt to bound the selection and dominance coefficients in modern human populations. Parameterizing only by the duration of the bottleneck TB, along with s and h, one can show that a substantial fraction of this three dimensional space is disallowed by the observation of even a single bottleneck.
Although the net number of recessive deleterious mutations is reduced as a consequence of a founder’s event and subsequent re-expansion, the fitness of individuals carrying these alleles is not increased, but rather decreased; selection acts only at homozygous sites and the number of homozygotes is known to increase after a population bottleneck. However, the number of heterozygous deleterious sites, or the average carrier frequency for associated alleles, is suppressed, such that the mating of individuals from disparate bottlenecked populations may result in a decreased incidence of recessive phenotypes in such mixed lineages. In studies of model organisms, this may have applications when comparing laboratory populations founded from a few wild type individuals to their corresponding natural population.
In principle, the results of this study are applicable to the analysis of specific groups of genes and pathways. Sufficiently large subsets of alleles that are medically relevant may be analyzed in humans to identify the mode of selection for candidate variants of recessive diseases. For model organisms with a significant density of deleterious alleles, it may be possible to create a dominance map of the genome.
In sum, the non-equilibrium dynamics induced by demographic events is an essential, and indeed insightful, feature of most realistic populations. Population bottlenecks, abundant in laboratory populations and in natural species, have the potential to provide a novel perspective on the role of dominance in genetic variation.
Acknowledgments
6 Acknowledgements
The authors would like to thank Benjamin Good, Alexey Kondrashov, Nick Patterson, Jonathan Pritchard, and Guy Sella for particularly useful discussions. DJB and SS were generously supported by NIH grants R01 MH101244 and R01 GM078598. RD was supported by a CIHR Banting fellowship. DR is grateful for support from NIH grant R01 GM100233.