Abstract
Population genomic datasets collected over the past decade have spurred interest in developing methods that can utilize massive numbers of loci for inference of demographic and selective histories of populations. The allele frequency spectrum (AFS) provides a convenient framework for such analysis and accordingly much attention has been paid to predicting theoretical expectations of the AFS under a number of different models. However, to date, exact solutions for the joint AFS of two or more populations under models of migration and divergence have not been found. Here we present a novel Markov chain representation of the coalescent on the state space of the joint AFS that allows for rapid, exact calculation of the joint AFS under generalized isolation with migration (IM) models. In turn, we show how our Markov chain method, in the context of composite likelihood estimation, can be used for accurate inference of parameters of the IM model using SNP data. Lastly, we apply our method to recent whole genome datasets from Drosophila melanogaster.
INTRODUCTION
The explosion in availability of genome sequence data brings with it the promise that longstanding questions in evolutionary biology might now be answered. In particular, understanding the forces at work when populations begin to diverge from one another is crucial to our understanding of the process of speciation. Population genomic sampling of multiple individuals from closely related populations provides our clearest view of the evolutionary forces at work during divergence, however it remains a challenge as to how best to analyze such massive datasets in a population genetic framework (Sousa and Hey 2013).
A popular model for population divergence is the so-called isolation with migration (IM) model (Wakeley 1996; Nielsen and Wakeley 2001; Hey and Machado 2003), in which a single ancestral population spits into two daughter populations at a given time and the daughter populations then have some degree of geneflow between them. IM models are a convenient framework for statistical estimation of population genetic parameters as the models described by various parameter combinations exist along a continuum between pure isolation after divergence to panmixia among daughter populations. More complex models of divergence, for instance secondary contact after isolation or geneflow that stops after a certain period of time, are also readily modeled in the IM framework. As a result numerous methods are now available from estimation of IM parameters.
Generally there exist two classes of methodology for the estimation of IM model parameters: genealogical samplers which aim to accurately compute the probability of a population sample under the assumption of no recombination within a given locus (e.g. IMa2; Hey and Nielsen 2007; Hey and Nielsen 2004) and methods which make use of the joint allele frequency spectrum (AFS) and assume free recombination between SNPs (e.g. δaδi; Gutenkunst et al. 2009). While genealogical samplers yield maximum likelihood or Bayesian estimates of population parameters, they become somewhat unwieldy for use with genome-scale data, due to the assumption of no recombination. Thus with the enormous increase in population genomic data from both model and non-model systems, much recent effort has been devoted to AFS based approaches that rely upon composite likelihood estimation (Gutenkunst et al. 2009; Naduvilezhath et al. 2011; Lukić and Hey 2012; Excoffier et al. 2013).
Estimation methods based on the joint AFS between populations rely upon calculating the probability of an observed AFS given the vector of parameters that describe the population history. The method for calculation of this expected AFS is thus central, and varies between competing methods. For instance, Gutenkunst et al. (2009) took the approach of numerically solving a diffusion approximation to the population allele frequency spectrum, whereas more recent methods of demographic inference rely upon coalescent simulation to estimate the expected sampled AFS (Naduvilezhath et al. 2011; Excoffier et al. 2013). While both of these approaches have been shown to be reliable for demographic inferences under many parameterizations, both are approximate and may contain error to various degrees across parameter space.
Here, we introduce a method for exact calculation of the joint AFS under generalized two-population IM models. Our method uses a coalescent Markov chain approach that is defined on the state space of the AFS itself. Using this newly defined state space, in combination with the rich mathematical toolbox of Markov chains, we are able to compute the expected AFS of a given IM model for moderate sample sizes. We compare our coalescent Markov chain calculations of the AFS to diffusion approximations and that obtained via simulation. Further, using simulation we show how our approach can be used for accurate inference of demographic parameters. Lastly we apply our software package implementing the method, IM_CLAM, to population genomic data from Drosophila melanogaster.
MODEL
Here we present a strategy for exact calculation of the joint AFS under the IM model, and the subsequent inference of its associated parameters, that relies upon both discrete time and continuous time Markov chains (DTMC and CTMC respectively). In outline our approach involves first enumerating the complete state space associated with a given configuration of samples from two populations (i.e. sample sizes), followed by construction of a transition matrix to be used for a DTMC (or the analogous CTMC), and finally through the use of standard Markov chain techniques, the calculation of the implied joint AFS. For reasons that will become clear below, we begin by describing how one would calculate the exact joint AFS from a two population island model, before moving on to the full-blown IM model.
A markov chain on the state space of the joint AFS
The first step in our approach requires the complete enumeration of the state space associated with our Markov chains given a sample configuration. The state space we describe is on the space of the allele frequency spectrum. That is to say that each state of our model is implies a unique contribution to the joint AFS of the model in question. To track the allele frequency contribution implied by each state we will track the number of descendent leaf lineages in each population that each gene copy present is ancestral to. We will need to track this quantity independently for each population to deal with migration. To introduce our state space consider a sample that consists of one allele for population 1 and one allele from population 2, and let n1 and n2 be the sample sizes such that n1 = n2 = 1 (Figure 1). Although this is a trivially small case, it is adequate for accurately describing the form of the state space. Our initial state (i.e. the configuration at the time of sampling), call it A0, looks like the following where the left and right matrices represent the state in populations 1 and 2 respectively, and the entry at i, j represents the number of gene copies ancestral to i sampled alleles in population 1 and j sampled alleles in population 2. By convention these state space matrices are zero indexed, and there will never be a non-zero value at the position (0, 0) as the models does not track lineages that are not ancestral to the sample. The initial state A0 indicates that there is a single allele in population 1 that is ancestral to one of the sampled gene copies from population 2 and a single allele in population 2 that is ancestral to one of the sampled gene copies from population 2. Moving back in time in Figure 1 the first event is a migration event from population 1 to population 2. Thus in state A1 the matrix representing population 2 now has two alleles, one of which is ancestral to a single allele in population 1 and the other which is ancestral to a single allele in population 2. Further notice that the left hand matrix, representing population 1, is empty. Finally to two alleles coalesce to find the MRCA in population 2, as indicated in state A3.
To enumerate the complete state space associated with a given sample configuration (n1, n2), we use a recursive approach that considers all possible coalescent and migration moves among present gene copies to exhaustively find all possible states, including MRCA states that will represent the absorbing states of our Markov chain. Note that in this two population island model only two absorbing states are possible– the MRCA could be found in population 1 or it could be found in population 2. In the case of n1 = n2 = 1 as shown in Figure 1, there are a total of 6 possible states however the number of states grows extremely quickly with increasing sample size (See Appendix). For instance when n1 = n2 = 2 there are 46 possible states, and n1 = n2 = 3 there are 268 states. Figure 2 shows how the state space grows in sample size, and while growth is sub-exponential it clearly explodes for larger samples.
Markov chain transition matrix
Having defined the state space we next consider the form of the transition matrix associated with the DTMC. Transitions between states in our coalescent markov chain depend both on parameters of the model (e.g. population sizes, migration rates) and on the combinatoric probability involved in the chain move. For instance let ni be the number of active lineages in population i within a state of the chain, let x represent the multiplicity of a specific lineage at the current state, and y be the number of gene copies involved in the move to the next state. Further let Ni be the population size of population i, and the coalescent rate be Ci = ni(ni − 1)/4Ni. Then the probability of acoalescent event in population 1 that moves the chain from state Ai at time t to Aj at time t + 1 would be where the product in the numerator of the first term is over each lineage involved in the move (maximum of two different terms). Here the first term represents the combinatorics involved in the move and the second, parameter-dependent term corresponding to the type of move (either coalescent or migration).
In the case of a migration event from population i to j the terms of the transition matrix take the form where Mi is the migration rate from population i scaled by effective population size such that Mi = 4Nim where m is the fraction of the focal population made up of migrant individuals each generation.
Turning our attention back to the case of n1 = n2 = 1, whose complete state space is given in the appendix, the transition matrix associated with the DTMC, call it P, would be where now each matrix entry Pij is scaled such that each row sums to one such that ∑j Pij = 1 The P matrix also implies an analogous CTMC transition matrix, call it Z, whose rows are constrained such that ∑j Zij = 0. With these transition matrices in hand we now turn attention to computing the SFS of the island (or IM) model.
Calculating the AFS
As said above, each state implies an associated contribution to the allele frequency spectrum. Let F represent the joint AFS from a two population sample. F will be matrix valued of size n1 + 1 rows and n2 + 1 columns, where n1 and n2 are the sample sizes from populations 1 and 2 respectively. Entries of F, Fij, will be the number of SNPs sampled with i derived alleles in population 1 and j derived alleles in population 2. To map a given state Ai to its contribution to F, we need only ask how long the system stays in a given state and then add that amount of time to each of the corresponding cells of F from the non-zero entries in both the right and left hand matrices of the Aith state.
We can use the tools of Markov chains to then perform the two calculations needed to exactly calculate the AFS under a given model: 1) calculate the expected number of times each state is visited before absorption (i.e. reaching the MRCA), and 2) calculate the expected length of time the chain is in each state to compute the AFS. The latter calculation is simply the exponentially distributed wait time under the coalescent with migration, which itself is a function of the number of gene copies active in a given state, population sizes, and migrations rates.
Calculating the expected number of visits to each state is move involved. We can rearrange our transition matrix P into what is called “canonical form”. We assume that P has r absorbing states and t transient states, such that where Q is a t×t matrix, R is a t×r matrix, and Ir is the identity matrix of rank r (Kemeny and Snell 1976). Using this factorization we can next compute the fundamental matrix of our Markov chain, N, by using the relationship where the entries of Nij represent the expected number of visits to state j given the chain started at state i, and It is a rank t identity matrix. It is important to note that this calculation will thus require the inversion of a potentially very large matrix, thus complicating our implementation. For the calculation of the island model however, we are only interested in one row of N, as the starting state is known with certainty (i.e. the observed sample), so this is readily solved. Also, note that N gives us the expected number of visits to each state by the DTMC until absorption (i.e. the MRCA). For the island model this describes the complete stochastic process as in that case we are dealing with a time homogenous process. For models with changes in population size or populations splitting we will need to consider different “phases” of the demographic history separately, as the transition rates through the system, or indeed even the state space of the system will change moving back in time.
Returning for a moment to the island model then, having calculated N we are ready to compute the expected AFS. As we said before, the expected AFS will simply be the sum of the products of the number of visits to each state and the length of time spent in each state. For the island model in the case where n1 = n2 = 1 there will be 6 terms in the summation to find F, one for each state.
Isolation with Migration
To calculate the AFS for the IM model, we calculate the contributions to the AFS from two sources: that of the island model phase of the model prior to divergence (looking back in time), and the contribution to the AFS from the single, ancestral population (see Figure 3). The contribution to the AFS from the island model portion, call it FI, can be computed by first calculating the total AFS from the island model from time zero to absorption, Ftot, and then subtracting off the portion of the AFS contributed from the population divergence time, Tdiv, until absorption (e.g. Wakeley and Hey 1997). Let the vector π(t) be the probability of being in each state of our Markov chain at time t. We need to calculate π(Tdiv) both to find FI and to figure out where our system begins the single population phase of the IM model. We use a CTMC representation of our same transition matrix from the island model (denoted Z) to compute π(Tdiv) using the matrix exponential such that
With π(Tdiv) in hand, we then can use the fundamental matrix of the island model, N, to compute the number of visits to each state conditional on starting in each state at Tdiv with probability π(Tdiv) as Ng = π(Tdiv)N, where Ng is subscripted g in reference to the fact that these represent “ghost visits, ” unseen in the actually IM model. FI then can simply be calculated as FI = Ftot − Fg, where Fg is the AFS implied by Ng.
Once we have the contribution to the AFS from the island phase, FI, there is only one portion remaining– the contribution to to the AFS from the single population, ancestral phase, call it FA (Figure 3). To compute this we map the state space of the island model onto a reduced state space of a single population model, use that mapping to fold π(Tdiv) to the state space size, and then compute a new DTMC transition matrix for the single population phase, changing population size as necessary and removing migration. With the new transition matrix we can compute the fundamental matrix for the ancestral phase, NA, and from that its contribution to the AFS, FA. Finally the AFS for the complete IM model, FIM, is equal to the combined sums of the AFS contributions from the two phases such that FIM = FI + FA.
IMPLEMENTATION
Our strategy for computing the AFS from the IM model relies upon taking the inverse of two large, sparse matrices, corresponding to functions of the transition matrix from the DTMC, and exponentiating one matrix. Such calculations are extremely expensive computationally, so in our implementation of this method we have used parallel, scalable algorithms where ever possible. Our software package, IM_CLAM, performs these calculations with help from two open source packages, the CSPARSE library (Davis 2006) and the PETSc package (Balay et al. 1997; Balay et al. 2015a; Balay et al. 2015b). In particular we use PETSc to distribute all sparse matrix calculations across a parallel compute environment that uses MPI. For matrix inversion, we compute row by row of the inverse matrix using a direct solver from CSPARSE and distribute those solves across cores. The matrix exponential is calculated using the Krylov subspace method as implemented in the SLEPc add-on to the PETSc package (Hernandez et al. 2005). IM_CLAM and its associated open source code are available for download from GitHub (https://github.com/kern-lab/im_clam).
APPLICATION TO DROSOPHILA MELANOGASTER DATA
We apply our method to recent whole genome sequencing projects from Drosophila melanogaster in which two population samples, one from North America (North Carolina) (Mackay et al. 2012) and a second from Africa (Zambia) (Lack et al. 2015) have been sequenced to good depth. We obtained aligned datasets from the Drosophila Genome Nexus resource (v1.0; Lack et al. 2015), and subsequently filtered from those alignments regions that showed strong identity-by-decent (IBD) and admixture using scripts provided with the alignments. This yielded sample sizes of n = 197 genomes from African lines and n = 205 genomes from N. American lines. The joint AFS was then constructed, using alignments to D. simulans and D. yakuba to determined the derived and ancestral allele at a given SNP. Tri-allelic positions were ignored. In an effort to sample the AFS from regions of the genome that should be less likely to affected by linked selection, we only examined intergenic regions that were at least 5kb away from genes, and that did not contain simple repeats, repeat masked regions, annotated transcription factor binding sites, or annotated regulatory elements. This yielded 5530 regions of the genome with a total length of 4.43Mb. From this we constructed a joint AFS that we then downsampled to a smaller size (n = 6 African and n = 6 N. American alleles) to allow for calculation using IM_CLAM.
RESULTS
Simulation
We first set out to compare the expected AFS calculated with IM_CLAM versus that calculated from coalescent simulations. As our calculations result in the exact AFS, we were interested in comparing the convergence of the simulated AFS to the true AFS as a function of the number of simulations. In Figure 4 we show the mean percentage error of the AFS computed from simulating a given number of independent genealogies, shorthanded on the axis label as number of SNPs. In Figure 4 the AFS was computed using n1 = n2 = 6, a symmetric migration of rate m12 = m21 = 1.0, and a divergence time of tdiv = 0.25. As the number of simulated genealogies increases the mean percentage error between the simulated AFS and that calculated by IM_CLAM drops quickly. However after 106 simulations the amount of Monte Carlo error plateaus at approximately 0.3% and then decays very slowly even after 109 simulations. Thus brute force simulation of the AFS seems ill advised for IM models, as it will be computationally quite expensive to converge to the correct distribution of allele frequencies, although approximately correct calculation could be done with considerably fewer simulations.
We next turned our attention to comparing our exact AFS to that computed by the popular software package ∂a∂i (Gutenkunst et al. 2009). ∂a∂i uses diffusion approximations to model the joint AFS among two populations and thus itself may be susceptible to a certain amount of error for given parameterizations. We compared our exact AFS to that generated from ∂a∂i under a range of migration rates, m = {0, 1, 5, 10}, and having fixed population sizes to 1.0 and tdiv = 0.5. Figure 5 shows the element wise percentage error for the ∂a∂i approximation of this comparison. ∂a∂i harbors an appreciable amount of error under these parameters, particular at the corners of the matrix, that represent fixed differences among populations. Thus while ∂a∂i has been shown to be accurate for use in inference, we can see here that the expected AFS produced using the diffusion approximation still strays from the true value.
As a result of this discrepancy we set out to compare the accuracy of inference using IM_CLAM in comparison to ∂a∂i. Our goal here is not to perform an exhaustive comparison between methods, as IM_CLAM is much more limited in scope than ∂a∂i, however we wish to show that our method has utility for parameter inference as well. For this we generated 100 replicate simulated AFS draws using coalescent simulations in a manner as to simulate a large number of independent SNPs. Again we set n1 = n2 = 6, a symmetric migration of rate m12 = m21 = 1.0, but here we used a divergence time of tdiv = 0.1. We set a low per locus θ, θ = 0.001, and generated 106 genealogies. This yielded approximately 3.54 × 105 SNPs per simulated AFS sample. With these simulated datasets we then set out to infer the parameters of the IM model. Figure 6 shows boxplot plots of parameter estimates for both IM_CLAM and ∂a∂i. In general both methods are accurate for this parameterization however it can be seen that a minority of optimizations using ∂a∂i yielded outlier parameter estimates (note y-axis on Figure 6). It is worth considering that both methods are using the BFGS (Broyden-Fletcher-Goldfarb-Shanno; (Press 1985)) algorithm for optimization, set with the same stopping criterion and bounds on the parameter space explored, thus failed optimization alone seems an unlikely explanation. Indeed similar behavior for ∂a∂i was observed in an earlier report (Naduvilezhath et al. 2011) although the range of parameters considered in that paper was quite wide.
Application to Drosophila melanogaster data
The demographic history of Drosophila melanogaster in many ways mirrors that of human populations. Drosophila melanogaster is commonly thought to have had its origins in sub-Saharan Africa, and have spread out of Africa approximately between 10,000-20,000 years ago (Lachaise et al. 1988; David and Capy 1988; Begun and Aquadro 1993; Li and Stephan 2006). D. melanogaster seems to have first migrated to Europe and Asia via the middle east, presumably as a human commensal, and then only much later did it arrive in North America (Lachaise et al. 1988). Here we model the demography of North American and African populations, thus the divergence time we will capture will be from the initial split between African and out-of-African lineages. Using a downsampled joint AFS from these population samples we estimated IM model parameters using IM_CLAM. Point estimates of the population sizes, migration rates, divergence time, and the optimized model likelihood are given in table 1. These estimates are scaled in the number of individuals for population sizes and the number of years for divergence time by assuming a mutation rate per base per generation of u = 5.49 × 10−9 and 15 years per generation. Multiple runs of IM_CLAM on this dataset from different starting points yielded similar point estimates and likelihoods.
Our estimates show that while current day Zambian effective population size has grown ∼3.2x larger then the ancestral population size, the North American population is quite a bit smaller, undoubtedly due to the strong and potentially repeated bottlenecks it has experienced in its history. Moreover our estimates indicate a good deal of continued gene flow between African and North America in both directions. Finally our estimate of the date of divergence between the lineage leading to North America and that leading to Africa was 16,980 years ago, well in line with previous estimates. Figure 7 shows the close correspondence between the inferred SFS and that predicted from our estimation.
DISCUSSION
Population genetic inference of demographic history has become an increasingly important goal for modern genomics, as the impacts of demography on patterns of genetic variation is now appreciated to directly impair our ability to identify causative disease variation via linkage (e.g. Rogers 2014) as well as shape the genetic architecture of phenotypic variation within populations (Lohmueller 2014; Simons et al. 2014). Moreover, our understanding of human prehistory has been revolutionized in recent years through demographic inference using population genetic data (e.g Botigué et al. 2013; Ralph and Coop 2013; Raghavan et al. 2015; Poznik et al. 2016). While that is so, methods that efficiently utilize whole genome information for inferring rich demographic histories, particularly multiple population histories, still lag behind the huge availability of data (Sousa and Hey 2013). Accordingly, much recent effort has focused on using the joint allele frequency spectrum of samples drawn from multiple populations as a way to summarize genome-wide data for demographic inference (Gutenkunst et al. 2009; Naduvilezhath et al. 2011; Lukić et al. 2011; Lukić and Hey 2012; Excoffier et al. 2013; Kamm et al. 2015).
In this study we present a novel method for calculating the exact joint site frequency spectrum expected from two population Isolation with Migration models. Our method relies upon a Markov chain representation of the coalescent, in which the state space of the chain is the joint AFS at a given point in time. Through the use of this state space, in conjunction with standard Markov chain techniques, we are able to numerically calculate our expected AFS. Our method stands in contrast to other popular techniques that either use diffusion approximations (Gutenkunst et al. 2009; Lukić et al. 2011) or direct Monte carlo simulation (Excoffier et al. 2013) to estimate the expected AFS under a given parameterization. Indeed, as we have shown, estimation of the AFS via diffusion or Monte carlo simulation can lead to persistent error and in some cases numerical instability (see Kamm et al. 2015). While we here use a Markov chain approach to calculate the exact AFS under generalized IM models, a recent, elegant paper by Kamm et al. (2015) presented analytic solutions and associated algorithms for computing the exact AFS for multiple population models with arbitrary population size histories but without continuous migration.
We have implemented our approach in a software package called IM_CLAM that allows for inference of generalized IM models using genome-wide joint AFS data by computing the exact AFS. As we have shown above with simulated data, IM_CLAM is quite accurate in its inference of population parameters. Application of IM_CLAM to population genomic data from Drosophila melanogaster sampled from North America and Africa recovers point estimates of population sizes and divergence time that are well in line with earlier estimates based on much smaller datasets (Li and Stephan 2006); we show that the North American populations are smaller than both current day African populations and the ancestral population from which both lineages are drawn, with African effective population size roughly 6.7x larger than North American effective population size. Moreover we find that the African population has experienced considerable growth since the divergence of the two populations, as it is now 3.2x larger than the ancestral size, while the North American population has yet to recover from the bottlenecking associated with its establishment. Indeed, larger sample sizes might show evidence of stronger population growth that what we have found here. Our estimated divergence time of 16,980 years ago is again consistent with earlier estimates based on small numbers of loci (Li and Stephan 2006; Duchen et al. 2013). Finally we estimate that there is considerable geneflow in both directions between African and North American populations, with the rate of migration to Africa being approximately 1.5x higher than in the other direction. Approximate Bayesian model selection by Duchen et al. (2013) supported a model of admixture between European and African populations in the founding of the North American population over symmetric migration models. While this is so, there is also considerable evidence for strong geneflow from North America to African populations (Pool et al. 2012). Taken on balance, although we are not modeling an admixture event directly here, our migration rate estimates seem reasonable and probably reflect time averaged geneflow between these populations in accordance with their complex history.
While the ability to compute the exact AFS under generalized IM models using our Markov chain approach is an advance, there are many short comings to our methodology. Perhaps most challenging is the fact that the state space of our Markov chain grows nearly exponentially in sample size (fig. 2). This means that our approach is only computationally feasible for smaller sample sizes, as in the current state space the transition matrix associated with larger sample sizes will be too large to represent in memory, even when sparse matrix representations are used as we have done here. While this is so the state space of the Markov chain could potentially be reduced in size if by exploiting lumpability among states (cf. Andersen et al. 2014). Even at moderate sizes the computational costs of the matrix inversion and exponentiation needed by our method are still high, thus IM_CLAM needs tens or hundreds of CPUs for optimization runs to complete within hours rather than days. Currently our implementation is quite limited in that it only handles the two population IM case with constant population sizes, and from that model produces only point estimates. Estimation of confidence intervals via the Godambe information matrix which has been utilized recently to provide appropriate interval estimation under composite likelihood (Coffman et al. 2016) should also be possible in this setting.
Despite the computational difficulties associated with the Markov chain approach described here, our method has opened a new avenue in calculating the likelihoods associated with AFS data and might be amenable to other population genetic problems. For instance, in the model presented above we consider the two dimensions of the state space matrices to represent different populations. It is simple to conceive of this dimension as instead two separate loci with recombination acting to make transitions among the numbers of alleles that are ancestral at one or both loci. In this way we have been able to write down a Markov chain that enables calculation of the two-locus allele frequency spectrum that itself might be useful for estimation of demographic parameters and recombination rates.
ACKNOWLEDGMENTS
ADK and JH were supported by NIH R01GM078204. We thank Dan Schrider for suggestion the name IM_CLAM, and Yun Song for helpful conversations about this effort.
APPENDIX
The complete state space for a sample of configuration n1 = n2 = 1 is given below. The ordering of states shown is arbitrary but identical to the one used in the example markov chain transition matrix in the Model section of the paper.