ABSTRACT
As a chromosome is sliced into pieces by recombination after entering an admixed population, ancestral tracks of chromosomes are shortened with the pasting of generations. The length distribution of ancestral tracks reflects information of recombination and thus can be used to infer the histories of admixed populations. Previous studies have shown that inference based on ancestral tracks is powerful in recovering the histories of admixed populations. However, population histories are always complex, and previous studies only deduced the length distribution of ancestral tracks under very simple admixture models. The deduction of length distribution of ancestral tracks under a more general model will greatly elevate the power in inferring population histories. Here we first deduced the length distribution of ancestral tracks under a general model in an admixed population, and proposed general principles in parameter estimation and model selection with the length distribution. Next, we focused on studying the length distribution of ancestral tracks and its applications under three typical admixture models, which were all special cases of our general model. Extensive simulations showed that the length distribution of ancestral tracks was well predicted by our theoretical models. We further developed a new method based on the length distribution of ancestral tracks and good performance was observed when it was applied in inferring population histories under the three typical models. Notably, our method was insensitive to demographic history, sample size and threshold to discard short tracks. Finally, we applied our method in African Americans and Mexicans from the HapMap dataset, and several South Asian populations from the Human Genome Diversity Project dataset. The results showed that the histories of African Americans and Mexicans matched the historical records well, and the population admixture history of South Asians was very complex and could be traced back to around 100 generations ago.
INTRODUCTION
Population admixture is a common phenomenon in human populations when previously isolated populations start to contact and interact with each other, accompanied by population migration, rising and falling of empires, trading of goods and services, and so on (Hellenthalet al. 2014). The history of population admixture does not fade with time, but leaves a great deal of information in the genomes of individuals from admixed populations. Population history in admixed populations thus can be recovered by utilizing the information in the genome, such as break points of recombination (Xuet al. 2008), admixture linkage disequilibrium (Ald) (Pattersonet al. 2012; Lohet al. 2013; Pickrellet al. 2014) and the length of ancestral tracks (ancestral tracks) (Pool and Nielsen 2009; Pugachet al. 2011; Gravel 2012; Jinet al. 2012; Hellenthalet al. 2014; Jinet al. 2014).
The information of ancestral tracks was first used by Pool and Nielsen (which they called migration tracts) to infer the history of populations, and they also deduced the distribution of ancestral tracks under a hybrid isolation (HI, or a one pulse admixture) model with a small migration rate (Pool and Nielsen 2009). A subsequent study of Pugach et al. performed wavelet transform on the ancestral tracks in admixed populations to obtain the dominant frequency of ancestral tracks and compared it to those obtained from extensive simulations, to estimate the admixture time, still under the simple HI model (Pugachet al. 2011). Then, the study of Jin et al. explored admixture dynamics by comparing the empirical and simulated distribution of ancestral tracks under three typical two-way admixture models; i.e. HI model, gradual admixture (GA) model, and continuous gene flow (CGF) model (Jinet al. 2012). Jin et al. later deduced the theoretical distributions of ancestral tracks under HI and GA models (Jinet al. 2014). Gravel extended these studies to multiple ancestral populations and discrete migrations. However, he failed to explicitly deduce the theoretical distribution of ancestral tracks under a general situation (Gravel 2012).
Here we proposed a general model that can cover all the scenarios of an admixed population with an arbitrary number of ancestral populations and (or) arbitrary number of admixture events. In this study, we first described the general admixture model and deduced a general formula for the theoretical distribution of ancestral tracks. With this distribution, we can use maximum likelihood estimation (MLE) to estimate model parameters, and the Akaike information criterion (AIC) (Akaike 1998) or the likelihood ratio test (LRT) (Wilks 1938) to select an optimal model from candidates for the given data. We next demonstrated that the three aforementioned admixture models, namely HI, GA and CGF models in previous studies (Jinet al. 2012), are all special cases under our general model. Then, under these three models, we developed a method called AdmixInfer to estimate the admixture proportion and admixture time, and simultaneously selected the optimal model according to the principles of AIC. We further conducted extensive simulation studies to demonstrate the accuracy of the theoretical distribution of ancestral tracks under the general model, and the effectiveness of our method to estimate the parameters and select an optimal model. Finally we applied our method to African Americans and Mexicans from the HapMap phase III dataset (International Hapmapet al. 2010), and several South Asian populations from the Human Genome Diversity Project (HGDP) (Liet al. 2008) dataset.
GENERAL MODEL
Length Distribution of Ancestral Tracks
In our general model, population admixture is accomplished by receiving gene flow(s) from ancestral populations either continuously or discontinuously. We model this process generation by generation, in which, if the admixed population does not receive further gene flow(s) in a particular generation, we set the strength of gene flow(s) to 0. Specifically, given an admixed population started T generations ago, with K ancestral populations, let mi(t) denotes the ancestry proportion from the ith ancestral population that entered at the t generation (Figure 1), then the general model is only determined by a K × T matrix M = {mi (t)}1≤i≤k, 1≤t≤T, which satisfies three properties: (a) mi(t) ≥ 0; (b) and (c), if 2 ≤ t ≤ T.
Let I(t) be the ancestry proportion of the admixed population at generation t inherited from the previous generation, and then we get I(t) as
Denote Hi(t) as the total ancestry proportion of the ith ancestral population in the admixed population at t generation, then
Recursively, we can get
Define si(t) as the survival proportion of the ancestral tracks at generation T from the ith ancestral population that entered at generation t, then
Generally, we assume that the ancestral populations are homogeneous, and recombination among segments from the same ancestry does change the length of the tracks, but it is not “observable” to us, thus the length of tracks seems unchanged, and only these recombination events among different ancestries produce “observable” changes. (Figure S1) Here we explicitly take these “unobservable” changes into consideration and adjust the recombination rate accordingly as following. Define the recombination among tracks from different ancestral populations as effective recombination, let, ui(t) be the effective recombination rate for tracks from the ith ancestral population that entered at generation t, then
If the end is ignored, a chromosome from the ith ancestral population that entered at generation t is expected to be split into ui(t) pieces per unit length (unit in Morgan). Then the contribution of ancestral tracks from ith ancestral population that entered at generation t to the admixed population is proportional to ui(t)si(t). Let Xi be the length of ancestral tracks of the ith ancestral population at generation T, and fi(x; M) be the distribution of Xi, then
Due to the limited accuracy in local ancestry inference, only those relatively long tracks are reliable (Pool and Nielsen 2009; Gravel 2012). Therefore, we are also interested in the conditional length distribution of ancestral tracks longer than a specific threshold C,
With the length distribution of ancestral tracks (Formula (5)), we can easily deduce the expectation and variance of Xi as and
Parameter Estimation and Model Selection
As the parameters of admixture events are fully determined by the matrix M, once M was accurately estimated, we could fully recover the history of the admixed population. MLE can be used to obtain the estimation of M. By utilizing the ancestral tracks inferred from the data, a likelihood function can be computed with the length distributions of ancestral tracks. The log-likelihood function of ancestral tracks from ith ancestral population Li(M) has the form where {xij}1≤j≤ni are the observed length of tracks from the ith ancestral population in the admixed population. Then the log-likelihood function of ancestral tracks of the admixed population L(M) is
Then the estimator of M is where M satisfies the properties in the above subsection.
However, with the increase in the number of parameters, it is complex and time-consuming to find the optimal solution and too many parameters could lead to over-fitting. In a real situation, we can propose several candidate models with prior knowledge in which the number of parameters is dramatically reduced, thus the problem is simplified to estimating parameters for each candidate model and selecting the most suitable one. If we have obtained the parameters of the candidate models, we can compare the models in a pair-wise fashion by using either AIC or LRT. Here, the two models are regarded to be ‘nested’ if one of the models constitutes a special case of the other (Lewis F et al. 2011). When the two competing models are nested, we use LRT to select the model; otherwise we use AIC.
THREE TYPICAL MODELS
The Distribution of Ancestral Tracks under a HI, GA, and CGF Model
In this subsection, we demonstrate that, with the length distribution of the general model, we can easily deduce the length distributions of ancestral tracks under three typical models aforementioned in previous studies (Jinet al. 2012). By restricting the number of ancestral populations to be two in the general model, if only one pulse of gene flow is allowed, the model turns into a HI model; if extra equal gene flows from both ancestral populations are allowed, the model turns into a GA model; if extra equal gene flows from only one of the ancestral populations are allowed, the model turns into a CGF model (Figure S2). Thus these three models are all special cases of our general model. There are only two parameters in each of the three models: the admixture proportion m and the admixture time T. Easily we can obtain the distribution of the ancestral tracks of these three models from Formula (5). The detailed calculation is in Supporting Information.
For simplicity, we define the ancestral population with the minor ancestry contribution as population 1, and the corresponding proportion is m. For the HI model (Figure S2 [A]), the ancestry proportions from population 1 and population 2 at generation t are
Then the length distribution of ancestral tracks from population 1 is
We can also get the expectation and variance of the length of the ancestral tracks from Formula (7) and Formula (8),
Substituting m with 1-m in Formulas (9), (10), and (11), we can obtain the length distribution, expectation, and variance of ancestral tracks from population 2, respectively. These two distributions are identical to the ones in previous studies (Gravel 2012; Jinet al. 2014).
For the GA model (Figure S2 [B]), the ancestry proportions from population 1 and population 2 at generation t are
Then the length distribution of ancestral tracks from population 1 is where . The expectation and variance of the ancestral tracks are
Substituting m with 1 − m in Formulas (12), (13), and (14), we can get the length distribution, expectation and variance of ancestral tracks from population 2, respectively.
For the CGF model (Figure S2 [C]), the ancestral population that contributes only one pulse of gene flow is treated as a gene flow recipient and the one that contributes continuously as gene flow donor. Here, we divide the CGF model into two sub-models. If population 1 is a gene flow recipient, we denote it as a CGFR model; otherwise we denote it as a CGFD model.
In the case of a CGFR model, the ancestry proportions from population 1 and population 2 at t generation are where α = 1–m1/T. Then the length distributions of ancestral tracks from the two ancestral populations are
The expectations and variances of the ancestral tracks are
In the case of a CGFD model, we just replace m for 1-m in Formula (15) and (16), and obtain the length distribution of ancestral tracks from population 2 and population 1, respectively.
Parameter Estimation and Model Selection under HI, GA, and CGF Models
As discussed above, there are only two parameters m and T in the HI, GA and CGF models. As for m, with the inferred ancestral tracks in an admixed population, we divide the total length of tracks from population 1 by the total length of tracks, and obtain an estimator of the admixture proportion. Interestingly, from the expectation of the ancestral tracks from two ancestral populations in the HI, GA and CGF models, we find that the expectation ratio between population 1 and population 2 relies only on m, thus we provide an alternative way to obtain the estimator ,
Generally, if there are only two ancestral populations, Formula (17) always holds whatever the admixture model is. The proof is in Supporting Information.
As for admixture time T, the estimation relies on the model assumed. Different models give different estimations of T so that we first need to assume a model. Here we regard the HI, GA, CGFR and CGFD models as the candidate models, and use MLE to estimate the admixture time T and AIC to select the optimal model as following: First, by utilizing the inferred ancestral tracks, we calculate the admixture proportion and determine population 1; secondly, with the estimator , maximizing the likelihood under model assumption HI, GA, CGFR, and CGFD, we obtain the maximum likelihoods Lmax (HI), Lmax (GA), Lmax (CGFR) and Lmax (CGFD) and corresponding optimal times . Because each pair of these models is not nested, thus here we use AIC to select the optimal model. The value AIC can be calculated by the formula where k is the number of parameters and Lmax is the maximized value of the likelihood function. The number of parameter of these models are the same, thus at the end of the comparison, we find that the problem is equivalent to finding the model with the highest likelihood. Thus, the model with the highest likelihood is chosen as the optimal model, and the corresponding parameters as the final results. These routines are implemented in our AdmixInfer. We also apply the bootstrapping technique in AdmixInfer and give a bootstrap estimation and confidence interval (CI) of the admixture time.
MATERIALS AND METHODS
General Settings of Simulation Studies
Simulation studies were performed to evaluate the correctness of the length distribution of ancestral tracks under the general model, and the performance of AdmixInfer under three typical models. The following settings remained the same under all situations simulated if with no further modification: The population size of the admixed population was simply set to 5,000 and remained constant in our simulations and the length of simulated chromosome was 3.0 Morgan, which approximated the length of chromosome 1 of the human genome. We simulated one chromosome each time and a pair of chromosomes represented an “individual.” At the end of simulation, 400 “individuals” (genome length of an “individual” approximated 1/10 of the length of a human individual) were sampled and the ancestral tracks were directly recorded.
Evaluate the Theoretical Distribution under the General Model
To test the accuracy of the length distribution of ancestral tracks under the general model, we simulated several general representative cases with three or four ancestral populations and one or more waves of admixture. Simulation 1A was to simulate one pulse admixture with three ancestral populations: admixture started 50 generations ago with proportions 10%, 30% and 60%, respectively. Simulation 1B was to simulate discrete multiple-waves admixture with three ancestral populations: the admixture started 50 generations ago with initial proportions 10%, 40% and 50%, respectively, and population 1 contributed extra 10% ancestries each 10 generations later. Simulation 1C was to simulate continuous multiple-waves admixture with three ancestral populations: the admixture started 50 generations ago with initial proportions 10%, 30% and 60%, respectively; population 1 contributed an extra 0.2% ancestries for each generation afterwards. And simulation 1D was to simulate multiple-waves admixture with four ancestral populations that arrived at different times: populations 1 and 2 firstly admixed 50 generations ago with proportions 40% and 60%, population 3 entered 40 generations ago with proportion 20%, and population 4 entered 30 generations ago with proportion 10%. The simulations were repeated 5 times, and the ancestral tracks in the admixed population were recorded and the length distribution was compared to the theoretical distribution. Detailed parameters for the simulations were provided in Table S1.
Evaluate the Performance of AdmixInfer
Then we focused on evaluating the performance of AdmixInfer under three typical models; i.e. HI, GA and CGF. The proportions of admixture varied from 10% to 50% in steps of 10% for the symmetric admixture models (HI and GA) and varied from 10% to 90% in steps of 10% for the asymmetric admixture model (CGF). We set the admixture time as 5, 10, 20, 30,…, 200 generations. The ancestral tracks in admixed populations were also recorded as previous simulations. Each simulation here was repeated 10 times and, in total, 4,200 simulations were carried out under HI, GA and CGF models. AdmixInfer was applied to the simulated data with the default settings; the results were recorded and summarized.
In real situations, we could only accurately infer the ancestral tracks longer than a specific threshold due to methods’ limitations in local ancestry inferences. To make our method more feasible to real cases, we also evaluated the robustness of our method under different thresholds ranging from 0 centi-Morgan (cM) to 2 cM in step of 0.1 cM, with the dataset simulated in previous evaluations.
We also evaluated the performance of AdmixInfer with different sample sizes. We simulated populations starting with the admixture of 50 and 100 generations ago under HI, GA, and CGF models, with admixture proportions 30%:70%. At the end of the simulation, 10, 20, 50, 100, 200, and 500 “individuals” were sampled, corresponding to 1, 2, 5, 10, 20, and 50 human samples. AdmixInfer was applied to the simulated dataset without discarding short tracks.
Finally, we tested the performance of our method with data simulated by real populations and inferred ancestral tracks. Simulations were carried out with real populations YRI and CEU as parental populations under different models (30% YRI ancestry and 70% CEU ancestry) with admixture time 10, 20, 50 and 100 generations. Here we simulated with the data of chromosome 1 and sampled 25 “individuals” at the end of the simulation, and each simulation was repeated 10 times. Then the local ancestry of the simulated populations was inferred by HAPMIX (Priceet al. 2009). With the derived ancestral tracks, AdmixInfer was used to select the optimal model and estimate generations accordingly with the tracks longer than 1 cM.
Apply to Real Datasets
We applied our method to some real datasets. First, the histories of African Americans and Mexicans are relatively clear, thus they can be used to test the performance of our method. The data of African Americans, Mexicans and reference populations CEU and YRI were obtained from HapMap project phase III (International Hapmapet al. 2010), and the reference populations that represented American Indian ancestry were obtained from HGDP dataset. Then we also applied our method to several HGDP populations from South Asia (Liet al. 2008), which showed evidence of population admixture from previous studies (Pattersonet al. 2012; Hellenthalet al. 2014). Haplotype phasing was performed by Shapeit 2 (Delaneauet al. 2012). Local ancestry was inferred by Hapmix (Priceet al. 2009). According to the prior knowledge, the generations settings in Hapmix inference were 10, 20 and 50 for African Americans, Mexicans and South Asian populations, respectively (Hellenthalet al. 2014). AdmixInfer was used to select the optimal model and admixture time accordingly with the tracks longer than 1 cM. We also performed bootstrapping 100 times to obtain confidence of model selection, and calculated the 95% confidence intervals of the generations inferred.
RESULTS
Theoretical and Simulated Distributions of Ancestral Tracks Match Well
With the length distributions of ancestral tracks under the general model, we could easily sketch the curves of theoretical length distributions of ancestral tracks under a given model (Figure 2, solid line). By putting the theoretical and simulated length distributions of ancestral tracks together, we clearly observed that the theoretical and simulated distributions of ancestral tracks matched well, for all the situations simulated and all the replicates (Figure 2). It showed that the theoretical length distribution of ancestral tracks for the general model, which we deduced, was reasonable and accurate.
AdmixInfer Performs well in Parameters Estimation and Model Selection
With the extensively simulated data, we could systematically evaluate the performance of our method in parameters estimation and model selection. For the simulated admixed populations under the HI model or CGFD model, our method could always distinguish the right model in all our simulations; for the CGFR model, our method could distinguish the right model with accuracy of 97.0%; and for GA model, our method could distinguish the right model with accuracy of 93.0% (Table 1). Moreover, the specificity of our method was over 97% for all the situations simulated. The sensitivity of the GA and CGFR models and the specificity of the HI and CGFD models under different admixture proportions and different admixture times were shown in Figure S3. We found that the simulations in which our method could not distinguish the right ones were mostly observed in these simulations with very recent admixture times and small admixture proportions (Figure S3 and Table S2-S6). We also found that almost all CGFR models were only wrongly distinguished as HI models, and GA models as CGFD models (Table S6). This is also reasonable, because the CGFR model is close to the HI model, so that CGFR model is more likely to be distinguished as a HI model. The same reason applies for that the GA model being wrongly distinguished as the CGFD model.
Note that there were only two parameters m and T for the three typical models. Our method also performed well in estimating parameters m and T for the three typical models. Regarding the admixture proportion m, estimations were very close to the pre-settings in simulations, and the small deviation was due to random drift in simulation with finite population sizes and only a subset of individuals being sampled at the end of the simulation. Time estimations for wrongly distinguished models were meaningless, and thus should be discarded. Results showed that our method can estimate admixture times with high accuracy (Figure 3, Figure S4-S7, and Table S2-S5). For HI and CGFD models, results showed high consistency with the time simulated, while a slight underestimation occurred for GA and CGFR models. For all these models that were simulated, the time estimated for a small proportion was less accurate than that for a larger proportion. We defined the relative errors between true admixture times and inferred admixture times as where T is the true admixture time and is the estimation of the admixture time. Under the situation of a certain admixture proportion and a certain model, we defined the average relative error on different values of admixture time as
We found that when the admixture proportion was 0.1, the relative errors of CGFR and CGFD were 6.43% and 5.89%, respectively. For the other cases, the relative errors were all less than 4% (Table 2). In conclusion, no matter the model selection or parameters estimation, our method performed well.
Robustness for Different Thresholds
To test the robustness of our method for different thresholds, we tested our method under different thresholds varying from 0 cM to 2 cM in steps of 0.1 cM. The results showed that our method was robust to thresholds, except the GA model with a larger time (Figure 4). When a larger threshold is taken, less information is kept for ancient admixture events. Although keeping all the information to estimate admixture times is better, we must balance the trade-off between information and accuracy, because the accuracy of local ancestry inference is not so good for short ancestral tracks due to method limitations. Take HI for an example, the probability p of ancestral tracks larger than a specific threshold C is Therefore, with an increase in threshold C, less information is kept. Here, we provided a reference table of the probability p under different admixture times and proportions (Table S7). For example, when T = 200, m = 0.1, if we set the threshold C as 2cM, the probability that the tracks exceed C is only 2.7%.
Small Sample Sizes also Give Good Estimations
To test the performance of AdmixInfer with different sample sizes, we evaluated the models with 10, 20, 50, 100, 200, and 500 “individuals” (corresponding to 1, 2, 5, 10, 20, and 50 human samples). Results showed that AdmixInfer was insensitive to sample sizes. Even with only one human sample, it could distinguish the right model and estimate the admixture time with high accuracy (Figure 5). However, considering the accuracy of local ancestry, short tracks were usually discarded. The information kept for extremely small sample sizes might not be sufficient to give a clear picture of the history of a population. Therefore, relatively larger sample sizes were recommended.
Error Analysis
When we use our method to infer the history of a real admixed population, there are two kinds of errors that may influence the accuracy of inference. The first kind of error is caused by the assumptions in deducing the length distribution of ancestral tracks. In the derivation, for simplicity, we ignored the end of the chromosome and the drift. For this kind of error, we have used the simulation data to demonstrate that the accuracy of the inference was neglectable (Figure 3 and Table 1). The second kind of error is caused by the local ancestry inference. In our study, local ancestral tracks are inferred by HAPMIX software. Here we used the simulation data with ancestry populations YRI and CEU to analyze the influence of this kind of error. For all the cases we simulated, only the HI model with an admixture time 100 generations was wrongly distinguished as a GA model (Table S8). We also found that for the case of large admixture times, the error of local ancestry inference will cause underestimation of admixture time. When the admixture time is large, the ancestral tracks will be short. However, the method of inferring ancestral tracks cannot effectively determine short tracks. Thus, it will influence the accuracy of our method in inferring admixture times and model selections.
Real Data Analysis
Our method for parameters estimation and model selection under three typical models implemented in AdmixInfer was applied to a real dataset. First, African American was inferred as a GA model and the admixture time was 12 generations ago (Table 3). When 29 years per generation was assumed according to previous investigation (Fenner 2005), it was about 350 years ago, which was consistent with the recorded history that most African ancestors arrived in America (via slave trading) after the seventeenth century. The slave trade continued until the nineteenth century and after that, many African people settled down in America. Gene flows from Africa and Europe would have continuously contributed to the African American gene pool and thus the GA model matched the recorded history well. Similarly, the model for Mexicans was inferred as a GA model and the admixture time was 18 generations ago (522 years before present), which was also consistent with the time of the exploration of the new world and the arrival of Europeans. The GA model indicates continuous contact and admixture of Europeans and American Indians.
Finally, we studied the admixture histories of several HGDP populations from South Asia. Previous studies have shown that the populations from South Asia have admixed ancestries mainly from Europe and East Asia (Liet al. 2008; Hellenthalet al. 2014). Regarding the admixture proportions, our estimations based on AdmixInfer (Table 3) were consistent with previous estimations from the three-population test (Pattersonet al. 2012). Regarding the admixture model and time, the populations Balochi, Brahui, Burusho, Kalash, Pathan and Sindhi from Pakistan in South Asia were all inferred to as CGF (Sardinian as donor), which indicated extra gene flows from European ancestry after initial admixture. The initial admixture times estimated ranged from 107 to 96 generations ago. When 29 years per generation was assumed according to previous investigations (Fenner 2005), these estimations (1103BC-784BC) coincided with the migration of Indo-Aryan speaking people into the Indian subcontinent. Extra gene flows from European ancestry might be contributed to by the rising of empires during the following centuries. In the case of the populations of Hazara and Uygur, they not only showed very similar admixture proportions, but also showed the same admixture model and very close admixture times; i.e., 67 and 70 generations ago, respectively. The Hazara population mainly settled in Afghanistan and Pakistan, while the Uygur population mainly settled in West China, and both populations are connected along the Silk Road. It was feasible to receive continuous gene flows from both European and East Asian ancestries. These similarities also indicated a possible close relationship or shared histories between these two populations.
In summary, our method showed good performance in inferring the admixture history of African Americans and Mexicans. The admixture scenarios in South Asian populations were more complex than expected. However, with our method, the analysis could shed light on the mysterious histories of these populations.
DISCUSSION
In this work, we proposed a general model to describe the admixture history with multiple ancestral source populations and multiple-wave admixtures. We showed the length distribution of ancestral tracks and some of its useful properties under this general model. We thus provided a theoretical framework to study population admixture history. With the general framework, we focused on studying three special cases of the admixture models (HI, GA, and CGF) and developed a method to estimate the admixture proportion, admixture time and determine the optimal model simultaneously. Our simulations showed that the theoretical distribution of ancestral tracks was consistent with our theoretical prediction, and our method was precise and efficient in inferring population history under three typical models.
In the efforts of model selection, we found that the simulations in which our method was not able to determine the correct model, were mostly those cases with recent admixture times and minor admixture proportions. The possible reason for incorrect determination was that we ignored the chromosome ends in deducing the theoretical length distribution. When the admixture proportion and times were small, the chromosomes without “observable” recombination were over-represented in the ancestral tracks. (Figure S8). Our further simulations showed that when the chromosome length increased, the accuracy of our method was enhanced. Furthermore, we note that the length distributions of ancestral tracks have no relationship with the population size. Thus, the change of population size does not affect the time estimation. Simulations under different demographic models also supported it (Figure S9).
The efficiency of our method could also be influenced by the validity of the local ancestry inference. To improve the reliability of the inference, we suggest using the ancestral tracks longer than a certain threshold C. However, when the threshold became large, some ancient admixture information disappeared rapidly. In principle, if short ancestral tracks could be precisely detected, our method is promising in recovering even more ancient admixture history, such as the admixture between modern human and Neanderthals (PrÜferet al. 2014; Sankararamanet al. 2014).
Though we proposed a general framework and relevant principles to infer the population history under the general model, finding optimal estimation for parameters is a challenging work with high dimensionality. Currently, our method implemented in AdmixInfer is focusing on the three typical models. For the real admixed populations, the admixture history is always complex, such as discrete multiple-waves admixture. Under such circumstances, the length distribution of ancestral tracks under the general model is still broadly useful and applicable. Therefore, based on this framework, to infer more complicated admixture history is a problem to be solved in the future.
ACKNOWLEDGEMENTS
This work was supported by the Strategic Priority Research Program of the Chinese Academy of Sciences (CAS) (XDB13040100), by National Natural Science Foundation of China (NSFC) grants (91331204 and 31171218), 973 Project (2011CB808000), the Fundamental Research Funds for the Central Universities (2011JBZ019), the National Excellent Doctoral Dissertation Foundation of PR China (FANEDD 201312), National Center for Mathematics and Interdisciplinary Sciences of CAS, and the Key Laboratory of Random Complex Structures and Data Science, CAS (2008DP173182). S.X. also gratefully acknowledges the support of the National Program for Top-notch Young Innovative Talents of The “Wanren Jihua” Project.