AN ACCURATE GENETIC CLOCK ========================= * D. H. Hamilton ## Abstract Molecular clocks give *“Time to most recent common ancestor” TMRCA* of genetic trees. By Watson-Galton17 most lineages terminate, with a few overrepresented *singular lineages* generated by W. Hamilton’s “kin selection”13. Applying current methods to this non-uniform branching produces greatly exaggerated *TMRCA.* We introduce an inhomogenous stochastic process which detects *singular lineages* by asymmetries, whose *reduction* gives true *TMRCA.* This implies a new method for computing mutation rates. Despite low rates similar to mitosis data, *reduction* implies younger *TMRCA,* with smaller errors. We establish accuracy by a comparison across a wide range of time, indeed this is only clock giving consistent results for both short and long term times. In particular we show that the dominant European y-haplotypes R1a1a & R1b1a2, expand from c3700BC, not reaching Anatolia before c3300BC. While this contradicts current clocks which date R1b1a2 to either the Neolithic Near East4 or Paleo-Europe20, our dates support recent genetic analysis of ancient skeletons by Reich23. The genetic clock, computing *TMRCA* by measuring genetic mutations, was conceived by Emile Zuckerkandl and Linus Pauling 32,33 on empirical grounds. However work on neutral mutations by Motto Kimura16 gave a theoretical basis and formula. While our theory applies to general molecular evolution, we focus on the Y-chromosome with DYS regions (**D**NA **Y**-chromosome **S**egments) counting the “short tandem repeat” (STR) number of nucleotides of a micro satellite. In fact one uses many DYS sites, marked by *j* = 1, *…N*, each individual *i,* 1 = 1, *‥n,* has STR number *x**i,j*. The Y-chromosome is passed unchanged from father to son, except for mutations *x**i*, *j* → *x**i*, *j* *±* 1 occurring at rate ![Formula][1] The fundamental assumption is that the sample population has a single patriarch at time *t* = *TMRCA*(generations). Now suppose the present (sample) population has mode *m**j* at DYS *j*. This is taken to be the STR value of the original patriarch. A calculation shows the present population with variance ![Graphic][2]. Then averaging over the markers gives *TMRCA* = Σ*j* *V**j* / (*n* Σ*j* *μ**j*). This variance method and its variations we call KAPZ after its originators. In practise problems soon arose. Mutation rates could be computed from mitosis, but sample sizes are too small to give great accuracy. Using these a KAPZ due to Zhivotovsky31,31 was applied to R1b1a2 by Myres20 giving L23*(Turkey) giving 9000BC, *σ* = 2000. Mutation rates could also be estimated from large family groups with genealogy data. However there are significant discrepancies in rates between different family groups. Also these “pedigree” rates are much larger than those from mitosis. A similar phenomena for the mitochondrial clock suggested high short term rates and lower long term rates14,15. So very low long term rates of .00069 were suggested31 for the Y-clock. We show this is unnecessary. Another problem is that KAPZ is for large populations whereas ancient populations were small and modern samples can be tiny, e.g. *n <* 20. This led to the introduction of Bayesian methods such as BATWING27, which considers all possible genealogical trees giving the present sample data, then searches for the tree of maximum likehood. But the BATWING *TMRCA* is often greater than KAPZ, e.g. for the Cinnioglu8 study of Anatolian DNA both methods were applied to the same data and mutation rates. For R1b1a2 the KAPZ had *T M RCA* = 9800*BC* compared with 18, 000*BC* for BATWING. Balaresque4 used BATWING to give an origin for R1b1a2 in Neolithic Anatolia c. 6000BC, but their statistics was disputed by Bushby29. All of this was contradicted by Reich22 who found R1b1a2 in skeletons c 3300BC from Yamnaya cemeteries. ## Singular Lineages A fundamental problem is that present populations have highly overrepresented branches we call *singular lineages*. A well known example is the SNP L21 which is a branch of R1b1a2. Individuals identified as L21 are often excluded from R1b1a2 analysis because they skew the results. Such a singular lineage causes the variance to be much greater, even though the original *TMRCA* remains unchanged, see figure 1. For Bayesian methods such lineages are very unlikely giving an even greater apparent *TMRCA*. However one cannot deal with singular branches by excluding them. For one thing, our method will show that 50% of DYS show evidence of singular side branches, i.e. more than a SD from expected. Excluding them would also remove some of the oldest branches and produce a *TMRCA* which is too young. Now these singular lineages are very (mathematically) unlikely to arise from the stochastic system which is the mathematical basis of KAPZ (or the equivalent Monte-Carlo process modeling BATWING). We believe that the standard stochastic process is perturbed by other improbable events, which are then amplified by biological processes. ![Figure 1:](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2015/05/18/019414/F1.medium.gif) [Figure 1:](http://biorxiv.org/content/early/2015/05/18/019414/F1) Figure 1: A singular lineage increases variance and apparent TMRCA: First, the Watson-Galton Process17 implies lineages almost certainly die out. Conversely, the “kin selection” of W.D. Hamilton13, shows kin co-operation gives genetic advantages. Consider three examples with well developed DNA projects. Group A of the Hamiltons has approximately 100, 000 descended from a Walter Fitzgilbert c 1300AD. Group A of the Macdonalds has about 700, 000 descendants from Somerfeld c1100AD, and Group A of the O’Niall has over 6 million descendants from Niall of the Seven Hostages, c300AD. These are elite groups with all the social advantages. One sees lines of chieftains, often polygamous. Our model has many extinct twigs with a few successful branches, whereas current models assume a uniform “star radiation”, see below ![Figure2](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2015/05/18/019414/F2.medium.gif) [Figure2](http://biorxiv.org/content/early/2015/05/18/019414/F2) ## Reduction of Singular Lineages Modelling singular lineages requires a new stochastic system where instead of a single patriarch we imagine many “virtual patriarchs”, each originating at tme *t**k* ago. Each of these giving a proportion 0 ≤*ρ**k* ≤1 of the present population. So we now have an inhomogenous expansion. Furthermore the symmetric model for mutations has to be changed to ![Formula][3] We introduce asymmetric mutations and show how to compute it. Asymmetry will play a very important role in detecting singular lineages. This inhomogenous asymmetric system is mathematically equivalent to a mixed population. Computing its solution is an “inverse problem”. Unfortunately inversion is un-stable for such systems, also there is no unique solution. However it turns out that, up to a standard deviation SD, most DYS markers show at most one singular branch which is found from asymmetries in the distribution. These singular branches are then *reduced* revealing the original lineage. We then compute a branching time *t**j* for each marker *j*. The effect of reduction is dramatic, see Figure 3. Now the nonuniform branching process causes the *t**j* to be ran-domly distributed so their mean is not the TMRCA. Large errors in mutation rates means one cannot simply take the max *t**j* to be the *TMRCA*. Instead stochastic simulations of the branching process, using robust statistics to avoid outliers, find the most likely *TMRCA*, see Supplementary Material 1 (SM1) for full mathematical details. These methods also imply a new way of computing mutation rates, see SM2. Previously, there were methods based on mitosis data or pedigree studies of family DNA projects (which gave quite different rates). We begin with 8 very large SNP projects from FTDNA using 37 markers, of course with unknown *TMRCA* and find mutation rates as the fixed points of a stochastic process. These take about 3 iterates to converge. After we discard markers with mutation *SD >* 33% we are left with 29 markers. We find the mutation rates are close to those obtained from mitosis and nearly 1/3 the values obtained by pedigree. Despite the fact that our mutation rates are lower than most studies, *reduction of singular lineages* produces more recent *TMRCA* than current models. ## Examples Our clock is the only one with across the board consistent results: View this table: [Table 1:](http://biorxiv.org/content/early/2015/05/18/019414/T1) Table 1: *TMRCA* for Medieval groups. Archeological finds convinced Marija Gimbutas11 to attribute Proto Indo-European (PIE) to the Yamnaya Culture c 3500BC of the Russian Steppes, see Anthony2. This is consistent with mainstream linguistic theory, some even wrote of linguistic DNA. But actual genetics was ignored because this contradicts current genetic clocks. Now the dominant European y-haplotypes are R1b1a2 & R1a1a (which like other y-haplotypes is marked by a unique single nucleotide polymorphism (SNP) mutation). Table 2 shows the expansion times of c3700BC, similar for regions Russia, Poland, Germany and Scandinavia. The times are so close only Scandinavia is significantly later. This data is from FTDNA projects for region X only using individuals with named ancestor from These independent results agree within the standard deviation (SD), with dates matching the Corded Ware Culture, a semi-nomadic people with wagons and horses who expanded west from the Urkraine *c*3000*BC*. This is consistent with the oldest R1b1a2, R1a1a skeletons being from the Yamnaya Culture23. View this table: [Table 2:](http://biorxiv.org/content/early/2015/05/18/019414/T2) Table 2: R1b1a2, R1a1a independent comparison An interesting intermediate step occurs between the medieval and eneolithic. The mythical Irish Chronicles relate that the O’Niall descend directly from the first Gaelic High Kings, which tradition dated c1300-1600BC. The O’Niall have the unique mutation M222 which is a branch of the haplotype L21. For L21, *n* = 1029, we compute *TMRCA* = 1600*BC* and SD *σ* = 320. These are dates for proto Celtic, i.e. what archeologists call the pre Urnfelder Cultures, c. 1300-1600BC, see SM5. Furthermore L21 is in turn a branch of haplotype P312 which we date to 2300BC. This date suggests the Bell Beaker Culture of Western Europe. Indeed the only known23 Bell Beaker genome is P312 with 14*C* date 2300BC. Our method requires large data sets and many markers which means we have to rely on data from FTDNA, finding 29 useable markers out of standard 37 they use. In fact many researchers4 have used FTDNA data. We think our method of reduction with robust statistics solves any problems with this data. To test this we compared our results with R1a1a1 data obtained from Underhill26 with *n* = 974(which involved excluding his four M420 individuals and others with missing markers), and 15 useable markers. The result was 2550*BC, σ* = 400, within the CI of our R1a1a results. Table 5 shows the results of extensive simulations using random subsets of our FTDNA data, for 29, 15 and 7 markers. For the same 15 markers as the Underhill26 the different FTDNA data gives very similar 3300*BC, σ* = 840 for R1a1a, verifying the correctness of using FTDNA data. However once you get down to 7 markers the confidence interval becomes large, e.g. R1a1a gives 3400*BC, σ* = 1500. Also it becomes difficult to deal with outliers. An example with few markers is R1b1a2 data of Balaresque4. Our method (this time with 7 useable markers) gave SD *>* 30%, see Table 6. Now Balaresque4 used the Bayesian method BATWING29 to suggest a Neolithic origin in Ana-tolia. With the same Cinnioglu8 data our method gives for Turkish R1b1a2 (*n* = 75) a *TMRCA* = 5300*BC, σ* = 3100, i.e. anytime from the Ice Age to the Iron Age. Fortunately, once again, we find good data from FTDNA: the Armenian DNA project, see Table 3. By tradition the Armenians entered Anatolia from the Balkans *c*1000*BC* so they might not seem a good example of ancient Anatolian DNA. But some 100 generations of genetic diffusion has resulted in an Armenian distribution of Haplotypes J, G, R1b1a2 closely matching that of all Anatolians, therefore representive of typical Anatolian DNA. We see that Anatolian R1b1a2 arrived after c3300BC, ruling out the Neolithic expansion c6000BC. When dealing with regional haplotypes, e.g. R1b1a2 in Anatolia, the *TMRCA* is only a upper bound for the arrival times, for the genetic spread may be carried by movements of whole peoples from some other region. View this table: [Table3](http://biorxiv.org/content/early/2015/05/18/019414/T3) Observe that our *TMRCA* for Armenian G2a2b (formerly G2a3) and J2 show them to be the first Neolithic farmers from Anatolia, i.e. older than 7000*BC*. In Table 4 we compared J2, G2a2b for all of Western Europe (non-Armenian data). Our dates show J2 was expanding at the end of the Ice Age. Modern J2 is still concentrated in the fertile crescent, but also in disconnected regions across the Mediterranean. The old genetic model predicted a continuous wave of Neolithic farmers settling Europe. But you cannot have a continuous maritime settlement: it must be *leap-frog*. Also repeated resettlement from the Eastern Mediterranean has mixed ancient J2 populations, and our method gives the oldest date. On the other hand G2a2b shows exactly the dates expected from a continuous wave of Neolithic farmers across Central Europe, consistent with Neolithic skeletons showing G2a2b (e.g. the famous Iceman). View this table: [Table4](http://biorxiv.org/content/early/2015/05/18/019414/T4) ## Discussion History, archeology, evolutionary biology, not to mention epidemics (e.g. dating HIV), forensic criminology and genealogy are just some of the applications of molecular clocks. Unfortunately current clocks have been found to give only “ballpark” estimates. Our method is the only one giving accurate time, at least for the human y-chromosome verified over the period 500 - 15, 000*ybp*. Our methods should also give accurate times for mitochondrial and other clocks. Many geneticists thought natural selection makes mutation rates too variable to be useful. The problem is confusion between the actual biochemical process giving mutations and superimposed processes like kin selection producing apparently greater rates. Notice that the SD for our mutation rates is on average 14% which is much smaller than the actual previous rates. We believe this small SD proves the reality of neutral mutation rates of Moto Kimura16. While our method is accurate for “big data”, applications to genetics, forensics, genealogy require the *TMRCA* between just two individuals, or between two species. Now for this “2-body problem” we cannot determine what singular lineages the branching has been through: with mutations either exaggerated or suppressed. Thus previous methods for small samples are at best unreliable. It is an important problem to find what accuracy is possible for small samples. In checking accuracy we ran into the question of the origins of PIE. Although there are genes for language there is certainly none for any Indo-European language. Thus inferences have to be indirect. Marija Gimbutas saw patterns in symbolism and burial rituals suggesting the Yamnaya Culture was the cradle of Proto Indo-European. Also their physiology was robustly Europeanoid unlike the gracile skeletons of Neolithic Europe, but this could be nutrition and not genetic. So it was an open question whether the spread of this robust type into Western Europe in the late Neolithic marked an influx of Steppe nomads or a revolution in diet. Reich23 observed all 6 skeletons from Yamnaya sites, c 3300BC by 14*C* dating, are either R1a1b1 and R1a1a. But that method could not date the origin of R1a1b1 and R1a1a. Our *TMRCA* shows both these haplotypes expanding at essentially the same time c3700BC. This, together with our later date for Anatolia, implies that R1b1a2 and R1a1a must have originated in the Yamnaya Culture, c 3700BC. Furthermore, considering the correlation of haplotypes R1b1a2 and R1a1a with Indo-European languages (i.e. all countries with R1b1a2 & R1a1a frequency *>* 50% speak Indo-European), this provides powerful evidence for the origin of Proto Indo-European. ## Supplementary Material 1: Mathematical Genetics Sophisticated Mathematical theory has been developed for genetics, see 1,2,5. However all of these assume the present distribution is entirely the result of the stochastic process. We emphasize the role of extraneous forces like kin-selection which operates on too big a scale and rarely enough with results that cannot be subsumed into the mutation rates. So our method does not follow from any of these previous theories nor is it just applying a known statistics package. Instead we return to basic principles. ### Fundamental Solutions The Y-chromosome has DYS marked by *j* = 1, *…N*, where one can count the STR number *x**j*. Consider the probability *P**j,k* (at time *t* generations) that at marker *j* we have *x**j* = *k*. This satisfies the homogenous stochastic system ![Formula][4] This homogenous system gives a uniform expansion from a single patriarch. The system is essentially the model of Wehrhahn9 who had *μ**j,-*1 = *μ**j,*1. We introduce asymmetric mutations with total rate ![Formula][5] About 50% of DYS markers show asymmetric mutations, i.e. *μ**j,-*1 *≠ μ**j,*1. The fundamental solution comes from the generator function ![Formula][6] with complex variable *z*, and normalized initial condition *x**j* = 0 or *P**j,*(0) = 1: ![Formula][7] Then *G* can be expanded in powers of *z* to give *P**j,k*(*t*). Now for the simplest asymmetric case, with only one step mutations, we have *G*(*z, t*) = ![Formula][8] so using the Hyperbolic Bessel Function of Order *k ≥* 0, see Olver 7 ![Formula][9] we see that the homogenous system has fundamental solution ![Formula][10] From this we obtain the second moment: ![Formula][11] Also from the fundamental solution we find, independently of time ![Formula][12] which we call the *asymmetric ratio*. It will be repeatedly used. Of course the actual initial value is not *x**j* = 0 but was usually taken to be the mode *m**j* which was assumed to be the value for original patriarch. Assuming symmetry, i.e. *μ**j,-*1 = *μ**j,*1, the TMRCA is: ![Formula][13] From the present distribution of data we use the frequency ![Formula][14] One problem with the KAPZ formula is that higher frequencies *f* (*j, k*), *|k|* = 2, 3*…* are overrepresented in the actual data. This is because the probability of a spontaneous two step mutation is much higher then the product of two one step mutations. So instead we use the frequency to solve the transcendental equation for the unknown *t* ![Formula][15] This nonlinear equation is easily solved via mathematical software such as MATHEMATICA (I used version 9 running on a boosted 2014 iMac which has accurate hyperbolic Bessel functions. Earlier versions on older iMacs gave inaccuracies so one had to compile one’s own functions). Using this formula resolves some other problems with the KAPZ method, e.g. *μ**j,-*1 ≠ *μ**j,*1 gives an extra quadratic term which if ignored causes large errors. ### Heterogeneous diffusion equation However the main problem is singularities in the stochastic process. For a uniform stochastic process, 1 *- P**j,*(*t*) *∼* 1 *- f* (*j,* 0) is the probability of some mutation. So the expected variance is *f* (*j,* 0)(1*-f* (*j,* 0)). Thus if the actual data variance *V**j* *>> f* (*j,* 0)(1 *-f* (*j,* 0)) we are not uniform. Now a sublineage of very high fertility increases variance, giving apparently greater *TMRCA* although it is unchanged. One finds similar results for Bayesian methods. The correct approach to nonuniformity assumes at times *t**i* (generations ago) a certain proportion 0 *≤ ρ**i* *≤* 1 of the present population originated from a “virtual patriarch” with an initial STR value *m**i*. The resulting system : ![Formula][16] i.e. *dρ* are atoms of weight *ρ**i* with STR value *m**i* occurring at time *t**i*. As the system is linear and isotropic the solution is a combination of fundamental solutions *P* of the homogenous system. Thus the present distribution *f* (*j, k*) is ![Formula][17] This allows us to consider populations mixed by having singular lineages from overfertile patriarchs, or by actual immigration from the outside. The inverse problem seeks to find singularities from present data. Unfortunately inversion is ill posed for such systems like the heat equation. This instability produces poor accuracy. Furthermore there is no unique solution, e.g.the present distribution could have been created yesterday. However we find that *∼* 50% of the DYS markers show no significant difference from the uniform expansion of a single patriarch, i.e. the data variance *V**j* is close to the expected variance *f* (*j,* 0)(1 *-f* (*j,* 0)). The other markers show at most one significant side branch, i.e. there is an original branch starting at time *t**j,* with STR *m* and a second one with STR *m*1 = *m* *±* 1 at time *t**j,*1 *< t**j,* with significant 0 *< ρ*1 *< ρ*. ### Reduction We locate these singular lineages by looking for asymmetries in the distribution. For a uniform flow from a single patriarch the frequency of STR value *k* is given by *f* (*j, k*) *∼ P**j,k*(*t*). The asymmetric ratio: ![Formula][18] is completely independent of time *t*. Therefore if say ![Formula][19] we have a singular lineage at *k* = +1. Thus the excess at *k* = +1 is ![Formula][20] To first order approximation then frequency *f* (*j,* +2) is due to this singularity at *j* = +1 which therefore gave a contribution ![Formula][21] to *k* = 0. Thus removing the effect of the singularity at *k* = +1 leads to new frequencies ![Formula][22] These of course are no longer normalized so we rescale to obtain the renormalized frequency *F* (*j, k*), e.g. ![Formula][23] which will be used to compute the expansion time for marker *j*. There are similar formulae if the singularity was at *k* = −1. This is illustrated in figure 3. However there is sampling error both in the frequencies and the *μ**j,*1, μ*j,-*1. So we bootstrap taking into account these uncertainties, running the computation thousands of times. Generally we find the branch singularity is always one of *k* = 0, +1, *-*1 with no SD. In a few cases the singularity may seem to wander between *k* = 0, +1, *-*1. So in the case of a wandering singularity we obtain a distribution over *k* = 0, +1, *-*1 with a mean and SD. In these cases we find the singularity is relatively small and does not make much difference to the final result. However to have a stable method we do not throw out these wandering singularities but in the algorithm use the mean to average between *k* = 0 and *k* = *±*1, e.g. if the mean is *k* = 0 then we use the original unreduced frequency. Notice that we assume at most one side branch. In theory there could be many and solving for these produce even better approximations to the present data. In fact you could get perfect matching but find the atoms were created yesterday! The thing is that while many markers show significant deviation from a uniform flow from a single patriarch, after we have carried out reduction for one possible side branch we find no significant difference from a uniform flow, i.e. the difference is within the SD. This is of course an approximation, the next level beyond Zuckerkandl and Pauling, but given the noise in the data perhaps the best we can do. Later we further reduce the effect of outliers by using robust statistics. Reducing the singular lineages increases the frequency *f* (*j,* 0) of the mode and decreases the computed *TMRCA*. But as the method of reducing singularities does not respect higher frequencies *f* (*j, k*) it follows the KAPZ formula cannot be used and instead we use the probability of no mutations, i.e. solve ![Formula][24] This is done for each DYS marker *j*, giving expansion times *t*1,…*t**N* for each marker, with computed CI. (An extra fixed source of error is the uncertainty in the mutation rates which we deal with later). We find the reduction of singularities makes striking difference to the *t**j* of the effected markers, often a reduction of *∼* 50% for *TMRCA*. Now the existence of side branches implies that the main branch could itself have been the side branch for an earlier branch that did not survive. Thus we do not expect the expansion times *t*1,…*t**N* for each marker to be essentially equal., i.e they are not within the SD of each other. Indeed we see that the distribution of the times *t**j* for different markers are almost certainly not randomly arranged about a single *TRMCA T* but distributed from *T* to the present. This is seen whether you use reduction or not, or our mutation rates or not. (For a given population one could scale mutation rates to get equal *t**j*, but then applying these adhoc mutation rates to other populations does not yield the same values). The spread out distribution of surviving branches is another verification of our theory of many extinctions, few survivors. The distribution of the times *t**j* for different markers we call the branching distribution, which is now discussed. ### The Branching Distribution The times *t**j* for different markers are sorted from the youngest to the oldest, forming a sequence ![Graphic][25]. The generation of these branches is by an unknown probability distribution *dτ* over [0, *T*]. We model *dτ* by assuming a surviving lineage is generated at random with probability *β*Δ*t* in time period [*t, t* + Δ*t*], multiplied by the probability that the branching hasn’t already occurred. The constant *β* averages fertility and extinction rates, the chance of a new lineage surviving. As *β → ∞* we get current theory where all lineages originate from a single patriarch at time *T*. Simulations with the data show that *β* varies in the range 1 to *∞*. We make no a priori estimate of *β*, unlike Bayesian methods where an overall fertility rate is a predetermined parameter. Instead our stochastic simulation will find the most likely *β, T* in each case. Assuming independence, then the generation of branches follows the well known exponential distribution: ![Formula][26] Notice this implies a finite probability that some markers have essentially zero mutations. This is actually seen in examples. Both the Hamilton Gp A and Macdonald Gp A have number of individuals *n >* 100. For the time scale of *>* 700 years we do not expect there is more than one marker out of 33 which shows absolutely no mutations from the mode. In fact in both cases there are 8 markers where all *n* individuals have exactly the same STR value. Estimating the parameter *T* for an exponential distribution is a well known problem of statistics. Kendall proved the best estimate for *T* would be max *t**j*. Unfortunately there is also considerable error *λ**j*% for the mutation rates *μ**j*. Later we give a method for reducing this error, even so we find the SD in the range 10% - 30% which gives corresponding range in error for each *t**j*. We understand that the *t**j* are being generated by the distribution *dτ* but superimposed on this is a further uncertainty due to mutation rates etc. In particular the largest *t**j* may be wildly inaccurate. Also we found that simply taking the average consistently underestimates the *TMRCA* by a wide margin. Assuming the mutation rates have normal distribution with mean *μ**j* and variance ![Graphic][27], the *t**j* have SD *t**j**λ**j*. Thus the actual data for ![Graphic][28] has probability density function for *s >* 0 ![Formula][29] The variance *ν* depends on two sources. First from the uncertainty in mutation rates, for each marker we get variance ![Graphic][30], giving total ![Formula][31] However a small sample also has inherent error from sampling. We are measuring the probability that there is a mutation. This is binomial with probability ![Formula][32] Hence for sample size *n* there is variance *H**j*(1 - *H**j*)*/n*, so the variance in time due to this is scaled by the derivative giving: ![Formula][33] The function ![Graphic][34] has actually to be computed as an inverse function depending on *H**j*. Therefore the total variance averaged over all *N* markers is *ν* = *ν*1 + *ν*2. Although for large samples (*n >* 1000) the second term is insignificant it does effect the results once you get to *n* = 100. In our algorithm the branching distribution is used to generate large numbers of random branching times so as to bootstrap error estimates. In turns out much faster to compile the distribution function as a table which can be repeatedly called on. ### Robust Statistics: Estimating *T* Inaccurate large values of ![Graphic][35] are mitigated by using “robust” statistics with quintiles instead of means/variances. Using FTDNA data we began with 37 markers. However the 4 markers of DYS464 are unordered and cannot be used. Also we find that markers DYS 19/394, 385b, 459b, CDYb have errors *>* 33% in mutation rates so are not used. (These are some of the most popular ones in the literature!). So usually we have *N* = 29 markers and take “quintiles” ![Graphic][36]. This means that tail end data is not discarded but kept as the information there are 8 values of ![Graphic][37], which effectively deals with outliers. Bootstrap methods give the confidence interval CI for each quintile. Thus we wish to find the best estimate of *T* given *θ** (and CI). This well known statistical problem was investigated by Stochastic Simulations (SS). We also tried Maximum Likehood Methods which gave similar results but with larger CI. Monte-Carlo Methods are used to produce very large numbers (*∼*107) of *T, β* with corresponding Distribution. These randomly generate ordered times (*s*1*…s*29) for which we take the quintiles *θ* = (*s*9, *s*12, *s*15, *s*18, *s*21). We filter by requiring that *θ* close to the data *θ**, i.e. ∥θ* - θ ∥ < 2*SD*. This gives a stochastic neighborhood *U* of θ* typically containing *>* 105 sets of data but with *T* is known for each *θ ∈ U*. Thus we can construct a quasilinear estimator: ![Formula][38] and use least squares over *U* to find constants (*q*1, *q*2, *q*3, *q*4, *q*5) minimizing ![Formula][39] The (*q*1, *q*2, *q*3, *q*4, *q*5) are computed in MATHEMATICA. We then test them by applying the QL to all of *U*, unsurprisingly ![Formula][40] What is important is that we find the uncertainty in the SS itself. Actually this depends on the data and is calculated in each case but for our examples we find ![Formula][41] Finally we apply the quasilinear estimator to the experimental data ![Formula][42] to obtain our best estimate of *T*. Application of *QL* computes the SD for our data, giving part of the overall SD. This must be combined with the SD coming from the uncertainty in the SS. Overall we find that our method has SD *∼*10%, this includes variances from our data, mutation rates and uncertainty in the SS. We also tested with 15 and 7 markers. Here one must use “quintiles” ![Graphic][43],![Graphic][44] respectively with all the loss of accuracy that implies. See Table 6, 7 for comparisons using 29, 15, 7 markers on same data. ![Figure 8](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2015/05/18/019414/F3.medium.gif) [Figure 8](http://biorxiv.org/content/early/2015/05/18/019414/F3) Figure 8 #### References 1. [1].Durrett, R. Probability Models for DNA sequence Evolution, Springer(2002) 2. [2].Epstein, C. and Mazzeo, R. Degenerate Diffusion Operators Arising from Population Biology, Princeton(2013) 3. [3].Kimura, M. Evolutionary rate at the molecular level. Nature 624 – 626 (17 February 1968); doi:10.1038/217624a0 [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1038/217624a0&link_type=DOI) [PubMed](http://biorxiv.org/lookup/external-ref?access_num=5637732&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F05%2F18%2F019414.atom) [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=A1968A646300008&link_type=ISI) 4. [4].Kendall, D. G. (1975). The Genealogy of Genealogy Branching Processes before (and after) 1873. Bulletin of the London Mathematical Society 7 (3): 225–253. [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1112/blms/7.3.225&link_type=DOI) 5. [5].1. Nielsen, R. (Editor) Statistical Methods in Molecular Evolution, Springer (2005). 6. [6].Ohta, T., and Kimura, M. The model of mutation appropriate to estimate the number of electrophoretically detectable alleles in a genetic population. Genet. Res. (1973)22: 201–204. [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1017/S0016672300012994&link_type=DOI) [PubMed](http://biorxiv.org/lookup/external-ref?access_num=4777279&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F05%2F18%2F019414.atom) [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=A1973S206000008&link_type=ISI) 7. [7].1. M. Abramowitz Olver, F. W. J. *Bessel functions of integer order* pp. 355–434 Handbook of Mathematical Functions (1964) edited by M. Abramowitz, National Bureau of Standards, Washington. 8. [8].Tucker, H. A Graduate Course in Probability, Academic Press(1967), preprinted Dover (1995). 9. [9].Walsh, B. Estimating the Time to the Most Recent Common Ancestor for the Y chromosome or Mitochondrial DNA for a Pair of Individuals, Genetics 158: 897–912(2001) [Abstract/FREE Full Text](http://biorxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6ODoiZ2VuZXRpY3MiO3M6NToicmVzaWQiO3M6OToiMTU4LzIvODk3IjtzOjQ6ImF0b20iO3M6Mzc6Ii9iaW9yeGl2L2Vhcmx5LzIwMTUvMDUvMTgvMDE5NDE0LmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 10. [10].Wehrhahn, C. F. The evolution of selectively similar electrophoretically detectable alleles in finite natural populations, Genetics 80(1975) 375–394. [Abstract/FREE Full Text](http://biorxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6ODoiZ2VuZXRpY3MiO3M6NToicmVzaWQiO3M6ODoiODAvMi8zNzUiO3M6NDoiYXRvbSI7czozNzoiL2Jpb3J4aXYvZWFybHkvMjAxNS8wNS8xOC8wMTk0MTQuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) ## Supplementary Material 2: Accurate Mutation rates Any genetic clock depends on reasonably accurate mutation rates. The mitosis method looks for mutations in sperm samples. Forensics uses father-son studies. However typical rates of *μ* = .002 would require nearly 50, 000 pairs to get an SD of 10%. Small samples have meant large errors. The pedigree approach is to study large family groups with well developed DNA/genealogy data. So inverting the KAPZ formula would yield accurate rates. However, *singular lineages* makes this problematic. Genealogical data might give mutation rates much greater than the biochemical rates because kin selection etc tend to exaggerate the apparent mutation rate. An inspection of 10 different sources finds mutation rates claiming SD *∼*10% yet they differ from each other by up to 100%. We describe a new method. To compute our rates we apply our theory to the large DNA projects for the SNP M222, L21, P312, U106, R1b1a2, I1, R1a1a. This avoids dealing with populations such as family DNA projects which are self selecting, i.e only those with the correct surname which neglects distant branches. Also we have very large samples, our average *n >* 1000. Greater accuracy should come from more generations and individuals. The problem is that we do not know their *TMRCA*. ### Asymmetric Mutation However before computing mutation rates we must consider asymmetric mutations, i.e. the left and right mutation rates *μ**j,-*1 ≠ μ*j,*1. For a uniform stochastic process we again use the asymmetric ratio ![Formula][45] to define the *asymmetric constant A**j* ∈ [0, 1] for marker *j*. For example *A**j* = 0.5 is complete symmetry. Of course singularities will effect this ratio, however these only occur *<* 50% of markers. Thus for each marker, SNP we compute this ratio. We find the SD for each SNP is relatively small while the difference between SNP can be large. However for each marker, using 8 SNP enables outliers to be easily removed leaving allowing us to use simple linear regression: i.e. average of the *A**j* over the remaining SNP groups. We see that asymmetry is a real effect: 50% of the *A**j* are more than two SD from symmetry *A**j* = 0.5. Observe this is significant. The total second moment is ![Formula][46] So using all our 33 DYS markers with our *μ**j*, we compute constants ![Formula][47] The KAPZ formula gives variance *V* = *μt* compared to the corrected formula *μt* + *τt*2. The uncorrected KAPZ gives an overestimate *>* 400% for *>* 200 generations. This effect can be nullified by using the mean instead of the mode, variance instead of the second moment, however failing to do so gives a large error. Furthermore other methods which assume symmetric mutations will also be inaccurate. Having estimates on the asymmetry is essential to our method because we find singular lineages by looking for asymmetry in the data. Any such anomaly needs to be significantly greater than the natural asymmetry. ### Mutation Rates as a fixed Point Next we compute mutation rates using 8 very large SNP groups. First, using the asymmetric constants we find singular lineages and reduce their effect. We take account of the error in the *A**j* by a bootstrap technique, which gives the variance for each frequency *f* (*j,* 0). For a given SNP *k* if markers *j* started their expansion at the same time TMRCA *T**k* we could calculate mutation rates *μ**j* via ![Formula][48] or rather average the 8 different *μ**j* we would obtain. However because of branching caused by extinction of lineages the different markers do not originate at the same time but at different times *t**j*. In this case we expect these *t**j* to be randomly distributed about the log mean over a middle set of times *t**j*. So, for each SNP group *k* = 1, *‥*8 define mean time *T**k*, not the TMRCA but the mean log mean over a middle set of markers, which is less. We find that this is very stable. So for a fixed marker *j* the data *τ**k,j* = *t**j* - *T**k* should be randomly distributed about zero over the different SNP *k* = 1, *‥,* 8. However the wrong choose of *μ**j* would give a bias. In fact this is what we see if the mutation rates *μ**j* = .002 were chosen. In appendix graphs show the *τ**k,j*, *k* = 1, *‥*8 bunched around a nonzero point. Thus we try to find *μ**j* so that the *τ**k,j*, *k* = 1, 2, *‥*8 has mean zero. However the *τ**k,j*, *k* = 1, 2, *‥*8 depend nonlinearly on the rates *μ**j*, as does the mean *T**k*, *k* = 1, *‥*8. We find this nonlinear regression problem is solved by an iterative scheme which starts with any reasonable set of DNA rates, finding any reasonable choice iterates to the same final answer. So choose *μ**j* = .002 to begin. Suppose at some stage we have apparent mutation rates *μ**j*. Then, for each SNP, and each marker we solve equation (1) to obtain the apparent *t**j*. For each SNP *k* = 1, *‥*8 we compute the mean log time *T**k*. At the next step we get new rates ![Graphic][49] from ![Formula][50] Averaging ![Graphic][51], *k* = 1, *‥*8 we get our next set of *μ**j* of mutation rates. However this method would be effected by a marker showing a singular lineage. Fortunately these are few in number and by comparison between the different SNP we remove the outliers. We then repeat the process, computing *T**k* again with the new rates, and another set of mutation rates. So we have an iterative process. One problem is that the iterates could tend to decrease to zero or increase to *∞*, as we are only calculatin),g relative rates. To prevent this we renormalize after each iteration so the total Σ*μ**j* is constant. We found the iterative scheme quickly converges to a fixed set of mutation rates, unique up to a constant factor. The CI is computed by bootstrap parametrized by the uncertainties in data and the asymmetric constants. In figure we show the distribution of *τ**k,*1, *k* = 1.2, *‥*8 before and after the first iteration. ### The generation factor *γ* This method does not give absolute mutation rates but *relative* mutation rates *μ**j**γ*, where *γ* is universal time scale constant. To find *γ* we apply our method to compute the *T* = *T M RCA* of three famous DNA projects and choose *γ* so the scaled *T /γ* best fits the historical record. We choose the DNA projects for the O’Niall(M222), Gp A of Macdonald (R1a1a) and Gp A of the Hamiltons (I1). These are large groups with characteristic DNA and fairly accurate times of origin. Of course finding one constant *γ* from three projects is inherently more accurate than using one project to find 33 different mutation rates. Actually assuming a generation of 27*years* these three projects yield *γ* = 1 with about 5% error, i.e. there is no actual need for this correction. This is a constant error (like uncalibrated 14*C* dating). Thus *γ* is related to the length of a generation. Most researchers use 25*yrs* for *t >* 500*ybp* and 27*yrs* for *t <* 500*ybp*. Balaresque and al used 30*yrs* based on Finer who sees a 30*yr* generation for modern hunter-gatherers. (Although for most of the time R1b1a2 were subsistence farmers and not hunter gatherers.) At first glance our theory allows any nominal generation as it really doesn’t matter, being included in the *γ* factor which we compute in years not generations. Actually its not as simple as that. While our three DNA projects being post 1000AD elites have a 27*yr* generation the problem is what to do for *t >* 2000*ybp*. Now 25*y* may be appropriate for subsistence farmers but we found that singular lineages of the elite have exaggerated effect so 27 years seems appropriate. Mutation Rates: Hamilton vs Mitosis and pedigree View this table: [Table5](http://biorxiv.org/content/early/2015/05/18/019414/T5) Asymmetric rates View this table: [Table6](http://biorxiv.org/content/early/2015/05/18/019414/T6) The log distribution of *τ*k,1, *k* = 1.2, *‥*8 before iteration at marker *j* = 1, ie DYS 393, but after reduction 1(*μ**j* = .002). The SNP are colored: ![Figure4](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2015/05/18/019414/F4.medium.gif) [Figure4](http://biorxiv.org/content/early/2015/05/18/019414/F4) After just one iterate we get ![Figure5](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2015/05/18/019414/F5.medium.gif) [Figure5](http://biorxiv.org/content/early/2015/05/18/019414/F5) So 5 of our *τ**k,*1, *k* = 1.2, *‥*8 bunch around zero, outliers are U106 and I1. The iterative scheme converges to stable values very fast, 7 iterates is enough. #### References 1. [1].Burgarella, C et al. Mutation rate estimates for 110 Y-chromosome STRs combining population and father-son pair data, European Journal of Human Genetics (2011) 19, 70–75; doi:10.1038/ejhg.2010.154 [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1038/ejhg.2010.154&link_type=DOI) [PubMed](http://biorxiv.org/lookup/external-ref?access_num=20823913&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F05%2F18%2F019414.atom) 2. [2]. George B. J. Busby et al. The peopling of Europe and the cautionary tale of Y chromosome lineage R-M269, Proc Biol Sci. 2012 Mar 7; 279(1730):884–92. doi: 10.1098/rspb.2011.1044 [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1098/rspb.2011.1044&link_type=DOI) [PubMed](http://biorxiv.org/lookup/external-ref?access_num=21865258&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F05%2F18%2F019414.atom) 3. [3].Fenner, J.N. Cross-Cultural Estimation of the Human Generation Interval for Use in Genetics-Based Population Divergence Studies, American Journal of Physical Anthropology,(2005) 128:415–428. [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1002/ajpa.20188&link_type=DOI) [PubMed](http://biorxiv.org/lookup/external-ref?access_num=15795887&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F05%2F18%2F019414.atom) [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000232239800016&link_type=ISI) 4. [4].Ho S.Y.W, Phillips M.J, Cooper A., Drummond A.J. Time dependency of molecular rate estimates and systematic overestimation of recent divergence times. Molecular Biology & Evolution 22 (7): 1561–1568(2005). [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1093/molbev/msi145&link_type=DOI) [PubMed](http://biorxiv.org/lookup/external-ref?access_num=15814826&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F05%2F18%2F019414.atom) [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000229757400007&link_type=ISI) 5. [5].Kimura, M. Evolutionary rate at the molecular level. Nature 624 – 626 (1968); doi:10.1038/217624a0 [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1038/217624a0&link_type=DOI) [PubMed](http://biorxiv.org/lookup/external-ref?access_num=5637732&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F05%2F18%2F019414.atom) [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=A1968A646300008&link_type=ISI) 6. [6].Zhivotovsky LA, Underhill PA, Cinnioglu C et al. The effective mutation rate at Y chromosome short tandem repeats, with application to human population-divergence time, Am J Hum Genet (2004) 74: 50–61. [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1086/380911&link_type=DOI) [PubMed](http://biorxiv.org/lookup/external-ref?access_num=14691732&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F05%2F18%2F019414.atom) [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000187723500005&link_type=ISI) ## Supplementary Material 3: Reduction of Singular Lineages vs KAPZ We compare results for our method with KAPZ, for the same data, 29 markers and our mutation rates First we compare for groups with medieval expansions View this table: [Table7](http://biorxiv.org/content/early/2015/05/18/019414/T7) Next we compare SNP G2a2b, R1b1a2, R1a1a, I1, L21, U106, J2, P312: View this table: [Table8](http://biorxiv.org/content/early/2015/05/18/019414/T8) RSL and KAPZ will give similar results if there is a fast expansion and thus insignificant singular lineages and branching. Actually this is to be expected sometimes, i.e. it is not surprising that the results using RSL and KAPZ for O’Niall, R1a1a, U106 are very similar. However in other cases the KAPZ results are about 30% too old. In the case of the Hamiltons and Macdonalds absurdly so. For R1b1a2 it gives an early Neolithic age, compared with eneolithic for R1a1a, yet these have been dated to the same Yamanya times. The KAPZ dates for L21 “Celtic” is nearly 2000 years before Urnfelder Culture. Of course one might try to “improve” KAPZ by increasing the mutation rates by 33% so the KAPZ times are decreased by 25%. Then the medieval dates look reasonable but we find 3100BC for G2a2b which is too late. For R1a1a we would get 2300BC which is not only too late but significantly different from the 3600BC for R1b1a2. Also G2a2b would be predated by R1b1a2 even though the latter has never been found in Neolithic sites of Europe. Getting consistent results across the span of history was a problem of previous clocks. ## Footnotes * 1 The calculations and figures for all 33 markers is shown in SM * Received May 15, 2015. * Accepted May 18, 2015. * © 2015, Posted by Cold Spring Harbor Laboratory The copyright holder for this pre-print is the author. All rights reserved. The material may not be redistributed, re-used or adapted without the author's permission. ## References 1. [1].Ammerman, A. J. and Cavalli-Sforza, L. L. The Neolithic Transition and the Genetics of Populations in Europe, Princeton Univ. Press, Princeton(1984). 2. [2].Anthony D.W., The Horse, the Wheel and Language, Princeton Univ. Press, Princeton(2007). 3. [3].Arredi B., Poloni E., Tyler-Smith C. The Peopling of Europe. Anthropological Genetics: Theory, Methods and Applications. Cambridge University Press: Cambridge(2010). 4. [4].Balaresque P. et al. A predominantly Neolithic Origin for European paternal Lineages, PLoS Biol (2010); 8: e1000285. [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1371/journal.pbio.1000285&link_type=DOI) [PubMed](http://biorxiv.org/lookup/external-ref?access_num=20087410&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F05%2F18%2F019414.atom) 5. [5].Burgarella, C et al. Mutation rate estimates for 110 Y-chromosome STRs combining population and father-son pair data, European Journal of Human Genetics (2011) 19, 70–75; doi:10.1038/ejhg.2010.154 [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1038/ejhg.2010.154&link_type=DOI) [PubMed](http://biorxiv.org/lookup/external-ref?access_num=20823913&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F05%2F18%2F019414.atom) 6. [6]. George B. J. Busby et al. The peopling of Europe and the cautionary tale of Y chromosome lineage R-M269, Proc Biol Sci. 2012 Mar 7; 279(1730):884–92. doi: 10.1098/rspb.2011.1044 [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1098/rspb.2011.1044&link_type=DOI) [PubMed](http://biorxiv.org/lookup/external-ref?access_num=21865258&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F05%2F18%2F019414.atom) 7. [7].Chiaroni J, Underhill PA, Cavalli-Sforza ll Y chromosome diversity, human expansion, drift, and cultural evolution, Proc Natl Acad Sci USA (2009); 106: 20174–20179. [Abstract/FREE Full Text](http://biorxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NDoicG5hcyI7czo1OiJyZXNpZCI7czoxMjoiMTA2LzQ4LzIwMTc0IjtzOjQ6ImF0b20iO3M6Mzc6Ii9iaW9yeGl2L2Vhcmx5LzIwMTUvMDUvMTgvMDE5NDE0LmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 8. [8].Cinnioglu C, King R, Kivisild T, Kalfoglu E, Atasoy S, et al. Excavating Y-chromosome haplotype strata in Anatolia. Hum Genet. (2004);114:127–148. [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1007/s00439-003-1031-4&link_type=DOI) [PubMed](http://biorxiv.org/lookup/external-ref?access_num=14586639&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F05%2F18%2F019414.atom) [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000186954300001&link_type=ISI) 9. [9].Edmonds C., Lillie A., Cavalli-Sforza L.L. Mutations arising in the wave front of an expanding population, Proc Natl Acad Sci USA (2004); 101: 975–979. [Abstract/FREE Full Text](http://biorxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NDoicG5hcyI7czo1OiJyZXNpZCI7czo5OiIxMDEvNC85NzUiO3M6NDoiYXRvbSI7czozNzoiL2Jpb3J4aXYvZWFybHkvMjAxNS8wNS8xOC8wMTk0MTQuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 10. [10].Fenner, J.N. Cross-Cultural Estimation of the Human Generation Interval for Use in Genetics-Based Population Divergence Studies, American Journal of Physical Anthropology,(2005) 128:415–428. [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1002/ajpa.20188&link_type=DOI) [PubMed](http://biorxiv.org/lookup/external-ref?access_num=15795887&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F05%2F18%2F019414.atom) [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000232239800016&link_type=ISI) 11. [11].Gimbutas M. The Prehistory of Eastern Europe, Part 1, (1956) Cambridge:American School of Prehistoric Research #20. 12. [12].Goldstein D. B, Linares A. R, Cavalli-Sforza L. L, Feldman M. W. An evaluation of genetic distances for use with microsatellite loci Genetics(1995);139:463–471. [Abstract/FREE Full Text](http://biorxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6ODoiZ2VuZXRpY3MiO3M6NToicmVzaWQiO3M6OToiMTM5LzEvNDYzIjtzOjQ6ImF0b20iO3M6Mzc6Ii9iaW9yeGl2L2Vhcmx5LzIwMTUvMDUvMTgvMDE5NDE0LmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 13. [13].Hamilton, W.D. The genetical evolution of social behaviour. I, II. Journal of Theoretical Biology 7 (1) (1964): 1–52. [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1016/0022-5193(64)90038-4&link_type=DOI) [PubMed](http://biorxiv.org/lookup/external-ref?access_num=5875341&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F05%2F18%2F019414.atom) [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=A19647207B00008&link_type=ISI) 14. [14].Ho S.Y.W, Phillips M.J, Cooper A., Drummond A.J. Time dependency of molecular rate estimates and systematic overestimation of recent divergence times. Molecular Biology & Evolution 22 (7): 1561–1568(2005). [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1093/molbev/msi145&link_type=DOI) [PubMed](http://biorxiv.org/lookup/external-ref?access_num=15814826&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F05%2F18%2F019414.atom) [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000229757400007&link_type=ISI) 15. [15].Hunt, J.S., Bermingham, E., and Ricklefs, R.E. Molecular systematics and biogeography of Antillean thrashers, tremblers, and mockingbirds (Aves: Mimidae)”. Auk 118 (1): 35 – 55 doi:10.1642/0004-8038(2001)118 [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1642/0004-8038(2001)118&link_type=DOI) 16. [15].Lee, E.J. et al. Emerging Genetic Patterns of the European Neolithic: Perspectives From a Late Neolithic Bell Beaker Burial Site in Germany, (2012) American J. of Physical Anthropology 148 : 571–579 17. [16].Kimura, M. Evolutionary rate at the molecular level. Nature 624 – 626 (1968); doi:10.1038/217624a0 [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1038/217624a0&link_type=DOI) [PubMed](http://biorxiv.org/lookup/external-ref?access_num=5637732&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F05%2F18%2F019414.atom) [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=A1968A646300008&link_type=ISI) 18. [17].Kendall, D. G. The Genealogy of Genealogy Branching Processes before (and after) 1873. (1975) Bulletin of the London Mathematical Society 7 (3): 225–253. [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1112/blms/7.3.225&link_type=DOI) 19. [18].Mallory, J. P. In Search of the Indo-Europeans: Language, Archaeology and Myth(1989) London: Thames & Hudson. 20. [19].Menozzi P, Piazza A, Cavalli-Sforza L. L. Synthetic maps of human gene frequencies in Europeans. Science(1978) 201:786–792 [Abstract/FREE Full Text](http://biorxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjEyOiIyMDEvNDM1OC83ODYiO3M6NDoiYXRvbSI7czozNzoiL2Jpb3J4aXYvZWFybHkvMjAxNS8wNS8xOC8wMTk0MTQuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 21. [20].Myre, N. et al. A major Y-chromosome haplogroup R1b Holocene era founder effect in Central and Western Europe, European Journal of Human Genetics (2010), 1–7, 1018-4813/10 22. [21].Ohta, T., and Kimura, M. The model of mutation appropriate to estimate the number of electrophoretically detectable alleles in a genetic population. Genet. Res. (1973)22: 201–204. [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1017/S0016672300012994&link_type=DOI) [PubMed](http://biorxiv.org/lookup/external-ref?access_num=4777279&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F05%2F18%2F019414.atom) [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=A1973S206000008&link_type=ISI) 23. [22].1. M. Abramowitz Olver, F. W. J. *Bessel functions of integer order* pp. 355–434 Handbook of Mathematical Functions (1964)edited by M. Abramowitz National Bureau of Standards, Washington. 24. [23].Reich, D, et al. Massive migration from the steppe was a source for Indo-European languages in Europe, Nature (2015) doi:10.1038/nature14317 [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1038/nature14317&link_type=DOI) [PubMed](http://biorxiv.org/lookup/external-ref?access_num=25731166&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F05%2F18%2F019414.atom) 25. [24].Renfrew, A.C., Archaeology and Language:The Puzzle of Indo-European Origins, (1987)London: Pimlico. 26. [25].Rootsi S. et al. Distinguishing the co-ancestries of haplogroup G Y-chromosomes in the populations of Europe and the Caucasus, Eur J Hum Genet. 2012 Dec; 20(12):1275–82. doi: 10.1038/ejhg.2012.86. [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1038/ejhg.2012.86&link_type=DOI) [PubMed](http://biorxiv.org/lookup/external-ref?access_num=22588667&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F05%2F18%2F019414.atom) 27. [26]. Peter A Underhill et al., The phylogenetic and geographic structure of Y-chromosome haplogroup R1a, European Journal of Human Genetics (2015) 23, 124–131; doi:10.1038/ejhg.2014.50; published online 26 March 2014 [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1038/ejhg.2014.50&link_type=DOI) [PubMed](http://biorxiv.org/lookup/external-ref?access_num=24667786&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F05%2F18%2F019414.atom) 28. [27].Walsh, B. Estimating the Time to the Most Recent Common Ancestor for the Y chromosome or Mitochondrial DNA for a Pair of Individuals, Genetics 158: 897–912(2001) [Abstract/FREE Full Text](http://biorxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6ODoiZ2VuZXRpY3MiO3M6NToicmVzaWQiO3M6OToiMTU4LzIvODk3IjtzOjQ6ImF0b20iO3M6Mzc6Ii9iaW9yeGl2L2Vhcmx5LzIwMTUvMDUvMTgvMDE5NDE0LmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 29. [28].Wehrhahn, C. F. The evolution of selectively similar electrophoretically detectable alleles in finite natural populations, Genetics 80(1975) 375–394. [Abstract/FREE Full Text](http://biorxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6ODoiZ2VuZXRpY3MiO3M6NToicmVzaWQiO3M6ODoiODAvMi8zNzUiO3M6NDoiYXRvbSI7czozNzoiL2Jpb3J4aXYvZWFybHkvMjAxNS8wNS8xOC8wMTk0MTQuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 30. [29].Wilson I. J, Weale M. E, Balding D. J. Inferences from DNA data: population histories, evolutionary processes and forensic match probabilities. J Roy Statist Soc A. (2003);166:1–33. 31. [30].Zhivotovsky, L.A. Estimating Divergence Time with the Use of Microsatellite Genetic Distances, Oxford J. Life Sciences etc 18, 700–709. 32. [31].Zhivotovsky LA, Underhill PA, Cinnioglu C et al. The effective mutation rate at Y chromosome short tandem repeats, with application to human population-divergence time, Am J Hum Genet (2004) 74: 50–61. [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1086/380911&link_type=DOI) [PubMed](http://biorxiv.org/lookup/external-ref?access_num=14691732&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2015%2F05%2F18%2F019414.atom) [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000187723500005&link_type=ISI) 33. [32].1. Kasha, M. and 2. Pullman, B Zuckerkandl, E. and Pauling, L.B. (1962) *Molecular disease, evolution, and genic heterogeneity*. In Kasha, M. and Pullman, B (editors). Horizons in Biochemistry. Academic Press, New York. pp. 189–225. 34. [33].1. Bryson, V. & and 2. Vogel, H.J. Zuckerkandl, E. and Pauling, L.B. (1965) *Evolutionary divergence and convergence in proteins*. In Bryson, V. & and Vogel, H.J. (editors). Evolving Genes and Proteins. Academic Press, New York. pp. 97–166. [1]: /embed/graphic-1.gif [2]: /embed/inline-graphic-1.gif [3]: /embed/graphic-4.gif [4]: /embed/graphic-9.gif [5]: /embed/graphic-10.gif [6]: /embed/graphic-11.gif [7]: /embed/graphic-12.gif [8]: /embed/graphic-13.gif [9]: /embed/graphic-14.gif [10]: /embed/graphic-15.gif [11]: /embed/graphic-16.gif [12]: /embed/graphic-17.gif [13]: /embed/graphic-18.gif [14]: /embed/graphic-19.gif [15]: /embed/graphic-20.gif [16]: /embed/graphic-21.gif [17]: /embed/graphic-22.gif [18]: /embed/graphic-23.gif [19]: /embed/graphic-24.gif [20]: /embed/graphic-25.gif [21]: /embed/graphic-26.gif [22]: /embed/graphic-27.gif [23]: /embed/graphic-28.gif [24]: /embed/graphic-29.gif [25]: /embed/inline-graphic-2.gif [26]: /embed/graphic-30.gif [27]: /embed/inline-graphic-3.gif [28]: /embed/inline-graphic-4.gif [29]: /embed/graphic-31.gif [30]: /embed/inline-graphic-5.gif [31]: /embed/graphic-32.gif [32]: /embed/graphic-33.gif [33]: /embed/graphic-34.gif [34]: /embed/inline-graphic-6.gif [35]: /embed/inline-graphic-7.gif [36]: /embed/inline-graphic-8.gif [37]: /embed/inline-graphic-9.gif [38]: /embed/graphic-35.gif [39]: /embed/graphic-36.gif [40]: /embed/graphic-37.gif [41]: /embed/graphic-38.gif [42]: /embed/graphic-39.gif [43]: /embed/inline-graphic-10.gif [44]: /embed/inline-graphic-11.gif [45]: /embed/graphic-41.gif [46]: /embed/graphic-42.gif [47]: /embed/graphic-43.gif [48]: /embed/graphic-44.gif [49]: /embed/inline-graphic-12.gif [50]: /embed/graphic-45.gif [51]: /embed/inline-graphic-13.gif