Abstract
Molecular clocks give “Time to most recent common ancestor” TMRCA of genetic trees. By Watson-Galton17 most lineages terminate, with a few overrepresented singular lineages generated by W. Hamilton’s “kin selection”13. Applying current methods to this non-uniform branching produces greatly exaggerated TMRCA. We introduce an inhomogenous stochastic process which detects singular lineages by asymmetries, whose reduction gives true TMRCA. This implies a new method for computing mutation rates. Despite low rates similar to mitosis data, reduction implies younger TMRCA, with smaller errors. We establish accuracy by a comparison across a wide range of time, indeed this is only clock giving consistent results for both short and long term times. In particular we show that the dominant European y-haplotypes R1a1a & R1b1a2, expand from c3700BC, not reaching Anatolia before c3300BC. While this contradicts current clocks which date R1b1a2 to either the Neolithic Near East4 or Paleo-Europe20, our dates support recent genetic analysis of ancient skeletons by Reich23.
The genetic clock, computing TMRCA by measuring genetic mutations, was conceived by Emile Zuckerkandl and Linus Pauling 32,33 on empirical grounds. However work on neutral mutations by Motto Kimura16 gave a theoretical basis and formula. While our theory applies to general molecular evolution, we focus on the Y-chromosome with DYS regions (DNA Y-chromosome Segments) counting the “short tandem repeat” (STR) number of nucleotides of a micro satellite. In fact one uses many DYS sites, marked by j = 1, …N, each individual i, 1 = 1, ‥n, has STR number xi,j. The Y-chromosome is passed unchanged from father to son, except for mutations xi,j → xi,j ± 1 occurring at rate
The fundamental assumption is that the sample population has a single patriarch at time t = TMRCA(generations). Now suppose the present (sample) population has mode mj at DYS j. This is taken to be the STR value of the original patriarch. A calculation shows the present population with variance . Then averaging over the markers gives TMRCA = Σj Vj / (n Σj μj). This variance method and its variations we call KAPZ after its originators.
In practise problems soon arose. Mutation rates could be computed from mitosis, but sample sizes are too small to give great accuracy. Using these a KAPZ due to Zhivotovsky31,31 was applied to R1b1a2 by Myres20 giving L23*(Turkey) giving 9000BC, σ = 2000.
Mutation rates could also be estimated from large family groups with genealogy data. However there are significant discrepancies in rates between different family groups. Also these “pedigree” rates are much larger than those from mitosis. A similar phenomena for the mitochondrial clock suggested high short term rates and lower long term rates14,15. So very low long term rates of .00069 were suggested31 for the Y-clock. We show this is unnecessary.
Another problem is that KAPZ is for large populations whereas ancient populations were small and modern samples can be tiny, e.g. n < 20. This led to the introduction of Bayesian methods such as BATWING27, which considers all possible genealogical trees giving the present sample data, then searches for the tree of maximum likehood. But the BATWING TMRCA is often greater than KAPZ, e.g. for the Cinnioglu8 study of Anatolian DNA both methods were applied to the same data and mutation rates. For R1b1a2 the KAPZ had T M RCA = 9800BC compared with 18, 000BC for BATWING. Balaresque4 used BATWING to give an origin for R1b1a2 in Neolithic Anatolia c. 6000BC, but their statistics was disputed by Bushby29. All of this was contradicted by Reich22 who found R1b1a2 in skeletons c 3300BC from Yamnaya cemeteries.
Singular Lineages
A fundamental problem is that present populations have highly overrepresented branches we call singular lineages. A well known example is the SNP L21 which is a branch of R1b1a2. Individuals identified as L21 are often excluded from R1b1a2 analysis because they skew the results. Such a singular lineage causes the variance to be much greater, even though the original TMRCA remains unchanged, see figure 1. For Bayesian methods such lineages are very unlikely giving an even greater apparent TMRCA. However one cannot deal with singular branches by excluding them. For one thing, our method will show that 50% of DYS show evidence of singular side branches, i.e. more than a SD from expected. Excluding them would also remove some of the oldest branches and produce a TMRCA which is too young. Now these singular lineages are very (mathematically) unlikely to arise from the stochastic system which is the mathematical basis of KAPZ (or the equivalent Monte-Carlo process modeling BATWING). We believe that the standard stochastic process is perturbed by other improbable events, which are then amplified by biological processes.
First, the Watson-Galton Process17 implies lineages almost certainly die out. Conversely, the “kin selection” of W.D. Hamilton13, shows kin co-operation gives genetic advantages. Consider three examples with well developed DNA projects. Group A of the Hamiltons has approximately 100, 000 descended from a Walter Fitzgilbert c 1300AD. Group A of the Macdonalds has about 700, 000 descendants from Somerfeld c1100AD, and Group A of the O’Niall has over 6 million descendants from Niall of the Seven Hostages, c300AD. These are elite groups with all the social advantages. One sees lines of chieftains, often polygamous. Our model has many extinct twigs with a few successful branches, whereas current models assume a uniform “star radiation”, see below
Reduction of Singular Lineages
Modelling singular lineages requires a new stochastic system where instead of a single patriarch we imagine many “virtual patriarchs”, each originating at tme tk ago. Each of these giving a proportion 0 ≤ρk ≤1 of the present population. So we now have an inhomogenous expansion. Furthermore the symmetric model for mutations has to be changed to
We introduce asymmetric mutations and show how to compute it. Asymmetry will play a very important role in detecting singular lineages. This inhomogenous asymmetric system is mathematically equivalent to a mixed population. Computing its solution is an “inverse problem”. Unfortunately inversion is un-stable for such systems, also there is no unique solution. However it turns out that, up to a standard deviation SD, most DYS markers show at most one singular branch which is found from asymmetries in the distribution. These singular branches are then reduced revealing the original lineage. We then compute a branching time tj for each marker j. The effect of reduction is dramatic, see Figure 3. Now the nonuniform branching process causes the tj to be ran-domly distributed so their mean is not the TMRCA. Large errors in mutation rates means one cannot simply take the max tj to be the TMRCA. Instead stochastic simulations of the branching process, using robust statistics to avoid outliers, find the most likely TMRCA, see Supplementary Material 1 (SM1) for full mathematical details.
These methods also imply a new way of computing mutation rates, see SM2. Previously, there were methods based on mitosis data or pedigree studies of family DNA projects (which gave quite different rates). We begin with 8 very large SNP projects from FTDNA using 37 markers, of course with unknown TMRCA and find mutation rates as the fixed points of a stochastic process. These take about 3 iterates to converge. After we discard markers with mutation SD > 33% we are left with 29 markers. We find the mutation rates are close to those obtained from mitosis and nearly 1/3 the values obtained by pedigree. Despite the fact that our mutation rates are lower than most studies, reduction of singular lineages produces more recent TMRCA than current models.
Examples
Our clock is the only one with across the board consistent results:
Archeological finds convinced Marija Gimbutas11 to attribute Proto Indo-European (PIE) to the Yamnaya Culture c 3500BC of the Russian Steppes, see Anthony2. This is consistent with mainstream linguistic theory, some even wrote of linguistic DNA. But actual genetics was ignored because this contradicts current genetic clocks. Now the dominant European y-haplotypes are R1b1a2 & R1a1a (which like other y-haplotypes is marked by a unique single nucleotide polymorphism (SNP) mutation). Table 2 shows the expansion times of c3700BC, similar for regions Russia, Poland, Germany and Scandinavia. The times are so close only Scandinavia is significantly later. This data is from FTDNA projects for region X only using individuals with named ancestor from These independent results agree within the standard deviation (SD), with dates matching the Corded Ware Culture, a semi-nomadic people with wagons and horses who expanded west from the Urkraine c3000BC. This is consistent with the oldest R1b1a2, R1a1a skeletons being from the Yamnaya Culture23.
An interesting intermediate step occurs between the medieval and eneolithic. The mythical Irish Chronicles relate that the O’Niall descend directly from the first Gaelic High Kings, which tradition dated c1300-1600BC. The O’Niall have the unique mutation M222 which is a branch of the haplotype L21. For L21, n = 1029, we compute TMRCA = 1600BC and SD σ = 320. These are dates for proto Celtic, i.e. what archeologists call the pre Urnfelder Cultures, c. 1300-1600BC, see SM5. Furthermore L21 is in turn a branch of haplotype P312 which we date to 2300BC. This date suggests the Bell Beaker Culture of Western Europe. Indeed the only known23 Bell Beaker genome is P312 with 14C date 2300BC.
Our method requires large data sets and many markers which means we have to rely on data from FTDNA, finding 29 useable markers out of standard 37 they use. In fact many researchers4 have used FTDNA data. We think our method of reduction with robust statistics solves any problems with this data. To test this we compared our results with R1a1a1 data obtained from Underhill26 with n = 974(which involved excluding his four M420 individuals and others with missing markers), and 15 useable markers. The result was 2550BC, σ = 400, within the CI of our R1a1a results. Table 5 shows the results of extensive simulations using random subsets of our FTDNA data, for 29, 15 and 7 markers. For the same 15 markers as the Underhill26 the different FTDNA data gives very similar 3300BC, σ = 840 for R1a1a, verifying the correctness of using FTDNA data. However once you get down to 7 markers the confidence interval becomes large, e.g. R1a1a gives 3400BC, σ = 1500. Also it becomes difficult to deal with outliers.
An example with few markers is R1b1a2 data of Balaresque4. Our method (this time with 7 useable markers) gave SD > 30%, see Table 6. Now Balaresque4 used the Bayesian method BATWING29 to suggest a Neolithic origin in Ana-tolia. With the same Cinnioglu8 data our method gives for Turkish R1b1a2 (n = 75) a TMRCA = 5300BC, σ = 3100, i.e. anytime from the Ice Age to the Iron Age. Fortunately, once again, we find good data from FTDNA: the Armenian DNA project, see Table 3. By tradition the Armenians entered Anatolia from the Balkans c1000BC so they might not seem a good example of ancient Anatolian DNA. But some 100 generations of genetic diffusion has resulted in an Armenian distribution of Haplotypes J, G, R1b1a2 closely matching that of all Anatolians, therefore representive of typical Anatolian DNA. We see that Anatolian R1b1a2 arrived after c3300BC, ruling out the Neolithic expansion c6000BC. When dealing with regional haplotypes, e.g. R1b1a2 in Anatolia, the TMRCA is only a upper bound for the arrival times, for the genetic spread may be carried by movements of whole peoples from some other region.
Observe that our TMRCA for Armenian G2a2b (formerly G2a3) and J2 show them to be the first Neolithic farmers from Anatolia, i.e. older than 7000BC. In Table 4 we compared J2, G2a2b for all of Western Europe (non-Armenian data). Our dates show J2 was expanding at the end of the Ice Age. Modern J2 is still concentrated in the fertile crescent, but also in disconnected regions across the Mediterranean. The old genetic model predicted a continuous wave of Neolithic farmers settling Europe. But you cannot have a continuous maritime settlement: it must be leap-frog. Also repeated resettlement from the Eastern Mediterranean has mixed ancient J2 populations, and our method gives the oldest date. On the other hand G2a2b shows exactly the dates expected from a continuous wave of Neolithic farmers across Central Europe, consistent with Neolithic skeletons showing G2a2b (e.g. the famous Iceman).
Discussion
History, archeology, evolutionary biology, not to mention epidemics (e.g. dating HIV), forensic criminology and genealogy are just some of the applications of molecular clocks. Unfortunately current clocks have been found to give only “ballpark” estimates. Our method is the only one giving accurate time, at least for the human y-chromosome verified over the period 500 - 15, 000ybp. Our methods should also give accurate times for mitochondrial and other clocks.
Many geneticists thought natural selection makes mutation rates too variable to be useful. The problem is confusion between the actual biochemical process giving mutations and superimposed processes like kin selection producing apparently greater rates. Notice that the SD for our mutation rates is on average 14% which is much smaller than the actual previous rates. We believe this small SD proves the reality of neutral mutation rates of Moto Kimura16.
While our method is accurate for “big data”, applications to genetics, forensics, genealogy require the TMRCA between just two individuals, or between two species. Now for this “2-body problem” we cannot determine what singular lineages the branching has been through: with mutations either exaggerated or suppressed. Thus previous methods for small samples are at best unreliable. It is an important problem to find what accuracy is possible for small samples.
In checking accuracy we ran into the question of the origins of PIE. Although there are genes for language there is certainly none for any Indo-European language. Thus inferences have to be indirect. Marija Gimbutas saw patterns in symbolism and burial rituals suggesting the Yamnaya Culture was the cradle of Proto Indo-European. Also their physiology was robustly Europeanoid unlike the gracile skeletons of Neolithic Europe, but this could be nutrition and not genetic. So it was an open question whether the spread of this robust type into Western Europe in the late Neolithic marked an influx of Steppe nomads or a revolution in diet.
Reich23 observed all 6 skeletons from Yamnaya sites, c 3300BC by 14C dating, are either R1a1b1 and R1a1a. But that method could not date the origin of R1a1b1 and R1a1a. Our TMRCA shows both these haplotypes expanding at essentially the same time c3700BC. This, together with our later date for Anatolia, implies that R1b1a2 and R1a1a must have originated in the Yamnaya Culture, c 3700BC. Furthermore, considering the correlation of haplotypes R1b1a2 and R1a1a with Indo-European languages (i.e. all countries with R1b1a2 & R1a1a frequency > 50% speak Indo-European), this provides powerful evidence for the origin of Proto Indo-European.
Supplementary Material 2: Accurate Mutation rates
Any genetic clock depends on reasonably accurate mutation rates. The mitosis method looks for mutations in sperm samples. Forensics uses father-son studies. However typical rates of μ = .002 would require nearly 50, 000 pairs to get an SD of 10%. Small samples have meant large errors. The pedigree approach is to study large family groups with well developed DNA/genealogy data. So inverting the KAPZ formula would yield accurate rates. However, singular lineages makes this problematic. Genealogical data might give mutation rates much greater than the biochemical rates because kin selection etc tend to exaggerate the apparent mutation rate. An inspection of 10 different sources finds mutation rates claiming SD ∼10% yet they differ from each other by up to 100%. We describe a new method.
To compute our rates we apply our theory to the large DNA projects for the SNP M222, L21, P312, U106, R1b1a2, I1, R1a1a. This avoids dealing with populations such as family DNA projects which are self selecting, i.e only those with the correct surname which neglects distant branches. Also we have very large samples, our average n > 1000. Greater accuracy should come from more generations and individuals. The problem is that we do not know their TMRCA.
Asymmetric Mutation
However before computing mutation rates we must consider asymmetric mutations, i.e. the left and right mutation rates μj,-1 ≠ μj,1. For a uniform stochastic process we again use the asymmetric ratio to define the asymmetric constant Aj ∈ [0, 1] for marker j. For example Aj = 0.5 is complete symmetry. Of course singularities will effect this ratio, however these only occur < 50% of markers. Thus for each marker, SNP we compute this ratio. We find the SD for each SNP is relatively small while the difference between SNP can be large. However for each marker, using 8 SNP enables outliers to be easily removed leaving allowing us to use simple linear regression: i.e. average of the Aj over the remaining SNP groups. We see that asymmetry is a real effect: 50% of the Aj are more than two SD from symmetry Aj = 0.5.
Observe this is significant. The total second moment is
So using all our 33 DYS markers with our μj, we compute constants
The KAPZ formula gives variance V = μt compared to the corrected formula μt + τt2. The uncorrected KAPZ gives an overestimate > 400% for > 200 generations. This effect can be nullified by using the mean instead of the mode, variance instead of the second moment, however failing to do so gives a large error. Furthermore other methods which assume symmetric mutations will also be inaccurate. Having estimates on the asymmetry is essential to our method because we find singular lineages by looking for asymmetry in the data. Any such anomaly needs to be significantly greater than the natural asymmetry.
Mutation Rates as a fixed Point
Next we compute mutation rates using 8 very large SNP groups. First, using the asymmetric constants we find singular lineages and reduce their effect. We take account of the error in the Aj by a bootstrap technique, which gives the variance for each frequency f (j, 0). For a given SNP k if markers j started their expansion at the same time TMRCA Tk we could calculate mutation rates μj via or rather average the 8 different μj we would obtain. However because of branching caused by extinction of lineages the different markers do not originate at the same time but at different times tj. In this case we expect these tj to be randomly distributed about the log mean over a middle set of times tj. So, for each SNP group k = 1, ‥8 define mean time Tk, not the TMRCA but the mean log mean over a middle set of markers, which is less. We find that this is very stable. So for a fixed marker j the data τk,j = tj - Tk should be randomly distributed about zero over the different SNP k = 1, ‥, 8. However the wrong choose of μj would give a bias. In fact this is what we see if the mutation rates μj = .002 were chosen. In appendix graphs show the τk,j, k = 1, ‥8 bunched around a nonzero point. Thus we try to find μj so that the τk,j, k = 1, 2, ‥8 has mean zero. However the τk,j, k = 1, 2, ‥8 depend nonlinearly on the rates μj, as does the mean Tk, k = 1, ‥8. We find this nonlinear regression problem is solved by an iterative scheme which starts with any reasonable set of DNA rates, finding any reasonable choice iterates to the same final answer. So choose μj = .002 to begin. Suppose at some stage we have apparent mutation rates μj. Then, for each SNP, and each marker we solve equation (1) to obtain the apparent tj. For each SNP k = 1, ‥8 we compute the mean log time Tk. At the next step we get new rates from
Averaging , k = 1, ‥8 we get our next set of μj of mutation rates. However this method would be effected by a marker showing a singular lineage. Fortunately these are few in number and by comparison between the different SNP we remove the outliers. We then repeat the process, computing Tk again with the new rates, and another set of mutation rates. So we have an iterative process.
One problem is that the iterates could tend to decrease to zero or increase to ∞, as we are only calculatin),g relative rates. To prevent this we renormalize after each iteration so the total Σμj is constant. We found the iterative scheme quickly converges to a fixed set of mutation rates, unique up to a constant factor. The CI is computed by bootstrap parametrized by the uncertainties in data and the asymmetric constants. In figure we show the distribution of τk,1, k = 1.2, ‥8 before and after the first iteration.
The generation factor γ
This method does not give absolute mutation rates but relative mutation rates μjγ, where γ is universal time scale constant. To find γ we apply our method to compute the T = T M RCA of three famous DNA projects and choose γ so the scaled T /γ best fits the historical record. We choose the DNA projects for the O’Niall(M222), Gp A of Macdonald (R1a1a) and Gp A of the Hamiltons (I1). These are large groups with characteristic DNA and fairly accurate times of origin. Of course finding one constant γ from three projects is inherently more accurate than using one project to find 33 different mutation rates. Actually assuming a generation of 27years these three projects yield γ = 1 with about 5% error, i.e. there is no actual need for this correction. This is a constant error (like uncalibrated 14C dating).
Thus γ is related to the length of a generation. Most researchers use 25yrs for t > 500ybp and 27yrs for t < 500ybp. Balaresque and al used 30yrs based on Finer who sees a 30yr generation for modern hunter-gatherers. (Although for most of the time R1b1a2 were subsistence farmers and not hunter gatherers.) At first glance our theory allows any nominal generation as it really doesn’t matter, being included in the γ factor which we compute in years not generations. Actually its not as simple as that. While our three DNA projects being post 1000AD elites have a 27yr generation the problem is what to do for t > 2000ybp. Now 25y may be appropriate for subsistence farmers but we found that singular lineages of the elite have exaggerated effect so 27 years seems appropriate.
Mutation Rates: Hamilton vs Mitosis and pedigree
Asymmetric rates
The log distribution of τk,1, k = 1.2, ‥8 before iteration at marker j = 1, ie DYS 393, but after reduction 1(μj = .002). The SNP are colored:
After just one iterate we get
So 5 of our τk,1, k = 1.2, ‥8 bunch around zero, outliers are U106 and I1.
The iterative scheme converges to stable values very fast, 7 iterates is enough.
Supplementary Material 3: Reduction of Singular Lineages vs KAPZ
We compare results for our method with KAPZ, for the same data, 29 markers and our mutation rates
First we compare for groups with medieval expansions
Next we compare SNP G2a2b, R1b1a2, R1a1a, I1, L21, U106, J2, P312:
RSL and KAPZ will give similar results if there is a fast expansion and thus insignificant singular lineages and branching. Actually this is to be expected sometimes, i.e. it is not surprising that the results using RSL and KAPZ for O’Niall, R1a1a, U106 are very similar.
However in other cases the KAPZ results are about 30% too old. In the case of the Hamiltons and Macdonalds absurdly so. For R1b1a2 it gives an early Neolithic age, compared with eneolithic for R1a1a, yet these have been dated to the same Yamanya times. The KAPZ dates for L21 “Celtic” is nearly 2000 years before Urnfelder Culture.
Of course one might try to “improve” KAPZ by increasing the mutation rates by 33% so the KAPZ times are decreased by 25%. Then the medieval dates look reasonable but we find 3100BC for G2a2b which is too late. For R1a1a we would get 2300BC which is not only too late but significantly different from the 3600BC for R1b1a2. Also G2a2b would be predated by R1b1a2 even though the latter has never been found in Neolithic sites of Europe. Getting consistent results across the span of history was a problem of previous clocks.
Footnotes
↵1 The calculations and figures for all 33 markers is shown in SM