Abstract
Many population genetic activities, ranging from evolutionary studies to association mapping to forensic identification, rely on appropriate estimates of population structure or relatedness. All applications require recognition that quantities with an underlying meaning of allelic identity by descent are not defined in an absolute sense, but instead are made “relative to” some set of alleles other than the target set. The early Weir and Cockerham FST estimate made explicit that the reference set of alleles was across independent populations. Standard kinship estimates have an implicit assumption that pairs of individuals in a study sample, other than the target pair, are unrelated, whereas other estimates assume alleles within individuals are not identical by descent. However, populations lose independence when there is migration between them, and when individuals in a study are related it is difficult to see how they can also be non-inbred. We have therefore re-cast our treatments of population structure, relatedness and inbreeding to make explicit that the parameters of interest involve differences of probabilities of identity by descent in the target and the reference sets of alleles and so can be negative. We take the reference set to be for the population from which study individuals have been sampled. We provide simple moment estimates of these parameters, phrased in terms of allele matching within and between individuals for relatedness and inbreeding, or within and between populations for population structure. A multi-level hierarchy of alleles within individuals, alleles between individuals within populations, and alleles between populations allows a unified treatment of relatedness and population structure. Our new estimates appear to be sensitive to rare or private variants, to give indications of the effects of natural selection, and to be appropriate for use in association studies.
Introduction
We offer here a unified treatment of relatedness and population structure with an underlying frame-work of alleles being identical by descent, ibd. We follow Thompson (2013) in regarding ibd for a set of alleles as being relative to some other, reference, set: “There is no absolute measure of ibd: ibd is always relative to some reference population.” In other words, ibd implies a reference point and ibd status there is often implicitly assumed to be zero.
A function of ibd of particular interest to us is FST, which we will show below depends on ibd of pairs of alleles within populations relative to that for pairs of alleles from different populations. The uses of estimates of this quantity are widespread, and here we note a recent discussion by McTavish and Hillis (2015) of the effects of SNP ascertainment, SNP array vs whole-genome sequencing, on inferences about population history. These authors used “pairwise FST for all pairs of populations using Weir and Cockerham’s method.” We suggest that a more informative analysis may result from our population-specific FST estimates (Weir and Hill, 2002; Weir et al., 2005; Browning and Weir, 2010). Several authors (e.g. Balding and Nichols, 1995; Shriver et al., 2004; Beaumont and Balding, 2004; Gaggiotti and Foll, 2010) have discussed the advantages of working with population-specific FST values instead of single values for a set of populations or of values for each pair of populations. We show below that the usual global FST measure can be regarded as an unweighted average of population-specific values, and because it is an average it collapses the variation detectable among populations that can indicate the effects of past selection (Weir et al., 2005). The usual measure can otherwise diminish signals of population history and this diminution has become more pronounced as genetic marker data have become richer and real differences among populations have become more evident. As Astle and Balding (2009) noted “population structure and [cryptic] relatedness are different aspects of a single confounder: the unobserved pedigree defining the (often distant) relationships among the study subjects.” A similar point was made by Kang et al. (2010): “The presence of related individuals within a study sample results in sample structure, a term that encompasses population stratification and hidden relatedness.” Our goal is to provide a unified approach to charactering population structure and individual relatedness and inbreeding, both in terms of the underlying parameters and of the methods of estimation.
A consideration of “genetic sampling” (Weir, 1996) makes it clear that population mean ibd for alleles in a single population, or ibd for alleles in a single individual, at one point in time cannot be estimated only from data for that population or that individual as there would be no information about variation over replications of the descent paths from past to present. We might regard multiple loci as providing replication of the genetic sampling process or we might collect data from multiple populations. An exception is when allele frequencies and ibd status in the reference population are assumed known, as is implied for standard methods for estimating relatedness and inbreeding (e.g. Ritland, 1996; Wang, 2014; Purcell et al., 2006; Yang et al., 2011). If, instead, these methods make use of frequencies from a sample of individuals they are providing estimates of the inbreeding or coancestry ibd measures relative to those measures for individuals in the whole sample. This point was also made by Yu et al. (2006) who spoke of “adjusting the probability of identity by state between two individuals with the average probability of identity by state between random individuals” in order to address identity by descent. Other relatedness estimation methods that do not use allele frequencies (e.g. KING-robust, Manichaikul et al., 2010) are estimating ibd between individuals (coancestry) relative to that within individuals (inbreeding: assumed zero for KING-robust).
For both population structure and relatedness we propose the use of allelic matching proportions within and between individuals or populations in order to characterize ibd for an individual or a population relative to a reference set of ibd values. We use allele matching rather than heterozygosity (Nei, 1973) or components of variance (Weir and Cockerham, 1984: hereafter WC84) although the distinction is more semantic than real. Our present treatment also differs from that in WC84 by using unweighted averages of statistics over populations instead of the weighted averages that were more appropriate for the WC84 model of independent populations.
The size of current genetic studies requires computationally feasible methods for estimating relatedness between all pairs of individuals, potentially 5 billion pairs for the TOPMed project (http://www.nhlbiwgs.org). The scale of the task may well rule out maximum likelihood approaches (e.g. Thompson, 1975; Ritland, 1996; Milligan, 2003) and Bayesian methods (e.g. Gaggiotti and Foll, 2010). Moment estimates seem still to be relevant and will be presented here.
Materials and Methods
Parameter Values
We write the probability of a pair of alleles being ibd as θ without specifying the reference set of alleles. Subscripts will identify the populations or individuals from which the alleles are drawn. Subscript W will indicate an average over sets of alleles within individuals or populations, and subscript B the average over pairs of distinct individuals or populations.
Pairs of Individuals
The coancestry coefficient θXY for individuals X, Y is the probability an allele taken randomly from X is ibd to one taken randomly from Y. If individual A is ancestral to both X and Y, and if there are n individuals in the pedigree path joining X to Y through A, then where FA is the inbreeding coefficient of A and the sum is over all ancestors A and all paths joining X to Y through A. The coancestry of X with itself is θXX = (1 + FX)/2.
Pairs of Populations
For populations i, i′ the quantity θii′ is the probability of ibd for an allele from i and one from i′. The two populations may be the same, i = i′. The simplest evolutionary scenario is for finite populations of constant size, subject only to genetic drift. For two populations of sizes N1, N2 with a common ancestral population t discrete generations in the past, the current within-population values θ11, θ22 and the between-population value θ12 are
The between-population ibd probability θ12(t) at present is the same as it was, θ12(0), in the common ancestral population. As we do throughout this discussion, we introduce quantities β that measure ibd for a target pair of alleles relative to that in a convenient comparison set, here alleles between the pair of current populations (also, in this case, the common ancestral population):
By considering ibd relative to that between populations, we avoid having to know the value in the ancestral population or even having to specify that ancestral population. The two population-specific values βii(t) differ if the two populations have different sizes. We often wish to work with the averages θW (t) = [θ11(t) + θ22(t)]/2, θB(t) = θ12(t) and we write βW (t) = [β11(t) + β22(t)]/2: providing N1, N2, t are all large, and Nh is the harmonic mean of the population sizes. From now on we will regard βW as the parametric form of FST, and Reynolds et al. (1983) showed that FST serves as a measure of population distance under the pure drift model. The formulation FST = (θW − θB)/(1 − θB) makes explicit that FST is a measure of ibd within populations relative to ibd between pairs of populations.
Drift, Mutation and Migration
Non-trivial equilibria for two populations drifting apart are obtained when there is mutation, and we illustrate some aspects of our population-specific approach by considering the case of two populations exchanging alleles each generation when there is infinite-alleles mutation. The transition equations for θ1, θ2, θ12, extending those given by Maruyama (1970), are: where , the mutation rate is µ and population i: i = 1, 2 receives a fraction mi of its alleles each generation from population i′: i′ ≠ i. A consequence of these equations is that θ11(t) + θ22(t) ≥ 2θ12(t), or that θW ≥ θB and so βW = FST is positive. However, it is not necessary that each of θ11, θ22 exceeds θ12 and in Figure 1, second row, we show that mutation leads to equilibrium values of θii different from 1, and in Figure 1, third row that migration can lead to cases where θ11 > θ12 > θ22. In the absence of migration, mutation drives θ12 and β12 to zero, so that θii = βii are both positive. In Figure 2 we show the region in the space of N1, m1 values where β11 ≤ 0 ≤ β22 for fixed N2, m2, µ. Averaging over the two βii’s to work with FST hides this potential difference in sign of the βii’s.
Actual vs Predicted
θ The probabilities of ibd calculated from path-counting methods for pedigrees of individuals or from transition equations for populations can be regarded as the expected values, over evolutionary replicates, of the actual identity status of a pair of alleles. We have previously discussed the variation of actual identity about the predicted value (Hill and Weir, 2011, 2012). The variance of the actual ibd measure for two alleles is Δ − θ (Cockerham and Weir, 1983), where Δ is the joint probability of ibd for each of two pairs of alleles. The coefficient of variation of the actual coancestry for two individuals is greater than 1 for individuals with predicted coancestry θ less than 0.125 and it increases as the degree of relationship decreases. The implication of this is that, for a particular pair of populations or individuals, estimated values may not match those expected from pedigrees or transition equations. Evaluation of estimation procedures should, therefore, be performed over many replicate pairs.
Inbred Populations
The discussion so far has implicitly assumed Hardy-Weinberg equilibrium within populations and no need to differentiate pairs of alleles within individuals from those between individuals in the same population. We can relax that assumption. Two alleles taken at random from population i may be from the same or different individuals. If the sampling was without replacement, the ibd probability for two alleles from one individual is the inbreeding coefficient F for that individual, whereas sampling with replacement from one individual has ibd probability (1 + F)/2. We defer a more extensive discussion to a subsequent publication.
Estimation
Allele Frequencies
It is allelic identity in state (ibs) that can be observed, rather than identity by descent (ibd) and we now consider how ibs data can provide ibd estimates. We start by considering the frequencies of the various alleles at a locus.
We can distinguish three classes of allele frequency. The sample frequency of allele u in a sample of alleles taken from population i provides an estimate of the actual allele frequency in that population. These actual frequencies, in turn, vary about the population frequency pu, where variation refers to the values of the actual frequencies in evolutionary replicates of population i.
For ni randomly sampled alleles, where niu are of type u, has a binomial distribution . The mean of from this statistical sampling process is and the variance is . The distribution of the actual frequency is not known in general, but there is a class of evolutionary models (e.g. Balding and Nichols, 1995) that provide the Beta distribution:
The mean for this genetic sampling process is and the variance is pu(1 − pu)θi. We keep these first two genetic-sampling moments although we do not invoke the beta distribution.
The total mean and variance follow from and the complete set of first and second moments were given by Weir and Hill (2002):
These moments refer to expectations over both repeated samples from the same populations (statis-tical sampling) and over replications of the populations themselves (genetic sampling). From now on we will drop the T subscript but all expectations are total. WC84 set all within-population θii to a common value θ, and all between-population θii′ to zero. Note the assumption that all populations have the same expected allele frequencies pu, although they have different actual frequencies .
Allelic Matching
We find intuitive appeal in working with proportions of pairs of alleles that are ibs. If the sample of ni alleles from population i has niu copies of allele type u, then the matching (allele sharing) proportion for pairs of alleles drawn without replacement from population i is where
For sampling with replacement the sample-size corrections are not necessary . The average over samples from r populations is . The allele-pair matching proportion between populations i and i′ is and these have an average over pairs of samples from r populations of .
Population Structure Estimates
From Equations 1, the matching proportions have expectations where . Averaging over populations or pairs of populations:
The expectations lead immediately to simple method-of-moment estimates for any number of sampled populations, any number of alleles sampled per population, and any numbers of alleles per locus:
To the extent that the expectation of a ratio is the expectation of ratios, Equations 1,2 show that each is unbiased for the corresponding β:
Note that the pairwise estimates sum to zero by construction. Although it is not possible to find estimates for each θ when the sampled populations have correlated sample allele frequencies, it is possible to rank the ’s, and these are likely to have the same ranking as the θ’s. We now show how this approach also gives estimates of individual inbreeding coefficients and individual-pair coancestry coefficients.
Relatedness Estimates
Suppose now we have a series of individuals i; i = 1, 2,… r and we sample two alleles with replacement from an individual or one allele randomly from each of two individuals. Each individual is regarded as a population with its own actual allele frequencies. Setting the sample sizes to 2 in the matching proportions for one individual and to 1 each for pairs of individuals in the previous section leads to estimates of θii = (1 + Fi)/2 or θii′ for individual i or individuals i, i′. The expectations of these estimates are [(1 + Fi)/2 − θB]/(1 − θB) and (θii′ − θB)/(1 − θB), respectively, where θB is the average of all θii′, i ≠ i′. Inbreeding and coancestry are estimated relative to the average coancestry of all pairs of individuals in the study. Yang at al. (2010) also discuss estimates relative to the study population, and say “Estimates of relationships are always relative to an arbitrary base population in which the average relationship is zero. We use the individuals in the sample as the base so that the average relationship between all pairs of individuals is 0 and the average relationship of an individual with him- or herself is 1.” Although our estimates of pairwise relationship sum to zero, we retain the unknown value θB in their expectations. We cannot estimate θB and we may prefer to report estimates relative to those for the least related pairs as described below in Equation 6.
It is customary (e.g. Yang et al., 2011) use allelic dosage to express relatedness estimates or other analyses (Patterson et al., 2006). Writing the number of copies of allele u carried by individual i as xiu, the U-allele versions of these standard estimates are where i, i′ may be the same or different. If the allele frequencies pu are known, these estimates are unbiased for θii′. For biallelic SNPs, there is no need to sum over alleles, and the u subscripts can be dropped.
Our coancestry estimates have the same functional form as those for population structure, but they may best be compared to the standard estimates by expressing allelic matching proportions in terms of allelic dosages. Noting that individual matching proportions are 1 and 0 for homozygotes and heterozygotes, respectively, and that matching proportions for pairs of individuals are 1 when they are the same homozygote, 0.5 when they are the same heterozygote or one is homozygous and the other heterozygous with one allele shared with the first, and 0 when they have no shared alleles:
In particular, for SNPs, where xi are the dosages for, say, the reference allele. Our relatedness and inbreeding estimates are where
Storey and Ochoa (accompanying papers) have equivalent estimates. Their expressions are a little different because their reference is for all pairs of alleles in a sample, including those within individuals, whereas ours is for pairs of alleles in different individuals. Astle and Balding (2009, equation 2.3) gave similar estimates although, in effect, they set θB, the average coancestry of all pairs of individuals in a sample, to zero.
Combining Over Loci
Single-locus analyses do not provide meaningful results, and combining estimates over loci l has often been considered in the literature. In a parallel discussion of weighting over alleles u at a single locus, Ritland (1996) considered weights wu chosen to minimize variance.
Two extreme weights are wl = 1 and . The first may be called “unweighted” and the second “weighted”. In an obvious notation
Note the parallel to averaging over alleles in Equations 3. Bhatia et al. (2013) refer to the first estimate as the “average of ratios” and the second as the “ratio of averages.” WC84 advocated the second, with justification given in the Appendix to that paper, as did Bhatia et al.
The unweighted estimate is unbiased for all allele frequencies but is susceptible to the effects of rare variants, when can be very small. Rare variants may have little effect on the weighted average , and the variance of the estimate is seen in simulations to be less than for the unweighted average, but it is unbiased only if every locus has the same value of the ibd probabilities. A more extensive discussion was given in the Appendix of WC84 for population structure, and by Ritland (1996) for inbreeding and relatedness.
Private Alleles
Current sequence-based studies are revealing large numbers of low-frequency variants, including those found in only one population. These private alleles were identified by Slatkin (1985) and Mathieson and McVean (2012) as being of particular interest. They are very frequent in the 1000 genomes project data (The 1000 Genomes Project Consortium, 2010). If x1 is the sample count of an allele observed only in population 1 of r populations the sample matching proportions are So the β estimates are
The estimate of FST for a private allele is its own-population sample frequency, but the population-specific value for its own population ranges from approximately −r + 1 when x1 = 1 to 1 when x1 = n1. This amplifies the comment “populations can display spatial structure in rare variants, even when Wright’s fixation index FST is low” of Mathieson and McVean (2012). A population with many private alleles at low to intermediate frequencies will thus likely have a negative , and how negative will depend on how many populations have been sampled. Note that this implies must be allowed to go negative, whereas Bayesian estimators of population specific FST are forced to belong to [0, 1], although this assumption can be relaxed (Ritland, 1996).
RESULTS
Population Structure
We have conducted a series of simulations to evaluate the performance of our FST estimates, and we have looked at 1000 Genomes SNP data to explore the role of rare variants on the estimates. Some of the simulations were conducted with sim.genot.metapop.t available in the hierfstat package (Goudet, 2005). The migration model we have used allows for a matrix of migration rates between each pair of populations, and the mutation model allows for multiple alleles at a locus. The notation for a two-population model was given above.
Model 1. Same Migration Rates, Different Population Sizes
We considered two populations, with sizes N1 = 100, N2 = 1, 000 and migration rates m1 = m2 = 0.01. The mutation rate was µ = 10−6. After 400 generations, the β’s have values β1 = 0.156, β2 = −0.037 and β12 = 0.059. We simulated 50 individuals from each population under this scenario, with 1,000 loci and up to 20 alleles per locus. From the resulting allelic data we obtained estimates, and 95% confidence intervals by bootstrapping over loci. The results are shown in Table 1. The predicted values are contained in the confidence intervals, and there are negative values for both the parametric and the estimated value of β2. Note that we cannot estimate β12 with data from two populations.
Model 2. Continent-Island Model
In this scenario we have an infinite continent supplying a proportion m = 0.01 of the alleles independently to populations 1 and 2, still with sizes N1 = 100, N2 = 1, 000. There is no migration between the two populations, so θ12 = 0. The predicted values and estimated values after 400 generations are shown in Table 1.
Model 3. Migrant-pool Island Model
In this scenario, each population contributes to a migrant pool, from which migrant alleles are drawn. Among the migrant alleles in the case of two populations, half of the “migrant alleles” will in fact be resident alleles if the gametic pool is composed of the same proportion of alleles from each island, independent of its size. With otherwise the same parameter values, the predicted values and our estimates after 400 generations are shown in Table 1.
Model 4. Different Population Sizes, Different Migration Rates
We return to the two-populations model described above, but now with N1 = 10, 000, N2 = 100 and different migration rates m1 = 0.01, m2 = 0. Predicted values after 1,000 generations are shown in Figure 3, and our estimates in Table 1.
The results in Table 1 show generally good behavior of our β estimates. In Figure 3 we show the estimates for 10 different time points (independent replicates) for Model 4. As time increased, the number of polymorphic loci decreased. In generations 600, 800, 1000 the numbers of polymorphic loci had dropped from 1,000 to 712, 349 and 151 respectively and the quality of the estimates decreased: higher bias and higher variance.
Rare Alleles
For r populations with total sample size nT, and with x1 copies of an allele private to population 1, the total count for this alleles is xT = x1 and assuming similar sample sizes for each sample. In Figure 4 we display as a function of allele frequencies for SNPs located on chromosome 2 in the 1000 Genomes project. Individuals were grouped by regions (Africa, Europe, South Asia, East Asia and the Americas). The drawn line corresponds to βW = 5pT. The initial linear segment corresponds to alleles that are present in one continent only. βW’s start departing from this line for allele counts larger than 80, or equivalently, for worldwide frequencies larger than ≈ 0.01, given the sampled chromosome number of 2, 426.
When a new allele appears, it will be present in one population only. We expect most if not all rare alleles to be private alleles, and thus the expected values for FST (βW) for these rare alleles are their (sub-population) frequencies. When βW starts departing from the allele frequency, it implies that some scattering has been happening. In species with a lot of migration, this will happen at low frequencies, whereas the species that are more sedentary should show a one to one relation between sub-population allele frequencies and βW for a larger range of their site frequency spectrum.
In Buckleton et al. (2016) we gave population-specific FST estimates for a set of 446 populations, using published data for 24 microsatellite loci collected for forensic purposes. We showed in that paper how the choice of a reference set of populations can affect results. For a set of African populations, the average within-population matching proportion was and the average between-population-pair averages were within the African region and for all pairs of populations. There is a larger FST for the set of African populations with Africa as a reference set than there is with the world as a reference set. The opposite was found for a collection of Inuit populations: the average within-population matching proportion was whereas the average between-population-pair matching proportions were for pairs within the Inuit group and for all pairs in the study: so FST is less with Inuit as a reference set than with the world as a reference set .
Inbreeding and Relatedness
To check on the validity of our estimators of individual inbreeding and coancestry coefficients, we simulated data for a range of 11 coancestries: (i/32: i = 0, 1, 2,…, 10). Using the ms software (Hudson, 2002), we generated data from an island model with two populations exchanging N m = 1 migrant per generation. We simulated 5,000 independent loci, read either as haplotypes (5,000) or as SNPs (approximately 80,000 polymorphic sites for the founders). We then chose 20 individuals from one of these populations and let them mate at random, without selfing. We did not assign or consider sex for these 20 founders. The number of offspring per mating was Poisson with mean of five. These offspring were than allowed to mate at random, without selfing, to produce families of size Poisson with mean three. By keeping records of all matings we could generate the pedigree-based inbreeding and coancestry values for all 135 individuals: founder, their offspring and their grand-offspring. The pedigree-based coancestries for all 9,045 pairs of individuals are shown in Figure 5, although we note (Hill and Weir, 2011) that the actual values have variation about expected or pedigree values. We used the same pedigree to simulate another data set, where this time 10 founders were coming from the first population and the other 10 from the second population, thus creating admixture among the children and grand-children.
The left hand plot of Figure 6 reflects the summing to zero by construction of the coancestries, whereas the pedigree coancestries are necessarily non-negative. The right hand plot shows a “correction” of the estimates: we took the set of smallest values in the left hand plot to represent the unrelated (relative to the assumed-unrelated) founders. If we write as the average value in this distribution then our corrected values are
The corrected estimates are clearly close to the pedigree values. However, we are not sure if it is necessary, in general, to undertake this correction process. Whether or not it is applied, the values are still relative to those among all pairs of individuals in a study sample. In general, we will not have any individuals identified for which it is justified to assume zero relatedness or zero inbreeding, and we note the comment by Thompson (2013) “in most populations IBD within individuals is at least as great as IBD between.”
The distributions of estimates in Figure 7 are tightly clustered around 11 values, corresponding to the 11 distinct pedigree values i/16, i = 0, 1, 2 … 11. A contrasting result is shown in Figure 8, for the CGTA estimates, calculated as weighted averages over loci (in the sense of equation 3 by taking the ratio of two sums over loci).
There is a current tendency in genome wide association studies (GWAS) to restrict the SNPs used in relatedness estimation to having a minor allele frequency (MAF) above some threshold. For example, the KING manual (http://people.virginia.edu/∼wc9c/KING/manual.html) lists a parameter minMAF to specify the minimum minor allele frequency to select SNPs for relationship inference in homogeneous populations. The thought is that lesser frequencies give rise to biased values, but that is not likely the case if “ratio of averages” estimates are used. To illustrate the effect of MAF filtering, we applied four different thresholds for our simulated data and we show the means and standard deviations for estimates for each of nine pedigree values in Table 2. The estimates are the corrected values – i.e. relative to an assigned value of zero for the least-related class. There is clear evidence for the merits of retaining all SNPs, both in terms of bias and variance.
We continued a comparison of values by applying the estimates described by Wang (2014) and computed using the related R package (Pew et al., 2015), listed in Table 3. Additionally, related offers maximum likelihood estimators, derived by Milligan (2003) and Wang and Santure (2007). They are not computed here, because they require substantial computing time, which rules them out for genomic data.
In Figure 9 we display box plots of coancestry estimates for eight alternative estimates, displayed according to 9 pedigree values. The β estimates are not corrected, yet have good bias and variance properties compared to other estimates.
In Figure 10 we compare our β estimates with those from GCTA for admixed individuals with two ancestral populations. We used the same pedigree as in the section above, but took as founders 10 individuals from each of the two populations. Coancestries were calculated for all pairs of individuals in the pedigree. Figure 10 illustrates the accuracy of our β estimate compared with GCTA using the coancestries among founders. The ’s for pairs of founders from the same population are tightly distributed around 0.015, while ’s for pairs of individuals one from each population are tightly distributed around −0.11. The distribution for the same two categories for the GCTA estimators is wider, in particular for pairs of individuals originating from the same population.
DISCUSSION
A Unified Approach
Although there has been general recognition that family and evolutionary relatedness are just two ends of a continuum, we are not aware of previous estimates of population structure quantities such as FST or individual-pair coancestries that rest on this common framework. We have presented estimates that apply equally well to populations and individuals. While their statistical properties remain to be fully explored, it is reassuring to see how well they performed in the few simulations presented here.
Although individual-specific inbreeding coefficients and individual-pair-specific coancestry coefficients are used routinely in association studies, we have not seen widespread adoption of population-specific FST values in evolutionary studies. We have shown here, theoretically and empirically, that these values can differ substantially among populations. This may simply reflect population size and migration rate differences, but different values may also provide signatures of natural selection.
There is also general understanding that identity by descent is a relative concept, rather than an absolute concept. This understanding has not led to an apparent recognition that the usual estimates of inbreeding and kinship are not unbiased for expected or pedigree values. Replacing population allele frequencies by sample values leads to bias in the usual estimates, regardless of sample size. As the allele frequencies enter GCTA estimates, for example, as squares the expected values of the estimates depend on the variances these frequencies. These, in turn, depend on the parameters being estimated.
We also stress that all allelic variants, whatever their frequencies, need to be included in the estimation of population structure and inbreeding or relatedness. The estimates certainly depend on the allele frequencies, and restricting the range of frequencies used may reveal features of interest, but the underlying ibd parameters do not depend on the frequencies. Exclusion of some alleles based on their frequencies will lead to biased estimates of the parameters.
Previous Estimates
Weir and Cockerham Estimates of FST
The FST estimate of WC84 has been widely adopted and it performs well for the model stated in that paper: data from a series of independent populations with equivalent histories. In the present notation, WC84 assumed θii = θ, θii′ = 0 for all populations i and all i′ ≠ i. The estimate was designed to be unbiased for any number of populations, any sample sizes and any number of alleles per locus. The analysis was a weighted one over popu-lation: the average allele frequencies for a study had sample size weights, . Although our β estimates do not make explicit mention of allele frequencies, there is implicit use of sample frequencies that are unweighted averages over individuals or populations.
Weighting over populations has been discussed by Tukey (1957) and Robertson (1962). Those authors were concerned with bias and variance and they used the language of variance components, within and between populations. For allele u these components were given as (1 − θ)pu (1 − pu) and θpu (1 − pu), respectively, by WC84. Tukey said “In practice, we select two quadratic func-tions by some scheme involving intuition, find how their average values are expressed linearly in terms of the variance components, and then form two linear combinations of the original quadratics whose average values are the variance components. These linear combinations are then our esti-mates. Much flexibility is possible.” The estimates of WC84, Weir and Hill (2002) and Bhatia et al. (2013) all have this structure although ratios of linear combinations are taken to remove the allele frequency parameters. Tukey went on to say that the weights wi = ni (in the present notation) “gives the customary analyses, which treat observations as important and columns [i.e. populations] as unimportant.” Further, “the choice wi = 1 … treat the columns as important. This [unweighted] approach is appropriate when the column variance component is large compared with the within variance component.” Robertson (1962) also pointed to sample-size weights for small between-population variance components and equal weights for large values. Bhatia et al. (2013) were concerned with unequal FST values so their use of equal weights is consistent with Turkey’s statements. Their work provides simple averages of the different FST’s as opposed to averages weighted by sample sizes. For unequal FST’s and unequal sample sizes, Weir and Hill (2002) said “the usual moment estimate [with sample-size weights] is of a complex function [of the FST’s].” In our current model of unequal θi’s and non-zero θii′’s we agree that unweighted analyses (population weights of 1) are appropriate, and that is what we have used in this paper. We note that Tukey’s “flexibility” in the choice of moment estimators, phrased in terms of weights, does not arise with maximum likelihood approaches. If sample allele frequencies are taken to be approximately normally distributed then REML methods give appropriate and unique estimates.
What are the consequences of using the WC84 estimates when the current model is more appropriate? We can show that the expected value of the Weir and Cockerham estimate is
This expression uses three functions of sample sizes: and nc . The two weighted averages are and . The quantity Q is . For equal sample sizes, ni = n, or for equal values of FST, . Under these circumstances and we find the WC84 estimator performs well unless the θii ‘s and /or the ni’s are quite different. We stress though that it is (θW – θB)/(1 – θB) being estimated.
Nei Estimates of FST
Although we have phrased estimates in terms of matching proportions, we note that they are the complements of “heterozygosities” . Our approach uses , the average population-pair allele matching, whereas most previous treatments, from Nei (1973) onwards, use total heterozygosities where is the average sample allele frequency over populations: . From Equations 1, the variance of is
For large sample sizes and Nei’s GST quantity and its expectation, in our notation, are which reduce to and as r becomes large. Otherwise, the expectation of GST depends on the number r of populations. This expectation is bounded above by one, contrary to the claim of Bhatia et al. (2013). Bounds on FST, when that is defined as , were given by Jakobsson et al. (2013).
Nei and Chesser (1983) and Nei (1987) modified Nei’s earlier approach to remove the effects of the number of populations. Jost (2008) pointed out that GST does not provide a good measure of differentiation among populations, where differentiation reflects the collection of allele frequencies , or their sample values . We regard θ’s as indicators of evolutionary history, rather than of allele frequencies, and we interpret them as probabilities of pairs of alleles being identical by descent. Jost introduced D = (HB − HW)/(1 − HW) or D = (θW − θB)/θW as a measure of differentiation among populations. For the two-population drift scenario without mutation D, unlike βW, does not have a simple dependence on time and so does not serve as a measure of evolutionary distance.
CGTA Estimates of Relatedness
The expressions in Equation 3 provide unbiased estimates of θii = (1 + Fi)/2 and θii′, i ≠ i′ when the allele frequencies are known. When study sample allele frequencies are used the expectations of these expressions, for one locus and large samples, are where . The extent of bias depends on how different the average coancestry of a target individual with all other study individuals is from the average coancestry of all pairs of study individuals. We stress that these estimates are not unbiased for θii′.
Association Mapping
One of our motivations for seeking a unified characterization of population structure and relatedness is that both phenomena affect samples used in association mapping. Many analyses, such as those in GCTA, use mixed linear models with an estimated Genetic Relatedness Matrix A being used in the formulation of the variance-covariance matrix for trait values of the study individuals. For a trait with additive genetic variance and no other genetic variance components, the variance matrix for individuals includes the term and A has diagonal elements (1 + Fi)/2 and off-diagonal elements θii′. We suggest that these be estimated by and , to accommodate any (hidden) relatedness and inbreeding among study subjects. We are less sure about the common practice of also using principal components of A as fixed effects to accommodate population structure, especially when A uses ’s, and we see the need for further investigation.
Population History
We also see the need for further exploration of the role of population-specific FST estimates in evolutionary genetic studies, given the generally unrecognized prevalence of negative expected values for populations with correlated allele frequencies shown in Figure 1 and the relationship of estimates with the site-frequency spectrum suggested in Figure 4.
Conclusion
We have presented moment estimators for the probabilities that pairs of alleles, taken from individuals or from populations, are identical by descent relative to the ibd probabilities for alleles from all pairs of individuals or populations in a study. By identifying the reference set of alleles as those in the current study we allow for negative values for population structure or relatedness parameters and their estimates. Alleles may have smaller ibd probabilities within some populations than between all pairs of populations in a study, for example. Some pairs of individuals in a study may be less related than the average for all pairs. Our estimates are phrased in terms of the proportions of pairs of alleles, within and between populations or individuals, that are of the same type (ibs).
For sets of populations, we advocate the use of population-specific FST values as these more accurately reflect population history. For sets of individuals, our estimates seem to behave at least as well as those given previously. We note that our estimates have the same logical basis, and algebraic expressions, for populations and for individuals. The chief novelty of our approach is in allowing for allele frequencies to be correlated among populations when characterizing population structure, and correlated among all individuals when characterizing individual-pair relatedness.
Acknowledgments
This work was supported in part by grants GM 075091 and GM 099568 from the US National Institutes of Health and by grant IZK0Z3 157867 from the Swiss National Science Foundation.