A unified characterization of population structure and relatedness

Bruce S. Weir; Jérôme Goudet

doi:10.1101/088260

Abstract

Many population genetic activities, ranging from evolutionary studies to association mapping to forensic identification, rely on appropriate estimates of population structure or relatedness. All applications require recognition that quantities with an underlying meaning of allelic identity by descent are not defined in an absolute sense, but instead are made “relative to” some set of alleles other than the target set. The early Weir and Cockerham F_ST estimate made explicit that the reference set of alleles was across independent populations. Standard kinship estimates have an implicit assumption that pairs of individuals in a study sample, other than the target pair, are unrelated, whereas other estimates assume alleles within individuals are not identical by descent. However, populations lose independence when there is migration between them, and when individuals in a study are related it is difficult to see how they can also be non-inbred. We have therefore re-cast our treatments of population structure, relatedness and inbreeding to make explicit that the parameters of interest involve differences of probabilities of identity by descent in the target and the reference sets of alleles and so can be negative. We take the reference set to be for the population from which study individuals have been sampled. We provide simple moment estimates of these parameters, phrased in terms of allele matching within and between individuals for relatedness and inbreeding, or within and between populations for population structure. A multi-level hierarchy of alleles within individuals, alleles between individuals within populations, and alleles between populations allows a unified treatment of relatedness and population structure. Our new estimates appear to be sensitive to rare or private variants, to give indications of the effects of natural selection, and to be appropriate for use in association studies.

Introduction

We oﬀer here a unified treatment of relatedness and population structure with an underlying frame-work of alleles being identical by descent, ibd. We follow Thompson (2013) in regarding ibd for a set of alleles as being relative to some other, reference, set: “There is no absolute measure of ibd: ibd is always relative to some reference population.” In other words, ibd implies a reference point and ibd status there is often implicitly assumed to be zero.

A function of ibd of particular interest to us is F_ST, which we will show below depends on ibd of pairs of alleles within populations relative to that for pairs of alleles from diﬀerent populations. The uses of estimates of this quantity are widespread, and here we note a recent discussion by McTavish and Hillis (2015) of the eﬀects of SNP ascertainment, SNP array vs whole-genome sequencing, on inferences about population history. These authors used “pairwise F_ST for all pairs of populations using Weir and Cockerham’s method.” We suggest that a more informative analysis may result from our population-specific F_ST estimates (Weir and Hill, 2002; Weir et al., 2005; Browning and Weir, 2010). Several authors (e.g. Balding and Nichols, 1995; Shriver et al., 2004; Beaumont and Balding, 2004; Gaggiotti and Foll, 2010) have discussed the advantages of working with population-specific F_ST values instead of single values for a set of populations or of values for each pair of populations. We show below that the usual global F_ST measure can be regarded as an unweighted average of population-specific values, and because it is an average it collapses the variation detectable among populations that can indicate the eﬀects of past selection (Weir et al., 2005). The usual measure can otherwise diminish signals of population history and this diminution has become more pronounced as genetic marker data have become richer and real diﬀerences among populations have become more evident. As Astle and Balding (2009) noted “population structure and [cryptic] relatedness are diﬀerent aspects of a single confounder: the unobserved pedigree defining the (often distant) relationships among the study subjects.” A similar point was made by Kang et al. (2010): “The presence of related individuals within a study sample results in sample structure, a term that encompasses population stratification and hidden relatedness.” Our goal is to provide a unified approach to charactering population structure and individual relatedness and inbreeding, both in terms of the underlying parameters and of the methods of estimation.

A consideration of “genetic sampling” (Weir, 1996) makes it clear that population mean ibd for alleles in a single population, or ibd for alleles in a single individual, at one point in time cannot be estimated only from data for that population or that individual as there would be no information about variation over replications of the descent paths from past to present. We might regard multiple loci as providing replication of the genetic sampling process or we might collect data from multiple populations. An exception is when allele frequencies and ibd status in the reference population are assumed known, as is implied for standard methods for estimating relatedness and inbreeding (e.g. Ritland, 1996; Wang, 2014; Purcell et al., 2006; Yang et al., 2011). If, instead, these methods make use of frequencies from a sample of individuals they are providing estimates of the inbreeding or coancestry ibd measures relative to those measures for individuals in the whole sample. This point was also made by Yu et al. (2006) who spoke of “adjusting the probability of identity by state between two individuals with the average probability of identity by state between random individuals” in order to address identity by descent. Other relatedness estimation methods that do not use allele frequencies (e.g. KING-robust, Manichaikul et al., 2010) are estimating ibd between individuals (coancestry) relative to that within individuals (inbreeding: assumed zero for KING-robust).

For both population structure and relatedness we propose the use of allelic matching proportions within and between individuals or populations in order to characterize ibd for an individual or a population relative to a reference set of ibd values. We use allele matching rather than heterozygosity (Nei, 1973) or components of variance (Weir and Cockerham, 1984: hereafter WC84) although the distinction is more semantic than real. Our present treatment also diﬀers from that in WC84 by using unweighted averages of statistics over populations instead of the weighted averages that were more appropriate for the WC84 model of independent populations.

The size of current genetic studies requires computationally feasible methods for estimating relatedness between all pairs of individuals, potentially 5 billion pairs for the TOPMed project (http://www.nhlbiwgs.org). The scale of the task may well rule out maximum likelihood approaches (e.g. Thompson, 1975; Ritland, 1996; Milligan, 2003) and Bayesian methods (e.g. Gaggiotti and Foll, 2010). Moment estimates seem still to be relevant and will be presented here.

Materials and Methods

Parameter Values

We write the probability of a pair of alleles being ibd as θ without specifying the reference set of alleles. Subscripts will identify the populations or individuals from which the alleles are drawn. Subscript W will indicate an average over sets of alleles within individuals or populations, and subscript B the average over pairs of distinct individuals or populations.

Pairs of Individuals

The coancestry coeﬃcient θ_XY for individuals X, Y is the probability an allele taken randomly from X is ibd to one taken randomly from Y. If individual A is ancestral to both X and Y, and if there are n individuals in the pedigree path joining X to Y through A, then where F_A is the inbreeding coeﬃcient of A and the sum is over all ancestors A and all paths joining X to Y through A. The coancestry of X with itself is θ_XX = (1 + F_X)/2.

Pairs of Populations

For populations i, i^′ the quantity θ_ii′ is the probability of ibd for an allele from i and one from i^′. The two populations may be the same, i = i^′. The simplest evolutionary scenario is for finite populations of constant size, subject only to genetic drift. For two populations of sizes N₁, N₂ with a common ancestral population t discrete generations in the past, the current within-population values θ₁₁, θ₂₂ and the between-population value θ₁₂ are

The between-population ibd probability θ₁₂(t) at present is the same as it was, θ₁₂(0), in the common ancestral population. As we do throughout this discussion, we introduce quantities β that measure ibd for a target pair of alleles relative to that in a convenient comparison set, here alleles between the pair of current populations (also, in this case, the common ancestral population):

By considering ibd relative to that between populations, we avoid having to know the value in the ancestral population or even having to specify that ancestral population. The two population-specific values β_ii(t) diﬀer if the two populations have diﬀerent sizes. We often wish to work with the averages θ_W (t) = [θ₁₁(t) + θ₂₂(t)]/2, θ_B(t) = θ₁₂(t) and we write β_W (t) = [β₁₁(t) + β₂₂(t)]/2: providing N₁, N₂, t are all large, and N_h is the harmonic mean of the population sizes. From now on we will regard β_W as the parametric form of F_ST, and Reynolds et al. (1983) showed that F_ST serves as a measure of population distance under the pure drift model. The formulation F_ST = (θ_W − θ_B)/(1 − θ_B) makes explicit that F_ST is a measure of ibd within populations relative to ibd between pairs of populations.

Drift, Mutation and Migration

Non-trivial equilibria for two populations drifting apart are obtained when there is mutation, and we illustrate some aspects of our population-specific approach by considering the case of two populations exchanging alleles each generation when there is infinite-alleles mutation. The transition equations for θ₁, θ₂, θ₁₂, extending those given by Maruyama (1970), are: where , the mutation rate is µ and population i: i = 1, 2 receives a fraction m_i of its alleles each generation from population i^′: i^′ ≠ i. A consequence of these equations is that θ₁₁(t) + θ₂₂(t) ≥ 2θ₁₂(t), or that θ_W ≥ θ_B and so β_W = F_ST is positive. However, it is not necessary that each of θ₁₁, θ₂₂ exceeds θ₁₂ and in Figure 1, second row, we show that mutation leads to equilibrium values of θ_ii diﬀerent from 1, and in Figure 1, third row that migration can lead to cases where θ₁₁ > θ₁₂ > θ₂₂. In the absence of migration, mutation drives θ₁₂ and β₁₂ to zero, so that θ_ii = β_ii are both positive. In Figure 2 we show the region in the space of N₁, m₁ values where β₁₁ ≤ 0 ≤ β₂₂ for fixed N₂, m₂, µ. Averaging over the two β_ii’s to work with F_ST hides this potential diﬀerence in sign of the β_ii’s.

Figure 1.

Eﬀects of Drift, Mutation and Migration. For all panels, N₁ = 10000, N₂ = 100 First row: drift only (no mutation nor migration). θ₁, θ₂’s and β’s tend to 1,θ₁₂ = 0.000. Second row: Drift and Mutation µ = 10⁻³, m₁ = m₂ = 0. θ’s and β’s have positive limits less than 1. At equilibrium, θ₁ = 0.024, θ₂ = 0.714, θ₁₂ = 0.000, β₁ = 0.024, β₂ = 0.714, β_W = 0.369. Third Row: Drift. Mutation and Migration. µ = 10⁻³, m₁ = 10⁻², m₂ = 0. θ’s positive and less than 1, β_W is positive but β_ii’s may be negative. At equilibrium, θ₁ = 0.543, θ₂ = 0.714, θ₁₂ = 0.596, β₁ = −0.131, β₂ = 0.292, β_W = 0.080.

Figure 2.

The region above and to the right of the red line has equilibrium values of θ₁ ≤ θ₁₂ ≤ θ₂, i.e. β₁ ≤ 0 ≤ β₂. In that region a pair of alleles within population 1 has a smaller probability of ibd than does an allele from population 1 with an allele from population 2.

Actual vs Predicted

θ The probabilities of ibd calculated from path-counting methods for pedigrees of individuals or from transition equations for populations can be regarded as the expected values, over evolutionary replicates, of the actual identity status of a pair of alleles. We have previously discussed the variation of actual identity about the predicted value (Hill and Weir, 2011, 2012). The variance of the actual ibd measure for two alleles is Δ − θ (Cockerham and Weir, 1983), where Δ is the joint probability of ibd for each of two pairs of alleles. The coefficient of variation of the actual coancestry for two individuals is greater than 1 for individuals with predicted coancestry θ less than 0.125 and it increases as the degree of relationship decreases. The implication of this is that, for a particular pair of populations or individuals, estimated values may not match those expected from pedigrees or transition equations. Evaluation of estimation procedures should, therefore, be performed over many replicate pairs.

Inbred Populations

The discussion so far has implicitly assumed Hardy-Weinberg equilibrium within populations and no need to diﬀerentiate pairs of alleles within individuals from those between individuals in the same population. We can relax that assumption. Two alleles taken at random from population i may be from the same or diﬀerent individuals. If the sampling was without replacement, the ibd probability for two alleles from one individual is the inbreeding coefficient F for that individual, whereas sampling with replacement from one individual has ibd probability (1 + F)/2. We defer a more extensive discussion to a subsequent publication.

Estimation

Allele Frequencies

It is allelic identity in state (ibs) that can be observed, rather than identity by descent (ibd) and we now consider how ibs data can provide ibd estimates. We start by considering the frequencies of the various alleles at a locus.

We can distinguish three classes of allele frequency. The sample frequency of allele u in a sample of alleles taken from population i provides an estimate of the actual allele frequency in that population. These actual frequencies, in turn, vary about the population frequency p_u, where variation refers to the values of the actual frequencies in evolutionary replicates of population i.

For n_i randomly sampled alleles, where n_iu are of type u, has a binomial distribution . The mean of from this statistical sampling process is and the variance is . The distribution of the actual frequency is not known in general, but there is a class of evolutionary models (e.g. Balding and Nichols, 1995) that provide the Beta distribution:

The mean for this genetic sampling process is and the variance is p_u(1 − p_u)θ_i. We keep these first two genetic-sampling moments although we do not invoke the beta distribution.

The total mean and variance follow from and the complete set of first and second moments were given by Weir and Hill (2002):

These moments refer to expectations over both repeated samples from the same populations (statis-tical sampling) and over replications of the populations themselves (genetic sampling). From now on we will drop the T subscript but all expectations are total. WC84 set all within-population θ_ii to a common value θ, and all between-population θ_ii′ to zero. Note the assumption that all populations have the same expected allele frequencies p_u, although they have diﬀerent actual frequencies .

Allelic Matching

We find intuitive appeal in working with proportions of pairs of alleles that are ibs. If the sample of n_i alleles from population i has n_iu copies of allele type u, then the matching (allele sharing) proportion for pairs of alleles drawn without replacement from population i is where

For sampling with replacement the sample-size corrections are not necessary . The average over samples from r populations is . The allele-pair matching proportion between populations i and i′ is and these have an average over pairs of samples from r populations of .

Population Structure Estimates

From Equations 1, the matching proportions have expectations where . Averaging over populations or pairs of populations:

The expectations lead immediately to simple method-of-moment estimates for any number of sampled populations, any number of alleles sampled per population, and any numbers of alleles per locus:

To the extent that the expectation of a ratio is the expectation of ratios, Equations 1,2 show that each is unbiased for the corresponding β:

Note that the pairwise estimates sum to zero by construction. Although it is not possible to find estimates for each θ when the sampled populations have correlated sample allele frequencies, it is possible to rank the ’s, and these are likely to have the same ranking as the θ’s. We now show how this approach also gives estimates of individual inbreeding coefficients and individual-pair coancestry coefficients.

Relatedness Estimates

Suppose now we have a series of individuals i; i = 1, 2,… r and we sample two alleles with replacement from an individual or one allele randomly from each of two individuals. Each individual is regarded as a population with its own actual allele frequencies. Setting the sample sizes to 2 in the matching proportions for one individual and to 1 each for pairs of individuals in the previous section leads to estimates of θ_ii = (1 + F_i)/2 or θ_ii′ for individual i or individuals i, i′. The expectations of these estimates are [(1 + F_i)/2 − θ_B]/(1 − θ_B) and (θ_ii′ − θ_B)/(1 − θ_B), respectively, where θ_B is the average of all θ_ii′, i ≠ i′. Inbreeding and coancestry are estimated relative to the average coancestry of all pairs of individuals in the study. Yang at al. (2010) also discuss estimates relative to the study population, and say “Estimates of relationships are always relative to an arbitrary base population in which the average relationship is zero. We use the individuals in the sample as the base so that the average relationship between all pairs of individuals is 0 and the average relationship of an individual with him- or herself is 1.” Although our estimates of pairwise relationship sum to zero, we retain the unknown value θ_B in their expectations. We cannot estimate θ_B and we may prefer to report estimates relative to those for the least related pairs as described below in Equation 6.

It is customary (e.g. Yang et al., 2011) use allelic dosage to express relatedness estimates or other analyses (Patterson et al., 2006). Writing the number of copies of allele u carried by individual i as x_iu, the U-allele versions of these standard estimates are where i, i^′ may be the same or diﬀerent. If the allele frequencies p_u are known, these estimates are unbiased for θ_ii′. For biallelic SNPs, there is no need to sum over alleles, and the u subscripts can be dropped.

Our coancestry estimates have the same functional form as those for population structure, but they may best be compared to the standard estimates by expressing allelic matching proportions in terms of allelic dosages. Noting that individual matching proportions are 1 and 0 for homozygotes and heterozygotes, respectively, and that matching proportions for pairs of individuals are 1 when they are the same homozygote, 0.5 when they are the same heterozygote or one is homozygous and the other heterozygous with one allele shared with the first, and 0 when they have no shared alleles:

In particular, for SNPs, where x_i are the dosages for, say, the reference allele. Our relatedness and inbreeding estimates are where

Storey and Ochoa (accompanying papers) have equivalent estimates. Their expressions are a little diﬀerent because their reference is for all pairs of alleles in a sample, including those within individuals, whereas ours is for pairs of alleles in diﬀerent individuals. Astle and Balding (2009, equation 2.3) gave similar estimates although, in eﬀect, they set θ_B, the average coancestry of all pairs of individuals in a sample, to zero.

Combining Over Loci

Single-locus analyses do not provide meaningful results, and combining estimates over loci l has often been considered in the literature. In a parallel discussion of weighting over alleles u at a single locus, Ritland (1996) considered weights w_u chosen to minimize variance.

Two extreme weights are w_l = 1 and . The first may be called “unweighted” and the second “weighted”. In an obvious notation

Note the parallel to averaging over alleles in Equations 3. Bhatia et al. (2013) refer to the first estimate as the “average of ratios” and the second as the “ratio of averages.” WC84 advocated the second, with justification given in the Appendix to that paper, as did Bhatia et al.

The unweighted estimate is unbiased for all allele frequencies but is susceptible to the eﬀects of rare variants, when can be very small. Rare variants may have little eﬀect on the weighted average , and the variance of the estimate is seen in simulations to be less than for the unweighted average, but it is unbiased only if every locus has the same value of the ibd probabilities. A more extensive discussion was given in the Appendix of WC84 for population structure, and by Ritland (1996) for inbreeding and relatedness.

Private Alleles

Current sequence-based studies are revealing large numbers of low-frequency variants, including those found in only one population. These private alleles were identified by Slatkin (1985) and Mathieson and McVean (2012) as being of particular interest. They are very frequent in the 1000 genomes project data (The 1000 Genomes Project Consortium, 2010). If x₁ is the sample count of an allele observed only in population 1 of r populations the sample matching proportions are So the β estimates are

The estimate of F_ST for a private allele is its own-population sample frequency, but the population-specific value for its own population ranges from approximately −r + 1 when x₁ = 1 to 1 when x₁ = n₁. This amplifies the comment “populations can display spatial structure in rare variants, even when Wright’s fixation index F_ST is low” of Mathieson and McVean (2012). A population with many private alleles at low to intermediate frequencies will thus likely have a negative , and how negative will depend on how many populations have been sampled. Note that this implies must be allowed to go negative, whereas Bayesian estimators of population specific F_ST are forced to belong to [0, 1], although this assumption can be relaxed (Ritland, 1996).

RESULTS

Population Structure

We have conducted a series of simulations to evaluate the performance of our F_ST estimates, and we have looked at 1000 Genomes SNP data to explore the role of rare variants on the estimates. Some of the simulations were conducted with sim.genot.metapop.t available in the hierfstat package (Goudet, 2005). The migration model we have used allows for a matrix of migration rates between each pair of populations, and the mutation model allows for multiple alleles at a locus. The notation for a two-population model was given above.

Model 1. Same Migration Rates, Diﬀerent Population Sizes

We considered two populations, with sizes N₁ = 100, N₂ = 1, 000 and migration rates m₁ = m₂ = 0.01. The mutation rate was µ = 10⁻⁶. After 400 generations, the β’s have values β₁ = 0.156, β₂ = −0.037 and β₁₂ = 0.059. We simulated 50 individuals from each population under this scenario, with 1,000 loci and up to 20 alleles per locus. From the resulting allelic data we obtained estimates, and 95% confidence intervals by bootstrapping over loci. The results are shown in Table 1. The predicted values are contained in the confidence intervals, and there are negative values for both the parametric and the estimated value of β₂. Note that we cannot estimate β₁₂ with data from two populations.

View this table:

Table 1.

Predicted and estimated β’s for two populations

Model 2. Continent-Island Model

In this scenario we have an infinite continent supplying a proportion m = 0.01 of the alleles independently to populations 1 and 2, still with sizes N₁ = 100, N₂ = 1, 000. There is no migration between the two populations, so θ₁₂ = 0. The predicted values and estimated values after 400 generations are shown in Table 1.

Model 3. Migrant-pool Island Model

In this scenario, each population contributes to a migrant pool, from which migrant alleles are drawn. Among the migrant alleles in the case of two populations, half of the “migrant alleles” will in fact be resident alleles if the gametic pool is composed of the same proportion of alleles from each island, independent of its size. With otherwise the same parameter values, the predicted values and our estimates after 400 generations are shown in Table 1.

Model 4. Diﬀerent Population Sizes, Diﬀerent Migration Rates

We return to the two-populations model described above, but now with N₁ = 10, 000, N₂ = 100 and diﬀerent migration rates m₁ = 0.01, m₂ = 0. Predicted values after 1,000 generations are shown in Figure 3, and our estimates in Table 1.

Figure 3.

Estimated β’s for independent simulations at diﬀerent times, showing increase in bias and variances as the number of polymorphic loci decreases.

The results in Table 1 show generally good behavior of our β estimates. In Figure 3 we show the estimates for 10 diﬀerent time points (independent replicates) for Model 4. As time increased, the number of polymorphic loci decreased. In generations 600, 800, 1000 the numbers of polymorphic loci had dropped from 1,000 to 712, 349 and 151 respectively and the quality of the estimates decreased: higher bias and higher variance.

Rare Alleles

For r populations with total sample size n_T, and with x₁ copies of an allele private to population 1, the total count for this alleles is x_T = x₁ and assuming similar sample sizes for each sample. In Figure 4 we display as a function of allele frequencies for SNPs located on chromosome 2 in the 1000 Genomes project. Individuals were grouped by regions (Africa, Europe, South Asia, East Asia and the Americas). The drawn line corresponds to β_W = 5p_T. The initial linear segment corresponds to alleles that are present in one continent only. β_W’s start departing from this line for allele counts larger than 80, or equivalently, for worldwide frequencies larger than ≈ 0.01, given the sampled chromosome number of 2, 426.

Figure 4.

β_W as a function of allele frequencies (n_u /n_T) for SNPs located on chromosome 2. Data from the 1000 genomes project, individuals were grouped by regions (Africa, Europe, South Asia, East Asia and Americas). The drawn line corresponds to 5n_u /n_T. The initial linear segment corresponds to alleles that are present in one continent only. β_W s start departing from this line for allele counts larger than 80, or equivalently, for worldwide frequencies larger than ≈ 0.01, given the sampled chromosome number of 2426.

When a new allele appears, it will be present in one population only. We expect most if not all rare alleles to be private alleles, and thus the expected values for F_ST (β_W) for these rare alleles are their (sub-population) frequencies. When β_W starts departing from the allele frequency, it implies that some scattering has been happening. In species with a lot of migration, this will happen at low frequencies, whereas the species that are more sedentary should show a one to one relation between sub-population allele frequencies and β_W for a larger range of their site frequency spectrum.

In Buckleton et al. (2016) we gave population-specific F_ST estimates for a set of 446 populations, using published data for 24 microsatellite loci collected for forensic purposes. We showed in that paper how the choice of a reference set of populations can aﬀect results. For a set of African populations, the average within-population matching proportion was and the average between-population-pair averages were within the African region and for all pairs of populations. There is a larger F_ST for the set of African populations with Africa as a reference set than there is with the world as a reference set. The opposite was found for a collection of Inuit populations: the average within-population matching proportion was whereas the average between-population-pair matching proportions were for pairs within the Inuit group and for all pairs in the study: so F_ST is less with Inuit as a reference set than with the world as a reference set .

Inbreeding and Relatedness

To check on the validity of our estimators of individual inbreeding and coancestry coefficients, we simulated data for a range of 11 coancestries: (i/32: i = 0, 1, 2,…, 10). Using the ms software (Hudson, 2002), we generated data from an island model with two populations exchanging N m = 1 migrant per generation. We simulated 5,000 independent loci, read either as haplotypes (5,000) or as SNPs (approximately 80,000 polymorphic sites for the founders). We then chose 20 individuals from one of these populations and let them mate at random, without selfing. We did not assign or consider sex for these 20 founders. The number of oﬀspring per mating was Poisson with mean of five. These oﬀspring were than allowed to mate at random, without selfing, to produce families of size Poisson with mean three. By keeping records of all matings we could generate the pedigree-based inbreeding and coancestry values for all 135 individuals: founder, their oﬀspring and their grand-oﬀspring. The pedigree-based coancestries for all 9,045 pairs of individuals are shown in Figure 5, although we note (Hill and Weir, 2011) that the actual values have variation about expected or pedigree values. We used the same pedigree to simulate another data set, where this time 10 founders were coming from the first population and the other 10 from the second population, thus creating admixture among the children and grand-children.

Figure 5.

Pedigree-based inbreeding and coancestry coefficients for simulated data.

The left hand plot of Figure 6 reflects the summing to zero by construction of the coancestries, whereas the pedigree coancestries are necessarily non-negative. The right hand plot shows a “correction” of the estimates: we took the set of smallest values in the left hand plot to represent the unrelated (relative to the assumed-unrelated) founders. If we write as the average value in this distribution then our corrected values are

Figure 6.

Comparison of estimated and pedigree coancestries. Uncorrected estimates on left, corrected estimates on right. Correction procedure described in text.

The corrected estimates are clearly close to the pedigree values. However, we are not sure if it is necessary, in general, to undertake this correction process. Whether or not it is applied, the values are still relative to those among all pairs of individuals in a study sample. In general, we will not have any individuals identified for which it is justified to assume zero relatedness or zero inbreeding, and we note the comment by Thompson (2013) “in most populations IBD within individuals is at least as great as IBD between.”

The distributions of estimates in Figure 7 are tightly clustered around 11 values, corresponding to the 11 distinct pedigree values i/16, i = 0, 1, 2 … 11. A contrasting result is shown in Figure 8, for the CGTA estimates, calculated as weighted averages over loci (in the sense of equation 3 by taking the ratio of two sums over loci).

Figure 7.

Comparison of β and CGTA coancestry estimates, when founders are drawn from a single population.

Figure 8.

Comparison of β and CGTA coancestry estimates, when founders are drawn from two populations.

There is a current tendency in genome wide association studies (GWAS) to restrict the SNPs used in relatedness estimation to having a minor allele frequency (MAF) above some threshold. For example, the KING manual (http://people.virginia.edu/∼wc9c/KING/manual.html) lists a parameter minMAF to specify the minimum minor allele frequency to select SNPs for relationship inference in homogeneous populations. The thought is that lesser frequencies give rise to biased values, but that is not likely the case if “ratio of averages” estimates are used. To illustrate the eﬀect of MAF filtering, we applied four diﬀerent thresholds for our simulated data and we show the means and standard deviations for estimates for each of nine pedigree values in Table 2. The estimates are the corrected values – i.e. relative to an assigned value of zero for the least-related class. There is clear evidence for the merits of retaining all SNPs, both in terms of bias and variance.

View this table:

Table 2.

Eﬀects of filtering to L SNPs on coancestry estimate means (and standard deviations ×100).

We continued a comparison of values by applying the estimates described by Wang (2014) and computed using the related R package (Pew et al., 2015), listed in Table 3. Additionally, related oﬀers maximum likelihood estimators, derived by Milligan (2003) and Wang and Santure (2007). They are not computed here, because they require substantial computing time, which rules them out for genomic data.

View this table:

Table 3.

Other estimates of relatedness.

In Figure 9 we display box plots of coancestry estimates for eight alternative estimates, displayed according to 9 pedigree values. The β estimates are not corrected, yet have good bias and variance properties compared to other estimates.

Figure 9.

Boxplots of coancestry estimates for eight alternative estimates, displayed according to nine pedigree values. The β estimates are not corrected, yet have good bias and variance properties.

In Figure 10 we compare our β estimates with those from GCTA for admixed individuals with two ancestral populations. We used the same pedigree as in the section above, but took as founders 10 individuals from each of the two populations. Coancestries were calculated for all pairs of individuals in the pedigree. Figure 10 illustrates the accuracy of our β estimate compared with GCTA using the coancestries among founders. The ’s for pairs of founders from the same population are tightly distributed around 0.015, while ’s for pairs of individuals one from each population are tightly distributed around −0.11. The distribution for the same two categories for the GCTA estimators is wider, in particular for pairs of individuals originating from the same population.

Fig 10.

Boxplots of coancestry estimates β and GCTA when the founders come from two populations. Coancestries were estimated for all the individuals in the pedigree, but only those between founders are shown. For each panel, the left boxplot is for pairs of founders from the same population, while the right boxplot is for pairs when the two members come from diﬀerent populations.

DISCUSSION

A Unified Approach

Although there has been general recognition that family and evolutionary relatedness are just two ends of a continuum, we are not aware of previous estimates of population structure quantities such as F_ST or individual-pair coancestries that rest on this common framework. We have presented estimates that apply equally well to populations and individuals. While their statistical properties remain to be fully explored, it is reassuring to see how well they performed in the few simulations presented here.

Although individual-specific inbreeding coefficients and individual-pair-specific coancestry coefficients are used routinely in association studies, we have not seen widespread adoption of population-specific F_ST values in evolutionary studies. We have shown here, theoretically and empirically, that these values can diﬀer substantially among populations. This may simply reflect population size and migration rate diﬀerences, but diﬀerent values may also provide signatures of natural selection.

There is also general understanding that identity by descent is a relative concept, rather than an absolute concept. This understanding has not led to an apparent recognition that the usual estimates of inbreeding and kinship are not unbiased for expected or pedigree values. Replacing population allele frequencies by sample values leads to bias in the usual estimates, regardless of sample size. As the allele frequencies enter GCTA estimates, for example, as squares the expected values of the estimates depend on the variances these frequencies. These, in turn, depend on the parameters being estimated.

We also stress that all allelic variants, whatever their frequencies, need to be included in the estimation of population structure and inbreeding or relatedness. The estimates certainly depend on the allele frequencies, and restricting the range of frequencies used may reveal features of interest, but the underlying ibd parameters do not depend on the frequencies. Exclusion of some alleles based on their frequencies will lead to biased estimates of the parameters.

Previous Estimates

Weir and Cockerham Estimates of F_ST

The F_ST estimate of WC84 has been widely adopted and it performs well for the model stated in that paper: data from a series of independent populations with equivalent histories. In the present notation, WC84 assumed θ_ii = θ, θ_ii′ = 0 for all populations i and all i′ ≠ i. The estimate was designed to be unbiased for any number of populations, any sample sizes and any number of alleles per locus. The analysis was a weighted one over popu-lation: the average allele frequencies for a study had sample size weights, . Although our β estimates do not make explicit mention of allele frequencies, there is implicit use of sample frequencies that are unweighted averages over individuals or populations.

Weighting over populations has been discussed by Tukey (1957) and Robertson (1962). Those authors were concerned with bias and variance and they used the language of variance components, within and between populations. For allele u these components were given as (1 − θ)p_u (1 − p_u) and θp_u (1 − p_u), respectively, by WC84. Tukey said “In practice, we select two quadratic func-tions by some scheme involving intuition, find how their average values are expressed linearly in terms of the variance components, and then form two linear combinations of the original quadratics whose average values are the variance components. These linear combinations are then our esti-mates. Much flexibility is possible.” The estimates of WC84, Weir and Hill (2002) and Bhatia et al. (2013) all have this structure although ratios of linear combinations are taken to remove the allele frequency parameters. Tukey went on to say that the weights w_i = n_i (in the present notation) “gives the customary analyses, which treat observations as important and columns [i.e. populations] as unimportant.” Further, “the choice w_i = 1 … treat the columns as important. This [unweighted] approach is appropriate when the column variance component is large compared with the within variance component.” Robertson (1962) also pointed to sample-size weights for small between-population variance components and equal weights for large values. Bhatia et al. (2013) were concerned with unequal F_ST values so their use of equal weights is consistent with Turkey’s statements. Their work provides simple averages of the diﬀerent F_ST’s as opposed to averages weighted by sample sizes. For unequal F_ST’s and unequal sample sizes, Weir and Hill (2002) said “the usual moment estimate [with sample-size weights] is of a complex function [of the F_ST’s].” In our current model of unequal θ_i’s and non-zero θ_ii′’s we agree that unweighted analyses (population weights of 1) are appropriate, and that is what we have used in this paper. We note that Tukey’s “flexibility” in the choice of moment estimators, phrased in terms of weights, does not arise with maximum likelihood approaches. If sample allele frequencies are taken to be approximately normally distributed then REML methods give appropriate and unique estimates.

What are the consequences of using the WC84 estimates when the current model is more appropriate? We can show that the expected value of the Weir and Cockerham estimate is

This expression uses three functions of sample sizes: and n_c . The two weighted averages are and . The quantity Q is . For equal sample sizes, n_i = n, or for equal values of F_ST, . Under these circumstances and we find the WC84 estimator performs well unless the θ_ii ‘s and /or the n_i’s are quite different. We stress though that it is (θ_W – θ_B)/(1 – θ_B) being estimated.

Nei Estimates of F_ST

Although we have phrased estimates in terms of matching proportions, we note that they are the complements of “heterozygosities” . Our approach uses , the average population-pair allele matching, whereas most previous treatments, from Nei (1973) onwards, use total heterozygosities where is the average sample allele frequency over populations: . From Equations 1, the variance of is

For large sample sizes and Nei’s G_ST quantity and its expectation, in our notation, are which reduce to and as r becomes large. Otherwise, the expectation of G_ST depends on the number r of populations. This expectation is bounded above by one, contrary to the claim of Bhatia et al. (2013). Bounds on F_ST, when that is defined as , were given by Jakobsson et al. (2013).

Nei and Chesser (1983) and Nei (1987) modified Nei’s earlier approach to remove the eﬀects of the number of populations. Jost (2008) pointed out that G_ST does not provide a good measure of diﬀerentiation among populations, where diﬀerentiation reflects the collection of allele frequencies , or their sample values . We regard θ’s as indicators of evolutionary history, rather than of allele frequencies, and we interpret them as probabilities of pairs of alleles being identical by descent. Jost introduced D = (H_B − H_W)/(1 − H_W) or D = (θ_W − θ_B)/θ_W as a measure of diﬀerentiation among populations. For the two-population drift scenario without mutation D, unlike β_W, does not have a simple dependence on time and so does not serve as a measure of evolutionary distance.

CGTA Estimates of Relatedness

The expressions in Equation 3 provide unbiased estimates of θ_ii = (1 + F_i)/2 and θ_ii′, i ≠ i′ when the allele frequencies are known. When study sample allele frequencies are used the expectations of these expressions, for one locus and large samples, are where . The extent of bias depends on how different the average coancestry of a target individual with all other study individuals is from the average coancestry of all pairs of study individuals. We stress that these estimates are not unbiased for θ_ii′.

Association Mapping

One of our motivations for seeking a unified characterization of population structure and relatedness is that both phenomena aﬀect samples used in association mapping. Many analyses, such as those in GCTA, use mixed linear models with an estimated Genetic Relatedness Matrix A being used in the formulation of the variance-covariance matrix for trait values of the study individuals. For a trait with additive genetic variance and no other genetic variance components, the variance matrix for individuals includes the term and A has diagonal elements (1 + F_i)/2 and oﬀ-diagonal elements θ_ii′. We suggest that these be estimated by and , to accommodate any (hidden) relatedness and inbreeding among study subjects. We are less sure about the common practice of also using principal components of A as fixed eﬀects to accommodate population structure, especially when A uses ’s, and we see the need for further investigation.

Population History

We also see the need for further exploration of the role of population-specific F_ST estimates in evolutionary genetic studies, given the generally unrecognized prevalence of negative expected values for populations with correlated allele frequencies shown in Figure 1 and the relationship of estimates with the site-frequency spectrum suggested in Figure 4.

Conclusion

We have presented moment estimators for the probabilities that pairs of alleles, taken from individuals or from populations, are identical by descent relative to the ibd probabilities for alleles from all pairs of individuals or populations in a study. By identifying the reference set of alleles as those in the current study we allow for negative values for population structure or relatedness parameters and their estimates. Alleles may have smaller ibd probabilities within some populations than between all pairs of populations in a study, for example. Some pairs of individuals in a study may be less related than the average for all pairs. Our estimates are phrased in terms of the proportions of pairs of alleles, within and between populations or individuals, that are of the same type (ibs).

For sets of populations, we advocate the use of population-specific F_ST values as these more accurately reflect population history. For sets of individuals, our estimates seem to behave at least as well as those given previously. We note that our estimates have the same logical basis, and algebraic expressions, for populations and for individuals. The chief novelty of our approach is in allowing for allele frequencies to be correlated among populations when characterizing population structure, and correlated among all individuals when characterizing individual-pair relatedness.

Acknowledgments

This work was supported in part by grants GM 075091 and GM 099568 from the US National Institutes of Health and by grant IZK0Z3 157867 from the Swiss National Science Foundation.

Literature Cited

↵
Astle W, Balding DJ. 2009. Population structure and cryptic relatedness in genetic association studies. Statistical Science 24:451-471.
OpenUrl CrossRef
↵
Balding DJ, Nichols RA. 1995. A method for quantifying diﬀerentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica 96:3–12.
OpenUrl CrossRef PubMed Web of Science
↵
Beaumont MA, Balding DJ. 2004. Identifying adaptive genetic divergence among populations from genome scans. Molecular Ecology 13:969-980.
OpenUrl CrossRef PubMed Web of Science
↵
Bhatia G, Patterson N, Sankararaman S, Price AL. 2013. Estimating and interpreting F_ST: The impact of rare variants. Genome Research 23:1514-1521.
OpenUrl Abstract/FREE Full Text
↵
Browning SR, Weir BS. 2010. Population structure with localized haplotype clusters. Genetics 185:1337-l1344.
OpenUrl Abstract/FREE Full Text
↵
Buckleton JS, Curran JM, Goudet J, Taylor D, Thiery A, Weir BS. 2016. Population-specific F_ST values: A worldwide survey. Forensic Science International: Genetics 23:91-100.
OpenUrl
↵
Cockerham CC, Weir BS. 1983. Variance of actual inbreeding. Theoretical Population Biology 23:85-109.
OpenUrl CrossRef PubMed Web of Science
Foll M, Gaggiotti O. 2006. Identifying the environmental factors that determine the genetic structure of populations. Genetics 174:875–891.
OpenUrl Abstract/FREE Full Text
↵
Gaggiotti OE, Foll M. 2010. Quantifying population structure using the F-model. Molecular Ecology Resources 10:821–830.
OpenUrl
↵
Goudet J. 2005. hierfstat, a package for R to compute and test hierarchical F-statistics. Molecular Ecology Notes 5:184-186.
OpenUrl CrossRef Web of Science
↵
Hill WG, Weir BS. 2011. Variation in actual relationship as a consequence of Mendelian sampling and linkage. Genetics Research 93:47-74.
OpenUrl
↵
Hill WG, Weir BS. 2012. Variation in actual relationship among descendants of inbred individuals. Genetics Research 94:267-274.
OpenUrl
↵
Hudson RR. 2002. Generating samples under a Wright-Fisher neutral model. Bioinformatics 18:337-8
OpenUrl CrossRef PubMed Web of Science
↵
Jakobsson M, Edge MD, Rosenberg NA. 2013. The relationship between F_ST and the frequency of the most frequent allele. Genetics 193:515–528.
OpenUrl Abstract/FREE Full Text
↵
Jost L. 2008. G(ST) and its relatives do not measure diﬀerentiation. Molecular Ecology 17:4015–4026.
OpenUrl CrossRef PubMed Web of Science
↵
Kang HM, Sul JH, Service SK, Zaitlen NA, Kong SY, Freimer NB, Sabatti C, Eskin E. 2010. Variance component model to account for sample structure in genome-wide association studies. Nature Genetics 42:348-354.
OpenUrl CrossRef PubMed Web of Science
Li CC, Weeks DE, Chakravarti, A. 1993. Similarity of DNA fingerprints due to chance and relat-edness. Human Heredity 43: 45-52.
OpenUrl CrossRef PubMed Web of Science
Lynch M. 1988. Estimation of relatedness by DNA fingerprinting. Molecular Biology and Evolution 5: 584-599.
OpenUrl PubMed Web of Science
Lynch M, Ritland K. 1999. Estimation of pairwise relatedness with molecular markers. Genetics 152: 1753-1766.
OpenUrl Abstract/FREE Full Text
↵
Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sal M, Chen W-M. 2010. Robust relationship inference in genome-wide association studies. Bioinformatics 26:2867-2873.
OpenUrl CrossRef PubMed Web of Science
↵
Mathieson I, McVean G. 2012. Diﬀerential confounding of rare and common variants in spatially structured populations. Nature Genetics 44:243-248.
OpenUrl CrossRef PubMed
↵
Maruyama T. 1970. Eﬀective number of alleles in a subdivided population. Theoretical Population Biology 1:27-306.
OpenUrl
↵
McTavish EJ, Hillis DM. 2015. How do SNP ascertainment schemes and population demographics aﬀect inferences about population history? BMC Genomics 16:266–278.
OpenUrl CrossRef PubMed
↵
Milligan BG. 2003. Maximum-likelihood estimation of relatedness. Genetics 163:1153-1167.
OpenUrl Abstract/FREE Full Text
↵
Nei M. 1973. Analysis of gene diversity in subdivided populations. Proceedings of the National Academy of Sciences, USA 70:3321–3323.
OpenUrl Abstract/FREE Full Text
↵
Nei M. 1987. Molecular Evolutionary Genetics. Columbia University Press, New York.
↵
Nei M, Chesser RK. 1983. Estimation of fixation indices and gene diversities. Annals of Human Genetics 47:253-259.
OpenUrl CrossRef PubMed Web of Science
↵
Patterson N, Price AL, Reich D. 2006. Population structure and eigenanalysis. PLoS Genetics 2:2074–2093.
OpenUrl
↵
Pew J, Muir PH, Wang J, Frasier TR. 2015. related: an R package for analysing pairwise relatedness from codominant molecular markers. Mol Ecol Resour 15: 557-561
OpenUrl CrossRef PubMed
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ, Sham PC. 2007. PLINK: A toll set for while-genome association and population-based linkage analysis. American Journal of Human Genetics 81:559–575.
OpenUrl CrossRef PubMed
Queller DC, Goodnight KF. 1989. Estimating relatedness using molecular markers. Evolution 43: 258-275.
OpenUrl CrossRef Web of Science
↵
Reynolds J, Weir BS, Cockerham CC. 1983. Estimation of the coancestry coefficient: basis for a short-term genetic distance. Genetics 105:767–779.
OpenUrl Abstract/FREE Full Text
↵
Ritland K. 1996. Estimators for pairwise relatedness and individual inbreeding coefficients. Genet-ical Research 67:175–185.
OpenUrl CrossRef Web of Science
↵
Robertson A. 1962. Weighting in the estimation of variance components in the unbalanced single classification. Biometrics 18:3-17.
OpenUrl
↵
Shriver MD, Kennedy GC, Parra EJ, Lawson HA, Sonpa V, Huang J, Akey JM, Jones KW. 2004. The genomic distribution of population substructure in four populations using 8,525 autosomal SNPs. Human Genomics 41:274-286.
OpenUrl
↵
Slatkin M. 1985. Rare alleles as indicators or gene flow. Evolution 39:53-65.
OpenUrl CrossRef Web of Science
↵
The 1000 Genomes Project Consortium. 2010. A map of human genome variation from population-scale sequencing. Nature 467:1061-1073.
OpenUrl CrossRef PubMed Web of Science
↵
Thompson EA. 1975. Estimation of pairwise relationships. Annals of Human Genetics 39:173-188.
OpenUrl CrossRef PubMed Web of Science
↵
Thompson EA. 2013. Identity by descent: variation in meiosis, across genomes, and in populations. Genetics 194:301–326.
OpenUrl Abstract/FREE Full Text
↵
Tukey JW. 1957. Variances of variance components: II. The unbalanced single classification. Annals of Mathematical Statistics 28: 43-56.
OpenUrl
Wang J. 2002. An estimator for pairwise relatedness using molecular markers. Genetics 160: 1203-1215.
OpenUrl Abstract/FREE Full Text
↵
Wang J. 2014. marker-based estimates of relatedness and inbreeding coefficients: an assessment of current methods. Journal of Evolutionary Biology 27:518–530.
OpenUrl CrossRef
Wang J, Santure AW. 2009. Parentage and sibship inference from multilocus genotype data under polygamy. Genetics 181:1579-1594
OpenUrl Abstract/FREE Full Text
↵
Weir BS. 1996. Genetic Data Analysis II. Sinauer, Sunderland, MA.
↵
Weir BS, Cardon LR, Anderson AD, Nielsen DM, Hill WG. 2005. Measures of human population structure show heterogeneity among genomic regions. Genome Research 15:1468-1476.
OpenUrl Abstract/FREE Full Text
↵
Weir BS, Cockerham CC. 1984. Estimating F-statistics for the analysis of population structure. Evolution 38:1358-1370.
OpenUrl CrossRef PubMed Web of Science
↵
Weir BS, Hill WG. 2002. Estimating F-statistics. Annual Review of Genetics 36:721-750.
OpenUrl CrossRef PubMed Web of Science
↵
Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, Madden PA, Heath AC, Martin NG, Montgomery GW, Goddard ME, Visscher PM. 2010. Common SNPs explain a large proportion of the heritability for human height. Nature Genetics 42:565-569.
OpenUrl CrossRef PubMed Web of Science
↵
Yang J, Lee SH, Goddard ME, Visscher PM. 2011. GCTA: A tool for genome-wide complex trait analysis. American Journal of Human Genetics 88:76–82.
OpenUrl CrossRef PubMed
↵
Yu J, Pressoir G, Briggs WR, et al. 2006. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nature Genetics 38:203-208.
OpenUrl CrossRef PubMed Web of Science

View the discussion thread.

Posted November 17, 2016.

Download PDF

Citation Tools

Subject Area

Genetics

Subject Areas

All Articles

Animal Behavior and Cognition (5214)
Biochemistry (11745)
Bioengineering (8751)
Bioinformatics (29195)
Biophysics (14971)
Cancer Biology (12095)
Cell Biology (17411)
Clinical Trials (138)
Developmental Biology (9421)
Ecology (14179)
Epidemiology (2067)
Evolutionary Biology (18306)
Genetics (12245)
Genomics (16802)
Immunology (11867)
Microbiology (28083)
Molecular Biology (11592)
Neuroscience (60965)
Paleontology (451)
Pathology (1870)
Pharmacology and Toxicology (3238)
Physiology (4959)
Plant Biology (10427)
Scientific Communication and Education (1683)
Synthetic Biology (2885)
Systems Biology (7339)
Zoology (1651)

[1] ↵
Astle W, Balding DJ. 2009. Population structure and cryptic relatedness in genetic association studies. Statistical Science 24:451-471.
OpenUrl CrossRef

[2] ↵
Balding DJ, Nichols RA. 1995. A method for quantifying diﬀerentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica 96:3–12.
OpenUrl CrossRef PubMed Web of Science

[3] ↵
Beaumont MA, Balding DJ. 2004. Identifying adaptive genetic divergence among populations from genome scans. Molecular Ecology 13:969-980.
OpenUrl CrossRef PubMed Web of Science

[4] ↵
Bhatia G, Patterson N, Sankararaman S, Price AL. 2013. Estimating and interpreting F_ST: The impact of rare variants. Genome Research 23:1514-1521.
OpenUrl Abstract/FREE Full Text

[5] ↵
Browning SR, Weir BS. 2010. Population structure with localized haplotype clusters. Genetics 185:1337-l1344.
OpenUrl Abstract/FREE Full Text

[6] ↵
Buckleton JS, Curran JM, Goudet J, Taylor D, Thiery A, Weir BS. 2016. Population-specific F_ST values: A worldwide survey. Forensic Science International: Genetics 23:91-100.
OpenUrl

[7] ↵
Cockerham CC, Weir BS. 1983. Variance of actual inbreeding. Theoretical Population Biology 23:85-109.
OpenUrl CrossRef PubMed Web of Science

[8] Foll M, Gaggiotti O. 2006. Identifying the environmental factors that determine the genetic structure of populations. Genetics 174:875–891.
OpenUrl Abstract/FREE Full Text

[9] ↵
Gaggiotti OE, Foll M. 2010. Quantifying population structure using the F-model. Molecular Ecology Resources 10:821–830.
OpenUrl

[10] ↵
Goudet J. 2005. hierfstat, a package for R to compute and test hierarchical F-statistics. Molecular Ecology Notes 5:184-186.
OpenUrl CrossRef Web of Science

[11] ↵
Hill WG, Weir BS. 2011. Variation in actual relationship as a consequence of Mendelian sampling and linkage. Genetics Research 93:47-74.
OpenUrl

[12] ↵
Hill WG, Weir BS. 2012. Variation in actual relationship among descendants of inbred individuals. Genetics Research 94:267-274.
OpenUrl

[13] ↵
Hudson RR. 2002. Generating samples under a Wright-Fisher neutral model. Bioinformatics 18:337-8
OpenUrl CrossRef PubMed Web of Science

[14] ↵
Jakobsson M, Edge MD, Rosenberg NA. 2013. The relationship between F_ST and the frequency of the most frequent allele. Genetics 193:515–528.
OpenUrl Abstract/FREE Full Text

[15] ↵
Jost L. 2008. G(ST) and its relatives do not measure diﬀerentiation. Molecular Ecology 17:4015–4026.
OpenUrl CrossRef PubMed Web of Science

[16] ↵
Kang HM, Sul JH, Service SK, Zaitlen NA, Kong SY, Freimer NB, Sabatti C, Eskin E. 2010. Variance component model to account for sample structure in genome-wide association studies. Nature Genetics 42:348-354.
OpenUrl CrossRef PubMed Web of Science

[17] Li CC, Weeks DE, Chakravarti, A. 1993. Similarity of DNA fingerprints due to chance and relat-edness. Human Heredity 43: 45-52.
OpenUrl CrossRef PubMed Web of Science

[18] Lynch M. 1988. Estimation of relatedness by DNA fingerprinting. Molecular Biology and Evolution 5: 584-599.
OpenUrl PubMed Web of Science

[19] Lynch M, Ritland K. 1999. Estimation of pairwise relatedness with molecular markers. Genetics 152: 1753-1766.
OpenUrl Abstract/FREE Full Text

[20] ↵
Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sal M, Chen W-M. 2010. Robust relationship inference in genome-wide association studies. Bioinformatics 26:2867-2873.
OpenUrl CrossRef PubMed Web of Science

[21] ↵
Mathieson I, McVean G. 2012. Diﬀerential confounding of rare and common variants in spatially structured populations. Nature Genetics 44:243-248.
OpenUrl CrossRef PubMed

[22] ↵
Maruyama T. 1970. Eﬀective number of alleles in a subdivided population. Theoretical Population Biology 1:27-306.
OpenUrl

[23] ↵
McTavish EJ, Hillis DM. 2015. How do SNP ascertainment schemes and population demographics aﬀect inferences about population history? BMC Genomics 16:266–278.
OpenUrl CrossRef PubMed

[24] ↵
Milligan BG. 2003. Maximum-likelihood estimation of relatedness. Genetics 163:1153-1167.
OpenUrl Abstract/FREE Full Text

[25] ↵
Nei M. 1973. Analysis of gene diversity in subdivided populations. Proceedings of the National Academy of Sciences, USA 70:3321–3323.
OpenUrl Abstract/FREE Full Text

[26] ↵
Nei M. 1987. Molecular Evolutionary Genetics. Columbia University Press, New York.

[27] ↵
Nei M, Chesser RK. 1983. Estimation of fixation indices and gene diversities. Annals of Human Genetics 47:253-259.
OpenUrl CrossRef PubMed Web of Science

[28] ↵
Patterson N, Price AL, Reich D. 2006. Population structure and eigenanalysis. PLoS Genetics 2:2074–2093.
OpenUrl

[29] ↵
Pew J, Muir PH, Wang J, Frasier TR. 2015. related: an R package for analysing pairwise relatedness from codominant molecular markers. Mol Ecol Resour 15: 557-561
OpenUrl CrossRef PubMed

[30] Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ, Sham PC. 2007. PLINK: A toll set for while-genome association and population-based linkage analysis. American Journal of Human Genetics 81:559–575.
OpenUrl CrossRef PubMed

[31] Queller DC, Goodnight KF. 1989. Estimating relatedness using molecular markers. Evolution 43: 258-275.
OpenUrl CrossRef Web of Science

[32] ↵
Reynolds J, Weir BS, Cockerham CC. 1983. Estimation of the coancestry coefficient: basis for a short-term genetic distance. Genetics 105:767–779.
OpenUrl Abstract/FREE Full Text

[33] ↵
Ritland K. 1996. Estimators for pairwise relatedness and individual inbreeding coefficients. Genet-ical Research 67:175–185.
OpenUrl CrossRef Web of Science

[34] ↵
Robertson A. 1962. Weighting in the estimation of variance components in the unbalanced single classification. Biometrics 18:3-17.
OpenUrl

[35] ↵
Shriver MD, Kennedy GC, Parra EJ, Lawson HA, Sonpa V, Huang J, Akey JM, Jones KW. 2004. The genomic distribution of population substructure in four populations using 8,525 autosomal SNPs. Human Genomics 41:274-286.
OpenUrl

[36] ↵
Slatkin M. 1985. Rare alleles as indicators or gene flow. Evolution 39:53-65.
OpenUrl CrossRef Web of Science

[37] ↵
The 1000 Genomes Project Consortium. 2010. A map of human genome variation from population-scale sequencing. Nature 467:1061-1073.
OpenUrl CrossRef PubMed Web of Science

[38] ↵
Thompson EA. 1975. Estimation of pairwise relationships. Annals of Human Genetics 39:173-188.
OpenUrl CrossRef PubMed Web of Science

[39] ↵
Thompson EA. 2013. Identity by descent: variation in meiosis, across genomes, and in populations. Genetics 194:301–326.
OpenUrl Abstract/FREE Full Text

[40] ↵
Tukey JW. 1957. Variances of variance components: II. The unbalanced single classification. Annals of Mathematical Statistics 28: 43-56.
OpenUrl

[41] Wang J. 2002. An estimator for pairwise relatedness using molecular markers. Genetics 160: 1203-1215.
OpenUrl Abstract/FREE Full Text

[42] ↵
Wang J. 2014. marker-based estimates of relatedness and inbreeding coefficients: an assessment of current methods. Journal of Evolutionary Biology 27:518–530.
OpenUrl CrossRef

[43] Wang J, Santure AW. 2009. Parentage and sibship inference from multilocus genotype data under polygamy. Genetics 181:1579-1594
OpenUrl Abstract/FREE Full Text

[44] ↵
Weir BS. 1996. Genetic Data Analysis II. Sinauer, Sunderland, MA.

[45] ↵
Weir BS, Cardon LR, Anderson AD, Nielsen DM, Hill WG. 2005. Measures of human population structure show heterogeneity among genomic regions. Genome Research 15:1468-1476.
OpenUrl Abstract/FREE Full Text

[46] ↵
Weir BS, Cockerham CC. 1984. Estimating F-statistics for the analysis of population structure. Evolution 38:1358-1370.
OpenUrl CrossRef PubMed Web of Science

[47] ↵
Weir BS, Hill WG. 2002. Estimating F-statistics. Annual Review of Genetics 36:721-750.
OpenUrl CrossRef PubMed Web of Science

[48] ↵
Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, Madden PA, Heath AC, Martin NG, Montgomery GW, Goddard ME, Visscher PM. 2010. Common SNPs explain a large proportion of the heritability for human height. Nature Genetics 42:565-569.
OpenUrl CrossRef PubMed Web of Science

[49] ↵
Yang J, Lee SH, Goddard ME, Visscher PM. 2011. GCTA: A tool for genome-wide complex trait analysis. American Journal of Human Genetics 88:76–82.
OpenUrl CrossRef PubMed

[50] ↵
Yu J, Pressoir G, Briggs WR, et al. 2006. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nature Genetics 38:203-208.
OpenUrl CrossRef PubMed Web of Science

A unified characterization of population structure and relatedness

Abstract

Introduction

Materials and Methods

Parameter Values

Pairs of Individuals

Pairs of Populations

Drift, Mutation and Migration

Actual vs Predicted

Inbred Populations

Estimation

Allele Frequencies

Allelic Matching

Population Structure Estimates

Relatedness Estimates

Combining Over Loci

Private Alleles

RESULTS

Population Structure

Model 1. Same Migration Rates, Diﬀerent Population Sizes

Model 2. Continent-Island Model

Model 3. Migrant-pool Island Model

Model 4. Diﬀerent Population Sizes, Diﬀerent Migration Rates

Rare Alleles

Inbreeding and Relatedness

DISCUSSION

A Unified Approach

Previous Estimates

Weir and Cockerham Estimates of FST

Nei Estimates of FST

CGTA Estimates of Relatedness

Association Mapping

Population History

Conclusion

Acknowledgments

Literature Cited

Citation Manager Formats

Subject Area

Weir and Cockerham Estimates of F_ST

Nei Estimates of F_ST