Abstract
FST is a fundamental measure of genetic differentiation and population structure, currently defined for subdivided populations. FST in practice typically assumes independent, non-overlapping subpopulations, which all split simultaneously from their last common ancestral population so that genetic drift in each subpopulation is probabilistically independent of the other subpopulations. We introduce a generalized FST definition for arbitrary population structures, where individuals may be related in arbitrary ways, allowing for arbitrary probabilistic dependence among individuals. Our definitions are built on identity-by-descent (IBD) probabilities that relate individuals through inbreeding and kinship coefficients. We generalize FST as the mean inbreeding coefficient of the individuals’ local populations relative to their last common ancestral population. We show that the generalized definition agrees with Wright’s original and the independent subpopulation definitions as special cases. We define a novel coancestry model based on “individual-specific allele frequencies” and prove that its parameters correspond to probabilistic kinship coefficients. Lastly, we extend the Pritchard-Stephens-Donnelly admixture model in the context of our coancestry model and calculate its FST. To motivate this work, we include a summary of analyses we have carried out in follow-up papers, where our new approach has been applied to simulations and global human data, showcasing the complexity of human population structure, demonstrating our success in estimating kinship and FST, and the shortcomings of existing approaches. The probabilistic framework we introduce here provides a theoretical foundation that extends FST in terms of inbreeding and kinship coefficients to arbitrary population structures, paving the way for new estimators and novel analyses.
Note: This article is Part I of two-part manuscripts. We refer to these in the text as Part I and Part II, respectively.
Part I: Alejandro Ochoa and John D. Storey. “FST and kinship for arbitrary population structures I: Generalized definitions”. bioRxiv (10.1101/083915) (2019). https://doi.org/10.1101/083915. First published 2016-10-27.
Part II: Alejandro Ochoa and John D. Storey. “FST and kinship for arbitrary population structures II: Method of moments estimators”. bioRxiv (10.1101/083923) (2019). https://doi.org/10.1101/083923. First published 2016-10-27.
1 Introduction
A population of mating organisms is structured if its individuals do not mate randomly, which results in an increase in mean homozygozity over the population compared to that of a randomly mating population [3, 4]. FST is a parameter that measures population structure [5, 6], which is typically understood through homozygosity. An unstructured population has FST = 0 and genotypes at each locus have Hardy-Weinberg proportions. At the other extreme, a fully differentiated population has FST = 1 and every subpopulation at every locus is homozygous for some allele. In addition to measuring population differentiation, FST is also used to model DNA profile matching uncertainty in forensics [7–13] and to identify loci under selection [14–21]. Current FST definitions assume a partitioned or subdivided population into discrete, non-overlapping subpopulations [5, 6, 22–24]. Many FST estimators further assume that subpopulations have evolved independently from the most recent common ancestor (MRCA) population [21–24], which occurs only if every subpopulation split from the MRCA population at the same time (Fig. 1A, Fig. 2A). However, populations such as humans are not naturally subdivided [11, 25–27] (Fig. 1B); thus, arbitrarily imposed subdivisions may yield correlated subpopulations that no longer satisfy the independent subpopulations model assumed by existing FST estimators (Fig. 2B). In this work, we build a generalized FST definition applicable to arbitrary population structures, including arbitrary evolutionary dependencies.
Natural populations are often structured due to population size differences and the constraints of distance and geography [31]. For example, the genetic population structure of humans shows evidence of population bottlenecks migrating out of Africa [32–40] as well as numerous admixture events [41–45]. Notably, human populations display genetic similarity that decays smoothly with geographic distance, rather than taking on discrete values as would be expected for independent subpopulations [11, 27, 35, 37–39] (Fig. 1B). Current FST definitions do not apply to these complex population structures.
FST is known by many names, including fixation index [6] and coancestry coefficient [23, 46]). FST is also alternatively defined in terms of the variance of subpopulation allele frequencies [6], variance components [47], correlations [22], and genetic distance [46]. Our generalized FST is defined using inbreeding coefficients, like Wright’s FST. There is also a diversity of summary statistics that measure locus-specific differentiation, such as GST, , and D, which are functions of observed allele frequencies, and which approximate FST under certain conditions [48–53]. We consider FST as a genome-wide evolutionary parameter given by relatedness, which modulates the random drift of allele frequencies across loci but does not depend on these frequencies, mutation rates, or other locus-specific features. We review these previous FST definitions in greater detail in Supplementary Information, Section S1. The focus of our work is to generalize and accurately estimate the genome-wide FST in individuals with arbitrary relatedness, and does not presently concern locus-specific FST estimation or the identification of loci under selection.
The developments in this paper have lead to improved estimates of FST and kinship in Part II [2]. We have also applied these new probabilistic quantities and estimators to data from the Human Origins and 1000 Genomes Project data sets in ref. [54]. To motivate the generalized definitions we present in this work, in Section 2 we provide an overview of simulation results demonstrating the accuracy of the estimators (from Part II) and findings from analyzing the Human Origins and 1000 Genomes Project datasets (from ref [54]). These results establish that a generalized definition of FST in terms of kinship and inbreeding for arbitrary population structures is needed.
In Section 3 we formally define kinship and inbreeding coefficients, which measure how individuals are related, quantify population structure, and are the foundation of our work. We then generalize FST in terms of individual parameters (namely, inbreeding coefficients), and in analogy to Wright’s FIS, model local inbreeding on an individual basis. Our FST applies to arbitrary population structures, generalizing previous FST definitions restricted to subdivided populations.
In Section 4 we show a connection between the coalescent and kinship, inbreeding and generalized FST. This provides a generalization of a previous result showing the relationship between the coalescent and the classic FST defined on subdivided populations. In Section 5 we define a coancestry model that parametrizes the correlations of “individual-specific allele frequencies” (IAFs), a recent tool that also accommodates individual-specific relationships [55, 56]. Our model is related to previous models between populations [23, 57]. We prove that our coancestry parameters correspond to kinship coefficients, thereby preserving their probabilistic interpretations, and we relate these parameters to FST.
Lastly, in Section 6 we provide a novel FST analysis for admixed individuals by applying our coancestry model from Section 5 to the widely-used Pritchard-Stephens-Donnelly (PSD) admixture model, in which individuals derive their ancestry from several ancestral subpopulations with individual-specific admixture proportions [58–60]. We analyze an extension of the PSD model [55, 61–64] that generates allele frequencies from the Balding-Nichols distribution [7], and propose a more complete coancestry model for the ancestral subpopulations. We derive equations relating FST to the model parameters of PSD and its extensions. These results enable us to use an admixture simulation without independent subpopulations to benchmark kinship and FST estimators in Section 6 of Part II.
Our generalized definitions permit the analysis of FST and kinship estimators under arbitrary population structures, and pave the way forward to new estimation approaches, which are the focus of our following work in this series (Part II).
2 Motivating analyses
The results presented here lead to a deeper understanding of the limitations of existing FST, kinship, and inbreeding estimators. Specifically, the assumptions underlying existing estimators are too restrictive and do not align with the properties of human populations that have been revealed through recent studies. In Part II, we theoretically calculate and then numerically verify complex biases that manifest from existing estimators when the population structure and relatedness violates the non-overlapping and independently evolving subpopulations assumptions. This then leads to new estimators of FST, kinship, and inbreeding proposed in Part II. In ref. [54], we applied the estimators from Part II to data from the Human Origins study and 1000 Genomes Project (TGP). There, it is revealed on these seminal studies that the theory, methods, and simulations from Part I and Part II hold true on real data. Although the results summarized in this section involve details presented in full in Part II and ref. [54], it may be useful to the reader to see the ultimate consequences of the theory present in the current paper, Part I.
In Part II, we carried out simulations in two scenarios. The first scenario approximately satisfies the assumptions of the existing (Weir-Cockerham) estimate of FST. The second scenario is an admixture model (described in Section 6), which reflects the characteristics we have observed in real data where there are no well-defined independent subpopulations. Fig. 3, columns A and B, show the results of these simulations. It can be seen that that both the existing and proposed estimators do well in the first scenario (Fig. 3A) where the population is divided into non-overlapping subpopulations that have independently evolved from a common ancestral population. However, in the second scenario (Fig. 3B) where these assumptions are violated, the existing estimators show notable downward bias. Our theoretical results determine exactly what this bias is for both kinship and FST.
In ref. [54], we then analyzed data from the Human Origins [28–30] and TGP studies [65], both of which consist of individuals sampled from a global distribution of ancestries. For the TGP data, we specifically limited our analysis to Hispanics. Our novel kinship estimates calculated on these data reveal a complex population structure in the global human population (Fig. 3C) and in Hispanics in particular (Fig. 3D). Since there are no independent subpopulations in the human data, existing kinship and FST estimates in these data will also be downwardly biased, which can be seen in the bottom two rows of Fig. 3C-D. In contrast, our more accurate novel FST estimates measure greater differentiation than has been previously reported (Fig. 3C-D, second and fourth rows). A deeper analysis of our calculations reveals a clear connection between our estimated kinship structure (but not existing estimates) and the global human migrations under the African Origins model [54]. Our results suggest that common population genetic analyses on real human data will greatly benefit from our improved kinship and FST estimation framework.
3 Generalized definitions in terms of individuals
Now that we have established the need for a more flexible population structure model that does not assume independent subpopulations, we shall introduce here novel definitions required for this goal. First we review the formal definitions of kinship and inbreeding coefficients. Then we define a “local” population for every individual, which allows us to distinguish “structural” inbreeding due to the population structure from the “local” inbreeding that applies to individuals with closely-related parents. We then introduce our generalized FST definition as the mean structural inbreeding coefficient, and show that this definition equals the previous FST definition for independent subpopulations. We also generalize previous formulas for changing the reference ancestral population for kinship and inbreeding coefficients. Lastly, we review the connection between kinship coefficients and the covariance of genotypes.
3.1 Overview of data and model parameters
Table 1 summarizes the notation used in this work. Our models assume that genotypes at every locus evolve neutrally—by random drift only, in the absence of recent mutation and selection. Thus, only the population structure shapes the covariance structure of genotypes.
Let xij be observed biallelic genotypes for locus i ∈ {1,…, m} and diploid individual j ∈ {1,…, n}. Given a chosen reference allele at each locus, genotypes are encoded as the number of reference alleles: xij = 2 is homozygous for the reference allele, xij = 0 is homozygous for the alternative allele, and xij = 1 is heterozygous. We focus on biallelic loci since they vastly outnumber other types of genetic variants in humans. Note that a multiallelic model, which would require additional notation, could follow in analogy to previous FST work for populations [23].
We assume the existence of a panmictic ancestral population T for all individuals under consideration. T is generally not required to be the MRCA population, so many choices of T are possible. Note that T is a collection of organisms ancestral to a given set of individual organisms, shared by all loci, and it is not assumed that the alleles at a given locus coalesce in T. Two alleles are said to be “identical by descent” (IBD) if they originate from a single ancestor organism that lived more recently than the given ancestral population T [4, 6, 66]. In other words, relationships that precede T in time do not count as IBD, while relationships since T count toward IBD probabilities. Every locus i is assumed to have been polymorphic in T, with an ancestral reference allele frequency , and no new mutations have occurred since then.
The inbreeding coefficient of individual j relative to T, , is defined as the probability that the two alleles of any random locus of j are IBD when the ancestral population is T [67]. Therefore, measures the amount of relatedness within an individual, or the extent of dependence between its alleles at each locus. Similarly, the kinship coefficient of individuals j and k relative to T, , is defined as the probability that two alleles at any random locus, each picked at random from each of the two individuals, are IBD when the ancestral population is T [5]. measures the amount of relatedness between individuals, or the extent of dependence across their alleles at each locus. Note that children j of parents (k, l) have an expected of [5]. Both and combine relatedness due to the population structure with recent or “local” relatedness, such as that of family members [68]. The values of are functions of the chosen ancestral population T, which determines the level of relatedness that is treated as unrelated [4, 66]. Thus, and increase if T is an earlier rather than a more recent population. The expression “ relative to T” refers to the value of when T is chosen as the reference ancestral population [6,66]. The mean is positive in a structured population [67], and it also increases slowly over time in finite panmictic populations due to genetic drift [69].
Given an ancestral population T (not necessarily the MRCA population in this context) and an unstructured subpopulation S that evolved from T, Malécot defined FST as the mean over the individuals in S relative to T [5], and which we denote by . When S is itself structured, Wright defined three coefficients that connect T, S and individuals I in S [6]: FIT (“total inbreeding”) is the mean of individuals (I) relative to T; FIS (“local inbreeding”) is the mean of individuals (I) relative to S, which Wright did not consider to be part of the population structure; lastly, FST (“structural inbreeding”) is the mean relative to T that would result if individuals in S mated randomly (and which equals our ). The special case FIS = 0 gives FST = FIT [6]. See Supplementary Information, Section S1.1 for a more detailed review of these definitions. Wright created the distinction between FST and FIT with animal breeding in mind, since mating systems for artificial selection could cause the local inbreeding (FIS) and therefore also FIT to be large at times, but FST measures the more relevant mean inbreeding that results after random mating resumes in the strain [67]. However, in large, natural populations FIS is small so FST ≈ FIT in these cases. The FST definition has been extended to a set of disjoint subpopulations, where it is the average FST of each subpopulation from the last common ancestral population [23, 24].
In practice, the ancestral population T is usually not identified explicitly, which obscures its role in estimating kinship and FST. Here we clarify this important matter. Every population of mating organisms can be modeled as descending from a panmictic ancestral population T—whether real or a mathematical construct—that at every locus contained the pool of ancestral alleles that modern individuals inherited. By default, the recommended choice of T is the MRCA population of the individuals in the sample [22–24, 66, 70]. For example, if all individuals are drawn from one effectively panmictic population, then this population is the MRCA. In a pedigree with unrelated founders, the MRCA population consists of these founders [6, 31]. In a population structure defined by a tree, the MRCA population is the root node at which the first split occurs (Fig. 2). The choice of T sets the minimum possible value of : a pair of unrelated individuals drawn from T have , and an individual from T (with unrelated parents by definition) has [71]. Thus, assuming that pairs are present in a sample, the set of values is in terms of the MRCA population T if and only if min . If min , then T is more ancestral than the MRCA population. Estimates with min —impossible if is a probability—have an implicit T that is more recent than the MRCA population and cannot be interpreted biologically. For humans, if we ignore the limited Neanderthal and Denisovan introgressions [42, 43], the MRCA population is the real population estimated to have existed in Africa ≈100-200 thousand years ago [32, 33, 40], which first split into the ancestral southern African KhoeSan population (who speak unique “click languages”) and the rest of humans [32, 33, 37, 38, 40].
3.2 Local populations
Our generalized FST definition depends on the notion of a local population. Our formulation includes as special cases the independent subpopulations and admixture models, and its generality is in line with recent efforts to model population structure on a fine scale [72, 73], through continuous spatial models [27, 74–76], or in a manner that makes minimal assumptions [56].
We define the local population Lj of an individual j as the MRCA population of j. In the simplest case, if j’s parents belong to the same panmictic subpopulation S, then S = Lj. However, if j’s parents belong to different subpopulations, then Lj is modeled as an admixed population (see example below). More broadly, Lj is the most recent panmictic population from which individual j drew its alleles and its inbreeding coefficient can be meaningfully defined. We define the “local” inbreeding coefficient of j to be , and j is said to be locally outbred if .
For any population T ancestral to Lj, the parameter trio are individual-level analogs of Wright’s trio (FIT, FIS, FST) defined for a subdivided population [6], with Lj playing the role of S. Moreover, just like Wright’s coefficients satisfy our individual-level parameters satisfy since the probability of the absence of IBD of j relative to T (which is )equals the product of the independent probabilities of absence of IBD at two levels: of j relative to Lj (which is ), and of Lj relative to T (which is ). Note that an individual j is locally outbred if and only if .
Similarly, we define the jointly local population Ljk of the pair of individuals j and k as the MRCA population of j and k. Hence, Ljk is ancestral to both Lj and Lk (Fig. 2B). We define the “local” kinship coefficient to be , and j and k are said to be locally unrelated if . Since the expected inbreeding coefficient of an individual is the kinship of its parents [5], it follows that locally-unrelated parents have locally-outbred offspring.
Consider an individual j in an admixture model, deriving alleles from two distinct subpopulations A and B with proportions qjA and qjB = 1 − qjA. Then Lj is modeled as a population that at locus i has a reference allele frequency of , where and are the allele frequencies in A and B, respectively. Considering a pair of individuals (j, k) and varying their admixture proportions, their jointly local population at one extreme is Ljk = Lj = Lk if and only if qjA = qkA (in other words, these individuals have the same local population if and only if their admixture proportions are the same); at the other extreme Ljk is the MRCA population of A and B if and only if qjA = 1 and qkA = 0 or vice versa (in other words, these individuals have the most distant jointly local population if and only if they are not admixed and belong to opposite subpopulations).
3.3 The generalized FST for arbitrary population structures
Recall the individual-level analog of Wright’s FST is , which measures the inbreeding coefficient of individual j relative to T due exclusively to the population structure (Fig. 2B, Table 1 and Section 3.2). We generalize FST for a set of n individuals as where the most meaningful choice of T is the MRCA population of all individuals under consideration, and are fixed weights for these individuals. The simplest weights are for all j. However, we allow for flexibility in the weights so that one may assign them to reflect how individuals were sampled, such as a skewed or uneven sampling scheme. For example, if there are two local populations and the first has twice as many samples as the second, then this can be counteracted by weighing every individual from the first local population half as much as every individual from the second local population. In general, individuals can be weighted inversely proportional to their local population’s sample sizes, a scheme used implicitly in the Hudson pairwise FST estimator [24] and which we iterated for a hierarchy of subdivisions in our analysis of the Human Origins dataset [54]. However, for complex population structures without discrete subpopulations and no obvious sampling biases relative to geography or other variables, we favor uniform weights over complicated weighing schemes (the admixed Hispanic individuals were weighted uniformly in [54]).
This generalized FST definition summarizes the population structure with a single value, intuitively measuring the average distance of every individual from T. Moreover, our definition contains the previous FST definition as a special case, as discussed shortly. For simplicity, we kept Wright’s traditional FST notation rather than using something that resembles our notation. A more consistent notation could be , which more clearly denotes the weighted average of across individuals. Our definition is more general because the traditional S population is replaced by a set of local populations {Lj}, which may differ for every individual.
3.3.1 Mean heterozygosity in a structured population
Our generalized FST parametrizes the reduction in mean heterozygosity relative to the ancestral population T for arbitrary population structures, thus generalizing the familiar connection of the classical FST to allele fixation in an independently-evolving subpopulation. Here we will assume locally-outbred individuals, for which . The expected proportion of heterozygotes Hij of an individual with inbreeding coefficient at locus i with an ancestral allele frequency is given by [67]
The weighted mean of these expected proportion of heterozygotes across individuals, , is given by our generalized FST:
Hence, individuals have Hardy-Weinberg proportions at every locus if and only if FST = 0, which in turn happens if and only if for each j. In the other extreme, individuals have fully-fixated alleles at every locus , if and only if FST = 1, which in turn happens if and only if for each j.
Eq. (4) presents an apparent paradox since a given sample estimate of the heterozygosity on one side does not depend on T, while FST and on the other side vary depending on our choice of ancestral population T. In fact, both sides of Eq. (4) are constant with respect to T under our model: FST increases as T is taken to be a more distant ancestral population, but also changes so that is constant in expectation (see Supplementary Information, Section S4 for a proof of this result).
3.3.2 FST under the independent subpopulations model
Here we show that our generalized FST contains as a special case the currently-used FST definition for independent subpopulations. As discussed above, FST estimators often assume the independent subpopulations model, in which the population is divided into K non-overlapping subpopulations that evolved independently from their MRCA population T [22–24]. For simplicity, individuals are often further assumed to be locally outbred and locally unrelated. These assumptions result in the following block structure for our parameters, where j, k ∈ {1,…, n} index individuals, Su, Su′ are disjoint subpopulations treated as sets containing individuals, and u, u′ ∈ {1,…, K} index these subpopulations. This population structure corresponds to a tree in which every subpopulation split from T at the same time (Fig. 2A), which is the required demographic scenario that leads to probabilistically-independent subpopulations.
The generalized FST applied to independent subpopulations agrees with the previous FST definition of the mean per-subpopulation FST [23, 24]: where the weights wj are such that . Note also that the Su for u ∈ {1,…, K} act as the K unique local populations, where Lj = Su whenever j ∈ Su.
3.4 IBD probabilities with respect to a reference ancestral population
In developing the generalized FST, we have made use of equations that relate IBD probabilities in a hierarchy. Here we generalize these equations to individual inbreeding and kinship coefficients, which allow for transformations of these probabilities under a change of reference ancestral population. Our relationships are straightforward generalizations of Wright’s equation relating FIT, FIS, and FST in Eq. (1), now more generally applicable.
Let A be a population ancestral to population B, which is in turn ancestral to population C. The inbreeding coefficients relating every pair of populations in {A, B, C} satisfy
A similar form applies for individual inbreeding and kinship coefficients given relative to populations A and B, respectively, which generalizes Eq. (2). All of these cases follow since the absence of IBD of C (or j, or j, k) relative to A requires independent absence of IBD at two levels: of C (or j, or j, k) relative to B, and of B relative to A. All of the above equations can be extended to a multi-level hierarchy just like Wright did for Eq. (1), by iterating at each level [6].
3.5 Genotype moments under the kinship model
In the kinship model [5, 6, 67, 77], genotypes xij are random variables with first and second moments given by
Eq. (6) is a consequence of assuming no selection or new mutations, leaving random drift as the only evolutionary force acting on genotypes [67]. Eq. (7) shows how inbreeding modulates the genotype variance: an outbred individual relative to T has the Binomial variance of that corresponds to independently-drawn alleles; a fully inbred individual has a scaled Bernoulli variance of that corresponds to maximally correlated alleles [6]. Lastly, Eq. (8) shows how kinship modulates the correlations between individuals: unrelated individuals relative to T have uncorrelated genotypes, while holds for the extreme of identical and fully inbred twins, which have maximally correlated genotypes [5, 77]. Hence, and parametrize the frequency of non-independent allele draws within and between individuals. The “self kinship”, arising from comparing Eq. (7) to the j = k case in Eq. (8), implies , which is a rescaled inbreeding coefficient resulting from comparing an individual with itself or its identical twin.
4 Kinship and the generalized FST in terms of the coalescent
Slatkin (1991) [78] derived an expression for the classical FST (for a subdivided population) in terms of mean coalescence times, where is the mean coalescence time for alleles at a random locus within a subpopulations S, and is the mean coalescence time for alleles at a random locus across subpopulations. Here we generalize this expression to encompass inbreeding and kinship coefficients, as well as the generalized FST.
In all cases that follow, we generalize to denote the mean coalescence time for two alleles at a random locus drawn from the ancestral population T; in practice it corresponds to the mean coalescence time of the alleles of the two most distant individuals in the sample. The inbreeding and kinship coefficients are given by where is the mean coalescence time of the two alleles of individual j at a random locus, and is the mean coalescence time of two alleles drawn at random from each of two individuals j and k at a random locus (see Supplementary Information, Section S2 for derivations). These mean coalescence times could be estimated as average coalescence times for a large number of neutral loci across the genome. If all individuals in the sample are locally outbred, we obtain the desired expression for the generalized FST:
Therefore, the generalized FST equals the relative difference between the weighted mean coalescence times of the alleles within individuals versus the mean coalescence time between the most distantly-related individuals in the sample.
5 The coancestry model for individual allele frequencies
FST and its estimators are most often studied in terms of subpopulation allele frequencies [22–24, 57]. Here we introduce a coancestry model for individuals, which is based on individual-specific allele frequencies (IAFs) [55, 56] that accomodate arbitrary population-level relationships between individuals. Some authors use the terms “coancestry” and “kinship” exchangeably [23, 70, 71]; in our framework, kinship coefficients are general IBD probabilities (following [68]), and we reserve coancestry coefficients for the IAFs covariance parameters (in analogy to the work of [23]). This coancestry model is the foundation behind the extension of the PSD admixture model we present in Section 6 below, and simplifies the analysis of FST estimator bias in Section 3 of Part II.
In this section we introduce two parameters (see Table 1). First, πij ∈ [0, 1] is the IAF of individual j at locus i. Individual j draws its two reference alleles independently with probability πij. Allowing every locus-individual pair to have a potentially-unique allele frequency allows for arbitrary forms of population structure at the level of allele frequencies [56]. Second, is the coancestry coefficient of individuals j and k relative to an ancestral population T, which modulate the covariance of πij and πik as shown below.
5.1 The coancestry model
In our coancestry model, the IAFs πij have the following first and second moments,
Eq. (9) implies that random drift is the only force acting on the IAFs, and is analogous to Eq. (6) in the kinship model. Eq. (10) is analogous to Eqs. (7) and (8) in the kinship model, with individual coancestry coefficients playing the role of the kinship and inbreeding coefficients (for j = k), a relationship elaborated in the next section. Lastly, Eq. (11) draws the two alleles of a genotype independently from the IAF, which models locally-outbred and locally-unrelated individuals [23]. Hence, the coancestry model excludes local relationships, so it is more restrictive than the kinship model.
Our coancestry model between individuals is closely related to previous models between sub-populations [23, 57]. However, previous models allowed [23]. We require that for two reasons: (1) covariance is non-negative in latent structure models [79], such as population structure, and (2) it is necessary in order to relate to IBD probabilities as shown next.
5.2 Relationship between coancestry and kinship coefficients
Here we show that the coancestry coefficients for IAFs, θjk, defined above can be written in terms of the kinship and inbreeding coefficients utilized in our more general model. We do so by relating our coancestry coefficients to general kinship coefficients by matching moments. Conditional on the IAFs, genotypes in the coancestry model have a Binomial distribution, so
We calculate total moments by marginalizing the IAFs. The total expectation is which agrees with Eq. (6) of the kinship model. The total covariance is calculated using
The first term is zero for j ≠ k, and for j = k it is
The second term equals 4 Cov (πij, πik|T) for all (j, k) cases, which is given by Eq. (10). All together,
Comparing the above to Eqs. (7) and (8), we find that
Therefore, our coancestry coefficients are equal to kinship coefficients, except that self-coancestries are equal to inbreeding coefficients.
Since individuals in our IAF coancestry model are locally outbred and locally unrelated, we also have and for j ≠ k. Replacing these quantities in Eq. (3), we obtain the generalized FST in terms of coancestry coefficients.
6 Coancestry and FST in admixture models
The Pritchard-Stephens-Donnelly (PSD) admixture model [58] is a well-established, tractable model of structure that is more complex than the independent subpopulations model. There are several algorithms available to estimate the PSD model parameters [58–60, 64, 80]. This model assumes the existence of several intermediate ancestral subpopulations, from which individuals draw alleles according to their admixture proportions. However, the PSD model was not developed with FST in mind; we will present a modified model that is compatible with our coancestry model. The results presented in this section are applied to evaluate kinship and FST estimators in Section 6 of Part II, where an admixed population without independent subpopulations is simulated and the true kinship and FST are known.
The PSD model is a special case of our coancestry model with the following additional parameters (see Table 1). The number of intermediate subpopulations is denoted by K. Let be the reference allele frequency at locus i and intermediate subpopulation Su (u ∈ {1,…, K}; compare to previous notation in Table 1). Lastly, qju ∈ [0, 1] is the admixture proportion of individual j for intermediate subpopulation Su. These proportions satisfy for each j.
6.1 The PSD model with Balding-Nichols allele frequencies
The original algorithm for fitting the PSD model [58] utilizes prior distributions for intermediate subpopulation allele frequencies and admixture proportions according to
Subsequent work has shown [56, 60] that the PSD model of [58] is then equivalent to forming IAFs where genotypes are then drawn independently according to xij|πij ~ Binomial(2, πij).
Here we consider an extension of this model, which we call the “BN-PSD” model, by replacing Eq. (15) with the Balding-Nichols (BN) distribution [7] to generate the allele frequencies for the intermediate subpopulations from their MRCA population T. The BN-PSD model establishes an independent subpopulations structure of the intermediate subpopulations Su as illustrated in Fig. 4. This combined model has been used to simulate structured genotypes [55, 62, 63], and is the target of some inference algorithms [61, 64]. The BN distribution is the following reparametrized Beta distribution, where p is the ancestral allele frequency and F is the inbreeding coefficient [7]. The resulting allele frequencies p* fit into our coancestry model, since E[p*] = p and Var(p*) = p(1 − p)F.
In BN-PSD, the allele frequencies at each locus i for intermediate subpopulation Su are drawn independently from where is the ancestral allele frequency and is the inbreeding coefficient of Su relative to T (compare to notation in Table 1).
We calculate the coancestry parameters of this model by matching moments conditional on the admixture proportions Q= (qju). We calculate the expectation as and the IAF covariance is
By matching these to Eq. (10), we arrive at coancestry coefficients and FST of
6.2 The BN-PSD model with full coancestry
The BN-PSD model contains a restriction that the K intermediate subpopulations are independent. Suppose instead that the intermediate subpopulation allele frequencies satisfy our more general coancestry model: where is the coancestry of the intermediate subpopulations Su and Sv. Note that the previous BN-PSD model satisfies and for u ≠ v. Repeating our calculations assuming our full coancestry setting, individual coancestry coefficients and FST are given by
Therefore, all coancestry coefficients of the intermediate subpopulations influence the individual coancestry coefficients and the overall FST. The form for above has a simple probabilistic interpretation: the probability of IBD at random loci between individuals j and k corresponds to the sum for each pair of subpopulations u and v of the probability of the pairing (qjuqkv) times the probability of IBD between these subpopulations . Note that Eq. (18) was derived independently for a related model [81], but the value of FST for a set of admixed individuals—which we provide in Eq. (19)—had not been described before to the best of our knowledge.
7 Discussion
We presented a generalized FST definition corresponding to a weighted mean of individual-specific inbreeding coefficients. Compared to previous FST definitions, ours is applicable to arbitrary population structures, and in particular does not require the existence of non-overlapping subpopulations.
We considered two closely-related population structure models with individual-level resolution: the kinship model for genotypes, and our new coancestry model for IAFs (individual-specific allele frequencies). The kinship model is the most general, applicable to the genotypes in arbitrary sets of individuals. Our IAF model requires a local form of Hardy-Weinberg equilibrium, and it does not model locally-related or locally-inbred individuals. Nevertheless, IAFs arise in many applications, including admixture models [59], estimation of local kinship [55], genome-wide association studies [82], and the logistic factor analysis [56]. We prove that kinship coefficients, which control genotype covariance, also control IAF covariance under our coancestry model.
We also calculated FST for admixture models. To achieve this, we framed the PSD (Pritchard-Stephens-Donnelly) admixture model as a special case of our IAF coancestry model, and studied extensions where the intermediate subpopulations are more structured. FST was previously studied in an admixture model under Nei’s FST definition for one locus, where FST in the admixed population is given by a ratio involving admixture proportions and intermediate subpopulation allele frequencies [52]. On the other hand, our FST is an IBD probability shared by all loci and independent of allele frequencies. Under our framework, the FST of an admixed individual is a sum of products, which is quadratic in the admixture proportions and linear in the coancestry coefficients of the intermediate subpopulations. In the future, inference algorithms for our admixture model with fully-correlated intermediate subpopulations could yield improved results, including coancestry and FST estimates.
Our probabilistic model reconnects FST [21, 23, 24] to inbreeding and kinship coefficients [68, 70, 83, 84], all quantities of great interest in population genetics, but which are currently studied in isolation. The main reason for this isolation is that FST estimation assumes the independent sub-populations model, in which kinship coefficients are uninteresting. However, study of the generalized FST in arbitrary population structures requires the consideration of arbitrary kinship coefficients [68]. Our work lays the foundation necessary to study estimation of the generalized FST, which is the focus of our next publication in this series (Part II).
Acknowledgments
This research was supported in part by NIH grant R01 HG006448.
References
- [1].
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].
- [9].
- [10].
- [11].↵
- [12].
- [13].↵
- [14].↵
- [15].
- [16].↵
- [17].↵
- [18].
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].
- [27].↵
- [28].↵
- [29].
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].↵
- [67].↵
- [68].↵
- [69].↵
- [70].↵
- [71].↵
- [72].↵
- [73].↵
- [74].↵
- [75].
- [76].↵
- [77].↵
- [78].↵
- [79].↵
- [80].↵
- [81].↵
- [82].↵
- [83].↵
- [84].↵
- [85].↵
- [86].↵
- [87].↵
- [88].↵
- [89].↵
- [90].↵
- [91].
- [92].
- [93].
- [94].↵
- [95].↵
- [96].↵
- [97].↵
- [98].↵
- [99].↵
- [100].↵
- [101].↵
- [102].↵
- [103].↵
- [104].↵
- [105].
- [106].↵
- [107].↵
- [108].↵