Abstract
FST is a fundamental measure of genetic differentiation and population structure currently defined for subdivided populations. FST in practice typically assumes the “island model”, where subpopulations have evolved independently from their last common ancestral population. In this work, we generalize the FST definition to arbitrary population structures, where individuals may be related in arbitrary ways. Our definitions are built on identity-by-descent (IBD) probabilities that relate individuals through inbreeding and kinship coefficients. We generalize FST as the mean inbreeding coefficient of the individuals’ local populations relative to their last common ancestral population. This FST naturally yields a useful pairwise FST between individuals. We show that our generalized definition agrees with Wright’s original and the island model definitions as special cases. We define a novel coancestry model based on “individual-specific allele frequencies” and prove that its parameters correspond to probabilistic kinship coefficients. Lastly, we study and extend the Pritchard-Stephens-Donnelly admixture model in the context of our coancestry model and calculate its FST. Our probabilistic framework provides a theoretical foundation that extends FST in terms of inbreeding and kinship coefficients to arbitrary population structures.
1 Introduction
A population is structured if its individuals do not mate randomly, in particular, if homozygozity differs from what is expected when individuals mate randomly [1]. FST is a parameter that measures population structure [2, 3], which is best understood through homozygosity. FST = 0 for an unstructured population, in which genotypes have Hardy-Weinberg proportions. At the other extreme, FST = 1 for a fully differentiated population, in which every subpopulation is homozygous for some allele. Current FST definitions assume a partitioned or subdivided population into discrete, non-overlapping subpopulations [2–6]. Many FST estimators further assume an “island model”, in which subpopulations evolved independently from the last common ancestral population [4–6] (Fig. 1A, Fig. 2A). However, populations such as humans are not necessarily naturally subdivided; thus, arbitrarily imposed subdivisions may yield correlated subpopulations [7] (Fig. 1B, Fig. 2B). In this work, we build a generalized FST definition applicable to arbitrary population structures, including arbitrary evolutionary dependencies.
Natural populations are often structured due to evolutionary forces, population size differences and the constraints of distance and geography [10]. The human genetic population structure, in particular, has been shaped by geography [11], population bottlenecks [12], and numerous admixture events [9, 13, 14]. Notably, human populations display genetic similarity that decays smoothly with geographic distance, rather than with discrete jumps as would be expected for island models [7] (Fig. 1B). Current FST definitions do not apply to these complex population structures.
Population structure can be quantified by the inbreeding and kinship coefficients, which measure how individuals are related. The inbreeding coefficient f is the probability that the two alleles of an individual, at a random locus, were inherited from a single ancestor, also called “identical by descent” (IBD) [15]. The mean f is positive in a structured population [15], and it also increases slowly over time in finite panmictic populations, an effect known as genetic drift [16]. The kinship coefficient φ is the probability that two random alleles, one from each individual, at a random locus are IBD [2]. Both f and φ combine relatedness due to the population structure with recent or “local” relatedness, such as that of family members [17]. The values of f, φ are relative to an ancestral population, where relationships that predate this population are treated as random [18]. Thus, f and φ increase if the reference ancestral population is an earlier rather than a more recent population.
Given an unstructured subpopulation S, Malécot defined FST as the mean f in S relative to an ancestral population T [2]. When S is itself structured, Wright defined three coefficients that connect T, S and individuals I in S [3]: FIT(“total f”) is the mean f in I relative to T; FIS (“local f”) is the mean f in I relative to S, which Wright did not considered to be part of the population structure; lastly, FST (“structural f”) is the mean f relative to T that would result if individuals in S mated randomly. Wright distinguished these quantities in cattle, where FIS can be excessive [15]; however, FIS ought to be small in large, natural populations. The special case FIS = 0 gives FST = FIT [3]. The FST definition has been extended to a set of disjoint populations, where it is the average FST of each population from the last common ancestral population [5, 6].
FST is known by many names (for example, fixation index [3], coancestry coefficient [5, 19]), and alternative definitions (in terms of the variance of subpopulation allele frequencies [3], variance components [20], correlations [4], and genetic distance [19]). Our generalized FST, like Wright’s FST, is defined using inbreeding coefficients. There is also a diversity of measures of differentiation that are specialized for a single multiallelic locus, such as GST, GʹST, and D, which are functions of observed allele frequencies, and which relate to FST under certain conditions [21–25]. We consider FST as a genome-wide measure of genetic drift given by the relatedness of individuals, which does not depend on allele frequencies or other locus-specific features.
In our work, we generalize FST in terms of individual inbreeding coefficients, and exclude local inbreeding on an individual basis. Our FST applies to arbitrary population structures, generalizing previous FST definitions restricted to subdivided populations. We also generalize the “pairwise FST”, a quantity often estimated between pairs of populations [6, 11, 26–30], now defined for arbitrary pairs of individuals.
We also define a coancestry model that parametrizes the correlations of “individual-specific allele frequencies” (IAFs) [31, 32], a recent tool that also accommodates arbitrary relationships between individuals. Our model is related to previous models between populations [5, 33]. We prove that our coancestry parameters correspond to kinship coefficients, thereby preserving their probabilistic interpretations, and we relate these parameters to FST.
We demonstrate our framework by providing a novel FST analysis in terms of our coancestry model of the widely used Pritchard-Stephens-Donnelly (PSD) admixture model, in which individuals derive their ancestry from intermediate populations given individual-specific admixture proportions [34–36]. We analyze an extension of the PSD model [31, 37–40] that generates intermediate allele frequencies from the Balding-Nichols distribution [8], and propose a more complete coancestry model for the intermediate populations. We derive equations relating FST to the model parameters of PSD and its extensions.
Our generalized definitions permit the analysis of FST and kinship estimators under arbitrary population structures, and pave the way forward to new approaches, which are the focus of our following work in this series [41, 42].
2 Generalized definitions in terms of individuals
2.1 Overview of data and model parameters
Let xij be observed biallelic genotypes for SNP i ∈ {1,…, m} and diploid individual j ∈ {1,…, n}. Biallelic SNPs are the most common genetic variation in humans; the multiallelic model follows in analogy to the work of [5]. Given a chosen reference allele at each SNP, genotypes are encoded as the number of reference alleles: xij = 2 is homozygous for the reference allele, xij = 0 is homozygous for the alternative allele, and xij = 1 is heterozygous. Our models assume that the genotype distribution is parametrized solely by the population structure, evolving by genetic drift in the absence of new mutations and selection.
We assume the existence of a panmictic ancestral population T. Relationships that precede T in time are considered random and do not count as IBD, while relationships since T count toward IBD probabilities. Every SNP i is assumed to have been polymorphic in T, with an ancestral reference allele frequency in T, and no new mutations have occurred since then.
The inbreeding coefficient of individual j relative to T, , is defined as the probability that the two alleles of any random SNP of j are IBD [15]. Therefore, measures the amount of relatedness within an individual, or the extent of dependence between its alleles at each SNP. Similarly, the kinship coefficient of individuals j and k relative to T, , is defined as the probability that two alleles at any random SNP, each picked at random from each of the two individuals, are IBD [2]. measures the amount of relatedness between individuals, or the extent of dependence across their alleles at each SNP.
For a panmictic population S that evolved from T, the inbreeding coefficient of S relative to T equals shared by every individual j in S. Thus, is equivalent to Wright’s FST for a subdivided population. The random drift in allele frequencies across SNPs from T to S is parametrized by alone, combining the contribution of time and sample size history into a single value [16].
2.2 Local populations
Our generalized FST definition depends on the notion of a local population. Our formulation includes as special cases island models and admixture models, and its generality is in line with recent efforts to model population structure on a fine scale [43, 44], through continuous spatial models [7, 45–47], or in a manner that makes minimal assumptions [32]. We define the local population Lj of an individual j as the most recent ancestral population of j. In the simplest case, if j’s parents belong to the same population, then that population is Lj and j belongs to it too. However, if j’s parents belong to different populations, then Lj is an admixed population (see example below). More broadly, Lj is the most recent population from which the inbreeding coefficient of j can be meaningfully defined. We define the “local” inbreeding coefficient of j to be , and j is said to be locally outbred if .
For any population T ancestral to Lj, the parameter trio are individual-level analogs of Wright’s trio (FIT, FIS, FST) defined for a subdivided population [3]. Moreover, just like Wright’s coefficients satisfy our individual-level parameters satisfy since the absence of IBD of j relative to T requires independent absence of IBD at two levels: of j relative to Lj, and of Lj relative to T. Note that an individual j is locally outbred if and only if .
Similarly, we define the jointly local population Ljk of the pair of individuals j and k as the most recent ancestral population of j and k. Hence, Ljk is ancestral to both Lj and Lk (Fig. 2B). We define the “local” kinship coefficient to be , and j and k are said to be locally unrelated if . Since the inbreeding coefficient of an individual is the kinship of its parents [2], it follows that a locally-outbred individual has locally-unrelated parents.
Consider an individual j in an admixture model, deriving alleles from two populations A and B with proportions qjA and qjB = 1 − qjA. Then Lj has allele frequencies at each SNP i, where and are the allele frequencies in A and B, respectively. Considering a pair of individuals (j, k), the jointly local population Ljk at one extreme equals Lj= Lk if qjA= qkA; at the other extreme Ljk is the last common ancestral population T of A and B if qjA = 1 and qkA = 0 or vice versa (i.e., individuals are not admixed and belong to separate populations).
2.3 The generalized FST for arbitrary structures
Recall the individual-level analog of Wright’s FST is , which measures the inbreeding coefficient of individual j relative to T due exclusively to the population structure (Fig. 2B), as discussed in the last section. We generalize FST for a set of individuals as where T is the most recent ancestral population common to all individuals under consideration, and wj > 0, are fixed weights for individuals. The simplest weights are for all j. However, we allow for flexibility in the weights so that one may assign them to reflect how individuals were sampled, such as a skewed or uneven sampling scheme.
This generalized FST definition summarizes the population structure with a single value, intuitively measuring the average distance of our individuals from T. Moreover, our definition contains the previous FST definition as a special case, as discussed shortly. For simplicity, we kept Wright’s traditional FST notation [3] rather than using something that resembles our notation. A more consistent notation could be , which more clearly denotes the weighted average of across individuals. Our definition is more general because the traditional S population is replaced by a set of local populations {Lj}, which may differ for every individual.
2.3.1 Mean heterozygosity in a structured population
Our generalized FST is connected to the mean heterozygosity in a structured population, and illustrates its properties. Here we will assume locally outbred individuals, for which . The expected proportion of heterozygotes Hij of an individual with inbreeding coefficient at SNP i with an ancestral allele frequency is given by [15]
The weighted mean of these expected proportion of heterozygotes across individuals, , is given by our generalized FST:
Hence, individuals have Hardy-Weinberg proportions if and only if FST = 0, which in turn happens if and only if for each j. In the other extreme, individuals have fully-fixated alleles , if and only if FST = 1, which in turn happens if and only if for each j.
2.3.2 FST under the island model
Here we show that our generalized FST contains as a special case the currently used FST definition for a subdivided population. As discussed above, FST estimators often assume what we call the “island model,” in which the population is subdivided into K non-overlapping subpopulations that evolved independently from their last common ancestral population T [4–6]. For simplicity, individuals are often further assumed to be locally outbred and locally unrelated. These assumptions result in the following block structure for our parameters, where Su, Suʹ are disjoint subpopulations treated as sets containing individuals. This population structure is illustrated as a tree in Fig. 2A.
In the notation of our generalized FST, we have under the island model assumptions that where the weights wj are such that . Note also that the Su here act as the K unique local populations, where Lj = Su whenever j ∈ Su.
2.4 The individual-level pairwise FST
An important special case of FST is the “pairwise” FST, which is the FST of two subpopulations. When the assumption holds that individuals belong to one of the two unstructured populations, this pairwise FST can be estimated consistently [6], and is used frequently in the literature [11, 26–30]. Here we generalize this parameter to be between two individuals, and clarify its relationship to inbreeding coefficients measured relative to ancestral population T.
Let Ljk denote the last common ancestral population of the pair of individuals j and k, which we defined above as their jointly local population (Fig. 2B). We define the “individual-level pairwise FST” to be which is the special case of our generalized FST for two populations, T = Ljk, and equal weights . Note that Lj and Lk being independent relative to Ljk enables consistent estimation of Fjk [6, 41]; the same is not generally possible for three or more individuals relative to their most recent ancestral population T.
Given and relative to some earlier ancestral population T ≠ Ljk (Fig. 2B), the desired parameters are given by, which follows analogously to Eqs. (1) and (2). Solving for , repeating for , and replacing them into our individual-level pairwise FST, we obtain an equation for arbitrary T:
When there is no local relatedness, is the usual inbreeding coefficient and is the usual kinship coefficient, both measuring population structure only and yielding
Note that the mean individual-level pairwise FST for n > 2, given by gives a lower bound for the “global” , since . Thus, for n > 2 if and only if all individuals are independent, where for all j ≠ k.
2.5 Shifting IBD probabilities for change of reference ancestral population
In developing the generalized FST and the individual-level pairwise FST, we have made use of equations that relate IBD probabilities in a hierarchy. Here we present more general forms of these equations, which allow for transformations of probabilities under a change of reference ancestral population. Our relationships are straightforward generalizations of Wright’s equation relating FIT, FIS, and FST in Eq. (1), now more generally applicable.
Let A be a population ancestral to population B, which is in turn ancestral to population C. The inbreeding coefficients relating every pair of populations in {A, B, C} satisfy which generalizes Eq. (4). A similar form applies for individual inbreeding and kinship coefficients given relative to populations A and B, respectively, which generalizes Eq. (2). All of these cases follow since the absence of IBD of C (or j, or j, k) relative to A requires independent absence of IBD at two levels: of C (or j, or j, k) relative to B, and of B relative to A.
2.6 Genotype moments under the kinship model
In the kinship model, genotypes xij are random variables with first and second moments given by
Eq. (6) is a consequence of assuming no selection or new mutations, leaving random drift as the only evolutionary force acting on genotypes [15]. Eq. (7) shows how inbreeding modulates the genotype variance: an outbred individual relative to has the Binomial variance of that corresponds to independently-drawn alleles; a fully inbred individual has a scaled Bernoulli variance of that corresponds to maximally correlated alleles [3]. Lastly, Eq. (8) shows how kinship modulates the correlations between individuals: unrelated individuals relative to have uncorrelated genotypes, while holds for the extreme of identical and fully inbred twins, which have maximally correlated genotypes [2, 48]. Hence, and parametrize the frequency of non-independent allele draws within and between individuals. The “self kinship”, arising from comparing Eq. (7) to the j = k case in Eq. (8), implies , which is a rescaled inbreeding coefficient resulting from comparing an individual with itself or its identical twin.
3 The coancestry model for individual allele frequencies
FST and its estimators are most often studied in terms of population allele frequencies [4–6, 33]. Here we introduce a coancestry model for individuals, which is based on individual-specific allele frequencies (IAFs) [31, 32] that accomodate arbitrary population-level relationships between individuals. Some authors use the terms “coancestry” and “kinship” exchangeably [5, 49, 50]; in our framework, kinship coefficients are general IBD probabilities (following [17]), and we reserve coancestry coefficients for the IAFs covariance parameters (in analogy to the work of [5]).
In this section we introduce two parameters. First, πij ∈ [0,1] is the IAF of individual j at SNP i. Individual j draws its alleles independently according to probability πij. Allowing every SNP-individual pair to have a potentially unique allele frequency allows for arbitrary forms of population structure at the level of allele frequencies [32]. Second, is the coancestry coefficient of individuals j and k relative to an ancestral population T, which modulate the covariance of πij and πik as shown below.
3.1 The coancestry model
In our coancestry model, the IAFs πij have the following first and second moments,
Eq. (9) implies that random drift is the only force acting on the IAFs, and is analogous to Eq. (6) in the kinship model. Eq. (10) is analogous to Eqs. (7) and (8) in the kinship model, with individual coancestry coefficients playing the role of the kinship and inbreeding coefficients (for j = k), a relationship elaborated in the next section. Lastly, Eq. (11) draws the two alleles of a genotype independently from the IAF, which models locally outbred and locally unrelated individuals [5]. Hence, the coancestry model excludes local relationships, so it is more restrictive than the kinship model.
Our coancestry model between individuals is closely related to previous models between populations [5, 33]. However, previous models allowed . We require that j for two reasons: (1) covariance is non-negative in latent structure models [51], such as population structure, and (2) it is necessary in order to relate to IBD probabilities as shown next.
3.2 Relationship between coancestry and kinship coefficients
Here we show that the coancestry coefficients for IAFs, θjk, defined above can be written in terms of the kinship and inbreeding coefficients utilized in our more general model. We do so by relating our coancestry coefficients to general kinship coefficients by matching moments. Conditional on the IAFs, genotypes in the coancestry model have a Binomial distribution, so
We calculate total moments by marginalizing the IAFs. The total expectation is which agrees with Eq. (6) of the kinship model. The total covariance is calculated using
The first term is zero for j ≠ k, and for j = k it is
The second term equals 4 Cov (πij, πik|T) for all (j, k) cases, which is given by Eq. (10). All together,
Comparing the above to Eqs. (7) and (8), we find that
Therefore, our coancestry coefficients are equal to kinship coefficients, except that self coancestries are equal to inbreeding coefficients.
Since individuals in our IAF coancestry model are locally outbred and unrelated, we also have and for j≠ k. Replacing these quantities in Eqs. (3) and (5), we obtain the generalized FST and pairwise FST in terms of coancestry coefficients.
4 Coancestry and FST in admixture models
The Pritchard-Stephens-Donnelly (PSD) admixture model [34] is a well-established, tractable model of structure that is more complex than island models. There are several algorithms available to estimate the PSD model parameters [34–36, 40, 52]. This model assumes the existence of “intermediate” populations, from which individuals draw alleles according to their admixture proportions. However, the PSD model was not developed with FST in mind; we will present a modified model that is compatible with our coancestry model.
The PSD model is a special case of our coancestry model with the following additional parameters. The number of intermediate populations is denoted by K. be the reference allele frequency at SNP i and intermediate population Su. Lastly, qju ∈ [0, 1] is the admixture proportion of individual j for intermediate population Su. These proportions satisfy for each j.
4.1 The PSD model with Balding-Nichols allele frequencies
The original algorithm for fitting the PSD model [34] utilizes prior distributions for intermediate population allele frequencies and admixture proportions according to
It has been shown [32, 36] that their model is then equivalent to forming IAFs where genotypes are then drawn independently according to xij ~ Binomial(2, πij).
Here we consider an extension of this, which we call the “BN-PSD” model, by replacing Eq. (16) with the Balding-Nichols (BN) distribution [8] to generate the intermediate allele frequencies . This combined model has been used to simulate structured genotypes [31, 38, 39], and is the target of some inference algorithms [37, 40]. The BN distribution is the following reparametrized Beta distribution, where p is the ancestral allele frequency and F is the inbreeding coefficient [8]. The resulting allele frequencies p* fit into our coancestry model, since E[p*] = p and Var(p*) = p(1 − p)F hold.
In BN-PSD, the allele frequencies are generated independently from resulting in an island model structure for the intermediate populations Su.
We calculate the coancestry parameters of this model by matching moments conditional on the admixture proportions Q = (qju). We calculate the expectation as and the IAF covariance is
By matching these to Eq. (10), we arrive at coancestry coefficients and FST of
4.2 The BN-PSD model with full coancestry
The BN-PSD contains a restriction that the K intermediate populations are independent. Suppose instead that the intermediate population allele frequencies satisfy our more general coancestry model:
Where is the coancestry of the intermediate populations Su and Sv. Note that the previous BN-PSD model satisfies and for u ≠ v. Repeating our calculations assuming our full coancestry setting, individual coancestry coefficients and FST are given by
Therefore, all coancestry coefficients of the intermediate populations influence the coancestry coefficients between individuals and the overall FST. The form for above has a simple probabilistic interpretation: the probability of IBD at random SNPs between individuals j and k corresponds to the sum for each pair of ancestries u and v of the probability of the pairing (qjuqkv) times the probability of IBD between these populations .
5 Discussion
We presented a generalized FST definition corresponding to a weighted mean of individual-specific inbreeding coefficients. Compared to previous FST definitions, ours is applicable to arbitrary population structures, and in particular does not require the existence of discrete subpopulations. A special case of our generalized FST is the pairwise FST of two individuals, which generalizes the pairwise FST between two populations that is part of many modern analyses [6, 11, 26–30].
We considered two closely-related population structure models with individual-level resolution: the kinship model for genotypes, and our new coancestry model for IAFs (individual-specific allele frequencies). The kinship model is the most general, applicable to the genotypes in arbitrary sets of individuals. Our IAF model requires a local form of Hardy-Weinberg equilibrium to hold, and it does not model locally related or locally inbred individuals. Nevertheless, IAFs arise in many applications, including admixture models [35], estimation of local kinship [31], genome-wide association studies [53], and the logistic factor analysis [32]. We prove that kinship coefficients, which control genotype covariance, also control IAF covariance under our coancestry model.
We also calculated FST for admixture models. To achieve this, we framed the PSD (Pritchard-Stephens-Donnelly) admixture model as a special case of our IAF coancestry model, and studied extensions where the intermediate populations are more structured. FST was previously studied in an admixture model under Nei’s FST definition for one locus, where FST in the admixed populationis given by a ratio involving admixture proportions and intermediate population allele frequencies [54]. On the other hand, our FST is an IBD probability shared by all loci and independent of allele frequencies. Under our framework, the FST of an admixed individual is a sum of products, which is quadratic in the admixture proportions and linear in the coancestry coefficients of the intermediate populations. In the future, inference algorithms for our admixture model with fully correlated intermediate populations could yield improved results, including coancestry and FST estimates.
Our probabilistic model reconnects FST [5, 6] to inbreeding and kinship coefficients [17, 50, 55], all quantities of great interest in population genetics, but which are studied in increasing isolation. The main reason for this isolation is that FST estimation assumes the island model, in which kinship coefficients are uninteresting. However, study of the generalized FST in arbitrary population structures requires the consideration of arbitrary kinship coefficients [17]. Our work lays the foundation necessary to study estimation of the generalized FST, which is the focus of our next publications in this series [41, 42].
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].
- [23].
- [24].
- [25].↵
- [26].↵
- [27].
- [28].
- [29].
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵