Abstract
The population differentiation statistic FST, introduced by Sewall Wright, is often treated as a pairwise distance measure between populations. As was known to Wright, however, FST is not a true metric because allele frequencies exist for which it does not satisfy the triangle inequality. We prove that a stronger result holds: for biallelic markers whose allele frequencies differ across three populations, FST never satisfies the triangle inequality. We study the deviation from the triangle inequality as a function of the allele frequencies of three populations, identifying frequency vectors at which the deviation is maximal. We also examine the implications of the failure of the triangle inequality for the four-point condition for groups of four populations. Next, we examine the extent to which FST fails to satisfy the triangle inequality in genome-wide data from human populations, finding that some loci have frequencies that produce deviations near the maximum. We discuss the consequences of the theoretical results for various types of data analysis, including multidimensional scaling and inference of neighbor-joining trees from pairwise FST matrices.
1. Introduction
Introduced by Wright (1951), FST, which provides a measure of population structure for population-genetic data, is one of the most commonly used statistics in population genetics (Holsinger and Weir 2009). Pairwise FST computed between two populations is often viewed as a measure of genetic “distance” between the populations (e.g. Jorde 1985, Rosenberg et al. 2005). Indeed, FST is frequently treated as a distance in many types of analysis for representing relationships between multiple populations, such as in the distance matrices used for spatially depicting genetic variation and in inference of population trees (e.g. Pérez-Lezaun et al. 1997, Li et al. 2008).
In the formulation of Nei (1973), FST can be written where JS is the mean homozygosity across a set of subpopulations and JT is the homozygosity of a population formed by pooling the subpopulations, assuming they have equal representation.
For the case of two subpopulations, using pki for the frequency of allele i in subpopulation k, and , where I is the total number of alleles at a locus of interest. Hence, eq. (1) reduces in this case to where , and .
FST, eq. (2), has some of the properties required by a mathematical measure of distance: it is symmetric with respect to a change in the population labels, it is nonnegative, and it is equal to 0 if and only if two populations have the same allele frequencies (p1i = p2i for all i). Yet, FST is not a true distance metric because it does not satisfy the triangle inequality: with three populations, the sum of two of the distances can be smaller than the third distance. In fact, Sewall Wright (1978, p. 89) was aware of this fact, offering a counterexample of three populations whose allele frequencies result in values of FST that do not satisfy the inequality: a biallelic locus that is monomorphic for one allele in population 1 and monomorphic for the other allele in population 2, and has equal frequencies for the two alleles in population 3.
Here, we generalize beyond Wright’s counterexample to show that not only is it possible for FST to violate the triangle inequality, for a biallelic locus with distinct allele frequencies in three populations, FST never satisfies the triangle inequality. We explore the extent to which FST fails to satisfy the triangle inequality over the space of possible allele frequencies, finding that the maximal deviation from the condition specified by the triangle inequality occurs precisely in Wright’s counterexample. We also show that the failure to satisfy the triangle inequality has as a consequence a failure of the four-point condition associated with construction of evolutionary trees. To consider the context of our theoretical results in data analysis, we examine the extent to which FST fails to satisfy the triangle inequality in data from three human populations. We also examine the impact of the mathematical results in multidimensional scaling analysis and on inference of population trees by neighbor-joining.
2. The Triangle Inequality Never Holds for Biallelic Markers with Distinct Allele Frequencies
We consider a biallelic locus in three populations. We choose one allele and label its frequencies in populations 1, 2, and 3, by p1, p2, and p3, respectively. Without loss of generality, we assume 0 ⩽ p1 ⩽ p2 ⩽ p3 ⩽ 1. We can define F (pi, pj) as the value of FST measured between two populations i and j, in which the frequencies of the chosen allele are pi and pj, respectively.
Simplifying the expression for FST from eq. (2) by noting that the other allele of the biallelic locus has frequency qi = 1 − pi and qj = 1 − pj in populations i and j, respectively, FST (pi, pj), or Fij for short, can be written
At pi = pj = 0 and at pi = pj = 1, we define Fij to be 0. We disregard the cases of p1 = p2 = p3 = 0 and p1 = p2 = p3 = 1, as these cases do not represent polymorphic loci.
The triangle inequality holds for FST in three populations if and only if all three of the following inequalities hold:
We show that eqs. (4) and (5) always hold, as these statements place the largest of the three FST values, F13, on the larger side of the inequality. We also show that when p1 = p2 ⩽ p3 or p1 ⩽ p2 = p3, eq. (6) also holds, so that the triangle inequality is satisfied. However, we show that when p1 < p2 < p3, the triangle inequality fails: while eqs. (4) and (5) do hold, eq. (6) does not.
2.1. At least two of the three of the inequalities always hold
To show that eqs. (4) and (5) hold, it suffices to show that
We need only show eq. (7) to also prove eq. (8). In particular, because pi = 1 − qi and p1 ⩽ p2 ⩽ p3, q3 ⩽ q2 ⩽ q1 and FST (pi, pj) = FST (qi, qj). Hence, by switching the population labels for populations 1 and 3 in eq. (8) and using frequencies qi in place of pi, eqs. (7) and (8) are equivalent.
To prove eqs. (4) and eq. (5), it remains to prove eq. (7). By eq. (3), we wish to show:
For convenience, we define
By these definitions,
Therefore, we seek to show:
For the cases in which σ12 = 0 or σ12 = 2, we have p1 = p2 = 0 and p1 = p2 = 1, respectively, which we defined in eq. (3) to have F12 = 0. If F12 = 0, then F12 ⩽ F13 trivially because 0 ⩽ F13. Hence, noting that δ12 = 0 if p1 = p2, inequality (9) always holds for p1 = p2 = p3.
Similarly, if σ12 + δ23 = 0, then p1 = p3 = 0, and if σ12 + δ23 = 2, then p1 = p3 = 1. It follows that p1 = p3 = p2, so the locus is not polymorphic. Consequently, the cases of σ12 + δ23 = 0 and σ12 + δ23 = 0 are excluded by our assumption that the locus is polymorphic.
Having dealt with the cases in which denominators in eq. (9) are zero, we rearrange eq. (9) and find that eq. (9) holds if
Inequality (10) is equivalent to:
Because δ23 ⩾ 0, δ23 does not change the sign of the expression in eq. (11). Hence, eq. (11) holds if δ23 = 0, or if . The latter inequality always holds, as , and σ12(2 − σ12) − δ12(1 − σ12) ⩾ 0, noting that σ12 ⩾ δ12, 2 − σ12 ⩾ 0, and 2 − σ12 > 1 − σ12. Therefore, eq. (11) holds, eq. (7) follows, and eqs. (4) and (5) are both true.
2.2. If two of the three populations have identical frequencies, then eq. (6) always holds
We show that eq. (6) always holds when p1 = p2 ⩽ p3 or p1 ⩽ p2 = p3. If p1 = p2, then F23 = F13 and F12 = 0. Eq. (6) holds trivially as 0+F23 ⩾ F23. Similarly, if p2 = p3, then F12 = F13 and F23 = 0. Eq. (6) holds trivially, as F12 + 0 ⩾ F12. Because eqs. (4)-(6) are all satisfied, the triangle inequality is satisifed for three populations if either p1 = p2 ⩽ p3 or p1 ⩽ p2 = p3.
Note that the triangle inequality is satisfied in the case that two of three points are the same for any function, f on some set X, which is symmetric and has the identity of indiscernibles, i.e. f (x, x) = 0 for all x ∈ X. With the added condition of nonnegativity, such a function is sometimes called a distance function or distance measure (to distinguish it from a true distance metric that also satisfies the triangle inequality).
2.3. If the three populations have distinct frequencies, then eq. (6) never holds
To show that eq. (6) does not hold when 0 ⩽ p1 < p2 < p3 ⩽ 1, we consider a function
Eq. (6) holds and hence the triangle inequality holds if and only if ψ(p1, p2, p3) ⩾ 0. We proceed in several steps to show that ψ(p1, p2, p3) < 0 when 0 ⩽ p1 < p2 < p3 ⩽ 1.
We write ψ(p1, p2, p3) as a fraction with positive denominator and relate ψ(p1, p2, p3) to its numerator ω(p1, p2, p3). Because the denominator of ψ(p1, p2, p3) is always positive, ψ(p1, p2, p3) < 0 if and only if ω(p1, p2, p3) < 0.
Next, we show that as a function of any one of its variables, ω(p1, p2, p3) is a quartic function. Considering ω(p1, p2, p3) as a function of p2, two of its roots lie at p2 = p1 and p2 = p3. Therefore, we can define a quadratic function in p2, φ(p1, p2, p3):
Because we assume p1 < p2 and p2 < p3, ω(p1, p2, p3) < 0 if and only if φ(p1, p2, p3) < 0.
We then consider the roots of φ(p1, p2, p3) as functions of p2, r1(p1, p3) and r2(p1, p3). We show that r1(p1, p3) ⩽ p1, r2(p1, p3) ⩾ p3, and φ(p1, p2, p3) < 0 for p2 ∈ (r1(p1, p3), r2(p1, p3)).
We conclude that because φ(p1, p2, p3) < 0, ω(p1, p2, p3) < 0, which implies ψ(p1, p2, p3) < 0.
Hence when 0 ⩽ p1 < p2 < p3 ⩽ 1, eq. (6) does not hold, and the triangle inequality does not hold for a biallelic marker with distinct allele frequencies in three populations.
2.3.1. ψ(p1, p2, p3) < 0 if and only if ω(p1, p2, p3) < 0 for 0 ⩽ p1 < p2 < p3 ⩽ 1
We simplify eq. (12) for ψ(p1, p2, p3) by using eq. (3)
Define ω(x, y, z) as:
If x = p1, y = p2, and z = p3, then
The denominator is always nonnegative when 0 ⩽ p1 < p2 < p3 ⩽ 1, so ψ(p1, p2, p3) < 0 if and only if ω(p1, p2, p3) < 0.
2.3.2. ω(p1, p2, p3) < 0 if and only if φ(p1, p2, p3) < 0 for 0 ⩽ p1 < p2 < p3 ⩽ 1
Consider ω(x, y, z) with x = p1 and z = p3 fixed, so that ω(p1, y, p3) is only a function of y. Because ω is quartic in y, it has at most four distinct roots, each of which can be expressed as a function of p1 and p3.
It is trivial to show that y = p1 and y = p3 are both roots of ω(p1, y, p3). Consequently, ω(p1, y, p3) has at most two other roots for y. Define a quadratic function φ(p1, y, p3) such that:
Performing polynomial division, we can write where:
For 0 ⩽ p1 < p2 < p3 ⩽ 1, p1 − p2 < 0 and p2 − p3 < 0, so ω(p1, p2, p3) < 0 if and only if φ(p1, p2, p3) < 0.
2.3.3. φ(p1, p2, p3) < 0 when 0 ⩽ p1 < p2 < p3 ⩽ 1
We know that φ(p1, y, p3) has at most two roots, r1(p1, p3) and r2(p1, p3). If these two roots are distinct, then φ(p1, y, p3) > 0 or φ(p1, y, p3) < 0 for values of y between the roots. Without loss of generality, assume r1(p1, p3) ⩽ r2(p1, p3). To show that φ(p1, p2, p3) < 0 for 0 ⩽ p1 < p2 < p3 ⩽ 1, we need to show all of the following:
If r1(p1, p3) < y < r2(p1, p3), then φ(p1, y, p3) < 0,
r1(p1, p3) ⩽ p1,
r2(p1, p3) ⩾ p3.
Note that because p1 < p3, demonstrating (2) and (3) suffices to show that the two roots r1(p1, p3) and r2(p1, p3) are distinct.
2.3.3.1. If r1(p1, p3) < y < r2(p1, p3), then φ(p1, y, p3) < 0
φ(p1, y, p3) is quadratic in y with leading coefficient α(p1, p3). If α(p1, p3) > 0 and the two roots r1(p1, p3) and r3(p1, p3) are distinct, then φ(p1, y, p3) < 0 between the roots of φ.
To show that α(p1, p3) > 0, rewrite α(p1, p3) as follows:
Because and , we then have from which we conclude α(p1, p3) > 0 because p1 ≠ p3.
2.3.3.2. r1(p1, p3) ⩽ p1
Note that r1(p1, p3) ⩽ 0 implies that r1(p1, p3) ⩽ p1 because 0 ⩽ p1. We can solve the quadratic equation φ(p1, y, p3) = 0 for the value of y as a function of p1 and p3, taking the smaller root to be r1(p1, p3):
To show r1(p1, p3) ⩽ 0 ⩽ p1, because we have demonstrated that α(p1, p3) > 0, we must show
It suffices to show that γ(p1, p3) ⩽ 0.
Write , where
The partial derivatives of are positive for 0 ⩽ p1 < p3 ⩽ 1:
Hence, is increasing in p1 ∈ [0, 1] and p3 ∈ [p1, 1] and is maximized at (p1, p3) = (1, 1). Because , it follows that for all (p1, p3) with 0 ⩽ p1 < 1 and p1 ⩽ p3 < 1.
We conclude γ(p1, p3) ⩽ 0 and therefore r1(p1, p3) ⩽ p1.
2.3.3.3. r2(p1, p3) ⩾ p3
It suffices to show r2(p1, p3) ⩾ 1 ⩾ p3.
Taking the positive root of the quadratic equation φ(p1, p2, p3) = 0,
Because α(p1, p3) > 0 and γ(p1, p3) ⩽ 0, and leaving off the arguments, it suffices to show α + β + γ ⩽ 0. If α + β + γ ⩽ 0 then 4α2 + 4αβ + 4αγ ⩽ 0, (2α + β)2 ⩽ β2 − 4αγ, and, thus, .
Define g(p1, p3) = α + β + γ. We can simplify the condition g(p1, p3) ⩽ 0, by using eqs. (13)-(15):
If 0 ⩽ p1 < p3 ⩽ 1, then g(p1, p3) ⩽ 0, as all of the factors in parentheses are nonnegative. Because g(p1, p3) ⩽ 0, it follows that r2(p1, p3) ⩾ p3.
We conclude that φ(p1, p2, p3) < 0 if 0 ⩽ p1 < p2 < p3 ⩽ 1. We have r1(p1, p3) ⩽ p1 and r1(p1, p3) ⩾ p3, the roots of φ(p1, y, p3) = 0 are distinct, and φ(p1, y, p3) < 0 between the roots, r1(p1, p3) and r2(p1, p3).
2.3.4. Concluding the proof
Because we have shown that for 0 ⩽ p1 < p2 < p3 ⩽ 1, φ(p1, p2, p3) < 0 and φ(p1, p2, p3) < 0 implies ω(p1, p2, p3) < 0, in turn implying ψ(p1, p2, p3) < 0 for 0 ⩽ p1 < p2 < p3 ⩽ 1, eq. (6) is never satisfied, and the triangle inequality is never satsified for biallelic markers with 0 ⩽ p1 < p2 < p3 ⩽ 1.
3. The Maximal Deviation from the Triangle Inequality Occurs at Sewall Wright’s Counterexample
3.1. Visualization of ψ(p1, p2, p3)
As shown in Section 2.3, ψ(p1, p2, p3), measuring the extent to which the triangle inequality fails to be satisfied, is always less than or equal to 0 for 0 ⩽ p1 ⩽ p2 ⩽ p3 ⩽ 1. We illustrate the value of ψ(p1, p2, p3) over the space of possible allele frequencies (p1, p2, p3) in Figure 1, holding p2 constant at each of several values and plotting ψ(p1, p2, p3) as a function of (p1, p3) over the permissible domain [0, p2] × [p2, 1].
In each plot at a fixed p2, the value of ψ appears to decrease monotonically from 0 along lines of constant p1 and along lines of constant p3 to a minimum at (p1, p3) = (0, 0). Moreover, considering all plots at different values of p2, the minimum at (p1, p3) = (0, 1) appears lowest in the case that . The plots suggest that the point at which ψ(p1, p2, p3) is the most negative—where FST fails the triangle inequality by the largest amount—is where p1, p2, and p3 are furthest apart. They suggest that the minimum of ψ(p1, p2, p3) lies at , exactly the triplet Sewall Wright (1978, p. 89) offered in his counterexample. We next prove this to be the case.
3.2. The minimum of ψ(p1, p2, p3) is and occurs at
We seek to find the minimum of ψ(p1, p2, p3), as described in eq. (12), considering all possible (p1, p2, p3) with 0 ⩽ p1 ⩽ p2 ⩽ p3 ⩽ 1. Note that we can assume 0 ⩽ p1 < p2 < p3 ⩽ 1, because if p1 = p2 ⩽ p3 or p1 ⩽ p2 = p3, then ψ(p1, p2, p3) = 0. As we showed previously, ψ(p1, p2, p3) = 0 is the maximal value of ψ, so in finding the minimum, we can assume p1, p2, and p3 are distinct.
We show that the minimum of ψ(p1, p2, p3) occurs at . The proof proceeds in three steps:
For fixed p2, p3, we show ψ(0, p2, p3) < ψ(p1, p2, p3), for all p1 with 0 < p1 < 1.
For fixed p2, we show ψ(0, p2, 1) < ψ(0, p2, p3) for all p3 with 0 < p3 < 1.
We show for all p2 with or .
Showing (1), (2), and (3) suffices to show that the minimum is , as Step 1 shows that the minimum has p1 = 0, Step 2 shows that p3 = 1, and Step 3 shows that .
3.2.1. ψ(0, p2, p3) < ψ(p1, p2, p3)
To show that the minimum of ψ(0, p2, p3) over 0 ⩽ p1 ⩽ 1 at fixed (p2, p3) occurs at p1 = 0, we seek to show that there is no minimum of ψ(0, p2, p3) for 0 < p1 < 1, so that the minimum must occur on the boundary of the unit interval. If ∂ψ/∂p1 > 0 for 0 < p1 < 1, then a minimum occurs at the lower bound of p1: p1 = 0. To show that ψ(0, p2, p3) < ψ(p1, p2, p3) for all p1, we show that ∂ψ/∂p1 > 0 everywhere in 0 < p1 < 1.
Note that
To show ∂FST (p1, p2)/∂p1−∂FST (p1, p3)/∂p1 > 0, it suffices to show ∂FST (p1, p2)/∂p1 > ∂FST (p1, p3)/∂p1.
Define a function f (p1, ρ) = ∂FST (p1, ρ)/∂p1. Note that showing f (p1, p2) > f (p1, p3) where p2 < p3 is the same as showing that ∂f (p1, ρ)/∂ρ < 0, for 0 < ρ < 1 (ρ must be strictly in the bounds of its domain because ρ = 0 would imply p1 = p2 and ρ = 1 would imply p2 = p3). Showing that ∂2FST (p1, ρ)/∂p1∂ρ < 0 implies that ∂ψ(p1, p2, p3)/p1 > 0.
Taking the partial derivative of FST (p1, ρ) with respect to p1 gives us
Taking the partial derivative again with respect to ρ yields
We seek to show that eq. (16) is strictly less than 0. By rearranging terms, it is equivalent to show that
First, consider that because p1 < 1 and ρ − ρ2 > 0 because ρ < 1. Thus,
Hence,
Because ρ − p1, ρ + p1, and 2 − ρ − p1 are positive, we have which can be rearranged to show
By noting that (ρ − p1)2/[(2 − ρ − p1)(ρ + p1)] is the expression for FST (eq. (3)), which is non-negative and less than or equal to 1, we find that both sides of eq. (18) are bounded in [0, 1]:
Therefore,
By adding (ρ − p1)2/[(2 − ρ − p1)(ρ + p1)] to both sides of we have eq. (17).
Because eq. (17) holds, we have completed our proof that eq. (16) is less than 0 for 0 < p1 < 1 and 0 < ρ < 1. Therefore, ∂ψ/∂p1 > 0 for 0 < p1 < 1 at fixed p2 and p3, and the minimum of ψ occurs at p1 = 0. We conclude ψ(0, p2, p3) < ψ(p1, p2, p3).
3.2.2. ψ(0, p2, 1) < ψ(0, p2, p3)
To show ψ(0, p2, 1) < ψ(0, p2, p3), we first comment that ψ(p1, p2, p3) symmetric with respect to the choice of allele, so that
FST is symmetric with respect to an exchange of populations: FST(pi, pj) = FST(pj, pi). It is also symmetric in the choice of allele used for the computation, so that FST(pi, pj) = FST(1 − pi, 1 − pj).
Thus, we have
From Section 3.2.1, we have ψ(0, p2, p3) < ψ(p1, p2, p3). By the symmetry in eq. (20), we have ψ(1 − p3, 1 − p2, 1 − 0) < ψ(1 − p1, 1 − p2, 1 − p3). Defining q1 = 1 − p3, q2 = 1 − p2, and q3 = 1 − p1, where 0 ⩽ q1 < q2 < q3 ⩽ 1, we can also express this inequality as ψ(q1, q2, 1) < ψ(q1, q2, q3), for all 0 ⩽ q1 < q2 < q3 ⩽ 1. Therefore p1 = 0 minimizes ψ for all values of (p2, p3) with 0 < p2 < 1 and 0 < p3 ⩽ 1, and p3 = 1 minimizes ψ for all values of (p1, p2) with 0 ⩽ p1 < 1 and 0 < p2 < 1. We then have ψ(0, p2, 1) < ψ(p1, p2, p3), which concludes the proof of the claim.
Given that we know that p1 = 0 and p3 = 1 minimize ψ(p1, p2, p3) at fixed p2, we have reduced this last step to a single variable problem to determine what value of p2 minimizes ψ(p1, p2, p3). Consider ψ(0, p2, 1):
We can take the derivative of eq. (21) with respect to p2:
Eq. (22) is only equal to 0 when . Therefore, p2 is a critical point for ψ in the domain 0 ⩽ p2 ⩽ 1 and specifically, is a minimum because ψ is greater when p2 = 0 or p2 = 1: ψ(0, 0, 1) = ψ(0, 1, 1) = 0.
4. The Four-Point Condition Never Holds for Biallelic Markers with Distinct Allele Frequencies
The failure of FST to satisfy the triangle inequality for distinct allele frequencies (Section 2.3) raises the issue of the status of FST with respect to the four-point condition of Buneman (1974). The four-point condition is satisfied for a function d on a set X if and only if for all choices of four points x1, x2, x3, x4 ∈ X, not necessarily distinct, all of the following hold:
Equivalently to eqs. (23)-(25), two of the quantities d(x1, x2) + d(x3, x4), d(x1, x3) + d(x2, x4), and d(x1, x4) + d(x2, x3) are equal and greater than or equal to the third.
The four-point condition can be satisfied for some set of four points x1, x2, x3, x4 ∈ X without necessarily holding for all sets of four points in X. For a specific set of four points, if and only if the four-point condition is satisfied, those points can be placed as the leaves of an unrooted tree whose edges are associated with lengths in such a manner that the pairwise distances between points computed along the tree accord with the function d (Buneman 1974, Steel 2016, p. 112).
For a function d that fails to satisfy the triangle inequality for all distinct points x1, x2, x4 in a set
X, the four-point condition sometimes fails; supposing d(x1, x2) + d(x2, x4) < d(x1, x4), we simply take x3 = x2. Noting that d(x2, x3) = 0, eq. (25) does not hold. However, the four-point condition can be satisfied for x1, x2, x3, x4 even if the triangle inequality is not satisfied for any three of the points, as is the case if (d(x1, x2), d(x1, x3), d(x1, x4), d(x2, x3), d(x2, x4), d(x3, x4)) = (2, 5, 8, 2, 5, 2).
We now demonstrate that for FST, not only does the triangle inequality fail for all sets of three points that correspond to the allele frequencies of a biallelic marker with distinct frequencies in three populations, as shown in Section 2.3, the four-point condition also fails for all sets of four points that correspond to frequencies of a biallelic marker with distinct frequencies in four populations. This result has the consequence that sets of four populations cannot be placed on an unrooted tree in such a way that pairwise distances, as computed along the tree, accord with FST.
Consider four populations whose frequencies of a particular allele at a biallelic marker satisfy 0 ⩽ p1 < p2 < p3 < p4 ⩽ 1. We show that so that eq. (24) fails to be satisfied with FST in the role of d.
Define s(pi, pj, pk, p£) = FST (pi, pj) + FST (pk, pl). For pi ⩽ pj ⩽ pk, recall from eq. (12) that ψ(pi, pj, pk) = FST (pi, pj) + FST (pj, pk) − FST (pi, pk). Applying the result of Section 2.3, ψ(pi, pj, pk) < 0 for 0 ⩽ pi < pj < pk ⩽ 1.
We can then use eq. (12) to write
Noting that ψ(p1, p2, p3), ψ(p1, p3, p4), and ψ(p2, p3, p4) are all bounded above by 0 owing to the failure of the triangle inequality for FST, we can cancel equal terms and use the positivity of FST in eq. (3) for distinct allele frequencies to obtain s(p1, p2, p3, p4) < s(p1, p3, p2, p4) and s(p1, p2, p3, p4) < s(p1, p4, p2, p3). We then have
To show that the four-point condition does not hold, we must show s(p1, p3, p2, p4) ≠ s(p1, p4, p2, p3), so that with FST in the role of d and pi in the role of xi, eqs. (24) and (25) cannot both hold si-multaneously. Examining eqs. (26) and (27), we have
Thus, this problem reduces to showing that
We have already shown in Section 3.2.1 that for 0 ⩽ pi < pj < pk < 1, ∂ψ(pi, pj, pk)/∂pi > 0.
Because p1 < p2, we can therefore conclude and thus, s(p1, p3, p2, p4) < s(p1, p4, p2, p3). Hence, eq. (25) fails, so that the four-point condition does not hold for distinct allele frequencies p1, p2, p3, p4. Therefore, beyond the failure of the four-point condition that results quickly when p3 = p2 from the triangle inequality not holding for 0 ⩽ p1 < p2 < p4 ⩽ 1, FST never satisfies the four-point condition when 0 ⩽ p1 < p2 < p3 < p4 ⩽ 1.
5. Distribution of Allele Frequencies in the Parameter Space
Next, in the context of our FST results, we consider the placement of loci from human populations in the space of possible allele frequencies. For this analysis, we examined 590,461 single-nucleotide polymorphisms (SNPs) taken from the HapMap (International HapMap 3 Consortium 2010), as used by Verdu et al. (2014) and Kang et al. (2016). We considered three populations, CEU with sample size 112 individuals, CHB with 137 individuals, and YRI with 140 individuals.
We identified ordered triples (p1, p2, p3) of frequencies, with p1 representing an allele frequency in CHB, p2 representing the frequency of the same allele in CEU, and p3 in YRI, and with p1 ⩽ p2 ⩽ p3. The SNPs can be divided into three groups based on which of the three populations has allele frequencies that lie between those of the other two populations. For the 265,517 SNPs with CEU in the intermediate position, we relabeled alleles such that . Note that we placed CEU in the intermediate position in case of ties. At nonzero allele frequencies, we observed 380 two-way ties with CHB, 621 two-way ties with YRI, and 7 three-way ties.
We plotted the values of (p1, p2, p3) over the permissible domain (Figure 2). Owing to the general similarity of allele frequencies among human populations, most points tend to have p1 only slightly less than p2 and p3 only slightly greater than p2. In regions with similar frequencies for p1, p2, and p3, ψ(p1, p2, p3) is only slightly less than zero. However, nontrivial numbers of points are placed in the upper left corner of the plots, where the deviation from 0 is greatest. Therefore, some SNPs in the three populations do indeed produce substantial deviations from the triangle inequality.
6. Discussion
In this paper, we have expanded on the observation of Sewall Wright (1978) that FST does not always satisfy the triangle inequality. In particular, we found that FST never satisfies the triangle in-equality for biallelic markers with distinct allele frequencies. Interestingly, Wright’s case—arguably the simplest counterexample owing to its use of 0, , and 1 rather than more obscure frequency values—is the triplet that fails the triangle inequality by the largest amount.
6.1. Consequences for statistical methods
We have found that failure to satisfy the triangle inequality for all triplets of distinct allele frequencies implies failure to satisfy the four-point condition for all sets of four distinct allele frequencies. These failures to satisfy the triangle inequality and the four-point condition for all 3-tuples and 4-tuples of distinct allele frequencies for biallelic markers have implications for various forms of data analysis using FST.
6.1.1. Multidimensional scaling (MDS)
Matrices of pairwise dissimilarity among a set of populations are commonly used as a basis for visually depicting similarities of the populations in two or three dimensions by mulitdimensional scaling analysis (Jombart et al. 2009, Wang et al. 2010). These depictions find a representation of the matrix in a two-or three-dimensional space that has the property that Euclidean distances between points in the space approximate the matrix entries. Because FST does not satisfy the triangle inequality for biallelic markers, however, three distinct populations considered for a biallelic marker in an FST matrix cannot be represented as points in Euclidean space in such a way that Euclidean distances in the triangle connecting them correspond to the entries in the FST matrix. This imperfection of the spatial representation applies for any subset of three distinct points in a larger collection; thus, Euclidean distances between points in an MDS representation of an FST matrix necessarily only approximate the matrix entries.
Although MDS cannot always perfectly recapitulate the dissimilarities in the input matrix, MDS is frequently performed on distance matrices that are not Euclidean (Mardia et al. 1979, Cox and Cox 2001). Typical metric MDS finds a best-fit of distances between points in Euclidean space to dissimilarities in the non-Euclidean matrix. The matrix entries can also be transformed so that sets of three points necessarily satisfy the triangle inequality. One adjustment adds a constant c to each matrix entry (Cailliez 1983). For a dissimilarity d, after a large enough constant is added to obtain a new dissimilarity d′ = d + c, the transformed distances satisfy the triangle inequality: if d(x1, x2) + d(x2, x3) < d(x1, x3), then a choice c > d(x1, x3) − d(x1, x2) − d(x2, x3) leads to d′(x1, x2) + d′(x2, x3) > d′(x1, x3). Alternatively, taking the square root of values in [0, 1] yields larger values still in [0, 1], so that the sum of any two of three transformed values is more likely to exceed the third one (Legendre and Legendre 1998, p. 433). In Figure 3, we apply this transformation, finding that satisfies the triangle inequality for all triplets plotted.
We note, however, that the choice of transformation does affect the resulting MDS representation. In Figure 4, we compare the output of the Caillez transformation and the square root transformation on FST dissimilarity matrices with the same five allele frequencies chosen independently at random from a uniform distribution. The MDS plots differ and, in some cases, two points that are close together in one plot are distant in the other. Considering the distances between the output in the plots, neither distance matrix results in the the same matrix as the original unmodified FST dissimilarities, because FST cannot be represented as distances in Euclidean space. The choice of transform ultimately affect the results, and is relevant to report for interpretation of MDS results.
6.1.2. Neighbor-joining inference of evolutionary trees
A second form of analysis affected by the failure of the triangle inequality is tree reconstruction from matrices of FST values computed from allele frequencies (e.g. Takezaki and Nei 1996, Pérez-Lezaun et al. 1997, Bosch et al. 2000). Here, we consider the behavior of the neighbor-joining (NJ) algorithm applied to FST dissimilarity matrices for biallelic markers with distinct frequencies.
If a dissimilarity matrix is generated exactly from a population tree by calculating path lengths between population pairs on the tree, then NJ recovers the generating tree (Saitou and Nei 1987, Studier and Keppler 1988, Atteson 1999). Because of the failure of the four-point condition, population trees constructed from FST matrices for biallelic markers do not perfectly represent those matrices. Moreover, for any four leaves of the population tree, the minimal path connecting the leaves does not faithfully represent the FST matrix entries associated with those leaves.
Nevertheless, the inferred tree might still place more genetically similar populations close together on the tree. Using results from Mihaescu et al. (2009), we determine the tree inferred by neighbor-joining from FST matrices for 4, 5, 6, and 7 taxa. In particular, we prove the following proposition.
Consider a biallelic marker in n populations, with allele frequencies 0 ⩽ p1 < p2 < … < pn ⩽ 1 for a specified allele. For these populations, neighbor-joining applied to an FST dissimilarity matrix, d, produces the tree topologies in Figure 5 for 4 ⩽ n ⩽ 7.
The proposition states that with the populations ordered by allele frequency, neighboring populations in the sequence are placed in adjacent positions on the neighbor-joining tree. We begin with a lemma that addresses the case of n = 4.
Consider a biallelic marker in 4 populations, with allele frequencies 0 ⩽ p1 < p2 < p3 < p4 ⩽ 1. For these populations, neighbor-joining applied to an FST dissimilarity matrix, d, produces the quartet ((1,2),(3,4)).
Proof. Proposition 6 of Mihaescu et al. (2009) demonstrates that neighbor-joining applied to a dissimilarity d in four populations i1, i2, i3, i4 returns the quartet ((i1, i2), (i3, i4)) if
For dissimilarity measure FST, we have already demonstrated in Section 4 that, using our notation, s(p1, p2, p3, p4) < s(p1, p3, p2, p4) and s(p1, p2, p3, p4) < s(p1, p4, p2, p3), so that
Using the definition s(pi, pj, pk, p£) = FST (pi, pj) + FST (pk, p£), the condition in Proposition 6 of Mihaescu et al. (2009) is obtained. □
To prove Proposition 1, we rely on the concept of quartet consistency, which Mihaescu et al. (2009) introduced for assessing the output topology T of neighbor-joining in contexts in which no tree T exactly captures entries in the dissimilarity matrix. By Definition 8 of Mihaescu et al. (2009), a dissimilarity map d for n populations is quartet consistent with a tree T if for every quartet ((i, j), (k, l)) ∈ T, wd(ij: kl) > max[wd(ik: jl), wd(il: jk)], where
Theorem 9 of Mihaescu et al. (2009) states that for 4 ⩽ n ⩽ 7, if there exists a tree T that is quartet consistent with a dissimilarity map d: X × X → IR, then NJ outputs a tree with the same topology as T. This theorem provides a method of determining the NJ tree from a dissimilarity map on n taxa, 4 ⩽ n ⩽ 7, without proceeding through the steps of the NJ algorithm: it suffices to exhibit T with which d is quartet consistent.
Consider a biallelic marker in n populations, with allele frequencies 0 ⩽ p1 < p2 < … < pn ⩽ 1 for a specified allele. For any i1, i2, i3, i4 with i1 < i2 < i3 < i4, using FST for the dissimilarity d, wd(i1i2: i3i4) > max[wd(i1i3: i2i4), wd(i1i4: i2i3)].
Proof. By definition of d, we have
Recalling eqs. (28) and (29) and the result that ψ(pi1, pi2, pi3) < 0 for 0 ⩽ pi1< pi2< pi3⩽ 1,
We then have:
We conclude wd(i1i2: i3i4) > max[wd(i1i3: i2i4), wd(i1i4: i2i3)]. □
We consider n populations labeled 1, 2, …, n such that the frequency of a specific allele has 0 ⩽ p1 ⩽ p2 ⩽ … ⩽ pn ⩽ 1. Consider a tree topology T in which populations 1 and 2 are in a cherry, n − 1 and n are in a cherry, and 3, …, n − 2 are arranged in numerical sequence incidental to the internal branch of the quartet ((1, 2), (n − 1, n)) (Figure 5). It suffices to show that the FST dissimilarity matrix d is quartet consistent with such a topology.
Consider an arbitrary subset of four populations {i1, i2, i3, i4}, where pi1< pi2< pi3< pi4 but i1, i2, i3, i4 are not necessarily consecutive. To show that FST is quartet consistent with T, it suffices to show that the quartet displayed by T for populations {i1, i2, i3, i4} has a larger value of w than either of the alternative quartets possible for the four populations.
By construction, T restricted to {i1, i2, i3, i4} gives quartet ((i1, i2), (i3, i4)). By Lemma 2, given {i1, i2, i3, i4} with i1 < i2 < i3 < i4, using FST for the dissimilarity d, wd(i1i2: i3i4) > max[wd(i1i3: i2i4), wd(i1i4: i2i3)]. Hence, FST is quartet consistent with T, and by Theorem 9 of Mihaescu et al. (2009), NJ applied to the FST dissimilarity matrix produces tree T. □
By the proposition, despite the fact that FST matrices cannot be perfectly represented on a tree, for 4 ⩽ n ⩽ 7 populations, NJ applied to FST places populations with neighboring allele frequencies in adjacent positions on the tree. The proof requires n ⩽ 7 in applying Theorem 9 of Mihaescu et al. (2009). According to that proposition, for 4 ⩽ n ⩽ 7, exhibiting T with which a dissimilarity d is quartet consistent suffices to determine the NJ output topology. However, for n > 7, quartet consistency does not suffice: it is possible to identify a tree T with which the dissimilarity matrix d is quartet consistent but that has a different topology than the tree produced by NJ.
6.1.3. Estimators of FST
Our results thus far have examined values of FST assuming that they are computed from true population allele frequencies. We can also examine the relationship of FST to the triangle inequality for an estimator .
For a biallelic locus in K populations with equal sample of size n diploid individuals, the Weir– Cockerham estimator (Weir and Cockerham 1984) is computed according to where and (Weir 1996, p. 173). For pairwise FST, with K = 2, this expression simplifies to:
As n → ∞, we have
By applying the formula for FST from eq. (3), we can rewrite this limit
To examine whether the large-sample limit of the estimator in eq. (30) satisfies the triangle inequality, we can consider the function
Note that 2x/(x + 1) is a monotonically increasing function for x in [0, 1]. Hence, using eq. (31), because FST (p1, p3) > FST (p1, p2) and FST (p1, p3) > FST (p2, p3) from eqs. (7) and (8), we have and . It follows that and . Consequently, satisfies the triangle inequality for if and only if
Figure 6 shows that for most of the values for shown, lies below zero, and hence fails the triangle inequality. We did not find any values with . However, we did observe for some values with mutually distinct , and , meaning that at some triples of distinct allele frequencies, satisfies the triangle inequality. In particular, in Wright’s counterexample, . This case illustrates that the minimum of does not occur in the same place as the minimum for ψ. The minimum for is not as far below zero as the corresponding minimum for ψ. From this large-sample analysis, we can conclude that the deviation from the triangle inequality is potentially not as great for the Weir–Cockerham estimator as it is for parametric FST.
6.2. Conclusions
We have seen that in allele frequencies from three human populations, the frequencies sometimes lie in parts of the allele frequency space in which the deviation is fairly large. Although we have considered sets of only small numbers of populations, relationships in a larger set of populations are constrained by features of relationships in smaller subsets, so that the results based on 3 and 4 populations that MDS and NJ representations do not perfectly represent FST matrices apply to larger sets. In the case of neighbor-joining, because demonstrating quartet consistency of a dissimilarity matrix with a tree is not sufficient to obtain the inferred tree for n > 7 taxa, it remains to assess the precise relationship of FST matrices and NJ.
The failure of FST to satisfy the triangle inequality can often be mitigated in data analysis. First, in the human data, the loci at which the failures are most severe correspond to points with relatively large allele frequency differences and are relatively sparse in the genome. Second, in MDS analysis, transformations can be applied to data matrices to produce spatial representations that more closely accord with the input matrix. Another solution is to use non-metric MDS, which is designed for input dissimilarity measures that are not necessarily metric; because the MDS visualization is affected by choices made in the analysis—both the version of the multidimensional scaling algorithm chosen and any associated transformations applied—it is desirable for these choices to be documented as part of the analysis (Jombart et al. 2009). Third, in NJ inference, although FST dissimilarity matrices cannot be perfectly recapitulated by a tree, we have found that NJ inference from FST produces predictable and intuitively sensible topologies for n ⩽ 7 taxa.
We note that we have only considered biallelic markers. Recall that the triangle inequality is satisfied for a distance between three populations if the sum of the distance between any two is greater than the third distance. Consider a modified version of ψ from eq. (12) for allele frequency vectors p, q, and r of a multiallelic marker: where FST follows the general eq. (1) or (2). If ψmulti ⩽ 0, then the triangle inequality is satisfied and if ψmulti < 0, then it fails. Taking two examples of (p, q, r), we have ψmulti((0, 0, 1), (1, 0, 0), (0, 1, 0)) = 1 and . Thus, for multiallelic markers, FST sometimes satisfies the triangle inequality and sometimes does not. The multiallelic case does not have a result as simple as the biallelic result that the triangle inequality is never satisfied for distinct allele frequency vectors, and merits a more detailed analysis.
Sewall Wright’s use of a counterexample to demonstrate the failure of the triangle inequality for FST has suggested a broader investigation of the nature of FST dissimilarity matrices. The results illustrate that even fundamental statistics such as FST and simple properties such as the triangle inequality continue to permit rich mathematical analysis.
Acknowledgments
We thank Jonathan Kang for assistance with the SNP genotypes. Support was provided by NIH grants R01 GM117590, R01 GM131404, and R01 HG005855.