Abstract
The Genealogical Sorting Index (gsi) has been widely used in species-delimitation studies, where it is usually interpreted as a measure of the degree to which each of several predefined groups of specimens display a pattern of divergent evolution in a phylogenetic tree. Here we show that the gsi value obtained for a given group is highly dependent on the structure of the tree outside of the group of interest. By calculating the gsi from simulated datasets we demonstrate this dependence undermines some of desirable properties of the statistic. We also review the use of the gsi delimitation studies, and show that the gsi has typically been used under scenarios in which it is expected to produce large and statistically significant results for samples that are not divergent from all other populations and thus should not be considered species. Our proposed solution to this problem performs better than the gsi in under these conditions. Nevertheless, we show that our modified approach can produce positive results for populations that are connected by substantial levels of gene flow, and are thus unlikely to represent distinct species. We stress that the properties of gsi made clear in this manuscript must be taken into account if the statistic is used in species-delimitation studies. More generally, we argue that the results of genetic species-delimitation methods need to be interpreted in the light the biological and ecological setting of a study, and not treated as the final test applied to hypotheses generated by other data.
Introduction
Genetic sequence data and phylogenetic methods are increasingly being used to aid in the discovery and delimitation of species (reviewed in Fujita et al. 2012). The widespread application of such data and analyses to alpha taxonomy has confirmed that evolutionarily distinct species will not necessarily fall into reciprocally monophyletic groups in phylogenies estimated from DNA sequences. Indeed, species can remain paraphyletic with respect to their close relatives in gene trees even millions of years after they begin to diverge (Tajima 1983; Hudson & Coyne 2002).
A number of methods have been developed with the objective of delimiting species using such unsorted gene trees (Knowles & Carstens 2007; O’Meara 2010; Yang & Rannala 2010, 2014; Ence & Carstens 2011; Zhang et al. 2011). The increasingly popular use of these methods in empirical species-delimitation studies has inspired a number of methodological papers exploring their statistical properties. These theoretical investigations have shown the methods to be powerful and accurate when their underlying assumptions are met, but it has become clear that violations of these assumptions can generate misleading results (Reid & Carstens 2012; Carstens et al. 2013; Edwards & Knowles 2014; Olave et al. 2014). Thus, species delimitation methods are most useful when their statistical properties are understood, and studies can be designed and interpreted in the light of these properties.
Although not exclusively designed for species-delimitation studies, the gsi of Cummings et al. (2008) has been widely used in this context (see references in Table 1). This statistic is a measure of the degree to which a pre-defined group of leaves in a phylogenetic tree falls into an exclusive region in that tree. The value of the statistic ranges from 0 to 1, with higher values corresponding to more phylogenetic exclusivity. In this way, the gsi aims to bridge the gap between monophyly and paraphyly as categorical terms, and in so doing, quantify the degree to which a lineage has become distinct as a result of evolutionary divergence. The calculation of the gsi is usually accompanied by an hypothesis test, in which the gsi for each group is compared to values calculated from trees in which the tip labels have been permuted.
When compared to other widely used species delimitation methods (Table 2) the gsi has many desirable properties. As well as having power to detect recently diverged lineages, the gsi differs from many comparable methods in not needing, as input parameters, the values of often difficult-to-estimate quantities such as the effective population size or mutation rate of the genetic sequences under consideration. The gsi value obtained for a given group is also purported to be comparable to those obtained for different groups within the same tree and between those arising from different studies (Cummings et al. 2008).
In applying the gsi to empirical data, however, we have found the value of this statistic to be highly dependent on the structure of the tree outside of the group of interest. This dependence is not reflected in the way the statistic is typically applied and interpreted in species-delimitation studies, and this mismatch between the gsi as it is used and goals of species delimitation undermines the many advantages of the statistic.
The gsi measures exclusivity relative to the entire tree
An example serves to illustrate the dependence of the gsi obtained for a particular group on the over-all structure of the phylogeny from which it is calculated. Take the tree presented in Figure 1. To calculate gsi for the group “a” in this tree we first need to calculate the intermediate statistic gs, which is defined as where dμis the degree (i.e. the number of connections) of the node u, which is one of U total nodes in the smallest sub-tree uniting all members of a group and n is the minimum number of nodes that could be used to unite this group, which is one less than the number of leaves. In the case of Figure 1, n is 3, but all 7 nodes in the tree are needed to create a sub-tree uniting group “a”. As the tree is fully dichotomous, the degree of each node is 3. Thus, gs is 3 / (7 × (3-2)) = 3/7. To obtain gsi for group “a” the observed value of gs is normalised using the maximum and minimum obtainable value for the statistic given the size of the group and the number of nodes in the tree:
Here max(gs) is 1 (the case in which a group is united by the minimum number of nodes, i.e. monophyly) and min(gs) is given by the equation where diis the degree of node i, one of I nodes in the entire tree. Because the smallest sub-tree uniting the group “a” in Figure 1 is the entire tree, min(gs) for this group is equal to the observed gs value. Thus, the numerator in equation (2) is 3/7 - 3/7 = 0 and so the value of gsi is also 0. This result is desirable, as the tree presented in Figure 1 has each group arranged in the least exclusive fashion possible. Nevertheless, defining min(gs) in this way means the value of gsi is partially dependent on the degree to which other groups in the tree fall into exclusive regions.
Consider now the tree presented in Figure 2, which could be obtained from genetic data underlying the topology illustrated in Figure 1 by simply adding further data from two distantly related taxa. Because the clade containing the “a” and “b” groups is unchanged the observed value of gs for “a” is still 7. The addition of the two groups “c” and “d” to the tree, however, has added 8 new dichotomous nodes. Thus the value of min(gs) is now 3/(15×(3-2)) = 3/15 and, following equation (2), gsi is equal to [3/7 - 3/15-3/15] ≈ 0.29. This difference arises from the inclusion of min(gs) in the calculation of gsi, which makes a gsi value obtained for a group a reflection of that group's exclusivity relative to the entire tree. This property is not desirable for a species-delimitation statistic, as it means large gsi values can be obtained for groups that do not represent a population that is divergent from all other samples in a given analysis. Moreover, it also compromises comparisons between groups within one study, or between values obtained from different studies.
A similar issue affects the hypothesis test that is often performed alongside the gsi. Statistical significance, or P-values, arising from this test are usually reported for each of several putative species under consideration in a single analysis. In practice, these P-values are interpreted as the results of independent tests that each group being considered is divergent from all other groups. In fact, as Cummings et al. (2008) make clear, because the test is performed by permuting group assignments across the entire tree, the null hypothesis being tested is that all individuals included in the tree come from a single panmictic population. It is seldom the case that all individuals considered in a species-delimitation study could plausibly have come from a randomly mating population. Thus, statistically significant results may simply represent the rejection of an implausible null hypothesis.
The gsi and species delimitation
The problems discussed above are most likely to affect interpretation of the gsi when the statistic is calculated for a large number of groups, especially when those groups are likely to be divergent from at least some others under consideration. To determine how often the gsi has been used in such contexts, we performed a literature review (Table 1). We identified papers recorded as citing Cummings et al. (2008) in Web Of Science and Google Scholar. For each study we recorded the context in which the gsi was used, the largest number of groups considered in a single analysis and the criteria by which those groups were determined. The results of this analysis show the gsi has mainly been used in the context of species-delimitation (54 of 78 empirical studies) and that these studies have often applied the statistic to several groups (mean = 9.9, median = 6) for which there is a priori evidence for evolutionary divergence. The basis of the group assignment is frequently an existing taxonomic distinction, or a preliminary phylogenetic or clustering analysis performed on data from which the gsi was calculated. Worryingly, these circumstances are exactly those in which (as we show above) use of the gsi can be misleading. We did not find any papers in which the plausibility of the null hypothesis was considered in discussing the statistical significance of results.
To investigate how large the effect of including multiple divergent groups in the calculation of gsi is likely to be in practice, we calculated the statistic from simulated datasets. We used the program ms (Hudson 2002) to simulate gene trees arising from neutral evolution under the demographic history depicted in Figure 3, that is, four divergent populations with two (“a” and “b”) diverging at a time point t which was varied among simulations. We performed 500 simulations for each value of t between 0 and 1 Ne generations in 0.05 Ne increments, sampling 10 individuals per population in each simulation. For each simulation, we calculated the gsi for group “a” twice, first considering all populations in the simulation (the “four-population tree”) then after discarding individuals from the divergent populations “c” and “d” (the “two-population tree”).
As expected, the mean gsi value obtained for the group “a” tracks the divergence of this population from population “b” in all simulations (Fig 4). However, the gsi value calculated from the four-population tree is substantially larger than the value obtained from the two-population tree, even though the same individuals make up the “a” population in each case. The difference is especially pronounced early in the divergence process, indeed, the expected value of gsi in the four-population case is high (0.40) even when t = 0 Ne (i.e. when populations “a” and “b” are panmictic with respect to each other). For every simulation, including those in which the “a” and “b” populations were panmictic, the calculation based on the four population-tree produced a significant result. By contrast, calculations based on the two-population tree produced a near-uniform distribution of P-values under panmixia and became increasingly likely to return significant results as the populations diverged.
These simulations confirm that both the value and the significance obtained for the gsi of a given group is partially dependent on the degree to which other groups fall into exclusive regions of the phylogeny being considered. As we note above, this characteristic is not desirable in a statistic purported to relate only to the group under consideration, as it makes comparison of gsi values obtained from different trees problematic. Significant results can readily be obtained from populations that are not genealogically divergent from all others groups in an analysis.
Aligning the gsi with species delimitation as it is practiced
As noted above, the gsi has many desirable properties (Table 2). Unlike many species delimitation methods, the gsi can be calculated without the need for often difficult-to-estimate population-genetic parameters as input. Additionally, the relative simplicity of the gsi means the statistic can be applied to large datasets. Unlike the GMYC (Pons et al., 2006), the gsi can be applied to unsorted gene trees and the gsi can be used to test the validity of proposed species suggested by morphological or other data. Given these unique properties of the gsi, we do no propose that empiricists discarding the statistic entirely. Rather, the properties described here should be carefully considered before the statistic before is applied to datasets.
There are likely to be multiple ways to reasonably incorporate the gsi in particular species delimitation studies; here we propose a general solution that retains the gsi's simplicity but removes its dependence on the overall structure of the tree from which it is calculated. Our proposed statistic, the “mean pairwise gsi” or pwgsi is calculated for a pair of groups, after all tips other than those in the groups of interest have been dropped from the phylogenetic tree under consideration. This approach can be applied to all putative species under consideration in a given study, or only to a subset that are of particular interest.
For example, to analyse the tree and group-assignments depicted in Figure 2 we first produce trees representing the possible pairwise comparisons of groups (Fig 5). For each tree, the pwgsi is simply the mean of the gsi values obtained for these two groups. Thus, in the case of Figure 5 the pwgsi for the “a:b” comparison is 0 and all other comparisons have pwgsi of 1. This approach requires at most values to be calculated, where η is the number of groups being considered. Thus, the pairwise approach is not subject to the computation limitations of methods that consider all possible partitions of a group-assignment (O’Meara 2010; Ence & Carstens 2011), and can be applied to datasets in which a relatively large number of groups are being considered. It also relatively easy to calculate the pwgsi with the existing genealogicalSorting software, as we demonstrate in Supplementary Text 1.
The pairwise approach aligns well with the goals of species delimitation studies, as the pwgsi quantifies each population's exclusivity relative to all other populations. Moreover, the pattern of pwgsi values resulting from a single analysis can identify groups that are not divergent with respect to each other, but are divergent from all other groups and thus might be considered part of a single divergent population in subsequent analyses (as is the case with “a” and “b” in example discussed above). This approach uses the same procedure as in the calculation of the two-population scores in Fig 4. We can infer from Fig 4, therefore, that the pwgsi tracks lineage divergence and a permutation test applied to a particular between-population comparison has a strong power to detect divergent groups.
pwgsi and population structure
The high power of the pwgsi to detect an exclusive distribution of tips on a phylogeny may seem to make it an ideal statistic for species delimitation in the presence of incomplete lineage sorting. Care needs to be taken, however, in the interpretation of these results. The short period of divergence required to obtain significant results means that such results can be obtained even for what may turn out to be transient isolation between populations. Moreover, speciation is not the only one way in which a non-random distribution of groups might occur on a phylogeny. Specifically, sub-populations within a population with some degree of genetic structuring may be expected to fall into partially exclusive regions of a gene tree. To investigate the degree to which population structure affects the value of the pw-gsi, we simulated neutral gene trees under the scenario depicted in Figure 6. In this case, three populations diverged instantaneously at a time point that was held constant at 5 Ne generations. Two of the descendent populations (“a” and “b”) were united by ongoing and constant gene flow due to migration at a rate 4Nem with values {1, 2, 5, 25, 100}*. We performed 1500 simulations for each migration-rate value and sampled.
The pwgsi value increased for the “a:b” comparison as the number of migrants exchanged between these populations decreased (and thus the populations became more structured) (Fig 7). We also investigated the power of the pwgsi to detect population structure in these simulations by performing 104 group-label permutations per simulation. Significant results were obtained even with relatively limited population structure (Fig. 7). For example, when 4Nem = 100, a value giving a negligible expected FST of < 0.01 (Wright 1949), more than 10% of simulations produced a result with a P-value < 0.05.
Although speciation with gene flow is certainly possible (Emelianov et al. 2004; Davison et al. 2005; Niemiller et al. 2008; de León et al. 2010) and perhaps even common (Nosil 2008), it is generally accepted even a small number of successful migrants are enough to prevent speciation in the absence of selection (Slatkin 1995; Gavrilets 2000). Speciation is only possible with greater rates of migration when very strong divergent selection is acting (Felsenstein 1981; Kirkpatrick & Ravigné 2002). Clearly then, the results of the pwgsi cannot be treated as unambiguous evidence that the groups being considered are different species. Rather, researchers need to consider it in the design of their studies and the interpretation of results. In particular, the pwgsi may be a poor choice of statistic if a putative species is known to have a distinct geographic distribution with respect to others to which it is being compared (or if the population samples being analysed come from different regions).
By contrast, our finding that the pwgsi measures population structure may make it a useful statistic for within-species phylogeographic studies, complementing the AMOVA approach (Excoffier et al. 1992) that is currently widely used for sequence data in this context. Indeed, the gsi has already been used in this context (Chen & Hare 2011; Gustafsson & Olsson 2012).
Conclusions
Genetic sequences are a potentially powerful source of data for the discovery and delimitation of species. The results reported above, however, emphasise the care that needs to be taken in interpreting the results of DNA-based species-delimitation methods. We have shown that a naïve interpretation of gsi, a statistic that has been widely used in species-delimitation studies, can lead to erroneous conclusions. Although the gsi remains powerful approach to species when applied carefully (as in the pwgsi described here) it is still possible to obtain large gsi values and statistically significant results from populations connected by substantial gene flow.
Thus, we suggest that this statistic and other species-delimitation methods should be used as part of a genuinely integrative approach to taxonomy. In particular, the phylogenetic and population-genetic signals measured by species-delimitations methods should considered within the biological and ecological setting of a study, rather than as final arbiters of species’ status applied to hypotheses generated by other data.
Acknowledgments
We thank three anonymous reviewers, Steffen Klaere, Giulio Dalla Riva, and Michael Cummings, whose comments on this manuscript greatly improved it. SK in particular should be credited with the insight described in the footnote on pgX.
Footnotes
↵* Note, the inclusion of population “c” in this design illustrates the importance of the pw-gsi approach to quantifying lineage divergence. As this simulation proceeds population “c” is expected to become increasingly exclusive in gene trees arising from this history; thus the gsi values of “a” and “b” will increase over the course