Abstract
As a result of the process of descent with modification, closely related species tend to be similar to one another in a myriad different ways. In statistical terms, this means that traits measured on one species will not be independent of traits measured on others. Since their introduction in the 1980s, phylogenetic comparative methods (PCMs) have been framed as a solution to this problem. In this paper, we argue that this way of thinking about PCMs is deeply misleading. Not only has this sowed widespread confusion in the literature about what PCMs are doing but has led us to develop methods that are susceptible to the very thing we sought to build defenses against — unreplicated evolutionary events. Through three Case Studies, we demonstrate that the susceptibility to singular events indeed a recurring problem in comparative biology that links several seemingly unrelated controversies. In each Case Study we propose a potential solution to the problem. While the details of our proposed solutions differ, they share a common theme: unifying hypothesis testing with data-driven approaches (which we term “phylogenetic natural history”) to disentangle the impact of singular evolutionary events from that of the factors we are investigating. More broadly, we argue that our field has, at times, been sloppy when weighing evidence in support of causal hypotheses. We suggest that one way to refine our inferences is to re-imagine phylogenies as probabilistic graphical models; adopting this way of thinking will help clarify precisely what we are testing and what evidence supports our claims.
Introduction
Every so often, evolution comes up with something totally new and unexpected, a so-crazy-it-just-might-work set of adaptations that is the stuff of nature documentaries. Many biologists likely have a favorite example of a lineage that has evolved something spectacular such as devilishly horned lizards that squirt blood from their eye sockets, marine sloths that grazed ancient seabeds, or that ancient lineage of therapsid reptile that became covered in hair and filled with warm blood and milk.
As macroevolutionary researchers, it is hard to know what to do with these types of events. Their singular and unreplicated nature seems incompatible with models that we typically use to model change over time, such as Brownian motion (BM; Felsenstein, 1973). Such models presume continuity, whereas rare events, such as the evolution of novel nutritive function in milk-producing glands, have no clear precedent in history. The evolution of such traits may set in motion a cascade of changes across an organism, such that descendant lineages may look very different in many ways from their more distant relatives. Or alternatively, a suite of traits may just happen to change at the same time. In either case, it is these sorts of idiosyncratic and unreplicated events that we often think of when we think of the need to consider phylogeny in analyses of comparative data. And this is not an abstract concern; a wide breadth of macroevolutionary data suggest that abrupt shifts and discontinuities have been a major feature of life on Earth (Uyeda et al., 2011, 2017; Landis and Schraiber, 2017; Jablonski, 2017). But as recent controversies in phylogenetic comparative biology have highlighted, our methods may not be up to this task.
As examples, we highlight two recent controversies in phylogenetic comparative methods (PCMs; for recent reviews, see Pennell and Harmon, 2013; O’Meara, 2012; Garamszegi, 2014). First, Maddison and FitzJohn (2015) demonstrated that common statistical tests (e.g., Pagel, 1994; Maddison, 1990) for the evolutionary correlation of discrete characters are prone to reporting a significant association even when the pattern is driven by a single (or, very few) independent evolutionary event(s). Maddison and FitzJohn (2015) referred to such scenarios as cases of ‘phylogenetic pseudoreplication’ (see also Read and Nee, 1995; Nee et al., 1996). We will argue that this unresolved problem permeates not just tests for discrete character correlations, but nearly every method of finding associations in comparative methods (Figure 1), including those involved in our second example: the unacceptably high type-1 error rates (Rabosky and Goldberg, 2015) of methods used to infer trait-dependent diversification (e.g., BiSSE; Maddison et al., 2007). Specifically, Rabosky and Goldberg (2015) show that applying BiSSE to real-world phylogenies, which are usually not shaped liked the birth-death trees assumed by our models (Mooers and Heard, 1997), often leads in erroneous support for trait-dependent diversification models even when diversification dynamics are unrelated to the traits being considered. The work of Beaulieu and O’Meara (Beaulieu et al., 2013; Beaulieu and O’Meara, 2014, 2016) has illuminated the underlying reasons behind Rabosky and Goldberg’s findings: the failure to consider biologicaly-plausible alternative models. To address this shortcoming, Beaulieu et al. (2013) borrowed an idea from molecular phylogenetics (Penny et al., 2001; Galtier, 2001), and developed a Hidden Markov Model (HMM) for describing the evolution of a binary character. In their HMM the transition rates between character states depend on the ‘hidden’ state of another, unobserved, trait also evolving along the tree (also see Price, 1997, who explored a related model). Applying the same principle to trait-dependent diversification models, they showed how models that included background heterogeneity in diversification rates provide a fairer comparison to the hypothesis of genuine state-dependent diversification (Beaulieu and O’Meara, 2016). Rather than considering a biologically unrealistic constant-rate null hypothesis, Beaulieu and colleagues built models that allowed traits and diversification to vary in biologically plausible ways (also see Zenil-Ferguson and Pennell, 2017, on this point).
We think that the type of solution suggested by Beaulieu and O’Meara (2016) is general and applies across comparative biology. In this paper we develop this argument through a series of three Case Studies, depicted in panels I–III of Figure 1. We will show in each Case Study that rare evolutionary events may deceive our methods and distort our interpretation. For each study, we will then sketch out possible solutions for making causal inferences from comparative data. Each of these approaches share a common philosophy but may differ in their details. We do not have a one-size-fits-all solution and think that a diverse set of solutions are worth considering.
More specifically, all three Case Studies revolve around the problem of how to discover plausible histories of rare, evolutionary events — a practice we call “phylogenetic natural history” — and how to disentangle the impact of these events from that of the hypothesized effects we are investigating. But as we argue throughout this paper, the inference problems stemming from singular events are not actually specific to these cases. Rather they are only especially clear examples of broader challenges in comparative biology. By working through the singular events cases, we develop two ideas that we think will help move PCMs forward. First, we advocate for unifying hypothesis-testing and data-driven approaches. Rather than being alternative methods of investigating macroevolutionary processes and patterns, they are complementary, and in our view, essential, to one another. Second, we propose that comparative biologists need to be more careful about how we draw causal inference from phylogenetic data. One particular solution is to render comparative analyses as graphical models. These graphical models can help clarify exactly what causal statements we are making and what the limits of these inferences are.
Case Study I: Felsenstein’s Worst-Case Scenario
More than anything else, it was the famous series of figures depicting his “worst case scenario” (Figures 5, 6, and 7 in the original; our Figure 2) from Felsenstein’s iconic 1985 paper “Phylogenies and the comparative method” that really grabbed biologists by their Chacos and got the ball rolling on modern comparative thinking. The idea is simple: as a result of shared ancestry, measurements taken on one species will not be independent from those collected on another and especially so, if the two species are closely related. While other researchers had hit upon similar notions throughout the early 1980s (e.g., Clutton-Brock and Harvey, 1980; Mace et al., 1981; Ridley, 1983; Stearns, 1983; Cheverud et al., 1985), none of these had the pervasive impact that Felsenstein’s presentation did (see for example, Losos, 2011, who reproduces the figures and the accompanying reasoning in his presidential address for the American Society of Naturalists). The problem is just so obvious; all you have to do is look. And while of course his proposed solution, “independent contrasts” (IC), was widely adopted, we suspect it is the clarity with which Felsenstein articulated the problem that has kept his paper a hallmark of biological education and a testament to the importance of tree-thinking, even as his method has largely been superseded by the related least squares (Grafen, 1989) and mixed model (Lynch, 1991; Housworth et al., 2004; Hadfield and Nakagawa, 2010) approaches.
However, an important part of this story is often missed: Felsenstein also noted that the problem of non-independence does not occur if “characters respond essentially instantaneously to natural selection in the current environment, so that phylogenetic inertia is essentially absent” (p. 6). Despite this comment, a frequent misunderstanding of his argument is that the problem inherent in a non-phylogenetic regression of phylogenetically structured data is that species are not independent. In fact, independence of data is not an assumption of standard (non-phylogenetic) linear regression at all! Rather, standard linear regression assumes that the residuals of the fitted model are independent and identically distributed (i.i.d.). As a result, many applications of a “phylogenetic correction” seem to be missing the point (Revell, 2010; Hansen and Bartoszek, 2012): if all of the phylogenetic signal in a dataset is present in the predictor trait and residual variation is i.i.d., then there is no need for any phylogenetic correction (Rohlf, 2001, 2006). (However, phylogenetic analyses are nearly always needed to determine this condition in the first place.)
But for many researchers, applying non-phylogenetic methods to phylogenetically structured data is deeply unsettling; it just seems wrong somehow, even if we cannot quite put our finger on why (a problem that we revisit below). We suggest that what made Felsenstein’s prima facie argument so compelling was that it appealed to biologists’ intuition that many large clades of organisms are just different in many potentially idiosyncratic ways. In other words, singular events are a common feature of evolution across the tree of life (Uyeda et al., 2011; Landis and Schraiber, 2017; Uyeda et al., 2017; Jablonski, 2017) and we do not want to infer a causal relationship from unreplicated data (Nee et al., 1996). To illustrate the effect of non-independence of characters, Felsenstein simulated a “worst-case scenario” (our Figure 2) in which two clades are separated by long branches. He then evolved traits according to a BM process along the phylogeny; he recovered a significant regression slope using Ordinary Least Squares (OLS) despite there being no evolutionary covariance between the traits.
Here we revisit Felsenstein’s worst case scenario in order to demonstrate that IC and PGLS (which is identical to IC when the residuals are assumed to covary according to a BM model; Blomberg et al., 2012) do not completely address the problem that we tend to think they do — these methods are still susceptible to singular evolutionary events. In our first scenario, we used a phylogeny with two clades, each of which is internally unresolved, similar to that of Felsenstein’s original example. We emphasize that the only phylogenetic structure is that stemming from the deepest split. We then simulated two traits under independent BM processes, each with an evolutionary rate (σ2) of 1. So far, this is an identical procedure to Felsenstein’s initial presentation. However, at some point on a stem branch of one of the two clades we introduce a singular evolutionary “event” drawn from a multivariate normal distribution with uncorrelated divergences and equal variances that are a scalar multiple of σ2.
The resulting distribution of the data suggests a situation very similar to Felsenstein’s original worst-case scenario, and what we argue is the type of problem envisioned by most biologists when they warn their students of the dangers of ignoring phylogeny. To take a more concrete example, consider birds and mammals. Lots of things have happened since these groups diverged from their common ancestor and these have happened for many idiosyncratic reasons that are not well described by our models. For example, milk evolved somewhere along the mammalian lineage and surely this affected the evolution of other traits. Yet it would be nonsensical to describe the evolution of milk as a Brownian process, starting in some ancient reptile and merrily continuing on its way from Aardvarks to Zebra Finches.
One would hope that our tools for “correcting for phylogeny” would recognize that the apparently strong relationship between the two traits in our example was driven by only a single contrast. However, this is not the case. That single contrast results in a very high-leverage statistical outlier that drives significance as the size of the shift increases (Figure 2). We can repeat the same exercise with more phylogenetically structured data (where the two clades of interest are fully bifurcating following a Yule process) and obtain identical results (Figure 2, see Supplementary Material). This is disconcerting since our intuition suggests that we do not have compelling evidence for a causal relationship between these two traits (i.e., there is very little reason for us to believe from this correlation alone that one trait is an adaptation to the other).
How can we formulate a better set of models that can account for what our intuition tells us is a dangerous situation for causal inference? We can do so by including another phylogenetically plausible model: a singular shift driving differences between clades. Let us consider a scenario quite distinct from Felsenstein’s multivariate BM (mvBM) scenario. Instead, traits do not evolve by mvBM, but rather undergo a shift at a single point (e.g., perhaps ancient dispersal event where one clade invaded a new environment or the evolution of a novel key innovation). In such a scenario, we only need to consider the phylogeny in as much as a given species exists on either side of the event in question; except for this difference, the traits have no phylogenetic signal and the residuals are otherwise i.i.d. We can then erect two models: a linear regression model and a singular event model.
Linear regression model: where βX and β0 are the slope and intercept to the regression of Y on X, ϵ is a vector containing i.i.d. random variables describing the error, and the predictor X is generated by some stochastic process ψ(·) on the phylogeny (e.g., a random variable describing a single burst in X on the stem branch of one of the two clades). Alternatively, X and Y may not be related to one another at all. Rather, they may be the products of singular random evolutionary events, E1 and E2, that just so happened to occur on the branch separating two clades:
Singular events model: where the variables IE1 and IE2 are indicator random variables that take the value of 1 if an observation is from a lineage that experienced a phylogenetic event, or otherwise they are 0. Furthermore, βY0 and βX0 are the parameters that describe the trait means had they not experienced the singular evolutionary event in question. Thus, under the laws of conditional probability, the bivariate probability P(X, Y) under the liner model is: where θψ are the parameters of the process for X on the phylogeny, and and are the residual variances. This equation is derived from the assumed path of causation between X and Y, since the likelihood function of trait X, denoted by , is independent of Y, while the likelihood function of Y, denoted by depends on X. The remaining terms in the probability statement are interpreted as prior distributions for the parameters in a Bayesian inferential framework. For the singular event model, a similar exercise results in: where P(NE1 = 1) and P(NE2 = 1) are the probabilities of observing a single shift on the phylogeny, and and are the probabilities of observing these singular shifts in locations LE1 and LE2, respectively. The linear regression and singular events models lead to potentially very different distributions of trait data at the tips. For example, the singular event model, the distribution of Y is conditionally independent of X after accounting for LE1, βY, βY0 — a testable empirical prediction that will often result in these two models being easily distinguishable with model selection. But failing to consider the singular event model as a possibility is a problem: even for the simple case of two continuous traits, we have shown how easily data simulated under the singular event model can result in highly significant regressions for OLS, PGLS and IC regressions, regardless if the residuals are simulated as independent or phylogenetically correlated with respect to the model and phylogeny. We also note that estimating a λ transformation for the residuals (Pagel, 1999; Freckleton et al., 2002) will not rescue the analysis; the estimated value of λ will lie between 0 and 1 and we have found both these more extreme cases (OLS and IC, respectively) to be susceptible.
One might argue that the situation we describe is simply a violation of a BM model of evolution — and this would of course be correct (see also Maddison and FitzJohn, 2015). Indeed, for decades it has been common practice (but unfortunately, not universally so) to test whether contrasts are i.i.d. after conducting an analysis using IC (Garland et al., 1992; Purvis and Rambaut, 1995; Slater and Pennell, 2013; Pennell et al., 2015). Of course, Felsenstein recognized this particular vulnerability in his method, and correctly predicted that the underlying model was an “obvious point for future development” (p. 14). While today we have a much wider range of comparative models to choose from, most continuous trait models are Gaussian (e.g., Pagel, 1999; Blomberg et al., 2003; Butler and King, 2004; O’Meara et al., 2006; Eastman et al., 2011; Beaulieu et al., 2012; Uyeda andHarmon, 2014). It is only recently that alternative classes of models have been con sidered (Landis et al., 2012; Elliot and Mooers, 2014; Schraiber and Landis, 2015; Boucher et al., 2017; Duchen et al., 2017). Whether or not these types of models can sufficiently account for these types of singular events will be examined in the next section. However, our primary point here is to suggest that the phenomenon that made Felsenstein’s argument so intuitive is not the violation of i.i.d. residuals but rather the biologically intuitive realization that unreplicated differences colocalized on a single branch provide only weak evidence of a causal relationship between traits. However, this alternative model is rarely included in comparative analyses. Even for continuous traits, such unreplicated events can cause similar problems as those outlined by Maddison and FitzJohn (2015) in the case of discrete character correlations (as we will further elaborate in Case Study III).
Case Study II: Adaptive hypotheses and singular shifts
As stated above, the IC method is based on the BM model of trait evolution. While this model is useful (and has often been used) for testing for adaptation, it is inconsistent with how we think of the process of adapting to an optimal state (Lande, 1976; Hansen, 1997; Hansen and Orzack, 2005; Hansen et al., 2008; Hansen and Bartoszek, 2012). Hansen’s introduction of the Ornstein-Uhlenbeck (OU) process to comparative biology and the suite of methods built on his approach have been the only real attempts to actually try and capture the basic dynamics of adaptive trait evolution on phylogenies. While it is formally equivalent to a model of stabilizing selection within a population with a fixed additive genetic variance (Lande, 1976; Hansen and Martins, 1996), we agree with other researchers (Hansen, 2012) that the OU model is usually best thought of as a phenomenological descriptor of the long-term movement of adaptive peaks or adaptive zones rather than that of a population climbing along a fixed adaptive landscape.
While an OU model with a single stationary peak is often matched up against BM and other alternatives (Harmon et al., 2010; Slater et al., 2012; Pennell et al., 2015; Cooper et al., 2016), multi-peak OU models have been widely used to test for the presence of shifts in evolutionary regimes (i.e., parts of the phylogeny with their own optima, or less commonly, their own strength of selection parameters). Tests of adaptive evolution come in two flavors: those with an a priori hypothesis (or hypotheses) regarding which lineages belong to which distinct regimes based on ancestral state reconstruction of explanatory factors (Butler and King, 2004; Beaulieu et al., 2012) and those where the locations of regime changes are themselves estimated along with the parameters of the OU process (Ingram and Mahler, 2013; Uyeda and Harmon, 2014; Khabbazian et al., 2016).
These two types approaches represent two different philosophies of data analysis that follow a schism that cuts through comparative methods. For example, there are two major ways to investigate the dynamics of lineage diversification: test specific hypotheses about the drivers of diversification rate shifts (for example, the ‘SSE’ family of models; Maddison et al., 2007; FitzJohn, 2012) or search for the most-supported number and configuration of shifts (Alfaro et al., 2009; Stadler, 2011; Rabosky, 2014). The former (hypothesis-testing) seeks to understand the causes of evolutionary shifts, while the latter is a descriptive and exploratory approach to understanding evolutionary patterns. As we alluded to above, we refer to these data-driven approaches as “phylogenetic natural history” due to their similarity to the practice of natural history observations in nature but projected backwards through phylogenetic space and time (Maddison and FitzJohn, 2015)
Of course, the types of inferences we can make will be limited by our choice of approach. For example, it may be tempting to use exploratory approaches such as BAMM (Rabosky, 2014) or bayou (Uyeda and Harmon, 2014) to search a vast range of model space to find a particularly well-supported statistical hypothesis, observe the shifts identified, and then come up with post hoc explanations for why that particular configuration fits an adaptive story that the researcher can suddenly construct with great precision. (Comparative biologists are of course not unique in succumbing to such temptations; see for example Pavlidis et al., 2012). However, good scientists recognize that such a practice can easily become a form of data snooping. In fact, discovering the location of well-supported shifts on the phylogeny does not say anything about causation; it is merely a descriptive technique to find major features of the data where there is evidence that the parameters governing the dynamics of trait evolution have shifted on the phylogeny. It is nonetheless useful — and we argue essential — that a researcher know where these shifts occur. The reasons for this are covered in Case Study I: these major shifts are likely to drown out any biological signal in a dataset if they are unaccounted for by our hypothesis-driven models. While it is dangerous to come up with your hypothesis after viewing the data, it is equally dangerous to apply and interpret a model fit to your data without plotting and visualizing the signal in your data. We argue that hypothesis-driven and phylogenetic natural history approaches are complementary: we must pit our particular causal hypotheses against a “stuff-happens” model built on idiosyncratic singular evolutionary events.
To illustrate how we might go about uniting these two modes of inference to disentangle the support for causal models of evolution from that attributable to singular events, we reanalyze a dataset introduced by Scales et al. (2009) on lizard muscle fiber proportions (hereafter, the ‘Scales’ dataset). (An expanded dataset was re-analyzed by Scales and Butler (2016) with slightly modified hypotheses; but the original 2009 paper serves as a clearer illustration of our perspective and since we are using it only for rhetorical purposes, we will not delve into differences between the two.)
Scales et al. (2009) are interested in the composition of muscle fiber types in squamate lizards, and whether these muscle fibers evolve adaptively in response to the changing behavior and ecology of the organisms. They propose three primary adaptive hypotheses for the drivers of fast glycolytic (FG) muscle fiber proportions: i) foraging mode behavior (FM; e.g., sit-and-wait vs. active foraging vs. mixed); ii) predator escape behavior (PE; e.g., active flight vs. crypsis vs. mixed); and iii) a combined hypothesis of foraging mode and predator escape (FMPE) that assigns a unique regime to every combination of FM and PE represented in the dataset. For each hypothesis, they reconstruct a likely phylogenetic history of these behavioral modes on the phylogeny by conducting ancestral state reconstructions (Figure 3). After fitting the multi-optimum OU models to the muscle fiber data, they find strong support for the predator escape hypothesis, which is 13.0 AICc units better than the next closest model (FMPE). Such a finding appears quite reasonable under the “Life-Dinner Principle” (Dawkins and Krebs, 1979), which suggests that escaping a predator may have a far more direct effect on fitness than obtaining a food item (Scales et al., 2009).
However, AIC provides only relative support for a model given a set of alternatives (see Pennell et al., 2015, for more on this point in the context of comparative methods). An examination of the particular configuration of shifts in the three hypotheses may give pause to researchers familiar with squamates. For example, some may want to quibble with the suggestion that the “sit-and-wait” foraging behavior of Phrynosoma species, which are often ant-eating specialists that leisurely lap up passing insects, should be grouped with the “sit-and-wait” tactics of species such as Gambelia wislizenii, a voracious carnivore that frequently subdues and consumes other lizards close to their own size. Looking at the reconstructions, it is also apparent that the PE hypothesis is the simplest model that allows a shift on the branch leading to Phrynosoma, a group that any herpetologist would identify as “weird” for a multitude of reasons (indeed, these are the eyeball-socket-blood-squirters alluded to in the introduction). The question then arises: is the signal in the dataset for the PE hypothesis driven entirely by the singular evolution of different muscle fiber composition in Phrynosoma lizards? If so, then any number of causal factors that differ between Phrynosoma and other lizards could be equally as likely as predator escape — including foraging mode with a slight reclassification of character states! We want to emphasize that we are not criticizing any of the particular choices the researchers involved in this study made. Rather, we argue that such quandaries are the inexorable result whenever the primary signal in the data is due to a singular historical event.
To explore the impact of the distinctiveness of simply being a Phrynosoma lizard, we developed a novel Bayesian model by building on the R package bayou (Uyeda and Harmon, 2014). To do so, we consider the macroevolutionary optimum of a particular species to be a weighted average of past regimes, as is typical in all OU models with discrete shifts in regimes (Butler and King, 2004; Beaulieu et al., 2012), but in our case, this weighted average is itself a weighted average of two differing configurations of the locations of adaptive shifts (often referred to as “regime paintings”). One configuration assumes that shifts in the optima have occurred where a discrete character, hypothesized to shape the evolutionary dynamics of the continuous character, is reconstructed to have shifted. The other configuration is estimated directly from the data using bayou’s reversible-jump MCMC (RJMCMC) algorithm.
This equation describes the expected value of a trait for species i, Yi as a weighted average between the expected trait value under the PE hypothesis and the expected trait value under the reversible-jump estimate of shift configurations. The vectors θPE and θRJ are the values of the trait optima for the NPE and NRJ adaptive regimes, while ΨPE and ΨRJ correspond to the standard OU weight matrices that average over the history of adaptive regimes experienced by species i over the course of their evolution, with older regimes being discounted proportional to the OU parameter a (for a full description of how these weight matrices are derived, see Hansen, 1997; Butler and King, 2004).
In our model, the regime painting for our a priori hypothesis ΨPE is fixed, while we estimate the parameters the configuration of shifts for the reversible-jump component, ΨRJ, as well as the values for the optima θPE and θRJ; and standard parameters for the OU model such as α and σ2 which are assumed constant across the phylogeny. We also estimate the weight parameter ω, which determines the degree of support for the PE hypothesis against the reversible-jump regime painting. We place a truncated Poisson prior on the number of shifts for the reversible-jump analysis to be quite low, with a λ =0.5 and a maximum of λ = 10 (meaning that we are placing a prior expectation of 0.5 shifts on the tree). Furthermore, we place a symmetric β-distributed prior on the ω parameter with shape parameters of (0.8, 0.8). Additional details on the model-fitting can be found in the supplementary material.
We then fit this model to 3 different datasets: i) the original Scales data; ii) data simulated using the Maximum Likelihood estimates for the parameters of the PE model fitted to the Scales dataset; and iii) data simulated under the Maximum Likelihood estimates for a “Phrynosoma-only” model in which a single shift occurs leading to the genus Phrynosoma. We could then compare the posterior distribution of the weight parameter ω to evaluate the weight of evidence for each hypothesis in each dataset.
We find that our approach places intermediate weight on the PE hypothesis for the original Scales dataset. When we simulated data under the PE hypothesis, the estimated weight given to the PE hypothesis was likewise high (Figure 3B). When data were simulated under the Phrynosoma-only hypothesis, the weight given to the PE hypothesis was low, as predicted (Figure 3B). Furthermore, the RJ portion of the model fit to the Scales dataset recovers only a single highly supported shift on the stem branch of the Phrynosoma lizards (Figure 3C and 3D). This suggests that the PE hypothesis has statistically supported explanatory power as its estimated weight is well bounded away from 0. But it does not explain everything. In particular, the PE hypothesis fails to fully explain the shift leading to the Phryno-soma lizards (Figure 3C and 3D), which are more extreme than they should be considering the other taxa in their regime (there is only one, Holbrookia maculata, which does not show such an extreme shift). Consequently, the answer to whether differences in predation escape behavior are driving the evolution of these traits is neither yes or no, but somewhere in between. This more subtle view of muscle fiber evolution conforms quite well to the conclusions drawn in the original paper and our biological intuition about the genus Phrynosoma — variation in predator escape behavior is a good explanation for observed patterns of muscle fiber divergence, but Phrynosoma are weird and other factors likely are influencing their trait evolution beyond predator escape.
We can conduct the same analysis where we test not the PE hypothesis, but the Phrynosoma-only hypothesis against the reversible-jump hypotheses (Figure 4). In this case, we recover high weights for the Phrynosoma-only hypothesis regardless if the model is fit to the Scales dataset, or to data simulated under either the Phrynosoma-only hypothesis or the PE hypothesis. This is because accounting for the Phrynosoma shift is the primary feature of all three datasets (though weights are somewhat higher for data simulated under the Phrynosoma-only hypothesis than others). It may appear unsatisfying that such high weights are recovered for the a priori hypothesis when a singular event, which is easily reconstructed by the RJMCMC, explains the distribution of the data just as well. However, the analysis favors the Phrynosoma-only hypothesis simply because of the vague priors placed on the number and location of shifts in the reversible-jump analysis. Guessing correctly which of the 42 branches on the phylogeny has a single shift with our hypothesis is rewarded by the analysis (we will return to this issue in Case Study III). In the original Scales dataset, there are weakly supported shifts in the clades leading to the sister group of Phrynosoma lizards, and the branch leading to Acanthodactylus scutellatus and Aspidoscelis tigris. Finally, we can combine all three hypothesis simultaneously by placing a Dirichlet prior on the vector ω = [ωRJ, ωPE, ωPhrynosoma]. Doing so recovers strongest support for the Phrynosoma-only model, intermediate support for the PE hypothesis, and very little weight on the reversible-jump hypothesis, which has no strongly supported shifts (Figure 5).
By combining phylogenetic natural history approaches with our a priori hypotheses, we show that we can account for rare evolutionary events that are not well-accounted for by our generating model. In the case of the PE hypothesis, we show that it does indeed have explanatory power beyond simply explaining a singular shift in Phrynosoma and support the original authors’ conclusions. However, the intermediate result likely only occurs because the PE hypothesis places Phrynosoma in the same regime as Holbrookia maculata, which does not share the extreme shift that is found in Phrynosoma. Were this not the case (as in our fitting of the Phrynosoma-only hypothesis), it would still require visual inspection of the phylogenetic distribution of traits under the hypothesis in question to determine that a singular evolutionary event is driving support for a particular model. As discussed above, given a large enough tree such a priori hypotheses are likely to be strongly supported; if you can predict which one branch out of many will contain a shift then you may be on to something. But given the dangers of ascertainment bias and our biological intuition, we find this interpretation unsatisfying (Maddison and FitzJohn, 2015). We discuss this problem more in Case Study III.
Nevertheless, we show the value in combining a hypothesis testing framework with a natural history approach to identifying patterns of evolution. We show here that allowing for unaccounted shifts can provide a stronger test and more nuanced conclusions regarding the support for a particular predictor driving trait evolution across a phylogeny. Furthermore, predictors which provide additional explanatory power (if for example, regimes are convergent or if predictors vary continuously) will be even more favored over natural history models. Thus, our framework certainly does not automatically reward more complex, freely estimated models. Rather, the great uncertainty in possible models is incorporated as a prior on the arrangement of shifts and is limited in explanatory power, something that researcher-driven biological hypotheses are much more capable of accomplishing.
Case Study III: Darwin’s scenario and unreplicated bursts
We now turn to a case where both the explanatory variable and the focal trait are discrete characters. In comparison to the continuous cases described above, we expect the signal for evolutionary covariation between such characters to be more difficult to detect. However, as we mention above, Maddison and FitzJohn (2015) recently demonstrated that commonly used methods return significant correlations all the time — and in scenarios that seem to defy our statistical intuition. For example, Pagel’s (1994) correlation test would find the phylogenetic co-distribution of milk production and middle ear bones highly statistically significant even though they both are a defining characteristic of mammals (an inference so obviously dubious that even Darwin 1872 warned against it). This seems to be a clear case of phylogenetic pseudoreplication (Maddison and FitzJohn, 2015; Read and Nee, 1995). Maddison and FitzJohn describe the goal of correlation tests as finding the “weak” conclusion that “the two variables of interest appear to be part of the same adaptive/functional network, causally linked either directly, or indirectly through other variables” (p. 128). They assert that with our current approaches, we cannot even clear this (arguably low) bar. Here we delve into this idea a bit deeper. What constitutes good evidence of such a relationship? And is this a reasonable goal for comparative analyses?
Maddison and FitzJohn highlight two hypothetical situations, that they refer to as “Darwin’s scenario” and an “unreplicated burst”. They argue that these scenarios provide little evidence for an adaptive/functional relationship between two traits because the patterns of codistribution only reflect singular evolutionary events (Figure 1). In Darwin’s scenario, two traits are coextensive on the phylogeny, meaning that in every lineage where one trait is in the derived character state, the other trait is as well. As an example, consider the aforementioned phylogenetic distribution of middle ear bones and milk production in animals; all mammals (and only mammals) have middle ear bones and produce milk. These traits (depending on how they are defined) have only appeared once on the tree of life and both occurred on the same branch (the stem branch of mammals). The unreplicated burst scenario is identical to Darwin’s scenario except that rather than a single transition occurring in both traits, there is a single transition in the state of one trait (e.g., the gain of middle ear bones) and a sudden shift in the transition rates in another trait (e.g., the rates by which external testes are gained and lost across mammals). Note that these scenarios do not differ qualitatively from Felsenstein’s worst-case scenario nor the Phrynosoma-only model scenario from Case Studies I and II (Figure 1). In all three scenarios, something rare and interesting happened on a single branch and the distribution of traits at the tips of the phylogeny reflects this.
In their paper, Maddison and FitzJohn (2015) simulated comparative data and reported a preponderance of significant results using Pagel’s correlation test (1994) and Maddison’s (1990) concentrated changes test. In order to hone our intuition of the problems they present, we dig a bit deeper and investigate the mathematical reason that Pagel’s discrete correlation test (1994) returns a significant result in Darwin’s scenario. (We should note here that Brookfield [1993] conducted a similar analysis that was more-or-less completely overlooked.) To make the problem tractable, we assume that the traits were selected without first looking at their phylogenetic distribution, a condition that we (as well as Maddison and FitzJohn, 2015) suspect is rarely met in practice (more on this below).
Again, under Darwin’s scenario, there is a single concurrent origin of two traits leading to perfect codistribution across the phylogeny (a condition we define mathematically as event A). What is the probability that both traits X and Y undergoing a single, irreversible shift on the same branch Li under a model where the two traits are independent (Mind)? And what is the probability of this occurring if the two traits are actually evolving in a correlated fashion (Mdep)?
Under the independent model, both traits X and Y have to switch from 0 to 1 in the same branch once. We also know that there was at least one transition in each of the traits, since we would not study traits if there weren’t any changes in the phylogeny. The probability of this happening is where Nx and Ny are the stochastic processes that denote the number of shifts of trait X and Y at time t respectively. Li is the branch on which both transitions occur, where Li has a branch length of ti. The sum of all branch lengths is T. Since X and Y are independent, the joint probability of X and Y changing at the same time is simply the product of probabilities of each event, so the above expression becomes where Qx and Qy are the infinitesimal probability matrices that describe the transition rates between states in the independent case (these Q matrices are used to conduct Pagel’s correlation test, see Supplementary Material for details on matrix definitions under the independent case) and the subscripts on [eQyti](1,2) indicate row 1, column 2 of the resulting probability matrix. We now consider the outcome of maximizing this expression under a likelihood framework. Since there is no evidence of a transition from 1 to 0 in either trait, the maximum Likelihood estimate (MLE) for the transition rates and will be 0. Meanwhile, the MLEs for the transitions from 0 to 1 in both traits will be small (because these events are so rare, occurring only once, see the small probability of a single shift occurring in the Supplementary Material) but positive since one transition does occur on Li. Given the resulting parameter estimates of , it is likely that a great many realizations of this process would likely result in no lineages evolving the traits of interest at all — replaying the tape of life, under Markovian assumptions, will likely lead to many worlds where milk and middle ear bones don’t exist at all. However, we do not study traits that don’t exist. Because of this ascertainment bias, the probability of at least one switch occurring for traits that are unlikely to evolve at all (i.e. with very small and ) should be nearly exactly one, that is P(Nx (t) ≥ 1) ≈ 1 when accounting for total branch length T of the tree (see Supplementary Material for exact derivation of this probability). The probability of exactly one transition of each trait occurring in the lineage Li given that at least there is one transition in the tree is simply uniform (derived from a Poisson process, see Supplementary Material). Furthermore, with rare events the estimates of the probabilities of both traits changing only once in lineage Li conditional upon observing Darwin’s scenario (under the independent model Mind) is also one and , meaning that at the end the probability of the independent model reduces to where ti is the branch length of branch Li containing both shifts (Karlin and Taylor, 1981).
In contrast, for the completely dependent model Mdep, it is enough to follow what happens in a single trait since the second will just simply change along. Therefore:
Thus, the test statistic used in the likelihood ratio test comparing Mind and Mdep is simply proportional to the ratio of the length of the branch where the shift occurred to the total length of the tree (i.e., the probability of two events happening on the same branch equation (Eq. 7) vs. the probability of one event happening on the branch (Eq. 8).
In other words, the results of the analysis are predetermined. Under Darwin’s scenario, including additional taxa in the analysis will increase the support for the dependent model simply as a consequence of increasing the total length of the tree (i.e., the difference between ln(T) and ln(ti) will get bigger).
The assumptions used to derive this result differ very slightly from those used in available software; however, we can use simulation to test the validity of our result and to demonstrate that this is the mathematical reason that Pagel’s test returns a significant result. Using the R package diversitree (FitzJohn, 2012), we simulated a set of 20 taxon trees where both traits underwent a irreversible transition on a single, randomly chosen, internal branch. We then fit a Pagel model with constrained (Mdep) and unconstrained (Mind) transition rates. We also constrained the root state in both traits to 0, rates of losses of both the traits to 0, and gain rates in the dependent model following the gain of the other trait to be extremely high. Plotting the empirically estimated differences in the MLEs against the predictions making the simplifying assumptions above reveals a strong modal correlation between them (Fig.6). Differences likely reflect the fact that we have not explicitly made the assumption that P(Nx (t) ≥ 1) = P(Ny(t) ≥ 1) ≈ 1 when we fit the model with diversitree. Furthermore, we compare here only fully dependent and independent models. This can be seen when calculating the probability of one switch in each trait P(Nx (t) = 1, N (t)y = 1). In the fully dependent case that simply becomes P(Nx (t) = 1), in the independent case it becomes P(Nx (t) = 1) P(Ny (t) = 1) but in the correlated case it becomes affecting the likelihood ratio test based on estimations of the correlation (see Supplementary Material). However, such intermediate cases will only introduce slight differences and may not be distinguishable from the fully dependent case under Darwin’s Scenario (though they will be important in more intermediate cases, see Supplementary Material).
Maddison and FitzJohn (2015) hinted that the coincident occurrence of single events could be a way of measuring the evidence for a correlation, but did not work out the details as we have done here. The key to understanding this result is to recall Gould and Eldredge’s famous dictum (1977) that “stasis is data”. The remarkable coincidence is not just that the two characters happened to evolve on the same branch but that they were never subsequently lost. For even a modestly sized tree, this coincidence is so unlikely that the alternative hypothesis of correlated evolution is preferred over the null. It is therefore not completely unreasonable that Pagel’s test tells us that these traits have evolved in an entirely correlated fashion.
However, one key consideration should make us suspect of this line of reasoning. As Maddison and FitzJohn (2015) point out, the traits we use in comparative analyses are not chosen independently with respect to their phylogenetic distribution (as we assumed in our analysis). Rather, researchers’ prior ideas about how traits map unto trees likely inform which traits they choose to test for correlated evolution. For example, it is common practice among systematists to search for defining and diagnostic characteristics for named clades; these type of traits are of especial interest and are likely the same sorts of traits that are researchers might include in comparative analysis, thereby greatly increasing the likelihood of finding traits with independent, unrelated origins that align with Darwin’s scenario. We agree with Maddison and FitzJohn (2015) that this type of ascertainment bias is likely prevalent in empirical studies, even if it is usually more subtle than testing for a correlation between milk and middle ear bones. However, we disagree with them that this renders establishing correlations in intermediate cases hopeless. Understanding the exact mathematical reasons why Pagel’s test infers a significant correlation in a given case provides a clear boundary condition that can help develop quantitative corrections for ascertainment bias. Furthermore, the issues of ascertainment bias are likely to rapidly dissipate as we move away from the boundary case of Darwin’s scenario. As a result, extending our analytical approach to more complicated scenarios will likely provide an even more meaningful estimate of the weight of evidence supporting a hypothesis of correlation.
The structure of a solution
We have shown in the three Case Studies that many PCMs, including those that form the bedrock of our field, are susceptible to being misled by rare or singular evolutionary events. This fundamental problem has sown doubts about the suitability and reliability of many methods in comparative biology, even if it was not obvious that these issues were connected. But again, the fact that apparently different issues share a common root makes us hopeful that there can be a common solution.
As we illustrate through our Case Studies, we think that accounting for idiosyncratic evolutionary events will be an essential step towards such a solution. However, we will need to think hard about how best to model such events. In Case Study II, we present one solution to the problem that involves explicitly accounting for the possibility of unaccounted adaptive shifts using Bayesian Mixture modeling. We believe this approach has a great deal of promise as it provides simultaneous identification of biologically interesting shifts and the explanatory power of a particular hypothesis.
However, we do not claim that such an approach is the only solution or that it solves the problem completely. Indeed, we find that in all three Case Studies, the uniting philosophy is to consider models that account for idiosyncratic background events, rather than strict adherence to a particular methodology. For example, we highlighted in the introduction that we think HMMs (following Beaulieu et al., 2013; Beaulieu and O’Meara, 2016) are a potentially powerful, and widely applicable solution, even though we did not consider these in detail here.
And there are still other potential solutions which we have not even mentioned yet. In our own work (Uyeda et al., 2017), we have used a strategy similar to the Bayesian Mixture Modeling but instead of modeling the trait dynamics as a joint function of our hypothesized factors and background changes (represented by the RJMCMC component), we did the analyses in a two-step process: first, we used bayou (Uyeda and Harmon, 2014) to locate shifts points on the phylogeny, then used Bayes Factors to determine if predictors could “explain away” shifts found through exploratory analyses. For PGLS and other linear modeling approaches, modeling the residuals using fat-tailed distributions (Landis et al., 2012; Blomberg et al., 2012; Elliot and Mooers, 2014; Duchen et al., 2017) may mitigate the impact of singular evolutionary events on the estimation of the slope (also see Slater and Pennell, 2013, for an alternative approach using robust regression). Furthermore, we also think that rigorous examination of goodness-of-fit and model adequacy following any comparative analysis is critical for finding unforeseen singular events driving signal in the dataset (Garland et al., 1992; Boettiger et al., 2012; Slater and Pennell, 2013; Pennell et al., 2015). Which of these solutions (including those that were included in our Case Studies and those that were not) will be the most profitable to pursue will probably differ depending on the question, dataset and application — we anticipate that there will not be a one-size-fits-all solution — but we do think that any compelling solution will involve a unification of phylogenetic natural history and hypothesis testing approaches.
But we want to take this a step further. While it is useful to account for phylogenetic events in our statistical models, a greater goal of comparative biology should be explain why these events exist in the first place. We return to Maddison and FitzJohn’s (2015) “weak” goal of finding whether or not “two variables of interest appear to be part of the same adaptive/functional network, causally linked either directly, or indirectly through other variables.” We ultimately disagree with them that this constitutes a weak conclusion; the challenges of making these inferences from any comparative dataset are significant. Furthermore, we find the often repeated axiom “correlation does not mean causation” to be unhelpful. While the axiom is accurate in the strict sense, we believe that it obscures many logical and philosophical challenges to analyzing phylogenetic comparative data that are often ignored. And as is clear from reading the macroevolutionary literature, biologists do not shy away from forming causal statements from correlative data regardless. It therefore seems worthwhile to take seriously the question: “What would it take to infer causation from comparative data?” And even if we are to conclude that all the evidence for a hypothesized causal relationship stems from one or a few evolutionary events, is this finding biologically meaningful?
Phylogenies are graphical models of causation
One way to gain a foothold on the problem of causation is to build, communicate, and analyze phylogenetic comparative methods in a graphical modeling framework — a perspective that has recently been advocated by (Höhna et al., 2014). Graphical models that depict hypothesized causal links between variables make explicit key underlying assumptions that may otherwise remain obscured; indeed, the precise assumptions of PCMs were hotly debated in the early days of their development (Westoby et al., 1995b, a; Nee et al., 1996; Harvey et al., 1995; McNab, 1988) and remain poorly understood to this day (Hansen and Orzack, 2005; Hansen and Bartoszek, 2012). As examples of how using graphical models force us to be more clear in our reasoning, consider the graphs in Figure 7. We depict three different models of causation that have phylogenetic effects that each require alternative methods of analysis to estimate the effect of trait X on trait Y. In our example, a four species phylogeny provides possible pathways for causal effects, but variables may have entirely non-phylogenetic causes or may be blocked from ancestral causes by observed measurements, rendering the phylogeny irrelevant (e.g. Figure 7A). Edges connect nodes and indicate the direction of causality, where the nature of phylogenies allows us to assume that ancestors are causes of descendants, and not vice versa. This asymmetry results in a what is known as a probabilistic Bayesian Network (a type of directed acyclic graph, or DAG) that predicts a specific set of conditional probabilities among the data.
Depending on the Bayesian network structure, the appropriate method of analysis can range from a non-phylogenetic regression (Figure 7A), to commonly used comparative methods such as Phylogenetic Generalized Least Squares (PGLS, Figure 7B), to methods that require modeling both the evolutionary history of interaction of both trait X and trait Y (Figure 7C) (Hansen, 1997; Butler and King, 2004; Hansen et al., 2008; Revell, 2010; Hansen and Bartoszek, 2012). We emphasize that this implies that the use of phylogeny in interspecific comparisons is an assumption that depends on the precise question being asked and the hypothesized causal network. It is often assumed and asserted that PCMs are simply a more rigorous version of standard regression. This is simply not true.
In cases where phylogeny does matter, we must specify the generating model for unobserved states in our causal graphs. For example, it is common to assume a BM model for residual variation in PGLS or that ancestral states are reconstructed using stochastic character mapping in OU modeling of adaptation. However, BM and other continuous Gaussian or Markov processes are only a few of the many types of processes that may generate change on a phylogeny. We have shown that discontinuous processes and rare, singular events are poorly handled in our current framework and lead to much confusion about what exactly, our statistical methods are allowing us to infer from comparative data. Such models can be similarly illustrated using graphical models (Figure 8). By making our models explicit, we see that the phylogeny is best thought of as a pathway for past factors to causally influence the present-day distribution of observed states. These “singular-event” models are alternatives to the more continuous models we typically examine. Furthermore, representing our models as graphs, we are poised to take advantage of the sophisticated approaches for causal reasoning (e.g., Pearl, 1995, 2009; Sugihara et al., 2012; Shipley, 2016) that have been embraced by fields like computer science but largely ignored by comparative biologists (a rare exception is the recent introduction of phylogenetic path analysis; Hardenberg and Gonzalez-Voyer, 2013).
One clear case where such graphical modeling would improve inference are cases where considering phylogeny reverses the sign of the relationship between two variables. This is precisely what Nee et al. (1991) found looking at the relationship between body size and abundance in British birds; depending on how they aggregated the data (means of species, means of genera, means of tribes, etc.) the direction of correlation flipped back and forth. This reversal in the sign of the relationship between two variables X and Y when conditioning on a third Z is a general, and widely studied, statistical phenomenon known as “Simpson’s paradox” (Blyth, 1972). Nee and colleagues (1991; 1996) hold up their findings of the British bird study to be emblematic; in their view, the presence of Simpson’s paradox in their data clearly implies that phylogeny is key to making sense of interspecific data.
However, as Pearl (2014) has convincingly demonstrated, Simpson’s paradox is not really paradoxical at all when considered from the standpoint of Bayesian Networks. In fact, Pearl shows that the appropriate way to analyze the data depends crucially on what one assumes is causing what. To understand how causal inference resolves Simpson’s Paradox, we now present a rather artificial, but nevertheless illustrative example (Pearl, 2009). Consider three traits: Body size (B), abundance (N) and migratory behavior (M) in birds. Given the Bayesian Networks presented in Figure 9, we have two possible hypotheses for the causal relationships between the traits. We further consider the possibility that we do not have adequate data on M, and thus only B and N are observed. Our goal is to estimate the causal effect of B on N. In Figure 9A, body size influences whether or not species become migratory, and both migratory status and body size influence species abundance (but in opposite directions). Furthermore, under this scenario, both body size and migratory status will have phylogenetic signal. We can evolve traits along the phylogeny depicted in Figure 9C and obtain a bivariate plot that looks like Figure 9D. Under the alternative Bayesian Network, migratory behavior still has a positive effect on species abundance, but also increases body size, which in turn causes decreases species abundance. These two causal structures are observationally equivalent — meaning that any distribution simulated under one can be replicated under the alternative causal structure. Therefore, both networks can produce datasets with phylogenetic signal in both body size and migratory behavior, and both can produce a dataset with the distribution in Figure 9D (see Supplementary Material for additional details on generating Figure 9).
How then should we analyze the data if we want to understand the effect of body size on species abundance? If we assume that body size influences migratory behavior, then increasing body size (for example, if natural selection leads a species to become larger), will increase the probability of that species becoming migratory — and the two opposing effects will result in relatively little change in species abundance. Therefore, we should perform Ordinary Least Squares regression to estimate the net causal effect of increasing body size. We also note that all the phylogenetic signal is coming from the evolution of body size, which becomes irrelevant once we observe body size, and thus we do not need to perform PGLS. By contrast, if migratory behavior causes changes in body size, then selecting for an increase in body size will not result in a lineage changing their migratory status at all. Therefore, we are assured that increasing body size will likewise always decrease species abundance. Consequently, we should perform PGLS to account for the phylogenetic signal in the residual variation imposed by (unobserved) migratory status.
By working through the logic of comparative analyses using graphical models we have come to essentially the same line of reasoning of Westoby et al. (1995b, a), who, in the early days of PCMs, challenged the growing consensus that phylogeny needed to be included in any interspecific comparison — a consensus which has only gotten stronger as the years passed by (also see McNab, 2003, for a related critique). Westoby and colleagues were concerned that including phylogeny in interspecific comparisons necessarily favored some causal explanations over others. At the time, their critique was dismissed as innumerate hogwash (Harvey et al., 1995; Nee et al., 1996) and this evaluation has largely stuck. However, from our example of bird size and abundance, it is apparent that Westoby et al. were right all along: phylogenetic comparative methods are a powerful tools for drawing inferences from interspecific data but they necessarily imply some types of causal structures and negate others. It is too much to ask of our methods to decide what questions we ought to ask. As Westoby et al. (1995a) put it: “No statistical procedure can substitute for thinking about alternative evolutionary scenarios and their plausibility” (p. 534).
Concluding remarks: are our models valid tests of our causal hypotheses?
By explicitly including phylogeny into our graphical models of causation, we are forced to reckon with the scope of the inference problem and the ability of our data to be informative. While most of the statistical assumptions of methods are often well-known (e.g., for linear models, we assume that errors have equal variance and are normally distributed, etc.), Gelman and Hill (2006) argue that there is a more fundamental assumption — validity of data — that is almost always implicit and often overlooked
“Most importantly, the data you are analyzing should map to the research question you are trying to answer. This sounds obvious but is often overlooked or ignored because it can be inconvenient. Optimally, this means that the outcome measure should accurately reflect the phenomenon of interest, the model should include all relevant predictors, and the model should generalize to the cases to which it will be applied.” (Gelman and Hill, 2006)
We believe that far less discussion in comparative methods has been focused on the issue of statistical validity of the data collected to the research questions being posed by a given study. This is in large part because comparative data and the phylogeny that underly it are largely beyond the control of the researcher, but careful consideration of the data is required to understand what research questions can be reasonably answered. We find that most comparative research questions have a poorly defined scope of inference: it is unclear to what population a model or inference should generalize to. If we ask “are fur and middle ear bones correlated?”, we must also specify “in what organisms?”. Since no organisms other than mammals have the particular traits we define as “fur” and “middle ear bones”, we actually do not need statistics at all to determine whether these traits are correlated — we have sampled nearly the entire population relevant to the question! In nature, they are perfectly collinear. If we wish to expand our scope of inference to hypothetical organisms that evolve fur and/or middle-ear bones we are free to do so. However, we have collected a very poor data sample for such a question. It is not the fault of the statistical method to demonstrate that a poorly designed experiment does not represent its scope of inference, rather it is our job as researchers and statisticians to ask whether or not such a relationship addresses our biological question and whether the sample of data collected is valid for the question being asked.
In this paper we have tried to synthesize a wide variety of statistical and philosophical concepts to lay out a roadmap for where we think comparative biology should go. We certainly do not have all the answers. Of the paths we have explored, there are many details that need to be worked out, and we fully anticipate that there are many alternative paths that we have not even considered. However, we argue that if we are going to make substantial progress in using phylogenetic data to test evolutionary hypotheses, we will need to reckon more seriously with the idiosyncratic nature of evolutionary history, and to more clearly articulate precisely what we want to test and whether our models and data are suitable for the task.
Code Availability
Data and code needed to reproduce all analyses in this manuscript are available at https://github.com/uyedaj/pnh-ms/.
Acknowledgments
We thank Luke Harmon, Daniel Caetano, Eliot Miller, Ben Freeman, Florent Mazel, Joel McGlothlin, Martha Muñoz, Barbara Neto-Bradley, Francisco Henao Diaz, and Mauro Sugawara for their critical feedback on these ideas and this manuscript. JCU would like to specially thank the insightful knowledge and teaching gleaned from conversations over the years with Thomas Hansen that inspired the bulk of this manuscript (though he holds no culpability for the contents and opinions therein). MWP was supported by a NSERC Discovery Grant. JCU was supported by NSF Grants to Luke Harmon (DEB-1208912) and JCU (DBI-1661516).