Abstract
Evolutionary relationships between species are traditionally represented in the form of a tree, called the species tree. The reconstruction of the species tree from molecular data is hindered by frequent conflicts between gene genealogies. A standard way of dealing with this issue is to pos-tulate the existence of a unique species tree where disagreements between gene trees are explained by incomplete lineage sorting (ILS) due to random coalescences of gene lineages inside the edges of the species tree. This paradigm, known as the multi-species coalescent (MSC), is constantly violated by the ubiquitous presence of gene flow revealed by empirical studies, leading to topological incon-gruences of gene trees that cannot be explained by ILS alone. Here we argue that this paradigm should be revised in favor of a vision acknowledging the importance of gene flow and where gene histories shape the species tree rather than the opposite. We propose a new, plastic framework for modeling the joint evolution of gene and species lineages relaxing the hierarchy between the species tree and gene trees. As an illustration, we implement this framework in a mathematical model called the genomic diversification (GD) model based on coalescent theory, with four parameters tuning repli-cation, genetic differentiation, gene flow and reproductive isolation. We use it to evaluate the amount of gene flow in two empirical data-sets. We find that in these data-sets, gene tree distributions are better explained by the best fitting GD model than by the best fitting MSC model. This work should pave the way for approaches of diversification using the richer signal contained in genomic evolution-ary histories rather than in the mere species tree.
Introduction
The most widely used way of representing evolutionary relationships between contemporary species is the so-called species tree, or phylogeny. The high efficiency of statistical methods using sequence data to reconstruct species trees, hence called ‘molecular phylogenies’, led to precise dating of the nodes of these phylogenies [34, 37, 82]. Notwithstanding the debatable accuracy of these datings, the use of time-calibrated phylogenies, sometimes called ‘timetrees’ [33], has progressively overtaken a view where phylogenies merely represent tree-like relationships between species in favor of a view where the timetree is the exact reflection of the diversification process [60, 67, 80]. In this view, the nodes of the phylogeny are consequently seen as punctual speciation events where one daughter species is instantaneously ‘born’ from a mother species. In this paper, we explore an alternative view of diversification, acknowledging that speciation is a long-term process [16, 42, 68] and not invok-ing any notion of mother-daughter relationship between species as done in the timetree view. This alternative view is gene-based rather than species-based, comparable with Wu’s genic view of speci-ation [85]. We use here the term ‘gene’ in the sense of “non-recombining locus”, i.e., a region of the genome with a unique evolutionary history. Our view is meant in particular to accommodate the well-recognized existence of gene flow between incipient species, which persists during the speciation process and long after [50].
The timetree view of phylogenies does acknowledge that gene trees are not independent and may disagree with the species tree [47], but current methods jointly inferring gene trees and species tree rely on the following assumptions that we question in the next section: there is a unique species tree, the species tree shapes the gene trees and the species tree is the only factor mediating all dependences between gene trees (they are independent conditional on the species tree).
This view is materialized in a model called the ‘multispecies coalescent’ (MSC) [38] where con-ditional on the species tree, the evolutionary histories of genes follow independent coalescents con-strained to take place within the hollow edges of the species tree. Many methods have been developed to estimate the species tree under the MSC, such as full likelihood methods (e.g. BEAST [34], BPP [88]) which average over gene trees and parameters [87], and the approximate or summary coalescent methods (e.g. Astral [57], mp-est [44], and stells [86]) which use a two-step approach: gene trees are first inferred and then combined to estimate the species tree that minimize conflicts among gene trees. Discordance between gene topologies is then explained, as a first approximation at least, by the intrinsic randomness of coalescences resulting in incomplete lineage sorting (ILS) (figure 1).
However, the presence of gene flow (hybridization, horizontal transfer) is now widely recognized between closely related species, and even between distantly related species [50]. Porous species boundaries, allowing for gene exchange because of incomplete reproductive isolation, are indeed regularly observed in diverse taxa such as amphibians [20, 65], arthropods [12], cichlids [84], cyprinids [6, 23, 24, 25, 79], insects [61, 64, 83], and even more frequently among bacteria [50, 78]. Long neglected, gene flow has recently been recognized as an important evolutionary driving force, through adaptive introgression or the formation of new hybrid taxa [1]. The ubiquity of genetic exchange across the Tree of Life between contemporary species suggests that gene flow has occurred many times in the evolutionary past, and might actually be the most important cause of discrepancies between gene histories (e.g. [8, 11, 22, 36]) (figure 1). Accordingly, several extensions to the MSC model have been considered allowing for gene flow between species [39, 89]. These models acknowledge that species boundaries can be permeable at a few specific timepoints [32]. Unfortunately, because of the heavy computational cost of modeling the coalescent with gene flow, these methods are limited to small data-sets [89]. More importantly, they might not be appropriate to realistically model gene flow, given the frequency of gene flow across time and clades described in empirical studies [77]. Additionally, some of these methods, Astral and mp-est, might infer erroneous gene trees when gene flow is present [46]. These observations urge for novel approaches where gene flow is the rule rather than the exception.
To fill this void, we propose here an alternative model, that we call the genomic diversification (GD) model, framed with minimal assumptions arising from recent empirical evidence. Unlike the timetree view, our genomic view of diversification does not put the emphasis on the species tree (which in our model becomes a network rather than a tree) and assumes that gene trees shape the species tree (rather than the opposite).
The Genomic View of Diversification
Gene flow and the questionable existence of a species genealogy
The biological species concept (BSC [53]) defines species as groups of interbreeding populations that are reproductively isolated from other groups. This definition postulates the non-permeability of species boundaries, which is contradicted by the growing body of evidence describing permeable or semi-permeable genomes, even between distantly related taxa. To integrate the possibility of gene flow into the definition of species, Wu [85] shifted the emphasis from isolation at the level of the whole genome to differential isolation at the gene level. Species are thus defined as differentially adapted groups for which inter-specific gene flow is allowed except for genes involved in differential adaptation (a well-defined form of divergence in which the alternative alleles have opposite fitness effects in the two groups) [85]. Because a fraction of the genome may still be exchanged after speciation is complete, a mosaic of gene genealogies is expected between divergent genomes [85]. Much evidence supports this prediction with the observation of highly conflicting gene trees, e.g. Darwin’s finches [26, 28], sympatric sticklebacks [69, 73], Iberian barbels [24], and Rhagoletis species [2].
Accordingly, the notion of a species genealogy as the binary division of species into new inde-pendently evolving lineages in bifurcating phylogenetic trees, appears inappropriate. To avoid this misleading vision of speciation, we here wish to relax the species tree constraint by considering only gene genealogies as real genealogies, thereby laying aside, at least temporarily, the notion of species genealogy. To do so, we do not specify mother-daughter relationships between species, yet we postulate the existence of species at any time, and assume that we can unambiguously follow the genealogies of genes (defined as non-recombining loci, as mentioned above).
Looking forward in time, genes belonging to two distinct individuals may find each other, in a next generation, in the same genome because of recombination. The same process might occur with two individuals belonging to different species under gene flow: this process, viewed in the backward direction of natural time, is defined here as disconnection (figure 2).
A primary consequence of the presence of gene flow is to challenge the notion of a unique ancestral species. If all genes ancestral to species S have traveled through the same species in the past, then species S has only one single ancestor species at any time. But because of gene flow (i.e. disconnection), these genes may lie in different species living at a given time in the past, such that species S can have several ancestral species at this time.
Genome cohesion under continuous gene flow
While some genes (e.g., genes involved in divergent adaptation) are hardly exchanged between populations, other genes (e.g., neutral genes unlinked to genes under divergent selection) can be subject to gene flow between different species [66, 85]. Gene flow can persist for long periods of time, with evidence suggesting introgression events occurring on periods lasting up to 20 Myr [6, 24, 84]. Over time, genetic differences will accumulate in regions of low recombination and expand via selective sweeps, leading eventually to complete reproductive isolation [85]. Accordingly, pairs of species will likely exhibit greater genetic incompatibility through time, i.e. be less permeable to gene flow, as has been observed for Iberian barbels [24], pea aphids [64], or salamanders [65]. In other words, gene lineages remaining too long isolated within the same species decrease their ability to introgress the genome of another species, a property that we name genome cohesion and which is the consequence of spontaneous genetic differentiation.
Seen from the viewpoint of a present-day genome, genome cohesion means that ancestral lineages of different genes in this genome cannot have spent too much time in the past in different species. As time goes backward, this results in an apparent attractive force that brings together lineages of genes belonging to the same present-day genome toward the same species, a phenomenon we term intragenomic connection.
This has to be distinguished from the coalescence which refers to the point in time when the lineages of homologous genes sampled from different genomes merge into a single lineage. To coalesce, homologous genes must be located in the same individual, hence in the same species. We call the intergenomic connection of two genomes sampled from different present-day species (figure 2) the first event (in backward time) of migration of two homologous genes from each of these genomes into the same species. Note that after coalescence (hence after intergenomic connection) of two homologous lineages from the two different genomes, the resulting lineage is now common to these two genomes. As a consequence of the mere intragenomic connection, going further back in time, all other genes will then converge to the same species and further coalesce, until all homologous gene lineages have coalesced.
The genomic diversification (GD) model
We propose here a new plastic framework, derived from the genomic view of diversification described above, that acknowledges the importance of gene flow and relaxes the hierarchy between the species tree and gene trees. This model that we named the genomic diversification (GD) model, uses coalescent theory for modeling the joint evolution of gene and species lineages, reconciling phylogenomics with our current knowledge of species diversification. The GD model features only four parameters prescribing four processes affecting gene genealogies: replication, genetic differentiation, introgression, and reproductive isolation. In backward time, these four processes respectively become: coalescence, intragenomic connection, disconnection and intergenomic connection. To model the progressive isolation with ongoing gene flow, we assume only one event at a time (figure 2). This framework can be made more complex by letting the parameters depend on time, on the gene, or on any prescribed category of genes.
The GD model was implemented in R (https://www.r-project.org) and evaluated under different sets of parameters. We also applied it to two empirical multi-locus data-sets showing complex evolutionary patterns due to gene flow, each comprising six morphologically and ecologically distinct species, the Ursinae (a bear subfamily) [41] and the Geospiza clade (a genus of Darwin’s finches) [19]. We estimated in particular 1) the relative amount of gene flow that has shaped each data-set, and 2) the corresponding average number of ancestral species.
Material and Methods
Parametrization of the GD model
At t = 0, n homologous genes are sampled in each of N sampled species. We will call a block at (backward) time t a (maximal) set of gene lineages that lie in the same species at time t, that are all ancestral to genes belonging to the same genome at t = 0. We now specify how we have parameterized the GD model. We follow the configuration of gene lineages into blocks in backward time, assuming a time-discrete Markov chain associated to the time-continuous chain with the following rates.
Intragenomic connection (rate a). At any time t in the past, due to genome cohesion, each gene lineage at rate a independently, escapes from its block and chooses a target block with a probability proportional to the number of lineages (ancestral to genes belonging to the same genome at t = 0) harbored by the target block.
Intergenomic connection (rate b). At any time t in the past, each pair of homologous genes (or preferably only of genes belonging to some previously prescribed category, like genes contributing to reproductive isolation) find themselves in the same species at rate b independently.
Coalescence (rate c). At any time t in the past, each pair of homologous genes lying within the same species coalesces at rate c.
Disconnection (rate d). At any time t in the past, as a result of gene flow, each gene lineage, at rate d independently, escapes from its block and creates a new block (i.e., gets into a species harboring no other gene lineage) (figure 2). To model the introgression of bigger chunks of DNA, we could alternatively assume that instead of one lineage, a given fraction of the lineages of a block can simultaneously create a new block. We will not consider this possibility in the present work.
We define the number of ancestral species of a given genome at time t, as the number of blocks at time t containing the ancestral lineages to this genome. We considered a time unit to be equal to the time elapsed between two events that we assumed to be constant for the sake of simplicity. In this manuscript we wish to explore the impact of gene flow rather than ILS to explain gene tree conflicts, and thus scale a large c value (coalescence rate) so that coalescent events are instantaneous. Therefore, only the parameters a, b, and d influence the gene genealogies.
A single sampled genome
We aimed to evaluate the variation in the number of ancestral species with gene flow. We performed simulations for a single sampled genome containing n genes (with n = 20, 50, 100, 200), and varied the relative amount of gene flow (disconnection) compared to genetic differentiation (intragenomic connection), ratio (with a = 1 and d ∈ [0.2, 2], every 0.2). The number of time units t was set to 10, 000. We sampled the number of ancestral species every 500 time units starting at time t = 5, 000, and averaged these 11 values for each simulation. For each set of parameters, 5 replicates were performed and averaged.
A model is said to be sampling consistent if the same outcome is expected for any k sampled genes independently of the total number n of genes in the genome. To evaluate this property, we randomly sampled k = 20 genes from each genome of n ≥ 20 genes and computed their average number of ancestral species.
A sample of several genomes
When considering several sampled genomes of n genes, n gene genealogies are obtained for a particular parameter setting. To characterize each set of gene genealogies, we employed a tree comparison metric, the Billera-Holmes-Vogtmann (BHV) metric [3]. The BHV metric accounts for both branch length and topological differences. This metric is a distance based on the concept of tree space, a quadrant complex with quadrants sharing some faces. Two trees with the same topology lie in the same quadrant, otherwise they lie in two distinct quadrants. At a common edge between two quadrants, the incongruent internal branches between trees have lengths equal to zero. Then a distance can be calculated between two rooted trees across these interconnected quadrants.
To compare trees that did not evolve on the same time scale, BHV distances were computed on re-scaled trees. For each set of gene trees issued from a single simulation or data-set, we rescaled all the trees so that the median of the most recent node depth is 1. We scaled the trees according to the median first coalescence among gene trees because in our model, the first coalescence initiates the intragenomic connection between genomes of different species, and hence coalescence of all the remaining homologous genes.
We evaluated the influence of the number of genes n (with n = 5, 10, 20), of the number of species N (with N = 6, 10), and of the relative amount of gene flow (with d =1 and , 0.5, 0.9, 1.3, 1.7, 2.1, 2.5, 2.9, 3.3) on gene tree diversity (BHV distances) (figure 4A). The other parameters were fixed, with b = 0.05 and c = 200.
For the same values of and c, and for n = 10, N = 6, we also evaluated the influence of the intergenomic connection rate b (with b = 0.01, 0.02, 0.05, 0.12) on gene tree diversity (BHV distances) (figure 4B).
The GD model versus the MSC model
To evaluate the ability of MSC methods to deal with gene flow, we estimated a species tree and its gene trees (MSC model with no gene flow) using sequences corresponding to gene trees simulated under the GD model (with gene flow).
A set of 10 gene trees was simulated under the GD model (with N = 6, b = 0.05,, and d = 1) (figure 5). We simulated DNA sequences (package’ PhyloSim’ in R [75]) corresponding to each of the 10 gene trees with model of DNA evolution estimated by modeltest (function ‘modelTest’, package’ phangorn’ in R [72]) for the TRAPPC10 intron of the bear data-set detailed below [41]: HKY model, rate matrix: a = 1.00, b = 5.29, c = 1.00, d = 1.00, e = 5.29, f = 1.00, base frequencies: 0.26, 0.19, 0.21, 0.34. Prior to simulating the sequences, the 10 gene trees were scaled to the TRAPP10 intron phylogenetic tree length (built with RaXML 8.1.11 [81] assuming GTR (general time reversible) model with 1,000 bootstrap replicates).
The species tree and the gene trees associated were estimated from the simulated sequences with the program BEAST v. 2.4.8 [5] with the following parameters: unlinked substitution models, unlinked clock models, unlinked trees, HKY substitution model for each of the 10 genes, strict clock, Yule process to model speciation events, and 80 million generations with sampling every 5000 generations. To set the calibration time of the root we assumed that 1 time unit corresponded to 10 ky; on average the last coalescence event among the 10 GD trees occurred at t = 700. Accordingly, we used a normal distribution prior for the root heights (mean=7.0; stdev=1.0).
Inferences from empirical data-sets
Empirical data-sets
The amount of gene flow that has shaped the two empirical data-sets was estimated by comparing the distributions of their pairwise gene tree distances with those of simulated trees. The first data-set comprised 14 autosomal introns for 6 bear species (Helarctos malayanus, Melursus ursinus, Ursus americanus, U. arctos, U. maritimus, and U. thibetanus) and 2 outgroups (Ailuropoda melanoleuca and Tremarctos ornatus) [41]. The sequences were downloaded from GenBank (supplementary table S1). As in Kutschera et al. [41], all variation within and among individuals was collapsed into one single 50% majority-rule-consensus sequence for each of the 8 species. The phylogenetic trees were built with the program BEAST v. 1.8.3. [13], with the parameters used by the authors of [41]: Yule prior to model the branching process, strict clock, a normal prior on substitution rates (0.001 ± 0.001) (mean ± SD), minimum age of 11.6 My for the divergence of A. melanoleuca from other bears (exponential prior: mean= 0.5; offset= 11.6), and 10 million generations with sampling every 1000 generations. The models of DNA evolution were estimated by modeltest (function ‘modelTest’, package’ phangorn’ in R [72]) (supplementary table S2). The monophyly of the ingroup and the topology among the outgroups were constrained according to the topology depicted in Kutschera et al. [41].
The second data-set comprised 7 nuclear markers for 6 finch species (Geospiza conirostris, G. fortis, G. fulginosa, G. magnirostris, G.scandens, and G. septentrionalis) and 2 outgroups (Camarhynchus psittacula and Platyspiza crassirostris) [19]. The sequences were downloaded from GenBank (supplementary table S3). The phylogenetic trees were built with the program BEAST v. [13] with the parameters used by Farrington et al. [19]: coalescent constant size prior to model the branching process, strict clock, substitution rate equal to 1, specific models of DNA evolution defined by the authors (supplementary table S2), and 10 million generations with sampling every 1000 generations. The monophyly of the ingroup and the topology among the outgroups were constrained according to the topology depicted in [19].
Estimation of parameters under the multi-species coalescent (MSC) model
We optimized the MSC model for N = 6 species by varying two parameters, the speciation rate λ and the extinction rate µ, and fixing the coalescence rate to 1. Birth-death trees of 6 tips (function’ sim.bdtree’, package’ geiger’ in R) were simulated in a grid of (λ, µ = mλ) with λ ∈ [0.02, 0.34], every 0.02, and m ∈ [0.1, 0.65], every 0.05. Because we simulated small trees (6 tips), the degree of variation between trees simulated with the same parameters was high. Therefore for each value of (λ, µ) we randomly selected 15 species trees for which the crown age did not differ by more than 2.5% from the expected crown age. Next, we simulated 10 gene genealogies for each species tree (coalescence rate fixed to 1).
If the diversification rate (speciation rate minus extinction rate) is low, all the homologous genes will coalesce before the next node in the species tree, so that all the gene trees will have the same topology. On the contrary, if the diversification rate is too fast, some homologous genes will not have time to coalesce before the next node of the species tree, resulting in incongruent gene trees due to the randomness of coalescences (ILS).
Estimation of parameters under the genomic diversification (GD) model
Equivalently, we optimized the GD model for N = 6 by varying two parameters, here a and b, and fixing d = 1 and c = 200 (recall c is given a sufficiently large value that coalescences are instantaneous). Since increasing n has no effect on BHV distances (see above and figure 4), we simulated genomes with n = 10 genes. The number of time units t was set to 5, 000, which guarantees the coalescence of all homologous genes. We performed 15 replicates under each parameter combination in a grid of with , every 0.2, and b ∈ [0.01, 0.12], every 0.01.
For both models (MSC and GD) we employed the Kullback-Leibler (KL) divergence (package’ FNN’ in R) as a distance metric to find the best set of parameters by minimizing this distance between the distributions of BHV pairwise distances of empirical and simulated trees. The lower the KL divergence is the better is the fit.
Results
A single sampled genome
Let us consider the case of N = 1 sampled genome containing n genes. We let A(t) = (A1(t), …, An(t)) denote the sorting of genes into ancestral species t units of time before the present. More precisely, Ak(t) denotes the number of ancestral species containing k gene lineages, so that and is the total number of species at t ancestral to the sampled genome. For each ε ∈ (0, 1], we will also be interested in the number of ancestral species containing at least a fraction ε of the genome (with [x] denoting the smallest integer larger than x). All stationary quantities will be denoted by the same symbols, replacing t with ∞.
The transition rates can be specified as follows in terms of the configuration of gene lineages into blocks (i.e., ancestral species). For each pair of blocks containing (j, k) lineages, intragenomic connection occurs at rate ajk and results in the configuration (j - 1, k + 1). For each block containing j lineages, disconnection occurs at rate dj and results in the block losing one lineage; simultaneously a new block containing 1 single lineage is created. These are exactly the same rates as in the wellknown Moran model with mutation under the infinite-allele model [58], replacing ‘block’ with ‘allele’, ‘connection’ by ‘resampling’ (simultaneous birth from one of the j carriers of a given allele and death of one of the k carriers of another given allele) and ‘disconnection’ with ‘mutation’ (mutation appearing in one of the j carriers of a given allele into a new allele never existing before). For this Moran model,
the total population size is n;
at rate a for each oriented pair of individuals independently, the first individual of the pair gives birth to a copy of herself and the second individual of the pair is simultaneously killed;
mutation occurs at rate d independently in each individual lineage.
As a consequence, A(t) has the same distribution as the allele frequency spectrum in the Moran model with total population size n, resampling rate a and mutation rate d, starting at time t = 0 from a population of clonal individuals (one single block). In particular, the distribution of A(∞) is the stationary distribution of the allele frequency spectrum, which is known to be given by Ewens’ sampling formula with scaled mutation rate d/a [14, 17, 18]. Expectations of this distribution are: so that and
In particular, as n → ∞,
At stationarity, and particularly for large values of , the mean number of ancestral species S(∞) obtained from simulations was equal to the mathematical prediction (figure 3A). In particular, the mean number of ancestral species at stationarity increases with
An additional key feature of this model is sampling consistency. In words, the history of a sample of k genes taken from a genome of n genes does not depend on n. This property can again be deduced from the representation of our model in terms of the better known Moran model. Indeed, the dynamics of a sample of k individuals in the Moran model does not depend on the population size, as can be seen from the so-called lookdown construction [15]. The simulations performed with k genes randomly sampled from each genome of n genes, are in agreement with this claim of sampling consistency: the number of ancestral species at stationarity E(S(∞)) is independent of the number of genes n (figure 3B).
A sample of several genomes
Using simulations, we evaluated the GD model for several sampled genomes (N > 1) under several combinations of parameters. As expected gene tree diversity, measured by BHV distances, increased with , i.e. the relative amount of gene flow, and with the number of species N. Conversely our results showed that the number of genes n had no effect on distances (figure 4A). This last result, the lack of influence of n on gene tree diversity, is of particular interest, because one usually has only access to a fraction of a genome. It shows that regardless of the number of genes sampled, the resulting gene tree diversity will remain the same as long as gene trees have been shaped by processes with similar parameter values.
Our results also showed that as the intergenomic connection rate b decreases, and for the same , gene trees were more similar (lower BHV distances) (figure 4B). When a long period of time elapses between two intergenomic connection events (low b), all the genes belonging to the two genomes that have started to coalesce, have enough time to converge toward the same species, and thus coalesce before the next intergenomic connection event, in spite of gene flow.
GD versus MSC: ignoring gene flow may lead to mistaken phylogenetic inferences
When evaluating the ability of MSC model to deal with gene flow, we found a strong support (posterior probabilities > 0.90) for all the nodes of the Bayesian species tree even if the individual gene trees of the GD model did not corroborate this topology (figure 5). For example, 7 out of 10 gene trees modeled under the GD model support the connection between the species E and the species C and D, and only 3 the direct relationship between the species E and F. Whereas the Bayesian tree strongly supports the clade (E,F) with a posterior probability equal to 1, and considers all the connections between E and (C,D) to be due to ancestral polymorphism (i.e., ILS). Moreover because gene trees are constrained in the species tree (MSC model), the coalescences between genes of E and (C,D) must take place after the species tree coalescence, therefore these coalescences are timed around 7 My instead of 2 My according to the GD tree. Failing to recognize that that gene flow may have shaped gene genealogies, hence DNA sequences, can result in important topological and dating errors.
Inferences from empirical data-sets support the GD model
To find the best set of parameters, we minimized the Kullback-Leibler (KL) divergence between the distributions of BHV pairwise distances of empirical and simulated trees (figure 6). Under the multi-species coalescent (MSC) model, the most likely set of parameters was µ = 0.4 × λ and λ = 0.2 (KL divergence = 0.23) for the bears, and µ = 0.45 × λ and λ = 0.22 for the finches (KL divergence = 0.12). We noted longer tailed distributions for the distances between trees modeled under the MSC model than for the empirical data-sets (figure 7). This skewed distribution obtained with the MSC model explains why we did not detect a sharp peak in the optimization landscape for the MSC model (figure 6).
Under the genomic diversification (GD) model, the most likely set of parameters was b = 0.03 and (KL divergence = 0.14) for the bears, and b = 0.11 and for the finches (KL divergence = 0.01) (figure 6). Contrary to the MSC model, the distributions of the distances between trees modeled under the GD model or empirical trees did not show, or to a lesser degree, a long tail (figure 7), explaining why we could detect a sharp peak in the optimization landscape for the MSC model (figure 6).
Comparing the parameters λ and µ to b and is not straightforward as the two models, MSC and GD, are built under different assumptions. However in both cases, the parameters influence the diversity among trees (shape of the distribution of BHV pairwise distances). A greater diversity among trees is expected with increasing λ and decreasing µ, and with increasing and b, allowing us to explore the parameter landscape to find the setting that minimizes the distance between simulations and empirical data-sets for each model.
Given our results and the mathematical predictions, the time-averaged number Sε(∞) of ancestral species to the sampled genome containing at least 10% of the genome (ε = 0.1) when n → ∞ is 4.8 for the bear data-set and 3.4 for the finch data-set.
Discussion
Within species, gene flow allows the maintenance of species cohesion in the face of genetic differentiation [59, 76], preventing genetic isolation of populations and the subsequent emergence of reproductive barriers leading to speciation [10]. Among species, the existence of gene flow challenges the notion of a species genealogy as well as the current concepts of species. Indeed, if gene flow is as pervasive as recent empirical studies suggest [8, 11, 22, 36], the genealogical history of species should be represented as a phylogenetic network encompassing the mosaic of gene genealogies. Similarly, it seems very conservative to delineate species based on the widely used biological species concept (reproductive isolation) [53], or phylogenetic species concept (reciprocal monophyly) [62]. Because of the ubiquity of gene flow, which can persist for several millions of years after the lineages have started to diverge (i.e., onset of speciation) [4, 48], species should be rather defined by their capacity to coexist without fusion in spite of gene flow [49, 70].
The simplified view of diversification, consisting in representing lineages splitting instantaneously into divergent lineages with no interaction (gene exchange) after the split, has been preventing evolutionary biologists from fully apprehending diversification at the genomic level and from correctly interpreting discrepancies between gene histories. Indeed, conflicting gene trees make the interpretation of their evolutionary history difficult. However, we argue that phylogenetic incongruence among gene trees should not be considered as a nuisance, but rather as a meaningful biological signal revealing some features of the dynamics of genetic differentiation and of gene flow through time and across clades. Current phylogenetic methods rely on the assumption that gene trees are constrained within the species tree, and that gene flow occurs infrequently between species. For many data-sets such as sequence alignments of genomes sampled from young clades, such methods could lead to an evolutionary misinterpretation of gene trees, and in the worst case to species trees with high node support while the gene trees had very different evolutionary histories (see figure 5). These observations urge for a change of paradigm, where gene flow is fully part of the diversification model. To consider the ubiquity of gene flow across the Tree of Life described by many recent studies, we have developed a new framework focusing on gene genealogies and relaxing the constraints inherent to the MSC paradigm. This framework is materialized in a mathematical model that we named the genomic diversification (GD) model.
The GD model
Under the GD model, gene genealogies are governed by four parameters corresponding to four biological processes, coalescence (replication), intragenomic connection (genetic differentiation), intergenomic connection (reproductive isolation), and disconnection (introgression) (figure 2).
Intergenomic connection corresponds to finding the most recent common ancestor of the two species at the genomic level. The time spent between intergenomic connections depends crucially on the (phylogenetic distance of the) species sampled at the present. Disconnection corresponds to the introgression of genetic material from one species into another species, which rate scales with the intensity of gene flow. Intragenomic connection models genetic differentiation. The slower genes accumulate mutations and differentiate, the more time can be spent by gene lineages in different species. Hence when genomes differentiate slowly, the rate of intragenomic connection is low.
Each of these parameters influences differently the resulting tree diversity, i.e. the distribution of the BHV distances among trees, that we used here as a summary statistic. Instead of focusing on the main phylogenetic signal alone as done by the current phylogenetic methods, the GD model makes use of the whole signal encompassed by all gene trees.
Higher amount of gene flow (disconnection) and reduced time to untangle gene genealogies before the connection of two other genomes (intergenomic connection) increase the diversity among trees. Conversely, when homologous genes coalesce faster (coalescence) and genes converge faster toward the species harboring the other genes of their genome (intragenomic connection) a lower diversity among trees is expected.
After evaluating this model under various sets of parameters, we applied it to analyze two empirical multi-locus data-sets for which gene tree conflicts have obscured the evolutionary history.
Gene flow among bears and among finches
Our results showed support for the hypothesis that gene flow has shaped the gene trees of bears and finches (figure 7). For the bear data-set, we found that each species had on average in the past about 4.8 ancestral species carrying at least 10% of its present genome (equation (2)). This result is in line with previous studies reporting gene flow between pairs of bear species [7, 31, 41, 45, 55]. Moreover, a recent phylogenomic study (869 Mb divided into 18,621 genome fragments) confirmed the existence of gene flow between sister species as well as between more phylogenetically distant species [40]. They used the D-statistics (gene flow between sister species) and DF OIL-statistics (gene flow among ancestral lineages [63]) to detect gene flow among the 6 bear species. Using their results, for each pair of species ij among the N species, we determined if the species j has contributed (gij = 1) or not (gij = 0) to the genome of the species i (with gii = 1), and calculated the average number of ancestral species S as follow:
We found on average 5.3 ancestral species for each of the Ursinae bears [40], close to the estimate obtained with the GD model (4.8).
We detected lower gene flow among finches than among bears. Each finch species had on average in the past 3.4 ancestral species (for the subsample of gene trees analyzed here), which is also consistent with the extensive evidence that many species hybridize on several islands [21, 27, 29, 30, 71]. Because of gene flow very little genetic structure was detected by a Bayesian population structure analysis, only 3 genetic populations among the 6 Geospiza species [19]. Each of the 2 species, G. magnirostris and G.scandens, were mostly characterized by a single genetic population, there-fore had about 1 ancestral species each. Conversely 4 Geospiza species shared the same genetic population, suggesting 4 ancestral species for each of these 4 species. Taking together these results roughly indicate that each of the 6 Geospiza species had in average 3 ancestral species, in line with the GD estimate (3.4).
We showed here that strictly bifurcating lineage-based models do not adequately capture complex evolutionary patterns at the species level. On the contrary, a model relaxing species boundaries and accounting for gene flow, like the GD model, better reproduced the complex history of gene genealogies under continuous gene flow. Note that we considered a simple scenario with no ILS and statistically exchangeable genes resulting in a model with only three parameters, but given the simplicity and the flexibility of our model, many extensions may be considered to address scenarios that could not have been considered previously, opening up new perspectives in the study of speciation and macro-evolution.
Gene flow: an evolutionary force driving diversification
Species diversification requires genetic variation among organisms, introduced by mutations and structural variation, upon which natural selection and drift can act by influencing the sorting of offspring and the survival of organisms [70]. Recently, gene flow has also been mentioned as another potential source of genetic variation [52], and more particularly in the case of adaptive radiations [9, 43, 54, 74]. Hybrid zones act as filters, preventing the introgression of deleterious genes while allowing advantageous or neutral genes to cross the species boundaries [52]. Newly acquired genes will then be a source of variation [52], by providing evolutionary adaptive shortcuts (beneficial genes) or greater adaptability once in the genetic pool of the introgressed species (neutral markers) [52]. The introgressed species then has a wider range of potentially adaptive allelic variants, allowing it to diversify rapidly if the opportunity arises. Accordingly important gene flow should be detected prior to an adaptive radiation. This hypothesis is supported by empirical evidence, but has only been tested under limited conditions [9, 43, 54, 74]. The model proposed here constitutes a great opportunity to investigate more systematically how gene flow is distributed throughout the phylogenies and how it can influence the frequency of adaptive radiations.
Evolutionary dynamics along the genome
Along the genome, gene flow is not expected to be uniformly distributed either. Incongruent gene trees should reveal genes that have evolved more slowly. Indeed, because of the genome cohesion force, genes evolving slower will be able to stay longer in different species. Conversely, congruent gene trees should reveal genomic regions not subject to gene flow, as genomic regions under strong selective differentiation [32, 35]. This framework could thus be used to evaluate how gene flow varies along the genome and to explore the genomic architecture of species barriers. Indeed some regions, as sexual chromosomes or low recombination genomic regions, are expected to be more differentiated and hence to undergo less gene flow (e.g. Heliconius species [51]). In order to distinguish between genes and to reduce potential errors in parameter estimation, data may be grouped by gene class (statistical binning) using a method aiming to evaluate whether two genes are likely to have the same tree (linked sites) or the same tree in distribution (statistical exchangeability) [56].
Perspectives
Models and methods inferring macro-evolutionary history from phylogenetic trees, such as speciation and extinction rates, trait evolution, and ancestral character reconstruction, have become increasingly complex [60, 67, 80]. Yet, the raw material used by these methods is often reduced to the species tree, which can be viewed as a summary statistic of the information contained in the genome. We argue here that a valuable amount of additional signal, not accessible in phylogenetic trees, is contained in gene trees, and is directly informative about the diversification process. Indeed, because genetic differentiation and gene flow impact each gene differently, genes may have experienced very different evolutionary trajectories.
In order to make use of the entire information conveyed by gene trees, we propose here a new approach to tackle the diversification process, the genomic view of diversification, under which gene trees shape the species tree rather than the opposite. This approach aims at better depicting the intricate evolutionary history of species and genomes. We hope that this view of diversification will pave the way for future developments in the perspective of inferring diversification processes directly from genomes rather than from their summary into one single species tree. One of the challenges in this direction will be to propose finer inference methods than the crude one used here, based on a single summary statistic, the BHV distances.
Acknowledgments
The authors thank the Center for Interdisciplinary Research in Biology (Collège de France, CNRS) for funding. JM is funded by LabEx MemoLife, project Genomics of Diversification.
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].↵
- [67].↵
- [68].↵
- [69].↵
- [70].↵
- [71].↵
- [72].↵
- [73].↵
- [74].↵
- [75].↵
- [76].↵
- [77].↵
- [78].↵
- [79].↵
- [80].↵
- [81].↵
- [82].↵
- [83].↵
- [84].↵
- [85].↵
- [86].↵
- [87].↵
- [88].↵
- [89].↵