ABSTRACT
Mimulus guttatus and M. nasutus are an evolutionary and ecological model sister species pair differentiated by ecology, mating system, and partial reproductive isolation. Despite extensive research on this system, the history of divergence and differentiation in this sister pair is unclear. We present and analyze a novel population genomic data set which shows that M. nasutus “budded” off of a central Californian M. guttatus population within the last 200 to 500 thousand years. In this time, the M. nasutus genome has accrued numerous genomic signatures of the transition to predominant selfing. Despite clear biological differentiation, we document ongoing, bidirectional introgression. We observe a negative relationship between the recombination rate and divergence between M. nasutus and sympatric M. guttatus samples, suggesting that selection acts against M. nasutus ancestry in M. guttatus.
While speciation is often depicted as a simple event in which a single species splits into two, there is increasing evidence that this process is often more complex. In particular, speciation reflects a tension among divergence, the assortment of ancestral variation, and introgression that plays out across the geography and ecology of the incipient species. Our population genetic view of this process has been fundamentally limited by examining few loci, where stochasticity in ancestral processes can prevent strong inferences about isolation and gene flow. By contrast, whole genome resequencing (even of only a few individuals) makes use of both many genealogical histories and contiguous genomic blocks of ancestry to provide well-resolved views of population history, divergence and introgression (1–4). Here, we present a population genomic investigation of the speciation history of two closely related species of yellow monkeyflowers, the primarily outcrossing Mimulus guttatus, and the self-pollinating M. nasutus – an evolutionary model system for which the genetic and ecological basis of reproductive isolation is reasonably well characterized (5).
In flowering plants, speciation often involves a shift in pollinator (e.g., (6–8)) or mating system (e.g., (6, 9–11)), with concomitant divergence in key floral traits causing reproductive isolation between lineages. The evolutionary transition from outcrossing to self-fertilization, as occurred in M. nasutus, is of particular interest because the expected reduction in both the effective population size and effective recombination rate (12, 13) can dramatically alter population genetic processes, and patterns of genomic variation (14, 15). Recent evidence for elevated levels of putatively deleterious alleles in selfing taxa (16–18) is consistent with the idea that inbreeding reduces the effectiveness of purifying selection (due to a lowered effective population size). However, we still have few examples of the effects of self-fertilization on patterns of diversity across the genome, particularly in the context of recently diverged and potentially hybridizing species. Genomic datasets from young selfing species can uniquely inform the process of mating system divergence by allowing us to compare regions of the genome that share a common ancestor before or after the origin of self-fertilization and thus understand the assortment of ancestral variation (19).
The M. guttatus-M. nasutus species pair is an excellent model for investigating the causes and consequences of mating system evolution and species divergence. M. guttatus is primarily outcrossing (although the rate of outcrossing varies among populations, 20, 21, 22) with large, bee-pollinated flowers and occupies diverse ecological habitats throughout western North America. M. nasutus is highly selfing with reduced, mostly closed flowers and overlaps broadly with M. guttatus throughout their shared range. Though allopatric populations are more common, the two species often co-occur. In sympatry, species are partially reproductively isolated by differences in floral morphology, flowering phenology, and pollen-pistil interactions (23–25). Although early-generation hybrids occur in nature (23, 26), numerous intrinsic hybrid incompatibilities decrease hybrid fitness (27–29). Based on the most detailed population genetic analyses of Mimulus to date (two and six sequenced nuclear loci, respectively: 29, 30), M. nasutus exhibits reduced diversity compared to M. guttatus, and some M. guttatus sequences are nearly identical to M. nasutus, suggestive of historical introgression. However, this very limited view of the genome cannot resolve the timing and genomic consequences of divergence between Mimulus species, nor can it inform the extent or consequences of introgression between them.
We present the first population genomic analysis of M. guttatus and M. nasutus, spanning diverse ecotypes collected from throughout the species’ ranges. We use these dense and contiguous population genomic data to estimate the population-split time, quantify rapid loss of ancestral variation accompanying the transition to selfing in M. nasutus, and identify ongoing, bidirectional introgression. Additionally, we observe a negative correlation between the recombination rate and interspecific divergence between M. nasutus and sympatric M. guttatus, a result best explained by selection against introgression of M. nasutus ancestry. Our approach provides a detailed view of differentiation and introgression in a tractable ecological, genetic, and evolutionary model system.
Results
We analyze a novel population genomic dataset of nineteen lab and/or naturally inbred Mimulus samples – thirteen M. guttatus, five M. nasutus, and one from the outgroup, M. dentilobus. Collections spanned the ecological and geographic ranges of each species (Figure 1A and Table S1). Many of our analyses focus on four M. guttatus and four M. nasutus collections sequenced to high depth (13.8X–24.7X) and with identical read lengths (100 bp, paired-end reads). All comparisons are presented in supplementary tables S2 and S3. Of our focal M. guttatus samples, CACG and DPRG are narrowly sympatric with M. nasutus populations from the northern and southern portion of both species’ ranges, respectively. Our focal northern allopatric M. guttatus collection, AHQT, is well outside the geographic range of M. nasutus. By contrast, the southern allopatric collection (SLP) is geographically close to M. nasutus populations. Our focal M. nasutus samples also include sympatric and allopatric samples from the north and south (Table S1).
Speciation history
Overall patterns of genomic differentiation show deep population structure in M. guttatus, with M. nasutus diverging from a central Californian M. guttatus population approximately 200 kya.
A rate-smoothed neighbor-joining (nj) tree clearly displays a deep phylogenetic split within M. guttatus, roughly corresponding to northern and southern parts of its range, and the placement of all M. nasutus samples as a node within the southern M. guttatus cluster (Figure 1B). Note, however, that geography is not a clean predictor of genetic structure within M. guttatus (e.g., DUN is from a northern latitude yet clusters with our southern M. guttatus samples). The fact that M. guttatus is paraphyletic (i.e., is split by M. nasutus), suggests that M. nasutus budded from within a structured ancestral M. guttatus population. A principle component analysis (PCA, Figure 1C) also reveals the genetic structure within M. guttatus as PC2 differentiates northern and southern M. guttatus groups identified in the nj tree. Consistent with the single origin of M. nasutus, PC1 separates M. guttatus from the strongly clustered M. nasutus, presumably as a consequence of a shared history of genetic drift among these M. nasutus samples,
We generate a quantitative description of the strong genetic structure within M. guttatus, focusing on our high-coverage (focal) samples. Pairwise sequence diversity at synonymous sites within northern (πS AHQT x CACG = 3.97% [3.89%–4.06%]) and southern (πS DPR x SLP = 4.45% [4.39%–4.52%]) samples is significantly lower than that within M. guttatus overall (πS = 4.91% [4.85%–4.96%]), or between north and south (πS= 5.26% [5.20%–5.30%], Figure 1D). Diversity within the northern and southern clades is consistent with a very large effective population size (Ne) of approximately one and a half million chromosomes for both groups (assuming the per generation mutation rate, μ = 1.5 * 10−8 [following Koch et al. 2008], 2 Ne = 1.5 × 106, and assuming an annual life-history this is also the per year mutation rate). One simple way to estimate a population split time (τ generations), assuming no introgression, is to assume that the divergence between populations is the sum of pairwise diversity (π) within an ancestral population and the product of the per-generation mutation rate, μ, and two times the split time (31). Using this relationship, and representing ancestral diversity by the southern M. guttatus samples, we set τ = (πS NorthGut x SouthGut – πS SouthGut) / 2μ and estimate a split between northern and southern Mimulus populations more than a quarter of a million years ago (265 ky [251 ky – 280 ky]).
Interspecific divergence between M. guttatus and M. nasutus (dS = 4.94% [4.88%–5.00%]) is comparable to overall M. guttatus diversity, and exceeds diversity within northern or southern M. guttatus collections (Figure 1D). We derive a simple estimate of split time between M. guttatus and M. nasutus, under a model assuming no introgression, as we did above to estimate the split between focal northern and southern M. guttatus samples. Using the difference between divergence of M. nasutus to the southern, allopatric M. guttatus sample (to minimize the influence of recent introgression and historical divergence between M. guttatus’ genetic clusters) and a proxy for diversity in an ancestral population (southern M. guttatus), we estimate that M. nasutus and M. guttatus split approximately 200 ky ago (τ = (πSNas x AlloSouthGut – πS SouthGut) / 2μ = 196 ky [181 ky – 212 ky]).
As a complementary inference of historical patterns of divergence within M. guttatus and between species, we applied Li and Durbin's implementation of the pairwise sequentially Markovian coalescent (PSMC) (32) to pairwise combinations of focal haploid genomes (Figures 1E and Figure S1–S5). Since we sample from a structured population, the inferred large recent population sizes likely represent reduced coalescent rates caused by population structure, rather than dramatic recent increases in Ne. Thus, the PSMC inference of larger recent population sizes between northern and southern M. guttatus (SLP x AHQT) compared to within these groups (SLP x DPRG and CACG x AHQT, Figures 1E and S1) likely reflects the strong genetic structure within M. guttatus. Similarly, the large recent population sizes inferred within both northern and southern samples likely reflect substructure within northern and southern regions.
We also use this PSMC analysis to roughly estimate a split date, by assessing when the inferred coalescent rate between species decreases (i.e., the population size estimate increases) relative to the rate within M. guttatus (see 32). In doing so, we focus on the southern M. guttatus samples that fall closest to M. nasutus in our nj tree. The inferred coalescent rate between M. nasutus and southern M. guttatus (SLP x KOOT, gray line) decreases relative to the rate within southern M. guttatus (SLP x DPRG, dark blue/navy line), i.e., the lines diverge, from ∼500 to ∼200 kya, suggesting either a gradual split between species over that time span, or a hard split sometime within that range (Figures 1E and S2). This result is qualitatively similar to our estimate based on synonymous nucleotide variation among these samples.
Genomic consequences of the transition to selfing
We find that patterns of genomic variation within M. nasutus reflect the genomic consequences of a recent transition to selfing. Synonymous diversity within M. nasutus (πS = 1.09% [1.03%–1.14%], Figure 1D) is one fifth that observed within M. guttatus, consistent with a high rate of genetic drift since M. nasutus’ origin. Moreover, most ancestral variation in M. nasutus has been homogenized: of the fixed differences between M. nasutus and M. guttatus, 90% are derived in M. nasutus and 10% are derived in M. guttatus (when polarizing by M. dentilobus). Although M. nasutus has lost much of its ancestral variation, shared variants still constitute a much higher proportion of its polymorphism (50%) relative to an equally sized sample of M. guttatus (10%). This suggests M. nasutus’ genetic diversity disproportionately comes from genomic regions retaining ancestral variation.
We find an excess of high-frequency derived synonymous mutations in M. nasutus (Figure S6), suggesting that its population contracted recently. By contrast, we observe slightly more rare synonymous alleles than expected under demographic and selective equilibrium in M. guttatus, likely reflecting the structure and/or history of introgression in these samples, or weak selection against unpreferred codons.
The distribution of synonymous diversity in 5 kb windows across the genome (Figure 2A) bolsters the view that M. nasutus’ genomic diversity is a mixture of closely related genomic regions that rapidly coalesce in the small M. nasutus population, and distantly related regions that do not coalesce until joining a large M. guttatus-like ancestral population. In pairwise comparisons of sequence diversity within M. nasutus, half of the genomic windows are differentiated by πS < 0.5% (corresponding to ∼170 thousand years of divergence), reflecting recent common ancestry since the species split. On the other hand, one third of such windows are differentiated by πS > 2.0%, reflecting deep ancestry in a large ancestral population (Figure 2A).
These findings contrast sharply with comparisons within M. guttatus, as well as between M. nasutus and allopatric M. guttatus samples, for which recent common ancestry since the species split is rare (πS < 0.5% for less than 1.5% of 5kb windows) and deep coalescence is the norm (mode πS = 4%, Figure 2A). Under the neutral coalescent, a pair of lineages will fail to find a common ancestor with each other by generation τ with probability , where is the (constant) effective number of chromosomes. Therefore, the observation that half of our windows share a common ancestor in the past ∼170 ky, by an admittedly crude calculation, predicts a population size between 150k and 250k effective chromosomes (compared to estimated of 1.25 in M. guttatus from synonymous diversity, above). This ten-fold reduction in effective population size as compared to M. guttatus far exceeds both the twofold decrease in Ne expected to accompany the evolution of selfing and the fourfold decrease calculated by the difference in intraspecific variation.
Likewise, our PSMC results strongly support a history of extensive recent shared ancestry and the incomplete sorting of ancestral diversity in M. nasutus. We infer a dramatic decline in M. nasutus’ effective population size after it split from M. guttatus (compare red and black-gray lines Figure 1E, see also Figure S3), suggesting that the evolution of selfing roughly coincided with M. nasutus’ split from M. guttatus. We caution, however that interpretation of PSMC's estimated population size in M. nasutus is not straightforward. This is because the transition to selfing reduces the population recombination rate more than the population mutation rate (13); however, Li and Durbin's (32) implementation of the PSMC assumes that both these values change proportionally with the historical effective population size.
Across the genome, the mosaic nature of ancestry within M. nasutus is apparent as long contiguous regions of recent common ancestry (colored windows in Figures 2B and S7) interrupted by regions of deep ancestry, due to incomplete lineage sorting and/or historical introgression (white windows in Figure 2B and S7). This block-like ancestry structure results in extensive linkage disequilibrium (LD) in M. nasutus. In contrast to M. guttatus, for which the sample pairwise LD drops halfway towards its minimum values within only 15–20 base-pairs, LD in M. nasutus decays much more slowly, not dropping halfway towards its minimum values until 22 kb (Figure 2C). This represents a thousand-fold difference in the decay of LD, as compared to a more modest ten-fold reduction in the effective population size between M. nasutus and M. guttatus. This dramatic difference in the scale of LD between Mimulus species is likely due to a major reduction in the effective recombination rate within the selfing M. nasutus. Following Nordborg ((13) Eq 1), we use the comparison of the population scaled recombination and mutation rate to estimate an effective selfing rate of 99% in M. nasutus.
Patterns of sequence variation suggest a reduced efficacy of purifying selection in M. nasutus, a result consistent with extensive genetic drift and/or linked selection within M. nasutus. All M. nasutus samples contain more premature stop codons than any M. guttatus sample (M. nasutus: mean 124, range = 121–126, M. guttatus: mean 95.5, range = 86–102), and a large proportion of these premature stops are at high frequency in M. nasutus (Figure 2D). For 27 of the 29 fixed differences for a premature stop codon, M. nasutus carries the premature stop and M. guttatus carries the intact allele. Additionally, after standardizing by synonymous variation, we observe an excess of putatively deleterious, non-synonymous variation in M. nasutus relative to M. guttatus πN/πS = 0.197 [0.192–0.203] and 0.157 [0.155–0.160], respectively). However, this difference is not yet reflected in divergence between the species (dN/dS = 0.156 [0.154–0.159]), presumably because interspecific sequence differences largely reflect variation that predates the origin of selfing in M. nasutus rather than the relatively few mutations accrued within the past ∼170 ky. This pattern of elevated πN/πS in selfing species but only modest dN/dS between selfers and their close relatives is common (14), even in genome-wide analyses (e.g., (19)).
Ongoing gene flow and its consequences
Somewhat surprisingly, given the placement of the sympatric, northern M. guttatus sample (CACG) far from M. nasutus in our neighbor-joining tree and PCA analyses (Figures 1B and 1C, respectively), CACG has the lowest level of nucleotide divergence to M. nasutus (Table S2). To test whether these seemingly contradictory observations are due to introgression, we used Treemix, an unsupervised method to construct a phylogeny featuring admixture events (33). Treemix finds a clear signal of admixture from M. nasutus into the population ancestral to CACG (Figure 3A). This result holds across a range of different sample subsets (Figure S8). Some Treemix analyses also suggest introgression from M. nasutus into the population ancestral to our southern, sympatric M. guttatus (DPRG); however, the manifestation of this second signal varies across sample subsets (Figure S8, see supplemental results).
Further quantitative evidence for introgression in the CACG sample comes from the bimodal distribution of divergence between M. nasutus and CACG in 5kb windows (dark purple lines in Figure 3B). Indeed, for most genomic windows, CACG shows deep divergence with M. nasutus (half of windows have more than 5% sequence divergence), but approximately one tenth of windows show nearly no divergence (πS < 0.5%). We find a similar, but subtler, bimodal distribution of divergence between Mimulus species featuring the sympatric southern M. guttatus sample (2.5% of windows comparing DPRG and M. nasutus exhibit less than 0.5% sequence divergence – dark blue lines in Figure 3B). We do not observe this pattern in the allopatric samples in which fewer than 0.5% of windows are less than 0.5% diverged from M. nasutus (light lines in Figure 3B and S7). The PSMC analysis also reflects ongoing introgression in sympatry, as it highlights recent common ancestry between sympatric M. guttatus samples (CACG and DPRG) and M. nasutus (Figures S4 and S5). Moreover, genomic regions of low interspecific divergence are spatially clustered (Figures 3C and S7), consistent with recent and ongoing introgression that is slowly broken down over generations of recombination.
We utilize this spatial distribution of windows of low interspecific divergence to probabilistically infer regions of recent M. nasutus or M. guttatus ancestry across the genomes of our four focal M. guttatus samples (Figure 3C) using a Hidden Markov Model (‘HMM’ see METHODS). From this HMM, we estimate recent M. nasutus ancestry for 15.1%, 5.7%, 1.1% and 0.6% of sympatric northern (CACG), sympatric southern (DPRG), allopatric southern (SLP), and allopatric northern (AHQT) M. guttatus genomes, respectively. The non-zero admixture proportions inferred in allopatric samples likely reflect low levels of admixture into allopatric M. guttatus and/or mis-assigned regions of incomplete lineage sorting.
To learn about the timing of admixture, we find contiguous regions of individual M. guttatus genomes with a greater than 95% posterior probability of M. nasutus ancestry and display the length distribution of these blocks in Figure 3D. The mean admixture block lengths are 132 kb (∼0.74 cM, n = 227 blocks) and 18.6 kb (∼0.10 cM, n = 350 blocks) for northern (CACG) and southern (DRPG) sympatric M. guttatus samples, respectively. Because our HMM occasionally breaks up what seem to be contiguous blocks of admixed ancestry, we instituted a range of strategies to fuse admixture blocks, none of which greatly influenced our block length distributions (see supplementary results and Figure S9). Under a single temporal pulse of gene flow, admixture block lengths are exponentially distributed with a mean length (in Morgans) that is the reciprocal of the number of generations. With this model, we estimate 135 and 962 generations since admixture for CACG and DPRG samples, respectively (Table S5). Because these estimates are much more recent than the split time of around 200 kya, this pattern cannot be explained by incomplete lineage sorting.
While this pulse model provides an intuitive summary of time back until an admixed region finds itself in an M. nasutus sample, our data are inconsistent with a model of a single admixture pulse. Specifically, the block length distribution of M. nasutus ancestry in M. guttatus samples is too variable to be consistent with a single admixture time, and so these estimates should be viewed as average times back to an admixture event. Furthermore, these blocks come from far enough back in the past that every single block is likely derived from a separate admixture event, rather than being derived from a single M. nasutus parent (Table S5, see also Methods and supplementary results for quantitative support for these points).
We also detected a signal of introgression from M. guttatus into M. nasutus, despite the numerous difficulties presented by the short scale of LD in M. guttatus and the similarity between interspecific divergence and M. guttatus diversity. We overcame this challenge by reasoning that admixture from M. guttatus into M. nasutus would result in long genomic regions that are more closely related to nearby M. guttatus samples than to other M. nasutus samples. Therefore, we focus on long contiguous regions (> 20 kb) for which one ‘outlier’ M. nasutus sample does not coalesce with any other M. nasutus samples until before the species split (as inferred by pairwise πS > 1%, alternative thresholds explored in the supplement), and compare synonymous divergence of outlier individuals and non-outliers in non-overlapping 20 kb windows to northern and southern allopatric M. guttatus samples. By excluding sympatric M. guttatus samples, we avoid conflating introgression into M. nasutus with the extensive signature of introgression from the opposite direction (i.e., M. nasutus into M. guttatus).
Pooling across northern M. nasutus samples (NHN, Koot, CACN), outlier windows are more often genetically closer to the northern M. guttatus sample (AHQT) than are the non-outlier windows (272 of 490, one sided binomial test against the null expectation of 50% P = 0.008). However, this result is individually significant only for our most geographically remote M. nasutus sample, NHN (For 141 of 289 outlier windows NHN are closer to northern M. guttatus than are the non-outlier M. nasutus). Additionally, the southern M. nasutus sample, DPRN had the smallest proportion of outlier windows resembling northern M. guttatus. By contrast, no samples have a disproportionate share of outliers genetically close to southern M. guttatus, SLP (Table S6), consistent with little or no introgression into DPRN (note, however, that this comparison is likely underpowered because of the genetic proximity of southern M. guttatus to the population that founded M. guttatus). In Table S6 and in the supplementary results, we show that there is significant support for introgression from M. guttatus into M. nasutus over a broad range of parameter choices for designating outlier windows – while no samples are genetically closer to southern M. guttatus in outlier windows more often than expected by chance, NHN individually and/or pooled northern M. nasutus samples are more often like northern M. guttatus than expected under incomplete lineage sorting.
Divergence between M. nasutus and sympatric M. guttatus samples decreases with increasing local recombination rate
We observed a genome-wide negative relationship between absolute synonymous divergence (i.e., the mean number of pairwise sequence differences at synonymous sites) and the local recombination rate in both sympatric M. guttatus samples (DPRG x Nas: Spearman's ρ = −0.080, P = 0.0008, CACG x Nas: Spearman's ρ = −0.0718, P = 0.0027). These sympatric M. guttatus samples display approximately a 5% reduction in synonymous nucleotide divergence in genomic regions with greater than average recombination rates. This signal is substantially weaker in the allopatric southern M. guttatus sample (SLP x Nas: Spearman's ρ = −0.0521, P = 0.0297), which is nestled within the range of M. nasutus, and altogether absent in comparisons with the allopatric northern M. guttatus sample, which occurs well outside of M. nasutus’ range (AHQT x Nas: Spearman's ρ = −0.0261, P = 0.2768). This result holds after accounting for the potential confounding effect of sequencing depth and the influence of synonymous divergence to M. dentilobus (a proxy for mutation rate variation, see Methods, Supplementary results and Table S7).
This negative relationship between divergence and the recombination rate in sympatry is consistent with selection against M. nasutus ancestry reducing effective gene flow at linked sites (see (34), for related theory and tentative evidence to date), and inconsistent with alternative scenarios. Specifically, neutral processes cannot generate this correlation because mean neutral coalescent times are independent of recombination rates (35). Selective sweeps and background selection acting simply on M. guttatus ancestry would not influence expected substitution rates at linked neutral sites (36), and so while they may influence relative measures of divergence (34, 37, 38), they will not influence absolute divergence. Moreover, since we observe no relationship between diversity and the recombination rate within M. guttatus (Spearman's ρ = – 0.0275 P = 0.244) or M. nasutus (Spearman's ρ = 0.0291 P = 0.218), it seems unlikely that linked selection processes within species (background selection and selective sweeps) could explain this result.
Finally, we note that genomic regions with the lowest recombination rates have the highest densities of centromeric and TE-like repeats, and in contrast, high-recombination regions have the highest local gene-densities per physical distance (data not shown). However, there is no obvious link between this observation and the consistent negative correlation between divergence and recombination between M. nasutus and sympatric, but not allopatric, M. guttatus.
Discussion
Speciation history and the origin of M. nasutus
Genetically, M. nasutus clusters with central Californian M. guttatus samples, suggesting that speciation post-dated the differentiation of some M. guttatus populations. Thus, speciation in this pair is best described as a ‘budding’ of M. nasutus from M. guttatus, rather than a split of an ancestral species into two. We observe an approximate coincidence between the timing of divergence and the decline in population size in M. nasutus (as inferred from our PSMC analysis), likely as a result of the transition to selfing being linked to speciation (see (10) for phylogenetic evidence of this link in the Solanaceae and (39, 40) for a likely case in Capsella). Future genomic analyses across the M. guttatus complex and other species groups will facilitate an in-depth view of the causes and consequences of speciation by the budding of selfing and/or endemic populations off of a widespread parental species, and the commonness of this mode of speciation.
We estimate that M. nasutus split from a M. guttatus population within the last two hundred to five hundred thousand years (with our estimate of ∼200 ky, inferred from differences in synonymous sequence differences within and between species, and the estimate of 500 ky corresponding to conservative estimates of population splits from the PSMC). This lies between the ∼50 ky separating selfing Capsella rubella from outcrossing C. grandiflora (19, 41) and Arabidopsis thaliana which has potentially been selfing for over a million years ((42), having split from A. lyrata ∼3–9 Mya (43)). Although 200 ky represents a relatively short time evolutionarily, it implies that M. nasutus managed to survive numerous dramatic bioclimatic fluctuations.
The transition to selfing and its genomic consequences
The transition from outcrossing to self-fertilization in M. nasutus has had clear consequences on patterns of genomic variation. In M. nasutus, linkage disequilibrium exceeds that in M. guttatus by three orders of magnitude. This result suggests a high selfing rate in M. nasutus (estimated at 99%, above), consistent with direct estimates from field studies (23). We observe a four-fold drop in diversity and infer a ten-fold reduction in effective population size in M. nasutus compared to M. guttatus, values far exceeding the two-fold decrease in Ne expected as a direct consequence of selfing (44, 45). This more than two-fold reduction in Ne of selfing populations relative to their outcrossing relatives has been identified in other species pairs (16, 41), and may be partially due to extreme founding bottlenecks and/or frequent colonization events and demographic stochasticity that further increase the rate of genetic drift (12). Additionally, due to the lower effective recombination rate in selfing species (13), the effect of linked selection is magnified (46–48), further reducing diversity.
Selfing populations are expected to experience a reduced effectiveness of purifying selection accompanying the drop in effective population size and recombination rates (14, 48, 49). Consistent with these predictions, M. nasutus has accumulated numerous putatively deleterious mutations, including nonsynonymous variants and premature stop codons. Presumably, this elevation in radical genetic variants reflects a reduction in the efficacy of purifying selection due to a high rate of genetic drift and linked selection, as well as perhaps the escape of some genes (e.g., loci involved in pollinator attraction) from the selective constraints they faced in an outcrossing population (e.g., (16)).
Selfing as a reproductive barrier and its significance for ongoing gene flow
Despite multiple reproductive isolating barriers, including mating system differences we find ongoing, bidirectional introgression between M. guttatus and M. nasutus.
Evidence of ongoing introgression from the selfer, M. nasutus, into the outcrosser, M. guttatus, is particularly stark. There are numerous evolutionary implications of introgression from selfers to outcrossers. Introgression of deleterious mutations accumulated in selfers may introduce a genetic load to outcrossers. This burden would result in selection against genetic material from selfers in hybridizing outcrossing populations, and could ultimately favor reinforcement of reproductive isolation. Alternatively, such introgression could provide a multi-locus suite of variation facilitating self-fertilization, and other correlated traits (e.g., drought resistance and rapid development), in favorable environments, as appears to be the case in introgression between wild and domestic beets (Beta vulgaris, (50)).
Evidence of introgression from M. guttatus into M. nasutus is subtler, but is potentially critically important. Even relatively low levels of introgression into a selfer may rescue the population from a build up of deleterious alleles, and reintroduce adaptive variation, and so may lower its chances of extinction, a fate considered likely for most selfing lineages (51, 52). However, before potentially rescuing a selfing population from extinction, genomic regions introduced from outcrossing species must themselves survive a purging of deleterious recessive alleles.
Higher rates of introgression from M. nasutus to M. guttatus would be consistent with the prediction that backcrosses should be asymmetric – because bees preferentially visit plants with larger flowers (53, 54) and/or larger floral displays (55, 56), both features of M. guttatus, visits to M. nasutus and F1 hybrids are likely preceded and followed by visits to M. guttatus (23, 29). Consistent with this prediction, direct estimates of hybridization in the DPR sympatric population reveal that F1 hybrids are the product of M. nasutus maternal and M. guttatus paternal parents, respectively (23). However, we caution that it is considerably more challenging to identify introgression into M. nasutus than into M. guttatus, as the similarity between interspecific divergence and diversity in M. guttatus makes historical admixture difficult to separate from the incomplete sorting of M. nasutus’ ancestral variation. We further note that, although asymmetrical introgression from selfers to outcrossers has been detected in other systems (Pitcairnia (57), and potentially in Geum (58, 59)), the relative contribution of selfing vs. other isolating barriers and/or selection is unclear. Dense sampling of sympatric and allopatric populations of outcrossing species experiencing ongoing gene flow with selfing relatives will allow for tests of these hypotheses. Importantly, the number, location and length-distribution of admixture blocks identified from genomic analyses provide information about the longer-term consequences and pace of introgression between selfers and outcrossers.
Selection against hybrids and implications for species maintenance
The numerous short blocks (in addition to long blocks) of M. nasutus ancestry observed in M. guttatus suggest that M. nasutus ancestry can potentially persist in an M. guttatus background for many generations. Despite this, M. guttatus and M. nasutus are still ecologically and genetically distinct. While this differentiation could arise via a balance of genetic drift and gene flow, there is a possible role for selection in the maintenance of species boundaries.
We identified a genome-wide signature of selection against introgression of M. nasutus ancestry in M. guttatus, in the form of a negative relationship between the local recombination rate and absolute divergence. This relationship was highly significant in both sympatric comparisons, but only weakly significant in parapatry, and insignificant in allopatry. Additionally, we did not find a relationship between recombination and diversity within either species. Moreover, unlike a negative relationship between the recombination rate and relative measures of differentiation, such as Fst or the number of fixed differences (e.g., 60, 61, 62), this finding cannot also be explained by a high rate of hitchhiking or background selection within populations since the species split (34, 37, 38). Instead, it seems more consistent with M. nasutus ancestry being selected against more strongly in regions of low recombination due to linkage with maladaptive alleles that introgression would introduce (suggesting that the genome has congealed in low recombination regions, (63, 64)). Previously this result had been hinted at by higher absolute divergence near the breakpoints of inversions (e.g., (65, 66)), or in centromeres relative to telomeres (67), but to our knowledge this is the first time this basic prediction has been demonstrated genome-wide.
Further work, including experiments measuring selection on genetic variants in the wild, and larger sample sizes from both allopatric and sympatric populations, is needed to pinpoint which (if any) genomic regions are particularly strongly selected against in hybrids. Genetically mapped loci for adaptive interspecific differences (68) and hybrid inviability and sterility (28) are promising candidates.
Materials and Methods
Mimulus sampling and whole genome sequencing
We utilized a combination of existing (downloaded from the NCBI SRA, sequenced by 69) and newly generated whole genome sequence data from 19 different lab and/or naturally inbred Mimulus accessions, including 13 M. guttatus, 5 M. nasutus, and 1 M. dentilobus individual as an outgroup. Samples varied in their geography and life history. Mean sequencing depths range from 2X to 25X, and read lengths include 36, 76, and 100 base pair paired end reads. We present SRA accession numbers as well as depth, read length and additional sample information in Table S1, and note that we obtained the DPRG sequence data directly from the U.S. Department of Energy Joint Genome Institute. Our analysis included newly generated whole genome sequences from five lines (CACG, CACN, DPRN, NHN, and KOOT), and we present details of sequence generation in the supplementary methods.
Genome sequence alignment, SNP identification and annotation
We aligned paired end reads to the M. guttatus v2.0 reference genome (69) using Burrows-Wheeler Aligner (bwa (70)) with a minimum alignment quality threshold of Q29 (filtering done using SAMtools (71)). Alignment-processing details can be found in the supplementary methods. We produced a high quality set of invariant sites and SNPs simultaneously for all lines using the GATK Unified Genotyper, with a site quality threshold of Q40 (72, 73). For all analyses described below, we exclusively used genotype calls from reference scaffolds 1–14, corresponding to the 14 chromosomes in the Mimulus genome. For all analyses (except PSMC, which requires a consistently high density of data, see below), we set also set a strict minimum depth cutoff of 10 reads per site (unless otherwise noted), removed triallelic sites, and censured genotypes at sites where individual depth was two standard deviations away from mean depth. To assign genotypes at heterozygous sites, we randomly selected one of two alternate alleles. Such heterozygous sites are not concentrated in long genomic regions and account for approximately 1% and 2% of synonymous SNPs in average focal M. nasutus and M. guttatus samples, respectively.
Following the filtering steps described above, all remaining genic genotype calls were classified as zero, two, three, or fourfold degenerate using the Mimulus guttatus v2.0 gene annotations provided by phytozome (69).
Data analysis
NJ Tree and PCA
We first analyzed broad patterns of genomic differentiation captured by principle component analysis and a neighbor-joining (nj) tree for all M. guttatus and M. nasutus samples, rooting our tree by the outgroup, M. dentilobus. For both these analyses (and the Treemix analyses) we downsampled to 1000 synonymous SNPs per chromosome with more than one copy of the minor allele. In this downsampling we insisted that no focal samples have missing genotypic data and then prioritized by the number of individuals with sequence data (see supplementary methods). After this downsampling, all individuals had genotypic data at more than 60% of SNPS, and 97% of these SNPs had genotypic data for at least 16 of our 19 individuals. We make the nj tree with the nj function in the R program (74), ape (75), and root it by the outgroup, M. dentilobus, We construct our PCA with customized R scripts accounting for missing genotype data described in the supplementary methods.
Nucleotide diversity and divergence
Preliminary analyses of nucleotide variation showed a strong influence of sequence features such as read depth and length (Table S3). In general, samples with low depth and short reads are less different from all other sample than are samples with high depth and long reads (Tables S3 and S4), a likely consequence of difficulties in aligning short reads when they differ substantially from the reference. We therefore focus all following analyses on eight samples with high and consistent mean read depths (13X–24X) and read lengths (all 100 bp paired end reads).
Divergence and diversity
We quantified patterns of synonymous and nonsynonymous sequence variation at fourfold and zerofold degenerate sites, respectively. For each pairwise comparison, we count the number of pairwise sequence differences and number of sites for which both samples have data above our quality and depth thresholds. To generate confidence intervals for our point estimates of diversity and divergence that acknowledges the non-independence of sequence variants due to linkage disequilibrium, we resample 100 kb windows with replacement.
Allele Frequency Spectrum (AFS)
To polarize the AFS we examined all sites passing our depth and quality thresholds for M. dentilobus as well as all focal samples. For sites polymorphic in of focal samples, we labeled the M. dentilobus allele as ancestral.
Premature stop codon identification
We searched for premature stop codons in each Mimulus accession using the Mimulus guttatus v2.0 gene annotations. We defined a mutant stop codon to be premature only if all three nucleotide sites were available for the codon, if it occurred in a gene for which at least 25% of the codons were available in the sample, and if the codon did not occur in the last 5% of all codons on the 3’ end of the gene.
Correlations between recombination rate and diversity (or divergence)
We estimated genetic distances in centiMorgans (cM) using information from three existing Mimulus genetic linkage maps: one intra-population M. guttatus composite map (IMxIM), integrated from three different F2 maps between individuals originating from the IM population (76), and two inter-specific F2 maps between IM62 M. guttatus (the reference line) and SF M. nasutus, IMxSF_2006 (68) and IMxSF_2009 (C. Wu and J. Willis unpublished). All three linkage maps are available at http://www.mimulusevolution.org/. See supplementary methods for more details.
With this integrated map, we estimated a local recombination rate for every 100 kb window, smoothed by the mean rate in the surrounding 500 kb. In each window we calculated mean pairwise sequence diversity at synonymous sites and used a non-parametric spearman rank sign test to evaluate the relationship between synonymous sequence diversity and the local recombination rate. We excluded windows with fewer than 100 pairwise comparisons, and regions without recombination estimates. The final number of 100 kb windows included in each pair-wise comparison ranged from 1,773–2,023 (∼177.3–202.3 Mb), or 60.5–69.0% of the reference genome. Moreover, the set of windows in each analysis largely overlapped – 1,756 windows were common to all analyses meaning that for a given test 87% to 99% of the windows were used in all other analyses. In the supplementary results and Table S7, we show that divergence to M. dentilobus and local depth do not influence our qualitative conclusions.
PSMC
As a complementary inference of historical demography and differentiation, we applied Li and Durbin's implementation of the pairwise sequentially Markovian coalescent (PSMC) (32) to pairwise combinations of focal haploid genomes. See the supplementary methods for additional details. Due to high diversity in our dataset, we used a window size of 10 bp for PSMC analysis. For all comparisons, we ran PSMC for 20 iterations and used the following input settings: recombination/mutation ratio (r) = 1.25, Tmax = 10, number of intervals (n) = 60, number of population size parameters = 24, parameter distribution pattern = ‘1*4 + 1*4 + 1*3 + 18*2 + 1*3 + 1*4 + 1*6’. We represented time using a generation time of 1 year and a mutation rate of 1.5 × 10−8. We note that choosing a fixed value for the recombination/mutation ratio is appropriate for comparisons between species and within M. guttatus; however, this does not capture the change in this ratio accompanying the transition to selfing in M. nasutus. Therefore quantitative estimates of population decline through time within M. nasutus are best viewed as very rough approximations. To generate a measure of variability in the PSMC estimates, we ran 100 bootstrap analyses for each pairwise comparison. See the supplementary methods for details.
Introgression Analyses
Treemix
As a test for recent admixture between M. guttatus and M. nasutus, we used Pickrell and Pritchard's (2012) Treemix method to model the evolutionary history of a group as a series of splits and gene flow events. For these analyses, we used the same subsample of 14,000 SNPs used for the nj tree and PCA described above. We specified a Treemix block size of 200 SNPs and estimated the evolutionary history including 1, 2, 3 or 4 admixture events. We also ran analyses with and without the reference line IM62 included, and with all nineteen Mimulus samples (see supplementary methods and results, as well as Figure S8).
HMM
We implement the forward-backward algorithm and posterior decoding as described by Durbin et al. (77) in a customized R script controlling for underflow (available upon request) to calculate posterior probabilities of M. nasutus or M. guttatus ancestry across all four focal M. guttatus samples.
We take as our emissions the minimum pairwise π between our focal M. guttatus sample and all M. nasutus samples in non-overlapping 1kb windows (πclose_nas). We compared πclose_nas to the genome-wide distribution of π between a M. nasutus sample and its genetically closest M. nasutus sample and π between M. guttatus samples and the genetically closest M. nasutus sample, to calculate the emission probability of admixed or pure M. guttatus ancestry, respectively. In doing so we accounted for both the heterogeneity in the number of informative sites across the genome and the fact that we compare each M. nasutus sample to three others, while we compare each M. guttatus sample to all four M. nasutus samples (see supplementary methods).
We calculate ti,j, the probability of transitioning from ancestry of species i to ancestry from species j from α, the admixture proportion, and r, the product of the recombination rate per 1kb multiplied by a point estimate of the number of generations since admixture. We set ti,j to tgut,gut = (1 − r) + r (1 − α), tgut,nas = r α, tnas,gut = r (1 − α), and tnas,nas = (1 − r) + r α, and optimize these parameters with the Nelder-Mead algorithm implemented in the R function, optim, calculating the likelihood of our data given these parameters from the forward algorithm. α estimated in this way is similar to estimates from the proportion of low divergence windows presented in the text, suggesting that our data provide information about both α and r. We assume that r is constant across windows, ignoring the influence of the recombination rate on the transition rate.
Inference of introgression history
The number of admixture blocks and our point estimate of admixture timing are strong evidence that admixture is not the result of a single chance M. nasutus ancestor in the history of these samples. To see this, consider that our current day genome is expected to be broken into X chunks v generations ago, where X is the number of recombination events (i.e, the map length in Morgans (approximately 14.7 Morgans in Mimulus) times the number of generations, v), plus the number of chromosomes (14 in Mimulus). So, for example, the CACG genome is broken into X = 2219 chunks at our point estimate of its admixture time, v = 150 generations ago (150 generations times 14.7 Morgans plus 14 chromosomes). These X chunks are drawn from 2v genealogical ancestors spread across many ancestors, meaning that CACG is expected to inherit 2219/2150 ∼ 1/2139 chromosomal segments from a typical genealogical ancestor, and therefore under a point model of admixture the odds of CACG inheriting two ancestry blocks from the same ancestor is vanishingly small. Therefore, each of the 227 M. nasutus ancestry blocks observed in CACG descends from a different admixture event, implying a vast number of admixed ancestors in the history of the sympatric populations M. guttatus. The case is starker for DPRG, for which we infer a much older point estimate of the admixture time.
Because recombination is a Poisson process, under a single admixture time, v, we expect the variance in admixture block length to equal the square of the mean block length. However the variance in admixture block lengths is inconsistent with this expectation for both southern (DPRG, mean = 0.0010 M, σ = 0.0013 M, Bootstrap P < 0.001), and northern (CACG, mean = 0.0074 M, σ = 0.0102M, Bootstrap P < 0.001) sympatric M. guttatus samples. This argues against a point model of admixture and suggests ongoing and sustained gene flow into M. guttatus where it is sympatric with M. nasutus. We acknowledge that these calculations are somewhat crude, especially because we assume a constant recombination rate genome-wide. Nonetheless, the extreme variation in admixture block length and the large number of blocks visually obvious in Figure S9 supports this qualitative result.
We note that since our inference of the extreme improbability that two admixture blocks are derived from the same introgression event relied on a point model, we must soften this conclusion. It is plausible that a few young ancestry blocks in CACG are descended from the same admixture event, however, the rejection of a point admixture model strengthens our major conclusion that gene flow is ongoing and that our samples do not represent single admixture events.
Inference of Introgression from M. guttatus to M. nasutus
To test for introgression from M. guttatus into M. nasutus, we took advantage of the structure of genetic variation in M. nasutus – for most of the genome all individuals are remarkably similar, and when this is not the case, one individual often differs sharply from all others. The genetic variation in such genomic regions has either been maintained from the stock of ancestral variation, or it has been reintroduced by introgression upon secondary contact. The extreme genetic variation and miniscule extent of linkage disequilibrium within M. guttatus makes these two alternative hypotheses nearly indistinguishable in any given region; however, by collating information across all such regions we can test the introgression hypothesis.
To do so, we find long regions (20 kb) of the M. nasutus genome for which one individual is a genetic outlier, as described above. In these regions, we ask whether the outlier is more closely related to a specific M. guttatus sample than are the non-outliers. Under incomplete lineage sorting, there is a 50% probability that this is the case. However, under an admixture model this probability is greater than 50% if the M. guttatus sample is more closely related to the potential admixture source than it is to the population that founded M. nasutus. See supplementary results for more details, and Table S5 for the robustness of this inference to our exact rules for identifying outlier windows.
We perform this test for all M. nasutus samples individually against two focal allopatric M. guttatus samples from the northern (AHQT) and southern (SLP) groups. In addition to testing the one sided hypothesis of whether the outlier is sample is more often like a give M. guttatus sample we also pool our northern M. nasutus samples to amplify any signal.
Analyses and Results
Pairwise comparisons
Values of πS and πN/πS for each pairwise comparison are presented in Table S2. The mean number of pairwise sequence differences between M. nasutus and each focal M. guttatus sample is 4.54% (Northern sympatric – CACG), 4.76% (Southern sympatric – DPRG), 5.05% (Southern allopatric – SLP), 5.41% (Northern allopatric – AHQT).
To highlight the influence of read length and depth on estimates of diversity, in Table S3 we present mean πS and πN/πS values across all population comparisons split by the number focal samples in a comparison (i.e., zero means that neither of the samples is included in our detailed genomic analyses due to low depth or short reads). Note that for a given comparison between populations, diversity between two focal samples is much higher than that between two non-focal samples illustrating the influence of sequencing effort on diversity estimates. To avoid these effects we focus on our focal samples when discussing levels of diversity. We also note that we did not present in-depth genomic analyses of comparisons including the reference, IM62, because of unknown biases it may introduce.
We present values of pairwise πS between each focal sample with (above the main diagonal) and without (below the main diagonal) putatively introgressed regions (as inferred by a >95% posterior probability of M. nasutus ancestry in a genomic region of an M. guttatus sample) in Table S4. Reassuringly, after removing such regions, our two northern focal M. guttatus samples no longer differ in the number of pairwise sequence difference to M. nasutus, suggesting that our HMM performed very well for CACG (compare CACG and AHQT to M. nasutus samples above and below the main diagonal). Removing regions of inferred recent introgression in DPRG also increased its distance from M. nasutus samples; however, this sample is still genetically closer to M. nasutus than is the allopatric southern sample, SLP. This suggests that our HMM may have missed short (i.e. old) regions of introgression into DPRG and/or that even without introgression, DPRG is more closely related to the M. nasutus progenitor population than is SLP.
Additional PSMC results
We present the results of additional PSMC analyses, including the effects of admixture on shared variation between M. guttatus and M. nasutus, and M. nasutus’ population size decline, as well as ‘zoomed-in’ and ‘zoomed-out’ views of alternative figures (changing the y-axis limits) for each analysis. In Figure S1, we present a ‘zoomed-out’ view of Figure 1E providing a view of historical population splits and population size changes over time. The extreme variation in recent population sizes demonstrates both the effect of population structure within M. guttatus on population size estimates, and the lower accuracy of PSMC time estimates in recent history (Li and Durbin 2011). In Figure S2, we present a ‘zoomed-in’ view of the split between M. nasutus and southern M. guttatus. The approximate split date of ∼200-500 kya is visible by evaluating roughly when the southern M. guttatus x M. nasutus curve (SLP x KOOT, gray) diverges from the southern M. guttatus curve (SLP x DPRG, blue; see Li and Durbin 2011).
We infer the history of population size decline in M. nasutus by running PSMC on all pairwise comparisons of our four high-coverage M. nasutus lines. From Figure S3, we observe that M. nasutus’ decline in effective population size was coincident with divergence between the species, indicating that it is plausible that the evolution of selfing was associated with speciation and the origin of M. nasutus. The extreme reduction in M. nasutus’ effective population size relative to M. guttatus is also evident from these analyses.
Finally, our PSMC analyses demonstrate an effect of admixture on the inferred history of divergence. We observe a reduction in the between species effective population size between M. nasutus and sympatric M. guttatus, relative to that between M. nasutus and allopatric M. guttatus (Figure S4, and S5 for the full, zoomed-out view). In Figure S4, relatively recent (i.e., between ∼10 and 70 kya) effective population sizes between M. nasutus and sympatric M. guttatus are reduced relative to allopatric comparisons, and roughly in the range of or even lower than population sizes within southern and northern M. guttatus, further supporting a history of ongoing and recent introgression.
Robustness of introgression results
Treemix
We explored our Treemix analyses over a range of different sample subsets and numbers of admixture events:
Focal samples and the reference (IM62) rooted by the outgroup
Focal samples rooted by the outgroup
All M. nasutus and M. guttatus samples rooted by the outgroup.
For each set of samples, we allowed one, two, three, or four historical admixture events (Figure S8). Regardless of sample subset and the number of admixture events allowed, we always see strong evidence of introgression from M. nasutus into CACG, a result consistent with all analyses in this manuscript. However, the other clear signal of introgression observed in our genomic analyses – introgression from M. nasutus into DPRG, was only observed when we allow for more than one introgression event and analyze all focal samples and the reference genome (Figure S8 A.2-A.4). When we limit our analysis to focal samples (rooted by the outgroup) and allow for two or more introgression events, treemix places an introgression arrow from northern M. guttatus samples to SLP (Figure S8 B.2-B.4). We view this result as consistent with introgression of M. nasutus material into DPRG, as introgression of the northern M. guttatus samples to SLP leaves DPRG as closer to the M. nasutus samples than SLP. Combined with our genome-wide analyses, we view this as good evidence of introgression from M. nasutus into DPRG.
Block length distributions
In the main text, we used the length distribution of admixture blocks to provide a detailed view of the recent history of introgression of M. nasutus ancestry into M. guttatus. While this summary of the data contains much information, our inference of this distribution is likely imperfect. In practice, we may break up long admixture blocks or we may mislabel short genomic regions with low divergence as short admixture blocks.
In practice, both problems could confuse our inference. Miscalling short unadmixed regions and breaking up long regions into numerous smaller ones will both push back our inferred admixture time. Additionally, introducing short, false positive blocks may mislead us into seeing a mix of old and new admixture events, when in practice there was a single recent pulse. A major claim of our manuscript is that admixed M. guttatus samples are not simply early-generation hybrids, but rather represent on ongoing history of introgression. We therefore wish to ensure that these potential challenges to characterizing the block length distribution do not mislead our inference.
We use two strategies to ensure the robustness of our results. First, we ‘heal’ admixture blocks within X = {0,20,50,100} kb of one another (Figure S9), to guard against breaking up few long admixture blocks into more short ones. We also use our allopatric and putatively ‘pure’ M. guttatus samples to empirically control for the false positive admixture blocks. To do so, we alternatively use the block length distribution of AHQT and SLP and remove the closest matched block lengths in our other samples (note that we remove use the AHQT block length distribution in an attempt to better characterize the introgression history of SLP as well). By factorially implementing these controls, we see that our inference of ongoing introgression of M. nasutus into sympatric M. guttatus populations is robust. In all controls, we observe more variation in admixture tract lengths than would be expected under a simple point admixture model. Moreover, while removing young blocks and creating longer blocks creates a more recent estimated admixture time, our most recent estimated admixture time in CACG is 37 generations ago, arguing against a single recent admixture event. Even if admixture occurred 37 generations ago into CACG, it is very unlikely that a block from a given event at that time would survive to the present – and therefore gene flow is likely (relatively) consistently ongoing (Table S5).
Robustness of inferred introgression from M. guttatus into M. nasutus
In the main text we described our strategy of using outlier windows – regions where one M. nasutus sample differed radically from all others to infer historical introgression from M. guttatus into M. nasutus. The identification of outlier windows required numerous decisions; here we investigate the robustness of the signal of introgression to these choices.
The first was the πS cutoff differentiating outlier and non-outlier regions. We chose three alternative values for this cutoff – 0.5% (roughly corresponding to the expected level of differentiation since the species split), 2.0% (roughly corresponding to expected levels of variation within an ancestral M. guttatus population), and 1.0% (representing a compromise between these values). Within a given cutoff, we identify 20 contiguous overlapping sliding windows (with a 1kb slide) where one sample differs from all others by πS greater than this threshold, while the others are differentiated from one another by πS less than this threshold. Although we always insist on 20 contiguous windows (representing 20kb), we vary the window size, allowing it to take the value of 5, 10 or 20 kb (noted by L in Table S5).
Regardless of exact thresholds, we always see evidence for either introgression into NHN and/or introgression into the pooled collection of northern M. nasutus samples (CACN, NHN, and Koot), in the form of too many outlier windows being too close to AHQT (Table S6). By contrast, no samples are closer to SLP in outlier regions more often than expected by chance. However, as noted in the main text, our inability to identify introgression from southern M. guttatus into southern M. nasutus is likely underpowered because SLP may be too similar to the population that founded M. nasutus.
The relationship between divergence and recombination rate is not driven by sequencing depth or mutation rate variation
In the main text, we report a strong negative relationship between the local recombination rate (in 100 kb windows, smoothed over 500 kb) and absolute divergence between M. nasutus and sympatric M. guttatus samples at synonymous sites. We control for the potential confounds of the mutation rate (measured as divergence to M. dentilobus) and/or sequencing depth (at synonymous sites) in Table S7. To do so, we find the nonparametric correlation (Spearman's ρ) between the recombination rate and residuals of predicted divergence given local depth and/or divergence to M. dentilobus (where predictions come from the best fit linear model).
Acknowledgements
We thank Dan Rokhsar, Kerrie Barry, and others at the U.S. Department of Energy Joint Genome Institute who provided primary genome sequencing and support. We also thank the Willis lab at Duke University (especially Jennifer Modliszewski) for DNA extractions. This study was supported in part by resources and technical expertise from the Georgia Advanced Computing Resource Center (GACRC), a partnership between the University of Georgia's Office of the Vice President for Research and Office of the Vice President for Information Technology. We particularly thank Shan-Ho Tsai, Curtis E. Combs Jr., and Yecheng Huang at the GACRC for their technical expertise and/or support. Saravanaraj ‘Raj’ Ayyampalayam at the Quantitative Biology Consulting Group in the University of Georgia Institute of Bioinformatics provided technical advice. John Willis and Stephen Wright provided thoughtful comments, which significantly improved the quality of our manuscript. Funding was provided by the University of Georgia Research Foundation to Andrea Sweigart and by the Alfred P. Sloan Foundation to Graham Coop. The work conducted by the Joint Genome Institute is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.
Footnotes
↵⌘ These authors jointly supervised this work