Abstract
Adaptation in the wild often involves the use of standing genetic variation (SGV), allowing rapid responses to selection on ecological timescales. Despite increasing documentation of evolutionarily important SGV in natural populations, we still know little about how the genetic and genomic structure and molecular evolutionary history of SGV relate to adaptation. Here, we address this knowledge gap using the threespine stickleback fish (Gasterosteus aculeatus) as a model. We demonstrate that adaptive genetic variation is structured genome-wide into distinct marine and freshwater haplogroups. This divergent variation averages six million years old, nearly twice the genome-wide average, but has been evolving over the 15-million-year history of the species. Divergent marine and freshwater genomes maintain regions of ancient ancestry that include multiple chromosomal inversions and extensive linked variation. These discoveries about ancient SGV demonstrate the intertwined nature of selection on ecological timescales and genome evolution over geological timescales.
The mode and tempo of adaptive evolution depend on the sources of genetic variation affecting fitness1,2. While new mutation is ultimately the source of all genetic variation, recent studies of adaptation in the wild document adaptive genetic variation that was either segregating in the ancestral population as standing genetic variation (SGV)3-5 or introgressed from a separate population or species6,7. The use of SGV appears particularly important when dramatic responses to selection occur on ecological timescales, in dozens of generations or fewer3. When environments change rapidly, SGV can propel rapid evolution in ecologically relevant traits even in populations of long-lived organisms like Darwin’s finches8, monkeyflowers9, and threespine stickleback fish10.
Existing genetic variants have evolutionary histories that are often unknown but that may have significant impacts on subsequent adaptation9,11. The abundance, genomic distribution, and fitness effects10-14 of SGV are themselves products of evolution, and their unknown history raises fascinating questions for the genetics of adaptation in the wild. When did adaptive variants originally arise? How are they structured, across both geography and the genome? Which evolutionary forces shaped their distribution? And how does this evolutionary history of SGV potentially channel future evolutionary change?
Answers to these questions are critical for our understanding of the importance of SGV in nature and our ability to predict the paths available to adaptation on ecological timescales9. Biologists are beginning to probe evolutionary histories of SGV using genome-wide sequence variation across multiple individuals in numerous populations15, but this level of inference has been unavailable for most natural systems because of methodological limitations that remove phase information (e.g. pool-seq16) or produce very short reads (e.g. RAD-seq17). Here, we investigate the structure and evolutionary history of divergent SGV by implementing a novel haplotyping method based on restriction site-associated DNA sequencing (RAD-seq). This approach creates nearly 1kb haplotypes at thousands of densely sampled loci, allowing us to accurately measure sequence variation and estimate divergence times across the genome.
SGV is likely critical to adaptation in this species. Marine stickleback have repeatedly colonized freshwater lakes and streams18,19, and adaptive divergence in isolated freshwater habitats is highly parallel at the phenotypic20,21 and genomic levels19,22 (but see Stuart et al.23). In addition, analyses of haplotype variation at the genes eda10,24 and atp1a124 present two clear results: separate freshwater populations share common ‘freshwater’ haplotypes that are identical-by-descent (IBD), and sequence divergence between the major marine and freshwater haplogroups suggests their ancient origins, perhaps over two million years ago in the case of eda10. While intriguing, it is not clear whether the deep evolutionary histories of these loci are outliers or representative of more widespread ancient history across the genome. To address this fundamental question we utilize the new haplotype RAD-seq approach to assay genome-wide variation associated with adaptive divergence in two young freshwater ponds, which formed during the end-the end-Pleistocene glacial retreat (c. 12,000 years ago21,25; Fig. 1). Our results demonstrate a suite of adaptive variation structured into distinct marine and freshwater haplotypes that evolved over millions of years.
RESULTS AND DISCUSSION
Parallel adaptation to freshwater environments has been a major theme of stickleback evolutionary history26. Stereotypical morphological changes (e.g. bony armor20 and craniofacial structures27) presumably reflect adaptation to similar selective regimes28,29 and are accompanied by parallel genomic divergence19,22, which involves large regions spanning many megabases24,30, including multiple chromosomal inversions19. The leading hypothesis for the genetics of parallel divergence in stickleback posits that distinct freshwater-adaptive haplotypes that are identical-by-descent (IBD) are shared among fresh water populations due to historical gene flow between marine and freshwater populations30. To test for the presence of these haplotypes directly, we characterized the genomic architecture and evolutionary history of SGV by modifying the RAD-seq protocol31 to generate phased haplotypes similar in length to Sanger sequencing reads, each anchored to tens of thousands of PstI restriction sites spread across the genome (Fig. 1.B-D). We sampled five fish (10 haploid genomes) each from Boot Lake (BL) and Rabbit Slough (RS), and four fish (8 genomes) from Bear Paw Lake (BP). After stringent data filtering (see Methods), this resulted in a dataset of 57,992 RAD loci (locus = two tags representing one cut site) with 694 potential variable sites per locus and a median of seven segregating sites per locus (range: 2-155, Suppl. Fig. 1, Suppl. Table 1). We then used these phased haplotypes to estimate genealogies at each RAD locus. By including haplotypes from all three populations in these genealogical analyses, we were able to jointly calculate population genetic statistics (Fst, π, dxy), estimate the degree of lineage sorting within populations, and identify patterns of IBD among populations.
We find that indeed, parallel population genomic divergence in each freshwater site consistently involved haplotypes that were IBD among both freshwater populations (Fig. 2). Background Fst between populations ranged from 0.139-0.226, with divergence between the freshwater populations BL and BP being highest (Fst(rs-bl) = 0.139, Fst(rs-bp) = 0.194, Fst(bl-bp) = 0.226; two-sided Mann-Whitney test for all pairwise comparisons: p ≤ 1×10−10). The degree and genomic distribution of pairwise Fst between the BL, BP, and RS populations were similar to those previously reported22, including marine-freshwater Fst outlier regions on chromosome 4 over a broad span in which the eda gene is embedded (orange triangle in Fig. 2A) and three regions now known to be associated with chromosomal inversions on chromosomes 1, 11, and 21 (yellow bars in Fig. 2; hereafter referred to as inv1, inv11, and inv21). The gene atp1a1 (green triangle in Fig. 2A) is contained within inv1. As expected, we found distinct haplogroups associated with marine and freshwater habitats at both eda and atp1a1 (Fig. 3, insets).
Strikingly, this finding of habitat specific haplogroups was not at all unique to these well studied genes or chromosomal inversions. The two isolated freshwater populations shared IBD haplotypes within all common marine-freshwater Fst peaks even though IBD was rare elsewhere (Fig. 2B). Furthermore, we observed a separate clade of haplotypes representing the marine RS population at the majority (1129 of 2172, 52%) of RAD loci showing freshwater IBD. The result was a genome-wide pattern of reciprocal monophyly between marine and freshwater haplotypes. Notably, this is the same genealogical structure previously reported at eda10,24 and atp1a124, demonstrating that these loci are but a small part of a genome-wide suite of genetic variation sharing similar habitat-specific evolutionary histories, and the previous documentation of their genealogies was a harbinger of a much more extensive pattern across the genome revealed here. Hereafter, we refer collectively to this class of RAD loci as ‘divergent loci’.
Because the genealogical structure of divergence across the genome mirrors that at eda and atp1a1, we asked whether levels of sequence variation and divergence also showed consistent genomic patterns. At all RAD loci we therefore calculated π within each population, as well as in the combined freshwater populations, and dxy between marine and freshwater habitat types. Genome-wide diversity was similar across populations and habitat types (mean πrs = 0.0032, πbl = 0.0034, πbp = 0.0026, πfw = 0.0038) and comparable to previous estimates22. Likewise, genome-wide dxy among habitat types was modest (0.0049) when compared to π across all populations (π = 0.0042, two-sided Mann-Whitney test: p ≤ 1×10−10). Among divergent loci, however, we observed reductions in diversity in both habitats (mean πrs-divergent = 0.0012, πrs-divergent = 0.0016, two-sided permutation test: p ≤ 1×10−4, Fig. 3, Suppl. Fig. 2), indicating natural selection in both habitats. Sequence divergence associated with reciprocal monophyly was striking, however, averaging nearly three times the genome-wide mean (mean dxy-divergent = 0.0124). This divergence ranged more than an order of magnitude (0.0013–0.0442), from substantially lower than the genome-wide average to ten times greater than the average. These findings indicate that much of the genetic variation underlying adaptive divergence was not just standing and structured by habitat, but has been segregating and accumulating for millennia.
These data clearly support the hypothesis of Schluter and Conte30 of ancient haplotypes ‘transported’ among freshwater populations. Much of the divergence we observed was ancient in origin, with levels of sequence divergence at some RAD loci exceeding that observed at eda (Fig. 3, gold line) and suggestive of divergence times of at least two million years ago10. Our observation that sequence variation was consistently reduced in both habitat types emphasizes that alternative haplotypes at these loci are likely selected for in the marine population as well as the freshwater. These alternative fitness optima — driven by different ecologies — provide a favorable landscape for the maintenance of variation32,33, but also lead to a more potent barrier to gene flow among freshwater populations if there are fitness consequences in the marine habitat for stickleback carrying freshwater-adaptive variation. Conditional fitness effects through genetic interactions (e.g., dominance34 or epistasis35) and genotype-by-habitat interactions36 could potentially extend the residence time of freshwater haplotypes in the marine habitat. Future work should consider the phenotypic effects of divergently adaptive variation in different external environments36,37.
A steady accumulation of divergently adaptive variation between marine and freshwater stickleback genomes may also have been critical to the rapid divergence in the young pond populations we study here. We found reciprocal monophyly associated with a spectrum of sequence divergence, including a substantial fraction of divergent loci (11.0%, 124/1129) with dxy below the genome-wide average. Thus, ongoing marine-freshwater ecological divergence has yielded continuing marine-freshwater genomic divergence. Moreover, while this younger variation is shared between the freshwater populations in this study, and localizes to genomic regions of divergence shared globally19, some adaptive variants may be distributed only locally (e.g. to southern Alaska or the eastern Pacific basin). In addition to the globally distributed suite of variation, there may also exist a substantial amount of regional variation contributing to stickleback genomic and phenotypic diversity.
Sequence divergence provides an important relative evolutionary timescale. However, to more directly compare the timescales of ecological adaptation and genomic evolution, we translated patterns of sequence variation into the time to the most recent common ancestor (Tmrca) of allelic variation, in years. To do so, we performed a de novo genome assembly of the ninespine stickleback (Pungitius pungitius), a member of the Gasterosteidae that diverged from the threespine stickleback lineage approximately 15 million years ago38 (Fig. 4A, Suppl. Table 2). We then aligned our RAD dataset to this assembly and estimated gene trees for each alignment with BEAST39, setting divergence to the ninespine stickleback at 15 MYA (see Methods).
We find that the divergence of key marine and freshwater haplotypes has been ongoing for millions of years and extends back to the split with the ninespine stickleback lineage (Fig. 4B). Genome-wide variation averaged 4.1 MY old, and Tmrca for the vast majority of RAD loci was under 5 MY old. In contrast, divergent loci averaged 6.4 MY old and, amazingly, the most ancient 10% (118 of 1129 loci) are estimated at over 10 MY old. This deep genomic divergence not only underscores that the marine-freshwater transition has been occurring throughout the history of the threespine stickleback lineage, for which there is evidence in the fossil record going back 10 million years40, but it also demonstrates that at least some of the variation fueling those ancient events has persisted until the present day. In some genomic regions, then, marine and freshwater threespine stickleback are as divergent as threespine and ninespine stickleback, which are classified into separate genera.
Adaptive divergence has impacted the history of the stickleback genome as a whole (Fig. 4C). We identified 32.6 Mb, or 7.5%, of the genome as having elevated Tmrca (gray boxes in Fig. 4C; two-sided permutation test, p ≤ 0.001). Outside of the non-recombining portion of the sex chromosome (chr. 19), the oldest regions of the stickleback genome were those enriched for divergent loci. Patterns of ancient ancestry closely mirrored recent divergence in allele frequencies (Fig. 2A) and it appears that historical and contemporary marine-freshwater divergence has impacted ancestry across much of the length of some chromosomes. Chromosome 4, for example, contains at least three broad peaks in Tmrca and a total of 5.9 Mb identified as genome-wide outliers (two-sided permutation test, p ≤ 0.001). This chromosome has been of particular interest because of its association with a number of phenotypes20,41, including fitness42. We found the major-effect armor plate locus eda comprised a local peak (mean Tmrca = 6.4 MYA) nested within a large region of deep ancestry spanning 8.1 Mb. Moreover, at least two other peaks distal to eda, centered at 21.4 Mb and 26.6 Mb, were also several million years older than the genomic average at 6.8 MYA and 7.0 MYA, respectively.
Intriguingly, genomic regions of elevated Tmrca remained outliers even after removing marine-freshwater relative divergence outliers as measured by Fst (Suppl. Fig. 3). We estimated that 7.5% of the genome had increased Tmrca even though only 1.9% of RAD loci (1129 of 57,992) were classified as divergent. When we removed these loci along with loci with extreme values of marine-freshwater Fst (Fst > 0.5), many of the regions in which they resided were still Tmrca outliers. It is possible that the remainder of this old variation is neutral with respect to fitness. However, we identified divergence outliers based on only a single axis of divergence: the marine-freshwater axis. Throughout the entire species range, populations are locally experiencing multiple axes of divergence, including lake-stream and benthic-limnetic axes43, that often shares a common genomic architecture44,45. Our data may indicate underlying similarities in selection regimes. Alternatively, this co-localized ancient variation may represent the accumulation of adaptive divergence along multiple axes in the same genomic regions, whether or not the underlying adaptive variants are the same. Aspects of the genomic architecture, such as gene density or local recombination rates, may in part govern where in the genome adaptive divergence can occur46-48. Multiple axes of divergence may therefore act synergistically to maintain genomic variation across the stickleback metapopulation.
Nevertheless, much of the ancient variation we observe may in fact itself be neutral, having been maintained by close linkage to loci under divergent selection between the marine and freshwater habitats32. Indeed, the broadest peaks of Tmrca we observe occur in genomic regions with low rates of recombination47,49 in other stickleback populations, which would extend the size of the linked region affected by divergent selection. On ecological timescales, low recombination rates in stickleback are thought to promote divergence by making locally adapted genomic regions resistant to gene flow47. Our results potentially extend the inferred impact of recombination rate variation on genomic variation to timescales that are 1000-fold longer, maintaining both multimillion-year-old adaptive variation and large stores of linked genetic variation. Future modeling efforts will be needed to explore the range of population genetic parameter values (e.g. selection coefficients, migration rates, and recombination rates) required to produce the extent of divergence we see here.
Lastly, our findings demonstrate that known chromosomal inversions maintain globally distributed, multilocus haplotypes. The three chromosomal inversions (inv1, inv11, and inv21; yellow bars in Fig. 4C) all showed sharp spikes in Tmrca. Genomic signatures of these inversions are distributed throughout the species range, including coastal marine-freshwater population pairs in the Pacific and Atlantic basins19 and inland lake-stream pairs in Switzerland.44. Despite our limited geographic sampling, our finding that all three of these inversions are over six million years old is further evidence of single, ancient origins of each, followed by their spread across the species range. Each inversion contained a high density of divergent RAD loci (inv1: 64% of loci divergent; inv11: 60%; inv21: 71%) but we also identified regions within these inversions in which haplotypes from marine or freshwater habitats, or both, were not monophyletic. inv1 and inv11 both contained two regions separated by loci in which neither habitat type was monophyletic; inv21, the largest of the three, contained ten such regions. Additionally, Tmrca and Fst decreased sharply to background levels outside of the inversions, demonstrating the potential for gene flow and recombination to homogenize variation in these regions. We interpret this as evidence that these inversions help maintain linkage disequilibrium among multiple divergently adaptive variants in regions susceptible to homogenization11,50 The presence of these inversions, therefore, further supports the hypothesis that the recombinational landscape can influence where in the genome adaptive divergence can occur and emphasizes the degree to which gene flow among divergently adapted stickleback populations has impacted global genomic diversity.
CONCLUSIONS
Selection operating on two very different timescales — the ecological and the geological — has shaped genomic patterns of SGV in the threespine stickleback. Selection on ecological timescales drives phenotypic divergence in decades or millennia by sorting SGV across geography and throughout the genome22,44,51,52. Our findings show that the persistence of this ecological diversity and local adaptation of stickleback has set the stage for long-term divergent selection and for the continual accumulation and maintenance of adaptive variation over millions of years. A number of genetic variants fueling contemporary, rapid adaptation may even have been present - and under selection - since before the threespine-ninespine stickleback lineages split. The extent to which ecological adaptation in a single population drew on haplotypes that have evolved over millions of years and persisted in multiple populations, many of which are now extinct, underscores the need to understand macroevolutionary patterns when studying microevolutionary processes, and vice versa.
METHODS
Sample collection and library preparation
Wild threespine stickleback were collected from Rabbit Slough (N 61.5595, W 149.2583), Boot Lake (N 61.7167, W 149.1167), and Bear Paw Lake (N 61.6139, W 149.7539). Rabbit Slough is an offshoot of the Knik Arm of Cook Inlet and is known to be populated by anadromous populations of stickleback that are stereotypically oceanic in phenotype and genotype22,53. Boot Lake and Bear Paw Lake are both shallow lakes formed during the end-Pleistocene glacial retreat. Fish were collected in the summers of 2009 (Rabbit Slough), 2010 (Bear Paw Lake), and 2014 (Boot Lake) using wire minnow traps and euthanized in situ with Tricaine solution. Euthanized fish were immediately fixed in 95% ethanol and shipped to the Cresko Laboratory at the University of Oregon (Eugene, OR, USA). DNA was extracted from fin clips preserved in 95% ethanol using either Qiagen DNeasy spin column extraction kits or Ampure magnetic beads (Beckman Coulter, Inc) following manufacturer’s instructions. Yields averaged 1-2 μg DNA per extraction (~30 mg tissue). Treatment of animals followed protocols approved the University of Oregon Institutional Animal Care and Use Committee (IACUC).
We designed our library preparation strategy to identify sufficient sequence variation for gene tree reconstruction and to simplify downstream sequence processing and analysis by taking advantage of the phase information captured by paired-end sequencing. We generated RAD libraries from these samples using the single-digest sheared RAD protocol from Baird et al. with the following specifications and adjustments: 1 μg of genomic DNA per fish was digested with the restriction enzyme PstI-HF (New England Biolabs), followed by ligation to P1 Illumina adaptors with 6 bp inline barcodes. Ligated samples were multiplexed and sheared by sonication in a Bioruptor (Diagenode). To ensure that most of our paired-end reads would overlap unambiguously and produce longer contiguous sequences, we selected a narrow fragment size range of 425-475 bp. The remainder of the protocol was per Baird et al.31. All fish were sequenced on an Illumina HiSeq 2500 using paired-end 250 bp sequencing reads at the University of Oregon’s Genomics and Cell Characterization Core Facility (GC3F).
Sequence preparation
Raw Illumina sequence reads were demultiplexed, cleaned, and processed primarily using the Stacks pipeline54. Paired-end reads were demultiplexed with process_shortreads and cleaned using process_radtags using default criteria (throughout this document, names of scripts, programs, functions, and command-line arguments will appear in fixed-width font). Overlapping read pairs were then merged with fastq-join55 (Fig. S1). Pairs that failed to merge were removed from further analysis. In order to retain the majority of the sequence data for analysis in Stacks and still maintain adequate contig lengths, merged contigs were trimmed to 350 bp and all contigs shorter than 350 bp were discarded. We aligned these contigs to the stickleback reference genome19,49 using bbmap with the most sensitive alignment settings (‘vslow=t’; http://jgi.doe.gov/data-and-tools/bbtools/) and used the pstacks, cstacks, and sstacks components of the Stacks pipeline to create stacks and call SNPs and haplotypes, create a catalog of RAD tags across individuals, and match tags across individuals. All data were then passed through the Stacks error correction module rxstacks to prune unlikely haplotypes. We ran the Stacks component program populations on the final dataset to filter loci genotyped in fewer than four individuals in each population and to create output files for sequence analysis. We use the naming conventions of Baird et al.56: A “RAD tag” refers to sequence generated from a single end of a restriction site and the pair of RAD tags sequenced at a restriction site comprises a “RAD locus” (Figure 2.1).
We used the program phase 57 to phase pairs of RAD tags originating from the same restriction site. We coded haplotypes present at each RAD tag, which often contain multiple SNPs, into multiallelic genotypes. This both simplified and reduced computing time for the phasing process. Custom Python scripts automated this process and are included as supplementary files. We required that each individual had at least one sequenced haplotype at each tag for phasing to be attempted. If a sample had called genotypes at only one tag in the pair, the sample was removed from further analysis of that locus. The resultant phased haplotypes were used to generate sequence alignments for import into BEAST.
We recovered a total of 236,787 RAD tags after filtering, mapping to 151,813 PstI restriction sites. At 84,974 restriction sites, we recovered and successfully phased adjacent RAD tags (169,948 RAD tags) into single RAD loci. We retained these 84,974 RAD loci for our analysis.
Ninespine stickleback genome assembly
In order to estimate the Tmrca of threespine stickleback RAD alleles, we used the ninespine stickleback (Pungitius pungitius) as an outgroup (Figure 3.1, see Figure 1.2). RAD sequence analysis, however, relies on the presence of homologous restriction sites among sampled individuals and results in null alleles when mutations occur within a restriction site58. Because this probability increases with greater evolutionary distance among sampled sequences, we elected to use RAD-seq to only estimate sequence variation within the threespine stickleback. We then generated a contig-level de novo ninespine stickleback genome assembly from a single ninespine stickleback individual from St. Lawrence Island, Alaska (collected by J. Postlethwait) using DISCOVAR de novo (https://software.broadinstitute.org/software/discovar). We used this single ninespine stickleback haplotype to estimate threespine-ninespine sequence divergence and time calibrate coalescence times within the threespine stickleback. DISCOVAR de novo requires a single shotgun library of paired-end 250-bp sequence reads from short-insert-length DNA fragments. High molecular weight genomic DNA was extracted from an ethanol-preserved fin clip by proteinase K digestion followed by DNA extraction with Ampure magnetic beads. Purified genomic DNA was mechanically sheared by sonication and size selected to a range of 200-800 bp by gel electrophoresis and extraction. We selected this fragment range to agree with the recommendations for de novo assembly using the DISCOVAR de novo (https://software.broadinstitute.org/software/discovar/blog). This library was sequenced on a single lane of an Illumina HiSeq2500 at the University of Oregon’s Genomics and Cell Characterization Core Facility (GC3F: https://gc3f.uoregon.edu/). We assembled the draft ninespine stickleback genome using DISCOVAR de novo. Raw sequence read pairs were first quality filtered and adaptor sequence contamination removed using the program process_shortreads, which is included in the Stacks analysis pipeline59. We ran the genome assembly on the University of Oregon’s Applied Computational Instrument for Scientific Synthesis (ACISS: http://aciss-computing.uoregon.edu).
Alignment of RAD tags to the ninespine assembly
We included the single ninespine stickleback haplotype into our sequence analyses by aligning a single phased threespine stickleback RAD haplotype from each locus to the ninespine genome assembly. For those that aligned uniquely (59,254 RAD loci), we used a custom Python script to parse the output BAM file60 and reconstruct the ninespine haplotype from the query sequence and alignment fields. The final dataset consists of 57,992 RAD loci that mapped to the 21 threespine stickleback chromosomes and aligned uniquely to the ninespine assembly.
Lineage sorting and time to the most recent common ancestor
Allelic divergence can occur by multiple modes of lineage sorting during adaptation. To identify patterns of lineage sorting associated with freshwater colonization, we analyzed gene tree topologies at all RAD loci using BEAST v. 1.739,61. We used blanket parameters and priors for BEAST analyses across all RAD loci. Markov chain Monte Carlo (MCMC) runs of 1,000,000 states were specified, and trees logged every 100 states. We used a coalescent tree prior and the GTR+Γ substitution model with four rate categories and uniform priors for all substitution rates. We identified evidence of lineage sorting by using the program treeannotator to select the maximum clade credibility (MCC) tree for each RAD locus and the is.monophyletic() function included in the R package ‘ape’62. We determined for each MCC tree whether tips originating from marine (RS) or freshwater (BL+BP) formed monophyletic clades.
To convert node ages estimated in BEAST into divergence times, in years, we assumed a 15 million-year divergence time between threespine and ninespine stickleback at each RAD locus38 The Tmrca of all alleles in each gene tree was set at 15 Mya at each node age of interest was converted into years relative to the total height of the tree. Additionally, to use the ninespine stickleback as an outgroup, we required that threespine stickleback haplotypes at a RAD locus were monophyletic to the exclusion of the ninespine haplotype. Doing so reduced our analysis to 49,672 RAD loci for analyses included in Fig. 4 of the main text. RAD loci not showing this pattern of lineage sorting did not show evidence of a genome-wide correlation with marine-freshwater divergence and thus do not impact the assertions in the main text. We used medians of the posterior distributions as point estimates of Tmrca for each RAD locus. Because of the somewhat limited information from any single RAD locus, and because the facts of the genealogical process mean that the true Tmrca at any locus likely differs from the 15 My estimate63-65, we do not rely heavily on Tmrca estimates at individual RAD loci. Rather, we use these estimates to understand patterns of broad patterns of ancestry throughout the threespine stickleback genome — spatially along chromosomes and genome-wide patterns.
We determined Tmrca outlier genomic regions by permuting and kernel smoothing the genomic distribution of Tmrca estimates using the same window sizes as we present in the main text. Windows where the actual Tmrca value exceeded 99.9% of permuted windows were considered outliers. This method controls for the local density of RAD loci (poorly sampled regions will have larger confidence bands) and the size of the windows used.
Sequence diversity and haplotype networks
We quantified sequence diversity within and among populations and sequence divergence between populations using R (R Core Team66). We used the R package ‘ape’62 to compute pairwise distance matrices for all alleles at each RAD locus and used these matrices to calculate the average pairwise nucleotide distances, π, within and among populations along with dxy, the average pairwise distance between two sequences using only across-population comparisons67. We also calculated the haplotype-based Fst from Hudson et al.68 implemented in the R package ‘PopGenome’69. We used permutation tests written in R to identify differences in variation within- and between-habitat type at divergent RAD loci versus the genome-wide distributions. Mann-Whitney-Wilcoxon tests implemented in R were used to identify variation in genome-wide diversity among populations and habitat types.
We constructed haplotype networks of the RAD loci at eda and atp1a1 using the infinite sites model with the function haploNet() in the R package ‘pegas’70. The atp1a1 network was constructed from from a RAD locus spanning exon 15 of atp1a1 and including portions of introns 14 and 15 at (chr1:21,726,729-21,727,381 [BROAD S1, v89]; chr1: 26,258,117-26,257,465 [re-scaffolding from Glazer, et al49]). The eda network spans exon 2 and portions of introns 1 and 3 of eda (chr4: 12,808,396-12,809,030).
Code availability
Scripts used to phase RAD-tags, summarize gene trees, calculate population genetic statistics, and produce figures and statistics presented in paper are available at https://github.com/thomnelson/ancient-divergence. Scripts for processing raw sequence data are available from the authors upon request.
DATA AVAILABILITY
Raw sequence data supporting these findings are available on the Sequence Read Archive at PRJNAXXXXXX. The final datasets needed to reproduce the figures and statistics presented in the paper are available at https://github.com/thomnelson/ancient-divergence.
AUTHOR CONTRIBUTIONS
TCN and WAC conceived of the project and designed sampling, sequencing, and analysis. TCN prepared sequencing libraries, wrote software, and performed data analysis. TCN and WAC wrote the paper.
COMPETING FINANCIAL INTERESTS
The authors declare no competing financial interests.
ACKNOWLEDGEMENTS
We thank P. Phillips, M. Streisfeld, J. Postlethwait, K. Sterner for valuable input and lively discussion throughout this project. We also thank K. Alligood, E. Beck, S. Bassham, M. Chase, M. Currey, M. Hahn, L. Fishman, C. Small, S. Stankowski, J. Willis, two anonymous reviewers, and members of the Cresko Lab and the Institute of Ecology and Evolution for advice and comments on previous versions of this manuscript. J. Postlethwait graciously donated ninespine stickleback tissue, collected under award XXXXXXXX. We acknowledge National Science Foundation awards NSF DEB 1501423 (WAC and TCN), NSF DEB 0949053 (WAC), and National Institutes of Health award NIH T32GM007413 (TCN).