Abstract
In the biological species concept, much of the genomes cannot be exchanged between species1,2. In the modern genic view, species are distinct as long as genes that delineate the morphological, ecological and reproductive differences remain distinct2. The rest (or the bulk) of the genomes should be freely interchangeable. The core of the species concept therefore demands finding out the full potential of introgressions between species. In a survey of two closely related mangrove species (Rhizophora mucronata and R. stylosa) on the coasts of the western Pacific and Indian oceans, we found that the genomes are well delineated in allopatry, echoing their morphological and ecological divergence. The two species are sympatric/parapatric in the Daintree River area of northeastern Australia. In sympatry, their genomes harbor 7,700 and 3,100 introgression blocks, respectively, with each block averaging about 3-4 Kb. These fine-grained and strongly-penetrant introgressions suggest that each species must have evolved many differentially-adaptive (and, hence, non-introgressable) genes that contribute to speciation. We identify 30 such genes, seven of which are about flower development, within small genomic islets with a mean size of 1.4 Kb. In sympatry, the species-specific genomic islets account for only a small fraction (< 15%) of the genomes while the rest appears interchangeable.
Introduction
The biological species concept (BSC) has been the gold standard of modern evolutionary biology1–3. In this concept, species are products of allopatric speciation during which geographical isolation ensures the absence of gene flow. BSC therefore makes clear assumptions about the genetics of species divergence, postulating that nearly the entire genome evolves as a cohesive unit1. In the lexicon of genetics, BSC postulates that the density of “speciation genes” is so high that every genomic segment would be precluded from introgressing across the species boundary. BSC is essentially a genomic concept of species2.
This genomic concept is plausible on phenotypic considerations because even closely-related species differ by a multitude of traits. When such traits are carefully dissected, each has been found to be highly polygenic4–9, implying extensive genetic divergence. A particularly instructive case is the spermatogenic programs in Drosophila. Among sibling species, the number of genes involved in hybrid male sterility, i.e., spermatogenic failure, is in the hundreds10–12. The cloning of the component genes further reveals a complex web of interactions13,14. This level of divergence means that selection, presumably sexual in nature, drives spermatogenic programs to evolve extremely rapidly15 with hybrid sterility being the incidental byproduct. Other “speciation traits” such as sexual isolation, genital morphology and neural development depict a genetic basis that is qualitatively similar16–18.
Therefore, the genetics of functional divergence between species, where rigorous dissection is possible, shows a surprising degree of cohesiveness postulated by BSC. On the other hand, because BSC demands the cohesiveness across the entire genome, the empirical observations could still fall far short of the requirement by BSC. This latter consideration has led to an alternative view.
In the alternative genic view, species are define by a set of loci that govern the morphological, reproductive, behavioral and ecological characters. These “speciation genes” may, collectively, account for no more than a fraction of the genome10,12–14,19. This fraction should be fitness-reducing upon introgression, whereas the rest of the genome can be freely exchanged without a fitness consequence. In short, the diverging genomes comprise both introgressable and non-introgressable DNA segments. These non-introgressable segments are often referred to as “genomic islands” which are, in theory, more divergent than the rest of the genome2,4,5,12,20–24. Based on this conjecture, genomic islands have been identified either by the relative level of divergence or by the absolute divergence4,20,22–29.
Assuming that the current evidence for “speciation with gene flow”30,31 is convincing, (but see ref.32,33), we ask what it takes to corroborate the genic concept of species. It requires answering the following questions: i) the proportion of the genome that is non-introgressable, ii) the number and size distribution of such non-introgressable segments and, in particular, iii) the genic content within these segments. Ideally, the genome of a species can be entirely replaced by that of a closely related species, except for the “speciation genes” themselves as shown in Fig. 1A. The dynamics of such replacement has been modeled in the “Recurrent Selection and Backcross (RSB)” theory34 but none of these questions has been answered by the current empirical approaches which often do not have the resolution (see Fig. 1A legends).
In this study, we propose a different solution by comparing the same pair of species in allopatry vs. in sympatry. To realize the full introgression potentials, the sympatric taxa also have to meet a number of criteria including the stage of speciation, the timing of secondary contact and the duration of sympatry. Such opportunities may not be common. These criteria will be reviewed in Discussion after the results are presented.
Results
We study two closely related mangrove species – Rhizhophora mucronata vs. R. stylosa35,36. Mangroves are woody plants that have colonized the intertidal zones of the tropical coasts35–37. Because of the narrow band of suitable habitats along the coasts (or near the river mouths), the global distributions of mangroves are essentially one dimensional, making them ideal for biogeographical studies. In particular, the genomes of R. mucronata and R. stylosa have been published36 and their speciation history has been analyzed35. Built on these previous analyses, this study surveys the allopatric and sympatric populations in their full ranges of distributions.
Rhizophora mucronata has a wide distribution in the Indo-Western Pacific (IWP), particularly to the west of the Strait of Malacca and all the way to East Africa. In contrast, R. stylosa differs by its extension eastward from the Strait of Malacca to the western Pacific Islands (Fig. 1B). The two species have been reported to overlap in scattered locales along a number of western Pacific coastlines. However, in our own field trips, their relative abundance is often skewed in favor of one species and the co-occurrence has been rarely found. The sole exception in our collection is in the Daintree River (DR) area of northeastern Australia, where both species are quite abundant (Fig. 1B). It is the evidence from this site of sympatry that is instructive about the genic makeup of species.
Genomic diversity within Rhizophora mucronata and R. stylosa, respectively
For genomic studies, 21 R. stylosa individuals from five locations (labeled s1-s5) and 31 R. mucronata individuals from seven locations (named m1-m7) were analyzed (Fig. 1B). Note that m1 and s1 designate the sympatric DR samples. All the samples are sequenced for the whole genome on the Illumina Hiseq 2000 platform, yielding a mean depth of 15X (ranging from 11X to 22X) (Supplementary Table S2 and Table S3). The short reads of each individual are mapped to the reference genome of R. apiculata36, with a genomic coverage of 81% (79% - 82%) (Table S1). The level of genetic diversity shows two patterns. Low genetic diversity is found in all allopatric populations (average θπ at 0.44 and 0.40 per Kb for R. mucronata and R. stylosa, respectively) and the level is much higher in the sympatric DR populations (θπ = 1.05/Kb and 1.22/Kb, respectively). The Watterson’s estimates (θw) are similar (Table S1) (see Materials and Methods).
Divergence between the two species in allopatry
We first constructed a Maximum Likelihood (ML) tree using RAXML on the sequences of the 31 R. mucronata and 21 R. stylosa individuals from the 11 populations38. The ML tree bifurcates with a clear delineation between species across all allopatric populations. However, the m1 and s1 (i.e. DR) samples show strong signs of admixture as they are “in the middle” of the bifurcated tree (Fig. 1B). When the DR samples are removed, the phylogeny shows clear delineation (Fig. 1C). Those two trees are robust when rebuilt using the ML method in IQTREE39 or the Neighbour-Joining (NJ) method in MEGA7 (Supplementary Fig. S1 and S2)40. The monophyletic delineation of R. mucronata and R. stylosa in allopatry is also supported by the principle component analysis (PCA) (Supplementary Fig. S3)41.
In total, 1.2 million variable sites are detected across all populations of the two species. We first partition these sites by excluding the DR samples (see Materials and Methods). Each site is then represented by an FST value with FST = 0 indicating no differentiation between the two species in allopatry and FST = 1 indicating complete differentiation. Figure 1D shows the U-shaped distribution whereby the abundance of sites at the far right reveals the extensive differentiation between species. Such a U-shape distribution is typical of species of some divergence with little gene flow3.
Morphologically, the two species are distinguished by style length37,42, as pictured in Fig. 2A. The morphological differences between R. stylosa and R. mucronata across populations are shown in Fig. 2C. R. mucronata is readily distinguished by its short style, in the range of 0.9-1.6mm (Fig. 2A). In contrast, the style of R. stylosa is long, 2.4-5.3 mm (Fig. 2A) with no overlap between the two species (Fig. 2C)37,42. While the style length varies from locale to locale in both species, this trait is a species-diagnostic one across locales. The two species also show different habitat preferences with R. mucronata usually found upriver while R. stylosa is close to the river mouth (Figs. 2D). Additional diagnostic morphological characters, which are less stable, are listed in the Supplement (Table S4).
Characterizations of R. stylosa and R. mucronata in sympatry (the DR samples)
For the sympatric DR samples, which appear admixed in their DNA sequences, the morphological characters remain distinct. The style length of each sample is concordant with that of the allopatric populations of the same species (Figs. 2B). In fact, the two species in all sympatric populations can be clearly delineated by this character (see Fig. 2C). In the DR area, these two species are parapatric-sympatric with distributions up- or down-river and extensive overlaps in the middle (Figs. 2D). This difference in habitat preference is seen in all locales42,43.
Corroborating the phylogenetic positions of the DR samples in Fig. 1B, we use the Bayesian clustering analysis, ADMIXTURE44. The analysis identifies two genetic components that make up the genomes of the DR samples (Fig. 3A and Supplementary Fig. S4). PCA results also indicate significant admixture in m1 and s1 individuals (Supplementary Fig. S3). Furthermore, because species divergence is monophyletic in all allopatric comparisons, incomplete lineage sorting as the cause of the observed admixture in the DR samples is rejected. In short, we interpret the high FST sites as manifesting the divergence after speciation (Fig. 1D) with subsequent admixture in the DR area. Additional tests of introgression (LD analysis, D stataistic and the modified fd statistic) are presented in the Supplement (Tables S5-S6 and Figs. S5-S6).
Extensive introgressions in sympatry
For the two species in sympatry in the DR area, we ask the following questions: 1) How many introgressed segments can be found in each species? 2) Is the introgression symmetric between the two species? 3) How fine-grained are the introgressed segments (i.e., many small segments or a few large ones)? A few large blocks are expected after recent hybridizations but many fine-grained blocks may result from old introgressions that have been eroded by recombination. If true, introgression (and hence speciation itself) might be a prolonged process. 4) How many genomic segments are non-introgressable and what are their genic contents? Question 4 will be the subject of the next section.
To quantify the introgressions between R. stylosa and R. mucronata within the DR area, we use the divergent sites of the non-DR samples. There are 228,778 sites with FST > 0.8 between the two species, now referred to as d-sites (d for divergence). Note that the bulk of d-sites (163,089 sites) are fully divergent with FST = 1.0 outside the DR area (Fig. 1D). Furthermore, a fraction of the d-sites are introgression sites in the DR samples (referred to as i-sites). An i-site is where the introgressed allele (or i-allele) is found in >= n of the 10 genomes. (Note that both m1 and s1 samples have five diploid individuals, or 10 genomes; see Materials and Methods). We further impose a condition that the reciprocal introgression of the i-allele can happen at most once (<= 1) in the samples. This second condition is less crucial since high-frequency introgressions in both directions are rare.
It is necessary to set n close to the maximum of 10 for strongly penetrant introgressions. Fig. 3B shows the level of introgressions in the two directions. We set n = 8 for the m1 samples where the i-allele is usually found >= 8 times (the orange bars in Fig. 3B). Hence, the results with n= 2 and n = 8 would not be very different. Furthermore, to avoid the confounding presence of remnant ancient polymorphisms, we require introgressions at an i-site to be strongly asymmetric: >= n one way (say, from R. stylosa to R. mucronata) and <= 1 in the reciprocal direction (Supplementary Fig. S7). For the R. stylosa (s1) samples, the occurrence of the i-allele is rather even between 2 and 10 (the green bars in Fig. 3B). The asymmetry is probably due to the geography of the DR area, which is at the fringe of the R. mucronata distribution. Consequently, gene flow from R. mucronata into R. stylosa may be more limited here, resulting in the lower frequency of introgressions in the s1 samples. In this regard, setting n = 8 would miss many introgressions in R. stylosa leading to a much lower introgression rate than in R. mucronata. Nevertheless, the final estimations appear robust even when n is set as low as 2 (see below). Simulations of these scenarios are presented in the Supplement.
Obviously, introgressions do not happen site-by-site, but appear as long segments of DNA consisting of consecutive i-sites. We shall label these segments “introgression blocks” (or i-blocks). Fig. 4A shows a segment of the genome that comprises a string of d-sites, some of which are i-sites as defined above. These d-sites and i-sites are embedded in a background of low-FST or invariant sites (FST <= 0.8). This figure shows 3 i-blocks, each consisting of one, two or three i-sites. The length of each block is defined by the distance between the two breakpoints flanking the block. Unless specified, we remove the singleton i-blocks that harbor only a single i-site when presenting the length distribution of i-blocks.
The analysis of i-blocks is summarized in Table 1. We shall focus on the results with n = 8 but the results of n = 2 and n = 10 are given for comparison. In the DR area, samples of R. mucronata (m1) harbor far more introgressions than those of R. stylosa (s1). The bottom of Table 1 at n=8 shows that 15.8 or 23.4% of the R. mucronata genomes are introgressions from R. stylosa, the two values depending on whether singleton i-blocks are counted. In the opposite direction, 7.8 – 11.3% of the R. stylosa genomes are introgressions. The introgressions of Table 1 can be visualized in Figs. 4–5. The salient observation is the highly fine-grained nature of the introgressions. In R. mucronata, the introgressions are distributed over 7,714 i-blocks with an average length of 3.40 Kb. In R. stylosa, there are 3,070 i-blocks with an average size of 4.21 Kb. During the evolution, there should be numerous recombination events that break the introgressions into thousands of tiny i-blocks.
It should be noted that Table 1 and Figs. 4–5 present the extreme cases of introgressions that rise to very high frequencies in a non-native genomic background. Because introgressions happen in both directions, beneath these highly penetrant i-blocks are many more introgressions that do not meet the stringent criteria (see the next section for more details).
The distributions of i-blocks are shown at the large genomic scale in Fig. 4B, at the scaffold scale in Fig. 4C and as individual sites in Fig. 5A–5C. Note that only d-sites and i-sites are portrayed in these figures, which convey the visual impression of the fine-grained nature of the i-blocks. (As shown in Fig. 1D, the d- and i-sites are the 228,778 sites with FST > 0.8; the rest are invariant and lowly divergent sites.) Specifically, the i-blocks are dispersed across the whole genome (Fig. 4B and Supplementary Fig. S9). Indeed, 93 (in s1 genomes) and 96 (in m1 genomes) of the top 100 scaffolds harbor the switching between i- and d-blocks (Table 1). Figure 4C shows that the switching between i- and d-blocks can occur in a few to tens of Kbs. At the site level, i-blocks and d-blocks can switch within a small distance (Fig. 5A–5C). An i-block (or d-block) may harbor only one i-site (or d-site), referred to as singleton block (Table 1 and Supplementary Table S7). Singleton blocks, not uncommon but less reliable, are not used in the tally.
The extensive fine-grained introgressions convey two messages. First, hybridizations may happen continually over a long span of time. Each hybridization event would initially bring in whole-chromosome introgressions that are subsequently broken down by recombination. Small DNA fragments may have been introgressed in this piece-meal manner continually. Second, loci of differential adaptation between species may be very common such that introgressions tend to be small, and thus free of the introgressed alleles that are deleterious in the genetic background of another species45. In the next section, we will direct the attention toward non-introgressions, which are blocks of native alleles flanked by introgressed DNA segments.
Very fine-grained interspersion between “introgressable” and “non-introgressable” blocks
Some DNA segments may not be introgressable due to the presence of genes of adaptive differences. Such loci, by definition, contribute to reproductive isolation or ecological speciation2,46 and have sometimes been referred to “speciation genes”12,13,19,47–49. The number, size and direction of introgressions are therefore functions of a number of parameters: 1) the rate of hybridization; 2) the strength of selection against the speciation genes when introgressed; 3) the number and location of speciation loci; 4) the rate of recombination that free neutral genes from the linkage to speciation genes; and 5) the length of time since the time of initial hybridization.
To probe the influences of these parameters, we carry out computer simulations based on the Recurrent Selection and Backcross (RSB) model (see Luo et al., 200234 and the Materials and Methods). The RSB model is proposed for identifying genes of complex traits34. In its execution, one dilutes the genome of breed A (say, the bull dog) with that of breed B (e.g., the border collie) but retains all the desired phenotypic traits of the former. This is done by continually selecting for the traits of breed A while backcrossing the culled products to breed B. The scheme is almost identical with the process of “speciation with gene flow” in their model structure. They differ only in the parameter values; for example, the length of time in speciation is far larger and the gene flow is much smaller, and often bidirectional as well. The differences entail separate simulations for speciation with gene flow.
One particular scenario of speciation with introgression is simulated in Fig. 5D–5E, where two speciation loci, at position 51 and 71, are assumed (see legends). In this demonstration, the introgression occurs in one direction only. At generation 1000, extensive admixture is evident but, as the process continues to generation 10,000, the genome is almost entirely replaced by the introgressed alleles (shown in blue). Importantly, the two speciation loci (shown in pink) resist replacement and bring with them traces of nearby native segments. In this scenario, the selection against introgression is strong (s = −0.05) and the recombination rate for a 100Kb simulated sequence is high (r = 1).
Additional scenarios are presented in Fig. S11A, which shows that a lower recombination rate (r = 0.1) would increase the size of the non-introgressed DNA segments, because the neutral genes near positions 51 and 71 are selected against along with the speciation loci. Figures S11B and S11C show that a reduced selection intensity (s= −0.01) or a 10-fold higher introgression rate would give rise to extensive introgressions. Interestingly, partial introgressions are detected even at positions 51 and 71, where selection acts against the invading alleles. The simulations suggest that, given the right parameter values, the pattern of introgression would follow exactly the prediction based solely on selection, whereby only the alleles of the speciation loci cannot be introgressed. The rest of the genome, even right next to the speciation loci, is freely shared between species.
In the previous section, we define introgressions as the invading DNA segments that rise to 0.8 – 1.0 in frequency (i-blocks), conditional on the reciprocal direction being 0 – 0.1 in frequency. In this section, we focus on the fraction of sites in the genome where the introgression frequency is <= 0.1 in either direction, referred to as j-site (j-site is used as the antonym of i-site). Due to the various patterns of introgressions (Fig. 5D–5E, Fig. S7 and Fig. S11), a large grey area of partial introgression exists in frequencies between 0.1 and 0.8. Many such DNA segments are introgressable but have not risen to a high frequency whereas other segments may harbor non-introgressable loci but reach an appreciable frequency transiently. Therefore, while we will use the stringent low cutoff of 0.1 for identifying putative “speciation genes”, we use the much less strong cutoff of <=0.2 for evaluating the size of the non-introgessable genomes. Fig. S7 shows a total of 228,778 sites, among which 31,564 have the i-allele at <=0.2 in both directions. Counting the sites alone, we non-conservatively estimate the fraction of non-introgressable genomes to be < 15% (i.e., 31.6/228.8 = 13.8% < 15%) as stated in the abstract. At present, showing that > 85% of the genomes should be introgressable is the best that we can do.
With the j-allele define above, a j-block is defined as a DNA segment containing >= 2 j-sites. Of particular interest within j-blocks are the coding genes that have at least one j-site (Table 2 and Supplementary Table S10). By these stringent criteria, there are only 159 j-blocks which together account for < 0.1% of the genome (Table 2). While only 30 genes containing j-sites are found in these j-blocks (Table 2 and Supplementary Table S10), it is remarkable that 7 of the 30 genes function in flower development and/or gamete production as shown in Table 2 (see the WEGO gene ontology in Supplementary Fig. S10, where a larger set of genes is presented under less stringent criteria). Two of the seven genes regulate flowering period, which is later and shorter in R. stylosa than in R. mucronata50. Mutants of RA_08689 (encoding MRG family protein) and RA_19120 (known as SPF1) exhibit a late-flowering phenotype in Arabidopsis51,52. RA_11619 and RA_19120 are involved in female gametophyte development53,54. RA_08689, RA_10417, RA_13641, RA_19120 and RA_20369 all play a role in pollen germination, pollen tube growth and cotyledon development55–57. In particular, RA_08699 (known as LFR) is required for all stages of pollen development58,59 and the null allele of LFR is male-sterile in A. thaliana59. Since all seven genes contain highly differentiated amino acids and non-introgressable sites (j-sites) (Table 2, Supplementary Table S10 and Fig. S12), their involvement in the speciation between R. mucornata and R. stylosa seems plausible.
Discussion
The species of R. mucronata and R. stylosa in the DR area are unusual, or possibly unique, among sympatric species reported in the literature37,60,61 as explained below. These features may be the primary reason that they are fairly close to corroborating the genic view of species, whereby a small fraction of the genomes delineate species.
The DR populations stand out even among comparisons between these two species in other locales, where the sympatric species remain clearly delineated in their genomic sequences. For example, the m2/s2 collections from Singapore both show the expected phylogenetic relationship of their species designation (Fig. 1B). This expected pattern is consistently found in other locales of sympatry: in Brandan, Indonesia60, in Panay Island, Philippines, in Kosrae, Micronesia, in Yap, Micronesia and in North Sulawesi, Indonesia61. The two species in sympatry outside of the DR area occasionally show a slight tendency of being “on the fringe” of their phylogenetic cluster, thus suggesting low-level introgressions. Nevertheless, the extensive fine-grained introgression observed in the DR samples has not been reported before. Importantly, Yan et al. (2016)61 did notice samples from northeastern Australia (from Trinity Inlet and Daintree River, Queensland) to be different without further clarification.
The near absence of prior reports of fine-grained introgression is understandable as several conditions have to be met for this phenomenon to be realized. The first condition is that the two diverging populations have to be in the right stage when they first come into contact. This stage roughly corresponds to Stage III defined in Wu (2001)2 whereby speciation is nearly complete but gene flow is still possible. Had the secondary contact happened before this stage (in Stage I or II), the process of speciation could be arrested or even reversed. On the other hand, if the contact starts too late, there would be too little gene flow to give rise to the extensive introgressions observed in the DR area.
Among the locales of sympatry reported for R. mucronata and R. stylosa, northern Australia has been suggested to be where the two species came into contact in their incipient stage of speciation37. In this view, the two diverging taxa moved eastward crossing of the southern Indian Ocean to Australia37. They then dispersed north from Australia before spreading east- and westward. By this time, the two species may be too divergent to experience gene flow, thus explaining their clean phylogenetic relationship in sympatry in these other locales.
The second condition may be even more difficult to satisfy – that the two species need to remain in contact for a long period of time after establishing the secondary contact62. As discussed, numerous recombination events accumulated over a long period of time are necessary to achieve the fine-grained introgression. Continual gene flow also prevents further build-ups of functional divergence that would lead to the complete cessation of gene flow2.
The third condition is ecological. Two sympatric species without niche separation would face the problem of competitive exclusion63, making long-term coexistence unlikely. R. mucronata and R. stylosa had evolved a degree of niche separation that results in limited overlaps in habit preference (Fig. 2D). Given the necessary confluence of all these conditions, R. mucronata and R. stylosa in the DR area may be truly exceptional.
In conclusion, non-introgressable DNA segments, or genomic islands, are often portrayed to be large segments of the genome. Instead of a few large “genomic islands”4,5,20,22,23,29, we observe in the DR samples a large number of tiny islets. In a previous study, the genomic island surrounding the speciation gene, Odysseus, is indeed found to be < 2 Kb64 and, hence, a veritable islet.
Small introgressions are obviously conducive for the identification of genes driving the adaptive divergence (or speciation genes) as only a few candidate genes are involved (see Tables 2 and Supplementary Table 10). Finally, the contrast between large islands and small islets is important for understanding the process of speciation. The simple model presented above is intended to probe the relative importance of various parameters such as the introgression rate, selection intensity and length of hybridization. Realistic predictions will require careful measurements of all relevant parameters. In particular, measuring the selection intensity against “speciation genes” may reveal how selection drives speciation at the molecular level2,12,45,48,65–68.
Materials and Methods
Sampling and genome re-sequencing
To make the samples of R. mucronata and R. stylosa more representative, we collected individuals both in allopatry and sympatry in the Indo-West Pacific region (Fig. 1). We re-sequenced 31 R. mucronata individuals from seven populations and 21 R. stylosa individuals from four populations (Fig. 1 and Supplementary Tables S1-S3). To tell apart the two species by morphology, we observed the style length and shape in the bud and took photos (Fig. 2). Fresh leaves were sampled from individual trees and dried with silica gel. DNA isolation was done following the CTAB method69. Short-read libraries were sequenced using the Illumina Hiseq 2000 platform with insert size of 350bp and constructed following the TruSeq DNA Sample Preparation Guide. We obtained high quality sequencing data for each individual genome, with coverage in the 12 to 22X range (Supplementary Tables S2-S3).
SNP calling and genetic diversity detection
We used the Genome Analysis Toolkit (GATK)70 to call variants. Filtered reads from all 52 individuals were mapped to the R. apiculata reference genome using the Burrows-Wheeler Aligner (BWA)71. Our reference is the de novo genome sequence of R. apiculata36. SAMtools were used to import, sort, and pair bam files and remove duplications. To obtain high quality variants, only SNPs called by both GATK and SAMtools/bcftools were retained72. To remove low-quality variants, we eliminated all loci that had base quality (Q) or mapping quality (q) smaller than 20. We additionally applied the following filters: 1) at least two reads had to support the minor allele to call a heterozygote; 2) homozygous SNPs were only retained if read depth was at least 2. After filtering, we selected these high quality sites for further analyses, with multi-allelic (>=3) sites, insertions, and deletions excluded. To estimate genetic diversity in each population, we calculated θw (Watterson’s θw) and θπ (Nei and Li’s θπ)73,74 within each population (Supplementary Table S1). To estimate genomic divergence between R. mucronata and R. stylosa populations, we calculated the genetic differentiation coefficient (FST) (Fig. 1D)74,75.
Detecting gene flow
We applied Patterson’s D statistic and a modified fd statistic to quantify gene flow76,77. A positive D or fd value is an indicator of introgression (Supplementary Fig. S8 and Table S6). The basic model has three ingroups (P1, P2, and P3) and the outgroup (O) in the genealogical relationship (((P1,P2),P3),O). In our analysis, P1 and P2 are different populations from the same species R. mucronata (or R. stylosa), while P3 corresponds to the other species. The outgroup is the reference R. apiculata36. Positive D values imply that P2 and P3 have more shared alleles than P1 and P3 (see Supplement Table S6 and Fig. S6). A software package (plink-1.07) was used to estimate linkage disequilibrium (LD), represented by the r2 statistic within each population or group (Supplementary Fig. S5)78. LD decay was used to test for the presence of admixture events. We also calculated LD decay in sympatric populations in Singapore (s2 and m2) and allopatric R. mucronata and R. stylosa populations (Supplementary Fig. S5) as controls.
Genomic scan for introgressed and non-introgressable blocks
We used four predefined taxa: m1 (R. mucronata population in Daintree River), s1 (R. stylosa population in Daintree River), Mallo (allopatric R. mucronata populations m2-m7), and Sallo (allopatric R. stylosa populations s2-s4). To get a more informative data set, we filtered out sites with too many missing genotypes in each taxon or low divergence (FST <= 0.8) between Mallo and Sallo. We retained 228,778 SNPs (FST > 0.8, which we call divergent sites or d-sites between Mallo and Sallo). 163,089 of the d-sites are fixed (FST = 1.0 and Dxy = 1.0) between Mallo and Sallo. There are four possible states of each d-site: homozygous R. mucronata variant (M type or MM), homozygous R. stylosa variant (S type or SS), heterozygote (MS), or missing data.
We then looked for introgressed sites (i-sites) and non-introgressable sites (j-sites) among all the d-sites across m1 and s1 genomes. We have five diploid individuals (10 genomes) from the m1 and s1 populations. We have define allele classes as follows. Introgressed allele (i-allele): an R. stylosa variant in m1 populations or an R. mucronata variant in s1 populations. i-site: an i-site in m1 or in s1 genomes is defined as >= 8 occurrences of i-allele out of the 10 genomes (Fig. 3B and Supplementary Fig. S7). j-site: a d-site with <=1 occurrences of i-allele in both m1 and s1 populations (Fig. 3B and Supplementary Fig. S7). i-block: A genomic block in one species is considered to be introgressed from the other species if one or more i-sites continuously (without disruption by other d-sites) are present (Fig. 4A). The length of an i-block is determined by the midpoint between the flanking (d-sites, i-sites) intervals (as shown in Fig. 4A). j-block: a genomic block with one or more j-sites continuously. We define the boundaries the same as for i-blocks.
Simulations of genomic sequences under hybridization, selection, and recombination
To probe the influences of hybridization, selection, and recombination on genomic sequences, we carry out computer simulations based on the Recurrent Selection and Backcross (RSB) model (see Luo et al., 2002). We set high and low levels for each parameter. Population size was set at 1000. The length of simulated sequences is 100 kb (for convenience, 1 kb is the basic unit that cannot be separated by recombination). The original allele in the sequence and an i-allele from the other species are differentially labeled. Hence, at the beginning of the simulations, the sequences of all individuals are in original alleles states (100 x). After several generations of hybridization, selection, and recombination, the sequences become shuffled (Fig. 5D–5E and Supplementary Fig. S11).
We first set a low hybridization rate (or introgression, 1/1000 per generation) and recombination (10E-6 per generation between adjacent base pairs). For every generation, 999 individuals are picked from the original population and one is from the other population (or species). The recombination probability (r) for a 100 kb sequence is about 0.1. Since population size is 1000, there will be an average of 100 individuals with recombination in each generation. Two negatively selected loci (#51 and #71) are defined in the simulated sequences. If one or both sites harbor an i-allele, the relative fitness of this sequence is 0.95 (Supplementary Fig. S11A) or 0.99 (Supplementary Fig. S11B). We also examined a high introgression rate regime (10/1000). In this case, four loci (#41, #51, #71 and #76) were set as negatively selected (relative fitness = 0.95 for an i-allele) (Supplementary Fig. S11C). Finally, we simulated genomic sequences under a high recombination rate (10E-5, r = 1.0 for a 100Kb simulated sequence per generation) and a low introgression rate (1/1000 per generation). Two loci (#51 and #71) were negatively selected (relative fitness = 0.95 for an i-allele) (Fig. 5D–5E and Supplementary Fig. S11D-S11F).
Acknowledgments
We thank Wei Lun Ng for the photo of R. mucronata style in Fig. 2A. This study was supported by the National Natural Science Foundation of China (91731301, 31600182 and 31830005); the National Key Research and Development Plan (2017FY100705); the 985 Project (33000-18841204) and the Fundamental Research Funds for the Central Universities (17lgpy99).