Abstract
Genetic variation is the fuel of evolution. However, analyzing evolutionary dynamics in natural populations is challenging, sequencing of entire populations remains costly and comprehensive sampling logistically difficult. To tackle this issue and to define relevant spatial and temporal scales of variation, we have founded the European Drosophila Population Genomics Consortium (DrosEU). Here we present the first analysis of 48 D. melanogaster population samples collected across Europe in 2014. Our analysis uncovers novel patterns of variation at multiple levels: genome-wide neutral SNPs, mtDNA haplotypes, inversions, and TEs showing previously cryptic longitudinal population structure; signatures of selective sweeps shared among populations; presumably adaptive clines in inversions; and geographic variation in TEs. Additionally, we document highly variable microbiota and identify several new Drosophila viruses. Our study reveals novel aspects of the population biology of D. melanogaster and illustrates the power of extensive sampling and pooled sequencing of populations on a continent-wide scale.
Introduction
Genetic variation is the raw material for evolutionary change. Understanding the processes that create and maintain variation in natural populations remains a fundamental goal in evolutionary biology. The identification of patterns of genetic variation within and among taxa (Dobzhansky 1970; Lewontin 1974; Kreitman 1983; Kimura 1984; Hudson et al. 1987; McDonald & Kreitman 1991; e.g., Adrian & Comeron 2013) provides fundamental insights into the action of various evolutionary forces. Historically, due to technological constraints, studies of genetic variation were limited to single loci or small genomic regions and to static sampling of small numbers of individuals from natural populations. The development of population genomics has extended such analyses to patterns of variation on a genome-wide scale (e.g., Black et al. 2001; Jorde et al. 2001; Luikart et al. 2003; Begun et al. 2007; Sella et al. 2009; Charlesworth 2010; Casillas & Barbadilla 2017). This has resulted in fundamental advances in our understanding of historical and contemporaneous evolutionary dynamics in natural populations (e.g., Sella et al. 2009; Hohenlohe et al. 2010; Cheng et al. 2012; Fabian et al. 2012; Pool et al. 2012; Messer & Petrov 2013; Ellegren 2014; Harpur et al. 2014; Kapun et al. 2014; Bergland et al. 2014; Charlesworth 2015; Zanini et al. 2015; Kapun et al. 2016a; Casillas & Barbadilla 2017).
However, large-scale sampling and genome sequencing of entire populations remains largely prohibitive in terms of sequencing costs and labor-intensive sample collection, limiting the number of populations that can be analyzed. Evolution is a highly dynamic process across a variety of spatial scales in many taxa; thus, to generate a comprehensive context for population genomic analyses, it is essential to define the appropriate spatial scales of analysis, from meters to thousands of kilometers (Levins 1968; Endler 1977; Richardson et al. 2014). Furthermore, one-time sampling of natural populations provides only a static view of patterns of genetic variation. Allele frequency changes can be highly dynamic even across very short timescales (e.g., Umina et al. 2005; Bergland et al. 2014; Behrman et al. 2018), and theoretical work suggests that such temporal dynamics may be an important yet largely understudied mechanism by which genetic variation is maintained (Wittmann et al. 2017). It is thus essential to define the relevant spatio-temporal scales for sampling and population genomic analyses accordingly.
To generate a population genomic framework that can deliver appropriate high-resolution sampling and to provide a unique resource to the research community, we formed the European Drosophila Population Genomics Consortium (DrosEU; https://droseu.net). Our primary objective is to utilize the strengths of this consortium to extensively sample and sequence European populations of Drosophila melanogaster on a continent-wide scale and across distinct timescales. In close cooperation with a complementary effort focused on North American populations, the Drosophila Real Time Evolution Consortium (Dros-RTEC; http://web.sas.upenn.edu/paul-schmidt-lab/dros-rtec/), our long-term goal is to define the appropriate spatio-temporal scales at which populations should be sampled and analyzed and to gain novel insights into the dynamics of genetic variation.
D. melanogaster offers several advantages for such a concerted sampling and analysis effort: a relatively small genome, a broad geographic range, a multivoltine life history that allows sampling across generations over short timescales, ease of sampling natural populations using standardized techniques, an extensive research community and a well-developed context for population genomic analysis (Powell 1997; Keller 2007; Hales et al. 2015). The species is native to sub-Saharan Africa and has subsequently expanded its range into novel habitats in Europe over the last 10,000-15,000 years and in North America and Australia in the last several hundred years (e.g., Lachaise et al. 1988; David & Capy 1988; Keller 2007). On both the North American and Australian continents, the prevalence of latitudinal clines in frequencies of alleles (e.g., Schmidt & Paaby 2008; Turner et al. 2008; Kolaczkowski et al. 2011b; Fabian et al. 2012; Bergland et al. 2014; Machado et al. 2016; Kapun et al. 2016a), structural variants such as chromosomal inversions (Mettler et al. 1977; Voelker et al. 1978; Knibb et al. 1981; Knibb 1982; 1986; Anderson et al. 1991; Rako et al. 2006; Kapun et al. 2014; Rane et al. 2015; Kapun et al. 2016a) and transposable elements (TEs) (Boussy et al. 1998; González et al. 2008; 2010), as well as complex phenotypes (de Jong & Bochdanovits 2003; Schmidt & Paaby 2008; Schmidt et al. 2008; Flatt et al. 2013; Adrion et al. 2015 and references therein; Kapun et al. 2016b; Behrman et al. 2018) have been interpreted to result from local adaptation to environmental factors that co-vary with latitude or as the legacy of an out-of-Africa dispersal history. However, sampling across these latitudinal gradients has not been replicated outside of a single transect on the east coasts of both continents. The observed latitudinal clines on the east coasts of North America and Australia may have been generated, at least in part, by demography and differential colonization histories of populations at high and low latitudes (Bergland et al. 2016). In North America, for example, temperate populations appear to be largely of European origin, whereas low-latitude populations show evidence of greater admixture from ancestral African populations and the Caribbean (Caracristi & Schlötterer 2003; Yukilevich & True 2008a; b; Duchen et al. 2013; Kao et al. 2015; Bergland et al. 2016). More intensive sampling and analysis of both African as well as European populations is thus essential to disentangling the relative importance of local adaptation versus colonization history and demography in generating the clinal patterns that have been widely observed. While there has been a great deal of progress in the analysis of ancestral African populations (e.g., Begun & Aquadro 1993; Corbett-Detig & Hartl 2012; Pool et al. 2012; Fabian et al. 2015; Lack et al. 2015; 2016), the European continent remains largely uncharacterized at the population genomic level (Božičević et al. 2016; Pool et al. 2016; Mateo et al. 2018).
Here, we present the first analysis of the DrosEU pool-sequencing data from a set of 48 European population samples collected in 2014. Our main focus is on describing spatial variation across the European continent. A similar consortium has been organized mainly in the United States, the Dros-RTEC consortium. While the two consortia share the common goal of widespread and coordinate sampling, the Dros-RTEC consortium concentrates on seasonal dynamics in North American populations (Machado et al. 2018). We examine the 2014 DrosEU data at three levels: (1) patterns of variation at single-nucleotide polymorphisms (SNPs) in the nuclear (∼5.5 × 106 SNPs) and mitochondrial (mtDNA) genomes; (2) variation in copy number of transposable elements (TEs); (3) cosmopolitan chromosomal inversions previously associated with climate adaptation; and (4) variation among populations in microbiota, including endosymbionts, bacteria, and viruses (Figure 1).
We find that European populations of D. melanogaster exhibit novel patterns of variation at all levels investigated: neutral SNPs in the nuclear genome and mtDNA haplotypes that reveal previously unknown longitudinal population structure; genomic regions consistent with selective sweeps that indicate selection on a continent-wide scale; new evidence for inversion clines in Europe; and spatio-temporal variation in TEs frequencies. We also identify four new DNA viruses and for the first time assemble the complete genome of a fifth. These novel features are revealed by the comprehensive magnitude of our coordinated sampling, thus demonstrating the utility of this approach.
Together with other large-scale genomic datasets for D. melanogaster (Casillas & Barbadilla 2017) our data provide a rich and powerful community resource for studies of molecular population genetics. Importantly, the DrosEU dataset represents the first comprehensive characterization of genetic variation in D. melanogaster on the European continent and might yield important insights into how this species has adapted to temperate climates after its migration out of Africa.
Results
As part of the DrosEU effort, we collected and sequenced 48 population samples of D. melanogaster from 32 geographical locations across Europe in 2014 (Table 1; Figure 2 and Figure 3A).
While our analyses focus on spatial patterns, thirteen of the 32 locations were sampled repeatedly during the year (at least twice, once in summer and once in fall), allowing a first, crude analysis of seasonal changes in allele frequencies on a genome-wide level (Figure 2). For an extensive analysis of temporal (seasonal) patterns in mainly North American populations see the companion paper by Machado et al. (2018). All 48 samples were sequenced to high coverage, with a mean coverage per population of >50x (Table S1 and Figure 4).
Using this high-quality dataset, we performed the first comprehensive, continent-wide genomic analysis of European D. melanogaster populations (Figure 3). In addition to nuclear SNPs, we also investigated variation in mtDNA, TE insertions, chromosomal inversion polymorphisms, and the Drosophila-associated microbiome (Figure 3).
Most SNPs are widespread throughout Europe
We identified a total of 5,558,241 “high confidence” SNPs with frequencies > 0.1% across all 48 samples (Figure 3B, Table S1 and S2). Of these, 17% (941,080) were shared among all samples, whereas 62% were polymorphic in fewer than 50% of the samples (Figure 5A).
Due to our filtering scheme, SNPs that are private or nearly private to a sample will be recovered only if they are at a substantial frequency in that sample (∼5%). In fact, only a small proportion of SNPs (1% = 3,645) was found in fewer than 10% of the samples, and only 0.004% (210) were specific to a single sample (Figure 5A). To avoid an excess contribution of SNPs from populations with multiple (seasonal) sampling, we repeated the analysis by considering only the earliest (Figure 5 - figure supplement 1A) or the latest (Figure 5 - figure supplement 1B) sample from populations with seasonal data. We observed similar patterns across the three analyses: (i) a very small number of sample-specific, private SNPs (210, 527 and 455, respectively), (ii) a majority of SNPs shared among 20% to 40% of the samples (53%, 52% and 52%, respectively), and (iii) a substantial proportion shared among all samples (17%, 20% and 19%, respectively; Figure 5A and Figure 5 - figure supplement 1). These results suggest that most SNPs are geographically widespread in Europe and that genetic differentiation among populations is moderate, consistent with high levels of gene flow across the European continent.
Derived European and North American populations share more SNPs with each other than they do with an ancestral African population
D. melanogaster originated in sub-Saharan Africa, migrated to Europe ∼10,000-15,000 years ago, and subsequently colonized the rest of the world, including North America and Australia ∼150 years ago (Lachaise et al. 1988; David & Capy 1988; Keller 2007). To search for genetic signatures of this shared history, we investigated the amount of allele sharing between African, European, and North American populations. We compared our SNP set to two published datasets, one from Zambia in sub-Saharan Africa (DPGP3; Lack et al. 2015) and one from North Carolina in North America (DGRP; Huang et al. 2014).
Populations from Zambia inhabit the ancestral geographical range of D. melanogaster (Pool et al. 2012; Lack et al. 2015); North American populations are thought to be derived from European populations, with some degree of admixture from African populations, particularly in the southern United States and the Caribbean (Caracristi & Schlötterer 2003; Yukilevich & True 2008a; b; Yukilevich et al. 2010; Duchen et al. 2013; Kao et al. 2015; Pool 2015; Bergland et al. 2016). The population from North Carolina exhibits primarily European ancestry, with ∼15% admixture from Africa (Bergland et al. 2016).
Approximately 10% of the SNPs (∼1 million) were shared among all three datasets (Figure 5B). Since the out-of-Africa range expansion and the subsequent colonization of the North America continent by European (and to a lesser degree African) ancestors was likely accompanied by founder effects, leading to a loss of African alleles, and adaptation to temperate climates (Mettler et al. 1977), we predicted that a relatively high proportion of SNPs would be shared between Europe and North America. As expected, the proportion of shared SNPs was higher between Europe and North America (22%) than between either Europe or North America and Zambia (11% and 13%, respectively; Figure 5B).
When we analyzed SNPs in variant frequency bins, the proportion of SNPs shared across at least two continents increased from 26% to 41% for SNPs, with variant frequencies larger than 50% (Figure 5 - figure supplement 2A). In contrast, only 6% of the SNPs at low frequency (<10%; Figure 5 - figure supplement 2C) were shared. These results are consistent with the loss of low-frequency variants during the colonization of the European continent; they suggest that intermediate frequency alleles are more likely to be ancestral and thus shared across broad geographic scales. Interestingly, as compared to Africa and North America, we identified nearly 3 million private SNPs that are specific to Europe (Figure 5B). Given that North American and Australian populations are – at least partly – of European ancestry (see Lemeunier & Aulard 1992 for more details), future analysis of our data may be able to shed light on the demography and adaptation of these derived populations.
European and other derived populations exhibit similar amounts of genetic variation
Next, we estimated genome-wide levels of nucleotide diversity within the European population samples using population genetic summary statistics. Pairwise nucleotide diversity (π and Watterson’s θ), corrected for pooling (Stalker 1976; Mettler et al. 1977; Voelker et al. 1978; Stalker 1980; Sezgin et al. 2004), ranged from 0.0047 to 0.0057 and from 0.0045 to 0.0064, respectively (Figure 6 and Figure 7), with our estimates being qualitatively similar to those from non-African D. melanogaster populations sequenced as individuals (Knibb et al. 1981; Knibb 1982; Anderson et al. 1987) or as pools (Inoue & Watanabe 1979; Inoue et al. 1984).
Estimates of π were slightly lower than, but in close agreement with, estimates of θ, leading to a slightly negative average of Tajima’s D (Das & Singh 1990; 1991; Singh & Das 1992; Singh 2018). Due to our SNP calling approach (see Materials and Methods), we found a deficiency of alleles with frequencies ≤ 0.01, both in the sample-wise site frequency spectra (SFS) as well as in the combined SFS by SNP type, with the sample-wise SFS being skewed towards low frequency variants (Figure 9A).
In addition, we observed an excess of low-frequency SNPs at non-synonymous sites as compared to other types of sites, which is consistent with purifying selection eliminating deleterious non-synonymous mutations (Endler 1977).
Overall, we detected only minor differences in the amount of genetic variation among populations. Specifically, genome-wide π ranged from 0.005 (Yalta, Ukraine) to 0.006 (Chalet à Gobet, Switzerland) for autosomes, and from 0.003 (Odesa, Ukraine) to 0.0035 (Chalet à Gobet, Switzerland) for the X chromosome (Table S1 and Figure 6). When testing for associations between geographic variables and genome-wide average levels of genetic variation, we found that both π and θ were strongly negatively correlated with altitude, but neither was correlated with latitude or longitude (Table 2). There were no correlations between the season in which the samples were collected and levels of average genome-wide genetic variation as measured by π and θ (Table 2).
The X chromosome showed markedly lower genetic variation than the autosomes, with the ratio of X-linked to autosomal variation (πX/πA) ranging from 0.53 to 0.66. These values are well below the ratio of 0.75 (one-sample Wilcoxon rank test, p < 0.001) expected under standard neutrality and equal sex ratio, but are consistent with previous findings for European populations of D. melanogaster and can be attributed to either selection (Knibb et al. 1981) or changes in population size (Knibb et al. 1981). This pattern is consistent with previous estimates of relatively low X-linked diversity for European (Kapun et al. 2014) and other non-African populations (Kapun et al. 2016a). Interestingly, the ratio πX/πA was significantly, albeit weakly, positively correlated with latitude (Spearman’s r = 0.315, p = 0.0289), with northern populations having slightly higher X/A ratios than southern populations. This is at odds with the prediction of periodically bottlenecked populations leading to a lower X/A ratio in the north and perhaps reflects more complex demographic scenarios (Mettler et al. 1977; Voelker et al. 1978; Knibb et al. 1981; Knibb 1982; Das & Singh 1991; Van ‘t Land et al. 2000; de Jong & Bochdanovits 2003; Anderson et al. 2005; Umina et al. 2005; Rako et al. 2006; Kapun et al. 2014; 2016a).
In contrast to π and θ, we observed major differences in the genome-wide averages of Tajima’s D among samples (Figure 10).
The chromosome-wide Tajima’s D was negative in approximately half of all samples and close to zero or slightly positive in the remaining samples, possibly due to heterogeneity in the proportion of sequencing errors among the multiplexed sequencing runs. However, models that included sequence run as a covariate did not explain more of the observed variance than models without the covariate, suggesting that associations of π and θ with geographic variables were not confounded by sequencing heterogeneity (see Supporting Information; Table S4). Moreover, our results for π, θ and D are unlikely to be confounded by spatio-temporal autocorrelations: after accounting for similarity among spatial neighbors (Moran’s I ≈ 0, p > 0.05 for all tests), there were no significant residual autocorrelations among samples for these estimators.
Genetic variation was not distributed homogeneously across the genome. Both π and θ were markedly reduced close to centromeric and telomeric regions (Figure 11), which is in good agreement with previous studies reporting systematic reductions in genetic variation in regions with reduced recombination (Kennison 2008).
Consistent with this, we detected strong correlations with estimates of recombination rates based on the data of Comeron et al. (2012) (linear regression, p < 0.001; not accounting for autocorrelation), suggesting that the distribution of genome-wide genetic variation is strongly influenced by the recombination landscape (Table S5). For autosomes, fine-scale recombination rates explained 41-47% of the variation in π, whereas broad-scale recombination rates (Roberts 1998; Pimpinelli et al. 2010) explained 50-56% of the variation in diversity. We obtained similar results for X-chromosomes, with recombination rates explaining 31-38% (Dobzhansky & Sturtevant 1938; Kunze-Mühl & Müller 1957; Ashburner & Lemeunier 1976) or 24-33% (Wesley & Eanes 1994; Andolfatto et al. 1999; Matzkin et al. 2005; Corbett-Detig et al. 2012) of the variation (Figure 11, Table S5, Figure 11 - figure supplement 1).
We also observed variation in Tajima’s D with respect to genomic position (Figure 11). Notably, Tajima’s D was markedly lower than the corresponding chromosome-wide average in the proximity of telomeric and centromeric regions on all chromosomal arms. These patterns possibly reflect purifying selection or selective sweeps close to heterochromatic regions (Navarro & Faria 2014; Kapun et al. 2014; 2016a), or might alternatively be a result of sequencing errors having a stronger effect on the SFS in low SNP density regions.
Localized reductions in Tajima’s D are consistent with selective sweeps
We identified 144 genomic locations on the autosomes with non-zero recombination, reduced genetic variation, and a local reduction in Tajima’s D (see Methods, Table S6), which jointly may be indicative of selective sweeps. Although we cannot rule out that these patterns are the result of non-selective demographic effects (e.g., bottlenecks), two observations suggest that at least some of these regions are affected by positive selection. First, bottlenecks are typically expected to cause genome-wide, non-localized reductions in Tajima’s D. Second, several of the genomic regions in our data coincide with previously identified, well-supported selective sweeps in the proximity of Hen1, Cyp6g1 (Andolfatto et al. 1999; Corbett-Detig & Hartl 2012), wapl (Andolfatto et al. 1999; Matzkin et al. 2005; Kennington et al. 2007; Corbett-Detig & Hartl 2012; Kennington & Hoffmann 2013; Kapun et al. 2014; 2016a), HDAC6 (Begun 2015; Lavington & Kern 2017), and around the chimeric gene CR18217 (Kirkpatrick 2010).
However, some regions, such as those around wapl or HDAC6, are characterized by low recombination rates (< 0.5 cM/Mb; Table S5), which can itself lead to reduced variation and Tajima’s D (see also Nolte et al. 2013). Our screen also uncovered several regions that have not previously been described as harboring sweeps (Table S6). These represent promising candidate regions containing putative targets of positive selection. For several of these candidate regions, patterns of variation were highly similar across the majority of European samples, suggesting the existence of continent-wide selective sweeps that either predate the colonization of Europe (e.g., Beisswanger et al. 2006) or that have swept across all European populations more recently. In contrast, some candidate regions were restricted to only a few populations and characterized by highly negative values of Tajima’s D, i.e. deviating from the among-population average by more than two standard deviations, thus possibly hinting at cases of local, population-specific adaptation (Figure 12 - figure supplement 2 and Table S6 for examples).
European populations are strongly structured along an east-west gradient
We next investigated patterns of genetic differentiation due to demographic substructure. Overall, pairwise differentiation as measured by FST was relatively low, though markedly higher for X-chromosomes (0.043–0.076) than for autosomes (0.013–0.059; Student’s t-test; p < 0.001; Figure 13), possibly reflecting differences in effective population size between the X chromosome and the autosomes (Hutter et al. 2007). One population, from Sheffield (UK), showed an unusually high amount of differentiation on the X-chromosome as compared to other populations (Figure 13).
Despite these overall low levels of among-population differentiation, European populations showed some evidence of geographic substructure. To analyze this pattern in more detail, we focused on a set of SNPs located in short introns (< 60 bp), as these sites are relatively unaffected by selection (Haddrill et al. 2005; Singh et al. 2009; Parsch et al. 2010; Clemente & Vogl 2012; Lawrie et al. 2013). We analyzed the extent of isolation by distance (IBD) within Europe by correlating genetic and geographic distance and using pairwise FST between populations as a measure of genetic isolation. FST was overall low but significantly correlated with distance across the continent, indicating weak but significant IBD (Mantel test; p < 0.001; max. FST ∼ 0.05; Figure 14A). We also examined those populations that were most and least separated by genetic differentiation, estimated by pairwise FST (Figure 14B). In general, longitude had a stronger effect on isolation than latitude, with populations showing the strongest differentiation separated along an east-west, rather than a north-south, axis (Figure 14B). This pattern remained unchanged when the number of populations sampled from Ukraine was reduced to avoid overrepresentation (Figure 14 - figure supplement 1).
To further explore these patterns, we performed a principal component analysis (PCA) on the allele frequencies of SNPs in short introns. The first three principal components (PC) explained more than 25% of the total variance (PC1: 16.3%, PC2: 5.4%, PC3: 4.8%, eigenvalues = 599.2, 199.1, and 178.5 respectively; Figure 14C and Figure 14 - figure supplement 2). As expected, PC1 was strongly correlated with longitude. Despite significant signals of autocorrelation, as indicated by Moran’s test on residuals from linear regressions with PC1, the association with longitude was not due to spatial autocorrelation, since a spatial error model also resulted in a significant association. PC2 was similarly, but to a lesser extent, correlated with longitude and also with altitude. PC3, by contrast, was not associated with any variable examined (Table 2). None of the major PC axes were correlated with season, indicating that there were no shared seasonal differences across samples in our dataset. Hierarchical model fitting based on the first three PC axes resulted in five distinct clusters (Figure 14C) that were oriented along the axis of PC1, supporting the notion of strong longitudinal differentiation among European populations. To the best of our knowledge, such a pronounced longitudinal signature of differentiation has not previously been reported in European D. melanogaster.
Remarkably, this pattern is qualitatively similar to that observed for human populations (Cavalli-Sforza 1966; Xiao et al. 2004; Francalacci & Sanna 2008), perhaps consistent with co-migration of this commensal species.
Mitochondrial haplotypes also exhibit longitudinal population structure
Our finding that European populations are longitudinally structured is also supported by an analysis of mitochondrial haplotypes. We identified two main mitochondrial haplotypes in Europe, separated by at least 41 mutations (between G1.2 and G2.1; Figure 15A). Our findings are consistent with similar analyses of mitochondrial haplotypes from a North American D. melanogaster population (Cooper et al. 2015) as well as from worldwide samples (Wolff et al. 2016), revealing varying degrees of differentiation among haplotypes, ranging from only a few to hundreds of substitutions. The two G1 subtypes (G1.1 and G1.2) are separated by only four mutations, and the three G2 subtypes are separated by a maximum of four mutations (between G2.1 and G2.3). The estimated frequency of these haplotypes varied greatly among populations (Figure 15B). Qualitatively, three types of European populations can be distinguished based on these haplotypes, namely those with (1) a high frequency (> 60%) of the G1 haplotypes, characteristic of central European samples, (2) a low frequency (< 40%) of G1 haplotypes, a pattern common for Eastern European populations in summer, and (3) a combined frequency of G1 haplotypes between 40-60%, which is typical of samples from the Iberian Peninsula and from Eastern Europe in fall (Figure 15 - figure supplement 1).
We observed a significant shift in the relative frequencies of the two haplotype classes between summer and fall samples in only two of the nine possible comparisons among haplotypes. While there was no correlation between latitude and the combined frequency of G1 haplotypes, we found a weak but significant negative correlation between G1 haplotypes and longitude (r2 = 0.10; p < 0.05), which is consistent with the longitudinal east-west population structure observed for intronic SNPs. In a subsequent analysis, we divided the dataset at 20° longitude into an eastern and a western subset since in northern Europe 20° longitude corresponds to the division of two major climatic zones, namely C (temperate) and D (cold), according to the Köppen-Geiger climate classification (Peel et al. 2007). When splitting the populations in a western (longitude < 20° E) and an eastern group (longitude > 20° E), we found a clear correlation between longitude and the combined frequency of G1 haplotypes, explaining as much as 50% of the variation in the western group (Figure 15 - figure supplement 1B). Similarly, in the eastern populations longitude and the combined frequency of G1 haplotypes were correlated, explaining approximately 20% of the variance (Figure 15 - figure supplement 1B). Thus, our data on mitochondrial haplotypes clearly confirm the existence of pronounced east-west population structure and differentiation in European D. melanogaster. While this might be due to climatic selection, as recently found for clinal mitochondrial haplotypes in Australia (Camus et al. 2017), we can presently not rule out an effect of demography.
The majority of TEs vary with longitude and altitude
To examine the population genetics of structural variants in our data, we first focused on transposable elements (TEs). The repetitive content of the 48 samples analyzed ranged from 16% to 21% with respect to nuclear genome size (Figure 16). The vast majority of detected repeats were TEs, mostly represented by long terminal repeats (LTR) and long interspersed nuclear elements (LINE; Class I), as well as a few DNA elements (Class II). LTR content best explained total TE content (LINE+LTR+DNA) (Pearson’s r = 0.87, p < 0.01, vs. DNA r = 0.58, p = 0.0117, and LINE r = 0.36, p < 0.01 and Figure S16A).
We next estimated population-wise frequencies of 1,630 TE insertions annotated in the D. melanogaster reference genome v.6.04 using T-lex2 (Table S7, Fiston-Lavier et al. 2010). On average, 56% of the TEs annotated in the reference genome were fixed in all samples. The remaining polymorphic TEs usually segregated at low frequency in all samples (Figure 16 - figure supplement 1A), potentially due to the effect of purifying selection (González et al. 2008; Petrov et al. 2011; Kofler et al. 2012; Cridland et al. 2013; Blumenstiel et al. 2014). However, we also observed 142 TE insertions present at intermediate (>10% and <95%) frequencies (Figure 16 - figure supplement 1B), which might be consistent with transposition-selection balance (Charlesworth et al. 1994).
In each of the 48 samples TE frequency and recombination rate were negatively correlated on a genome-wide level (Spearman rank sum test; p < 0.01), as previously reported (Bartolomé et al. 2002; Petrov et al. 2011; Kofler et al. 2012). This pattern still holds when only polymorphic TEs (population frequency <95%) are analyzed, although it becomes statistically non-significant for some chromosomes and populations (Table S8). In either case, the correlation is more negative when using broad-scale, rather than fine-scale, recombination rate estimates (Materials and methods, Tables S8B, S8D). This indicates that broad-scale recombination patterns may best capture long-term population recombination patterns.
We further tested whether the distribution of TE frequencies among samples could be explained by geographical or temporal variables. We focused on the 141 TE insertions that showed frequency variability among samples (interquartile range, (IQR) > 10; see Materials and Methods). Of these, 73 TEs showed significant associations with geographical or temporal variables after multiple testing correction (Table S9). Note that we used a conservative p-value threshold (< 0.001), and we did not find significant residual spatio-temporal autocorrelation among samples for any TE tested (Moran’s I > 0.05 for all tests; Table S9). 16 out of 73 TEs were located in regions of very low recombination (0 cM/Mb for either of the two recombination measures used). Among the 57 significant TEs located in high recombination regions, we observed significant correlations of 13 TEs with longitude, 13 with altitude, 5 with latitude, and 3 with season (Table S9). In addition, the frequencies of the other 23 insertions were significantly correlated with more than one of the above-mentioned variables (Table S9). These significant TEs were scattered along the main five chromosome arms (Table S9). Among the 57 significant TEs located in high recombination regions two TE families were enriched (χ2 p-values after Yate’s correction < 0.05): the LTR 297 family with 11 copies, and the DNA pogo family with 5 copies (Table S10). We also checked the genomic localization of the 57 TEs. Most of them (42) were located inside genes: two in 5’UTR, four in 3’UTR, 18 in the first intron, and 18 TEs in subsequent introns. Additionally, 7 TEs are <1 kb from the nearest gene, indicating that these might potentially affect the regulation of nearby genes (Table S9). Interestingly, 14 of these 57 TEs coincide with previously identified candidate adaptive TEs (Table S9), suggesting that our dataset might be enriched for adaptive insertions. However, further analyses are needed to discard the effect of non-selective forces on the patterns observed.
Inversion polymorphisms in Europe exhibit latitudinal and longitudinal clines
Chromosomal inversions are another class of important and common structural genomic variants, often exhibiting frequency clines on multiple continents, some of which have been shown to be adaptive (e.g. Knibb 1982; Umina et al. 2005; Kapun et al. 2014; 2016a). However, little is known yet about the spatial distribution and clinality of inversions in Europe. We used a panel of inversion-specific marker SNPs (Kapun et al. 2014) to examine the presence and frequency of six cosmopolitan inversion polymorphisms (In(2L)t, In(2R)NS, In(3L)P, In(3R)C, In(3R)Mo, In(3R)Payne) in the 48 samples. All populations were polymorphic for one or more inversions (Figure 17). However, only In(2L)t segregated at substantial frequencies in most populations (average frequency = 20.2%). All other inversions were either absent or occurred at low frequencies (average frequencies: In(2R)NS = 6.2%, In(3L)P = 4%, In(3R)C = 3.1%, In(3R)Mo =2.2%, In(3R)Payne = 5.7%).
Despite their overall low frequencies, several inversions exhibited clinal patterns across space (Table 3). We observed significant latitudinal clines for In(3L)P, In(3R)C and In(3R)Payne. Although they differed in overall frequencies, In(3L)P and In(3R)Payne showed latitudinal clines in Europe that are qualitatively similar to the clines previously observed along the North American and Australian east coasts (Figure S17 and Table S11, Kapun et al. 2016a). For the first time, we also detected a longitudinal cline for In(2L)t and In(2R)NS, with both inversions decreasing in frequency from east to west, a result that is consistent with our finding of strong longitudinal among-population differentiation in Europe. In(2L)t also increased in frequency with altitude (Table 3). Except for In(3R)C, we did not find significant residual spatio-temporal autocorrelation among samples for any inversion tested (Moran’s I ≈ 0, p > 0.05 for all tests; Table 3), suggesting that our analysis was not confounded by spatial autocorrelation for most of the inversions. It will clearly be interesting to examine the extent to which clines in inversions (and other genomic variants) across Europe are shaped by selection and/or demography in future work.
European Drosophila microbiomes contain trypanosomatids and novel viruses
We were also interested in determining the abundance of microbiota associated with D. melanogaster from the Pool-Seq data – these endosymbionts often have crucial functions in affecting the life history, immunity, hormonal physiology, and metabolic homeostasis of their fly hosts (e.g., Trinder et al. 2017; Martino et al. 2017). The taxonomic origin of a total of 262 million non-Drosophila reads was inferred using MGRAST, which identifies and counts short protein motifs (‘features’) within reads (Meyer et al. 2008). The largest fraction of protein features was assigned to Wolbachia (on average 53.7%; Figure 18), a well-known endosymbiont of Drosophila (Werren et al. 2008). The relative abundance of Wolbachia protein features varied strongly between samples ranging from 8.8% in a sample from the UK to almost 100% in samples from Spain, Portugal, Turkey and Russia (Table 1). Similarly, Wolbachia loads varied 100x between samples if we use the ratio of Wolbachia protein features divided by the number of Drosophila sequences retrieved for that sample as a proxy for relative micro-organismal load (for a full table of micro-organismal loads standardized by Drosophila genome coverage see Table S12).
Acetic acid bacteria of the genera Gluconobacter, Gluconacetobacter, and Acetobacter were the second largest group, with an average relative abundance of 34.4%.
Furthermore, we found evidence for the presence of several genera of Enterobacteria (Serratia, Yersinia, Klebsiella, Pantoea, Escherichia, Enterobacter, Salmonella, and Pectobacterium). Serratia occurs only at low frequencies or is absent from most of our samples, but reaches a very high relative abundance in the Nicosia summer collection (54.5%). This high relative abundance was accompanied by an 80x increase in Serratia bacterial load. We detected several eukaryotic microorganisms, although they were less abundant than the bacteria. The fraction of fungal protein features is larger than 3% in only three of our samples from Finland, Austria and Turkey (Table 1). Interestingly, we detected the presence of trypanosomatids in 16 of our samples, consistent with other recent evidence that Drosophila can host these organisms (Wilfert et al. 2011; Chandler & James 2013; Hamilton et al. 2015).
Our data also allowed us to detect the presence of five different DNA viruses (Table S13). These included approximately two million reads from Kallithea nudivirus (Webster et al. 2015), allowing us to assemble the complete Kallithea genome for the first time (>300-fold coverage in the Ukrainian sample UA_Kha_14_46; Genbank accession KX130344). We also identified around 1,000 reads from a novel nudivirus that is closely related to Kallithea virus and to Drosophila innubila nudivirus (Unckless 2011) in sample DK_Kar_14_41 from Karensminde, Denmark (Table 1). These sequences permitted us to identify a publicly available dataset (SRR3939042: 27 male D. melanogaster from Esparto, California; Machado et al. 2016) that contained sufficient reads to complete the genome (provisionally named “Esparto Virus”; KY608910). We further identified two novel Densoviruses (Parvoviridae), which we have provisionally named “Viltain virus”, a relative of Culex pipiens densovirus found at 94-fold coverage in sample FR_Vil_14_07 (Viltain; KX648535) and “Linvill Road virus”, a relative of Dendrolimus punctatus densovirus that was represented by only 300 reads here, but which has previously been found to have a high coverage in dataset SRR2396966 from a North American sample of D. simulans (KX648536; Machado et al. 2016). In addition, we detected a novel member of the Bidnaviridae family,“Vesanto virus”, a bidensovirus related to Bombyx mori densovirus 3 with approximately 900-fold coverage in sample FI_Ves_14_38 (Vesanto; KX648533 and KX648534), Using a detection threshold of >0.1% of the Drosophila genome copy number, the most commonly detected viruses were Kallithea virus (30/48 of the pools) and Vesanto virus (25/48), followed by Linvill Road virus (7/48) and Viltain virus (5/48), with Esparto virus being the rarest (2/48). In some samples, the viruses reached strikingly high titers: on 13 occasions the virus genome copy number in the pool exceeded the host genome copy number, reaching a maximum of nearly 20-fold in Vesanto.
This continent-wide analysis of the microbiota associated with fruit flies suggests that natural populations of European D. melanogaster differ greatly in the composition and relative abundance of microbes and viruses.
Discussion
In recent years, large-scale population resequencing projects have shed light on the biology of both model (Mackay et al. 2012; Langley et al. 2012; Consortium 2015; Lack et al. 2015; Alonso-Blanco et al. 2016; Lack et al. 2016) and non-model organisms (e.g., Hohenlohe et al. 2010; Wolf et al. 2010). Such massive datasets contribute greatly to our growing understanding of the processes that create and maintain genetic variation in natural populations. However, the relevant spatio-temporal scales for population genomic analyses remain largely unknown. Here we have applied, for the first time, a comprehensive sampling and sequencing strategy to European populations of D. melanogaster, allowing us to uncover previously unknown aspects of this species’ population biology.
A main result from our analyses of SNPs located in short introns and presumably evolving neutrally (Parsch et al. 2010) is that European D. melanogaster populations exhibit very pronounced longitudinal differentiation, a pattern that – to the best of our knowledge – has not been observed before for the European continent (for patterns of longitudinal differentiation in Africa see e.g. Michalakis & Veuille 1996; Aulard et al. 2002; Fabian et al. 2015). Genetic differentiation was greatest between populations from eastern and western Europe (Figure 14). The eastern populations included those from the Ukraine, Russia, and Turkey, as well as one from eastern Austria, suggesting that there may be a region of restricted gene flow in south-central Europe. However, populations from Finland and Cyprus are more similar to western populations than to eastern populations, possibly as a result of migration along shipping routes in the Baltic and Mediterranean seas. More data from populations in the unsampled, intermediate regions are needed to better delineate the geographic limits of the eastern and western population groups. Consistent with the strong differentiation between eastern and western populations, our PCA analysis revealed that longitude was the major factor associated with among-population divergence, with no significant effect of latitude (Figure 14C; Table 2). Thus, the patterns of neutral genetic differentiation in Europe contrast with those previously reported for North America, where latitude impacts neutral differentiation (Machado et al. 2016; Kapun et al. 2016a). However, our present analysis does not exclude the existence of clinally varying polymorphisms in European populations outside short introns: for example, we detected latitudinal frequency clines both for TEs and inversion polymorphisms. A detailed analysis of genome-wide patterns of clinal variation in the 2014 DrosEU data is beyond the scope of this paper and currently under way.
The mitochondrial genome and several chromosomal inversions and TEs showed similar patterns of differentiation as the rest of the genome, with the main axis of differentiation being longitudinal. Uncovering the extent to which this pattern is driven by demography and/or selection, and identifying the underlying environmental correlates (including any potential role of co-migration with human populations), will be an important task for future analyses. Due to the high density of samples and the large number of SNP markers examined, our results reveal that European populations of D. melanogaster exhibit much more differentiation and structure than previously thought (e.g., Baudry et al. 2004; Dieringer et al. 2005; Schlötterer et al. 2006; Nunes et al. 2008; Mateo et al. 2018).
Within the eastern and western population groups there was a low – but detectable – level of genetic differentiation among populations, including those that are geographically close (Figure 14C). These population differences persisted over a timespan of at least 2–3 months, as there was less genetic differentiation between the summer and fall samples of the 13 locations sampled at multiple time points than between neighboring populations (Figure 14C). Thus, while the weak but significant signal of IBD suggests homogenizing gene flow across geography, there is seasonally stable differentiation among populations. The season in which samples were collected did not show a significant association with genetic differentiation, except when considered in conjunction with longitude or altitude (Table 2). However, the data analyzed here are from a single year only: demonstrating recurrent shifts in SNP frequencies due to temporally varying selection will require analysis of additional annual samples. For an extensive analysis of patterns of seasonal variation across a broad geographic scale see Machado et al. (2018)
Our Pool-Seq data also allowed us to characterize geographic patterns in both inversions and TEs. In marked contrast to putatively neutral SNPs, the frequencies of several chromosomal inversions, including In(3L)P, In(3R)C, and In(3R)Payne, showed a significant correlation with latitude (Table 3). For In(3L)P and In(3R)Payne, the latitudinal clines were in qualitative agreement with parallel clines reported from North America and Australia, with the inversions decreasing in frequency as distance from the equator increases (Mettler et al. 1977; Knibb et al. 1981; Fabian et al. 2012; Kapun et al. 2014; Rane et al. 2015; Kapun et al. 2016a). This suggests that these inversions may contain genetic variants that are better adapted to warmer environments than to temperate climates. The overall frequencies of these inversions are, however, low in Europe (<5%), indicating that they might play only a minor role in local adaptation to European habitats. Some euchromatic TE insertions also showed geographic or seasonal patterns of variation (Table S7), indicating that they might play a role in local adaptation, particularly as many of them are located in regions where they could affect gene regulation. Importantly, several inversions and TEs also showed longitudinal frequency gradients, thus supporting the notion that European populations exhibit marked longitudinal differentiation.
We also examined signatures of selective sweeps in our dataset. We found 144 genomic regions that showed signatures of hard sweeps in regions of normal recombination (cM/Mb ≥ 0.5), and with reduced variation and negative Tajima’s D(D ≤ −0.8) in all European populations (Figure 12, Table S6). Four of these regions were identified in previous studies as potential targets for positive selection.
The first region, at the center of chromosome arm 2R (Figure 12A, Table S6), was previously found to be strongly differentiated between African and North American populations (Langley et al. 2012) and contains two genes, Cyp6g1 and Hen-1, that are associated with recent, strong selection. The cytochrome P450 gene Cyp6g1 has been linked to insecticide resistance (Daborn et al. 2002; Schmidt et al. 2010), shows evidence for recent selection independently in both D. melanogaster and D. simulans (Schlenke & Begun 2003; Catania et al. 2004), and is associated with a large differentiated region in the Australian latitudinal cline (Kolaczkowski et al. 2011a). Hen-1, a methyltransferase involved in maturation of small RNAs involved in virus and TE suppression, showed marginally non-significant evidence for selective sweeps in North American and African populations of D. melanogaster (Kolaczkowski et al. 2011b).
The second region previously implicated in a selective sweep is located on chromosome arm 3L (Figure 12B, Table S6) and centered around the chimeric gene CR18217, which formed from the fusion of a gene encoding a DNA-repair enzyme (CG4098) and a centriole gene (spd-2; Rogers & Hartl 2012). CR18217 appears to be unique to D. melanogaster, but – in spite of its recent origin – segregates at frequencies of around 90% (Rogers & Hartl 2012), consistent with a recent strong sweep in this region of the genome. This putative sweep region also spans Prosbeta6, which (like HDAC) encodes a gene involved in proteolysis (Flybase v. FB2017_05; Gramates et al. 2017). Prosbeta6 also shows homology to genes involved in immune function (Lyne et al. 2007; Handu et al. 2015), which might explain why it has been a target of positive selection.
The third previously characterized sweep region, surrounding the wapl gene on the X chromosome (Table S6), was identified as showing evidence of strong selective sweeps in both African and European D. melanogaster populations (Beisswanger et al. 2006; Boitard et al. 2012). The genic targets of selection in this region are unclear, but most likely are ph-p in Europe and ph-p or ph-d in Africa (Beisswanger et al. 2006). These genes are tandem duplicates involved in the Polycomb response pathway, which functions as an epigenetic repressor of transcription (reviewed in Kassis et al. 2017).
The fourth previously observed sweep region, originally identified in African populations of D. melanogaster, is also located on the X chromosome (Table S6), but 30 cM closer to the telomere and thus not implicating the wapl region (Beisswanger et al. 2006; Boitard et al. 2012). Selection in this region has been attributed to the HDAC6 gene (Svetec et al. 2009). HDAC6, although nominally a histone deacetylase, actually functions as a central player in managing cytotoxic assaults, including in transport and degradation of misfolded protein aggregates (reviewed in Matthias et al. 2008; Svetec et al. 2009).
Our data support the widespread occurrence of these previously identified sweeps in many populations in Europe. Notably, practically all European populations examined showed reduced variation and negative Tajima’s D in these sweep regions. This is consistent with the sweeps either pre-dating the colonization of Europe (e.g., Beisswanger et al. 2006) or having swept across Europe more recently (also see Stephan 2010 for discussion). In addition, we also uncovered several novel genomic regions with tentative evidence for hard sweeps (Table S6) – these regions represent a valuable source for future analyses of signals of adaptive evolution in European Drosophila.
Finally, we used our Pool-Seq data to identify microbes and viruses and to quantify their presence in natural populations of D. melanogaster across the European continent. Wolbachia was the most abundant bacterial genus associated with the flies, but its relative abundance and load varied greatly among samples (Figure 18). The second most abundant bacterial taxon was acetic acid bacteria (Acetobacteraceae), a group previously found among the most abundant bacteria in natural D. melanogaster isolates (Chandler et al. 2011; Staubach et al. 2013). Other microbes were highly variable abundance in relative abundance. For example, Serratia abundance was low in most populations, but very high in the Nicosia sample, which might reflect that there are individuals in the Nicosia sample that carry a systemic Serratia infection generating high bacterial loads. Future sampling may shed light on the temporal stability and/or population specificity of these patterns. Contrary to expectation, we found relatively few yeast sequences. This is a bit surprising because yeasts are commonly found on rotting fruit, the main food substrate of D. melanogaster, and have been found in association with Drosophila before (Barata et al. 2012; Chandler et al. 2012). This suggests that, although yeasts can attract flies and play a role in food choice (Becher et al. 2012; Buser et al. 2014), they might not be highly prevalent in or on D. melanogaster bodies. While trypanosomatids have been reported in association with Drosophila before (Wilfert et al. 2011; Chandler & James 2013; Hamilton et al. 2015), our study provides the first systematic detection across a wide geographic range in D. melanogaster. Despite being host to a wide diversity of RNA viruses (Huszar & Imler 2008; Webster et al. 2015), only three DNA viruses have previously been reported in association with Drosophilidae, and only one from D. melanogaster (Unckless 2011; Webster et al. 2015; 2016). Here, we have discovered four new DNA viruses in D. melanogaster. Although it is not possible to directly estimate viral prevalence from pooled sequencing data, we found that the DNA viruses of D. melanogaster can be very widespread, with Kallithea virus detectable at a low level in most populations.
A striking qualitative pattern in our microbiome data is the high level of variability among populations in the composition and relative amounts of different microbiota and viruses. Thus, an interesting open question is to what extent geographic differences in microbiota might contribute to phenotypic differences and local adaptation among fly populations, especially given that there might be tight and presumably local co-evolutionary interactions between fly hosts and their endosymbionts (e.g., Haselkorn et al. 2009; Richardson et al. 2012; Staubach et al. 2013; Kriesner et al. 2016).
In conclusion, our study demonstrates that extensive sampling on a continent-wide scale and pooled sequencing of natural populations can reveal new aspects of population biology, even for a well-studied species such as D. melanogaster. Such extensive population sampling is feasible due to the close cooperation and synergism within our international consortium. Our efforts in Europe are paralleled in North America by the Drosophila Real Time Evolution Consortium (Dros-RTEC), with whom we are currently collaborating to compare population genomic data across continents. In future years, our consortia will continue to sample and sequence European and North American Drosophila populations in order to study these populations with increasing spatial and temporal resolution and to provide an unprecedented resource for the Drosophila and population genetics communities.
Materials and Methods
The 2014 DrosEU dataset analyzed here consists of 48 samples of D. melanogaster collected from 32 geographical locations at different time-points across the European continent, through a joint effort of 18 European research groups (see Figure 2, Table 1). Field collections were performed with baited traps using a standardized protocol (see Supplementary file for details). Up to 40 males from each collection were pooled, and DNA extracted from each pool, using a standard phenol-chloroform based protocol. Each sample was processed in a single pool (Pool-Seq; Schlötterer et al. 2014), with each pool consisting of at least 33 wild-caught individuals. To exclude morphologically similar and co-occurring species, such as D. simulans, as potential contaminants from the samples, we only used wild-caught males and distinguished among species by examining genital morphology. Despite this precaution, we identified a low level of D. simulans contamination in our samples, and further steps were thus taken to exclude D. simulans sequences from our analysis (see below). The 2014 DrosEU dataset represents the most comprehensive spatio-temporal sampling of European D. melanogaster populations available to date (Table 1, Figure 3).
DNA extraction, library preparation and sequencing
DNA was extracted from pools of 33–40 males per sample after joint homogenization with bead beating and standard phenol/chloroform extraction. A detailed extraction protocol can be found in the Supporting Information file. In brief, 500 ng of DNA in a final volume of 55.5 μl were sheared with a Covaris instrument (Duty cycle 10, intensity 5, cycles/burst 200, time 30) for each sample separately. Library preparation was performed using NEBNext Ultra DNA Lib Prep-24 and NebNext Multiplex Oligos for Illumina-24 following the manufacturer’s instructions. Each pool was sequenced as paired-end fragments on a Illumina NextSeq 500 sequencer at the Genomics Core Facility of Pompeu Fabra University (UPF; Barcelona, Spain). Samples were multiplexed in five batches of 10 samples each, except for one batch that contained only 8 samples (see Supplementary Table S1 for further information). Each multiplexed batch was sequenced on four lanes to obtain an approximate 50x raw coverage for each sample. Reads were sequenced to a length of 151 bp with a median insert size of 348 bp (ranging from 209 to 454 bp).
Mapping pipeline and variant calling
Prior to mapping, we trimmed and filtered raw FASTQ reads to remove low-quality bases (minimum base PHRED quality = 18; minimum sequence length = 75 bp) and sequencing adaptors using cutadapt (v. 1.8.3; Martin 2011). We only retained read pairs for which both reads fulfilled our quality criteria after trimming. FastQC analyses of trimmed and quality filtered reads showed overall high base-qualities (median ranging from 29 to 35 in all 48 samples) and indicated a loss of ∼1.36% of all bases after trimming relative to the raw data. We used bwa mem (v. 0.7.15; Li 2013) with default parameters to map trimmed reads against a compound reference genome consisting of the genomes from D. melanogaster (v.6.12) and genomes from common commensals and pathogens, including Saccharomyces cerevisiae (GCF_000146045.2), Wolbachia pipientis (NC_002978.6), Pseudomonas entomophila (NC_008027.1), Commensalibacter intestine (NZ_AGFR00000000.1), Acetobacter pomorum (NZ_AEUP00000000.1), Gluconobacter morbifer (NZ_AGQV00000000.1), Providencia burhodogranariea (NZ_AKKL00000000.1), Providencia alcalifaciens (NZ_AKKM01000049.1), Providencia rettgeri (NZ_AJSB00000000.1), Enterococcus faecalis (NC_004668.1), Lactobacillus brevis (NC_008497.1), and Lactobacillus plantarum (NC_004567.2), to avoid paralogous mapping. We used Picard (v.1.109; http://picard.sourceforge.net) to remove duplicate reads and reads with a mapping quality below 20. In addition, we re-aligned sequences flanking insertions-deletions (indels) with GATK (v3.4-46; McKenna et al. 2010).
After mapping, Pool-Seq samples were tested for DNA contamination from D. simulans. To do this, we used a set of SNPs known to be divergent between D. simulans and D. melanogaster and assessed the frequencies of D. simulans-specific alleles following the approach of Bastide et al. (2013). We combined the genomes of D. melanogaster (v.6.12) and D. simulans (Hu et al. 2013) and separated species-specific reads for samples with a contamination level > 1% via competitive mapping against the combined references using the pipeline described above. Custom software was used to remove reads uniquely mapping to D. simulans. In 9 samples, we identified contamination with D. simulans, ranging between 1.2 % and 8.7% (Table S1). After applying our decontamination pipeline, contamination levels dropped below 0.4 % in all 9 samples.
We used Qualimap (v. 2.2., Okonechnikov et al. 2016) to evaluate average mapping qualities per population and chromosome, which ranged from 58.3 to 58.8 (Table S1). We found heterogeneous sequencing depths among the 48 samples, ranging from 34x to 115x for autosomes and from 17x to 59x for X-chromosomes (Figure S1, Table S1). We then combined individual BAM files from all samples into a single mpileup file using samtools (v. 1.3; Li & Durbin 2009). Due to the large number of Pool-Seq datasets analyzed in parallel, we had to implement quality control criteria for all libraries jointly in order to call SNPs. To accomplish this, we implemented a novel custom SNP calling software to call SNPs with stringent heuristic parameters (PoolSNP; see Supplementary Information), available at Dryad (doi: https://doi.org/10.5061/dryad.rj1gn54). A site was considered polymorphic if (1) the minimum coverage from all samples was greater than 10x, (2) the maximum coverage from all samples was less than the 95th coverage percentile for a given chromosome and sample (to avoid paralogous regions duplicated in the sample but not in the reference), (3) the minimum read count for a given allele was greater than 20x across all samples pooled, and (4) the minimum read frequency of a given allele was greater than 0.001 across all samples pooled. The above threshold parameters were optimized based on simulated Pool-Seq data in order to maximize true positives and minimize false positives (see Figure S18 and Supporting Information). Additionally, we excluded SNPs (1) for which more than 20% of all samples did not fulfill the above-mentioned coverage thresholds, (2) which were located within 5 bp of an indel with a minimum count larger than 10x in all samples pooled, and (3) which were located within known transposable elements (TE) based on the D. melanogaster TE library v.6.10. We further annotated our final set of SNPs with SNPeff (v.4.2; Cingolani et al. 2012) using the Ensembl genome annotation version BDGP6.82 (Figure 3).
Combined and population-specific site frequency spectra (SFS)
We quantified the amount of allelic variation with respect to different SNP classes. For this, we first combined the full dataset across all 48 samples and used the SNPeff annotation (see above) to classify the SNPs into four classes (intergenic, intronic, non-synonymous, and synonymous). For each class, we calculated the site frequency spectrum (SFS) based on minor allele frequencies for the X-chromosome and the autosomes, as well as for each sample and chromosomal arm separately, by counting alleles in 50 frequency bins of size 0.01.
Genetic variation in Europe
We characterized patterns of genetic variation among the 48 samples by estimating three standard population genetic parameters: π, Watterson’s θ and Tajima’s D (Watterson 1975; Nei 1987; Tajima 1989). We focused on SNPs located on the five major chromosomal arms (X, 2L, 2R, 3L, 3R) and calculated sample-wise π, θ and Tajima’s D with corrections for Pool-Seq data (Kofler et al. 2011). Since PoPoolation, the most commonly used software for population genetics inference from Pool-Seq data, does not allow using predefined SNPs (which was desirable for our analyses), we implemented corrected population genetic estimators described in Kofler et al. (2011) in Python (PoolGen; available at Dryad; doi: https://doi.org/10.5061/dryad.rj1gn54). Before calculating the estimators, we subsampled the data to an even coverage of 40x for the autosomes and 20x for the X-chromosome to control for the sensitivity to coverage variation of Watterson’s θ and Tajima’s D (Korneliussen et al. 2013). At sites with greater than 40x coverage, we randomly subsampled reads to 40x without replacement; at sites with below 40x coverage, we sampled reads 40 times with replacement. Using R (R Development Core Team 2009), we calculated sample-wise chromosome-wide averages for autosomes and X chromosomes separately and tested for correlations of π, θ and Tajima’s D with latitude, longitude, altitude, and season using a linear regression model of the following form: yi = Lat + Lon + Alt + Season + εi, where yi is either π, θ and D. Here, latitude, longitude, and altitude are continuous predictors (Table 1), while ‘season’ is a categorical factor with two levels S (“summer”) and F (“fall”), corresponding to collection dates before and after September 1st, respectively. We chose this arbitrary threshold for consistency with previous studies (Bergland et al. 2014; Kapun et al. 2016a). To further test for residual spatio-temporal autocorrelation among the samples (Kühn & Dormann 2012), we calculated Moran’s I (Moran 1950) with the R package spdep (v.06-15., Bivand & Piras 2015). To do this, we used the residuals of the above-mentioned models, as well as matrices defining pairs of samples as neighbors weighted by geographical distances between the locations (samples within 10° latitude/longitude were considered neighbors). Whenever these tests revealed significant autocorrelation (indicating non-independence of the samples), we repeated the above-mentioned regressions using spatial error models as implemented in the R package spdep, which incorporate spatial effects through weighted error terms, as described above.
To test for confounding effects of variation in sequencing errors between runs, we extended the above-mentioned linear models including the run ID as a random factor using the R package lme4 (v.1.1-14; see Supporting Information). Preliminary analyses showed that this model was not significantly better than simpler models, so we did not include sequencing run in the final analysis (see Supporting information and Table S4).
To investigate genome-wide patterns of variation, we averaged π, θ, and D in 200 kb non-overlapping windows for each sample and chromosomal arm separately and plotted the distributions in R. In addition, we calculated Tajima’s D in 50 kb sliding windows with a step size of 10 kb to investigate fine-scale deviations from neutral expectations. We applied heuristic parameters to identify genomic regions harboring potential candidates for selective sweeps. To identify candidate regions with sweep patterns across most of the 48 samples, we searched for windows with log-transformed recombination rates ≥ 0.5, pairwise nucleotide diversity (π ≤ 0.004), and average Tajima’s D across all populations ≤ - 0.8 (5% percentile). To identify potential selective sweeps restricted to a few population samples only, we searched for regions characterized as above but allowing one or more samples with Tajima’s D being more than two standard deviations smaller than the window-wise average. To account for the effects of strong purifying selection in gene-rich genomic regions which can result in local negative Tajima’s D (Tajima 1989) and thus confound the detection of selective sweeps, we repeated the analysis based on silent sites (4-fold degenerate sites, SNPs in short introns of ≤ 60 bp lengths and SNPs in intergenic regions in ≥ 2000 bp distance to the closest gene) only. Despite of the reduction in polymorphic sites available for this analysis, we found highly consistent sweep regions and therefore proceded with the full SNP datasets, which provided better resolution (results not shown).
For statistical analysis, the diversity statistics were log-transformed to normalize the data. We then tested for correlations between π and recombination rate using R in 100 kb non-overlapping windows and plotted these data using the ggplot2 package (v.2.2.1., Wickham 2016). We used two different recombination rate measurements: (i) a fine-scale, high resolution genomic recombination rate map based on millions of SNPs in a small number of strains (Comeron et al. 2012), and (ii) the broad-scale Recombination Rate Calculator based on Marey maps generated by laboratory cross data fitting genetic and physical positions of 644 markers to a third-order polynomial curve for each chromosome arm (Fiston-Lavier et al. 2010). Both measurements were converted to version 6 of the D. melanogaster reference genome to match the genomic position of π estimates (see above).
SNP counts and overlap with other datasets
We used the panel of SNPs identified in the DrosEU dataset (available at Dryad; doi: https://doi.org/10.5061/dryad.rj1gn54) to describe the overlap in SNP calls with other published D. melanogaster population data: the Drosophila Population Genomics Project 3 (DPGP3) from Siavonga, Zambia (69 non-admixed lines; Lack et al. 2015; 2016) and the Drosophila Genetic Reference Panel (DGRP) from Raleigh, North Carolina, USA (205 inbred lines; Mackay et al. 2012; Huang et al. 2014). For these comparisons, we focused on biallelic SNPs on the 5 major chromosome arms. We used bwa mem for mapping and a custom pipeline for heuristic SNP calling (PoolSNP; Figure 3). To make the data from the 69 non-admixed lines from Zambia (Lack et al. 2015; 2016) comparable to our data, we reanalyzed these data using our pipeline for mapping and variant calling (Figure 3).
The VCF file of the DGRP data was downloaded from http://dgrp2.gnets.ncsu.edu/ and converted to coordinates according to the D. melanogaster reference genome v.6. We depicted the overlap of SNPs called in the three different populations using elliptic Venn diagrams with eulerAPE software (v3 3.0.0., Micallef & Rodgers 2014). While the DrosEU data were generated from sequencing pools of wild-caught individuals, both the DGRP and DPGP3 data are based on individual sequencing of inbreed lines and haploid individuals, respectively.
Genetic differentiation and population structure in European populations
To estimate genome-wide pairwise genetic differences, we used custom software to calculate SNP-wise FST using the approach of Weir and Cockerham (1984). We estimated SNP-wise FST for all possible pairwise combinations among samples. For each sample, we then averaged FST across all SNPs for all pairwise combinations that include this particular sample and finally ranked the 48 population samples by overall differentiation.
We inferred demographic patterns in European populations by focusing on 20,008 putatively neutrally evolving SNPs located in small introns (less than 60 bp length; Haddrill et al. 2005; Singh et al. 2009; Parsch et al. 2010; Clemente & Vogl 2012; Lawrie et al. 2013) that were at least 200 kb distant from the major chromosomal inversions (see below). To assess isolation by distance (IBD), we averaged FST values for each sample pair across all neutral markers and calculated geographic distances between samples using the haversine formula (Green & Smart 1985) which takes the spherical curvature of the planet into account. We tested for correlations between genetic differentiation and geographic distance using Mantel tests using the R package ade4 (v.1.7-8., Dray & Dufour 2007) with 1,000,000 iterations. In addition, we plotted the 5% smallest and 5% largest FST values from all 1,128 pairwise comparisons among the 48 population samples onto a map to visualize geographic patterns of genetic differentiation. From these putatively neutral SNPs, we used observed FST on the autosomes (Faut) to calculate the expected FST on X chromosomes (FX) as in Machado et al. (2016) using the equation where z is the ratio of effective population sizes of males (Nm) and females (Nf), Nm/Nf (Ramachandran et al. 2004). For the purposes of this study we assume z = 1.
We further investigated genetic variation in our dataset by principal component analysis (PCA) based on allele frequencies of the neutral marker SNPs described above. We used the R package LEA (v. 1.2.0., Frichot et al. 2013) and performed PCA on unscaled allele frequencies as suggested by Menozzi et al. (1978) and Novembre and Stephens (2008). We focused on the first three principal components (PCs) and employed a model-based approach as implemented in the R package mclust (v. 5.2., Fraley & Raftery 2012) to identify the most likely number of clusters based on maximum likelihood and assigned population samples to clusters by k-means clustering in R (R Development Core Team 2009). Finally, we examined the first three PCs for correlations with latitude, longitude, altitude, and season using general linear models and tested for spatial autocorrelation as described above. A Bonferroni-corrected α threshold (α’= 0.05/3 = 0.017) was used to account for multiple testing.
Mitochondrial DNA
To obtain consensus mitochondrial sequences for each of the 48 European populations, reads from individual FASTQ files were aligned and minor variants replaced by the major variant using Coral (Salmela & Schröder 2011). This way, ambiguities that might prevent the growth of contigs from reads during the assembly process can be eliminated. For each population, a genome assembly was obtained using SPAdes using standard parameters and k-mers of size 21, 33, 55, and 77 (Bankevich et al. 2012) and the corrected FASTQ files. Mitochondrial contigs were retrieved by blastn, using the D. melanogaster NC 024511 sequence as a query and each genome assembly as the database. To avoid nuclear mitochondrial DNA segments (numts), we ensured that only contigs with a much higher coverage than the average coverage of the genome were retrieved. When multiple contigs were available for the same region, the one with the highest coverage was selected. Possible contamination with D. simulans was assessed by looking for two or more consecutive sites that show the same variant as D. simulans and looking for alternative contigs for that region with similar coverage. As an additional quality control measure, we also examined the presence of pairs of sites showing four gametic types using DNAsp 6 (Rozas et al. 2017) – given that there is no recombination in mitochondrial DNA no such sites are expected. The very few sites presenting such features were rechecked by looking for alternative contigs for that region and were corrected if needed. The uncorrected raw reads for each population were mapped on top of the different consensus haplotypes using Express as implemented in Trinity (Grabherr et al. 2011). If most reads for a given population mapped to the consensus sequence derived for that population the consensus sequence was retained, otherwise it was discarded as a possible chimera between different mitochondrial haplotypes. The repetitive mitochondrial hypervariable region is difficult to assemble and was therefore not used; the mitochondrial region was thus analyzed as in Cooper et al. (2015). Mitochondrial genealogy was estimated using statistical parsimony (TCS network; Clement et al. 2000), as implemented in PopArt (http://popart.otago.ac.nz), and the surviving mitochondrial haplotypes.
Frequencies of the different mitochondrial haplotypes were estimated from FPKM values using the surviving mitochondrial haplotypes and expressed as implemented in Trinity (Grabherr et al. 2011).
Transposable elements
To quantify the transposable element (TE) abundance in each sample, we assembled and quantified the repeats from unassembled sequenced reads using dnaPipeTE (v.1.2., Goubert et al. 2015). The vast majority of high-quality trimmed reads were longer than 135 bp. We thus discarded reads less than 135 bp before sampling. Reads matching mtDNA were filtered out by mapping to the D. melanogaster reference mitochondrial genome (NC_024511.2. 1) with bowtie2 (v. 2.1.0., Langmead & Salzberg 2012). Prokaryotic sequences, including reads from symbiotic bacteria such as Wolbachia, were filtered out from the reads using the implementation of blastx (translated nucleic vs. protein database) vs. the non-redundant protein database (nr) using DIAMOND (v. 0.8.7., Buchfink et al. 2015). To quantify TE content, we subsampled a proportion of the raw reads (after filtering) corresponding to a genome coverage of 0.1X (assuming a genome size of 175 MB), and then assembled these reads with Trinity assembler (Grabherr et al. 2011). Due to the low coverage of the genome obtained with the subsampled reads, only repetitive DNA present in multiple copies should be fully assembled (Goubert et al. 2015). We repeated this process with three iterations per sample, as recommended by the program guidelines, to assess the repeatability of the estimates.
We further estimated frequencies of previously characterized TEs present in the reference genome with T-lex2 (v. 2.2.2., Fiston-Lavier et al. 2015), using all annotated TEs (5,416 TEs) in version 6.04 of the D. melanogaster genome from flybase.org (Gramates et al. 2017). For 108 of these TEs, we used the corrected coordinates as described in Fiston-Lavier et al. (2015), based on the identification of target site duplications at the site of the insertion. We excluded TEs nested or flanked by other TEs (<100 bp on each side of the TE), and TEs which are part of segmental duplications, since T-lex2 does not provide accurate frequency estimates in complex regions (Fiston-Lavier et al. 2015). We additionally excluded the INE-1 TE family, as this TE family is ancient, with thousands of insertions in the reference genome, which appear to be mostly fixed (2,234 TEs; Kapitonov & Jurka 2003).
After applying these filters, we were able to estimate frequencies of 1,630 TE insertions from 113 families from the three main orders, LTR, non-LTR, and DNA across all DrosEU samples. T-lex2 contains three main modules: (i) the presence detection module, (ii) the absence detection module, and (iii) the combine module, which joins the results from the former two detection modules. In the presence module, T-lex2 uses Maq (v. 0.7.1., Li et al. 2008) for the mapping of reads. As Maq only accepts reads 127 bp or shorter, we cut the trimmed reads following the general pipeline (Figure 3) and then used Trimmomatic (v. 0.35; Bolger et al. 2014) to cut trimmed reads longer than 100 bp into two equally sized fragments using CROP and HEADCROP parameters. Only the presence module was run with the cut reads.
To avoid inaccurate TE frequency estimates due to very low numbers of reads, we only considered frequency estimates based on at least 3 reads. Despite the stringency of T-lex2 to select only high-quality reads, we additionally discarded frequency estimates supported by more than 90 reads, i.e. 3 times the average coverage of the sample with the lowest coverage (CH_Cha_14_43, Table 1), in order to avoid non-unique mapping reads.
This filtering allows to estimate TE frequencies for ∼96% (92.9% to 97.8%) of the TEs in each population. For 85% of the TEs, we were able to estimate their frequencies in more than 44 out of 48 DrosEU samples.
We tested for correlations between TE insertion frequencies and recombination rates using Spearman’s rank correlations as implemented in R. For SNPs, we used recombination rates from Comeron et al. (2012) and from the Recombination Rate Calculator (Fiston-Lavier et al. 2010) in non-overlapping 100 kb windows, and assigned to each TE insertion the recombination rate of the corresponding 100 kb genomic window.
To test for spatio-temporal variation of TE insertions, we excluded TEs with an interquartile range (IQR) < 10. There were 141 TE insertions with variable population frequencies among the DrosEU samples. We tested the population frequencies of these insertions for correlations with latitude, longitude, altitude, and season using generalized linear models (ANCOVA) following the method used for SNPs but with a binomial error structure in R.
We also tested for residual spatio-temporal autocorrelations, with Moran’s I test (Moran 1950; Kühn & Dormann 2012). We used Bonferroni corrections to account for multiple testing (α’= 0.05/141 = 0.00035) and only considered Bonferroni-corrected p-values < 0.001 to be significant. TEs with a recombination rate that differed from 0 cM/Mb according to both used measures (see above) were considered as high recombination regions. To test TE family enrichment among the significant TEs we performed a χ2 test and applied Yate’s correction to account for the low number of some of the cells.
Inversion polymorphisms
Since Pool-Seq data precludes a direct assessment of the presence and frequencies of chromosomal inversions, we indirectly estimated inversion frequencies using a panel of approximately 400 inversion-specific marker SNPs (Kapun et al. 2014) for six cosmopolitan inversions (In(2L)t, In(2R)NS, In(3L)P, In(3R)C, In(3R)Mo, In(3R)Payne). We averaged allele frequencies of these markers in each sample separately. To test for clinal variation in the frequencies of inversions, we tested for correlations with latitude, longitude, altitude and season using generalized linear models with a binomial error structure in R to account for the biallelic nature of karyotype frequencies. In addition, we tested for residual spatio-temporal autocorrelations as described above and Bonferroni-corrected the α threshold (α’= 0.05/7 = 0.007) to account for multiple testing.
Microbiome
Raw sequences were trimmed and quality filtered as described for the genomic data analysis. The remaining high quality sequences were mapped against the D. melanogaster genome (v.6.04) including mitochondria using bbmap (v. 35; Bushnell 2016) with standard settings. The unmapped sequences were submitted to the online classification tool, MGRAST (Meyer et al. 2008) for annotation. Taxonomy information was downloaded and analyzed in R (v. 3.2.3; R Development Core Team 2009) using the matR (v. 0.9; Braithwaite & Keegan) and RJSONIO (v. 1.3; Lang) packages. Metazoan sequence features were removed. For microbial load comparisons, the number of protein features identified by MGRAST for each taxon and sample was divided by the number of sequences that mapped to D. melanogaster chromosomes X, Y, 2L, 2R, 3L, 3R and 4.
We also surveyed the datasets for the presence of novel DNA viruses by performing de novo assembly of the non-fly reads using SPAdes 3.9.0 (Bankevich et al. 2012), and using conceptual translations to query virus proteins from Genbank using DIAMOND ‘blastp’ (Buchfink et al. 2015). In three cases (Kallithea virus, Vesanto virus, Viltain virus), reads from a single sample pool were sufficient to assemble a (near) complete genome. In two other cases, fragmentary assemblies allowed us to identify additional publicly available datasets that contained sufficient reads to complete the genomes (Linvill Road virus, Esparto virus; completed using SRA datasets SRR2396966 and SRR3939042, respectively). Novel viruses were provisionally named based on the localities where they were first detected, and the corresponding novel genome sequences were submitted to Genbank (KX130344, KY608910, KY457233, KX648533-KX648536). To assess the relative amount of viral DNA, unmapped (non-fly) reads from each sample pool were mapped to repeat-masked Drosophila DNA virus genomes using bowtie2, and coverage normalized relative to virus genome length and the number of mapped Drosophila reads.
Additional information
Funding
Author contributions
Martin Kapun, Visualization, Writing-original draft preparation, Formal analysis, Conceptualization, Writing-review & editing, Supervision, Methodology, Investigation, Data curation, Project administration, Validation, Resources, Software; Maite G. Barrón, Visualization, Writing-original draft preparation, Formal analysis, Conceptualization, Writing-review & editing, Methodology, Investigation, Data curation, Project administration, Validation, Resources, Software; Fabian Staubach, Visualization, Writing-original draft preparation, Formal analysis, Conceptualization, Writing-review & editing, Supervision, Funding acquisition, Methodology, Investigation, Data curation, Validation, Resources, Software; Jorge Vieira, Visualization, Writing-original draft preparation, Formal analysis, Conceptualization, Writing-review & editing, Methodology, Investigation, Validation, Resources; Darren J. Obbard, Writing-original draft preparation, Formal analysis, Conceptualization, Writing-review & editing, Methodology, Investigation, Validation, Resources; Clément Goubert, Visualization, Writing-original draft preparation, Formal analysis, Conceptualization, Writing-review & editing, Investigation, Resources; Omar Rota-Stabelli, Visualization, Writing-original draft preparation, Formal analysis, Conceptualization, Writing-review & editing, Methodology, Investigation, Resources; Maaria Kankare, Writing-original draft preparation, Conceptualization, Writing-review & editing, Methodology, Investigation, Resources; Annabelle Haudry, Writing-original draft preparation, Formal analysis, Conceptualization, Writing-review & editing, Investigation, Validation, Resources; R. Axel W. Wiberg, Writing-original draft preparation, Formal analysis, Conceptualization, Writing-review & editing, Methodology, Investigation, Resources, Software; Lena Waidele, Svitlana Serga, Patricia Gibert, Damiano Porcelli, Sonja Grath, Eliza Argyridou, Lain Guio, Mads Fristrup Schou, Conceptualization, Writing-review & editing, Investigation, Resources; Iryna Kozeretska, Conceptualization, Writing-review & editing, Methodology, Investigation, Resources; Elena G. Pasyukova, Marta Pascual, Alan O. Bergland, Conceptualization, Writing-review & editing, Funding acquisition, Methodology, Investigation, Resources; Volker Loeschcke, Catherine Montchamp-Moreau, Jessica Abbott, Nico Posnien, Maria Pilar Garcia Guerreiro, Banu Sebnem Onder, Conceptualization, Writing-review & editing, Funding acquisition, Investigation, Resources; Cristina P. Vieira, Visualization, Formal analysis, Conceptualization, Writing-review & editing, Investigation, Resources; Élio Sucena, Conceptualization, Writing-review & editing, Methodology, Investigation, Project administration, Resources; Cristina Vieira, Michael G. Ritchie, Thomas Flatt, Josefa González, Writing-original draft preparation, Conceptualization, Writing-review & editing, Supervision, Funding acquisition, Methodology, Investigation, Project administration, Validation, Resources; Bart Deplancke, Conceptualization, Writing-review & editing, Funding acquisition, Investigation; Bas J. Zwaan, Visualization, Writing-original draft preparation, Conceptualization, Writing-review & editing, Supervision, Funding acquisition, Methodology, Investigation, Project administration; Eran Tauber, Writing-original draft preparation, Conceptualization, Writing-review & editing, Funding acquisition, Methodology, Investigation, Resources; Dorcas J. Orengo, Eva Puerma, Conceptualization, Writing-review & editing, Investigation, Validation, Resources; Montserrat Aguadé, Writing-original draft preparation, Conceptualization, Writing-review & editing, Methodology, Investigation, Validation, Resources; Paul S. Schmidt, John Parsch, Writing-original draft preparation, Conceptualization, Writing-review & editing, Funding acquisition, Methodology, Investigation, Validation, Resources; Andrea J. Betancourt, Writing-original draft preparation, Formal analysis, Conceptualization, Writing-review & editing, Supervision, Funding acquisition, Methodology, Investigation, Project administration, Validation, Resources
Author ORCIDs
Acknowledgments
We are grateful to all members of the DrosEU and Dros-RTEC consortia and to Dmitri Petrov (Stanford University) for support and discussion. DrosEU is funded by a Special Topic Networks (STN) grant from the European Society for Evolutionary Biology (ESEB). Computational analyses were partially executed at the Vital-IT bioinformatics facility of the University of Lausanne (Switzerland) and at the computing facilities of the CC LBBE/PRABI in Lyon (France).
Footnotes
↵§ Members of the Drosophila Real Time Evolution (Dros-RTEC) Consortium