Abstract
Recent studies have consistently inferred high rates of adaptive molecular evolution between Drosophila species. At the same time, the Drosophila genome evolves under different rates of recombination, which results in partial genetic linkage between alleles at neighboring genomic loci. Here we analyze how linkage correlations affect adaptive evolution. We develop a new inference method for adaptation that takes into account the effect on an allele at a focal site caused by neighboring deleterious alleles (background selection) and by neighboring adaptive substitutions (hitchhiking). Using complete genome sequence data and fine-scale recombination maps, we infer a highly heterogeneous scenario of adaptation in Drosophila. In high-recombining regions, about 50% of all amino acid substitutions are adaptive, together with about 20% of all substitutions in proximal intergenic regions. In low-recombining regions, only a small fraction of the amino acid substitutions are adaptive, while hitchhiking accounts for the majority of these changes. Hitchhiking of deleterious alleles generates a substantial collateral cost of adaptation, leading to a fitness decline of about 30/2N per gene and per million years in the lowest-recombining regions. Our results show how recombination shapes rate and efficacy of the adaptive dynamics in eukaryotic genomes.
Author Summary Because recombination takes place at a limited rate, alleles at neighboring sites in a genome can remain genetically linked over evolutionary periods. In this paper, we show that evolutionary forces generated by genetic linkage have drastic consequences for the adaptive dynamics in low-recombining parts of the Drosophila genome. Our study is based on a new method to analyze allele frequencies that is applicable to genome data at both high and low rates of recombination. We show that genes in low-recombining regions of the Drosophila genome incur a substantial cost of adaptation, because deleterious alleles get fixed more frequently than under high recombination. This cost reduces rate and power of the adaptive process. Our results suggest that the Drosophila genome has evolved to minimize this cost by placing genes under high adaptive pressure in high-recombining regions.
Introduction
Genetic linkage imposes evolutionary correlations between neighboring genomic loci. Two particular effects are well known: adaptive mutations induce genetic hitchhiking of linked neutral and weakly selected variants [1–12], and deleterious mutations cause background selection on linked sites[13–20]. Both effects reduce sequence diversity, but in different ways. Background selection caused by strongly deleterious mutations leads to an unbiased removal of genetic diversity, which can be described by a reduced effective population size[17,18]. Genetic hitchhiking in selective sweeps affects common variants in a stronger way than rare variants and, hence, distorts the shape of the allele frequency spectrum [3,10,21-23]. Both effects depend on the rate of recombination, and are expected to be strong evolutionary forces in low-recombining regions of the Drosophila genome.
In this paper, we integrate hitchhiking and background selection into a new method to infer rates and the genomic distribution of adaptation under (partial) genetic linkage. Our model is based on diffusion theory [24,25]. Its key new variable is the effective rate of linked selective sweeps, which governs the hitchhiking rate of neutral and weakly selected polymorphisms and can be estimated directly from sequence data. We model linked recurrent sweeps as a Poisson process, which is often implicit in other approaches [8,10,26,27]. Our method exploits the entire allele frequency spectrum, rather than the level of diversity alone.
In Drosophila melanogaster, studying linkage effects has a long history. These studies are motivated by a strong correlation between the recombination rate and observed levels of diversity [28–30], which has been attributed to background selection and hitchhiking [31–36]. Our integrated inference of hitchhiking, background selection and adaptive evolution uses data from the Drosophila melanogaster genetic reference panel (DGRP) [29], which consists of 168 complete genome sequences from inbred lines sampled in North Carolina, USA. We also use the recently published high-resolution recombination map by Comeron and coworkers [30] to analyze polymorphism spectra as a function of the recombination rate.
We show that linkage correlations affect the adaptive process of the Drosophila genome in two ways. Background selection explains the broad reduction of genetic diversity with decreasing recombination rate, which is consistent with previous findings [28–30]. In addition, the low-recombining regions (which account for 21% of autosomal sequence) are marked by strongly linked selective sweeps that generate substantial hitchhiking and distort allele frequency spectra. Our integrated inference method leads to estimates of adaptive rates also in these low-recombining regions, which had to be excluded from most previous studies due to the confounding effects of linkage[24,26,37,38]. We find a sharp drop in the rate of adaptation, compared to high-recombining regions. In addition, we estimate rate and effect of deleterious fixations due to hitchhiking, which quantify the cost of adaptation imposed by genetic linkage.
Results
Probabilistic evolution of neutral sequence under genetic hitchhiking and background selection
Figure 1 illustrates the evolutionary models used for our analysis. The full model of linked adaptation describes the evolution of a focal genomic site, which is coupled by background selection to neighboring deleterious variants and by hitchhiking to neighboring beneficial variants sweeping through the population (Figure 1a). We describe both types of linkage interactions by summary parameters, which enter an effective single-site model for the focal genomic position. Specifically, background selection caused by deleterious mutations reduces the genetic diversity at the focal site but leaves the shape of the allele frequency spectrum invariant; this effect can be described by a reduction in effective population size [13,17]. Hitchhiking in selective sweeps reduces the diversity and changes the allele frequency spectrum. We capture the effects of hitchhiking by an effective rate of linked selective sweeps. This parameter measures sweeps that are close enough to the focal site to affect its alleles by hitchhiking. Together, our full model has four parameters: the scaled mutation rate θ, the scaled divergence time τ, the scaled effective population size λ, and the effective rate of linked sweeps, v (Materials and Methods and Supplementary Text). To quantify the importance of both kinds of linkage interactions in different parts of the Drosophila genome, we compare our inference from the full linked adaptation model to results from two partial models. First, the background selection model retains the coupling of the focal site with neighboring deleterious variants but neglects hitchhiking (Figure 1b). This model has the three independent parameters θ, τ, λ and the constraint v = 0. Second, the unlinked adaptation model assumes that recombination is strong enough to annihilate all evolutionary effects of genetic linkage (Figure 1c). This model has the two independent parameters θ, τ and the constraint v = 0, λ = 1.
For all three models, we derive analytic probability distributions for the allele frequencies at the focal site (Supplementary Text). These frequency spectra distinguish the full linked adaptation model from the background selection model and the unlinked adaptation model, which is similar to the model introduced by Mustonen and Lässig [24] (Figure 1d). Specifically, a positive rate of linked sweeps, v, removes common variants relative to rare variants, which is correctly captured by our model. A qualitatively similar depletion of the frequency spectrum was observed in rapidly evolving populations involving multiple segregating beneficial mutations [23]. The three models also have distinct effects on summary statistics of allele frequencies such as the diversity and Tajima’s D, which measures the relative abundance of intermediate-frequency polymorphisms compared to low-and high-frequency variants [39].
The background selection model lowers diversity, but leaves Tajima’s D invariant. In contrast, the full linked adaptation model lowers diversity and generates negative values for Tajima’s D, indicating a depletion of common variants (Figure 1e).
Our approach to model background selection as a simple reduction of the effective population size is valid if linked deleterious mutations are under sufficiently strong negative selection. For background selection caused by weakly deleterious mutations, it has been shown that the effect on the spectrum is more complicated than a simple reduction in effective population size [13,14,19,20]. For the data set of this study, our simple model captures the dominant effect of background selection, which is an unbiased removal of diversity with decreasing recombination.
Using the analytical probability distributions shown in Figure 1d, we develop an inference framework to estimate the effective parameters θ, λ and v and various other evolutionary characteristics under the different models. As data input for this inference, we use divergence data between D. melanogaster and D. simulans, together with outgroup-directed allele frequency distributions in the D. melanogaster population (these spectra count the number of alleles that are different from the corresponding allele in the outgroup species D. simulans). To test our inference framework, we use the linked adaptation model to simulate allele frequency data for varying values of v. Our inference recovers all three parameters θ, λ and v with good accuracy (Figure 1d and Supplementary Figure 2). For later purposes, we also test whether we can infer directional selection on the focal site itself (Supplementary Figure S2a-c), with similarly accurate results.
Quantifying linkage effects from synonymous sites in Drosophila
We study the autosomal genome sequences of 168 inbred lines of Drosophila melanogaster, published in the Drosophila melanogaster Genetic Reference Panel (DGRP) [29]. We use the Drosophila simulans reference genome to orient allele frequencies of segregating sites and to determine fixed differences. We first consider all synonymous sites, binned according to the local recombination rate (see Material and Methods and Supplementary Table S1). Summary statistics of this binning clearly show a strong dependency on the recombination rate (Figure 2). First, we observe a moderate increase in the rate of divergence from Drosophila simulans (Figure 2a) with decreasing recombination rate. Second, the diversity decreases sharply by about a factor of 8 as the recombination rate decreases, which has been reported before [28,29] (Figure 2b). Finally, for low recombination, the allele frequency spectrum deviates from the standard spectrum. This can be seen in the growing difference between two estimators of neutral diversity (Figure 2b) and further quantified by Tajima’s D [39], which drops below −1.0 for zero recombining regions (Figure 2c). While a drop in diversity alone can be explained by background selection from linked deleterious mutations alone, a drop in Tajima’s D is in this case best explained by hitchhiking with linked beneficial mutations.
To quantify further the change in the allele frequency data with the recombination rate, we apply our probabilistic model to each recombination bin separately, estimating the population size λ, the mutation rate θ and the effective rate of linked drivers v. We report maximum likelihood estimates for these parameters for all recombination bins (Figure 3a-c). With decreasing recombination rate we observe clines in the parameters, which mirror the clines in the summary statistics (Figure 2). First, the mutation rate increases mildly with decreasing recombination rate, which matches the observed increasing divergence. This is very similar for the background selection model and the linked adaptation model, with slightly higher estimates from the former (Figure 3a). Reflecting the strong drop in diversity (Figure 2b), we estimate a drop in effective population size for both models (Figure 3b), with a higher population size in the linked adaptation model (λ = 0.25 versus λ = 0.17), because it explains part of the reduction in polymorphisms by hitchhiking. Finally, the linked adaptation model estimates substantial levels of the rate of linked sweeps (v∼10) for recombination rates below about 0.4 cM/Mb (0.4 × 10−8 crossovers per nucleotide per generation) (Figure 3c). This rate of linked sweeps means that every site in this region is linked to about one sweep per 80,000 years (assuming an effective haploid population size of N0 =4 × 106 [40], see Supplementary Text), which is 10 times higher than the time needed for neutral variants to fix by drift.
The strong dependency of our parameter estimates on the recombination bin is clearly visible in the allele frequency spectra (Figure 3d). First, for the bin with highest recombination, the unlinked adaptation model (gray curve) fits the data very well, consistent with no strong linkage effects in that bin. The spectrum from the lowest recombination bin has a reduced level of polymorphisms, which is well explained by the drop in effective population size, captured by both the background selection and the linked adaptation model. However, this spectrum also exhibits a substantial V-shaped deviation from the background selection model prediction (blue curve), which is well captured by the linked adaptation model (red curve). The difference between the two models is also reflected in the log-likelihood score (Supplementary Figure S6), which is much larger for the linked adaptation model in regions of recombination rates below 0.4 cM/Mb.
We note that our model does not contain demographic effects, such as bottlenecks, which have been used previously to explain the synonymous site frequency spectrum in Drosophila [41]. Indeed, demographic effects on the site frequency spectrum are negligible for this data set, given the good fits of the simple unlinked adaptation model to highly recombining sites. This does not rule out that demography affects other observables, in particular haplotype structure (see also the Discussion below). We also tested whether the deviations from the neutral spectrum in low recombining regions could be caused by directional selection on synonymous sites. We find that in order to explain the depletion of intermediate frequency polymorphisms by selection, unrealistically high selection coefficients on synonymous sites are necessary (Supplementary Figure S3), clearly inconsistent with previous observations [42].
Mixed selection model for nonsynonymous and non-coding sites
We use a mixed model for polymorphism spectra and divergence in non-synonymous and non-coding annotation categories that consists of four components (Supplementary Text): neutral sites, weakly or moderately selected sites that contribute to rare variants, sites with adaptive substitutions (seen as fixed differences between the two species), and conserved sites under purifying selection. The component of weakly selected sites is similar to the neutral component with one additional scaled parameter σ, which denotes the selection coefficient (Supplementary Text and Supplementary Figure S2d). Together with the three neutral parameters, the full mixed model has 7 parameters: three weights to parameterize the contributions of the four components (the fourth is set by normalization), the scaled selection coefficient σ, and the three neutral parameters θ, λ and v introduced above. We estimate these parameters in a hierarchical way: First, we fit the neutral parameters θ, λ and v based on the synonymous sites in each bin, as already presented above. Second, we obtain maximum likelihood estimates of the other four parameters, keeping the neutral parameters fixed. To assess the impact of hitchhiking in particular, we compare our estimates of this full model to a background selection mixed model, with the constraint v = 0, and an unlinked mixed model, with λ = 1,v = 0.
We divide the genome into four further broad annotation categories beyond synonymous sites: Intergenic regions, Introns, untranslated regions in exons (UTR) and nonsynonymous sites (see Material and Methods). For each annotation category we bin all sites according to the same recombination rate bins as for synonymous sites and then perform the conditional model estimations as described above for each bin separately. Figure 4 shows the allele frequency spectrum and the model predictions for the first and the last recombination bin for each annotation category, divided by the background selection model spectrum as estimated from synonymous sites. This serves to highlight the distortions from the standard spectrum: In both high and low recombination, we see an overall reduced diversity due to selection, and for low recombination we additionally see a relative enrichment of rare variants and depletion of common variants. As can be seen, for high recombination (upper plots), the unlinked mixed model fits the data well, consistent with the results from synonymous sites. For low recombination (lower plots), the background selection mixed model is a poor explanation for the data, which clearly exhibits the V-shaped distortion observed previously. Here, the linked adaptation mixed model fits are substantially better. This is satisfying because the background selection mixed model and full mixed models have the same number of free parameters (four), since θ, λ and v are fixed from synonymous sites, so their performance is directly comparable.
All model parameter estimates with bootstrap error estimates are summarized in Supplementary Table S2. We find that the linked adaptation mixed model performs substantially better than the background selection mixed model for recombination values below 0.4 cM/Mb, with about 16,000 units of log likelihood difference for the lowest four recombination bins in total, which is highly significant (see Supplementary Figure S6).
The evolutionary cause of fixed differences
Because our full mixed model allows estimation of neutral, weakly selected and adaptive fractions of sites, we can specifically estimate how these fractions contribute to fixed differences between the two species. These estimates are shown in summary for the full mixed model in Figure 5a (see Supplementary Figure S5a for a split into annotation categories and estimates from the background selection mixed model), using a more coarse-grained binning for clarity (Methods). As can be seen, below 0.4 cM/Mb most mutations are hitchhiking with selective sweeps (dark blue). This includes neutral and weakly selected variants that have reached some frequency by drift and have then been picked up by a linked sweep. Overall, the fraction of adaptive substitutions is between 10% and 20%. Figure 5b and Supplementary Figures S5b show how this fraction of adaptive substitutions is distributed across recombination values and annotation classes. We find that in high recombining regions, between 44% and 59% of nonsynonymous substitutions are adaptive, confirming previous observations in Drosophila [24,29]. The second highest fraction of adaptive substitutions in highly recombining regions is seen in proximal intergenic regions (20%-25%) and in UTR (15%-20%), consistent with adaptively evolving regulatory regions in the untranslated parts of exons. In intergenic regions further away from genes and in Introns, we observe only low fractions of adaptive substitutions (around and below 15%). When we look at low-recombining regions, substitutions in nonsynonymous, UTR and proximal intergenic sequence have only a small adaptive component. In the arguably less functional categories like distal intergenics and introns, we see a more erratic pattern with an actual increase up to 30% of adaptive substitutions in zero recombining regions (Supplementary Figure S5b). While this signal may indicate increased non-coding adaptation in low recombining regions, it may also be caused by technical artifacts. In particular, alignment errors between D. simulans and D. melanogaster in non-coding sequence can cause increased rates of sequence mismatches between the two species and generate a spurious signal of adaptation.
Overall, we observe a consistently smaller fraction of adaptive substitutions for low recombination with the full linked adaptation mixed model than with the background selection mixed model (Supplementary Figure S5a). This is expected since the former provides a much better fit to the neutral component (Figure 3d) than the latter, which allows the full mixed model to explain a larger portion of the spectrum using neutral sites. Only the excess of substitutions not explained by neutral substitutions are interpreted as being adaptive.
We compared our estimates of the fraction of adaptive substitutions with the generalized McDonald-Kreitman (MK) test [43], which can be corrected for the presence of weakly selected sites [29,37]. As shown in Supplementary Figure S4, this method is much more conservative than ours, in particular in low recombining regions, where the MK estimate for the adaptive fraction is estimated to be zero for all annotation categories. This is consistent with an observation made by Messer and Petrov [44], and it highlights the importance of explicitly using the allele frequency spectrum to estimate parameters such as the adaptive fraction of substitutions. The only case for which the test and our method agree quantitatively are nonsynonymous substitutions in high recombining regions, which is the case the test was originally developed for [43].
Fitness Flux and the selective effect of adaptive substitutions
Fitness flux measures the speed of adaptation; according to the fitness flux theorem, fitness flux is generically positive [45]. In an adaptive process driven by the substitution of beneficial mutations, the fitness flux is simply the product of the substitution rate and the average selection coefficient of these changes [24,45]. In our model, we do not directly infer a selection coefficient of adaptive mutations, because any such direct inference would have a high degree of uncertainty. However, we can derive an upper bound on the strength of adaptation from its hitchhiking effects on synonymous changes. As shown in Supplementary Text, the fitness flux Φ is simply related to the rate v of linked sweeps and the recombination rate r, with a proportionality constant C that is greater than but of order one. Since our method cannot infer arbitrarily low levels of hitchhiking (in fact we find v = 0 for most recombination bins), we set v = 1 as a detection threshold and use the above equation to estimate an upper bound on the fitness flux based on that threshold. For nonsynonymous sites, Figure 5c shows this upper bound across the genome, except for zero-recombining regions (the lowest bin). Most estimates are between 10 and 100 in units of μ/2N0 per site, which is consistent with previous estimates in Drosophila [24,46]. We then use this upper bound on the fitness flux together with the inferred rate of adaptive nonsynonymous substitutions to estimate an upper bound on their selection coefficient (Supplementary Text). We find that in the part of the genome with the lower 50% of recombination rates, this upper bound on the selection coefficient is about sa <1 × 10−4, and about five times higher in regions with higher recombination, i.e. sa <5 × 10−4. Previous work in Drosophila has led to estimates across four orders of magnitude, sa = 10−5 [38], sa = 10−4 [24], sa = 10−3 [27,47], and sa = 10−2 [26] (see also [40] for a partial summary), some of which exceed our estimated upper bound by a factor 100.
The cost of adaptation in low-recombining regions
We can use our model to estimate the cost of adaptation imposed by genetic linkage. This cost is given by a negative component of the fitness flux, Φ-, which is the product of the rate of deleterious substitutions, which mainly fix via hitchhiking, and their average selection coefficient, which is negative (Supplementary Text). Figure 5c shows an estimate of the resulting hitchhiking flux Φ- for nonsynonymous sites in different recombination bins. In the lowest-recombining regions of the Drosophila genome, we find that Φ- reaches values of about 1.6 μ/2N0 per sequence site, or about 30/2N0 per million years per gene (Supplementary Text). This cost reaches about 5% of the upper bound on the total fitness flux in the second-lowest recombination bin (Figure 5c).
A related cost measure is the contribution of deleterious hitchhiking to genetic load [48–52]. Our mixed model predicts the stationary probability of any site to be fixed in a low-fitness allele (Supplementary Text). Multiplying this probability with the single-site selection coefficient, we obtain a genetic load of about 40/2N0 per gene. This load measures the fitness cost of placing an average gene into a region of low recombination; its evolutionary interpretation is discussed below.
Discussion
In this study, we have developed an analytic model for adaptive evolution under partial genetic linkage. This model maps the complex process of correlated multi-site evolution onto an effective single-site process with three evolutionary forces: positive selection causing primary adaptation, genetic draft inducing hitchhiking, and background selection constraining diversity and divergence. Despite its simplicity, our model explains allele frequency data across different recombination classes of the Drosophila genome with remarkable accuracy (Figures 3d and 4). Because our inference method does not use haplotypes, it can be applied to bulk sequencing data, which extends its possible range of applications.
Consistent with previous studies, we infer high rates of adaptive evolution in high-recombining sequence of the Drosophila genome: about 50% of the nonsynonymous substitutions in coding sequence, and 20% of the substitutions in UTR and in proximal intergenic sequence are adaptive. We obtain upper bounds for the resulting speed of adaptation, which is measured by a fitness flux of order 100 μ/2N0 per sequence site, and for the average selection coefficient of adaptive changes, which is of order 10−4. These bounds follow from a simple argument: adaptive processes with higher total fitness flux would distort the frequency spectrum of synonymous polymorphisms, which we do not observe in the high-recombining regions spanning 80% of the Drosophila genome. This argument constrains the average speed of adaptation by hard selective sweeps, which lead to substitutions of the beneficial allele and drive the long-term adaptive divergence between species. It does not exclude individual selective sweeps with far higher selection coefficients. It also does not constrain soft and partial sweeps, which involve beneficial alleles arising on diverse genetic backgrounds or alleles with a conditional selective advantage [53]. These sweeps leave a weaker trace in the synonymous frequency spectrum that hard sweeps. Soft sweeps have been inferred by haplotype-based genomic scans for adaptation in several systems including Drosophila [54].
Remarkably, the allele frequency spectra of the North Carolina flies lack a clear footprint of the population’s recent demography. About 80% of synonymous sites in the autosomal genome show a textbook neutral spectrum with constant effective population size. The spectrum in the 20% lowest-recombining sequence sites is depleted of common variants, but we attribute this recombination class-specific signal to hitchhiking rather than to recent changes of population size. We emphasize that our analysis is based on site frequency spectra, so this result does not rule out demography being visible in some other (e.g., haplotype-based) observables. In other systems, for example in humans, demographic effects are more prevalent in the allele frequency data, and our method will have to be extended to distinguish them from signals of adaptation and of linkage correlations. Similarly, we can extend our method to account for a variable density of functional elements, say gene content. This is expected to generate heterogeneous amounts of adaptation, hitchhiking, and background selection within one recombination class.
The most striking result of this study is a strong quantitative relation between the amount of adaptation and linkage correlations in the Drosophila genome, which is summarized in Figure 5. The fraction of adaptive amino acid substitutions drops from about 50% in high-recombining regions to small values in the 20% lowest-recombining sites; a similar drop is observed in UTR and proximal intergenic regions. The majority of substitutions in all of these sequence classes can be accounted for by hitchhiking. We also have shown that hitchhiking imposes a substantial cost on adaptation, which is measured by a negative fitness flux component Φ- of about 30/2N0 per million years per gene and a genetic load of about 40/2N0 per gene. Together, we obtain a complex picture of adaptation in low-recombining regions: linkage interactions reduce rate and power of primary selective sweeps by hitchhiking. In a continual adaptive process, the fitness cost of hitchhiking is compensated by a cascade of secondary adaptive changes at the hitchhiking sites. This complexity of the genomic dynamics of adaptation is a generic consequence of linkage interactions, which become a strong evolutionary force under low recombination [55–58]. We have shown that the interplay of adaptation and linkage interactions already generates strong effects in Drosophila, a species with overall high recombination rates. These effects are expected to be even stronger in other species with lower recombination rates that result, for example, from alternating sexual and asexual reproductive modes. The salient point about Drosophila is that recombination rates and, hence, the strength of linkage correlations vary strongly within its genome, with a broad decrease from central to distal parts of the chromosomes [30]. Thus, our findings may suggest that the distribution of genes in the Drosophila genome results, in part, from an adaptive minimization of the cost of adaptation: genes under high adaptive pressure are predominantly placed in high-recombining genomic regions. In this way, the interplay of adaptation and genetic linkage can shape the large-scale genome architecture.
Material and Methods
Genomic Data
We downloaded the complete genome sequences of 168 lines from the Drosophila Melanogaster Reference Panel (DGRP) from the DGRP website (http://dgrp.gnets.ncsu.edu) as fasta files. We downloaded the reference sequence from Drosophila simulans, aligned to the reference sequence of Drosophila melanogaster. We computed outgroup directed allele frequencies at all sites at which a) there is a valid Drosophila simulans allele, b) at least 150 lines of the DGRP sequences have a called allele (see Supplementary Table S1). To simplify downstream analysis, we normalized all sites to 150 called alleles. Specifically, if m ≥ 150 alleles are called, and k of those are different from the simulans allele, we computed the normalized outgroup directed allele frequency (allele count) as
Sequence Annotation
We downloaded gene annotations from flybase [59] and defined annotation categories as follows:
INTERGENIC FAR: Intergenic regions that are at least 5kb away from genes
INTERGENIC MEDIUM: Intergenic regions within 5kb distance to the next gene, but further away than 500bp
INTERGENIC NEAR: Intergenic sites within 500bp of a gene
INTRON: Introns on protein-coding genes
UTR: untranslated regions on the exons
SYNONYMOUS: protein-coding sites on the reference at which none of the three possible point mutation changes the encoded amino acid
NONSYNONYMOUS: protein-coding sites on the reference at which any of the three possible point mutations changes the encoded amino acid.
In some figures we joined the intergenic categories where appropriate. Most genes have multiple associated transcripts due to alternative splicing. We chose the transcript corresponding to the longest encoded protein coding sequence for each gene and annotated introns, UTRs, synonymous and nonsynonymous sites according to this one transcript. See Supplementary Table S1 for the number of sites in a given annotation category on the different chromosomes.
Recombination Rate Binning
Recombination maps were obtained from Comeron et al. [30] through their website http://www.recombinome.com, defined as mean rates within 100kb windows. We used the recombination map to annotate every site in the Drosophila genome. We then used only sites in the SYNONYMOUS annotation category on the autosomal chromosomes (2L, 2R, 3L and 3R) and defined quantile boundaries on this set. Specifically, we sorted all recombination rate values of this set of sites and determined recombination rate boundaries by dividing the data set into 21 equally large subsets of values. We then used these quantile boundaries to bin all sites (not just those in category SYNONYMOUS) into bins according to their local recombination rate. Here are the quantile boundaries used in this study for autosomal data (in cM/Mb): 0.0, 0.069, 0.217, 0.415, 0.44, 0.821, 1.055, 1.29, 1.415, 1.592, 1.741, 1.938, 2.169, 2.354, 2.612, 2.838, 3.156, 3.461, 3.796, 4.244, 5.395, Infinity. For figure 5, we used a more coarse binning, merging bins [0], [1, 2], [3, 4, 5], [6, 7, 8, 9], [10, 11, 12, 13, 14, 15, 16] and [17, 18, 19, 20].
For plotting purpose only, we show allele frequency spectra with averaged number of counts in neighboring allele frequencies. Specifically, of the 151 allele frequency values (incl. 0 and 150), we average values in groups of 2 from frequency 10 through frequency 11, and in groups of 3 from frequency 12 through 44 and from frequency 120 through 149. We average values in groups of 5 from frequency 45 through 119.
Summary of model parameter estimation
In the Supplementary Text we derive the allele frequency distribution for several basic models, which all derive from the full probability to observe in m samples of the ingroup species k alleles which differ from some outgroup. The parameters are the scaled time to the common ancestor of the two species , the scaled mutation rate , the scaled selection coefficient the scaled rate of linked selective sweeps and the scaled effective population size . The basic models are: These models are used to estimate parameters based on synonymous sites only.
To model other annotation categories, we introduce mixed models with the following components:
A neutral component with fraction cn, modeled by one of Punlinked , PBGS or Plinked
A weakly selected component with fraction cw, which is modeled by the most general model P, but with a constraint σ > 1.
A fraction of adaptive substitutions with fraction ca.
A fraction of additional conserved sites with fraction .
Taking the linked adaptation model as the neutral model, the full mixed model is: Similar mixed models are defined using the background selection or the unlinked adaptation model as neutral component, with accordingly fewer neutral parameters.
To estimate parameters, we consider a data set of outgroup-directed allele frequencies with a fixed sample size m. We denote the number of sites with allele frequency k by nk. The total log-likelihood of the data given parameters Θ is then: where P is a placeholder for the appropriate model, and Θ denotes the set of model parameters. Parameter estimates are obtained by maximization of the log-Likelihood: As detailed in Supplementary Text, we do not simultaneously estimate all 7 parameters of the mixed model, but use a hierarchical approach, first estimating the neutral parameters from synonymous sites only.
Bootstrapping
We use bootstrapping to obtain error estimates for all parameters and derived estimates (i.e. fractions of substitutions and genetic load). Each bootstrap sample is generated from the frequency count data in a given bin, by resampling all frequency counts with replacement from the original counts. We obtain a standard error estimate for each parameter by taking the standard deviation of that parameter across 20 bootstrap samples.
Simulations
We simulated the background selection and linked adaptation models using a Monte Carlo method. In the standard simulation, a population with N = 1000 individuals is simulated at a single site with two alleles. Mutations occur randomly with rate μ. Each generation is sampled with replacement from the previous generation. We introduce selection by assigning a modified sampling weight p = exp (s), where s is the selection coefficient. Linked selective sweeps occur with rate V. For each sweep we choose a linked allele with a probability equal to its frequency x. A sweep instantaneously fixes that allele, setting x = 1.
For a single sample, we start with an equilibrated allele frequency as ancestral value (obtained by simulating a single site for 2/μ. generations) and then simulate two separate populations for t generations, starting with the ancestral allele frequency. We sample a single outgroup allele and 20 ingroup alleles from the two evolved populations, respectively. The result is a single outgroup-directed allele frequency k, obtained by counting the number of ingroup alleles that are different from the outgroup. We simulate 40,000 independent samples before testing parameter inference based on the resulting spectrum.
Implementation
The implementation of the inference method is available under https://github.com/stschiff/hfit.
Acknowledgements
This work was funded by Wellcome Trust grant 098051. We would like to thank P.W. Messer for comments on an earlier version of the manuscript.
Footnotes
↵* Authors with equal contributions