Abstract
The rate of recombination affects the mode of molecular evolution. In high-recombining sequence, the targets of selection are individual genetic loci; under low recombination, selection collectively acts on large, genetically linked genomic segments. Selection under linkage can induce clonal interference, a specific mode of evolution by competition of genetic clades within a population. This mode is well known in asexually evolving microbes, but has not been traced systematically in an obligate sexual organism. Here we show that the Drosophila genome is partitioned into two modes of evolution: a local interference regime with limited effects of genetic linkage, and an interference condensate with clonal competition. We map these modes by differences in mutation frequency spectra, and we show that the transition between them occurs at a threshold recombination rate that is predictable from genomic summary statistics. We find the interference condensate in segments of low-recombining sequence that are located primarily in chromosomal regions flanking the centromeres and cover about 20% of the Drosophila genome. Condensate regions have characteristics of asexual evolution that impact gene function: the efficacy of selection and the speed of evolution are lower and the genetic load is higher than in regions of local interference. Our results suggest that multicellular eukaryotes can harbor heterogeneous modes and tempi of evolution within one genome. We argue that this variation generates selection on genome architecture.
Author Summary The Drosophila genome is an ideal system to study how the rate of recombination affects molecular evolution. It harbors a wide range of local recombination rates, and its high-recombining parts show broad signatures of adaptive evolution. The low-recombining parts, however, have remained dark genomic matter that has been omitted from most studies on the inference of selection. Here we show that these genomic regions evolve in a different way, which involves clonal competition and is akin to the evolution of asexual systems. This regime shows a lower efficacy of selection, a lower speed of evolution, and a higher genetic load than high-recombining regions. We argue these evolutionary differences have functional consequences: protein stability and protein expression are gene traits likely to be partially compromised by low recombination rates.
Introduction
Genetic linkage affects molecular evolution by coupling the selective effects of mutations at different loci. This coupling, which is often called interference selection, generates two basic evolutionary processes: a strongly beneficial mutation can drive linked neutral and deleterious mutations to high frequency; conversely, a strongly deleterious mutation can impede the establishment of a linked neutral or beneficial mutation. The first process is an instance of hitchhiking or genetic draft [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], the second is known as background selection [13, 14, 15, 16, 17, 18, 19, 20, 21]. In both cases, interference selection has the same consequence: by spreading the effect of selected mutations onto neighbouring genomic sites, it reduces speed and degree of adaptation.
In an evolving population, interference links between genomic loci are established by new mutations, reinforced by selection on these mutants, and reduced by recombination. Hence, strength and genomic range of interference are set by all three of these evolutionary forces. In sexually reproducing organisms with a sufficiently high recombination rate, interference has limited effects because it remains local; that is, it acts only on mutations at proximal genomic positions but is randomized by recombination at larger distances. Without recombination, however, interference becomes global: it couples the evolution of mutations across an entire chromosome. Under strong selection pressures, chromosome-wide genetic linkage generates a specific mode of evolution in which populations harbour competing clades of closely related individuals, each clade containing a distinct set of beneficial and deleterious mutations. This evolutionary mode, which is commonly called clonal interference [22], is well known in asexually evolving microbial [23] and viral populations [24, 25]. Theoretical models suggest that a similar mode of evolution, in which selection acts on extended segments of genetically linked sequence, arises in sexual populations at low recombination rates [26, 27]. However, finding evidence for this mode of evolution in the eukaryotic genome of an obligate sexual organism has remained elusive so far.
Drosophila is an ideal system to map differential effects of interference in one genome. Fruit flies show overall high rates of adaptive genetic mutations [28, 29, 30], at least in high-recombining parts of the genome (low-recombining parts have been excluded from most previous studies). At the same time, local recombination rates vary by orders of magnitude within the Drosophila genome. In particular, extended segments of low-recombining sequence are located in genomic regions flanking the centromeres and, to a lesser extent, the telomeres [31]. In this paper, we present a systematic analysis of linkage effects across the Drosophila genome. For two populations of D. melanogaster, we map the frequency statistics of mutations and the divergence from the neighboring species D. simulans in their dependence on the local recombination rate. Consistently in both populations, we find two clearly distinct regimes: a local interference regime in high-recombining regions and an interference condensate regime in extended low-recombing regions, which cover about 20% of the autosomes. We delineate these two regimes by the statistics of synonymous mutations. In the local interference regime, the amount of synonymous mutations decreases with recombination rate and the frequency spectrum follows an almost perfect inverse-frequency power law, as predicted by classic theory [32]. This indicates that the establishment of synonymous mutations is constrained by background selection [14, 15, 13, 16, 18], but established mutations evolve predominantly under genetic drift. In contrast, the frequency spectrum of the interference condensate regime shows a specific depletion of intermediate- and high-frequency variants, which is consistent with genetic draft over extended genomic distances [6, 7].
To corroborate this scenario, we develop a scaling theory of evolution under positive and negative selection and limited recombination, building on recent models of asexual and sexual evolution [22, 11, 27, 33, 34, 35, 20, 36]. This theory provides a unified framework for background selection and genetic draft. It describes how amount and frequency-dependence of mutations depend on the recombination rate, and it predicts the transition point from local interference to the interference condensate. Over a wide range of evolutionary parameters, the transition occurs at a threshold recombination rate that is close to the sum of the rates of deleterious mutations and of beneficial substitutions per unit sequence; these rates can be inferred from genomic summary statistics. In the Drosophila genome, the predicted threshold recombination rate is numerically close to the point mutation rate, in perfect agreement with the transition point observed in mutation frequency spectra.
Our scaling theory also provides the tools to infer key evolutionary features of the condensate regime from genomic data. We use this inference to quantify similarities of the Drosophila interference condensate to asexual evolution, and to understand the likely biological impact of the condensate mode of evolution. Specifically, we show that genes in condensate regions are less evolvable in response to positive selection and have a higher genetic load than genes in the local interference regime. We discuss how these evolutionary differences can impact gene functions in the condensate and generate selective pressure on genome architecture.
Results
Evolutionary modes in recombining genomes
Why there are two distinct evolutionary regimes in recombining genomes can be understood from a remarkably simple scaling theory. We consider genome evolution under deleterious mutations with rate ud, beneficial mutations generating substitutions with rate vb, and recombination with rate ρ; all of these rates are measured per base pair unit of haploid sequence and per generation. In terms of these rates, we estimate the probability that a given selected mutation evolves autonomously—i.e., free of background selection and genetic draft—, generalizing a previous argument by Weissman and Barton for evolution solely under beneficial mutations [26]. We first compute the average amount of interference generated by the substitution of a beneficial mutation with selection coefficient s at a given focal site in the genome. This mutation takes an average time of order τs ~ 1/s from establishment to high frequency and generates a linkage correlation interval of size ξs = 1/(ρτs) ~ s/ρ around the focal site. Other mutations at a distance r ≲ ξs are likely to retain their genetic linkage to one of the alleles at the focal site and are subject to strong interference; more distant mutations are likely to randomize their genetic linkage to the alleles at focal site by recombination within the time span τs. Hence, each beneficial substitution generates an interference domain with an area τs × ξs ~ 1/ρ around its focal point in genomic space-time [26]. By exactly the same argument, each deleterious mutation creates background selection in an interference domain with area τs × ξs = 1/ρ around its focal point. In this case, τs ~ 1/s is the expected time between origination and loss of the deleterious allele. While genetic draft acts on all other mutations in the genomic neighborhood of the driver mutation, background selection strongly affects only mutations on the genetic background of the deleterious allele, but this difference does not affect the scaling of interference domains. To estimate the joint effects of beneficial and deleterious mutations, we combine both kinds of interference interactions into a single interference density parameter
This parameter delineates two universal evolutionary modes of recombining genomes:
The local interference mode (ω ≪ 1) has a dilute pattern of interference domains: the domains of beneficial substitutions and of deleterious mutations are randomly distributed with probabilities ub and vb per unit sequence and per unit time (Fig. 1a). The space-time shape of a domain, which is given by the scales ξs and τs, depends on the selection coefficient of its focal mutation, but its area 1/ρ is universal. Because the interference domains are, on average, well-separated in the local intereference mode, different mutations under selection are statistically independent. Hence, the rate of beneficial substitutions with selection coefficient s is related to the underlying mutation rate ub and the haploid effective population size N by Haldane’s classic formula for individual sites, vb = 2Nsub [37]. A given (beneficial or deleterious) mutation evolves autonomously if its own interference domain has negligible overlap with any of the other interference domains, which happens with probability p0 = e−ω. Two target mutations on different genetic backgrounds see the same red domains but different blue domains; however, the no-interference probability p0 is independent of background.
The interference condensate mode (ω ≫ 1) has densely packed and overlapping interference domains. This indicates strong interference over extended genomic segments: genomic space-time is jammed by mutations under selection (Fig. 1b). By the definition of ω, the condensate is a broad evolutionary regime: it is generated by a sufficiently large supply of substantially beneficial or deleterious mutations, or a combination of both, but it does not depend on details of their effect distribution. Remarkably, the condensate domains have not only a universal area but also a typical shape that is given by universal scales τ and ξ = 1/(ρτ). Below, we will infer these scales from genomic data in Drosophila.
A key evolutionary quantity to map these evolutionary modes is the average fitness variance in a population, Σ, which measures the efficacy of selection: by Fisher’s fundamental theorem, Σ equals the rate of fitness increase by frequency gains of fitter genetic variants in the population. As detailed in Methods, our scaling theory captures the dependence of Σ on the evolutionary parameters in both interference modes. In the local interference regime, Σ depends linearly on the rates ud and ub, which reflects the statistical independence of mutation events. In the interference condensate, Σ is curbed to a sublinear function of ud and ub, which signals a reduced efficacy of selection caused by the jamming of genomic space-time (Fig. 1b). To test these results of our scaling theory, we performed simulations of evolving populations (see Methods for simulation details). In Fig. 2a, the average fitness variance per unit sequence, ς2, is plotted against ω (with a suitable rescaling factor, as explained in Methods). In the regime ω ≪ 1, the rescaled fitness variance data obtained for a wide range of parameters ud, ub, ρ collapse onto a uni-valued, linear function of ω. In the interference condensate (ω ≳ ω*), the fitness variance is seen to be curbed, and the rescaled data for different evolutionary parameters show some spread. Here we evaluate ω in terms of the mutation rates and the average selection coefficient of beneficial mutations, , which maps the same regimes as equation (1) because for ω ≲ 1.
The reduction in the efficacy of selection and, in particular, the diminishing return of beneficial mutations in the interference condensate are hallmarks of asexual evolution in large populations, which are known from experiments with microbial populations [38, 23, 39, 40] and from theoretical models [22, 36, 12, 33]. They signal the competition between genetic clades in an evolving population, which prevents some beneficial mutations to reach substantial frequencies because they are outrun by other clades. At the same time, deleterious mutations can reach fixation if they are part of a successful clade; this effect is often referred to as Muller’s ratchet [41, 42, 43, 44, 45, 46]. We conclude that the condensate regime shares important characteristics with asexual processes, in accordance with previous results in refs. [26, 27, 11]. Our scaling theory expresses this link mathematically: the modes of evolution in recombining systems can be mapped onto corresponding modes of asexual evolution in a specific low-recombination limit (Methods).
In a minimal scaling theory that is based on the interference density, the local interference regime and the interference condensate are separated by a smooth transition at a characteristic value ω* of order 1. The transition occurs when the interference probability e−ω becomes of order 1; the transition point marks the onset of nonlinearity in the fitness variance. For a broad range of evolutionary parameters, which includes realistic assumptions for the Drosophila genome, this behavior is confirmed by our simulations (Fig. 2a) and is consistent with analytical results for specific cases [26, 27], including the low-recombination limit [33, 20, 12]. In Methods, we detail this minimal scaling theory and discuss extensions that cover parameter regimes with systematic shifts of the transition point ω* above or below 1 (Supplementary Figure S1).
Genomic signature of interference selection
The fitness variance is a key summary statistics to map evolutionary modes under interference selection but it depends on the a priori unknown rates ud and vb. The distribution of mutation frequencies x at neutrally evolving sites, the so-called neutral site frequency spectrum q(x), provides an alternative test that can be directly evaluated from population-scale sequencing data. At high recombination rates, the neutral spectrum is dominated by genetic drift and has the universal Kimura form q(x) ~ 1/x [32] (blue line in Fig. 2b). At low recombination, the spectrum shows a characteristic depletion of intermediate- and high-frequency mutation counts (red line in Fig. 2b). This shape distortion is a result of genetic draft, which generates faster frequency changes and, hence, fewer variants in this frequency range than genetic drift. This distortion turns out to be a robust feature that can be read off from genomic data even if the exact form of the spectrum is hidden by noise and confounding factors.
We can compare the site frequency spectrum inferred from genomic data with spectra derived from specific evolutionary models. All analytically solvable models make strong simplifying assumptions on the evolutionary process, specifically on rate and effect distributions of mutations generating interference selection. The exact form of the spectral function depends on these model details, but broad shape features are a universal markers of interference. An important class of models are so-called travelling fitness waves, which describe the asymptotic regime of linked genetic variation generated by multiple coexisting mutations with individually small selection coefficients [33, 34, 36, 47]. Fitness waves generate a steady turnover of sequence variation at a characteristic rate σ. In the asymptotic wave regime, genetically linked neutral sites have a spectrum depleted at intermediate frequencies, as given by an inverse-square power law q(x) ~ 1/x2 for x < 1/2 and a minimum near x = 1/2 [27]. These models underscore an important general point: genetic draft — i.e., mutation frequency trajectories shaped by the substitution dynamics of a beneficial allele at a genetically linked locus — and the associated shape distortion of site frequency spectra is universally generated by a sufficient supply of deleterious or by beneficial mutations [45, 46]. In the following, we use a specific model with spectral shapes depleted at intermediate frequencies that are tunable to the Drosophila data reported below. The model contains a focal site that evolves by mutations, selection, and genetic drift; the site is also subject to background selection and genetic draft with rate σ. Draft is generated by linked strongly beneficial alleles, each of which occurs on a random genetic background and leads to instantaneous fixation or loss of mutations at the focal site. The resulting spectral function of neutral sites takes a simple approximate form, Q0(x; ν) = e−νx/x, with a shape parameter ν that is proportional to the draft rate σ (dashed lines in Fig. 2b; details on Q0(x; ν) are given in Methods). In the following, this model will serve to parametrize the universal shape distortion of empirical spectra, without pretence to resolve details of the underlying selective forces.
In Fig. 2cd, we show that mutation frequency data produce two consistent markers of interference. First, we fit neutral site frequency spectra obtained from numerical simulations to the form Q0(x; ν) and plot the inferred shape parameter ν against the interference density ω of the underlying evolutionary process. Over a wide range of rates us,ub, and ρ, we find ν ≈ 0 (i.e., neutral spectra of the form q(x) (~ 1/x) in the local interference regime (ω ≲ ω*) and ν > 1 (i.e., neutral spectra with depletion of intermediate and high frequencies) in the condensate regime (ω ≳ ω*). Below, we will link the shape parameter ν to specific evolutionary characteristics of the condensate regime. Second, we record the sequence diversity at synonymous sites as a function of the interference density ω (Fig. 2d). This quantity shows a strong dependence on p in the local interference regime (ω ≲ ω*) and a weaker dependence in the interference condensate (ω ≳ ω*), a pattern that is predicted by our scaling theory and will be described in more detail below. Hence, both genetic draft on neutral mutations and the depletion of the diversity pattern set on at the transition point ω* from local interference to the interference condensate (Fig. 2cd), the same point that is marked by the onset of nonlinearity in the fitness variance (Fig. 2a). The validity range of our inference method is detailed in Methods.
Interference selection in the Drosophila genome
To obtain a genome-wide map of interference in the Drosophila melanogaster genome, we use sequence data from an American [48] and an African population [49]. To equalize coverage, we take a random sample of 25 individuals in each population. At this sampling depth, site frequency spectra are quite insensitive to low-frequency variants (which would arise, for example, from a recent population expansion) but are perfectly suitable for studying intermediate-frequency variants, which are at the center of this study. Based on a published high-resolution recombination map for the Drosophila genome [31], we partition genomic sites in the autosomes (i.e., chromosomes 2L, 2R, 3L, 3R) according to the local recombination rate evaluated in windows of 105 base pairs. This partitioning covers a range of rates between 10−10 and 10−7 with an average of 2.4 × 10−8 per unit sequence and per generation (often reported in units of 10−8 per unit sequence and per generation, called centiMorgans per Megabase).
In each recombination rate bin and in different sequence categories, we record the outgroup-directed site frequency spectrum, , which is defined as the number of sites per unit sequence at which a fraction x = k/n of the sampled individuals have a mutant allele and a fraction 1 − x have the outgroup allele (with k = 0,1,…, 25 and n = 25; we disregard sites with more than two alleles). Following common practice, we determine the outgroup allele by alignment with the reference genome of the neighboring species D. simulans. These empirical spectra differ in two ways from the model spectra introduced above. First, the spectra are evaluated for discrete frequencies in a small sequence sample, which introduces sampling corrections compared to model distributions derived for larger populations (we use a hat to mark this difference; sampling corrections are detailed in Methods). Second, model spectra are directed from the ancestral allele at the origination time of the mutation. A substitution between the ingroup and the outgroup species reverses the role of the ancestral and allele mutant at part of the sequence sites. In a sequence class with a given density d of substitutions, the ancestor-directed and the outgroup-directed spectra are related by a linear map, q0(x) = (1 − d) q(x) + dq(1 − x); the same map relates the sample spectra and . Hence, outgroup-directed spectra have a primary branch with a maximum at low frequency and a secondary branch with a maximum at high frequency.
In Fig. 3ab, we plot the sample spectra of 4-fold synonymous sequence sites, , for three representative bins of high, intermediate, and low recombination rates in the American and the African population. These spectra have a striking common pattern. Across high and intermediate recombination rates, they follow almost perfectly the standard Kimura inverse-frequency form, q(x) ~ (1 – d)/x + d/(1 – x), which appears as straight lines over most of the frequency range in the double-logarithmic plots of Fig. 3ab. This form indicates that the dominant evolutionary force acting on synonymous genetic mutations at high and intermediate recombination rates is genetic drift. It shows that the average selection at synonymous sites is weak, making this class a good approximation of neutrally evolving sequence. It also excludes strong demographic effects affecting the spectral form at intermediate frequencies (notwithstanding small differences at low frequencies between the American and the African population, which reflect differences in their recent demography [50, 51]). Strong selective sweeps are known to deplete the density and to distort the spectrum of synonymous mutations in the local vicinity of the positively selected site [29, 52]; however, the rate of these sweeps is low enough not to affect the aggregate spectra (Fig. 3ab). In contrast, the spectra at low recombination rates show a depletion of intermediate and high frequencies. This depletion signals genetic draft on synonymous mutations, which we attribute interference selection. The argument for interference selection will be completed below, where we show that the onset of the shape distortion occurs at a value ω* ~ 1 predicted by the interference scaling model and is accompanied by a consistent ρ-dependence of the synonymous sequence diversity.
To map the transition point between local interference and interference condensate, we calibrate draft model spectra for neutral sites, to the empirical frequency spectra of synonymous sequence sites in D. melanogaster. We use a consistent Bayesian inference scheme that includes sampling effects and the map from outgroup-directed to ancestor-directed spectra (Methods). This scheme provides maximum-likelihood values and confidence intervals of the the shape parameter ν (Fig. 3c) and of the mutation density θs (shown below in Fig. 5a) in each recombination bin, without additional fit parameters. The synonymous spectral data signal an onset of interference selection at a threshold recombination rate ρ* ≈ μ, corresponding to a shape parameter or rescaled draft rate ν = 1. Regions with ρ ≳ ρ* are inferred to be in the local interference regime, regions with ρ ≲ ρ* are in the interference condensate. This regime is characterized by moderate interference with values of the shape parameter in the range ν ≲ 2. To obtain some insight on the selective effects causing interference, we infer “corrected”, ancestor-directed sample spectra by an inverse linear map (Methods). The low-recombination spectra are monotonic and well approximated by the spectral functions Q0(x; ν) of the draft model, but they do not show the minimum at x = 1/2 characteristic of the travelling-wave model (Supplementary Figure S2). This suggests the underlying interference selection includes drivers under substantial selection and is at some distance from the travelling-wave regime of multiple mutations with individually small effects.
The partitioning of the Drosophila genome in local interference regions and interference condensate regions can be consistently traced in all sequence categories (Fig. 4, Supplementary Figure S3). At low recombination rates, all of these site frequency spectra show qualitatively the same depletion of intermediate and high frequencies that is characteristic of the condensate regime. Specifically for nonsynonymous mutations, we calibrate a two-component model to the empirical spectra . Here Q(x; ζ, ν) is a spectral function for sequence sites with mean selection coefficient sa that are subject to genetic draft with rate σ. This function contains branches Q±(x; ζa, ν) = e(±ζ−ν)x that correspond to beneficial and deleterious mutations, respectively, and depend on the rescaled selection coefficient ζ (Methods). We obtain maximum-likelihood model parameters θa, , ζa in each recombination bin, using an extended Bayesian inference scheme. This scheme includes a model for cross-species evolution under selection and the resulting, more complex linear map between ancestor-directed and outgroup-directed spectra (Methods). Remarkably, the spectra for nonsynymous mutations (Fig. 4) can be explained across all recombination classes by the two-component model (3) with the same shape parameter as inferred for synonymous sites (Fig. 3c) and, hence, with the same threshold rate ρ*. The maximum-likelihood model includes near-neutral sites with spectral shape Q0(x,ν), which is of Kimura form for ρ > ρ*, as well as moderately selected sites with spectrum Q(x; ζα, ν) and mean effect of order ζa ~ 20, which produce excess frequency counts in the range x ≲ 0.1 (see dashed vs. solid lines in Fig. 4). These excess counts cannot be explained by demographic factors, because they are common to both populations and no comparable excess is observed at synonymous sites. Across all recombination rates, the inferred mutation densities θa and are much lower than the density θs at synonymous sites (Fig. 5a), indicating that a large fraction of amino acid changes is under strong selection and hence, suppressed in the frequency range of our spectra. The inferred selective effects of amino acid changes that do appear in our spectra are consistent with the expected fitness landscape of proteins. Important molecular phenotypes of proteins, such as fold stability or enzymatic activity, are quantitative traits encoded by multiple sequence sites. Such traits generically contain weakly and moderately selected constitutive sites, even if the trait itself is under strong stabilizing selection [53]. The maximum-likelihood selection coefficients ζa (Supplementary Table S2) are just one order of magnitude higher than the scaled draft rate ν in the lowest recombination classes. This suggests that a fraction of nonsynonymous sites is affected by genetic draft in the condensate regime; this point is discussed further below.
Predicting the condensation transition in Drosophila
The threshold recombination rate marking the onset of interference selection is numerically close to the point mutation rate in the Drosophila genome, μ = 2.8 × 10−9 per generation and per unit sequence [54] (see Fig. 3b). In light of our scaling theory, this is hardly surprising: the interference density ω, which determines the transition between local interference and interference condensate, is determined by a balance between the rates of local mutations and recombination. We now combine the scaling theory, our evolutionary model, and our Bayesian inference scheme to independently predict the transition point ρ*, as well as the behavior of the sequence diversity in both interference regimes, solely from genomic data in the high-recombination regime. This serves as a stringent consistency test for the scaling theory and provides additional evidence for our inference of interference selection in the Drosophila genome.
First, we estimate the rate of deleterious mutations in protein-coding sequence from the reduction in mutation density of amino acid changes compared to synonymous changes, ud/μ = αd, where αd = 1 – (θa/θs) is the fraction of amino acid mutations that are deleterious. Moderately deleterious and strongly deleterious amino acid changes contribute partial fractions and , respectively. Second, we estimate the rate of beneficial substitutions in a similar way from the excess of amino acid substitutions compared to the number expected in the near-neutral component, vb/μ = αb(dα/ds), where αb = 1 – (ds/da)(θa/θs) is the fraction of amino acid substitutions that are beneficial. In Methods, we derive these expressions from our evolutionary model and show how they can consistently be extrapolated into the condensate regime. The expression for αb resembles a McDonald-Kreitman test [55, 56], but our mixed model (3) affords an improved estimate of the mutation density θα by discounting moderately deleterious mutations. Equation (1) then gives a simple estimate of the interference density from genomic summary data,
In Fig. 5ab, we collect the relevant data of the Drosophila genome in all recombination classes: the maximum-likelihood mutation densities θs and θa inferred from spectral data, and the corresponding sequence divergence levels ds and da, defined as the number of substitutions between each D. melanogaster population and the D. simulans reference genome. From these data, we infer p-dependent fractions αd and αb (Fig. 5cd) and the resulting interference density ω (Fig. 5e).
For ρ > ρ*, we consistently find a deleterious mutation rate ud ≈ 0.9μ and a beneficial substitution rate vb ≈ 0.1μ that are approximately independent of ρ (Fig. 5cd). Hence, the local interference regime has an approximately constant density of interference domains and an interference density that is inversely proportional to the recombination rate, ω ~ 1.0 μ/ρ. As inferred above, the Drosophila genome includes a sizeable fraction of moderately selected sites and genome-wide positive selection is not dominated solely by strong selective sweeps; these characteristics suggest the minimal scaling theory in terms of the interference density is applicable. This theory makes three quantitative predictions:
In the local interference regime, the mutation densities θs and θa, as well as the sequence diversity are proportional to the probability of no interference, p0 = e−ω ≈ e−μ/ρ. This formula generalizes the standard model of background selection, which predicts the size of the error-free sequence class to depend exponentially on the rate of deleterious mutations [14]. Indeed, the observed ρ-dependence of the sequence diversity at synonymous sites is in good agreement with this theory in the local interference regime, πs ~ e−μ/ρ (Fig. 3d, see definition in Methods), in broad agreement with previous observations [57, 48].
The transition to the interference condensate occurs at a threshold interference density ω* of order 1. This determines the threshold recombination rate ρ* ~ μ (Fig. 5e), in agreement with the observed onset of interference selection (Fig. 3c).
In the interference condensate, the sequence diversity depends only weakly on the recombination rate. This dependence can be derived from a simple scaling argument based on extremal value statistics: an expected number ω/ω* of beneficial mutations with average selection coefficient s originate in each interference domain, but only the fittest of these mutants reaches fixation. This determines the draft rate , which sets the sequence diversity πs ≃ 2μ/σ [6, 19, 27]. With condensate interference densities bounded in the range ω ≈ (0.85 — 1.0)μ/ρ (Fig. 5e), we obtain the leading ρ-dependence πs ~ (1 + log(ρ*/ρ))−1, which is in agreement with the observed pattern (Fig. 3d). Our scaling argument is consistent with the scaling of σ in the numerical simulations (Fig. 2c) and with previous results for evolution solely under beneficial mutations [26]. In a state of stationary fitness, however, beneficial substitutions are a generic feature of the condensate regime. Even in the absence of adaptation, they compensate the fixation of deleterious mutations fixed by interference selection [44, 58]. Below, we will discuss the likely loci of these dynamics in the Drosophila genome.
Taken together, genetic variation in Drosophila is in remarkable quantitive agreement with our interference scaling theory over two decades of recombination rates. Fig. 6 charts the local interference density ω in the autosomes of D. melanogaster, the high-resolution recombination map of ref. [31] and our genomic inference. Extended condensate regions, shown in orange, are located primarily adjacent to the centromere regions and, to a lesser extent, to the telomeres. The major part of condensate sequence maintains a residual level of recombination, corresponding to interference densities in the range 1 < ω ≤ 4. The remaining 9% of the autosomal genome consists of 38 contiguous segments with no recorded recombination (ω > 4); these segments have an average length of 0.2 Mb and a maximum length of 0.9Mb. We now turn to inferring key genomic and evolutionary features of the interference condensate.
Linkage correlations in the condensate
Although the condensate is a complicated regime of strongly correlated mutations, it has remarkably simple emergent scaling properties. Because interference domains in the condensate are densely packed, the draft rate σ becomes similar to the neutral coalescence rate, , which is also the scale of fitness differences between competing clades. The emergence of a characteristic scale of genetic turnover is a common feature of models of asexual evolution [12, 59]. Under finite recombination, the rate sets the genomic correlation length ; coexisting mutations at a distance r ≲ ξ are likely to retain their genetic linkage over a mean coalescence time interval (Fig. 1b). We can estimate the coalescence rate from the inferred values of draft rate and synonymous mutation density [12], , or 347 equivalently from the neutral sequence diversity (Methods). Together, we obtain the simple estimates which determine the universal shape of interference domains in the condensate regime (Fig. 1b). In coalescent models under selection, the same scaling links the coalescence rate to the neutral sequence diversity, in some cases with logarithmic corrections [19, 27]. In the condensate regime, we find coalescence times τ about an order of magnitude lower than at high recombination rates (Fig. 7a) and genomic correlations up to ξ ≲ 104 base pairs (Fig. 7b), signalling that neighboring genes are often in common interference domains.
Speed and cost of evolution in the condensate
The most important asexual feature of the interference condensate is the drastically reduced efficacy of selection. In the Drosophila genome, we can quantify this effect in two ways. First, the fraction of beneficial substitutions, αb, which takes stable values of about 50% in the local interference regime, sharply drops in the condensate to below 10% in the lowest recombination class (Fig. 5d). Second, the fitness variance per unit sequence of the condensate is related to the neutral sequence diversity,
(Methods). The ρ-dependent values of ς2 inferred from the synonymous sequence diversity πs (Fig. 7c) show a sharp drop in the efficacy of selection within the condensate regime; the fitness variance in the lowest recombination class is by a factor 10 lower than at the transition point ρ*. The strong dependence of the fitness variance on the recombination rate is in tune with the simulation results (Fig. 2a). We conclude that interference selection curbs rate and selective effects of adaptive evolution in the condensate regions of the Drosophila genome.
The reduced efficacy of selection has an immediate consequence for genome functionality in the condensate regime: interference selection generates emergent neutrality of sequence sites with selection coefficients s ≲ σ; these sites become disfunctional because their alleles are randomized by interference selection [12]. We can estimate the resulting fitness cost (or genetic load) for a protein, , where f(s) is the distribution of selection coefficients and ℓ is the length of the protein. This cost increases with decreasing ρ, because σ as inferred through the shape parameter increases (Fig. 3c). Emergent neutrality says that the genetic load in the two modes differs not only in magnitude, but qualitatively. In the local interference regime, a genetic locus under moderate selection (s > 1/2N) incurs classical mutational load, where the beneficial allele is always prevalent and only a small minority fraction of the population, of average μ/s, carries the deleterious allele. In the condensate, deleterious and compensatory beneficial substitutions generate a new equilibrium in which the deleterious allele becomes dominant in the population with probability 1/(1 + es/σ) [12]. Hence, the functional impact of moderately deleterious alleles (s < σ) becomes important. Emergent neutrality likely affects part of the nonsynonymous sites in the moderate selection class (Fig. 4), as well as intron, UTR, and synonymous sites under selection for codon usage. Assuming that just a few percent of these sites become effectively neutral, the above estimate predicts a substantial scaled fitness cost 2NΔF ~ 10 – 100 per gene, even if the effect of each individual site is weak. This fitness cost is specific to genes in the condensate regime; its likely consequences for genome architecture are discussed below.
Discussion
The main method development of this paper is a unified scaling theory of genetic draft and background selection. This theory identifies a dominant scaling variable, the interference density ω = (ud + vb)/ρ, to discriminate between two evolutionary modes: the local interference regime (ω ≲ ω*) and the interference condensate (ω ≳ ω*). In the local interference mode, mutations evolve in an approximately independent way by selection and genetic drift; in the condensate, they are locked into clades of genetically linked sequence segments and many are governed by linkage. This mode requires a sufficiently high supply of mutations under substantial (beneficial or deleterious) selection but is insensitive to details of the evolutionary process – in particular, to the rate of adaptation. Over a broad range of evolutionary parameters, the transition point ω* between local interference and condensate is of order 1. The frequency spectrum of neutral mutations can be used as a marker of the evolutionary mode: the “convective” frequency evolution in the condensate regime is signalled by a characteristic depletion of intermediate and high frequencies.
In the Drosophila genome, we build a case for the joint presence of these evolutionary modes from a number of mutually consistent observations from sequence data. We infer the rates of deleterious amino acid changes, ud, and the rate of beneficial substitutions, vb, in protein coding sequence. Given these selective building blocks of the interference density ω, our scaling theory predicts how genetic variation depends on recombination: the sequence diversity varies strongly in the local interference regime, π ~ e−ω, and weakly in the condensate, π ~ (1 + log(ω/ω*))−1; the transition point between these regimes, ω* ~ 1, marks the onset of genetic draft. Together, amplitude and shape of mutational spectra change in a concerted way. These predictions are in agreement with direct genomic data of synonymous mutations. While any single characteristic of genetic variation could be explained by alternative evolutionary scenarios, the consistent joint pattern of diversity and spectral shape over the entire range of recombination rates provides strong evidence for an interference condensate in the Drosophila genome (Fig. 3cd).
Our results suggest that the established rationale of a strong evolutionary advantage of sex applies to about 80% of the Drosophila genes, which are in the local interference regime. The other 20%, some 3000 genes in the interference condensate, show evolutionary similarities with asexual systems. In the condensate regions, we infer a significantly lower fitness variance per unit sequence, indicating reduced evolvability in response to adaptive pressure. This may signal that condensate genes respond less efficiently to existing positive selection for change or that they are subject to less selection for change in the first place. We also infer a significantly increased fitness cost (genetic load) concentrated in weakly and moderately selected sequence sites, whose alleles are randomized by emergent neutrality [12]. This finding suggests that the evolutionary partitioning of the Drosophila genome is also a functional partitioning. We hypothesize that condensate genes have systematically lower intrinsic fold stability than other genes. They should also have reduced codon usage bias, which may affect speed and efficiency of translation and, hence, increase the cost of protein expression. These hypotheses on the functional impact of interference selection can be tested by experiment and by targeted sequence analysis.
A salient feature of Drosophila is that both evolutionary modes coexist in one genome. This implies that functional and fitness differences between condensate genes and other genes play out in the same individual, the same environment, and the same population. Over macro-evolutionary time scales, these differences can generate feedback effects on genome architecture. First, we expect selection against too long recombination coldspots. This is qualitatively in line with observations: in 91% of the autosomes, the Drosophila genome maintains a residual level of recombination, keeping interference selection capped to a moderate level (ω ≤ 4); the remaining zero-recombination sequence is fragmented into short contiguous segments (Fig. 6). Second, a given gene incurs a fitness cost that depends on its target function and on the interference regime it is placed in. Therefore, genes with high requirements on protein stability or translation efficiency should be suppressed in the condensate. Whether there are differences in gene content and gene functions between the local interference regime and the condensate that can be explained as a consequence of differences in interference selection is an interesting question for future research.
Methods
Scaling theory
The heuristic scaling approach used in this paper is based on three main ingredients: (i) In the local interference regime, the behavior of an evolutionary observable can be calculated approximately from single-site population genetics. (ii) The crossover to the interference condensate regime can be described by a scaling function that depends only on the variable ω given by equation (1). Here and in the following, crossover is used as a technical term of scaling theory that is not to be confused with the genetics term. (iii) In the condensate, evolutionary observables follow broad heuristic constraints, and there is a matching condition between both scaling regimes at the crossover point ω*. We consider long genomic segments that evolve under limited recombination; individual sequence sites have a distribution of selection coefficients with , where s is the average selection coefficient and N is the effective population size. For simplicity, we neglect prefactors of order 1 and corrections to scaling, which often depend on more specific model assumptions. Our analysis in the main text builds on a minimal model with the following scaling relations:
The average fitness variance per unit sequence, ς2, takes the form where σ is a characteristic selection strength in the condensate regime. The local interference expression follows from direct calculation by single-site population genetics, assuming statistical independence of selected alleles at different sites. The leading condensate asymptotics ς2 ≃ ρσ is then already determined by the scaling properties (ii) and (iii). Specifically, we evaluate the matching condition at the crossover point ω* ~ 1 of the minimal scaling theory with the requirement that ς2 depends only weakly on ud and vb in the condensate. This is expected from the jamming of genomic space-time shown in Fig. 1b and implies that the recombination rate becomes a limiting factor of ς2 in the condensate. The scaling argument given in the main text suggests a specific functional form in the condensate, which is given by see also ref. [26]. Equation (7) can be rescaled to a dimensionless form, which is confirmed by our simulation results (Fig. 1c). Equations (7) and (9) and, in particular, the minimal crossover scaling ω* ~ 1 are consistent with previous results for evolution under beneficial mutations [26] and under background selection including moderate effects (log(2Ns) ~ 1) [17, 60]. Strong heterogeneities in the effect distribution or background selection by strongly deleterious mutations generate systematic shifts of the crossover point ω*; the corresponding extensions of scaling theory are discussed below.
As explained in the main text, σ determines the characteristic scales of time and genomic distance in the condensate[26],
The neutral sequence diversity, , can be estimated by the form
Using again the matching criterion at the crossover point ω*, equation (11) interpolates between background selection in the local interference regime and neutral evolution under genetic draft with rate σ in the condensate regime [17, 60, 20, 27]. In the local interference regime, π is proportional to N, which implies that established neutral mutations evolve predominantly under genetic drift and their spectrum is of Kimura form. Hence, the genetic draft of neutral mutations and the resulting depletion of intermediate and high frequencies is a marker of the condensate regime, which sets on at the crossover point ω*. Equation (8) then predicts the form of the diversity in the condensate regime,
Equations (7) – (11) show that π measures key characteristics of the condensate regime, the spacetime scaling (equation (5)) and the fitness variance per unit sequence (equation (6)). In the main text, we use the sample sequence diversity at synonymous sites, to infer these characteristics in the Drosophila genome. As discussed in the main text, the Drosophila genome has a distribution of selective effects that includes sites with weak and moderate selection, for which the minimal scaling theory with ω* ~ 1 should be applicable. We find clear evidence that πs follows the scaling behavior predicted by equations (11) and (12); see Fig. 3d.
Extensions of scaling theory
Evolution by beneficial and deleterious mutations is a complex process whose details depend on their rates and effect distribution. The minimal scaling theory is a coarse approximation of this process. It provides useful approximations of genomic statistics over a wide range of evolutionary parameters, which includes settings appropriate for Drosophila. Here we discuss two extensions that serve to link our scaling theory to existing evolutionary models and to delineate the range of validity of the minimal model.
Background selection with strong effects. Equation (11) predicts the onset of interference selection for sites of selection coeffcient s at a characteristic value
This expression is consistent with known results of background selection theory [15, 20, 27, 43]. If background selection involves only strongly selected sites (2Ns ≫ 1), we obtain a shift of the crossover point ω* observed in aggregate data to values above 1 (Supplementary Figure S1a). The crossover point is still marked by the onset of interference selection on neutral sites and the resulting spectral shape distortion. This regime is not relevant for Drosophila, where we observe ω* ~ 1 and consistently, genomic sites under weak and moderate selection.
Selective sweeps. An adaptive process driven by strongly beneficial mutations with rate vb and average selection coefficient sb generates a fitness flux ϕ = vbs, which measures the speed of adaptation per unit sequence [30]. Under this process, a genomic focal site is subject to linked sweeps at a rate σsweep = vbξs = vbsb/ρ = 2Nϕ/ρ. Focal sites of selection coefficient s < sb are strongly affected by interference for σsweep > s, which sets the crossover point to interference selection,
In the special case of equal selection coefficients at all sites (s = sb), the crossover point is again ω* = 1, independently of sb [26, 61] (Supplementary Figure S1b). The onset of interference on neutral sites and the resulting spectral shape distortion occur at a value ω0 = 1/(2Nsb) = ω*/(2Ns), which is smaller than ω*. This regime is not observed in Drosophila; strong sweeps are too rare to distort neutral spectra for ω < 1.
Link to asexual evolution
An explicit expression for σ can be obtained if we identify the total fitness variance per correlation interval, ξς2 = σ2, with the corresponding quantity in models of asexual populations with a genome of length ξ[26]. In the intermediate fluctuation regime of travelling waves, this identification yields the analytic expression , where log(…) stands for a weak and model-specific dependence on the parameters ub, , N and the system size ξ; see equation (15) of ref. [61]. This form has a leading large-asymptotics consistent with our scaling argument given in the main text, . In ref. [27], an analogous identification is discussed for evolutionary processes dominated by weakly selected alleles.
The theoretical limit of strictly asexual evolution, which is reached at very low recombination rates ρ ≪ σ/L in a genome of length L, can be described in terms of our scaling theory by substituting L for the correlation length = σ/ρ resp. ξs = s/ρ. In this limit, the interference density (1) for mutations of effect s takes the form which depends on the genome-wide rates Ud = Lud and Vb = Lvb. For Vb = 0, the identity (13) for ω* becomes the well-known criterion for the onset of Muller’s ratchet in asexual populations, Ud/s ~ log(Ns) [42, 62, 43, 45].
Evolutionary model: mutation frequency distributions
As explained in the main text, we use a specific model of interference selection to parametrize site frequency spectra: individual sites evolve under mutations (with rate μ), selection (with site selection coefficient s > 0), genetic drift (in a population of effective size N), and periodically recurrent genetic draft (with rate σ). The draft model generates site frequency spectra that can be estimated analytically by a saddle-point approximation to the path integral of mutation frequency paths [63]. For beneficial mutations (of selection coefficient s) and deleterious mutations (of selection coefficient −s) at two-allelic sites, we obtain the frequency distributions respectively; these distributions depend on the mutation density θ = θ0e−ω = μNe−ω ≪ 1, the scaled draft rate ν = 2Nσθ/θ0, and the scaled site selection coefficient ζ = 2Nsθ/θ0. The function q0(x) = x−1+θ(1 – x)−1+θ denotes the neutral spectrum under genetic drift. The no-sweep probability p(τ, σ) over a time interval t is assumed to be strongly suppressed for τ ≳ 1/σ. The exponential weight involves the maximum-likelihood frequency path with effective selection coefficient , which is denoted by . This path follows the equation of motion with g(x) = x(1 – x) and has a sojourn time up to frequency x. The prefactor Z±(θ,ζ,ν) ensures the normalization . An approximate evaluation of the integral in equation (16) results in the remarkably simple spectral function
This function consistently interpolates between the asymptotic regimes of effectively neutral mutations (ζ ≪ ν, i.e., ), which are dominated by genetic draft (Q ≃ e−νx/x), and strongly selected mutations (ζ ≫ ν, i.e., ), which evolve in an autonomous way (Q ≃ e±ζx/x). For the spectral function of neutral sites, we use the shorthand
The family of spectral functions (18) provides a good parametrization of the spectral data in our simulations (Fig. 1d), as well as in all sequence classes and recombination rate classes of Drosophila (Fig. 3ab, Fig. 4, and Supplementary Figure S3). Frequency distributions of the form (17) serve as building blocks for our genomic inference; see equations (2) and (3) of the main text and equations (25) – (27) below.
The spectral functions of the draft model map the crossover from drift-dominated to draft-dominated evolution in analytical form. The neutral site frequency spectrum (2) determines the sequence diversity which is consistent with the scaling behavior (11). In the local interference regime π ≃ θ = μNe−ω; i.e., background selection reduces diversity but does not affect the shape of the frequency spectrum. In the condensate regime, π ≃ 2μ/σ < θ, i.e., diversity and spectral shape are determined by interference selection [6]. These features hold for broad classes of interference selection [46], making the spectral functions a convenient choice for parametrizing the Drosophila site frequency spectra. Specifically, we use the condition ν > 1 on the shape parameter inferred from the spectrum of synonymous sequence sites as a marker of the interference condensate regime.
Evolutionary model: substitution dynamics and allele occupancy
The draft model also serves to parametrize the sequence evolution between the ingroup species D. melanogaster and the outgroup species D. simulans. In this model, allele substitutions at individual sites take place with Kimura-Ohta rates that depend on the local coalescence rate (or inverse effective population size)
Consistently with equations (11) and (20), the coalescence rate maps again the crossover between local interference and condensate regime. The beneficial and deleterious substitution rates depend on the scaled selection coefficient ζ and the scaled coalescence rate . Models of this form have been shown to provide an excellent approximation to the equilibrium substitution dynamics in linked genomes under different scenarios of interference selection [12, 58]. The rates (22) consistently determine the equilibrium occupancy probabilities of beneficial and deleterious alleles, as well as the expected sequence divergence between in- and outgroup species, where τd is the divergence time and d0 = μτd the expected divergence at neutral sites.
Ancestor-directed, outgroup-directed, and corrected frequency spectra
As explained in the main text, the evolutionary model specified by equations (22) – (24) serves to relate the outgroup-directed frequency spectra q0(x) and the basic frequency distributions of the draft model, q±(x) (equation 17) without additional fitting parameters. Specifically, the ancestor-directed spectrum qs(x; θ, ν) at synonymous sites, which is of the form (2), determines the outgroup-directed spectrum with the spectral function Q0(x; ν) given by equation (19). In other sequence classes, we use ancestor-directed spectra of the form (3), with the spectral functions Q±(x; ζ,ν) given by equation (18) and the allele occupancy probabilities λ±(ζ, ν) given by equation (23). These determine the outgroup-directed counterparts
We can reconstruct the synonymous spectrum qs(x) from by inverting the linear map (25),
Applying this transformation to the outgroup-polarized spectral data of synonymous mutations, (Fig. 3ab), produces the corrected spectral data shown in Supplementary Figure S2. These provide a bona fide improved approximation to the underlying spectrum qs(x). However, the reconstruction becomes noisy in the limit x → 1, where is dominated by the component .
Bayesian estimation of model parameters
Consider a sequence class with population frequency spectrum q(x; θ, θ′, ζ,ν) given by a two-component model of the form (3); the associated outgroup-polarized spectrum qo(x;θ,θ′,ζ,ν) is given by equation (27). In that class, a sample of n random individuals contains mutations of discrete outgroup-polarized frequency x = k/n with probability (see also ref. [64]) this expression yields closed analytical expressions involving hypergeometric and Gamma functions.
By calibrating the model distributions with observed site frequency spectra and divergence data, we can infer parameters of the model (25) for synonymous sites and of the mixed model (27) for other sequence classes. Our inference is based on total log likelihood score of the observed frequency counts in a given sequence class, where L is the total number of sequence sites in the class. We have developed a consistent Bayesian inference scheme that takes into account the allele occupancy (23), the evolutionary dynamics (24), and the sampling statistics (29). This scheme proceeds in a hierarchical way: we first determine a posterior distribution of parameters (θs, ν) for synonymous sites, using the single-component model (25). Then we obtain the posterior distribution of parameters for amino-acid changes and the analogous distributions for other sequence classes, using the mixed model (3) with the same value of v as for synonymous sites (this constraint does not induce a significant drop in likelihood score). Our inference scheme is implemented in a software called “hfit” https://github.com/stschiff/hfit using special functions and numerical optimization routines from the Gnu Scientific Library http://www.gnu.org/software/gsl/, and a custom MCMC algorithm to obtain Maximum Likelihood estimates and confidence intervals for all parameters.
The Bayesian inference scheme, together with the substitution model given by equations (22) – (24), allows a direct estimate of the rates ud, vb and of the interference density ω from observed frequency spectra and substitutions at synonymous and non-synonymous sites. First, the rate of deleterious mutations in a given sequence class is simply ud = μλ+θ/θs. Equation (27) then determines the total rate of deleterious nonsynonymous mutations, which is the sum of contributions from moderately deleterious changes and from strongly deleterious changes. Second, the rate of adaptive amino acid substitutions is given by the excess of nonsynonymous divergence compared to the expectation from the equilibrium model (24),
Here we have treated synonymous mutations as (approximately) neutral. In the local interference regime, we have ζa ≫ 1 and, hence, λ+(ζa,ν) ≈ 1 and d(ζa, ν) ≈ 0. Equations (31) and (32) then reduce to the expressions given in the main text, ud/μ = αd with αd = 1 − (θa/θs) and vb/μ = αb(da/ds) with αb = 1 − (ds/da)(θa/θs). These expressions are evaluated using measured divergence data ds,da and maximum-likelihood spectral parameters θs,θa. They enter equation (4) for the interference density ω, which serves to estimate the threshold recombination rate ρ* from the condition ω* = 1. To estimate the fraction of adaptive substitutions, αb, in the condensate regime (Fig. 5d), we use the full expression (32).
Genomic data and sequence annotation
We downloaded the complete genome sequences of 168 lines from the Drosophila Melanogaster Reference Panel (DGRP) from the DGRP website http://dgrp.gnets.ncsu.edu and of 27 lines sampled from Rwanda from the Drosophila Population Genomics Project http://dpgp.org as fasta files. We downloaded the reference sequences from Drosophila simulans and from Drosophila yakuba, aligned to the reference sequence of Drosophila melanogaster from the UCSC genome browser (https://genome.ucsc.edu). For both outgroups, we compute outgroup-directed allele frequencies at all sites at which (i) there is a valid outgroup allele, and (ii) at least 150 lines of the DGRP sequences or 25 lines of the DPGP sequences have a called allele. We then downsample all sites to 25 called alleles, using random sampling without replacement (hypergeometric sampling).
We downloaded gene annotations from flybase [43]. We define annotation categories as follows. Intergenic: intergenic regions that are at least 5kb away from genes, Intron: introns of protein-coding genes, UTR: untranslated regions in exons, Synonymous: protein-coding sites of the reference genome at which none of the three possible point mutation changes the encoded amino acid, Nonsynonymous: protein-coding sites on the reference at which any of the three possible point mutations changes the encoded amino acid. Most genes have multiple associated transcripts due to alternative splicing. We choose the transcript corresponding to the longest encoded protein coding sequence for each gene and annotated introns, UTRs, synonymous and nonsynonymous sites according to that transcript. See Supplementary Table S1 for the number of sites in a given annotation category on the different chromosomes.
Maps of mean recombination rates within 100kb windows were obtained from Comeron et al. [31] through the website http://www.recombinome.com. We use the recombination map to annotate every site in the Drosophila genome. We then use only synonymous sites on the autosomes (2L, 2R, 3L and 3R) and define quantile boundaries on this set. Specifically, we sort all recombination rate values of this set of sites and determine recombination rate bins by dividing the data set into 21 equally large subsets of values. We then use these quantile boundaries to bin all sites (not just synonymous sites) into bins according to their local recombination rate. The quantile boundaries used in this study for autosomal data are (in cM/Mb): 0.0, 0.069, 0.217, 0.415, 0.44, 0.821, 1.055, 1.29, 1.415, 1.592, 1.741, 1.938, 2.169, 2.354, 2.612, 2.838, 3.156, 3.461, 3.796, 4.244, 5.395, Infinity. All binned allele frequency data is given in Supplementary Table S1.
Simulations of evolutionary processes in recombining populations
We use the SLiM simulator [65] to simulate a population of sequences evolving under mutations, drift, selection, and recombination; the genome of each individual has 100,000 sites. To mimic the Drosophila phylogeny, we start from a single population of size N = 1000 that evolves for 10, 000 generations, then splits into ingroup and outgroup populations of size N = 1000; these evolve in isolation for another 10, 000 generations. Finally, we sample one individual from the outgroup population and 20 individuals from the ingroup population.
We consider three classes of mutations: neutral mutations, beneficial mutations and deleterious mutations, the latter two with fixed selection coefficient sad = 0.01. The rate of neutral mutations is μ = 1.5 × 10−6, the rate of beneficial mutations varies from ub = 0 to 2.5 × 10−7, and the rate of deleterious mutations from ud = 0 to 3 × 10−6. We also run simulations with only one class of selected mutations (i.e., ud = 0 or ub = 0). The recombination rate ρ varies in the range 10−7 to 10−4.
We use these simulations to display the transition from local interference to the interference condensate and to corroborate our scaling theory (Fig. 2). In particular, the simulations demonstrate that the maximum-likelihood shape parameter ν* inferred from the spectral data of synonymous sequence sites can serve as a faithful marker of the interference condensate regime.
Acknowledgments
We would like to thank P.W. Messer for comments on an earlier version of the manuscript.