ABSTRACT
Reassortment in viruses with segmented genome is a major evolutionary process for their genetic diversity and adaptation. It is also crucial in generating different levels of sequence polymorphism among segments when positive selection occurs at different rates on them. Previous studies have detected intra-subtype reassortment events in human influenza H3N2 by between-segment incongruity in phylogenetic tree topology. Here, we quantitatively estimate the reassortment rate, probability that a pair of segments in a viral lineage become separated in a unit time, between hemmaglutinin (HA) and four non-antigenic segments (PB2, PB1, PA and NP) in human influenza virus H3N2. Using statistics that measure incongruity in tree topology or linkage disequilibrium between segments and performing simulations that are constrained to reproduce the various patterns of H3N2 molecular evolution, we infer that reassortment rate ranges between 0.001 and 0.01 assuming one generation to be 1/80 year. However, we find that a higher rate of reassortment is required to generate the observed pattern of ~40% less synonymous sequence polymorphism on HA relative to other non-HA segments, which results from recurrent selective sweeps by antigenic variants on the HA segment. Here, synonymous diversity was compared after correcting for difference in inferred mutation rates among segments, which we found significant. We also explored analytic approximations for inter-segmental difference in sequence diversity for a given reassortment rate to understand the underlying dynamics of recurrent positive selection. It is suggested that the effects of clonal interference and potentially demography-dependent rate of reassortment in the process of recurrent selective sweeps must be considered to fully explain the genomic pattern of diversity in H3N2 viruses.
The evolution of influenza virus has been one of major long-standing subjects of modern biological researches, owing to its significant impact not only on human public health but also on the study of adaptive molecular evolution (Fitchet al. 1991; Yang 2000; Nelson and Holmes 2007). Studies focused on the sequence evolution of hemagglutinin (HA) and neuraminidase (NA) gene segments because HA and NA proteins are recognized as antigens by host adaptive immune system. Rapid amino acid substitutions at their epitope sites cause “antigenic drift” that forces updates in flu vaccines. The occurrence of positive selection on these sites was confirmed by various evidences and is now generally believed to drive the evolutionary dynamics of influenza viruses. However, given that the complex seasonal dynamics of viral populations has its own effect on the temporal patterns of sequence diversity or polymophism and that multiple antigenic sites, together with other functional sites on a non-recombining segment (“complete linkage” within a segment), undergo correlated evolutionary changes, it is not easy to identify and analyze selection from the observation of viral sequence evolution (Illingworth and Mustonen 2012; Strelkowa and Lassig 2012; Kim and Kim 2015).
The correct evolutionary model of influenza virus must predict the observed patterns of sequence diversity as accurately as possible. The observed patterns include the characteristic “cactus-like” genealogical trees for individual viral segments (Buonagurioet al. 1986; Fergusonet al. 2003; Bedfordet al. 2011), the ratio of nonsynonymous versus synonymous sequence changes at epitope and non-epitope sites (Ina and Gojobori 1994), the relative distribution of amino acid substitutions in external versus internal branches of trees (Fitchet al. 1997; Pybuset al. 2007), the absolute level and geographic differentiation of sequence diversity (Bedfordet al. 2010), and the temporal correlation of variants’ fixation events (Strelkowa and Lassig 2012; Kim and Kim 2015). These patterns were mostly observed in the HA segment and therefore models were developed mainly to explain the evolution of HA gene. However, the evolutionary change of HA is not independent of other segments. Unless viruses co-infect hosts very frequently and exchange segments with each other, thus resulting in very frequent reassortment, different segments in a viral lineage are not separated during most of the infectious cycles. Such correlated inheritance, or genetic linkage across segments, means that evolutionary events on one segment can affect the dynamics of others. The pattern of genetic variation observed in HA is therefore expected to be shaped by the fitness effects of variants not only on HA but also on other segments.
The HA segment of H3N2 viruses exhibit lower level of genetic diversity, measured in mean time to coalescence, than other segments (Rambautet al. 2008). This is explained by recurrent positive selection occurring at far higher rate on the HA segment than other segments. Selective sweeps driven by antigenic variants on HA therefore cause the greatest reduction in polymorphism at linked sites on the same segment but less severe reduction at other segments due to occasional events of reassortment that break down the hitchhiking effect (Maynard Smith and Haigh 1974). Relative diversity between segments is therefore informative for adaptive evolution in the HA gene. In addition, negative (or purifying) selection against deleterious mutations cause reduction in polymorphism, an effect termed background selection (Charlesworthet al. 1993). This variation-reducing effect is also greatest on completely linked sites and diminishes as linkage becomes weaker. Since negative selection must be operating in all genes of influenza virus to maintain their functions, genetic diversity at HA must be affected not only by negative selection on the same segment but also that on all the other segments, unless reassortment is very frequent relative to the strength of negative selection.
Therefore, the evolutionary model of positive and negative selection should be tested against the inter-segmental levels and patterns of sequence polymorphism. However, a crucial parameter in such a model with multiple viral segment, the rate of reassortment between segments, is not well known. Reassortment in segmented RNA virus, effectively equivalent to meiotic recombination in most eukaryotes, plays a critical role in their evolution. To date, eleven families of RNA virus are known to have segmented genome (McDonaldet al. 2016). Among these, reassortment in influenza virus has been most intensively studied. Through this process, influenza viruses can acquire novel variation that confers resistance to antivirals (Simonsenet al. 2007). Intrasubtype reassortments also drive adaptive amino acid replacements. Past pandemics have been attributed to the result of reassortment between different influenza subtypes (Nelson and Holmes 2007). Therefore, detecting and understanding reassortment has been of great public health interest.
Numerous studies have detected reassortment from serially sampled influenza virus sequences. Reassortments were observed within and between subtypes of human influenza A (Holmeset al. 2005; Schweigeret al. 2006; Lycettet al. 2012; Luet al. 2014; Westgeestet al. 2014; Pinsentet al. 2015; Berryet al. 2016; Villa and LÄssig 2017), and between lineages of influenza B virus (Dudaset al. 2014). While it can be identified manually by comparing phylogeny between segments, for comprehensive analysis and identification computational detection algorithms were suggested. Most widely used method detects a clade that occupies a position in a phylogenetic tree constructed for one segment is located on a different position in the corresponding tree for a different segment (Nagarajan and Kingsford 2010). Such a clade thus represents a reassortant. Other methods are not dependent on phylogeny. Rabadanet al. (2008) identified the presence of reassortment when mean sequence difference between two taxa is highly variable for different segments. However, this approach overlooked the possibility that different segments may have different sequence diversity not due to reassortment but due to segment-dependent effective population sizes.
Despite these sophisticated methods for identifying reassortment, rare attempt has been made to estimate how frequently it occurs during viral reproduction, particularly in comparison to the rates of mutation and coalescence. The rate of reassortment per unit time (∆t) can be defined as a probability that a pair of segments in a given individual virus at time t come from different parental viruses that existed at time t - ∆t. If the reproduction of viruses can be approximated in a discrete-time process, a natural choice for the unit time above can be the average length of a single host infection cycle, which we arbitrarily define to be one “generation” (Kim and Kim 2016). In previous studies, reassortment rate was often estimated as the number of detected events divided by years or the number of synonymous changes on the tree (Villa and LÄssig 2017). This quantity may be a lower bound of the actual rate since only those events leaving sufficiently conspicuous inter-segmental incongruence in phylogenies are counted. In this study, we perform a quantitative analysis of reassortment rate in influenza H3N2, using summary statistics (“metrics”) that measure either incongruity in tree topology or linkage disequilibrium. We conduct simulations of viral sequence evolution under four different models, including recurrent positive selection with and without complex demography, that are however constrained to replicate the key patterns of H3N2 sequence variation. Then, the range of reassortment rate that reproduces the observed values of these metrics as well as the ratio of sequence diversity at HA versus non-HA segments will be identified. We also seek analytic approximations for the effect of recurrent positive selection with varying reassortment rate and other theoretical explanations to understand inter-segmental variation in the level of polymorphism observed in the actual and simulated sequences.
MATERIALS and METHODS
Sequence data
Genome sets of human influenza A/H3N2 sequences were downloaded from Influenza Virus Genome Set of National Center for Biotechnology Information (NCBI). A genome set is defined as the sequences of viral segments from a single virus isolate. In this study, we use sequences of HA, PB2, PB1, PA and NP segments. Outlier sequences (different from other sequences in the same year at more than 100 sites) and sequences containing symbols other than A, C, G and T were discarded.
Statistics for inter-segmental genetic correlation
Robinson-Foulds distance (RFD; Robinson and Foulds 1981) was calculated between evolutionary trees from different viral segments to quantify incongruence between their topologies. Neighbor-joining trees constructed from individual segments were fed into TreeDist in PAUP 4 test version (Swofford 2003).
The standard measure of linkage disequilibrium (LD) for a pair of bi-allelic sites, ρ2, is calculated as where pA is the frequency of allele A at locus 1 and pB is the frequency of allele B at locus 2 and DAB is pAB - pApB (Hill and Robertson 1968). To quantify LD between segments, ρ2 is calculated for each pair of sites, one on segment 1 and another on segment 2. The average of all such pairs is given by . We define a metric that quantifies between-segment LD relative to within-segment LD as where is the mean of ρ2 between all pairs of sites within segment 2, which is either a non-HA segment in actual data or a segment that evolves without positive selection in simulation (see below).
Topology based linkage disequilibrium (TBLD), proposed recently by Wirtzet al. (2018) as an improvement over conventional SNP-based LD, is obtained by grouping sequences of a segment into two alleles defined by tree topology, as illustrated in Figure 1. Then, ρ2 is calculated between segments using the frequencies of such topology-based alleles. For this analysis, neighbor-joining trees constructed above for Robinson-Foulds metric were used again.
The above metrics were calculated for 30 genomic sets (from either actual H3N2 or simulated population) randomly sampled within each 6-month time window. For H3N2 data, sequences from different regions (Asia, Europe, North America, South America, Oceania and others) were sampled proportionally to the number of sequences in the database. For a given metric, the average value over time windows, from year 2007 to 2016 for H3N2 data or over 10 simulation years, was obtained.
Simulation
We conducted the individual-based simulation of virus evolution in a procedure described in Kim and Kim (2016) with modification. In this study, a virus consists of two segments, each containing 1,000 bi-allelic sites. Segment 1 is modeled after the HA1 segment and have 770 “nonsynonymous” sites including Lb “epitope” sites where beneficial mutations occur to increase viral fitness by s. On the other hand, segment 2 does not have epitope sites. Other sites on segment 1 or 2 are either under negative selection (770 – Lb nonsynonymous sites that mutate to deleterious alleles with selection coefficient sd) or under neutral evolution (Ls = 230 “synonymous” sites). The population evolves in discrete generations, with one generation corresponding to 1/80 year. Mutation rate per site per year is given by μ = 8.0 ×10−3 (10−4 per generation) which is approximately the estimate of per-nucleotide mutation rate in H3N2 viruses.
As described in Kim and Kim (2016), after steps of migration and mutation, a Poisson number of progenies are produced per a parental copy of virus as a function of its absolute fitness, which is obtained by multiplying its relative fitness (after combining effects of all beneficial and deleterious mutations) by the ratio of population size to carrying capacity (K) at each generation. Let N be the number of viruses after these steps of reproduction. Then, we randomly select two viruses that exchange their segments with probability 0.5. This step is repeated Nr times. Therefore, the rate of reassortment per viral lineage per generation is r.
Four evolutionary models are considered. First, in model 1, both segments are subjected only to genetic drift in a near constant-sized population (thus s = sd = 0). Here, carrying capacity (K = 140 ≈ N) was given to yield π1 = 0.027, the mean pairwise sequence difference per site in segment 1, which corresponds to the observed synonymous diversity in the HA segment of H3N2 population. We also consider models with recurrent positive selection without (model 2; sd = 0) or with (model 3; sd > 0) negative selection at other nonsynonymous sites on both segments. The strength of positive selection, s, is set either 0.05 or 0.1, as we previously estimated s to range between 0.05 and 0.11 by examining how rapidly the frequencies of known antigenic-cluster-changing variants (Koelet al. 2013) increase over time (Kim and Kim 2016). In all models with selection, N was adjusted to yield π1 very close to 0.027. Other evolutionary parameters relevant for segment 1 in model 2 and model 3 are identical to those of Model A and Model B1 (L = 1,000), respectively, in Kim and Kim (2016). Finally, we examine the model of positive selection only, but together with metapopulation dynamics (model 4; s = 0.1 and sd = 0). This model uses the same parameter values as in the Model C3a (constant carrying capacity in tropical region) of Kim and Kim (2016). Briefly, the metapopulation consists of ten local demes, each of which sends migrants in proportion to its size to other demes. Five and two demes are colonized in “winter” and “summer”, respectively, and go extinct in the next season. A remaining deme, modeling a tropical population with continuous influenza epidemics, however is maintained without extinction.
Synonymous diversity and divergence
To obtain synonymous diversity π for each segment, mean pairwise synonymous difference among sequences sampled within a 6-month window was calculated according to Nei-Gojobori method (Nei and Gojobori 1986). Then, the average over 20 years from 1997 to 2016 (40 time windows) was obtained. Synonymous diversity corrected for mutation rate, π*, is obtained by dividing π by the synonymous divergence of corresponding segment from 1997 to 2016 (see below). Tajima’s D (Tajima 1989) was also calculated for 30 sequences in each of the above 6-month windows and the average over windows was obtained.
To estimate synonymous divergence, which is the number of nucleotide substitutions per synonymous sites, we reconstructed phylogeny for each PB2, PB1, PA, HA and NP segment. We tracked ancestral sequences at all internal nodes of phylogeny on a path starting from tree root to sequences sampled at each year and counted the cumulative number of synonymous changes on the path. Measuring cumulative divergence along the phylogeny, rather than just calculating synonymous differences between two terminal years of sampling, prevents multiple nonsynonymous changes at one site being counted as a synonymous change, especially in HA1 domain, or multiple synonymous changes at one sites being counted as a smaller number of changes. Neighbor-joining trees were reconstructed using PAUP with 30 sub-sampled sequences per year from 1973 to 2016 from all available sequences from Genbank. The trees were rooted to the common ancestor of sequences collected in 1973 because, across all segments, these sequences exhibit very little diversity and therefore their common ancestor is confidently dated to the same year. Internal node states were inferred to track synonymous changes along the branch using ACCTRAN method in PAUP. For this analysis, we used either four-fold synonymous sites or all synonymous sites. To obtain synonymous divergence for the latter, we used the Nei-Gojobori method.
To test whether the rate of sequence divergence at one segment is significantly different from that of another segment, bootstrap test was performed according to (Hall and Wilson 1991). Let dX and dY be the divergence of segment X and Y. Then we define and test if it is significantly greater than zero. As the distribution of under the null hypothesis (dX = dY) can be approximated by the distribution of where is a bootstrap value of , the P-value is approximately the proportion of bootstrap samples that satisfy . For each pair of segments, was obtained from divergences from 1973 to 2016 calculated by the above method. A pseudo data set is prepared by randomly sampling triplet-codon columns in the alignment of a given segment with replacement until it has the same number of codons as the original sequence. For bootstrap test, 1000 pseudo data sets for each segment pair were generated.
Data availability statement
The authors affirm that all data necessary for confirming the conclusions of this article are represented fully within the article and its tables and figures
RESULTS
The estimation of inter-segmental reassortment rate in influenza virus H3N2
Population genetic processes at different segments become uncorrelated as reassortment occurs. Therefore, we attempted to infer reassortment rate in H3N2 viruses using multiple summary statistics that measure correlation in the patterns of sequence diversity across segments. One metric we use is Robinson-Foulds distance (RFD) between evolutionary trees, each of which is constructed from sequences of one particular segment (Robinson and Foulds 1981). As a measure of tree incongruity, RFD is expected to be positively correlated with reassortment rate. On the other hand, linkage disequilibrium (LD) between polymorphism on different segments is expected to decay with an increasing rate of reassortment. We consider two summary statistics (metrics) of inter-segmental LD, λ and TBLD (see Methods).
To investigate whether these metrics are sufficiently informative and robust for inferring reassortment rate, we performed simulation of virus population in which two segments (one modeling the HA segment and the other a non-HA segment) are undergoing varying rates of reassortment (r = 0 to 10−2). The relationship between a given metric and reassortment rate may depend on the pattern of sequence diversity, which is determined by how viruses evolve. We therefore simulated virus population under four distinct population genetic models: simple neutral evolution (model 1), recurrent positive selection (selective sweeps; model 2), recurrent positive and negative selection (model 3), and positive selection under complex demographic dynamics (model 4). Parameters of each evolutionary model were adjusted to yield a constant level of synonymous sequence diversity (or effective population size Ne ≈ 140) and constant rate of adaptive substitutions (k ≈ 1.3) at the first segment, matching those at the HA segment of H3N2 population (Bhattet al. 2011; Kim and Kim 2016).
All three metrics change monotonically (increase in RFD and decrease in λ and TBLD) with increasing reassortment rate, particularly in the range where r is between 10−3 and 10−2 (Ner ≅ 0.1 ~ 1) (Figure 2). RFD responds most sensitively to r: the distribution of RFD for a given r is relatively narrow compared to the change of mean with increasing r. However, the absolute values of RFD changes significantly depending on the evolutionary models in the simulation. On the other hand, λ and TBLD exhibit larger variances but are less sensitive to evolutionary models.
We calculated these three metrics from HA-PB2, HA-PB1, HA-PA and HA-NP segment pairs in influenza H3N2 (Table 1). We do not observe clear difference in reassortment rates among these segment pairs. For example, TBLD is smallest between HA and PA but λ is largest for this pair. We therefore take averages over segment pairs and compare them to the simulation results above (see horizontal lines in Figure 2). The agreement between observation and simulation is generally poor for r < 10−3 or r = 10−2. Within the range between 10−3 and 10−2, the most likely value of r (judged by difference between the empirical value and the mean of simulated distribution) depends on the combination of metric and simulation model. This may suggest that relationship between each metric and reassortment rate varies according to the pattern of sequence polymorphism in the population or, equivalently, the topology of evolutionary trees shaped by selection and population structure.
A well-known summary statistic for tree topology is Tajima’s D (Tajima 1983). We computed Tajima’s D, modified for longitudinally sampled sequences (see Methods), for each segment in actual and simulated data (Table 2, Table S1). Simulation under model 4 yields the values of Tajima’s D that closely match the value from the HA segment of H3N2. Therefore, given the hypothesis that tree topology modulates the response of our metrics to reassortment rate, the estimates of r under model 4 might be more sensible than under other models. In this case, based on RFD r is estimated to be between 0.002 and 0.005. However, it is not clear yet whether Tajima’s D captures the aspect of tree topology that modulate the outcome of reassortment or it is tree topology alone that matters. For instance, the relationship between r and RFD is quite different in models 2 and 3, which however yield similar values of Tajima’s D (Table S1).
Given that r is at least 0.002 under the assumption of 80 generations per year, there are approximately 1 − (1 − 0.002)80 ≈ 0.15 reassortments per year per viral lineage: namely, one copy of the HA segment and one copy of a non-HA segment found in one virus trace back to two different ancestral viruses of the previous year with more than 15% chance. We confirmed that this per-year estimate does not change when one generation is given 1/160 or 1/40 year (Figure S1).
We next examine how well our inference on reassortment rate matches the result of widely-used method of identifying reassortment events through phylogenetic graph-mining. Using GiRaF (Graph-incompatibility-based Reassortment Finder; Nagarajan and Kingsford 2010), we obtained the candidate sets of reassorted taxa when phylogenies are compared in HA-PB2, HA-PB1, HA-PA, and HA-NP segment pairs (Table 1) and between two segments in the above simulation (Table 3). Simulations show that the numbers of detected reassortments vary greatly according to evolutionary model. Models 2 and 3 lead to a larger number of detection for a given value of r. This might be because single-branch reassortments are more detectable with GiRaF (Nagarajan and Kingsford 2010) and genealogies produced under these models have longer outer branches (thus more negative Tajima’s D). With models 1 and 4 (2 and 3), simulation with r = 10−3 (smaller than 10−3) leads to the number of detections similar to that observed in actual viral sequences. Therefore, based on the number of GiRaF-detected reassortment events as a summary statistic, a smaller estimate of r is inferred relative to that obtained above using RFD, λ, or TBLD. It however needs to be investigated whether simulated sequences generated under idealized models have allowed better phylogenetic inference and thus more sensitive detection of reassortments.
Conversely, reassortment rate r can be translated into the number of reassortment events on the phylogeny of sampled sequences, which GiRaF targets. Focusing on a specific pair of segments, probability that at least one of two randomly sampled viruses is a reassortant (namely, going backward in time one viral lineage experience reassortment before the coalescence of two lineages occurs) is given approximately by P(2) ≡ 2r/(1/Ne + 2r) = 2Ner/(2Ner + 1). Then, assuming r = 0.001 and Ne = 140, P(2) is about 0.22. With r = 0.005 it is about 0.58. Recently, using GiRaF, Berryet al. (2016) estimated that about 40% of H3N2 sequences are reassortants, looking at all eight segments. Therefore, the probability of sampling at least one reassortant out of two is 1 − 0.62 ≈ 0.64. Assuming that a random set of segments are exchanged at a reassortment event, this corresponds approximately to P(2) = 0.64/2 = 0.32. Considering that GiRaF cannot detect all reassortment events in the sampled genealogy (Nagarajan and Kingsford 2010), we may conclude from this result that the number of reassortment events detected by direct identification of incongruent tree branches is compatible with our estimate of r being on the order of 0.001 (~ 0.1 per year).
Reassortment rate and the inter-segmental pattern of sequence diversity and divergence
Reassortment is critical in shaping the genomic pattern of genetic variation under the effect of selective sweeps and background selection. The more frequent reassortment is, the smaller neutral genetic variation is on a segment under positive selection relative to other segments evolving in more neutral manner. We investigate what range of reassortment rate is compatible with the relative levels of neutral sequence diversity in HA vs. non-HA segments of H3N2. We first calculated synonymous sequence diversities (mean pairwise synonymous differences; π) at the HA, PB2, PB1, PA and NP segments of H3N2 viruses, using sequences sampled from 1997 to 2016 (Table 2). These values however cannot be simply compared with each other because, if mutation rates are not uniform over segments, it can also contribute to differences in neutral genetic diversity. Note that the RNA segments of influenza virus may replicate independently within a host and thus can accumulate mutations at different rates.
Inter-segmental heterogeneity in mutation rate can be detected by differences in synonymous sequence divergence over time. We observe that synonymous substitutions from 1973 to 2016 occur at constant rates at respective segments (Figure 3), in remarkable agreement with molecular clock despite uncertainty regarding whether synonymous substitutions in influenza viruses are strictly neutral or the total number of replication per year is constant over different flu seasons or years. At the same time we note that the rates are variable across segments. For a given pair of segments, the significance of differences in synonymous divergences was evaluated by bootstrapping (Table 4). We find that divergences at HA and PB1 segments, calculated either from four-fold synonymous sites only (Figure 3A) or from all synonymous sites according to Nei-Gojobori method (Figure 3B), are significantly larger than other segments. One may question whether frequent nonsynonymous substitutions in the HA segment have an (unknown) effect of elevating the rate of synonymous substitutions at the same or nearby codons. To test this possibility we measured the synonymous divergences of HA1- and HA2-domain sequences separately. Unlike HA1, on which the epitope sites of hemagglutinin are located, HA2 domain is mainly under stabilizing selection similar to other non-HA genes (Bhattet al. 2011). Synonymous divergence at HA2, obtained from either 4-fold degenerate sites only or using Nei-Gojobori method, is actually larger than that of HA1, although the difference is not significant in the bootstrap test (p = 0.16). Therefore, we may rule out the possibility that recurrent nonsynonymous substitutions at HA elevate the rates of synonymous mutations at the corresponding codons.
The level of neutral sequence diversity correcting for mutation rate heterogeneity, denoted as π*, is then obtained by dividing π at each segment by its synonymous divergence between 1997 and 2016. π* of HA is about half the level of other segments (Table 2). Synonymous diversity (π) of HA before correction is already lower than those of other segments but the difference becomes larger after the correction. Differences in π* among non-HA segments are small. This result confirms the concentration of positive selection causing selective sweeps on the HA segment.
We next examined which value of r best explains the ratio of π* on HA versus non-HA segments. In simulations described above, we calculated neutral diversity on segment 1 and 2, π1 and π2, and obtained their ratio (π1/π2) (Figure 4). Since mutation rate is constant in simulation, diversity needs no correction by divergence. We find that reassortment rate close to (in models 2 and 3) or larger than (in model 4) 0.01 best explains the HA vs. non-HA ratio of π* for both s = 0.1 (Figure 4) and s = 0.05 (Figure S2) This result is not dependent on the frequency of positive selection that we vary to yield different k, the number of advantageous substitution per year (Figure S3). Note that the estimate of r using correlation statistics (RFD, λ, and TBLD) above is smaller than 0.01. Why a higher rate of reassortment in selective sweep simulations, particularly for model 4, is compatible with πHA/πnon-HA needs explanation (see Discussion).
To gain further insight on the above result and the dynamics of recurrent selective sweeps, we sought a simple analytic approximation to π1/π2 using the following heuristic argument. Consider model 2 in which positive selection occurs recurrently in segment 1, generating an equilibrium flux of beneficial alleles reaching fixation in a single constant-sized population. Discrete events of sweeps can be arranged in order, backward in time: let allele B1 be a beneficial allele that was fixed in the last sweep. (There can be multiple beneficial alleles at different sites being fixed together at a single episode of sweep due to temporal clustering of substitutions (Kim and Kim 2015). In that case, B1 represents the one that originated by most recent mutation.) The beneficial alleles fixed in the preceding rounds of sweeps are defined as B2, B3, and so on. The allele frequency of Bi is given by xi. Two randomly chosen copies of segment 1 have their most recent common ancestor at t1 generations back in time. We may assume that t1 is distributed with mean τ1 that is determined only by the rate of selective sweeps. Namely, coalescence due to genetic drift during time interval between successive sweeps is ignored. Then, tracing events backward in time from present, coalescence occurs as x1 approaches close to zero. Therefore t1 should be slightly smaller than waiting time until the time of B1’s entrance into the population. It is also possible that, at the time of sampling lineages, there is a currently sweeping beneficial allele that has not reached fixation. If both sampled lineages carry this sweeping allele, their coalescence should occur close to the time of this beneficial allele’s entrance. Here we simply define τ1 as the mean of t1 when all of such possibilities under the equilibrium flux of beneficial alleles are taken into account.
With complete linkage (r = 0), identical backward-in-time process governs the coalescence of randomly sampled lineages in segment 2 and their mean coalescent time is τ1. However, with r > 0 two lineages may avoid coalescence by reassortment: at a given generation, each lineage can recombine away from B1 allele with probability r(1 - x1). Given that r/s ≪ 1, where s is the strength of selection, the probability of such a lineage recombining back to B1 is very small and thus can be ignored. The opportunity for a lineage to recombine away increases as x1 remains longer at low values (but not too low forcing coalescence). Therefore, the length of trajectory x1 determines the probability of escaping coalescence. While x1 should increase from 1/N to 1, forward-in-time, stochastic effect makes the trajectory much shorter than the length of deterministic trajectory: the change of x1 is approximated by instantaneous increase from 1/N to 1/(Nf), where f is the fixation probability of a copy of beneficial allele (Maynard Smith 1971), followed by deterministic increase expected for selective advantage s. Then, using the approximation obtained in (Barton 2000) and other studies, the probability of escaping coalescence in one round of sweep is given by where Ne is the effective population size under which sweeps occur (i.e, Ne1 of (Kim and Kim 2016)). Now, two lineages that have just escaped coalescence are subject to coalescence in the next (earlier) round of sweep by B2. Assuming that successive sweeps occur as a random Poisson process, the waiting time until x2 becomes small enough to force coalescence is again τ1. Then, if the lineages coalesce in the nth round of sweep, it takes on average nτ1 generations. Therefore, the mean coalescent time for segment 2 is
The level of sequence diversity on segment 1 relative to that on segment 2 is therefore
This approximation shows that π1/π2 does not depend on the rate of recurrent positive selection (k) but on the strength of selection, in agreement with our simulation (Figure S3).
We compared the simulation results of model 2 with Eq. (5) in which f is replaced by either 2s, a usual approximation under infrequent selective sweeps, or mean fixation probability observed in simulation. The latter is 0.0269 for s = 0.1 and 0.0253 for s = 0.05. Therefore, actual fixation probabilities are much smaller than 2s, indicating that strong clonal interference occurs in our simulated populations (i.e. under parameters constrained to yield both π1 ≈ 0.027 per site and k ≈ 1.3 per year). Figure 5 shows that π1/π2 predicted by eq. (5), using either choices of f, is much smaller than that observed in simulation. Namely, lineages on segment 2 in simulation do not escape coalescence as frequently as predicted under the assumption of eq. (5), producing π2 not so larger than π1. It might be suggested that, in addition to the initial acceleration of xi by a factor of 1/f, xi would increase much faster than expected with selection coefficient s under clonal interference, because successful beneficial mutations reaching fixation tend to form temporal clusters, i.e. in positive linkage disequilibrium with each other (Strelkowa and Lassig 2012; Kim and Kim 2015). However, when we estimated the “effective” selection coefficients of beneficial alleles by counting generations that the sample frequency of xi takes to increase from ~0.2 to ~0.8 for all trajectories in simulation with s = 0.1, the mean was 0.069. We therefore did not obtain an evidence of faster increase in xi by clonal interference. It remains to be investigated what causes coalescence to occur faster, relative to recombination, than expected by eq. (5).
DISCUSSION
The population genetics of sexually reproducing organisms demonstrated that recombination rate is as important as other fundamental evolutionary parameters, such as mutation rate, effective population size, and selection coefficient, for understanding their evolution and genetic diversity. Therefore, in order to build a correct model that predicts the direction of influenza viral evolution, the rate of reassortment relative to other parameters needs to be estimated. While reassortment occurs in all 28 pairs of influenza viral segments, this study focused on reassortment between the HA segment, the major target of strong positive selection, and one of four non-HA segments (PB2, PB1, PA, NP) that are generally considered to evolve neutrally (Bhattet al. 2011), because such reassortment is expected to cause difference in inter-segmental difference in genetic diversity and is therefore key to inferring positive selection in influenza virus. The NA (neuraminidase) segment was not included among non-HA segments because, with epistatic interactions detected between HA and NA genes (Neverovet al. 2014), particular reassortants may have higher or lower fitness relative to non-reassortants and can thus bias the estimate of reassortment rate. MP and NS segments, also known to undergo little positive selection, were not included because their synonymous sites are not likely to evolve neutrally due to overlapping protein-coding regions. We used multiple summary statistics of correlation or congruence between segmental sequence diversity to infer the range of reassortment rate in the H3N2 viral population. In general, it is suggested that the probability of reassortment per virus per viral infection cycle is between 0.001 and 0.01. Ideally, information from multiple summary statistics might be combined to yield a narrower range of estimate for example using approximate Bayesian computation (ABC) (Beaumontet al. 2002). However, our individual-based simulation was too slow for such implementation. It might be possible in the future to develop a coalescent-based simulation that is fast enough for ABC. Since relationships between the summary statistics and reassortment rate depends on the evolutionary model of virus, such approach will have to estimate reassortment rate jointly with other parameters of selection and demography.
Reassortment rate determines the hitchhiking effect of recurrent positive selection at the HA segment on neutral genetic variation at other segments (Figure 4). Our simulation results suggest that more frequent reassortment (r ≥ 0.01) than inferred above using correlation statistics (i.e. 0.001 ≤ r < 0.01) is needed to explain ~40% lower synonymous diversity on HA relative to those on non-HA segments, particularly under complex demography (model 4). Given that our earlier inference of r < 0.01 is correct, this discrepancy would indicate that, for a given r, neutral lineages on non-HA segments escape hitchhiking (i.e. avoid coalescence forced by a sweep) more frequently than expected under the simulation models. It might be because our simulation models use reassortment rates that are constant in the course of selective and demographic dynamics (even in model 4). Namely, it assumes that hosts experience a constant rate of coinfection through time. This is not likely true in the H3N2 population, in which coinfection must be more frequent during the seasonal peaks of population size (the number of hosts infected). Then, if a new immunity-evading adaptive allele is more likely to arise during peaks of influenza epidemics, as expected from the principle that mutational input is proportional to population size and the fixation probability of adaptive allele is larger during the period of population size expansion (Otto and Whitlock 1997), this adaptive allele may be transmitted through coinfected hosts more frequently, thus participating in more reassortment, than non-adaptive alleles. Therefore, because hitchhiking effect is mostly determined during the early phase when the adaptive variant is still in low frequency (Maynard Smith and Haigh 1974), the observed ratio of HA to non-HA synonymous diversity can be explained by an effectively higher reassortment rate experienced by antigenic variants on the HA segment. Further investigation on adaptive evolution and inter-segmental diversity in influenza virus will require a theoretical/simulation model that allows realistic seasonal influenza dynamics and associated change in coinfection/reassortment rate.
Wider discrepancy between data and the model of positive selection under metapopulation structure (model 4 in Figure 4) demands further theoretical explanations. At first, it is known that the spatial structure of a population slows down the spread of a beneficial allele across demes, thus weakening the hitchhiking effect as there are more opportunities for neutral lineages to recombine away from the beneficial allele (Kim and Maruki 2011; Bartonet al. 2013). This contradicts with our result: if hitchhiking effect is weaker, π1/π2 should become smaller for a given r. However, demographic model assumed in those studies are quite different from the one used here. In our model 4, seven out of eight demes undergo extinction-recolonization cycles. While a beneficial allele is increasing in frequency in the total population, “empty” demes are more likely to be colonized by viruses carrying this than the ancestral allele. (Note that our model assigns the absolute fitness to haploid individuals so that a local population can be established even from a single immigrant (Kim and Kim 2016).) Because no or small number of individuals carrying the non-beneficial allele exist where those carrying beneficial allele increase exponentially, neutral lineages on segment 2 can hardly escape coalescence, thus resulting in stronger reduction in polymorphism. This stronger hitchhiking effect during the establishment of a new local population was demonstrated in the model of “Genotype-Dependent Colonization and Introgression (GDCI)” in Kim and Gulisija (2010). Unless r is very larger than 0.01 (or the joint effect of selection and co-infection dynamics increasing the effective recombination rate considered above is very dramatic), the overestimation of π1/π2 by model 4 may suggest that selective sweeps in the actual population of H3N2 do not occur predominantly through GDCI process. While the transmission of influenza virus in most regions of northern and southern hemispheres is seasonal, continuous year-round transmission occurs in certain tropical or subtropical regions (Viboudet al. 2006). Selective sweeps in such continuous viral populations would not involve the GDCI process. Therefore, if global influenza genetic diversity is mainly shaped by variants arising from the permanent tropical populations (Rambautet al. 2008; Chanet al. 2010), the overall effects of selective sweeps might be closer to those in our models 2 and 3. In our simulation of model 4, one out of eight demes are maintained at a constant size. Its small size however might have limited its contribution to the diversity of the total population.
While not initially a major focus of this study, significant inter-segmental heterogeneity in the rate of synonymous substitutions, indicating that new mutations occur at different rates in different segments, is an unexpected discovery. Negative-sense viral RNA strands replicate via positive-sense mRNA strands. Then, if segments are transcribed at different rates for example due to different demands for or turn-over rates of viral proteins, some segments may experience more negative-positive-negative replication cycles than others before being assembled into viral particles. Such difference would translate into different mutation rates given the fixed rate of RNA replication errors per cycle. We may also speculate that replication error is influenced by the secondary structures of RNA strands that are probably different among segments.
ACKNOWLEDGEMENT
This research was supported by the National Research Foundation of Korea grants 2012R1A1A2004932 to YK.