Abstract
Clinical influenza A isolates are rarely sequenced directly. Instead, a majority of these isolates (∽70% in 2015) are first subjected to serial passaging for amplification, most commonly in non-human cell culture. Here, we find that this passaging leaves distinct signals of adaptation in the viral sequences, and it confounds evolutionary analyses of the viral sequences. We find distinct patterns of adaptation to generic (MDCK) and monkey cell culture. These patterns also dominate pooled data sets not separated by passaging type. By contrast, MDCK-SIAT1 passaged sequences seem mostly (but not entirely) free of passaging adaptations. Contrary to previous studies, we find that using only internal branches of the influenza phylogenetic trees is insufficient to correct for passaging artifacts. These artifacts can only be safely avoided by excluding passaged sequences entirely from subsequent analysis. We conclude that all future influenza evolutionary analyses must appropriately control for potentially confounding effects of passaging adaptations.
INTRODUCTION
The routine sequencing of clinical isolates has become a critical component of global seasonal influenza surveillance (World Health Organization Global influenza surveillance network, 2011). Analysis of these viral sequences informs the selection of future vaccine strains (Stöhr et al., 2012; WHO Writing Group et al., 2012), and a wide variety of computational methods have been developed to identify sites under selection or immune-escape mutations (Blackburne et al., 2008; Koelle et al., 2006; Nelson et al., 2006; Suzuki, 2008; Wolf et al., 2006), or to predict the short-term evolutionary future of influenza virus (Łuksza and Lässig, 2014; Neher et al., 2014). However, sites that appear positively selected in sequence analysis frequently do not agree with sites identified experimentally in hemagglutination inhibition assays (Meyer and Wilke, 2015; Tusche et al., 2012), and the origin of this discrepancy is unclear. Here, we argue that a major cause of this discrepancy is widespread serial passaging of influenza virus before sequencing.
Clinical isolates are generally passaged in culture to amplify viral copy number, as well as to introduce virus into a living system for testing strain features such as vaccine response, antiviral response, and replication efficiency (Kumar and Henrickson, 2012; World Health Organization Global influenza surveillance network, 2011). A variety of culture systems are used for virus amplification. Cell cultures derived from Madin-Darby canine kidney (MDCK) cells are by far the most widely used system, with the majority of sequences in influenza repositories deriving from virus that has been passaged through an MDCK cell culture (Balish et al., 2005; Bogner et al., 2006). Influenza virus may also be passaged through monkey kidney (RhMK or TMK) cell culture or injected directly into egg amniotes. Alternatively, complete influenza genomes can be obtained from PCR-amplified influenza samples without intermediate passaging (Katz et al., 1990; Lee et al., 2013a).
Several experiments have demonstrated that influenza hemagglutinin (HA) accumulates mutations following rounds of serial passaging in both cell (Ilyushina et al., 2012; Lee et al., 2013b; Wyde et al., 1977) and egg culture (Robertson et al., 1993). The decreased number of mutations in MDCK-based cell culture is the main argument for use of this system over egg amniotes in vaccine production (Katz and Webster, 1989), with MDCK cells expressing human SIAT1 having the highest fidelity to the original sequence and reduced host adaptation (Hamamoto et al., 2013). Viral adaptations to eggs have recently been linked to reduced vaccine efficacy (Skowronski et al., 2014; Xie et al., 2015) and were implicated as potentially contributing to reduced efficacy of 2014-2015 seasonal H3N2 influenza vaccination in the World Health Organization’s recommendations for 2015-2016 vaccine strains (The World Health Organization, 2015). As the majority of influenza vaccines worldwide are produced in eggs, vaccine strain selection is limited to virus with the ability to replicate rapidly in this system (World Health Organization Global influenza surveillance network, 2011).
Although egg-passaged sequences are increasingly excluded from influenza phylogenetic analysis (see e.g. the NextFlu tracker (Neher and Bedford, 2015)), due to the known high host-specific substitution rates, cell culture is generally not thought to be sufficiently selective to produce a discernable evolutionary signal. One of few existing evolutionary analyses of passaging effects on influenza (Bush et al., 2000) demonstrated that passaging causes no major changes in clade structure between egg and cell passaged viruses. Moreover, several studies have recommended the use of internal branches in the phylogenetic tree to reduce passaging effects in evolutionary analysis of Influenza A (Bush et al., 2001; Suzuki, 2006). Another study discovered egg culture to be the cause of misidentification of several sites under positive selection in Influenza B (Gatherer, 2010), but this study was limited to comparing egg-cultured to cell-cultured virus. As the availability of unpassaged influenza sequences has dramatically increased over the past ten years, we can now perform a direct comparison of passaged to circulating virus.
Here, we compare patterns of adaptation in North American seasonal H3N2 influenza HA sequences derived from passaged and unpassaged virus. We divide viral sequences by their passaging history, distinguishing between unpassaged clinical samples, egg amniotes, RhMK (monkey) cell culture, and generic/MDCK-based cell culture. For the latter, we also distinguish between virus passaged in MDCK-SIAT1 cell culture (SIAT1) and in unmodified MDCK or unspecified cell culture (non-SIAT1). We find clear signals of adaptation to the various passaging conditions. These signals are strongly present in the tip branches of the phylogenetic trees but can also be detected in internal branches. Finally, we demonstrate that the identification of antigenic escape sites from sequence data has been confounded by passaging adaptations, and that the exclusion of passaged sequences allows us to use sequence and structural data to highlight regions involved in antigenic escape.
RESULTS
Most influenza-virus samples collected from patients are first serially passaged through one or more culturing systems, prior to PCR amplification and sequencing (Figure 1A). Reconstructed trees of influenza evolution contain a mixture of passage histories at their tips (Figure 1B). During serial passaging, influenza genomes accumulate adaptive mutations, and the effect of these mutations on evolutionary analyses of influenza sequences is not well understood.
Sitewise evolutionary rate patterns differ between passage groups
To quantify any evolutionary signal that may be introduced by passaging, we assembled, from the GISAID database (Bogner et al., 2006), a set of North American human influenza H3N2 hemagglutinin sequences collected between 2005 and 2015. We initially sorted these sequences into groups by their passage history: (1) unpassaged, (2) egg-passaged, (3) generic cell-passaged, and (4) monkey cell-passaged (Table 1). To assess evolutionary variation at individual sites, we calculated site-specific dN/dS (Echave et al., 2016), using Single Likelihood Ancestor Counting (SLAC). Specifically, we calculated one-rate dN/dS estimates, i.e., site-specific dN values normalized by a global dS value (see Methods for details). In addition to considering groups of sequences with specific passage histories, we also calculated dN/dS values by pooling all sequences into one combined analysis. This pooled group corresponds to a typical influenza evolutionary analysis in which passage history has not been accounted for.
We first correlated the sitewise dN/dS values we obtained for virus sequences derived from different passage histories. If passage history did not matter, then the dN/dS values obtained from different sources should correlate strongly with each other, with r approaching 1. Instead, we found that correlation coefficients ranged from 0.68 to 0.88, depending on which specific comparison we made (Figure 2A). (In this analysis, and throughout this work, we down-sampled alignments to the smallest number of sequences available for any of the conditions compared, to keep the samples as comparable as possible overall. The analysis of Figure 2 used n = 917 randomly drawn sequences for each condition.) Unpassaged dN/dS correlated more strongly with cell and pooled dN/dS (correlations of 0.77 and 0.79, respectively) than with monkey-cell dN/dS (0.68). Note that the dN/dS values from the pooled group, which corresponds to a typical data set used in a phylogenetic analysis of influenza, more closely correlated with the dN/dS values from the generic cell group (r = 0.87) than from the unpassaged group (r = 0.79). Egg-derived sequences were excluded from this analysis due to low sequence numbers (n = 79), however evolutionary rates from this condition correlated particularly poorly with those of random draws of 79 unpassaged sequences (Supplementary Figure 1). This result is consistent with the conclusions of (Bush et al., 2000), (Suzuki, 2006), and (Gatherer, 2010) that egg-derived sequences show specific adaptations not found otherwise in influenza sequences.
Because the common ancestor of any two passaged influenza viruses is a virus that replicated in humans, we would expect that any adaptations introduced during passaging should not extend into the internal branches of a reconstructed tree. Therefore, we additionally subdivided phylogenetic trees into internal branches and tip branches, and calculated site-specific dN/dS values separately for these two sets of branches. In fact, (Bush et al., 2000) had recommended the use of internal branches to reduce variation seen between egg and non-egg passaged virus. As expected, we found that when dN/dS calculations were restricted to the internal branches, the correlations between the passage groups overall increased (Figure 2B), even though distinct differences between the passage groups remained. Conversely, when only considering tip branches, correlations among most groups were relatively low (Figure 2C), with the exception of cell-passaged sequences compared to the pooled sequences. This finding emphasizes once again that the pooled sample is most similar to the cell-passaged sample. We conclude that different passaging histories leave distinct, evolutionary signatures of adaptation to the passaging environment.
To further investigate the apparent discrepancies between dN/dS derived from unpassaged sequences, monkey-cell passaged sequences, cell-passaged sequences, and the pooled set, we compared the magnitude of the site-wise rates (Figure 2D). Cell-passaged and pooled sequences had, on average, significantly inflated dN/dS values compared to unpassaged and monkey-cell-passaged sequences in the full phylogenetic tree (paired t test, P = 1.5 x 10-05 and P = 9.1 x 10-05, respectively) and along tip branches (paired t test, P =1.8 x 10-06 and P = 6.3 x 10-05, respectively). By contrast, there were no significant differences between cell-passaged and pooled sequences in all three cases (paired t test, P = 0.26, P = 0.24, and P = 0.26, respectively, for the full tree, internal branches, and tip branches). dN/dS values were generally more similar along internal branches, however a significant difference of dN/dS from cell-passaged and pooled sequences relative to monkey-cell-passaged sequences remained. These results demonstrate that both cell-passaged and pooled sequences show artificially inflated dN/dS values compared to unpassaged sequences.
In aggregate, these results show that while both generic-cell-passaged sequences and monkey-cell-passaged sequences yield different sitewise dN/dS patterns relative to unpassaged sequences (Fig. 2A-C), cell passaging additionally creates inflated dN/dS values (Fig. 2D), indicating positive adaptation to the passaging condition. At the same time, dN/dS values derived from monkey-cell-passaged sequences are the least similar to dN/dS from unpassaged sequences (Fig. 2A–C). The pooled group of sequences, which corresponds to a typical data set used in evolutionary analyses of influenza virus, describes evolutionary rates of specifically cell passaged virus and poorly matches evolutionary rates of circulating influenza virus.
Adaptations to cell and monkey-cell passage display characteristic patterns of site variation
We next asked whether adaptations to passage history were located in specific regions of the HA protein. To address this question, we employed the geometric model of HA evolution we recently introduced (Meyer and Wilke, 2015). For H3N2 HA, this model explains over 30% of the variation in dN/dS using two simple physical measures, the relative solvent accessibility (RSA) of individual residues in the structure (Tien et al., 2013) and the inverse linear distance in 3D space from each residue to protein site 224 in the hemagglutinin monomer. Notably, the geometric model was previously applied to a pooled sequence set including sequences of various passaging histories. To what extent it carries over to sequences with specific passaging histories is not known.
We first considered the correlation between dN/dS and RSA (Figure 3A). We found that for all passage groups, R2 values ranged from 0.10 to 0.16 in the full tree, consistent with our earlier work (Meyer and Wilke, 2015). The high congruence among R2 values for internal branches and all branches suggests that RSA imposes a pervasive selection pressure on HA, independent of passaging adaptations. Thus, RSA represents a useful structural measure of a persistent effect of dN/dS with stronger correlations in the full tree and internal branches than in tip branches.
Next we considered the correlation between dN/dS and the inverse distance to site 224 (Figure 3B). In contrast to RSA, correlations here were systematically higher in tip branches, suggesting a recent adaptive signal. We found virtually no correlation for unpassaged sequences, while a low correlation existed for monkey-cell cultured sequences and a higher correlation for cell-passaged and pooled sequences. Correlations from pooled sequences mirrored cell culture correlations and persisted through internal branches. Thus, the correlation of dN/dS with the inverse distance to site 224 seems to be primarily an artifact of cell passage, even though its effect can be seen along internal branches as well. As the majority of the available HA sequences are cell-derived, this cell-specific signal dominates the pooled data set. Further, this cell-specific signal is partially attenuated along internal branches and amplified along tip branches, as we would expect from a signal caused by recent host-specific adaptation. Even though this signal is a true predictor of influenza evolutionary rates for virus grown in cell culture, it does not transfer to unpassaged sequences and therefore has no relevance for the circulating virus. This finding serves as a strong demonstration of passage history as a confounder in evolutionary analysis of hemagglutinin evolution, not just for egg passage as previously demonstrated, but also for cell and monkey-cell passage.
Surprisingly, the correlation we found here between dN/dS and inverse distance to site 224 for pooled sequences (R2 = 0.067) was less than half of the value reported by (Meyer and Wilke, 2015) (Fig. 3B). However, using a dataset of sequences more temporally matched to that paper’s dataset (2005–2014 instead of 2005–2015), we recovered the previously seen higher correlation. This finding suggests that there is some feature in the additional 2015 sequences that changes the pooled dataset’s relationship with inverse distance to site 224. In 2015, unpassaged and SIAT1 sequenced each doubled in number compared to in 2014, while the number of non-SIAT1 cell cultured sequences dropped dramatically (Table 2). Therefore, we next investigated whether the drop in correlation from 2014 to 2015 could be attributed to the recent reduction in cell culture using non-SIAT1 cells.
There is little signal of adaptation to passage in SIAT1 cells
In the preceding analyses, we lumped all cell cultures except monkey cells into the same category. However, there are more subtle distinctions in cell passaging systems, and they can exert differential selective pressures on human adapted virus (Hamamoto et al., 2013; Oh et al., 2008). As our generic cell culture group was composed of a mixture of wild type MDCK, SIAT1, and unspecified cell cultures, we next investigated whether any one culture type was the source of the high cell-culture signal in Figure 3B.
The SIAT1 cell system, which overexpresses human-like 6-linked sialic acids over native 3-linked sialic acids (Matrosovich et al., 2003), is currently the dominant system for serial passaging of influenza virus. Approximately half of the 2015 influenza sequences currently available from GISAID derive from serial passaging through SIAT1 cells. Experimental analysis of SIAT1 demonstrates improved sequence fidelity and reduced positive selection over unmodified MDCK cell culture (Hamamoto et al., 2013; Oh et al., 2008). We sought to determine if the apparently cell-culture-specific correlation of site-wise evolutionary rates and inverse distance to site 224 extended to SIAT1 cell culture. To compare cell-culture varieties, we created sample-size matched groups of non-SIAT1 cell culture, SIAT1 cell culture, and unpassaged sequences collected between 2005 and 2015 (n = 1046), excluding sequences that had been passaged through both a non-SIAT1 and a SIAT1 cell culture.
All groups showed similar correlations between dN/dS and RSA, regardless of whether dN/dS was calculated for the entire tree, for internal branches only, or for tip branches only (Figure 4A). By contrast, inverse distance to site 224 uniquely correlated with dN/dS from non-SIAT1-cultured virus (Figure 4B). This effect was the strongest along tip branches (R2 = 0.139), but it was almost as strong along the entire tree (R2 = 0.129). The correlation was reduced, though still significant, among internal branches (R2 = 0.075). Thus, we conclude that the correlation between dN/dS and the inverse distance to site 224 (Meyer and Wilke, 2015) represents a unique signal of adaptation to passaging in non-SIAT1 cells. In our previous analysis (Meyer and Wilke, 2015), a non-SIAT1-specific signal completely dominated our evolutionary rate models, due to use of a standard, pooled data set mainly composed of sequences passaged in non-SIAT1 cells. In our new analysis (Figure 3B), the high correlation of non-SIAT1 cell dN/dS with inverse distance to site 224 is suppressed in the pooled condition, because the number of unpassaged and SIAT1-passaged sequences grew substantially in 2015. This difference in sample composition explains the lower than expected correlations in Figure 3B for pooled dN/dS.
When considering all branches in the phylogenetic tree, we found that dN/dS values were significantly inflated in sequences passaged in non-SIAT1 cells compared to both unpassaged and SIAT1-passaged sequences (paired t test, P = 5.05 x 106 and P = 6.94 x 108, respectively, Figure 4C), whereas unpassaged and SIAT1-passaged sequences showed no significant increase (Figure 4C). Unpassaged and non-SIAT1-passaged sequences showed significant differences along internal branches (paired t test, P = 0.036) and tip branches as well (paired t test, P = 2.03 x 106, Figure 4C). Thus, virus amplified in non-SIAT1 cell culture measurably adapts to this non-human host, and these adaptations can significantly confound downstream evolutionary analyses.
As these three conditions are somewhat temporally separated (most non-SIAT1 cell culture sequences are pre-2015, and most unpassaged and SIAT1 culture sequences are post-2014), we controlled for season-to-season variation by drawing 249 sequences from each group from 2014. First, we again considered site-wise dN/dS correlations among passaging groups, and we found that overall, unpassaged and SIAT1-passaged sequences appeared the most similar (Supplementary Figure 2A–C). However, both SIAT1 and non-SIAT1 showed dN/dS values that were inflated over dN/dS in unpassaged sequences when considering the full tree (paired t test, P = 0.029 and P = 0.0005, respectively, Supplementary Figure 2D), although only non-SIAT1 dN/dS was significantly inflated in tip branches (paired t test, P = 0.0008, Supplementary Figure 2D). (No significant difference was seen along internal branches.) Notably, in this more controlled comparison of SIAT1 cell culture to unpassaged sequences from the same year, we observed a significant difference in dN/dS between these conditions, suggesting that at least minor passaging artifacts remain after SIAT1 passaging.
Evolutionary variation in sequences from unpassaged virus predicts regions involved in antigenic escape
The preceding results might suggest that the inverse distance metric we previously proposed (Meyer and Wilke, 2015) only captures effects of adaptation to non-SIAT1 cell culture. However, this is not necessarily the case. Importantly, inverse distance needs to be calculated relative to a specific reference point. We previously used site 224 as the reference point because it yielded the highest correlation for the data set we analyzed then. For a different data set, one that doesn’t carry the signal of adaptation to non-SIAT1 cell culture, a different reference point may be more appropriate.
We thus repeated the analysis of (Meyer and Wilke, 2015) for a size matched sample of 1703 sequences from both non-SIAT1 cell passaged and unpassaged virus collected between 2005 and 2015 (Figure 5). In brief, for each possible reference site in the hemagglutinin structure, we measured the inverse distance in 3D space from that site to every other site in the structure. We then correlated the inverse distances with the dN/dS values at each site, resulting in one correlation coefficient per reference site. Finally, we mapped these correlation coefficients onto the HA structure, coloring each reference site by its associated correlation coefficient. If inverse distances measured from a particular reference amino acid have higher correlation with the sitewise dN/dS values, then this reference site will appear highlighted on the structure.
For non-SIAT1-passaged virus, this analysis recovered the finding of (Meyer and Wilke, 2015) that the loop containing site 224 appeared strongly highlighted (Figure 5A). However, this signal was entirely absent in unpassaged virus (Figure 5B), with no sites in that loop working well as a reference point. These results suggest that this loop is specifically involved in adaptation of hemagglutinin to non-SIAT1 cell culture, explaining the non-SIAT1-specific signal shown in Figure 4A. Thus, the inverse distance metric is useful for differentiating regions of selection particular to different experimental groups. (Meyer and Wilke, 2015) had concluded that sites under positive selection differed from sites involved in immune escape. Here, we have found that the origin of this positive selection is adaptation to the non-human passaging host, not immune escape in or adaptation to humans. Therefore, we next asked what residual patterns of positive selection remained once the adaptation to non-SIAT1 cells was removed. Even though site-wise correlations are relatively low for unpassaged virus compared to the ones observed for non-SIAT1-passaged virus, we could still recover relevant patterns of HA adaptation after rescaling our coloring. In particular, we found that sites opposite to the loop containing site 224 lit up in our analysis of unpassaged sequences (Figure 6A). Sites in this region are known to be involved in antigenic escape. In fact, many of the highlighted regions contain experimentally determined antigenic sites (Koel et al., 2013) and/or the sites determined to be responsible for the antigenic shift in the 2014/2015 seasonal flu (Chambers et al., 2015) (Table 2). We found a similar pattern of concordance with antigenic sites when mapping dN/dS values directly onto the structure (Figure 6B). The inverse-distance correlations, however, performed better at identifying antigenic sites than did raw dN/dS values. When considering the 90th percentile (top 10% highest scored sites) by either metric, the inverse-distance correlations recovered 7 of 8 sites while dN/dS alone recovered only 2 of 8 sites (Table 2).
DISCUSSION
We have found that serial passaging of influenza virus introduces a measurable signal of adaptation into the evolutionary analysis of natural influenza sequences. There are unique, characteristic patterns of adaptation to egg passage, monkey cell passage, and non-SIAT1 cell passage. Monkey cell-derived sequences show different molecule-wide evolutionary rate patterns, even though they show no dN/dS inflation when compared with unpassaged sequences. Non-SIAT1 cell-derived sequences instead display both dN/dS inflation and a hotspot of positive selection in a loop underneath the sialic-acid binding region. This hotspot has been previously noted (Meyer and Wilke, 2015) but no explanation for its origin was available. Further, we have found that virus passaged in SIAT1 cells seems to accumulate only minor passaging artifacts. Throughout our analyses, we have found limited utility to subdividing phylogenetic trees into internal and terminal branches. While signals of passage adaptation are consistently elevated along terminal branches and attenuated along internal branches, evolutionary rates along internal branches remain confounded by passaging artifacts. Finally, we could accurately recover the experimentally determined antigenic regions of hemagglutinin from evolutionary-rate analysis by using a data set consisting of only unpassaged viral sequences.
Previous studies (Bush et al., 2001; Suzuki, 2006) have suggested the use of internal branches to alleviate passage adaptations. However, we have found here that this strategy is insufficient, because the evolutionary signal of passage adaptations can often be detected along internal branches. This finding seems counterintuitive, as internal nodes should exclusively represent human-adapted virus. We suggest that passaging adaptations in internal branches may be caused by convergent evolution; if different clinical isolates converge onto the same adaptive mutations under passaging, then these mutations may incorrectly be placed along internal branches under phylogenetic tree reconstruction. Additionally, although the use of only internal branches removes some differences between the passage groups, the exclusion of terminal sequences can obscure recent natural adaptations and thus obscure actual sites under positive selection. Therefore, analysis of internal branches is not only insufficient for eliminating artifacts from passaging adaptations but also suboptimal for detecting positive selection in seasonal H3N2 influenza.
The safest route to avoid passaging artifacts is to limit sequence data sets to only unpassaged virus, although this approach limits sequence numbers. The human-like 6-linked sialic acids in SIAT1 (Matrosovich et al., 2003) greatly reduce observed cell culture-specific adaptations, particularly in the loop of hemagglutinin which contains site 224. This lack of selection concords with multiple experiments finding low levels of adaptation in this cell line (Hamamoto et al., 2013; Oh et al., 2008). As our analysis only detected minor differences between unpassaged and SIAT1 passaged virus, we posit that this passage condition is an acceptable substitute for unpassaged clinical samples. Even so, our findings do not preclude the existence of SIAT1-specific adaptations that may confound specific analyses.
Although the majority of the sequences from the year 2015 are SIAT1-passaged or unpassaged, several hundred sequences from that year derive from monkey cell culture. The use of monkey cell culture has surged in 2014 and 2015 compared to previous years. We recommend that these recently collected sequences be excluded from influenza rate analysis, in favor of the majority of unpassaged and SIAT1-passaged sequences. As passaging is a useful and cost effective method for amplification of clinically collected virus, unpassaged viral sequences are unlikely to completely dominate influenza sequence databases in the near future. However, new human epithelial cell culture systems for influenza passaging, as in (Ilyushina et al., 2012), could soon provide an ideal system that both amplifies virus and protects it from non-human selective pressures.
Passage history should routinely be considered as a potential confounding variable in future analyses of influenza evolutionary rates. Future studies should be checked against unpassaged samples to ensure that conclusions are not based on adaptation to non-human hosts. We recommend the exclusion of viral sequences which derive from serial passage in egg amniotes, monkey kidney cell culture, and any unspecified cell culture. Prior work that did not consider passaging history may likely have been confounded by passaging adaptations. In particular, we suggest that the evolutionary markers of influenza virus determined by (Belanov et al., 2015) be reevaluated to ensure these sites are not artifacts of viral passaging. Similarly, many of the earlier studies performing site-specific evolutionary analysis of HA, such as (Bush et al., 1999; Meyer and Wilke, 2015, 2013; Pan and Deem, 2011; Shih et al., 2007; Suzuki, 2008, 2006; Tusche et al., 2012), likely contain some conclusions that can be traced back to passaging artifacts. Additionally, even though passage artifacts do not appear to be sufficiently strong to affect clade-structure reconstruction (Bush et al., 2000), they do have the potential to cause artificially long branch lengths, due to dN/dS inflation, or misplaced branches, due to convergent evolution under passaging. Thus, future phylogenetic predictive models of influenza fitness and antigenicity, as in (Łuksza and Lässig, 2014), (Neher et al., 2014), and (Bedford et al., 2014), should too be checked for the presence of passage-related signals. Finally, while it is beyond the scope of this work to investigate passage history effects in other viruses, we suspect that passage-derived artifacts could be a factor in their phylogenetic analyses as well. The use of data sets free of passage adaptations will likely bring computational predictions of influenza positive selection more in line with corresponding experimental results.
Sequences without passage annotations are inadequate for reliable evolutionary analysis of influenza virus. Yet, passage annotations are often completely missing from strain information, and, when present, are often inconsistent; there is currently no standardized language to represent number and type of serial passage. We note, however, that passage annotations from the 2015 season are greatly improved when compared to previous seasons. Several major influenza repositories, including the Influenza Research Database (Squires et al., 2012) and the NCBI Influenza Virus Resource (Bao et al., 2008), do not provide any passaging annotations at all. Additionally, passage history is not required for new sequence submissions to the NCBI Genbank (Benson et al., 2012). The EpiFlu database maintained by the Global Initiative for Sharing Avian Influenza Data (GISAID) (Bogner et al., 2006) and OpenFluDB (Liechti et al., 2010), however, stand apart by providing passage history annotations for the majority of their sequences. Of these, only the OpenFluDB repository allows filtering of sequences by passage history during data download. Our results demonstrate the strength of passaging artifacts in evolutionary analysis of influenza. The lack of a universal standard for annotation of viral passage histories and a universal standard for serial passage experimental conditions complicate the analysis and mitigation of passaging effects.
METHODS
Influenza sequence data
Non-laboratory strain H3N2 hemagglutinin (HA) sequences collected in North America were downloaded from The Global Initiative for Sharing Avian Influenza Data (GISAID) (Bogner et al., 2006) for the 1968–2015 influenza seasons. Non-complete HA sequences were excluded. Sequences were trimmed to open reading frames, filtered to remove redundancies, and aligned by translation-alignment-back-translation and MAFFT (Katoh and Standley, 2013). Sequence headers of FASTA files were standardized into an uppercase text format with non-alphanumeric characters replaced by underscores. As H3N2 strains have experienced no persistent insertion or deletion events, we deleted sequences which introduced gaps to the alignment. To ascertain overall data quality, we built a phylogenetic tree of the entire sequence set (using FastTree 2.0 (Price et al., 2010)) and checked for any abnormal clades or other unexpected tree features. We found one abnormal clade of approximately 20 sequences with an exceptionally long branch length (> 0.01) and removed the sequences in that clade from further analysis. Our final data set consisted of 6873 sequences from 2005-2015 as well as an outgroup of 45 sequences from 1968–1977 (not considered for further analysis). We did not consider sequences collected from 1978-2004.
Identification of passage history and evolutionary-rate calculations
We divided sequences into groups by their passage history annotation and collection year, determining passage history by parsing with regular expressions for key words in FASTA headers (Table 1). We classified 1133 sequences with indeterminate or missing passage histories, or passage through multiple categories of hosts (i.e. both egg and cell), as “other”. The final data sets for individual passage groups contained between 79 and 3041 sequences (Table 1).
We next constructed phylogenetic trees for each passage group as well as one tree for a pooled data set combining all individual passage groups and other sequences. All phylogenetic trees were constructed using FastTree 2.0 (Price et al., 2010). We calculated site-specific dN/dS values using a one-rate SLAC (Single-Likelihood Ancestor Counting) model implemented in HyPhy (Pond et al., 2005). One rate models, which fit a site-specific dN and a global dS, yield more accurate estimates than two-rate models and hence are preferred (Spielman et al., 2015). Among different one-rate, site-specific models, SLAC performs nearly identical to other approaches, and it was chosen here due to its speed and ease of extracting dN/dS estimates along internal and tip branches. To obtain branch-specific estimates, we extracted the dN/dS values calculated by the SLAC algorithm at internal and tip branches.
We chose sequences from 2005-2015 as our sample set due the low number of available sequences prior to this period. As dN/dS estimates can be confounded by sample size (Spielman et al., 2015), we sought to limit this effect by down-sampling each experimental set to match the number of sequences in the smallest group being considered in a particular analysis (Table 1). To reduce season-to-season variation in the comparison of unpassaged, SIAT1, and non-SIAT1 cell culture, we performed one analysis with sequences from only 2014, which is the year that maximizes sequences available from all three conditions (n = 249 each).
Geometric analysis of dN/dS distributions
For each site i in HA, we computed the correlation of dN/dS at every site j ≠ i with the inverse Euclidian distance between j and i in the 3D crystal structure of the protein. This method is discussed in detail in (Meyer and Wilke, 2015). This correlation is then color-mapped onto the reference site. Sites spatially closest to positively selected regions in the protein have the highest correlation in this analysis. Thus, this approach allows us to visualize regions of increased positive selection. We processed the HA PDB structure as discussed in (Meyer and Wilke, 2015), and we provide a renumbered and formatted H3N2 structure derived from PDB ID 2YP7 (Lin et al., 2012) with our data analysis code (see below).
Statistical analysis and data availability
Raw influenza sequences used in this analysis are available for download from GISAID (http://gisaid.org) using the parameters “North America”, “H3N2”, “1976 – 2015. Acknowledgements for sequences used in this study are available in Supplementary File 1. The complete, processed data set used in our statistical analysis is available in Supplementary Dataset 6, including protein and gene numbering, computed evolutionary rates, relative solvent accessibility for the hemagglutinin trimer, and sitewise distance to protein site 224. Relative solvent accessibility of the hemagglutinin trimer was taken from (Meyer and Wilke, 2015). Site-wise distances between all amino acids in the HA structure PDBID:2YP7 were recalculated as in (Meyer and Wilke, 2015). Statistical analysis was performed using R (Ihaka and Gentleman, 1996), and all graph figures drawn with the R package ggplot2 (Wickham, 2009). Throughout this work, * denotes a significance of 0.01 ≤ P < 0.05, ** denotes a significance of 0.01 ≤ P < 0.05, and *** denotes a significance of P < 0.001.
Linear models between sitewise dN/dS and RSA or inverse distance were fit using the lm() function in R. Correlations were calculated using the R function cor() and significance determined using cor.test().
Our entire analysis pipeline, instructions for running analyses and raw data (except initial sequence data per the GISAID user agreement) are available at the following Github project repository: https://github.com/wilkelab/influenza_H3N2_passaging.
AUTHOR CONTRIBUTIONS
Conceived and designed the experiments: CDM COW. Wrote scripts and analytic tools: CDM AGM. Performed the experiments: CDM. Analyzed the data: CDM COW. Wrote the paper: CDM COW.
ACKNOWLEDGEMENTS
We would like to thank Sebastian Maurer-Stroh for help with interpreting passaging annotations in GISAID. This work was supported in part by NIH grant no. R01 GM088344, DTRA grant no. HDTRA1-12-C-0007, and NSF Cooperative agreement no. DBI-0939454 (BEACON Center). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.