Abstract
Many HLA loci show an excess of nonsynonymous (dN) with respect to synonymous (dS) substitutions at codons of the antigen recognition site (ARS), a hallmark of adaptive evolution. However, it remains unclear how these changes are distributed over time and across branches of the HLA phylogeny. In particular, although HLA alleles can be assigned to functionally and phylogenetically defined groups (“lineages”), a test for differences in ω (ω = dN/dS) within and between lineages is lacking. We analysed variation of ω across divergence times and phylogenetic contexts (placement of branches in the phylogeny).
We found a significant positive correlation between ω at ARS codons and divergence time, and that branches between lineages have higher ω than those within lineages. The excess of nonsynonymous changes between lineages attained significance when we used non-ARS codons to account for the fact that, even under purifying selection, ω is inflated for recently diverged alleles. Although less intensely selected, within-lineage variation at ARS codons bears evidence of selection, in the form of higher ω than those of non-ARS codons.
Our results show that ω ratios of class I HLA genes vary over time, and are higher in branches connecting alleles from distinct lineages. These results suggest that although within-lineage variation bears evidence of balancing selection, the between-lineage changes have been more intensely selected. Our findings indicate the importance of considering the effect of timescale when analysing ω values over a wide spectrum of divergences, and the value of using additional markers (in our case the tightly linked non-ARS codons) to account for the temporal dynamics of ω.
1 Introduction
MHC class I and II classical molecules are cell-surface glycoproteins which mediate presentation of peptides to T-cell receptors, and play a key role in triggering adaptive immune responses when the bound peptide is recognized as foreign (Klein and Sato 2000). In humans, they are coded by HLA class I (HLA-A, -B, and -C) and II (HLA-DR, -DQ, and -DP) genes. Several findings suggest these genes experienced balancing selection: unusually high level of heterozygosity with respect to neutral expectations (Hedrick and Thomson 1983); existence of trans-species polymorphisms (Takahata and Nei 1990); high levels of linkage disequilibrium (Huttley et al 1999); site frequency spectra with excess of common variants (Garrigan and Hedrick 2003); high levels of identity-by-descent compared to genomic averages (Albrechtsen et al 2010); positive correlation between HLA polymorphism and pathogen diversity (Prugnolle et al 2005), and significant associations of HLA alleles with the course of infectious diseases (e.g. Apps et al 2013). Information on the crystal structure of MHC molecules (Bjorkman et al 1987) allowed the identification of a specific set or amino-acids that make up the antigen recognition site (ARS), which determines which peptides the molecule can bind (Bjorkman et al 1987; Chelvanayagam 1996). The codons of the ARS were shown to have increased nonsynonymous substitution rates (Hughes and Nei 1988, 1989), consistent with the hypothesis that adaptive evolution at HLA loci is driven by peptide binding properties.
Several models of selection are compatible with balancing selection at HLA genes. The first is heterozygote advantage, which assumes that heterozygotes have higher fitness values because they are able to mount an immune response to a greater array of pathogens - an idea based on the observation that mice which are heterozygous for the MHC have increased immunological surveillance (Doherty and Zinkernagel 1975). Heterozygote advantage has received support from experiments in semi-natural populations of mice (Penn et al 2002), showing increased resistance of heterozygotes to multiple-strain infection, and through the finding that among humans infected with HIV, those which are heterozygous for HLA genes have slower progression to AIDS (reviewed in Dean et al 2002). A second form of balancing selection at MHC genes is negative frequency dependent selection, according to which rare variants have a selective advantage over common ones, because pathogens are more likely to evade presentation by common molecules (Slade and McCallum 1992). Although biologically compelling, most forms of summarizing genetic observation are incapable of differentiating this mode of selection from heterozygote advantage (Spurgin and Richardson 2010). A third model involves selection that is heterogeneous over space and/or time, favoring different alleles in different temporal or geographic compartments, and thus resulting in an overall increase in diversity at MHC loci. This model has been shown to be capable of accounting for features of HLA variation (Hedrick 2002). Many studies have investigated this model by comparing the degree of population differentiation at HLA and putatively neutral loci, with the expectation being that selection that is geographically heterogeneous will result in increased differentiation at HLA genes. As reviewed in Spurgin and Richardson (2010), the results are mixed, and interpretation is hampered due to differences in the mutational models underlying the evolution of HLA genes and loci used as neutral controls.
Although the specific form of selection acting on HLA genes remains an open question, the fact that these genes have evolved in a non-neutral way and are under balancing selection is an undisputed finding, which is robust to complications introduced by demographic history (Meyer and Thomson 2001; Hughes and Yeager 1998; Garrigan and Hedrick 2003). However, important questions regarding the properties of balancing selection at HLA genes remain unresolved. First, while most tests for selection have provided strong evidence for selection at HLA in deep timescales, there is comparatively less support for selection at recent timescales (Garrigan and Hedrick 2003). It has proved difficult to tease apart the possibility that selection differs across timescales from reduced statistical power of tests for recent selection, and thus the question of the timescale of selection on HLA genes remains open.
A second question concerns what type of variation is targeted by selection. Wakeland et al (1990) proposed a mechanism coined “divergent allele advantage”, which is a specific case of heterozygote advantage, according to which the fitness values of heterozygotes are proportional to the degree of divergence between the alleles they carry. This model was motivated by the observation that, in MHC class II murine genes, alleles within the same lineage often differ by only minor structural variations in the ARS, while alleles in different lineages have functionally different ARS. Classical MHC genes have many alleles, which can be grouped to reflect phylogenetic relatedness and functional attributes (with these groups commonly referred to as “lineages”). Although nucleotide diversity within lineages exceeds genome-wide averages, between lineage diversity is substantially higher than within (Takahata and Satta 1998). This raises the question of whether variation within lineages is under a different mode and intensity of selection with respect to differences among lineages.
We address these questions by analysing the temporal and phylogenetic dynamics of dN/dS (or ω) for ARS codons at the class I (HLA-A, -B and -C) loci. To quantify divergence times among alleles we estimate dS from non-ARS codons, thus avoiding the issue of statistical non-independence between the two measures. Additionally, we use a phylogenetic approach to compare the intensity of selection on branches within and between lineages, and on terminal and internal branches. Throughout, we evaluate the effect of intragenic recombinant alleles on the analyses. Our pairwise comparisons of alleles shows that more divergent pairs show higher ω for ARS codons than closely related pairs of alleles. Our phylogenetic analyses support the hypothesis that selection is stronger in branches connecting different lineages, or which are internal to the phylogeny, provided that a bias toward overestimating ω for recent divergence is taken into account. Although positive selection is weaker within than between lineages, our findings show that there is statistical support for deviation from a regime of neutrality or purifying selection within lineages. We conclude that divergence within lineages is not neutral, and that between lineage divergence bears a stronger signature of selection than within lineage variation.
2 Materials and Methods
2.1 Data
Alignments for HLA-A, HLA-B and HLA-C were obtained from the IMGT/HLA Database (Robinson et al 2013). All dN/dS estimations and analyses were implemented in CODEML (PAML package, Yang 2007). First codon position was considered to be the first codon of exon 2, as indicated by annotation on IMGT alignments. Our initial data sets were complete coding sequences, i.e, exons 2-7 (for HLA-A and HLA-C) and 2-6 (HLA-B). These data sets were used for the site models (SM) approach. For the pairwise and branch model (BM) approaches, we used two datasets: one with 48 ARS codons (Chelvanayagam 1996) and the other, non-ARS, consisting of the remaining codons (Table 1). Indels, null alleles, alleles encoding proteins with low cell surface expression (with the mutation inside or outside the coding region), alleles encoding proteins which are expressed as secreted molecules only and alleles with mutations that are putatively related to cell surface expression were removed from all downstream analyses. The non-ARS data sets were used for estimation of dS, used in the pairwise approach as an independent proxy for allelic divergence.
2.2 Recombination detection, clade filter and branch models
The complete alignments for each locus were used to generate NJ (Saitou and Nei 1987) trees in NEIGHBOR (PHYLIP package, Felsenstein 1989) using the F84 method, k = 2 and empirical base frequencies for the distance matrices in DNADIST (Felsenstein 1989). Intragenic recombinants were detected by applying RDP3 (Martin et al 2010) to the complete alignments and manual inspection.
2.2.1 RDP3
We chose RDP, Chimaera, Maxcho, GENECONV, BootScan and SiScan for recombination detection. Window size was adjusted to 100 for BootScan and SiScan, and to 15 for RDP. The number of variable sites per window was adjusted to 35 and 30 for Maxchi and Chimaera, respectively. In all other cases we used the default values. Trace evidence of recombination were ignored and we considered as significant p < 0.05 in at least 3 of these methods. These specifications were chosen based on a test alignment we provided to the software, in which parental and daughter HLA-B sequences were known a priori. Following this initial procedure, we visually inspected the filtered alignments for the detection of additional recombinant sequences. After listing the recombinant sequences, these were removed from the pairwise, site models and branch models data sets, resulting in a “recombinant” (R) and “non-recombinant” (NR) data set for each locus (Table 1).
2.2.2 Clade filter
For the branch models, we used t (expected number of nucleotide substitutions per codon) matrices obtained in the pairwise analyses of the NR non-ARS data sets as input for NEIGHBOR. The trees were visualized for manual pruning and labeling in Mesquite (v2.75, http://mesquiteproject.org/). We imposed that alleles from a given HLA lineage (this information is known a priori from the IMGT/HLA annotation, and is thus independent of our trees) had to group together in a clade and alleles which did not group in such manner were manually “pruned” from trees to adjust to this “clade criterium”. Tree branches were then labeled as “within” (w) lineages or “between” (b) lineages (or terminal/internal; figure 3) for the BM analyses. Table 1 summarizes the number of alleles used for each set of analyses.
2.2.3 Branch Models
With these pruned data sets we compared branch models 0 (one ω for all branches) and 2 (two or more categories of branches with independent ω) from CODEML via likelihood ratio tests (LRTs, see below). We provided CODEML with a topology based on the non-ARS data set, using branch lengths simply as starting points for ML estimation (fix_blength = 1). For all CODEML analyses (SM, BM and pairwise), the Goldman and Yang (1994) model was used for estimation of substitution rates, option F3×4 for codon frequency estimation, κ = 2 and ω = 0.4 as initial values. Tables S14-S16 (Online Resource 1) show likelihood convergence for the branch models. BM analyses were performed solely for NR (without recombinants) data sets (see tables 2 and 3).
2.3 Pairwise approach
Pairs with dN/dS > 5 were considered NAs, as high values of dN/dS are mainly caused by near zero dS values and, therefore, would bias our results (Wolf et al 2009). Correlations between allelic divergence and omega values were computed by Pearson’s correlation index. NA values were treated by casewise deletion. Significance for these correlations was evaluated via a in-house implementation of the Mantel Test. We repeated these Mantel tests using Spearman’s and Kendall’s correlation indexes (Online Resource 1, Table S13) and treating NA values with overall mean imputation. We obtained quantiles of the dSnon–ARS distribution and divided pairwise values according to these quantiles (Online Resource 1, Table S1 for non-ARS data set and Table 4 in main text for ARS data set). Differences in mean ω values for “within” and “between” were evaluated via a Wilcoxon rank sum test (Figure 2).
2.4 Site models
For the SM approach, the clade filter (see above) was not applied, which resulted in minor differences between this data set and the other two (pairwise and branch models approach, see Table 1).
We used site models from CODEML to verify which codons have evidence for ω > 1. M0 (one ratio) assumes the existence of only one ω ratio for all codons (it is the simplest model and used solely for an evaluation of consistency of the parameter estimates of the more complex models). M1 (neutral) assumes the existence of two categories of sites, one with ω1 = 1 (sites evolving in a neutral fashion) and the other with ωo < 1 (sites evolving under purifying selection), while M2 (selection) adds an extra category to M1, where ω2 > 1, corresponding to sites with evidence for adaptive evolution.M7 (beta) is a flexible null model where the value is sampled from a beta distribution, where where ω0 < 1, and 0 < ω < 1, while M8 adds an extra category to M7, ω2, which is estimated from the data (Yang 2006). Codons with posterior probabilities P > 0.95 of ω > 1 in the Bayes Empirical Bayes (BEB) (Yang et al 2005) approach implemented in CODEML were considered to have significant evidence for adaptive evolution, following criteria described elsewhere (Yang and Swanson 2002; Yang et al 2005). The ARS codons classification proposed by Bjorkman et al. (1987) are referred to as BJOR, the peptide binding environments described in (Chelvanayagam 1996) are referred to as CHEV and the list of codons in HLA genes with evidence of ω > 1 from Yang and Swanson (2002) is referred to as YANG (Online Resource 1, Table S9 and Figure 1). M1 vs M2 and M7 vs M8 were compared via likelihood ratio tests (see below).
Codons with P > 0.95 for ω > 1 in M8 (34 in total) were combined for the three loci, and the R and NR data sets, and compared to CHEV, BJOR and YANG. Of these 34, only one was outside of the exons 2 and 3 range (codon 305). Figure 1 shows the overlap between the BJOR and CHEV ARS classifications, the YANG set of codons and our approach.
Tables S3-S8 (Online Resource 1) show likelihoods obtained when altering initial CODEML conditions for the SM analyses. SM analyses were performed for R and NR data sets.
2.5 Likelihood Ratio Test (LRT)
When comparing two nested models, as in the SM (M1 and M2, M7 and M8) and BM (0 and 2) approaches (see above), the LRT test statistic is given by: where l2 and l0 are ML values for the more parameter rich model (M8 or M2 in the SM approach, or model 2 in the BM approach). Degrees of freedom: 2 d.f for SM and 1 d.f for BM. It is expected that the use of a chi-square distribution for significance evaluation is a conservative approach (Yang 2006).
2.6 Breslow-Day test
In order to compare ARS and non-ARS codons with respect to the distribution of synonymous and nonsynoymous changes within and between lineages (or for internal or terminal branches), we used a contingency table approach similar to the one described in Templeton (1996). We estimated the synonymous (S) and non-nonsynonymous (N) changes on each branch in CODEML, using the branch models. Next we counted N (nonsynonyous changes) and S (synonymous changes) for within/between or terminal/internal branches for each locus, and for ARS and non-ARS codons (Table 5).
We defined the odds ratio (OR) as Nwithin/Swithin/Nbetween/Sbetween and used a Breslow-Day test for homogeneity of odds ratios to test the hypothesis that contingency tables from ARS and non-ARS codons have the same odds ratio. We applied the same test to internal/terminal branches.
3 Results
3.1 Evidence for selection and assessment of recombination
Before investigating how ω varies over time and phylogenetic context, we tested (a) whether selection is detectable in our data set with pairwise comparisons and phylogenetic approaches; (b) if the presence of HLA alleles resulting from intragenic recombination influences our inferences; and (c) if there is agreement between the ARS codons defined by crystal structure (Bjorkman et al 1987; Chelvanayagam 1996) and the codons inferred to have ω > 1 in our data set.
We quantified the mean pairwise dN/dS , and found for all loci (Table 4). We used the non-ARS codons from the same sequences as an internal control, and found that is 3.9 (HLA-A), 4.0 (HLA-B) and 3.2-fold (HLA-C) greater than (Table 4). This effect is not driven by a subset of the pairwise comparisons, since dN > dS for the majority (between 67 and 84%) of ARS pairwise comparisons, in contrast to the non-ARS comparisons, where fewer than 7% show dN > dS (Table 4). Importantly, we find that the result is due to increased (3.5 to 14-fold higher for ARS), and not to decreased (0.5 to 2.8-fold higher for ARS, Table 4). Qualitatively similar results were obtained when we computed the ratio of mean substitution rates, (Table 4). These results are robust to the presence of recombinants (Online Resource 1, Table S1).
Evidence for adaptive evolution in ARS codons was also strongly supported by phylogenetic methods from CODEML (see Methods), where models allowing for selection (M2 and M8) in a subset of codons were significantly favored over the neutral models M1 and M7 (Online Resource 1, Table S2; p < 0.01, likelihood ratio test). Results were robust to starting conditions for HLA-A and HLA-B (Online Resource 1, Tables S3-S6), and less so for HLA-C (Online Resource 1, Tables S7 and S8).
We next quantified the overlap between codons inferred to be under selection (using site models from CODEML, from here on referred to as “SM”) and those defined as ARS by structural analyses of HLA molecules. We defined a set of ARS codons based on the work of Chelvanayagam (1996), which identified “peptide binding environments”, i,.e, the amino acid residues in a fixed neighborhood of the peptide binding residues in known crystal structure complexes, providing a less restrictive description of the antigen binding sites (Chelvanayagam 1996). Within exons 2 and 3 (which contain all ARS codons) we identified 33 codons with significant ω > 1 for the M8 site model (see Methods) in at least one locus, of which 27 (82%) are contained within the set that forms the ARS according to the crystal structure-based classification (Bjorkman et al 1987), 25 (76%) are contained within the peptide binding environments (Chelvanayagam 1996), and 25 (76%) overlap with Yang and Swanson’s (2002) site models approach to detect codons with ω > 1 in HLA (Figure 1 in main text and Online Resource 1, Table S9). This provides a highly significant association between the ARS and positively selected sites for all loci (p < 10−11, chi-square test). There is extensive overlap between the two ARS classifications (Bjorkman et al 1987; Chelvanayagam 1996) (Figure 1) and we also find a high overlap of selected sites between the R and NR data sets for each locus (27 out of 33) (Online Resource 1, Tables S10-S12).
Our results show that: (a) the pairwise and phylogenetic site models methods implemented by CODEML strongly support adaptive evolution on the ARS codons of HLA loci - as also described by Yang and Swanson (2002) through branch models; (b) there is an enrichment of codons with ω > 1 in the CHEV set of codons (Online Resource 1, Table S9), supporting the use of this classification for our study; (c) although the overall results were robust to the presence of recombinants - as indicated by simulation studies Anisimova et al (2003) -, the estimated values for ω appear to be sensitive to the inclusion of recombinants. Therefore, where appropriate, in subsequent pairwise analyses, we contrast results of non-recombinant (NR) and recombinant (R) datasets, while for the branch models we use the NR data set exclusively.
3.2 The time-dependence of ω at HLA loci
Having confirmed that selection at ARS sites is detectable with pairwise comparisons and phylogenetic approaches, we investigated if recent evolutionary change (accounting for differences among recently diverged alleles) shows different signatures of selection with respect to changes that occurred over greater timescales. Our first approach consisted in examining the distribution of ωARS as a function of the time since divergence between allele pairs. Our estimate of divergence time between allele pairs was based on the values of dSnon-ARS (estimated from non-ARS codons) for each allele pair, thus avoiding statistical non-independence with ωARS. Because very recently diverged alleles have low synonymous divergence (dSnon-ARS), the resulting ωARS values were often undefined or extremely large. We therefore followed a strategy adopted by Wolf et al (2009) to filter out the allele pairs with ωARS > 5 (resulting in the removal of 1.1%, 1.4% and 3.9% of ω values for pairwise comparisons at HLA-A, -B, and -C).
Pairwise estimates show that ωARS increases as a function of divergence time (Table 4). Indeed, ωARS and dSnon–ARS are positively correlated (Online Resource 1, Table S13; rHLA–A = 0.17, p < 0.001; rHLA–B = 0.20, p < 0.001; rHLA–C = 0.20, p < 0.001; Pearson, significance obtained by Mantel Test). Qualitatively similar results were found for NR data sets (Online Resource 1, Table S13; rHLA–A = 0.25, p < 0.001; rHLA–B = 0.12, p < 0.001; rHLA–C = 0.19, p < 0.001) and were robust to different correlation measures (Online Resource 1, Table S13). We also compared the pairwise ω for pairs of alleles within and between lineages (Figure 2). For all loci, the median value of ωARS is higher than 1 for the between lineage contrasts, and lower than 1 for the within lineage contrasts, and the distribution of ω is significantly higher for between lineage contrasts (p < 0.001, Wilcoxon rank sum test; Figure 2) of the ARS codons.
The above pairwise comparison approach suffers from the limitation that allele pairs with ω > 5 were treated as missing data, possibly underestimating ω for recently diverged alleles. This prompted us to use a phylogenetic model to contrast alleles at different levels of differentiation, which is more robust to the effects of low differentiation between specific allele pairs. We compared a branch model that estimates a single ω for all branches to one that estimates two values of ω (between or within; terminal or internal; see Figure 3 and Methods). For the branch models, the tree topology was obtained from the non-ARS (NR) data set. For all loci we found higher values of ωARS for branches which are between lineages, than for branches within lineages, although significance was not attained for these tests (Table 2). For the “internal-terminal” contrast we found higher ωARS for internal branches and there was statistical support for higher ω for internal than for terminal branches for HLA-C (Table 2).
Our results show that both pairwise comparisons and phylogenetic branch models indicate a heterogeneity of ω throught the diversification of HLA alleles, with higher ω values asssociated to contrasts between more divergent alleles (pairwise approach) or to branches connecting different lineages or that are internal to the phylogeny (phylogenetic branch models), although the difference was not significant for the “within-between” contrasts (Table 2).
3.3 Significantly more nonsynonymous changes between lineages at ARS codons
Our study estimates ω for allele pairs or branches sampled within a single species, and over varying timescales (recent to remote divergence). Both these features imply in possible biases to the estimation of ω, which we now discuss.
Kryazhimskiy and Plotkin (2008) used analytical and simulation approaches to show that under positive selection the behavior of ω within a single population is not a monotonic function of the intensity of selection, so that ω within a population can be low, even under positive selection. This occurs because the scenario they explored assumes that recent positive selection fixes a favourable nonsynonymous variant (thus decreasing ω). However, this scenario clearly does not apply to HLA genes, were balancing selection maintains multiple nonsyonynous polymorphisms simultaneously segregating within a population, contributing to ω > 1.
Another complexity in the interpretation of ω arises from that fact that many studies have shown that genes under purifying selection show surprisingly high ω (often close to 1) when samples with short divergence times are analyzed (e.g., those from a single population or species). For example, Rocha et al (2006) showed that dN/dS between two samples is negatively correlated with their divergence times, and exemplified these predictions with bacterial genomes. Likewise, a decrease of dN/dS with divergence time has been described in Wolf et al (2009), but considering a much deeper timescale. Kryazhimskiy and Plotkin (2008) demonstrated that this pattern is expected even under a regime of purifying selection that is constant over time. Thus, it is plausible that the recent divergence times among alleles within lineages could result in inflated ω values, explaining the modest differences in ω within and between lineages seen in the phylogenetic analyses (Tables 2 and 3). To explore this issue further, we took advantage of the availability of non-ARS codons from the same set of sequences, which provide a convenient internal control, given their tight linkage to the ARS codons (and hence similar within vs between-lineage phylogenetic structure). We find that non-ARS codons have larger values of ω within than between lineages, and for terminal versus internal brances (p < 0.05 for HLA-A in the within versus between lineage contrast, and for HLA-A and HLA-C in the tips versus internal contrast; LRT; Table 3). This distribution of ω values is in the exact opposite direction to that observed for the ARS (Table 2), consistent with an effect of short divergence times inflating the estimates of ω (Kryazhimskiy and Plotkin 2008).
To formally test whether ARS and non-ARS codons have a different distribution of synonymous and nonsynonymous changes within and between lineages (or for internal and terminal branches) we employed a contingency table approach implemented and described by Templeton (1996). We used the inferred number of synonymous (S) and nonsynonymous (N) changes on each branch to estimate the total number of each type of change in a specific class of branches (see Figure 3 for a schematic representation of the branch labeling). We defined the odds ratio (OR) as OR = Nwithin/Swithin/Nbetween/Sbetween. For all loci, we find that OR > 1 for non-ARS codons (proportionally more nonsynonymous changes within lineages) and OR < 1 for ARS codons (proportionally more nonsynonymous changes between lineages), as shown in Table 5. This finding is consistent with the maximum likelihood estimates of ω for branches (Tables 2 and 3), and the increased pairwise ω for alleles between lineages, relative to within (Figure 2). To test for differences between ARS and non-ARS codons, we pooled the contingency tables of all loci (due to the fact that several cells for individual loci had low counts). Using a Breslow-Day test for homogeneity of odds ratios (Table 5), we rejected the null hypothesis that contingency tables from ARS and non-ARS codons have the same OR (p − value = 0.0069). Our analysis with the contrast between terminal and internal branches (OR = Nterminal/Sterminal/Ninternal/Sinternal) showed the same pattern, with proportionally more nonsynonymous changes in internal branches for ARS codons (p − value = 0.00013; Table 5). Finally, although there is evidence for an excess of nonsynonymous changes between lineages (or for terminal branches) for ARS codons, there is also an enrichment for nonsynonymous changes within lineages for ARS codons, when compared to non-ARS codons (P < 0.001; Fisher’s exact test).
4 Discussion
Our study documents a positive correlation between ω values and the degree of divergence between allele pairs. This result is supported by phylogenetic analyses, which show higher ω values for branches connecting lineages, or branches which are internal to the phylogeny. A heterogenous nonsynonymous substitution rate was also reported in a recent study (Yasukochi and Satta 2014), which found that dN for ARS codons is not linearly correlated with divergence time in classical HLA loci. By further investigating the temporal dynamics in the DRB1 gene, these authors showed that this rate heterogeneity is likely the consequence of a reduction in the substitution rates in specific lineages, possibly as a consequence of continuous selective pressure by a specific pathogen. In the present study our goal is to explicity test for heterogeneity in the ω ratios over a priori defined groups of alleles or timescales. As was the case with the study of Yasukochi and Satta (2014), we find heterogeneity in the intensity of selection, in our case with evidence of increased selection at deeper timescales. Our findings indicate that long-term balancing selection has resulted in an enrichment for adaptive changes between HLA lineages, with recent and within lineage changes showing proportionally weaker signatures of molecular adaptation.
The finding that ωARS between lineage is greater than within is consistent with the divergent allele advantage model, according to which heterozygotes for more divergent alleles have higher fitnes than those carrying similar alleles. Under this model, excess of nonsynonymous changes between HLA allelic lineages would be expected, which is a result we have shown for the ARS data set.
Although our results show that ω within lineages is lower than between, our phylogenetic results show that within lineage variation does not behave neutrally, and ω for ARS codons within lineages is still higher than ω for non-ARS codons within lineages - a scenario which is expected to inflate ω Rocha et al (2006). This result suggests that within lineage variation cannot be viewed as neutral.
Recently several papers have drawn attention to the effects of divergence times on dN/dS estimation, and the complexities of interpreting these values when data is drawn from a single population (Rocha et al 2006; Kryazhimskiy and Plotkin 2008). Our finding of increased ωARS among more divergent alleles (or between lineages) is conservative in the light of these findings, which predict decreased ω for more divergent alleles. We accounted for this effect by using non-ARS codons, which have a similar phylogenetic structure to that of ARS codons (thanks to the removal of recombinants) to control for the background inflation of omega in recently diverged alleles, and found that ARS codons have very different distribution of ω, with increased evidence of selection between lineages, exactly the opposite to that seen for non-ARS codons.
An important caveat to this interpretation is that the temporal dynamics of dN/dS appear to be sensitive to the selective regime which is assumed to be operating. Thus, while several authors have shown that under purifying selection we expect increased dN/dS at low divergence, positive selection can produce a positive correlation with divergence times (Dos Reis and Yang 2013; Mugal et al 2014), which could account for part of the results we describe in this study. However, the case of directional positive selection, involving the sequential substitution of adaptive mutations, is markedly different from the dynamics of a balanced polymorphism, as is the case for HLA genes. Thus, simulation and analytical treatment of the temporal dynamics of dN/dS under balancing selection, which currently has not been explored, will be an additional source of information to intepret the dynamics we have described for HLA class I genes.
Electronic Supplementary Material
Supporting tables are available as an additional file.
Competing Interests
The authors declare that they have no competing interests.
Author’s Contributions
BDB carried out all data manipulation and statistical analyses and drafted the manuscript. RDF participated in the detection of recombination in the data sets and participated in the interpretation of the site models analyses. DM conceived of the study, participated in its design, coordination and in the statistical analyses and helped draft the manuscript. All authors read and approved the final manuscript.
Acknowledgements
The authors thank Aida M. Andrés and Richard Single for helpful discussions. This work was supported by Fundação de Amparo à Pesquisa do Estado de São Paulo (2009/09127-8; 08/56502-6 and 11/12500-2 to B.D.B) and Conselho Nacional de Desenvolvimento Científico e Tecnológico (152676/2011-2 to BDB, 142130/2009-5 to RSF and 308960/2009-2 to DM).