Coevolutionary analyses require phylogenetically deep alignments and better null models to accurately detect inter-protein contacts within and between species

Aram Avila-Herrera; Katherine S. Pollard

doi:10.1101/014902

Abstract

When biomolecules physically interact, natural selection operates on them jointly. Contacting positions in protein and RNA structures exhibit correlated patterns of sequence evolution due to constraints imposed by the interaction, and molecular arms races can develop between interacting proteins in pathogens and their hosts. To evaluate how well methods developed to detect coevolving residues within proteins can be adapted for cross-species, inter-protein analysis, we used statistical criteria to quantify the performance of these methods in detecting inter-protein residues within 8 angstroms of each other in the co-crystal structures of 33 bacterial protein interactions. We also evaluated their performance for detecting known residues at the interface of a host-virus protein complex with a partially solved structure. Our quantitative benchmarking showed that all coevolutionary methods clearly benefit from alignments with many sequences. Methods that aim to detect direct correlations generally outperform other approaches. However, faster mutual information based methods are occasionally competitive in small alignments and with relaxed false positive rates. All commonly used null distributions are anti-conservative and have high false positive rates in some scenarios, although the empirical distribution of scores performs reasonably well with deep alignments. We conclude that coevolutionary analysis of cross-species protein interactions holds great promise but requires sequencing many more species pairs.

Background

Coevolution—“the change of a biological object triggered by the change of a related object” [1]—is a powerful concept when applied to molecular sequence analysis because it reveals positional relationships that are worth preserving across evolutionary time scales. Sequence evolution is constrained by essential molecular interactions, such as contacts within a protein or RNA structure, as well as inter-molecular interactions in protein complexes and signaling pathways. These constraints define an epistasis between sites (residues or base-pairs) where the probability of a substitution depends on the states of other sites [2] involved in an interaction. Because epistasis can induce correlation between substitution patterns across columns in multiple sequence alignments, many methods have been developed that use evidence of coevolving alignment columns to detect physical interactions within and between biomolecules. These methods draw inspiration from diverse techniques in molecular phylogenetics, inverse statistical mechanics, Bayesian graphical modeling, information theory, sparse inference, and spectral theory (reviewed in [3, 4]).

Despite good rationale for coevolutionary approaches, physically interacting alignment columns have been notoriously difficult to identify from correlated patterns of sequence evolution for several reasons. First, shared evolutionary history creates a background of correlated substitution patterns against which it can be difficult to distinguish additional constraints derived from physical interactions. Common phylogeny is particularly strong within a gene family (e.g., predicting intra-molecular contacts). But it is also present across gene families within a species or even between species (e.g., predicting host-virus protein interactions), especially at shorter evolutionary distances where gene trees mirror species trees more closely. Coevolution methods have used a variety of approaches to counter the dependence induced by shared phylogeny, including removing closely related sequences from alignments to reduce non-independence [5, 6], differential weighting of sequences when computing statistics [7–9], and null distributions that directly model or indirectly account for phylogeny [10–13].

A second challenge arises when trying to distinguish correlated evolution that arises from direct versus indirect interactions. Alignment columns that are indirectly implicated in an interaction can be strongly correlated, and most columns are involved in multiple, partially overlapping interactions. For these reasons, close physical interactions may not produce patterns of substitution that are significantly more highly correlated than the background present in structures. This problem has been the focus of a recent class of coevolutionary methods that focuses on reducing the number of incorrect predictions by disentangling direct from indirect correlations [9, 14–17]. An alternative point of view considers these networks of indirectly correlated residues as protein sectors that can easily, through cooperative substitutions, respond to fluctuating evolutionary pressures [18].

Finally, due to low power—resulting in part from the previous two challenges—physically interacting sites can typically only be detected in multiple sequence alignments that span large evolutionary divergences and contain many hundreds to thousands of sequences. Recent evaluations of a number of coevolution methods concluded that accurate contact predictions require alignments with one to five times as many sequences (with <90 % sequence redundancy) as positions [19, 20].

To date, coevolutionary prediction of physically interacting alignment columns has been applied with success to intra-molecular contacts [7, 21–23] and well-characterized inter-molecular interactions [24], such as bacterial two-component signaling systems [25], enzyme complexes [26], and fertilization proteins [27]. The signal-to-noise ratio is too low and the search space too large to use sequence evolution to effectively identify pairs of physically interacting protein residues across entire proteomes; most pairs of sites with correlated substitution patterns are not in direct contact, and most physically interacting sites do not have statistically correlated substitution patterns [28].

However, the ability to now measure physical interactions between biomolecules with high-throughput technologies, such as affinity purification followed by mass spectrometry (APMS) [29], two-hybrid methods [30, 31], and protein complementation assays [32], raises the possibility of using sequence coevolution in a more specific way: to refine predicted interactions in an experimentally reduced search space. For example, correlated substitution patterns in pairs of proteins could help determine if an experimentally measured interaction is likely to represent direct physical contact versus an indirect interaction in a complex or a false positive. Coevolutionary analysis could also be informative regarding which of the sites in a pair of interacting molecules are most likely to be in physical contact.

One particularly exciting application of this approach is to characterize and potentially manipulate interacting residues in host-virus and host-parasite protein interactomes [33, 34]. Newly emerging data on antibody and antigen sequences within a host [35] offers an opportunity to harness coevolutionary signals to investigate the mechanisms of broadly neutralizing antibodies and immune evasion. The primary open question for these new applications is whether existing methods are sensitive and specific enough to detect coevolution with the levels of constraint and divergence that are present in inter-molecular data sets of modest size.

To this end, we designed data processing scripts, statistical evaluation and visualization tools, and simulation pipelines that allowed us to easily extend a suite of coevolution methods designed for intra-protein interaction prediction (Table 1) so that they can be used to test for patterns of correlated sequence evolution at pairs of sites in two different proteins, potentially from different sets of organisms in different parts of the tree of life (e.g., human-bacteria, bacteria-phage interactions). We then applied this integrated framework for coevolutionary analysis to refine and annotate a recently derived human-HIV1 protein-protein interaction network [33] and to test for coevolution in the well studied arms-race interaction between the mammalian cytidine deaminase APOBEC3G (A3G) and its HIV1 antagonist, Vif. Because fewer than ten orthologous mammal-lentivirus proteome pairs have been sequenced and mammalian divergence is low, we hypothesized that power would be low in these settings.

View this table:

Table 1:

Coevolution methods included in analysis. Information-based methods: MI: mutual information [57], VI: variation of information [58], MI_j: MI divided by alignment column-pair entropy, MI_Hmin: MI divided by minimum column entropy [8], MI_w: MI with adjusted amino acid probabilities. Direct methods: DI: direct information—MI with re-estimated joint probabilities [9], DI₂₅₆, DI₃₂: DI using Hopfield-Potts for dimensional reduction (256 and 32 patterns respectively) [59], DI_plm: Frobenius norm of coupling matrices in 21-state Potts model using pseudolikelihood maximization [38], PSICOV: sparse inverse covariance estimation [14]. Phylogenetic methods: CoMap P-values for four analyses CMP_cor: substitution correlation analysis [10], CMP_pol for polarity compensation, CMP_chg for charge compensation, CMP_vol for volume compensation [2].

To quantify the limitations of coevolutionary methods when only a handful of sequences are available, we used a data set of 33 within-species bacterial protein-protein interactions. To systematically determine the parameters that affect performance, we focused on the well-characterized interaction between bacterial histidine kinase A (HisKA) and its response regulator (RR), for which a co-crystal structure and thousands of sequences are available. By subsampling HisKA-RR sequence pairs, we show that most methods have appreciable precision or power at low false positive rates for alignments with ~500 or more sequences. However, the best performing method depends on whether power or precision is more important, the number of non-redundant sequences in the alignment, and whether the goal is to find structurally or functionally linked residues. By expanding this analysis to 32 additional bacterial interactions [24], we showed that these trends generalize beyond the specific example of HiskA and RR. We conclude that coevolution methods are able to identify some residues important for cross-species protein-protein interactions, but this approach will benefit greatly from additional sequence data.

Results

Performance benchmarking of coevolution methods

The coevolutionary methods benchmarked in our analyses fall into three general groups (Table 1). Information-based methods are various flavors of Mutual Information between pairs of sites, each considered independently. Direct methods are those that consider pairs of sites in the context of a sparse global statistical model for contacts in the multiple sequence alignment. Phylogenetic methods explicitly use a substitution rate matrix and phylogenetic tree in their calculation of a coevolution statistic that may take into account the biochemical and physical properties of amino acid residues, as well as report a P-value based on internal simulation of independently evolving sites. In this benchmark we use the CoMap P-value as a statistic for comparison with other coevolution methods. Other differences among the coevolution methods include the incorporation of two additional techniques that have been shown to improve performance, re-weighting sequences such that similar sequences contribute less to the final score [5] and applying an Average Product Correction (APC) to remove background noise and phylogenetic signal from “raw” coevolution statistics [8].

To benchmark coevolution methods, we used 33 within-species pairs of proteins with co-crystal structures determined from E. coli proteins. These include a set of paired alignments compiled by [24], plus the histidine kinase-response regulator (HisKA-RR) bacterial two-component system from [36], provided by the authors. We included HisKA-RR, because it is a well-characterized interaction with a very large, diverse multiple sequence alignment (8998 sequences for each gene) and genetic evidence supporting several interactions. For these reasons, HisKA-RR has also been used previously in coevolutionary analyses [37].

Because the HisKA-RR alignment is so large, it enabled us to quantify the effects of alignment size and diversity by down-sampling the full alignment to produce a wide range of smaller pairs of HisKA and RR multiple sequence alignments with different numbers of sequences (range 5 to 5000 sequences) and phylogenies from the original alignment. The 32 alignment pairs from [24] naturally varied in size (range 168 to 1428 sequences).

For each pair of multiple sequence alignments from two interacting proteins, we compared every site in the first protein to every site in the second protein and scored these pairs of alignment columns for coevolution using each of the methods in Table 1. We then used coevolution scores to predict inter-domain pairs of amino acid residues that are less than 8 angstroms (Å) to each other, measured between C_βs, in the representative co-crystal structure (see Methods). We also repeated our analyses of the HisKA-RR sub-alignments using a stricter definition of contacts that requires additional biochemical evidence for specificity determination, and an alternate definition that measures distance between the closest non-hydrogen atoms. Trends in our results were generally similar across these choices of definition for true interactions, but we observed some differences in performance between definitions when enforcing a false positive rate (FPR) (Figure S2).

The performance of each method to distinguish contacting pairs of residues (positives) from other residue pairs (negatives) was measured as previously described [14, 38] and evaluated using power (also called recall and true positive rate (TPR)) and precision (also called positive predictive value (PPV)) at a range of low FPRs. Power and precision are complementary performance measures that quantify the percentage of interacting residue pairs that are found and the percentage of identified residue pairs that are interacting, respectively. Precision is a useful measure of performance in cases where positives (contacting pairs of residues) are overwhelmed by negatives (non-contacting residues). A method with high precision is helpful for generating lists of high confidence pairs of residues for expensive follow-up studies, even if it misses a number of truly interacting sites and therefore has relatively low power. We additionally examined four threshold-independent performance measures, area under Receiver-Operator Curve (auROC), area under precision-recall curve (auPR), maximum F₁-score (f_max), maximum ϕ (phi_max).

Physically interacting sites can be accurately detected in large sequence alignments

Our primary finding is that many coevolutionary methods are able to detect inter-molecular contacts at low FPRs in alignments with hundreds of diverse sequences from each protein, consistent with previous studies of intra-molecular contacts [3, 17], specifically when the alignments are deeper than they are long [19, 20]. We capture this rectangular quality in the statistic Neff/L, where Neff is the effective number of sequences as calculated by PSICOV [14] and L is the total number of columns in the pair of alignments. We observe similar trends when we use the number of sequences (N) or their phylogenetic diversity (PD), rather than Neff/L, to compare performance. The relationship between N, PD, and Neff is explored further in the Supplemental Text: Diversity of sequences and Supplemental Figures S10, S11 and S21. The diversity of residues within the individual alignment columns that make up each pair is another important factor to consider, and is explored in the Supplemental Text: Performance by column entropy categories.

Both power and precision improve with increasing Neff/L for nearly all coevolutionary methods (Figure 1), in the HisKA-RR data set. However, for alignments with Neff/L < 1.0, power at FPR<5% and precision at FPR<0.1% both remain relatively low (<50%). Additionally, the performance metrics f_max and phi_max show that there are no score thresholds (i.e. the strictness of predictions) that achieve both high precision and power in alignments with Neff/L < ~3.0 (Supplemental Figure S1). Despite the smaller range in Neff/L values, these performance trends are also observed across the 32 alignments in [24] (Supplemental Figures S3 and S6).

Figure 1:

Coevolution statistics differ in their ability to detect residue contacts in HisKA-RR sub-alignments. Performance improves with larger, more diverse alignments. A: Power (TPR) and precision (PPV) at false positive rate (FPR) < 5%, B: at FPR < 0.1%. See Misc. Abbreviations and Table 1 for abbreviations.

In general, we confirm that coevolutionary methods that adjust for background phylogenetic signal through sequence re-weighting and/or average product correction (APC) (e.g., DI, DI_plm, and PSICOV) perform better than the phylogeny unaware mutual information (MI) based methods and the phylogeny aware approaches that explicitly use evolutionary models. In the HisKA-RR alignment, we observed two major exceptions to this trend when using the strictest definition for contacting pairs (i.e., requiring residue C_β < 8Å coupled with biochemical evidence for specificity determination) (Supplemental Figure S2). First, the standard MI statistic is the most precise method for detecting contacting sites in alignments with Neff/L > 1.6 and FPR < 0.1%. Second, mutual information normalized by the joint entropy (MI_j) has relatively high power in many scenarios and is the most powerful method for detecting contacting sites that are supported by experimental evidence at FPR < 5%. However, MI_j has drastically lower power at FPR < 0.1%. These findings suggest that MI is a reasonable choice if the goal of the analysis is to predict a small number of very high confidence contacts, whereas MI_j may be useful for detecting as many contacts as possible if a moderate FPR can be tolerated. These methods are both straightforward to compute, adding to their utility in these settings.

CoMap performance is an interesting case because, in contrast to DI, DI_plm, and PSICOV, it was not designed to find contacting residues. In the smallest alignments (5 sequences) we tested, it can have slightly better performance than the other methods. However, its poor performance in other alignments may indicate that it is identifying a set of coevolving residue pairs that partially overlap with contacting residues. It remains to explore whether CoMap can be used to prioritize residue pairs predicted by the other methods for functional assays.

Finally, we looked at the relationship between performance and the proportion of residue pairs that are contacts. Comparing across all 33 structures in our analyses, we observed the proportion of contacts is correlated with precision (Supplemental Figure S7). This means that most strongly coevolving residues in a protein pair are more likely to be physically interacting in co-crystal structures with larger interfaces.

Choice of null distribution afects performance

The previous results show performance based on the known HisKA-RR structure. When applying the methods in our study in practice the structure usually is not known. One therefore uses a null distribution to control false predictions. Specifically, an upper quantile of the distribution of coevolutionary statistics in the absence of coevolutionary constraint is used as a threshold; one declares any pair of sites with a statistic exceeding the threshold a predicted contact. The goal is to minimize false predictions by predicting contacts only when statistics are much larger than expected by chance under the null distribution. A variety of null distributions are commonly used, including theoretical limiting distributions [5, 8], the observed empirical distribution (under the assumption that most pairs of sites are not coevolving) [39] and parametric, semi-parametric, and non-parametric bootstrap distributions [10, 40]. Theoretical and empirical nulls are more computationally efficient than bootstrap methods, which require simulating large data sets. The HisKA-RR interaction provides a framework for assessing the performance of these different approaches.

We used our sampled sub-alignments of HisKA-RR and the 32 alignments in [24] to compare the performance of two commonly used null distributions and to evaluate the sensitivity of each approach to alignment size. For each null distribution and coevolutionary statistic, we first employed the non-contact pairs of residues to assess if the FPR was truly controlled or not at given target FPRs of 5% and 0.1%.

The normal distribution is commonly used as theoretical null for mutual information and its normalized variants. Under this assumption, we standardized the coevolution scores to Z-scores and compared these to upper quantiles of the standard normal distribution (mean = 0, variance = 1). We then used the resulting upper-tail P-values (P_normal) to predict contacting residue pairs. We found that nominal FPRs using this approach consistently exceed the target FPR across the range of Neff/L values in both the HisKA-RR sub-alignments and the alignments in [24] (Figures 2 and Supplemental Figure S4). In general, as Neff/L increases, the nominal FPR for Direct methods increases while it decreases in Information based methods. Nominal FPRs were up to twice to 20 times the target FPR for target FPRs 5% and 0.1% respectively. This suggests that either non-contacting residue pairs carry signals of coevolution (e.g., due to phylogeny, structural, or other evolutionary constraints) and/or that Z-scores of coevolution statistics have variance greater than one across non-contacting residues (e.g., due to an underestimated standard deviation across residue pairs resulting from within protein constraints or residues appearing in many pairs). Three of the four phylogeny aware CoMap methods controlled the nominal FPR below the target in all cases suggesting that the charge compensation analysis is predicting long-range residue interactions as well as contacts.

Figure 2:

Commonly used null distributions for coevolution statistics’ null distributions often fail to control the false positive rate (FPR). A: Nominal FPRs for target FPR < 5%, B: target FPR < 0.1% (dashed lines) in the HisKA-RR alignments, assuming standardized scores have a standard normal null distribution, (i.e. using P_normal). The phylogenetic methods control FPR at a threshold of 0.001, because they do not make any predictions at this significance level. See Misc. Abbreviations and Table 1 for abbreviations

Thus, while the normal distribution applied to standardized coevolution statistics can practically be used as a null distribution, we conclude that this approach results in elevated rates of false positive predictions, likely due to shared phylogeny or structural constraints affecting non-contacting residue pairs. A theoretical null (eg. noncentral gamma [41]) that is parameterized for individual column pairs may therefore be more appropriate.

Another choice of null distribution is the observed empirical distribution of the coevolution statistics. A P-value (P_empirical) for a score S is simply the proportion of scores that are more extreme than S. This straightforward method can be easily applied with any statistic. However, it also assumes that no pairs of sites are coevolving and should therefore produce thresholds that are too strict when there are some coevolving sites in the data set (i.e., making it harder to predict real contacts). Contrary to this expectation, we found that the empirical null distribution—like the normal null distribution—produces nominal FPRs that exceed target FPRs (Figure 3 and Supplemental Figure S5). However, it is the Direct methods that best control the nominal FPR in both sets of alignments, marginally exceeding the target FPR in only a couple of cases. The Information-based methods fared well in the alignments in [24], however the HisKA-RR sub-alignments reveal that at Neff/L < 0.3, control of the FPR is lost, especially in MI_Hmin. The Phylogenetic method that consistently exceeded the target FPR was the CoMap correlation analysis (CMP_cor) which makes no assumptions regarding the biochemical properties of the amino acids. These results suggest that the empirical null distribution is not as conservative of an approach as one might expect from including contacting residue pairs in the null distribution. Although, it may suffer from some of the same effects that make the normal null distribution anti-conservative, such as shared phylogeny or structural constraints, alignments with very few sequences (eg. 5-50) have a limited number of possible scores which leads to ties in P-values between contacting and non-contacting residues.

Figure 3:

Commonly used null distributions for coevolution statistics’ null distributions often fail to control the false positive rate (FPR). A: Nominal FPRs for target FPR < 5%, B: target FPR < 0.1% (dashed lines) in the HisKA-RR alignments, using the empirical distribution of score ranks as the null distribution (i.e. using P_empirical). See Misc. Abbreviations and Table 1 for abbreviations

These results are encouraging, but still leave us with the challenge of how to choose an appropriate P-value cutoff in a real analysis when the structure is unknown. Since our findings indicate that nominal FPRs exceed target FPRs with all three types of null distributions and nearly all methods, stricter P-value cutoffs than the target false positive rate seem warranted. But it is not clear how much stricter will be needed in any given alignment pair without additional information to guide such modifications (eg. incorporating alignment properties such as Neff/L into a model for each coevolution method). Hence, in most applications one must simply aim to control a target FPR, knowing that the true error rate is likely to be larger (Supplemental Figures S8 and S9). For this reason, the empirical null distribution may be the best choice to use as it controls error rates across the majority of alignment sizes, target FPRs, and coevolution methods (Figures 3 and S5) tested. As a rule of thumb, the empirical null overall controls the FPR for the Direct methods, however in small alignments (5 sequences or Neff/L < 0.3) it can be up to 1.5 times the target FPR.

Cross-Species Case Study 1: Applying coevolution methods to Vif-A3G identiies some residues known to afect host-virus interactions

Viral infectivity factor (Vif) is a lentiviral accessory protein whose primary function is to target the antiviral cytidine deaminase APOBEC3G (A3G) of its mammalian hosts through ubiquitination. Because the two protein families are in an evolutionary arms race [42, 43], we hypothesized that they would be an informative example for exploring the utility of coevolution methods in host-virus protein pairs (i.e., inter-protein, inter-species interactions). This is a novel application of coevolution analysis, which has primarily been applied to residues within a protein or between pairs of proteins in the same genome.

A major challenge in performing coevolutionary analysis on cross-species protein pairs is acquiring appropriate data, including paired alignments and protein structures for validation. For Vif-A3G, we were able to identify 16 pairs of sequences (Neff = 10.0) from different primates (A3G orthologs) and their lentiviruses (Vif orthologs) in public databases (Table S2). Our benchmarking results on HisKA-RR indicate that such small protein families push the useful limits of the coevolution statistics we tested (Neff/L = 0.014). The low sequence diversity of A3G (Neff = 3.04) within primates compared to Vif (Neff = 11.3) within primate lentiviruses also presents challenges. Hence, we expect coevolutionary analysis to potentially have limited power in this scenario. To quantitatively evaluate performance, requires validated Vif-A3G interactions. The structure of Vif in complex with A3G has not been solved. However, biochemical assays have solidly identified regions important for binding and ubiquitination along the individual reference sequences of HIV1 Vif [44–47] and human A3G [48, 49] (Table S3). For this analysis, we therefore take the residues in biochemically-validated regions to be positives even though they might not be contacts (ie. C_β distance ≥ 8Å), and assume that all remaining residues are negatives, even though other sites (including sites deleted in these reference sequences) are possibly involved in the interaction. While further experimentation is needed to understand the relationship between functionally important sites and the structure of the protein interaction, as well as the effects of mutations in these sites on the fitness of lentiviruses, we explore whether any clues can be identified in the limited data that describes the coevolutionary history of the Vif-A3G residues.

First, we computed coevolutionary statistics for all Vif-A3G residue pairs and evaluated how well the statistics pinpoint the positive functionally important residues compared to negatives. For this evaluation, we used the empirical distribution of scores as a null distribution to determine statistical significance (i.e., P_empirical) because they have lower false positive rates across Neff/L values at strict significance thresholds. Because the positives and negatives are single residues in each sequence instead of inter-protein residue pairs, we summarized P_empirical for each residue by assigning it the most significant P_empirical across all inter-protein pairs to which it belongs, and then explored the Vif and A3G results individually. From our benchmarking on the bacterial data sets, we know that significance thresholds that control the FPR vary by method and Neff/L, and that strict thresholds that yield very low (~2-3%) power are typically needed to control FPR in small alignments. We therefore chose to identify a significance threshold for each method that maximizes precision on the known functional sites in each protein. Then, we estimated power and FPR at these thresholds.

On Vif, with the exception of CMP_cor and DI₃₂, the maximum precisions for each method ranged from 9 to 20% (i.e. only one or two residues out of ten predicted to be positives are truly positives)(Supplemental Figure S14). At these precision-optimized thresholds, MI_j and MI_minh predict almost every Vif residue to be coevolving; a stricter threshold would not result in a lower proportion of incorrect predictions. In contrast, the precisions for CMP_cor, CMP_pol, and DI₃₂ are the highest (20%, 40%, 100% respectively). However, this comes at the cost of making the fewest number of predictions with the latter only making a single prediction. For these methods, less strict thresholds are needed to identify a greater proportion of positives at the cost of increasing the proportion of false discoveries. Across all methods, low f_max and phi_max values (0.26 and below) suggest there are no significance thresholds that balance power and precision for this data set.

We observed similarly low performance on A3G (Supplemental Figure S16). Encouragingly, we note that positions 128-130 are correctly identified by multiple methods (Supplemental Figure S12B). Residues at position 130 (e.g., D vs A) are highly likely to be adaptations that conferred species-specific resistance to Vif-induced degradation in Old World Monkeys 5-6MYA [42, 43]. Position 128, that also provides species-specific resistance, is thought to be more recent [42, 43, 50]. While these coevolution methods alone may not yet be accurate enough to identify functional residues, they potentially enhance other evolutionary analyses. For example, of the many Apobec sites under positive selection [43], it is reasonable that lentiviruses are more likely shaping the evolution of those sites that coevolve with Vif than sites that coevolve with other viral or virus-like agents.

Secondly, we visualized the localization of Vif residues predicted to be coevolving with A3G on a partial structure of Vif in complex with cofactors utilized for protein ubiquitination [51] (Figure 4). In [51], the authors are able to see that a critical subset of the Vif positives is solvent-exposed. We reevaluated performance with only these residues as the positives (Supplemental Figure S15). There is poor precision to identify the putative solvent-exposed interface among the methods; CMP_cor at 50% and CMP_vol at 10% are the only methods with precision >6%.

Figure 4:

HIV1 Vif (light blue) in complex with co-factors (grey) sans APOBEC3G (A3G) (PDB ID: 4N9F). Residues in red are predicted to be coevolving with A3G optimizing precision (PPV) using A: previously known essential residues, B-D: predictions using CMP_chg, MI, DI respectively. E: Few Vif residues previously known to interact with A3G are correctly predicted by more than four methods and none by methods in all classes of methods (Information-based, Direct, Phylogenetic). See Misc. Abbreviations and Table 1 for abbreviations.

Our analysis of the Vif-A3G interaction confirms that power to detect functionally important residues in each protein family is also low in inter-protein analyses between species, even though it is plausible that an arms race between lentivirus and mammal would give rise to stronger signals of coevolution compared to background. It is important to consider that perhaps the positions we considered positives may not all be of equal evolutionary importance across primates. Interfaces may be gained or lost and the rapid evolution of the two proteins likely produces many alternative solutions to maintaining an antagonistic interaction. There were many predicted positions that were not in the positives and further systematic validation and more comprehensive sequencing of lentiviruses and primates is needed to determine which pairs of residues are actually in close proximity or functionally required for other reasons. Additionally, there appears to be some level of complementarity in the predictions made by VI and MI_minh and the CMP methods, which measure different biochemical trade offs between coevolving residues. This strengthens the rationale for integrating methods to better predict interface residues experiencing potentially different evolutionary constraints (e.g., structural, catalytic activity, specificity). Coevolutionary analysis can help to generate and prioritize candidates for these experiments.

Cross-Species Case Study 2: The interaction network of HIV and human proteins shows only weak evidence of coevolution across mammals

We sought to use inter-protein residue coevolution to refine a recently derived APMS protein-protein interaction network of the HIV-human interactome [33]. This study detected human proteins that interact with each HIV protein, either via direct physical contact or as members of complexes. Specifically, we hoped to use evidence of sequence coevolution to resolve direct versus indirect protein interactions amongst all human proteins measured to interact with each HIV protein. Secondly, we wanted to know if coevolutionary signals are strong enough to pinpoint key residues involved in the interfaces of any direct interactions.

For each protein in the HIV genome, we computed a multiple sequence alignment with all other sequenced immunodeficiency viruses that infect mammals with sequenced genomes. Similarly, we generated a multiple alignment of each human protein with the sequences of its orthologs from any mammal with a sequenced immunodeficiency virus. This produced pairs of host-virus protein alignments with up to six immunodeficiency viruses and their primate, feline, and bovidae hosts. For each pair of residues in a host-virus protein pair, we quantified coevolution using MIj and a semi-parametric bootstrap to calculate P-values (See Supplemental Text: Simulating independently evolving pairs of alignments). For each protein pair, we varied the significance threshold and computed the count of significantly coevolving residue-pairs. We then compared this statistic for interacting protein pairs from the APMS network versus a control set of 100 randomly chosen lentivirus-mammal protein pairs not included in the APMS network. We found that APMS detected interactions have only marginally more counts of significant signals of coevolution compared to non-interactions (best auROC = 0.541 at P_bootstrap < 0.0001), and therefore counts of coevolving residues are not sensitive enough to distinguish direct interactions or the residues involved in them for this set of virus and host proteins. Based on our benchmarking, we conclude that this lack of signal may result from low power due to the lack of sequenced lentivirus-mammal proteome pairs.

Discussion

In this work we aimed to paint a picture of the performance of emerging methods to identify inter-protein contacts using coevolution and to identify properties of alignments where performance is expected to be best. As previously noted in intra-protein predictions [3, 9, 14], re-weighting of the sequences to account for the underlying phylogeny is important for inter-protein predictions as well, however as the comparison between MI_w and MI shows, it is important to tune the parameters controlling the re-weighting in cases where there are fast evolving alignment columns in an overall conserved protein family. Fortunately, methods that search for direct correlations—using a global statistical model for the sequence alignments—seem to be able to correct for the improper weighting (compare MI_w to DI). These methods are more precise at strict false positive rates than their counterparts especially when the alignments have Neff/L < 1.0. However, it may be beneficial to use a faster, MI-based method if the use case allows for a relaxed FPR and is concerned with power versus precision.

We also investigated the use of three null models to control the false positive rate. Counter-intuitively, a null model that explicitly models evolution independently for each alignment fails to control the false positive rate. We believe that our simulated alignments are systematically scoring too low because they fail to capture the correct amount of variation in the observed alignments, resulting in artificially significant P-values, except for when the effects of having small alignment sizes results in overly conservative P-values. Using a standard normal or the empirical distribution of scores as null models also failed to control the false positive rate, likely due to the correlation structure imposed by the shared evolutionary history of the residues, the distribution of evolutionary rates of the residues, or because asymptotic assumptions do not hold at small sample sizes. Thus, choosing an appropriate P-value cutoff in a real analysis when the structure is unknown and alignment depth is shallow still remains a challenge. However, we show that in diverse enough alignments the empirical null successfully controls the false positive rate for Direct methods. As a future direction, we suggest exploring theoretical null distributions that can be parameterized for individual alignment column pairs such as [41] or further improving protein evolution simulators to generate distributions of scores where the evolutionary rates are more similar between the null and alternate hypothesis.

A related problem to the one discussed here is to search a large set of protein pairs (within or between species) to determine which ones are interacting. In this setting, coevolution method performance is potentially more important than when predicting contacting residues for known interactions, because the search space will contain so many negatives (i.e., non-interacting pairs). A permissive P-value cutoff will lead to a large number of false positives and that may misinform investigators, while being too strict will lead to false negatives that keep potentially important findings hidden. While models exist that identify cutoffs based on benchmark data sets (e.g., Supplemental Figures S8 and S9, [24]), it would be interesting to understand why the parameters in these studies are appropriate and if they generalize to all protein-protein interactions. Ideally, we would like to understand what a null model teaches us about phylogeny-induced coevolution in the absence of structural inter- or intra-protein constraints. Another challenge for predicting interacting protein pairs from coevolutionary tests is how to summarize statistics for individual pairs of residues to produce a single score for a pair of proteins. Based on some preliminary investigations of these questions, we conclude that it is unlikely that cross-species interacting protein pairs can be accurately distinguished from non-interacting pairs on a genome-wide scale.

The progress of high-throughput interaction mapping highlights the need for continued refinement of inter-protein coevolution detection methods. Given that improper re-weighting of sequences can negatively affect power and the false positive rate, perhaps expanding Direct methods to independently obtain sequence weights for each alignment or using an evolution-based probabilistic weight (such as in CoMap or using phylogenetic logistic regression) for unusual variation in each column is a logical next step forward. Another important contribution would be to develop a generalizable null model that can help differentiate contacts when there are very few sequences available for protein families. Furthermore, investigating the correlations among the coevolution statistics themselves in inter-protein data sets could potentially disentangle structural from non-structural coevolutionary forces as well as serving to construct an ensemble method. Comprehensively sequencing orthologous pairs of protein families is a straightforward way to test the usefulness of these future contributions while simultaneously enabling current methods to perform to their fullest.

Conclusion

We benchmarked 13 coevolution methods on 33 protein interactions with associated sequence alignments of varying depths. We conclude that coevolutionary analyses of cross-species protein-protein interactions is largely hindered by a lack of phylogenetically deep protein alignments for many proteins, and furthur demonstrate this in two example use cases involving HIV-human protein interactions. Additionally, we report that commonly used null distributions generally fail to control false positives in coevolutionary analyses, though errors are best controlled by the empirical null in large alignments.

Materials and Methods

Multiple sequence alignments

A master alignment of 8998 concatenated HisKA and RR sequences from [36] was graciously provided by the authors. From this alignment, aligned sequences were sampled uniformly (each sequence had equal probability of being sampled) to create sub-alignments with 5, 50, 250, 500, 1000, and 5000 sequences. We sampled 10 sub-alignments of each alignment size (number of sequences in sub-alignment), resulting in 60 total alignment pairs.

The alignments in [24] were downloaded from complexes section of the Baker lab website (http://gremlin.bakerlab.org/complexes/PDB/_benchmark/_alignments.zip) on Aug 29, 2014. The corresponding structures were downloaded from PDB and processed to obtain contacts between representative protein chains.

The CoMap implementation requires a preprocessing step to remove sequence redundancy (a data munging alternative to sequence weighting). This additional step was also necessary to prevent buffer underflow errors when evaluating likelihoods in very large input trees. Therefore, all alignments with more than 200 sequences were culled to contain the 200 most diverse sequences before being passed to CoMap. The sub-alignment used corresponds to the 200-leaf sub-tree that maximizes PD for each original input alignment and tree.

Measuring coevolution

The coevolution methods used are listed in Table 1 and Table S1. Wrappers for the Direct methods are provided in coevo_tools to facilitate running from the command line. For methods in the plmDCA, mfDCA and hpDCA packages, MATLAB, or the MATLAB runtime executable is required as well as various MATLAB Toolbox dependencies and licenses.

Evaluating coevolution performance

For each method, coevolution scores for pairs of amino acid positions were used to predict inter-domain pairs of amino acid residues that are close to each other in the representative co-crystal structure (PDB ID: 3DGE).

We define positives as pairs of alignment positions mapping to amino acid residues whose beta carbons (C_β) are less than 8 angstroms apart in 3DGE. All other pairs of alignment positions are considered negatives.

We considered the following two alternative definitions of positives:

Closest non-hydrogen atom-atom distance between residues is less than 6 angstroms
C_β distance is less than 8 angstroms and at least one residue is mentioned as important in determining specificity of the HisKA-RR interaction in [52–56].

Residue pairs are predicted as coevolving if their scores or P-values are above a given threshold (eg. top 1%, P < 0.05) (Table S4).

Phylogenetic diversity

Phylogenetic diversity (PD) is calculated as the sum of the branch lengths in a tree built from the concatenated multiple sequence alignment of both proteins. Trees were built using FastTree (version 2.1.7 SSE3) with options -gamma -nosupport -wag.

Abbreviations

CoMap is abbreviated CMP in the main text and figures and CoMapP in supplemental figures. Effective number of sequences per column is abbreviated Neff/L. Phylogenetic distance is abbreviated PD. MI_Hmin appears as MIminh in figure legends. Precision (PPV) optimized metrics: ppvcut, ppvmax, ppvTPR, ppvFPR are the P_empirical threshold that maximizes PPV, said maximum PPV, power (TPR), and false positive rate (FPR) at said threshold.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

AA carried out the analysis. AA and KSP designed the analysis and wrote the manuscript. All authors read and approved the final manuscript.

Authors’ information

AA is a Bioinformatics graduate student at the University of California San Francisco in the laboratory of Dr. Katherine S. Pollard.

KSP is a Senior Investigator at the Gladstone Institutes and Professor of Epidemiology and Biostatistics at the University of California San Francisco.

Acknowledgements

We thank Martin Weigt for providing HisKA and RR alignments and for providing links to DCA source code. We also thank Julien Dutheil for help running CoMap correctly. This work was supported by a National Institutes of Health bioinformatics training grant, a UCSF Graduate Research Mentorship Fellowship, institutional funding from Gladstone Institutes, and a gift from the San Simeon Fund.

Footnotes

<aram.avilaherrera{at}ucsf.edu>
<katherine.pollard{at}gladstone.ucsf.edu>

References

↵
Yip KY, Patel P, Kim PM, Engelman DM, McDermott D, Gerstein M: An integrated system for studying residue coevolution in proteins. Bioinformatics 2008, 24:290–2.
OpenUrl CrossRef PubMed Web of Science
↵
Dutheil J, Galtier N: Detecting groups of coevolving positions in a molecule: A clustering approach. BMC Evol Biol 2007, 7:242.
OpenUrl CrossRef PubMed
↵
Dutheil JY: Detecting coevolving positions in a molecule: Why and how to account for phylogeny. Briefings in bioinformatics 2012, 13:228–43.
OpenUrl CrossRef PubMed
↵
Juan D de, Pazos F, Valencia A: Emerging methods in protein co-evolution. Nature reviews Genetics 2013, 14:249–61.
OpenUrl CrossRef PubMed
↵
Buslje CM, Santos J, Delfino JM, Nielsen M: Correction for phylogeny, small number of observations and data redundancy improves the identification of coevolving amino acid pairs using mutual information. Bioinformatics 2009, 25:1125–31.
OpenUrl CrossRef PubMed Web of Science
↵
Fares MA, Travers SA: A novel method for detecting intramolecular coevolution: Adding a further dimension to selective constraints analyses. Genetics 2006, 173:9–23.
OpenUrl Abstract/FREE Full Text
↵
Dahirel V, Shekhar K, Pereyra F, Miura T, Artyomov M, Talsania S, Allen TM, Altfeld M, Carrington M, Irvine DJ, Walker BD, Chakraborty AK: Coordinate linkage of HIV evolution reveals regions of immunological vulnerability. Proceedings of the National Academy of Sciences of the United States of America 2011, 108:11530–5.
OpenUrl Abstract/FREE Full Text
↵
Dunn SD, Wahl LM, Gloor GB: Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction. Bioinformatics 2008, 24:333–40.
OpenUrl CrossRef PubMed Web of Science
↵
Morcos F, Pagnani A, Lunt B, Bertolino A, Marks DS, Sander C, Zecchina R, Onuchic JN, Hwa T, Weigt M: Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proceedings of the National Academy of Sciences of the United States of America 2011, 108:E1293–301.
OpenUrl Abstract/FREE Full Text
↵
Dutheil J, Pupko T, Jean-Marie A, Galtier N: A model-based approach for detecting coevolving positions in a molecule. Molecular biology and evolution 2005, 22:1919–28.
OpenUrl CrossRef PubMed Web of Science
Pollock DD, Taylor WR, Goldman N: Coevolving protein residues: Maximum likelihood identification and relationship to structure. Journal of molecular biology 1999, 287:187–98.
OpenUrl CrossRef PubMed Web of Science
↵
Caporaso JG, Smit S, Easton BC, Hunter L, Huttley GA, Knight R: Detecting coevolution without phylogenetic trees? Tree-ignorant metrics of coevolution perform as well as tree-aware metrics. BMC evolutionary biology 2008, 8:327.
OpenUrl
↵
Weigt M, White RA, Szurmant H, Hoch JA, Hwa T: Identification of direct residue contacts in protein-protein interaction by message passing. Proceedings of the National Academy of Sciences of the United States of America 2009, 106:67–72.
OpenUrl Abstract/FREE Full Text
↵
Jones DT, Buchan DW, Cozzetto D, Pontil M: PSICOV: Precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments. Bioinformatics 2012, 28:184–90.
OpenUrl CrossRef PubMed Web of Science
↵
Burger L, Nimwegen E van: Disentangling direct from indirect co-evolution of residues in protein alignments. PLoS computational biology 2010, 6:e1000633.
OpenUrl
Delaporte E, Wyler Lazarevic CA, Iten A, Sudre P: Large measles outbreak in geneva, switzerland, january to august 2011: Descriptive epidemiology and demonstration of quarantine effectiveness. Euro surveillance: bulletin Europeen sur les maladies transmissibles = European communicable disease bulletin 2013, 18.
↵
Clark GW, Ackerman SH, Tillier ER, Gatti DL: Multidimensional mutual information methods for the analysis of covariation in multiple sequence alignments. BMC Bioinformatics 2014, 15:157.
OpenUrl
↵
McLaughlin Jr. RN, Poelwijk FJ, Raman A, Gosal WS, Ranganathan R: The spatial architecture of protein function and adaptation. Nature 2012, 491:138–42.
OpenUrl CrossRef PubMed Web of Science
↵
Kamisetty H, Ovchinnikov S, Baker D: Assessing the utility of coevolution-based residue-residue contact predictions in a sequence- and structure-rich era. Proceedings of the National Academy of Sciences of the United States of America 2013, 110:15674–9.
OpenUrl Abstract/FREE Full Text
↵
Hopf TA, Scharfe CP, Rodrigues JP, Green AG, Kohlbacher O, Sander C, Bonvin AM, Marks DS: Sequence co-evolution gives 3D contacts and structures of protein complexes. Elife 2014, 3.
↵
Marks DS, Colwell LJ, Sheridan R, Hopf TA, Pagnani A, Zecchina R, Sander C: Protein 3D structure computed from evolutionary sequence variation. PloS one 2011, 6:e28766.
OpenUrl CrossRef PubMed
Hopf TA, Colwell LJ, Sheridan R, Rost B, Sander C, Marks DS: Three-dimensional structures of membrane proteins from genomic sequencing. Cell 2012, 149:1607–21.
OpenUrl CrossRef PubMed Web of Science
↵
Marks DS, Hopf TA, Sander C: Protein structure prediction from sequence variation. Nature biotechnology 2012, 30:1072–80.
OpenUrl CrossRef PubMed
↵
Ovchinnikov S, Kamisetty H, Baker D: Robust and accurate prediction of residue-residue interactions across protein interfaces using evolutionary information. Elife 2014, 3:e02030.
OpenUrl CrossRef PubMed
↵
Juan D, Pazos F, Valencia A: High-confidence prediction of global interactomes based on genome-wide coevolutionary networks. Proceedings of the National Academy of Sciences of the United States of America 2008, 105:934–9.
OpenUrl Abstract/FREE Full Text
↵
Gershoni M, Fuchs A, Shani N, Fridman Y, Corral-Debrinski M, Aharoni A, Frishman D, Mishmar D: Coevolution predicts direct interactions between mtDNA-encoded and nDNA-encoded subunits of oxidative phosphorylation complex i. Journal of molecular biology 2010, 404:158–71.
OpenUrl CrossRef PubMed Web of Science
↵
Clark NL, Gasper J, Sekino M, Springer SA, Aquadro CF, Swanson WJ: Coevolution of interacting fertilization proteins. PLoS genetics 2009, 5:e1000570.
OpenUrl
↵
Yeang CH, Haussler D: Detecting coevolution in and among protein domains. PLoS computational biology 2007, 3:e211.
OpenUrl
↵
Morris JH, Knudsen GM, Verschueren E, Johnson JR, Cimermancic P, Greninger AL, Pico AR: Affinity purification-mass spectrometry and network analysis to understand protein-protein interactions. Nat Protoc 2014, 9:2539–54.
OpenUrl CrossRef PubMed
↵
Brückner A, Polge C, Lentze N, Auerbach D, Schlattner U: Yeast two-hybrid, a powerful tool for systems biology. International Journal of Molecular Sciences 2009, 10:2763–2788.
OpenUrl
↵
Vidal M, Fields S: The yeast two-hybrid assay: Still finding connections after 25 years. Nature methods 2014, 11:1203–1206.
OpenUrl
↵
Michnick SW, Ear PH, Landry C, Malleshaiah MK, Messier V: Protein-fragment complementation assays for large-scale analysis, functional dissection and dynamic studies of protein-protein interactions in living cells. Methods Mol Biol 2011, 756:395–425.
OpenUrl PubMed
↵
Jager S, Cimermancic P, Gulbahce N, Johnson JR, McGovern KE, Clarke SC, Shales M, Mercenne G, Pache L, Li K, Hernandez H, Jang GM, Roth SL, Akiva E, Marlett J, Stephens M, D’Orso I, Fernandes J, Fahey M, Mahon C, O’Donoghue AJ, Todorovic A, Morris JH, Maltby DA, Alber T, Cagney G, Bushman FD, Young JA, Chanda SK, Sundquist WI, et al.: Global landscape of HIV-human protein complexes. Nature 2012, 481:365–70.
OpenUrl CrossRef PubMed Web of Science
↵
Shapira SD, Gat-Viks I, Shum BO, Dricot A, Grace MM de, Wu L, Gupta PB, Hao T, Silver SJ, Root DE, Hill DE, Regev A, Hacohen N: A physical and regulatory map of host-influenza interactions reveals pathways in h1N1 infection. Cell 2009, 139:1255–67.
OpenUrl CrossRef PubMed Web of Science
↵
Liao HX, Lynch R, Zhou T, Gao F, Alam SM, Boyd SD, Fire AZ, Roskin KM, Schramm CA, Zhang Z, Zhu J, Shapiro L, Mullikin JC, Gnanakaran S, Hraber P, Wiehe K, Kelsoe G, Yang G, Xia SM, Montefiori DC, Parks R, Lloyd KE, Scearce RM, Soderberg KA, Cohen M, Kamanga G, Louder MK, Tran LM, Chen Y, Cai F, et al.: Co-evolution of a broadly neutralizing HIV-1 antibody and founder virus. Nature 2013, 496:469–76.
OpenUrl CrossRef PubMed Web of Science
↵
Procaccini A, Lunt B, Szurmant H, Hwa T, Weigt M: Dissecting the specificity of protein-protein interaction in bacterial two-component signaling: Orphans and crosstalks. PLoS ONE 2011, 6:e19729.
OpenUrl CrossRef PubMed
↵
Schug A, Weigt M, Onuchic JN, Hwa T, Szurmant H: High-resolution protein complexes from integrating genomic information with molecular simulation. Proceedings of the National Academy of Sciences of the United States of America 2009, 106:22124–9.
OpenUrl Abstract/FREE Full Text
↵
Ekeberg M, Lovkvist C, Lan Y, Weigt M, Aurell E: Improved contact prediction in proteins: Using pseudolikelihoods to infer potts models. Physical review E, Statistical, nonlinear, and soft matter physics 2013, 87:012707.
OpenUrl CrossRef PubMed
↵
Gouveia-Oliveira R, Roque FS, Wernersson R, Sicheritz-Ponten T, Sackett PW, Molgaard A, Pedersen AG: InterMap3D: Predicting and visualizing co-evolving protein residues. Bioinformatics 2009, 25:1963–5.
OpenUrl CrossRef PubMed Web of Science
↵
Wollenberg KR, Atchley WR: Separation of phylogenetic and functional associations in biological sequences by using the parametric bootstrap. Proceedings of the National Academy of Sciences of the United States of America 2000, 97:3288–91.
OpenUrl Abstract/FREE Full Text
↵
Goebel B, Dawy Z, Hagenauer J, Mueller JC: An approximation to the distribution of finite sample size mutual information estimates. In Communications, 2005. ICC 2005. 2005 IEEE international conference on. Volume 2. IEEE; 2005:1102–1106.
↵
Compton AA, Hirsch VM, Emerman M: The host restriction factor APOBEC3G and retroviral vif protein coevolve due to ongoing genetic conflict. Cell host & microbe 2012, 11:91–8.
OpenUrl CrossRef PubMed Web of Science
↵
Compton AA, Emerman M: Convergence and divergence in the evolution of the APOBEC3G-vif interaction reveal ancient origins of simian immunodeficiency viruses. PLoS pathogens 2013, 9:e1003135.
OpenUrl CrossRef PubMed
↵
Chen G, He Z, Wang T, Xu R, Yu XF: A patch of positively charged amino acids surrounding the human immunodeficiency virus type 1 vif SLVx4Yx9Y motif influences its interaction with APOBEC3G. Journal of virology 2009, 83:8674–82.
OpenUrl Abstract/FREE Full Text
Russell RA, Pathak VK: Identification of two distinct human immunodeficiency virus type 1 vif determinants critical for interactions with human APOBEC3G and APOBEC3F. Journal of virology 2007,81:8201–10.
OpenUrl Abstract/FREE Full Text
Zhang H, Pomerantz RJ, Dornadula G, Sun Y: Human immunodeficiency virus type 1 vif protein is an integral component of an mRNP complex of viral RNA and could be involved in the viral RNA folding and packaging process. Journal of virology 2000, 74:8252–61.
OpenUrl Abstract/FREE Full Text
↵
He Z, Zhang W, Chen G, Xu R, Yu XF: Characterization of conserved motifs in HIV-1 vif required for APOBEC3G and APOBEC3F interaction. Journal of molecular biology 2008, 381:1000–11.
OpenUrl CrossRef PubMed
↵
Zhang L, Saadatmand J, Li X, Guo F, Niu M, Jiang J, Kleiman L, Cen S: Function analysis of sequences in human APOBEC3G involved in vif-mediated degradation. Virology 2008, 370:113–21.
OpenUrl PubMed
↵
Russell RA, Smith J, Barr R, Bhattacharyya D, Pathak VK: Distinct domains within APOBEC3G and APOBEC3F interact with separate regions of human immunodeficiency virus type 1 vif. Journal of virology 2009, 83:1992–2003.
OpenUrl Abstract/FREE Full Text
↵
Xu H, Svarovskaia ES, Barr R, Zhang Y, Khan MA, Strebel K, Pathak VK: A single amino acid substitution in human APOBEC3G antiretroviral enzyme confers resistance to HIV-1 virion infectivity factor-induced depletion. Proceedings of the National Academy of Sciences of the United States of America 2004, 101:5652–7.
OpenUrl Abstract/FREE Full Text
↵
Guo Y, Dong L, Qiu X, Wang Y, Zhang B, Liu H, Yu Y, Zang Y, Yang M, Huang Z: Structural basis for hijacking CBF-beta and CUL5 e3 ligase complex by HIV-1 vif. Nature 2014, 505:229–33.
OpenUrl CrossRef PubMed
↵
Casino P, Rubio V, Marina A: Structural insight into partner specificity and phosphoryl transfer in two-component signal transduction. Cell 2009, 139:325–36.
OpenUrl CrossRef PubMed Web of Science
Li L, Shakhnovich EI, Mirny LA: Amino acids determining enzyme-substrate specificity in prokaryotic and eukaryotic protein kinases. Proceedings of the National Academy of Sciences of the United States of America 2003, 100:4463–8.
OpenUrl Abstract/FREE Full Text
Haldimann A, Prahalad MK, Fisher SL, Kim SK, Walsh CT, Wanner BL: Altered recognition mutants of the response regulator PhoB: A new genetic strategy for studying protein-protein interactions. Proceedings of the National Academy of Sciences of the United States of America 1996, 93:14361–6.
OpenUrl Abstract/FREE Full Text
Skerker JM, Perchuk BS, Siryaporn A, Lubin EA, Ashenberg O, Goulian M, Laub MT: Rewiring the specificity of two-component signal transduction systems. Cell 2008, 133:1043–54.
OpenUrl CrossRef PubMed Web of Science
↵
Laub MT, Goulian M: Specificity in two-component signal transduction pathways. Annual review of genetics 2007, 41:121–45.
OpenUrl CrossRef PubMed Web of Science
↵
Shannon CE: A mathematical theory of communication. The Bell System Technical Journal 1948, 27:379–423, 623–656.
OpenUrl CrossRef Web of Science
↵
Meila M: Comparing clusterings–an information based distance. Journal of Multivariate Analysis 2007, 98:873–895.
OpenUrl CrossRef Web of Science
↵
Cocco S, Monasson R, Weigt M: From principal component to direct coupling analysis of coevolution in proteins: Low-eigenvalue modes are needed for structure prediction. PLoS computational biology 2013, 9:e1003176.
OpenUrl

References

↵
Buslje CM, Santos J, Delfino JM, Nielsen M: Correction for phylogeny, small number of observations and data redundancy improves the identification of coevolving amino acid pairs using mutual information. Bioinformatics 2009, 25:1125–31.
OpenUrl CrossRef PubMed Web of Science
↵
Dunn SD, Wahl LM, Gloor GB: Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction. Bioinformatics 2008, 24:333–40.
OpenUrl CrossRef PubMed Web of Science
↵
Martin LC, Gloor GB, Dunn SD, Wahl LM: Using information theory to search for co-evolving residues in proteins. Bioinformatics 2005, 21:4116–24.
OpenUrl CrossRef PubMed Web of Science
↵
Meila M: Comparing clusterings–an information based distance. Journal of Multivariate Analysis 2007, 98:873–895.
OpenUrl CrossRef Web of Science
↵
Stamatakis A: RAxML-VI-HPC: Maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 2006, 22:2688–90.
OpenUrl CrossRef PubMed Web of Science
↵
Price MN, Dehal PS, Arkin AP: FastTree 2–approximately maximum-likelihood trees for large alignments. PloS one 2010, 5:e9490.
OpenUrl CrossRef PubMed
Cai W, Pei J, Grishin NV: Reconstruction of ancestral protein sequences and its applications. BMC evolutionary biology 2004, 4:33.
OpenUrl
↵
Koestler T, Haeseler A von, Ebersberger I: REvolver: Modeling sequence evolution under domain constraints. Mol Biol Evol 2012, 29:2133–45.
OpenUrl CrossRef PubMed
↵
Eddy SR: Accelerated profile HMM searches. PLoS computational biology 2011, 7:e1002195.
OpenUrl CrossRef
↵
Faith DP: Conservation evaluation and phylogenetic diversity. Biological Conservation 1992, 61:1–10.
OpenUrl CrossRef Web of Science
↵
Jones DT, Buchan DW, Cozzetto D, Pontil M: PSICOV: Precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments. Bioinformatics 2012, 28:184–90.
OpenUrl CrossRef PubMed Web of Science
Ovchinnikov S, Kamisetty H, Baker D: Robust and accurate prediction of residue-residue interactions across protein interfaces using evolutionary information. Elife 2014, 3:e02030.
OpenUrl CrossRef PubMed
↵
McLaughlin Jr. RN, Poelwijk FJ, Raman A, Gosal WS, Ranganathan R: The spatial architecture of protein function and adaptation. Nature 2012, 491:138–42.
OpenUrl CrossRef PubMed Web of Science
↵
Pettersen EF, Goddard TD, Huang CC, Couch GS, Greenblatt DM, Meng EC, Ferrin TE: UCSF chimera–a visualization system for exploratory research and analysis. Journal of computational chemistry 2004, 25:1605–12.
OpenUrl CrossRef PubMed Web of Science
Goebel B, Dawy Z, Hagenauer J, Mueller JC: An approximation to the distribution of finite sample size mutual information estimates. In Communications, 2005. ICC 2005. 2005 IEEE international conference on. Volume 2. IEEE; 2005:1102–1106.

View the discussion thread.

Posted February 20, 2015.

Download PDF

Citation Tools

Subject Area

Evolutionary Biology

Subject Areas

All Articles

Animal Behavior and Cognition (5214)
Biochemistry (11745)
Bioengineering (8751)
Bioinformatics (29195)
Biophysics (14971)
Cancer Biology (12095)
Cell Biology (17411)
Clinical Trials (138)
Developmental Biology (9421)
Ecology (14178)
Epidemiology (2067)
Evolutionary Biology (18306)
Genetics (12245)
Genomics (16801)
Immunology (11867)
Microbiology (28083)
Molecular Biology (11592)
Neuroscience (60965)
Paleontology (451)
Pathology (1870)
Pharmacology and Toxicology (3238)
Physiology (4959)
Plant Biology (10427)
Scientific Communication and Education (1683)
Synthetic Biology (2885)
Systems Biology (7339)
Zoology (1651)

[1] ↵
Yip KY, Patel P, Kim PM, Engelman DM, McDermott D, Gerstein M: An integrated system for studying residue coevolution in proteins. Bioinformatics 2008, 24:290–2.
OpenUrl CrossRef PubMed Web of Science

[2] ↵
Dutheil J, Galtier N: Detecting groups of coevolving positions in a molecule: A clustering approach. BMC Evol Biol 2007, 7:242.
OpenUrl CrossRef PubMed

[3] ↵
Dutheil JY: Detecting coevolving positions in a molecule: Why and how to account for phylogeny. Briefings in bioinformatics 2012, 13:228–43.
OpenUrl CrossRef PubMed

[4] ↵
Juan D de, Pazos F, Valencia A: Emerging methods in protein co-evolution. Nature reviews Genetics 2013, 14:249–61.
OpenUrl CrossRef PubMed

[5] ↵
Buslje CM, Santos J, Delfino JM, Nielsen M: Correction for phylogeny, small number of observations and data redundancy improves the identification of coevolving amino acid pairs using mutual information. Bioinformatics 2009, 25:1125–31.
OpenUrl CrossRef PubMed Web of Science

[6] ↵
Fares MA, Travers SA: A novel method for detecting intramolecular coevolution: Adding a further dimension to selective constraints analyses. Genetics 2006, 173:9–23.
OpenUrl Abstract/FREE Full Text

[7] ↵
Dahirel V, Shekhar K, Pereyra F, Miura T, Artyomov M, Talsania S, Allen TM, Altfeld M, Carrington M, Irvine DJ, Walker BD, Chakraborty AK: Coordinate linkage of HIV evolution reveals regions of immunological vulnerability. Proceedings of the National Academy of Sciences of the United States of America 2011, 108:11530–5.
OpenUrl Abstract/FREE Full Text

[8] ↵
Dunn SD, Wahl LM, Gloor GB: Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction. Bioinformatics 2008, 24:333–40.
OpenUrl CrossRef PubMed Web of Science

[9] ↵
Morcos F, Pagnani A, Lunt B, Bertolino A, Marks DS, Sander C, Zecchina R, Onuchic JN, Hwa T, Weigt M: Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proceedings of the National Academy of Sciences of the United States of America 2011, 108:E1293–301.
OpenUrl Abstract/FREE Full Text

[10] ↵
Dutheil J, Pupko T, Jean-Marie A, Galtier N: A model-based approach for detecting coevolving positions in a molecule. Molecular biology and evolution 2005, 22:1919–28.
OpenUrl CrossRef PubMed Web of Science

[11] Pollock DD, Taylor WR, Goldman N: Coevolving protein residues: Maximum likelihood identification and relationship to structure. Journal of molecular biology 1999, 287:187–98.
OpenUrl CrossRef PubMed Web of Science

[12] ↵
Caporaso JG, Smit S, Easton BC, Hunter L, Huttley GA, Knight R: Detecting coevolution without phylogenetic trees? Tree-ignorant metrics of coevolution perform as well as tree-aware metrics. BMC evolutionary biology 2008, 8:327.
OpenUrl

[13] ↵
Weigt M, White RA, Szurmant H, Hoch JA, Hwa T: Identification of direct residue contacts in protein-protein interaction by message passing. Proceedings of the National Academy of Sciences of the United States of America 2009, 106:67–72.
OpenUrl Abstract/FREE Full Text

[14] ↵
Jones DT, Buchan DW, Cozzetto D, Pontil M: PSICOV: Precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments. Bioinformatics 2012, 28:184–90.
OpenUrl CrossRef PubMed Web of Science

[15] ↵
Burger L, Nimwegen E van: Disentangling direct from indirect co-evolution of residues in protein alignments. PLoS computational biology 2010, 6:e1000633.
OpenUrl

[16] Delaporte E, Wyler Lazarevic CA, Iten A, Sudre P: Large measles outbreak in geneva, switzerland, january to august 2011: Descriptive epidemiology and demonstration of quarantine effectiveness. Euro surveillance: bulletin Europeen sur les maladies transmissibles = European communicable disease bulletin 2013, 18.

[17] ↵
Clark GW, Ackerman SH, Tillier ER, Gatti DL: Multidimensional mutual information methods for the analysis of covariation in multiple sequence alignments. BMC Bioinformatics 2014, 15:157.
OpenUrl

[18] ↵
McLaughlin Jr. RN, Poelwijk FJ, Raman A, Gosal WS, Ranganathan R: The spatial architecture of protein function and adaptation. Nature 2012, 491:138–42.
OpenUrl CrossRef PubMed Web of Science

[19] ↵
Kamisetty H, Ovchinnikov S, Baker D: Assessing the utility of coevolution-based residue-residue contact predictions in a sequence- and structure-rich era. Proceedings of the National Academy of Sciences of the United States of America 2013, 110:15674–9.
OpenUrl Abstract/FREE Full Text

[20] ↵
Hopf TA, Scharfe CP, Rodrigues JP, Green AG, Kohlbacher O, Sander C, Bonvin AM, Marks DS: Sequence co-evolution gives 3D contacts and structures of protein complexes. Elife 2014, 3.

[21] ↵
Marks DS, Colwell LJ, Sheridan R, Hopf TA, Pagnani A, Zecchina R, Sander C: Protein 3D structure computed from evolutionary sequence variation. PloS one 2011, 6:e28766.
OpenUrl CrossRef PubMed

[22] Hopf TA, Colwell LJ, Sheridan R, Rost B, Sander C, Marks DS: Three-dimensional structures of membrane proteins from genomic sequencing. Cell 2012, 149:1607–21.
OpenUrl CrossRef PubMed Web of Science

[23] ↵
Marks DS, Hopf TA, Sander C: Protein structure prediction from sequence variation. Nature biotechnology 2012, 30:1072–80.
OpenUrl CrossRef PubMed

[24] ↵
Ovchinnikov S, Kamisetty H, Baker D: Robust and accurate prediction of residue-residue interactions across protein interfaces using evolutionary information. Elife 2014, 3:e02030.
OpenUrl CrossRef PubMed

[25] ↵
Juan D, Pazos F, Valencia A: High-confidence prediction of global interactomes based on genome-wide coevolutionary networks. Proceedings of the National Academy of Sciences of the United States of America 2008, 105:934–9.
OpenUrl Abstract/FREE Full Text

[26] ↵
Gershoni M, Fuchs A, Shani N, Fridman Y, Corral-Debrinski M, Aharoni A, Frishman D, Mishmar D: Coevolution predicts direct interactions between mtDNA-encoded and nDNA-encoded subunits of oxidative phosphorylation complex i. Journal of molecular biology 2010, 404:158–71.
OpenUrl CrossRef PubMed Web of Science

[27] ↵
Clark NL, Gasper J, Sekino M, Springer SA, Aquadro CF, Swanson WJ: Coevolution of interacting fertilization proteins. PLoS genetics 2009, 5:e1000570.
OpenUrl

[28] ↵
Yeang CH, Haussler D: Detecting coevolution in and among protein domains. PLoS computational biology 2007, 3:e211.
OpenUrl

[29] ↵
Morris JH, Knudsen GM, Verschueren E, Johnson JR, Cimermancic P, Greninger AL, Pico AR: Affinity purification-mass spectrometry and network analysis to understand protein-protein interactions. Nat Protoc 2014, 9:2539–54.
OpenUrl CrossRef PubMed

[30] ↵
Brückner A, Polge C, Lentze N, Auerbach D, Schlattner U: Yeast two-hybrid, a powerful tool for systems biology. International Journal of Molecular Sciences 2009, 10:2763–2788.
OpenUrl

[31] ↵
Vidal M, Fields S: The yeast two-hybrid assay: Still finding connections after 25 years. Nature methods 2014, 11:1203–1206.
OpenUrl

[32] ↵
Michnick SW, Ear PH, Landry C, Malleshaiah MK, Messier V: Protein-fragment complementation assays for large-scale analysis, functional dissection and dynamic studies of protein-protein interactions in living cells. Methods Mol Biol 2011, 756:395–425.
OpenUrl PubMed

[33] ↵
Jager S, Cimermancic P, Gulbahce N, Johnson JR, McGovern KE, Clarke SC, Shales M, Mercenne G, Pache L, Li K, Hernandez H, Jang GM, Roth SL, Akiva E, Marlett J, Stephens M, D’Orso I, Fernandes J, Fahey M, Mahon C, O’Donoghue AJ, Todorovic A, Morris JH, Maltby DA, Alber T, Cagney G, Bushman FD, Young JA, Chanda SK, Sundquist WI, et al.: Global landscape of HIV-human protein complexes. Nature 2012, 481:365–70.
OpenUrl CrossRef PubMed Web of Science

[34] ↵
Shapira SD, Gat-Viks I, Shum BO, Dricot A, Grace MM de, Wu L, Gupta PB, Hao T, Silver SJ, Root DE, Hill DE, Regev A, Hacohen N: A physical and regulatory map of host-influenza interactions reveals pathways in h1N1 infection. Cell 2009, 139:1255–67.
OpenUrl CrossRef PubMed Web of Science

[35] ↵
Liao HX, Lynch R, Zhou T, Gao F, Alam SM, Boyd SD, Fire AZ, Roskin KM, Schramm CA, Zhang Z, Zhu J, Shapiro L, Mullikin JC, Gnanakaran S, Hraber P, Wiehe K, Kelsoe G, Yang G, Xia SM, Montefiori DC, Parks R, Lloyd KE, Scearce RM, Soderberg KA, Cohen M, Kamanga G, Louder MK, Tran LM, Chen Y, Cai F, et al.: Co-evolution of a broadly neutralizing HIV-1 antibody and founder virus. Nature 2013, 496:469–76.
OpenUrl CrossRef PubMed Web of Science

[36] ↵
Procaccini A, Lunt B, Szurmant H, Hwa T, Weigt M: Dissecting the specificity of protein-protein interaction in bacterial two-component signaling: Orphans and crosstalks. PLoS ONE 2011, 6:e19729.
OpenUrl CrossRef PubMed

[37] ↵
Schug A, Weigt M, Onuchic JN, Hwa T, Szurmant H: High-resolution protein complexes from integrating genomic information with molecular simulation. Proceedings of the National Academy of Sciences of the United States of America 2009, 106:22124–9.
OpenUrl Abstract/FREE Full Text

[38] ↵
Ekeberg M, Lovkvist C, Lan Y, Weigt M, Aurell E: Improved contact prediction in proteins: Using pseudolikelihoods to infer potts models. Physical review E, Statistical, nonlinear, and soft matter physics 2013, 87:012707.
OpenUrl CrossRef PubMed

[39] ↵
Gouveia-Oliveira R, Roque FS, Wernersson R, Sicheritz-Ponten T, Sackett PW, Molgaard A, Pedersen AG: InterMap3D: Predicting and visualizing co-evolving protein residues. Bioinformatics 2009, 25:1963–5.
OpenUrl CrossRef PubMed Web of Science

[40] ↵
Wollenberg KR, Atchley WR: Separation of phylogenetic and functional associations in biological sequences by using the parametric bootstrap. Proceedings of the National Academy of Sciences of the United States of America 2000, 97:3288–91.
OpenUrl Abstract/FREE Full Text

[41] ↵
Goebel B, Dawy Z, Hagenauer J, Mueller JC: An approximation to the distribution of finite sample size mutual information estimates. In Communications, 2005. ICC 2005. 2005 IEEE international conference on. Volume 2. IEEE; 2005:1102–1106.

[42] ↵
Compton AA, Hirsch VM, Emerman M: The host restriction factor APOBEC3G and retroviral vif protein coevolve due to ongoing genetic conflict. Cell host & microbe 2012, 11:91–8.
OpenUrl CrossRef PubMed Web of Science

[43] ↵
Compton AA, Emerman M: Convergence and divergence in the evolution of the APOBEC3G-vif interaction reveal ancient origins of simian immunodeficiency viruses. PLoS pathogens 2013, 9:e1003135.
OpenUrl CrossRef PubMed

[44] ↵
Chen G, He Z, Wang T, Xu R, Yu XF: A patch of positively charged amino acids surrounding the human immunodeficiency virus type 1 vif SLVx4Yx9Y motif influences its interaction with APOBEC3G. Journal of virology 2009, 83:8674–82.
OpenUrl Abstract/FREE Full Text

[45] Russell RA, Pathak VK: Identification of two distinct human immunodeficiency virus type 1 vif determinants critical for interactions with human APOBEC3G and APOBEC3F. Journal of virology 2007,81:8201–10.
OpenUrl Abstract/FREE Full Text

[46] Zhang H, Pomerantz RJ, Dornadula G, Sun Y: Human immunodeficiency virus type 1 vif protein is an integral component of an mRNP complex of viral RNA and could be involved in the viral RNA folding and packaging process. Journal of virology 2000, 74:8252–61.
OpenUrl Abstract/FREE Full Text

[47] ↵
He Z, Zhang W, Chen G, Xu R, Yu XF: Characterization of conserved motifs in HIV-1 vif required for APOBEC3G and APOBEC3F interaction. Journal of molecular biology 2008, 381:1000–11.
OpenUrl CrossRef PubMed

[48] ↵
Zhang L, Saadatmand J, Li X, Guo F, Niu M, Jiang J, Kleiman L, Cen S: Function analysis of sequences in human APOBEC3G involved in vif-mediated degradation. Virology 2008, 370:113–21.
OpenUrl PubMed

[49] ↵
Russell RA, Smith J, Barr R, Bhattacharyya D, Pathak VK: Distinct domains within APOBEC3G and APOBEC3F interact with separate regions of human immunodeficiency virus type 1 vif. Journal of virology 2009, 83:1992–2003.
OpenUrl Abstract/FREE Full Text

[50] ↵
Xu H, Svarovskaia ES, Barr R, Zhang Y, Khan MA, Strebel K, Pathak VK: A single amino acid substitution in human APOBEC3G antiretroviral enzyme confers resistance to HIV-1 virion infectivity factor-induced depletion. Proceedings of the National Academy of Sciences of the United States of America 2004, 101:5652–7.
OpenUrl Abstract/FREE Full Text

[51] ↵
Guo Y, Dong L, Qiu X, Wang Y, Zhang B, Liu H, Yu Y, Zang Y, Yang M, Huang Z: Structural basis for hijacking CBF-beta and CUL5 e3 ligase complex by HIV-1 vif. Nature 2014, 505:229–33.
OpenUrl CrossRef PubMed

[52] ↵
Casino P, Rubio V, Marina A: Structural insight into partner specificity and phosphoryl transfer in two-component signal transduction. Cell 2009, 139:325–36.
OpenUrl CrossRef PubMed Web of Science

[53] Li L, Shakhnovich EI, Mirny LA: Amino acids determining enzyme-substrate specificity in prokaryotic and eukaryotic protein kinases. Proceedings of the National Academy of Sciences of the United States of America 2003, 100:4463–8.
OpenUrl Abstract/FREE Full Text

[54] Haldimann A, Prahalad MK, Fisher SL, Kim SK, Walsh CT, Wanner BL: Altered recognition mutants of the response regulator PhoB: A new genetic strategy for studying protein-protein interactions. Proceedings of the National Academy of Sciences of the United States of America 1996, 93:14361–6.
OpenUrl Abstract/FREE Full Text

[55] Skerker JM, Perchuk BS, Siryaporn A, Lubin EA, Ashenberg O, Goulian M, Laub MT: Rewiring the specificity of two-component signal transduction systems. Cell 2008, 133:1043–54.
OpenUrl CrossRef PubMed Web of Science

[56] ↵
Laub MT, Goulian M: Specificity in two-component signal transduction pathways. Annual review of genetics 2007, 41:121–45.
OpenUrl CrossRef PubMed Web of Science

[57] ↵
Shannon CE: A mathematical theory of communication. The Bell System Technical Journal 1948, 27:379–423, 623–656.
OpenUrl CrossRef Web of Science

[58] ↵
Meila M: Comparing clusterings–an information based distance. Journal of Multivariate Analysis 2007, 98:873–895.
OpenUrl CrossRef Web of Science

[59] ↵
Cocco S, Monasson R, Weigt M: From principal component to direct coupling analysis of coevolution in proteins: Low-eigenvalue modes are needed for structure prediction. PLoS computational biology 2013, 9:e1003176.
OpenUrl

Coevolutionary analyses require phylogenetically deep alignments and better null models to accurately detect inter-protein contacts within and between species

Abstract

Background

Results

Performance benchmarking of coevolution methods

Physically interacting sites can be accurately detected in large sequence alignments

Choice of null distribution afects performance

Cross-Species Case Study 1: Applying coevolution methods to Vif-A3G identiies some residues known to afect host-virus interactions

Cross-Species Case Study 2: The interaction network of HIV and human proteins shows only weak evidence of coevolution across mammals

Discussion

Conclusion

Materials and Methods

Multiple sequence alignments

Measuring coevolution

Evaluating coevolution performance

Phylogenetic diversity

Abbreviations

Competing interests

Authors’ contributions

Authors’ information

Acknowledgements

Footnotes

References

References

Citation Manager Formats

Subject Area