Abstract
When biomolecules physically interact, natural selection operates on them jointly. Contacting positions in protein and RNA structures exhibit correlated patterns of sequence evolution due to constraints imposed by the interaction, and molecular arms races can develop between interacting proteins in pathogens and their hosts. To evaluate how well methods developed to detect coevolving residues within proteins can be adapted for cross-species, inter-protein analysis, we used statistical criteria to quantify the performance of these methods in detecting inter-protein residues within 8 angstroms of each other in the co-crystal structures of 33 bacterial protein interactions. We also evaluated their performance for detecting known residues at the interface of a host-virus protein complex with a partially solved structure. Our quantitative benchmarking showed that all coevolutionary methods clearly benefit from alignments with many sequences. Methods that aim to detect direct correlations generally outperform other approaches. However, faster mutual information based methods are occasionally competitive in small alignments and with relaxed false positive rates. All commonly used null distributions are anti-conservative and have high false positive rates in some scenarios, although the empirical distribution of scores performs reasonably well with deep alignments. We conclude that coevolutionary analysis of cross-species protein interactions holds great promise but requires sequencing many more species pairs.
Background
Coevolution—“the change of a biological object triggered by the change of a related object” [1]—is a powerful concept when applied to molecular sequence analysis because it reveals positional relationships that are worth preserving across evolutionary time scales. Sequence evolution is constrained by essential molecular interactions, such as contacts within a protein or RNA structure, as well as inter-molecular interactions in protein complexes and signaling pathways. These constraints define an epistasis between sites (residues or base-pairs) where the probability of a substitution depends on the states of other sites [2] involved in an interaction. Because epistasis can induce correlation between substitution patterns across columns in multiple sequence alignments, many methods have been developed that use evidence of coevolving alignment columns to detect physical interactions within and between biomolecules. These methods draw inspiration from diverse techniques in molecular phylogenetics, inverse statistical mechanics, Bayesian graphical modeling, information theory, sparse inference, and spectral theory (reviewed in [3, 4]).
Despite good rationale for coevolutionary approaches, physically interacting alignment columns have been notoriously difficult to identify from correlated patterns of sequence evolution for several reasons. First, shared evolutionary history creates a background of correlated substitution patterns against which it can be difficult to distinguish additional constraints derived from physical interactions. Common phylogeny is particularly strong within a gene family (e.g., predicting intra-molecular contacts). But it is also present across gene families within a species or even between species (e.g., predicting host-virus protein interactions), especially at shorter evolutionary distances where gene trees mirror species trees more closely. Coevolution methods have used a variety of approaches to counter the dependence induced by shared phylogeny, including removing closely related sequences from alignments to reduce non-independence [5, 6], differential weighting of sequences when computing statistics [7–9], and null distributions that directly model or indirectly account for phylogeny [10–13].
A second challenge arises when trying to distinguish correlated evolution that arises from direct versus indirect interactions. Alignment columns that are indirectly implicated in an interaction can be strongly correlated, and most columns are involved in multiple, partially overlapping interactions. For these reasons, close physical interactions may not produce patterns of substitution that are significantly more highly correlated than the background present in structures. This problem has been the focus of a recent class of coevolutionary methods that focuses on reducing the number of incorrect predictions by disentangling direct from indirect correlations [9, 14–17]. An alternative point of view considers these networks of indirectly correlated residues as protein sectors that can easily, through cooperative substitutions, respond to fluctuating evolutionary pressures [18].
Finally, due to low power—resulting in part from the previous two challenges—physically interacting sites can typically only be detected in multiple sequence alignments that span large evolutionary divergences and contain many hundreds to thousands of sequences. Recent evaluations of a number of coevolution methods concluded that accurate contact predictions require alignments with one to five times as many sequences (with <90 % sequence redundancy) as positions [19, 20].
To date, coevolutionary prediction of physically interacting alignment columns has been applied with success to intra-molecular contacts [7, 21–23] and well-characterized inter-molecular interactions [24], such as bacterial two-component signaling systems [25], enzyme complexes [26], and fertilization proteins [27]. The signal-to-noise ratio is too low and the search space too large to use sequence evolution to effectively identify pairs of physically interacting protein residues across entire proteomes; most pairs of sites with correlated substitution patterns are not in direct contact, and most physically interacting sites do not have statistically correlated substitution patterns [28].
However, the ability to now measure physical interactions between biomolecules with high-throughput technologies, such as affinity purification followed by mass spectrometry (APMS) [29], two-hybrid methods [30, 31], and protein complementation assays [32], raises the possibility of using sequence coevolution in a more specific way: to refine predicted interactions in an experimentally reduced search space. For example, correlated substitution patterns in pairs of proteins could help determine if an experimentally measured interaction is likely to represent direct physical contact versus an indirect interaction in a complex or a false positive. Coevolutionary analysis could also be informative regarding which of the sites in a pair of interacting molecules are most likely to be in physical contact.
One particularly exciting application of this approach is to characterize and potentially manipulate interacting residues in host-virus and host-parasite protein interactomes [33, 34]. Newly emerging data on antibody and antigen sequences within a host [35] offers an opportunity to harness coevolutionary signals to investigate the mechanisms of broadly neutralizing antibodies and immune evasion. The primary open question for these new applications is whether existing methods are sensitive and specific enough to detect coevolution with the levels of constraint and divergence that are present in inter-molecular data sets of modest size.
To this end, we designed data processing scripts, statistical evaluation and visualization tools, and simulation pipelines that allowed us to easily extend a suite of coevolution methods designed for intra-protein interaction prediction (Table 1) so that they can be used to test for patterns of correlated sequence evolution at pairs of sites in two different proteins, potentially from different sets of organisms in different parts of the tree of life (e.g., human-bacteria, bacteria-phage interactions). We then applied this integrated framework for coevolutionary analysis to refine and annotate a recently derived human-HIV1 protein-protein interaction network [33] and to test for coevolution in the well studied arms-race interaction between the mammalian cytidine deaminase APOBEC3G (A3G) and its HIV1 antagonist, Vif. Because fewer than ten orthologous mammal-lentivirus proteome pairs have been sequenced and mammalian divergence is low, we hypothesized that power would be low in these settings.
To quantify the limitations of coevolutionary methods when only a handful of sequences are available, we used a data set of 33 within-species bacterial protein-protein interactions. To systematically determine the parameters that affect performance, we focused on the well-characterized interaction between bacterial histidine kinase A (HisKA) and its response regulator (RR), for which a co-crystal structure and thousands of sequences are available. By subsampling HisKA-RR sequence pairs, we show that most methods have appreciable precision or power at low false positive rates for alignments with ~500 or more sequences. However, the best performing method depends on whether power or precision is more important, the number of non-redundant sequences in the alignment, and whether the goal is to find structurally or functionally linked residues. By expanding this analysis to 32 additional bacterial interactions [24], we showed that these trends generalize beyond the specific example of HiskA and RR. We conclude that coevolution methods are able to identify some residues important for cross-species protein-protein interactions, but this approach will benefit greatly from additional sequence data.
Results
Performance benchmarking of coevolution methods
The coevolutionary methods benchmarked in our analyses fall into three general groups (Table 1). Information-based methods are various flavors of Mutual Information between pairs of sites, each considered independently. Direct methods are those that consider pairs of sites in the context of a sparse global statistical model for contacts in the multiple sequence alignment. Phylogenetic methods explicitly use a substitution rate matrix and phylogenetic tree in their calculation of a coevolution statistic that may take into account the biochemical and physical properties of amino acid residues, as well as report a P-value based on internal simulation of independently evolving sites. In this benchmark we use the CoMap P-value as a statistic for comparison with other coevolution methods. Other differences among the coevolution methods include the incorporation of two additional techniques that have been shown to improve performance, re-weighting sequences such that similar sequences contribute less to the final score [5] and applying an Average Product Correction (APC) to remove background noise and phylogenetic signal from “raw” coevolution statistics [8].
To benchmark coevolution methods, we used 33 within-species pairs of proteins with co-crystal structures determined from E. coli proteins. These include a set of paired alignments compiled by [24], plus the histidine kinase-response regulator (HisKA-RR) bacterial two-component system from [36], provided by the authors. We included HisKA-RR, because it is a well-characterized interaction with a very large, diverse multiple sequence alignment (8998 sequences for each gene) and genetic evidence supporting several interactions. For these reasons, HisKA-RR has also been used previously in coevolutionary analyses [37].
Because the HisKA-RR alignment is so large, it enabled us to quantify the effects of alignment size and diversity by down-sampling the full alignment to produce a wide range of smaller pairs of HisKA and RR multiple sequence alignments with different numbers of sequences (range 5 to 5000 sequences) and phylogenies from the original alignment. The 32 alignment pairs from [24] naturally varied in size (range 168 to 1428 sequences).
For each pair of multiple sequence alignments from two interacting proteins, we compared every site in the first protein to every site in the second protein and scored these pairs of alignment columns for coevolution using each of the methods in Table 1. We then used coevolution scores to predict inter-domain pairs of amino acid residues that are less than 8 angstroms (Å) to each other, measured between Cβs, in the representative co-crystal structure (see Methods). We also repeated our analyses of the HisKA-RR sub-alignments using a stricter definition of contacts that requires additional biochemical evidence for specificity determination, and an alternate definition that measures distance between the closest non-hydrogen atoms. Trends in our results were generally similar across these choices of definition for true interactions, but we observed some differences in performance between definitions when enforcing a false positive rate (FPR) (Figure S2).
The performance of each method to distinguish contacting pairs of residues (positives) from other residue pairs (negatives) was measured as previously described [14, 38] and evaluated using power (also called recall and true positive rate (TPR)) and precision (also called positive predictive value (PPV)) at a range of low FPRs. Power and precision are complementary performance measures that quantify the percentage of interacting residue pairs that are found and the percentage of identified residue pairs that are interacting, respectively. Precision is a useful measure of performance in cases where positives (contacting pairs of residues) are overwhelmed by negatives (non-contacting residues). A method with high precision is helpful for generating lists of high confidence pairs of residues for expensive follow-up studies, even if it misses a number of truly interacting sites and therefore has relatively low power. We additionally examined four threshold-independent performance measures, area under Receiver-Operator Curve (auROC), area under precision-recall curve (auPR), maximum F1-score (fmax), maximum ϕ (phimax).
Physically interacting sites can be accurately detected in large sequence alignments
Our primary finding is that many coevolutionary methods are able to detect inter-molecular contacts at low FPRs in alignments with hundreds of diverse sequences from each protein, consistent with previous studies of intra-molecular contacts [3, 17], specifically when the alignments are deeper than they are long [19, 20]. We capture this rectangular quality in the statistic Neff/L, where Neff is the effective number of sequences as calculated by PSICOV [14] and L is the total number of columns in the pair of alignments. We observe similar trends when we use the number of sequences (N) or their phylogenetic diversity (PD), rather than Neff/L, to compare performance. The relationship between N, PD, and Neff is explored further in the Supplemental Text: Diversity of sequences and Supplemental Figures S10, S11 and S21. The diversity of residues within the individual alignment columns that make up each pair is another important factor to consider, and is explored in the Supplemental Text: Performance by column entropy categories.
Both power and precision improve with increasing Neff/L for nearly all coevolutionary methods (Figure 1), in the HisKA-RR data set. However, for alignments with Neff/L < 1.0, power at FPR<5% and precision at FPR<0.1% both remain relatively low (<50%). Additionally, the performance metrics fmax and phimax show that there are no score thresholds (i.e. the strictness of predictions) that achieve both high precision and power in alignments with Neff/L < ~3.0 (Supplemental Figure S1). Despite the smaller range in Neff/L values, these performance trends are also observed across the 32 alignments in [24] (Supplemental Figures S3 and S6).
In general, we confirm that coevolutionary methods that adjust for background phylogenetic signal through sequence re-weighting and/or average product correction (APC) (e.g., DI, DIplm, and PSICOV) perform better than the phylogeny unaware mutual information (MI) based methods and the phylogeny aware approaches that explicitly use evolutionary models. In the HisKA-RR alignment, we observed two major exceptions to this trend when using the strictest definition for contacting pairs (i.e., requiring residue Cβ < 8Å coupled with biochemical evidence for specificity determination) (Supplemental Figure S2). First, the standard MI statistic is the most precise method for detecting contacting sites in alignments with Neff/L > 1.6 and FPR < 0.1%. Second, mutual information normalized by the joint entropy (MIj) has relatively high power in many scenarios and is the most powerful method for detecting contacting sites that are supported by experimental evidence at FPR < 5%. However, MIj has drastically lower power at FPR < 0.1%. These findings suggest that MI is a reasonable choice if the goal of the analysis is to predict a small number of very high confidence contacts, whereas MIj may be useful for detecting as many contacts as possible if a moderate FPR can be tolerated. These methods are both straightforward to compute, adding to their utility in these settings.
CoMap performance is an interesting case because, in contrast to DI, DIplm, and PSICOV, it was not designed to find contacting residues. In the smallest alignments (5 sequences) we tested, it can have slightly better performance than the other methods. However, its poor performance in other alignments may indicate that it is identifying a set of coevolving residue pairs that partially overlap with contacting residues. It remains to explore whether CoMap can be used to prioritize residue pairs predicted by the other methods for functional assays.
Finally, we looked at the relationship between performance and the proportion of residue pairs that are contacts. Comparing across all 33 structures in our analyses, we observed the proportion of contacts is correlated with precision (Supplemental Figure S7). This means that most strongly coevolving residues in a protein pair are more likely to be physically interacting in co-crystal structures with larger interfaces.
Choice of null distribution afects performance
The previous results show performance based on the known HisKA-RR structure. When applying the methods in our study in practice the structure usually is not known. One therefore uses a null distribution to control false predictions. Specifically, an upper quantile of the distribution of coevolutionary statistics in the absence of coevolutionary constraint is used as a threshold; one declares any pair of sites with a statistic exceeding the threshold a predicted contact. The goal is to minimize false predictions by predicting contacts only when statistics are much larger than expected by chance under the null distribution. A variety of null distributions are commonly used, including theoretical limiting distributions [5, 8], the observed empirical distribution (under the assumption that most pairs of sites are not coevolving) [39] and parametric, semi-parametric, and non-parametric bootstrap distributions [10, 40]. Theoretical and empirical nulls are more computationally efficient than bootstrap methods, which require simulating large data sets. The HisKA-RR interaction provides a framework for assessing the performance of these different approaches.
We used our sampled sub-alignments of HisKA-RR and the 32 alignments in [24] to compare the performance of two commonly used null distributions and to evaluate the sensitivity of each approach to alignment size. For each null distribution and coevolutionary statistic, we first employed the non-contact pairs of residues to assess if the FPR was truly controlled or not at given target FPRs of 5% and 0.1%.
The normal distribution is commonly used as theoretical null for mutual information and its normalized variants. Under this assumption, we standardized the coevolution scores to Z-scores and compared these to upper quantiles of the standard normal distribution (mean = 0, variance = 1). We then used the resulting upper-tail P-values (Pnormal) to predict contacting residue pairs. We found that nominal FPRs using this approach consistently exceed the target FPR across the range of Neff/L values in both the HisKA-RR sub-alignments and the alignments in [24] (Figures 2 and Supplemental Figure S4). In general, as Neff/L increases, the nominal FPR for Direct methods increases while it decreases in Information based methods. Nominal FPRs were up to twice to 20 times the target FPR for target FPRs 5% and 0.1% respectively. This suggests that either non-contacting residue pairs carry signals of coevolution (e.g., due to phylogeny, structural, or other evolutionary constraints) and/or that Z-scores of coevolution statistics have variance greater than one across non-contacting residues (e.g., due to an underestimated standard deviation across residue pairs resulting from within protein constraints or residues appearing in many pairs). Three of the four phylogeny aware CoMap methods controlled the nominal FPR below the target in all cases suggesting that the charge compensation analysis is predicting long-range residue interactions as well as contacts.
Thus, while the normal distribution applied to standardized coevolution statistics can practically be used as a null distribution, we conclude that this approach results in elevated rates of false positive predictions, likely due to shared phylogeny or structural constraints affecting non-contacting residue pairs. A theoretical null (eg. noncentral gamma [41]) that is parameterized for individual column pairs may therefore be more appropriate.
Another choice of null distribution is the observed empirical distribution of the coevolution statistics. A P-value (Pempirical) for a score S is simply the proportion of scores that are more extreme than S. This straightforward method can be easily applied with any statistic. However, it also assumes that no pairs of sites are coevolving and should therefore produce thresholds that are too strict when there are some coevolving sites in the data set (i.e., making it harder to predict real contacts). Contrary to this expectation, we found that the empirical null distribution—like the normal null distribution—produces nominal FPRs that exceed target FPRs (Figure 3 and Supplemental Figure S5). However, it is the Direct methods that best control the nominal FPR in both sets of alignments, marginally exceeding the target FPR in only a couple of cases. The Information-based methods fared well in the alignments in [24], however the HisKA-RR sub-alignments reveal that at Neff/L < 0.3, control of the FPR is lost, especially in MIHmin. The Phylogenetic method that consistently exceeded the target FPR was the CoMap correlation analysis (CMPcor) which makes no assumptions regarding the biochemical properties of the amino acids. These results suggest that the empirical null distribution is not as conservative of an approach as one might expect from including contacting residue pairs in the null distribution. Although, it may suffer from some of the same effects that make the normal null distribution anti-conservative, such as shared phylogeny or structural constraints, alignments with very few sequences (eg. 5-50) have a limited number of possible scores which leads to ties in P-values between contacting and non-contacting residues.
These results are encouraging, but still leave us with the challenge of how to choose an appropriate P-value cutoff in a real analysis when the structure is unknown. Since our findings indicate that nominal FPRs exceed target FPRs with all three types of null distributions and nearly all methods, stricter P-value cutoffs than the target false positive rate seem warranted. But it is not clear how much stricter will be needed in any given alignment pair without additional information to guide such modifications (eg. incorporating alignment properties such as Neff/L into a model for each coevolution method). Hence, in most applications one must simply aim to control a target FPR, knowing that the true error rate is likely to be larger (Supplemental Figures S8 and S9). For this reason, the empirical null distribution may be the best choice to use as it controls error rates across the majority of alignment sizes, target FPRs, and coevolution methods (Figures 3 and S5) tested. As a rule of thumb, the empirical null overall controls the FPR for the Direct methods, however in small alignments (5 sequences or Neff/L < 0.3) it can be up to 1.5 times the target FPR.
Cross-Species Case Study 1: Applying coevolution methods to Vif-A3G identiies some residues known to afect host-virus interactions
Viral infectivity factor (Vif) is a lentiviral accessory protein whose primary function is to target the antiviral cytidine deaminase APOBEC3G (A3G) of its mammalian hosts through ubiquitination. Because the two protein families are in an evolutionary arms race [42, 43], we hypothesized that they would be an informative example for exploring the utility of coevolution methods in host-virus protein pairs (i.e., inter-protein, inter-species interactions). This is a novel application of coevolution analysis, which has primarily been applied to residues within a protein or between pairs of proteins in the same genome.
A major challenge in performing coevolutionary analysis on cross-species protein pairs is acquiring appropriate data, including paired alignments and protein structures for validation. For Vif-A3G, we were able to identify 16 pairs of sequences (Neff = 10.0) from different primates (A3G orthologs) and their lentiviruses (Vif orthologs) in public databases (Table S2). Our benchmarking results on HisKA-RR indicate that such small protein families push the useful limits of the coevolution statistics we tested (Neff/L = 0.014). The low sequence diversity of A3G (Neff = 3.04) within primates compared to Vif (Neff = 11.3) within primate lentiviruses also presents challenges. Hence, we expect coevolutionary analysis to potentially have limited power in this scenario. To quantitatively evaluate performance, requires validated Vif-A3G interactions. The structure of Vif in complex with A3G has not been solved. However, biochemical assays have solidly identified regions important for binding and ubiquitination along the individual reference sequences of HIV1 Vif [44–47] and human A3G [48, 49] (Table S3). For this analysis, we therefore take the residues in biochemically-validated regions to be positives even though they might not be contacts (ie. Cβ distance ≥ 8Å), and assume that all remaining residues are negatives, even though other sites (including sites deleted in these reference sequences) are possibly involved in the interaction. While further experimentation is needed to understand the relationship between functionally important sites and the structure of the protein interaction, as well as the effects of mutations in these sites on the fitness of lentiviruses, we explore whether any clues can be identified in the limited data that describes the coevolutionary history of the Vif-A3G residues.
First, we computed coevolutionary statistics for all Vif-A3G residue pairs and evaluated how well the statistics pinpoint the positive functionally important residues compared to negatives. For this evaluation, we used the empirical distribution of scores as a null distribution to determine statistical significance (i.e., Pempirical) because they have lower false positive rates across Neff/L values at strict significance thresholds. Because the positives and negatives are single residues in each sequence instead of inter-protein residue pairs, we summarized Pempirical for each residue by assigning it the most significant Pempirical across all inter-protein pairs to which it belongs, and then explored the Vif and A3G results individually. From our benchmarking on the bacterial data sets, we know that significance thresholds that control the FPR vary by method and Neff/L, and that strict thresholds that yield very low (~2-3%) power are typically needed to control FPR in small alignments. We therefore chose to identify a significance threshold for each method that maximizes precision on the known functional sites in each protein. Then, we estimated power and FPR at these thresholds.
On Vif, with the exception of CMPcor and DI32, the maximum precisions for each method ranged from 9 to 20% (i.e. only one or two residues out of ten predicted to be positives are truly positives)(Supplemental Figure S14). At these precision-optimized thresholds, MIj and MIminh predict almost every Vif residue to be coevolving; a stricter threshold would not result in a lower proportion of incorrect predictions. In contrast, the precisions for CMPcor, CMPpol, and DI32 are the highest (20%, 40%, 100% respectively). However, this comes at the cost of making the fewest number of predictions with the latter only making a single prediction. For these methods, less strict thresholds are needed to identify a greater proportion of positives at the cost of increasing the proportion of false discoveries. Across all methods, low fmax and phimax values (0.26 and below) suggest there are no significance thresholds that balance power and precision for this data set.
We observed similarly low performance on A3G (Supplemental Figure S16). Encouragingly, we note that positions 128-130 are correctly identified by multiple methods (Supplemental Figure S12B). Residues at position 130 (e.g., D vs A) are highly likely to be adaptations that conferred species-specific resistance to Vif-induced degradation in Old World Monkeys 5-6MYA [42, 43]. Position 128, that also provides species-specific resistance, is thought to be more recent [42, 43, 50]. While these coevolution methods alone may not yet be accurate enough to identify functional residues, they potentially enhance other evolutionary analyses. For example, of the many Apobec sites under positive selection [43], it is reasonable that lentiviruses are more likely shaping the evolution of those sites that coevolve with Vif than sites that coevolve with other viral or virus-like agents.
Secondly, we visualized the localization of Vif residues predicted to be coevolving with A3G on a partial structure of Vif in complex with cofactors utilized for protein ubiquitination [51] (Figure 4). In [51], the authors are able to see that a critical subset of the Vif positives is solvent-exposed. We reevaluated performance with only these residues as the positives (Supplemental Figure S15). There is poor precision to identify the putative solvent-exposed interface among the methods; CMPcor at 50% and CMPvol at 10% are the only methods with precision >6%.
Our analysis of the Vif-A3G interaction confirms that power to detect functionally important residues in each protein family is also low in inter-protein analyses between species, even though it is plausible that an arms race between lentivirus and mammal would give rise to stronger signals of coevolution compared to background. It is important to consider that perhaps the positions we considered positives may not all be of equal evolutionary importance across primates. Interfaces may be gained or lost and the rapid evolution of the two proteins likely produces many alternative solutions to maintaining an antagonistic interaction. There were many predicted positions that were not in the positives and further systematic validation and more comprehensive sequencing of lentiviruses and primates is needed to determine which pairs of residues are actually in close proximity or functionally required for other reasons. Additionally, there appears to be some level of complementarity in the predictions made by VI and MIminh and the CMP methods, which measure different biochemical trade offs between coevolving residues. This strengthens the rationale for integrating methods to better predict interface residues experiencing potentially different evolutionary constraints (e.g., structural, catalytic activity, specificity). Coevolutionary analysis can help to generate and prioritize candidates for these experiments.
Cross-Species Case Study 2: The interaction network of HIV and human proteins shows only weak evidence of coevolution across mammals
We sought to use inter-protein residue coevolution to refine a recently derived APMS protein-protein interaction network of the HIV-human interactome [33]. This study detected human proteins that interact with each HIV protein, either via direct physical contact or as members of complexes. Specifically, we hoped to use evidence of sequence coevolution to resolve direct versus indirect protein interactions amongst all human proteins measured to interact with each HIV protein. Secondly, we wanted to know if coevolutionary signals are strong enough to pinpoint key residues involved in the interfaces of any direct interactions.
For each protein in the HIV genome, we computed a multiple sequence alignment with all other sequenced immunodeficiency viruses that infect mammals with sequenced genomes. Similarly, we generated a multiple alignment of each human protein with the sequences of its orthologs from any mammal with a sequenced immunodeficiency virus. This produced pairs of host-virus protein alignments with up to six immunodeficiency viruses and their primate, feline, and bovidae hosts. For each pair of residues in a host-virus protein pair, we quantified coevolution using MIj and a semi-parametric bootstrap to calculate P-values (See Supplemental Text: Simulating independently evolving pairs of alignments). For each protein pair, we varied the significance threshold and computed the count of significantly coevolving residue-pairs. We then compared this statistic for interacting protein pairs from the APMS network versus a control set of 100 randomly chosen lentivirus-mammal protein pairs not included in the APMS network. We found that APMS detected interactions have only marginally more counts of significant signals of coevolution compared to non-interactions (best auROC = 0.541 at Pbootstrap < 0.0001), and therefore counts of coevolving residues are not sensitive enough to distinguish direct interactions or the residues involved in them for this set of virus and host proteins. Based on our benchmarking, we conclude that this lack of signal may result from low power due to the lack of sequenced lentivirus-mammal proteome pairs.
Discussion
In this work we aimed to paint a picture of the performance of emerging methods to identify inter-protein contacts using coevolution and to identify properties of alignments where performance is expected to be best. As previously noted in intra-protein predictions [3, 9, 14], re-weighting of the sequences to account for the underlying phylogeny is important for inter-protein predictions as well, however as the comparison between MIw and MI shows, it is important to tune the parameters controlling the re-weighting in cases where there are fast evolving alignment columns in an overall conserved protein family. Fortunately, methods that search for direct correlations—using a global statistical model for the sequence alignments—seem to be able to correct for the improper weighting (compare MIw to DI). These methods are more precise at strict false positive rates than their counterparts especially when the alignments have Neff/L < 1.0. However, it may be beneficial to use a faster, MI-based method if the use case allows for a relaxed FPR and is concerned with power versus precision.
We also investigated the use of three null models to control the false positive rate. Counter-intuitively, a null model that explicitly models evolution independently for each alignment fails to control the false positive rate. We believe that our simulated alignments are systematically scoring too low because they fail to capture the correct amount of variation in the observed alignments, resulting in artificially significant P-values, except for when the effects of having small alignment sizes results in overly conservative P-values. Using a standard normal or the empirical distribution of scores as null models also failed to control the false positive rate, likely due to the correlation structure imposed by the shared evolutionary history of the residues, the distribution of evolutionary rates of the residues, or because asymptotic assumptions do not hold at small sample sizes. Thus, choosing an appropriate P-value cutoff in a real analysis when the structure is unknown and alignment depth is shallow still remains a challenge. However, we show that in diverse enough alignments the empirical null successfully controls the false positive rate for Direct methods. As a future direction, we suggest exploring theoretical null distributions that can be parameterized for individual alignment column pairs such as [41] or further improving protein evolution simulators to generate distributions of scores where the evolutionary rates are more similar between the null and alternate hypothesis.
A related problem to the one discussed here is to search a large set of protein pairs (within or between species) to determine which ones are interacting. In this setting, coevolution method performance is potentially more important than when predicting contacting residues for known interactions, because the search space will contain so many negatives (i.e., non-interacting pairs). A permissive P-value cutoff will lead to a large number of false positives and that may misinform investigators, while being too strict will lead to false negatives that keep potentially important findings hidden. While models exist that identify cutoffs based on benchmark data sets (e.g., Supplemental Figures S8 and S9, [24]), it would be interesting to understand why the parameters in these studies are appropriate and if they generalize to all protein-protein interactions. Ideally, we would like to understand what a null model teaches us about phylogeny-induced coevolution in the absence of structural inter- or intra-protein constraints. Another challenge for predicting interacting protein pairs from coevolutionary tests is how to summarize statistics for individual pairs of residues to produce a single score for a pair of proteins. Based on some preliminary investigations of these questions, we conclude that it is unlikely that cross-species interacting protein pairs can be accurately distinguished from non-interacting pairs on a genome-wide scale.
The progress of high-throughput interaction mapping highlights the need for continued refinement of inter-protein coevolution detection methods. Given that improper re-weighting of sequences can negatively affect power and the false positive rate, perhaps expanding Direct methods to independently obtain sequence weights for each alignment or using an evolution-based probabilistic weight (such as in CoMap or using phylogenetic logistic regression) for unusual variation in each column is a logical next step forward. Another important contribution would be to develop a generalizable null model that can help differentiate contacts when there are very few sequences available for protein families. Furthermore, investigating the correlations among the coevolution statistics themselves in inter-protein data sets could potentially disentangle structural from non-structural coevolutionary forces as well as serving to construct an ensemble method. Comprehensively sequencing orthologous pairs of protein families is a straightforward way to test the usefulness of these future contributions while simultaneously enabling current methods to perform to their fullest.
Conclusion
We benchmarked 13 coevolution methods on 33 protein interactions with associated sequence alignments of varying depths. We conclude that coevolutionary analyses of cross-species protein-protein interactions is largely hindered by a lack of phylogenetically deep protein alignments for many proteins, and furthur demonstrate this in two example use cases involving HIV-human protein interactions. Additionally, we report that commonly used null distributions generally fail to control false positives in coevolutionary analyses, though errors are best controlled by the empirical null in large alignments.
Materials and Methods
Multiple sequence alignments
A master alignment of 8998 concatenated HisKA and RR sequences from [36] was graciously provided by the authors. From this alignment, aligned sequences were sampled uniformly (each sequence had equal probability of being sampled) to create sub-alignments with 5, 50, 250, 500, 1000, and 5000 sequences. We sampled 10 sub-alignments of each alignment size (number of sequences in sub-alignment), resulting in 60 total alignment pairs.
The alignments in [24] were downloaded from complexes section of the Baker lab website (http://gremlin.bakerlab.org/complexes/PDB/_benchmark/_alignments.zip) on Aug 29, 2014. The corresponding structures were downloaded from PDB and processed to obtain contacts between representative protein chains.
The CoMap implementation requires a preprocessing step to remove sequence redundancy (a data munging alternative to sequence weighting). This additional step was also necessary to prevent buffer underflow errors when evaluating likelihoods in very large input trees. Therefore, all alignments with more than 200 sequences were culled to contain the 200 most diverse sequences before being passed to CoMap. The sub-alignment used corresponds to the 200-leaf sub-tree that maximizes PD for each original input alignment and tree.
Measuring coevolution
The coevolution methods used are listed in Table 1 and Table S1. Wrappers for the Direct methods are provided in coevo_tools to facilitate running from the command line. For methods in the plmDCA, mfDCA and hpDCA packages, MATLAB, or the MATLAB runtime executable is required as well as various MATLAB Toolbox dependencies and licenses.
Evaluating coevolution performance
For each method, coevolution scores for pairs of amino acid positions were used to predict inter-domain pairs of amino acid residues that are close to each other in the representative co-crystal structure (PDB ID: 3DGE).
We define positives as pairs of alignment positions mapping to amino acid residues whose beta carbons (Cβ) are less than 8 angstroms apart in 3DGE. All other pairs of alignment positions are considered negatives.
We considered the following two alternative definitions of positives:
Closest non-hydrogen atom-atom distance between residues is less than 6 angstroms
Cβ distance is less than 8 angstroms and at least one residue is mentioned as important in determining specificity of the HisKA-RR interaction in [52–56].
Residue pairs are predicted as coevolving if their scores or P-values are above a given threshold (eg. top 1%, P < 0.05) (Table S4).
Phylogenetic diversity
Phylogenetic diversity (PD) is calculated as the sum of the branch lengths in a tree built from the concatenated multiple sequence alignment of both proteins. Trees were built using FastTree (version 2.1.7 SSE3) with options -gamma -nosupport -wag.
Abbreviations
CoMap is abbreviated CMP in the main text and figures and CoMapP in supplemental figures. Effective number of sequences per column is abbreviated Neff/L. Phylogenetic distance is abbreviated PD. MIHmin appears as MIminh in figure legends. Precision (PPV) optimized metrics: ppvcut, ppvmax, ppvTPR, ppvFPR are the Pempirical threshold that maximizes PPV, said maximum PPV, power (TPR), and false positive rate (FPR) at said threshold.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
AA carried out the analysis. AA and KSP designed the analysis and wrote the manuscript. All authors read and approved the final manuscript.
Authors’ information
AA is a Bioinformatics graduate student at the University of California San Francisco in the laboratory of Dr. Katherine S. Pollard.
KSP is a Senior Investigator at the Gladstone Institutes and Professor of Epidemiology and Biostatistics at the University of California San Francisco.
Acknowledgements
We thank Martin Weigt for providing HisKA and RR alignments and for providing links to DCA source code. We also thank Julien Dutheil for help running CoMap correctly. This work was supported by a National Institutes of Health bioinformatics training grant, a UCSF Graduate Research Mentorship Fellowship, institutional funding from Gladstone Institutes, and a gift from the San Simeon Fund.
Footnotes
<aram.avilaherrera{at}ucsf.edu>
<katherine.pollard{at}gladstone.ucsf.edu>