Coevolutionary analyses require phylogenetically deep alignments and better null models to accurately detect inter-protein contacts within and between species
© Avila-Herrera and Pollard. 2015
Received: 20 February 2015
Accepted: 17 July 2015
Published: 25 August 2015
When biomolecules physically interact, natural selection operates on them jointly. Contacting positions in protein and RNA structures exhibit correlated patterns of sequence evolution due to constraints imposed by the interaction, and molecular arms races can develop between interacting proteins in pathogens and their hosts. To evaluate how well methods developed to detect coevolving residues within proteins can be adapted for cross-species, inter-protein analysis, we used statistical criteria to quantify the performance of these methods in detecting inter-protein residues within 8 angstroms of each other in the co-crystal structures of 33 bacterial protein interactions. We also evaluated their performance for detecting known residues at the interface of a host-virus protein complex with a partially solved structure.
Our quantitative benchmarking showed that all coevolutionary methods clearly benefit from alignments with many sequences. Methods that aim to detect direct correlations generally outperform other approaches. However, faster mutual information based methods are occasionally competitive in small alignments and with relaxed false positive rates. Two commonly used null distributions are anti-conservative and have high false positive rates in some scenarios, although the empirical distribution of scores performs reasonably well with deep alignments.
We conclude that coevolutionary analysis of cross-species protein interactions holds great promise but requires sequencing many more species pairs.
KeywordsCoevolution Methods comparison Inter-protein Cross-species Host-virus Contact prediction Protein interaction
Coevolution—“the change of a biological object triggered by the change of a related object” —is a powerful concept when applied to molecular sequence analysis because it reveals positional relationships that are preserved across evolutionary time scales. Sequence evolution is constrained by essential molecular interactions, such as contacts within a protein or RNA structure, as well as inter-molecular interactions within protein complexes and signaling pathways. These constraints define an epistasis (i.e. genetic interaction) between sites (residues or base-pairs) where the probability of a substitution depends on the states of other sites involved in an interaction . For example, a mildly deleterious or neutral mutation may change the fitness landscape such that compensatory or advantageous mutations at another site become more likely. Understanding the basic connections and dependencies between these molecular machines is invaluable in learning how cells function, adapt, and how they can be manipulated into performing new tasks or correcting harmful behaviors, as in disease for example.
Because epistasis can induce correlation between substitution patterns among columns in multiple sequence alignments, many methods have been developed that use evidence of coevolving alignment columns to detect physical interactions within and between biomolecules. These methods draw inspiration from diverse techniques in molecular phylogenetics, inverse statistical mechanics, Bayesian graphical modeling, information theory, sparse inference, and spectral theory (reviewed in [3, 4]).
Despite good rationale for coevolutionary approaches, physically interacting alignment columns have been notoriously difficult to identify from correlated patterns of sequence evolution for several reasons. First, shared evolutionary history creates a background of correlated substitution patterns against which it can be difficult to distinguish additional constraints derived from physical interactions. Common phylogeny is particularly strong within a gene family (e.g. predicting intra-molecular contacts). But it is also present across gene families within a species or even between species (e.g. predicting host-virus protein interactions), especially at shorter evolutionary distances where gene trees mirror species trees more closely. Coevolution methods have used a variety of approaches to counter the dependence induced by shared phylogeny, including removing closely related sequences from alignments to reduce non-independence [5, 6], differential weighting of sequences when computing statistics [7–9], and null distributions that directly model or indirectly account for phylogeny [10–13].
A second challenge arises when trying to distinguish correlated evolution that arises from direct versus indirect interactions. Alignment columns that are indirectly implicated in an interaction can be strongly correlated, and most columns are involved in multiple, partially overlapping interactions. For these reasons, close physical interactions may not produce patterns of substitution that are significantly more highly correlated than the background present in structures. This problem has been the focus of a recent class of coevolution methods that focuses on reducing the number of incorrect predictions by disentangling direct from indirect correlations [9, 14–17]. An alternative point of view considers these networks of indirectly correlated residues as protein sectors that can easily, through cooperative substitutions, respond to fluctuating evolutionary pressures . Proteins are in fact quite dynamic, and many unstructured proteins are known to have important interactions [19, 20]. Coevolution methods have the exciting potential to reveal these hard to identify interactions, however distinguishing spurious correlations from true non-structural interactions remains a challenge.
The main barrier to overcoming this challenge is the impressively difficult task of compiling “gold standard” data sets in which true coevolving sites are clearly defined. Structural and systems biology have had great success in identifying and characterizing many important interactions (e.g. Nucleosome , Proteasome , regulation in protein networks [23, 24]). However, resolving large complexes and unstructured proteins remains technically difficult, a daunting task as the number of proteins is ever increasing.
Finally, due to low power—resulting in part from the previous two challenges—physically interacting sites can typically only be detected in multiple sequence alignments that span large evolutionary divergences and contain many hundreds to thousands of sequences. Recent evaluations of a number of coevolution methods concluded that accurate contact predictions require alignments with one to five times as many sequences (with < 90 % sequence redundancy) as positions [25, 26]. Even in the current data rich era of computational biology, such deep alignments are difficult to obtain, especially for cross-species protein interactions (e.g. host and pathogen interactions) because both members of the interaction must be equally deeply sequenced. Additionally, resolving orthologs and paralogs is not trivial.
Despite these challenges, coevolutionary prediction of physically interacting alignment columns has been applied with success to intra-molecular contacts [7, 27–29] and well-characterized inter-molecular interactions , such as bacterial two-component signaling systems , enzyme complexes , and fertilization proteins . Although the signal-to-noise ratio is too low and the search space too large to use sequence evolution to effectively identify pairs of physically interacting protein residues across entire proteomes; most pairs of sites with correlated substitution patterns are not in direct contact, and most physically interacting sites do not have statistically correlated substitution patterns .
However, the ability to now measure physical interactions between biomolecules with high-throughput technologies, such as affinity purification followed by mass spectrometry (APMS) , two-hybrid methods [36, 37], and protein complementation assays , raises the possibility of using sequence coevolution to refine predicted interactions in an experimentally reduced search space. For example, correlated substitution patterns in pairs of proteins could help determine if an experimentally measured interaction is likely to represent direct physical contact versus an indirect interaction in a complex or a false positive. Coevolutionary analysis could also be informative regarding which of the sites in a pair of interacting molecules are most likely to be in physical contact.
One particularly exciting application of this approach is to characterize and potentially manipulate interacting residues in host-virus and host-parasite protein interactomes [23, 39]. Newly emerging data on antibody and antigen sequences within a host  offers an opportunity to harness coevolutionary signals to investigate the mechanisms of broadly neutralizing antibodies and immune evasion. The primary open question for these new applications is whether existing methods are sensitive and specific enough to detect coevolution with the levels of constraint and divergence that are present in inter-molecular data sets of modest size.
List of methods benchmarked
seq %id, pseudocount
Code S1 in 
To quantify the limitations of coevolutionary methods when only a handful of sequences are available, we used a data set of 33 within-species bacterial protein-protein interactions. To systematically determine the parameters that affect performance, we focused on the well-characterized interaction between bacterial histidine kinase A (HisKA) and its response regulator (RR), for which a co-crystal structure and thousands of sequences are available. By sub-sampling HisKA-RR sequence pairs, we show that most methods have appreciable precision or power at low false positive rates for alignments with ∼500 or more sequences. However, the best performing method for a particular analysis will depend on whether power or precision is more important, the number of non-redundant sequences in the alignment, and whether the goal is to find structurally or functionally linked residues (i.e. long range interactions). By expanding this analysis to 32 additional bacterial interactions , we showed that these trends generalize beyond the specific example of HiskA and RR. We conclude that coevolution methods are able to identify some residues important for cross-species protein-protein interactions, but this approach will benefit greatly from additional sequence data.
Performance benchmarking of coevolution methods
The coevolutionary methods benchmarked in our analyses fall into three general groups (Table 1). Information-based methods are various flavors of mutual information (MI) between pairs of sites, each considered independently. Direct methods are those that consider pairs of sites in the context of a sparse global statistical model for contacts in the multiple sequence alignment. Phylogenetic methods explicitly use a substitution rate matrix and phylogenetic tree in their calculation of a coevolution statistic. The phylogenetic tree is used to account for the relatedness of the sequences—the observed sequences are themselves correlated due to their shared evolutionary histories. The substitution rate matrix may take into account the biochemical and physical properties of amino acid residues. The main phylogenetic method we report on, CoMap, reports a P-value based on internal simulation of independently evolving sites. In this benchmark we use this P-value as a statistic for comparison with other coevolution methods. Other differences among the coevolution methods include the incorporation of two additional techniques that have been shown to improve performance, re-weighting sequences such that similar sequences contribute less to the final score  and applying an Average Product Correction (APC) to remove background noise and phylogenetic signal from “raw” coevolution statistics .
To benchmark coevolution methods, we used 33 within-species pairs of proteins with co-crystal structures determined from E. coli proteins. These include a set of paired alignments compiled by  (Ovch32), plus the histidine kinase-response regulator (HisKA-RR) bacterial two-component system from Procaccini et al. , provided by the authors. We included HisKA-RR, because it is a well-characterized interaction with a very large, diverse multiple sequence alignment (8998 sequences for each gene pair) and genetic evidence supporting several interactions. For these reasons, HisKA-RR has also been used previously in coevolutionary analyses .
Because the HisKA-RR alignment is so deep, it enabled us to quantify the effects of alignment size and diversity by uniformly down-sampling the full alignment to produce a wide range of smaller pairs of HisKA and RR multiple sequence alignments. These sub-sampled alignments have six different numbers of sequences (5, 50, 250, 500, 1000, 5000), with phylogenies also sub-sampled from the original tree (Additional file 13: Figure S1). The 32 alignment pairs in Ovch32 naturally varied in size (range 216–6732 sequences) (Additional file 13: Figure S2).
In addition to the number of sequences in the alignments (N), we consider the phylogenetic diversity (PD ) of the alignments—also captured in the effective number of sequences (Neff) as calculated by PSICOV , the diversity within individual alignment columns measured by entropy, the alignment length (L) (i.e. the number of alignment columns), the proportion of contacting residues in the alignment.
TP: True positive, FP: False positive, TN: True negative, FN: False negative
C β distance
We also evaluated performance using two stricter definitions of contacts. First, we defined contacts as residue-pairs with less than 6Å between their closest non-hydrogen atoms. We then evaluated performance in the HisKA-RR sub-alignments using a definition of contacts that, in addition to spatial proximity (C β < 8Å), requires biochemical evidence for the role of the contacting residues in determing ortholog- and paralog- specificity of the interaction (i.e. reducing cross-talk between orthologous and paralagous interacting proteins). A list of such residues in representative sequences is found in Casino et al. , Li et al. , Haldimann et al. , Skerker et al. , and Laub and Goulian . Trends in the results were generally similar across these choices of definition for true interactions, but we observed some differences in performance between definitions when the false positive rate (FPR) is controlled (Additional file 13: Figure S8 and S10).
Physically interacting sites can be accurately detected in large sequence alignments
Our primary finding is that many coevolutionary methods are able to detect inter-molecular contacts at low FPRs in alignments with hundreds of diverse sequences from each protein, consistent with previous studies of intra-molecular contacts [3, 17], specifically when the alignments are deeper than they are long [25, 26]. We capture this rectangular quality in the statistic N eff /L, where N eff is the effective number of sequences as calculated by PSICOV  and L is the total number of columns in both alignments. We observe similar trends when using the number of sequences (N) or their phylogenetic diversity (PD) , rather than N eff /L, to compare performance.
However, in the HisKA-RR alignment, we observed two exceptions to this trend when using the strictest definition for contacting pairs (i.e. requiring residue C β < 8Å coupled with biochemical evidence for specificity determination). First, the standard MI statistic is the most precise method for detecting contacting sites in alignments with N eff /L >1.6 and FPR < 0.1 % (Additional file 13: Figure S10, Additional file 11). Second, mutual information normalized by the joint entropy (MIj) has relatively high power compared to the Information-based methods and is the most powerful method for detecting contacting sites that are supported by experimental evidence at FPR < 5 % (Additional file 13: Figure S8, Additional file 12). However, MIj has drastically lower power at FPR < 0.1 % (Additional file 13: Figure S9). These findings suggest MIj may be useful for detecting as many contacts as possible if a moderate FPR can be tolerated. Information-based methods are straightforward to compute, adding to their utility in these settings.
CoMap performance is an interesting case because, in contrast to DI, DI plm, and PSICOV, it was not initially designed to find contacting residues, rather a mix of both short and long-range interactions. In the smallest alignments (5 sequences) we tested, we occasionally observe CMP chg has higher power than the Direct methods (Mann-Whitney U P = 0.003). However, its lower performance in other alignments may indicate that it is identifying a set of coevolving residue pairs that partially overlap with contacting residues. Additionally, a filtering step necessary to run CoMap on large alignments may be limiting its performance (See Methods). It remains to explore whether CoMap can be used to prioritize residue pairs predicted by the other methods for functional assays.
Finally, we looked at the relationship between performance and the proportion of residue pairs that are contacts. Comparing across the structures in the Ovch32 data set, we observed the proportion of contacts is correlated with precision at FPR < 0.1 % (Additional file 13: Figure S24, Additional file 10). This means that most strongly coevolving residues in a protein pair are more likely to be physically interacting in co-crystal structures with a larger fraction interface residues. Power is also correlated with the proportion of contacts, though not as strongly as precision (Additional file 13: Figure S25).
Diversity of sequences is important for accurately detecting contacts
To investigate whether higher power in larger alignments results primarily from the number sequences per se or depends upon the diversity of the sequences, we compared the performance across alignments with different diversity values but the same number of sequences. We quantified diversity using phylogenetic diversity (PD)  and the effective number of sequences as calculated by PSICOV (N eff)  (Additional file 13: Figure S5 and S6).
For HisKA-RR sub-alignments, we found weak positive and negative relationships between the nominal false positive rate and PD for some methods in alignments with 5000 sequences at given target false positive rates. For each group of equally sized alignments for each method (and for each null distribution and significance threshold), we tested whether the false positive rate correlates with PD using Spearman’s rho. Few methods had uncorrected P-values < 0.05 and none did when controlling for the 336 comparisons (smallest uncorrected P: 1.73e-3; ρ: 0.85 for MI j at N = 5000, P empirical < 0.001). Testing for a bulk correlation (ignoring method; normalizing PD by alignment size) reveals a weak positive correlation (ρ= 0.27, P < 1.9e-29) at P normal and P empirical < 0.05 but not < 0.001. Overall this suggests that the false positive rate may increase with more diverse sequences at loose significance thresholds. Alternatively, the PD ranges were too small to detect a relationship with false positive rate.
While the range in diversity for alignments with 5 sequences is small (PD: 7.5-11, N eff: 5), under the normal distribution, the false positive rate is better controlled in diverse alignments. However, under the empirical null, the Information-based methods do not control the FPR for these alignments and have larger false positive rates as diversity increases in these alignments.
One caveat of the HisKA-RR analysis is that (for computational reasons) we generated sub-alignments by random sampling and therefore only explored a range of phylogenies close to the typical diversity for each alignment size. We observe fairly strong correlations between cutoff-independent performance metrics and N eff (and also N eff/L as L is constant in HisKA-RR). The alignments in Ovch32 provide a broader range of phylogenetic scenarios. Across these 32 interactions, N eff is weakly negatively correlated with the same performance metrics (Additional file 8). However, accounting for alignment length (with N eff/L) reveals that there is a positive relationship between alignment depth and performance. (Additional file 9, Additional file 13: Figure S5 and S7) show that high N eff alone does not guarantee good performance. For example, taking the best performing method at each alignment pair, the alignment pair with the highest N eff had at best the fourth poorest ϕ max. Conversely, the third smallest N eff corresponds to the third best ϕ max; and at FPR < 0.001, it had the highest precision (PPV = 63 %). Interestingly, it also has the shortest length (L = 168 columns), suggesting that perhaps taking into account the proportion of possible contacts may play an important role in estimating expected performance.
Choice of null distribution affects performance
The previous results show performance based on the known HisKA-RR structure. In practice, when applying the methods in our study the structure usually is not known. One therefore uses a null distribution to control false predictions. Specifically, an upper quantile of the distribution of coevolutionary statistics in the absence of coevolutionary constraint is used as a threshold; one declares any pair of sites with a statistic exceeding the threshold a predicted contact. The goal is to minimize false predictions by predicting contacts only when statistics are much larger than expected by chance under the null distribution. A variety of null distributions are commonly used, including theoretical limiting distributions [8, 49, 50], the empirical distribution of observed scores (under the assumption that most pairs of sites are not coevolving) , and parametric, semi-parametric, and non-parametric bootstrap distributions [10, 52]. Theoretical and empirical nulls are computationally inexpensive compared to bootstrap methods, which require accurately simulating thousands of large data sets (See Additional file 13: Supplemental Text).
We used our sampled sub-alignments of HisKA-RR and the Ovch32 alignments  to compare the performance of two commonly used null distributions and to evaluate the sensitivity of each approach to alignment size. For each null distribution and coevolutionary statistic, we first employed the non-contact pairs of residues to assess if the FPR was truly controlled or not at given target FPRs (α) of 5 % and 0.1 %.
Thus, while the normal distribution applied to standardized coevolution statistics can practically be used as a null distribution, we conclude that this approach results in elevated rates of false positive predictions, likely due to shared phylogeny, structural constraints affecting non-contacting residue pairs, or coevolution scores not being normally distributed (Additional file 13: Figure S30-S32). A theoretical null (e.g. non-central gamma ) that is parameterized for individual column pairs may therefore be more appropriate (See Additional file 13: Supplemental Text) and warrants future investigation.
Another choice of null distribution is the observed empirical distribution of the coevolution statistics. A P-value (P empirical ) for a score S is simply the proportion of scores that are more extreme than S. This straightforward method can be easily applied with any statistic. However, it also assumes that no pairs of sites are coevolving and should therefore produce thresholds that are too strict when there are some coevolving sites in the data set (i.e., making it harder to predict real contacts). Although, we found that the empirical null distribution does produce nominal FPRs that exceed target FPRs (Fig. 3; Additional file 13: Figure S13). As the proportion of contacts increases, the P empirical -values become more conservative (Additional file 13: Figure S26 and S27). The Direct methods best control the nominal FPR in both sets of alignments, marginally exceeding the target FPR in only a couple of cases (maximum FPR/ α= 3.68). The Information-based methods controlled the FPR below 1.58 times α in the Ovch32 alignments , however the HisKA-RR sub-alignments reveal that at N eff/L < 0.3, control of the FPR is lost, especially in MI Hmin (FPR/ α> 400). The Phylogenetic method that consistently exceeded the target FPR was the CoMap correlation analysis (CMP cor) which makes no assumptions regarding the biochemical properties of the amino acids. These results suggest that the empirical null distribution is not as conservative of an approach as one might expect from including contacting residue pairs in the null distribution. Although, it may suffer from some of the same effects that make the normal null distribution anti-conservative, such as shared phylogeny or structural constraints. In some methods like MI minh, alignments with very few sequences (e.g. 5–50) have a limited number of possible scores which leads to ties in P-values between contacting and non-contacting residues. If contacts and non-contacts have roughly the same P empirical values, the target and nominal FPRs should be similar. But with large ammounts of ties, predictions are made in blocks, possibly forcing discontinuous jumps in the nominal FPR with respect to the target FPR. This could compound or diminish the anti-conservativeness of P empirical .
Cross-species case study: applying coevolution methods to Vif-A3G identifies some residues known to affect host-virus interactions
Viral infectivity factor (Vif) is a lentiviral accessory protein whose primary function is to target the antiviral cytidine deaminase APOBEC3G (A3G) of its mammalian hosts through ubiquitination. Because the two protein families are in an evolutionary arms race [54, 55], we hypothesized that they would be an informative example for exploring the utility of coevolution methods in host-virus protein pairs (i.e. inter-protein, inter-species interactions). This is a novel application of coevolution analysis, which has primarily been applied to residues within a protein or between pairs of proteins in the same genome.
Important residues for the Vif-A3G interaction
A3G and A3F
essential for Vif-binding
First, we computed a subset of coevolutionary statistics for all Vif-A3G residue pairs and evaluated how well the statistics pinpoint the positive functionally important residues compared to negatives. For this evaluation, we used the empirical distribution of scores as a null distribution to determine statistical significance (i.e., P empirical ) because they have lower false positive rates across N eff/L values at strict significance thresholds. Because the positives and negatives are single residues in each sequence instead of inter-protein residue pairs, we summarized P empirical for each residue by assigning it the most significant P empirical across all inter-protein pairs to which it belongs, and then explored the Vif and A3G results individually (Additional file 7). From our benchmarking on the bacterial data sets, we know that significance thresholds that control the FPR vary by method and N eff/L, and that strict thresholds that yield very low (∼ 2–3 %) power are typically needed to control FPR in small alignments. we therefore chose to identify a significance threshold for each method that maximizes precision on the known functional sites in each protein. Then, we estimated power and FPR at these thresholds.
On Vif, with the exception of CMP cor and DI32, the maximum precisions for each method ranged from 9 to 20 % (i.e. only one or two residues out of ten predicted to be positives are truly positives) (Additional file 13: Figure S34). At these precision-optimized thresholds, MI j and MI minh predict almost every Vif residue to be coevolving; a stricter threshold would not result in a lower proportion of incorrect predictions. In contrast, the precisions for CMP pol, CMP cor, DI32 are the highest (20 %, 40 %, 100 % respectively). However, this comes at the cost of making the fewest number of predictions with the latter only making a single prediction. For these methods, less strict thresholds are needed to identify a greater proportion of positives at the cost of increasing the proportion of false discoveries. Across all methods, low f max and ϕ max values (0.26 and below) suggest there are no significance thresholds that balance power and precision for this data set.
Secondly, we visualized the localization of Vif residues predicted to be coevolving with A3G on a partial structure of Vif in complex with cofactors utilized for protein ubiquitination  (Additional file 7, Additional file 13: Figure S36). In , the authors are able to see that a critical subset of the Vif positives is solvent-exposed. We re-evaluated performance with only these residues as the positives (Table 3). There is poor precision to identify the putative solvent-exposed interface among the methods; CMP cor at 40 % and CMP vol at 10 % are the only methods with precision > 6 % (Additional file 13: Figure S35).
Our analysis of the Vif-A3G interaction confirms that power to detect functionally important residues in each protein family is also low in inter-protein analyses between species, even though it is plausible that an arms race between lentivirus and mammal would give rise to stronger signals of coevolution compared to background. It is important to consider that perhaps the positions considered positives may not all be of equal evolutionary importance across primates. Interfaces may be gained or lost and the rapid evolution of the two proteins likely produces many alternative solutions to maintaining an antagonistic interaction. There were many predicted positions that were not in the positives and further systematic validation and more comprehensive sequencing of lentiviruses and primates is needed to determine which pairs of residues are actually in close proximity or functionally required for other reasons. Additionally, there appears to be some level of complementarity in the predictions made by VI and MI minh and the CMP methods, which measure different biochemical trade offs between coevolving residues. This strengthens the rationale for integrating methods to better predict interface residues experiencing potentially different evolutionary constraints (e.g. structural, catalytic activity, specificity). Coevolutionary analysis can help to generate and prioritize candidates for these experiments.
A toolkit for inter-molecular coevolution analysis
Due to the diversity of coevolution methods and the time spanned during which they were developed, it is no surprise that they vary widely in the input and output formats they tolerate. Additionally, many of the coevolution methods we tested are computationally expensive, so we prepared our workflow to take advantage of multiprocessing workstations and high performance computing clusters. We outline a few utilities we developed to aid in processing sequences, structures, and coevolution results for benchmarking and making predictions and visualizations.
Our toolkit consists of three parts, (1) a collection of wrappers for running the coevolution programs from the command line and where possible in a Sun Grid Engine super computing environment (https://github.com/aavilahe/coevo_tools), (2) an R package for evaluating performance and calculating P empirical and P normal (https://github.com/aavilahe/coevo_analysis_Rpackage), and (3) pre- and post-processing utilities to facilitate input and output format management, mapping alignments to structural models, and visualizing coevolving residues on protein structures (https://github.com/aavilahe/coevo_analysis_pypackage).
We also implemented the canonical mutual information statistic, the normalizations of mutual information in Martin et al. , and VI, the information theoretic distance described by Meila  (https://github.com/aavilahe/infcalc).
The coevolution methods we tested accept three different file formats and alignments as two separate files or one horizontally concatenated file. The different formats, (fasta, phylip, raw reads) store more or less meta-data and have limits on the length of sequence identifiers.
Our coevo package at https://github.com/aavilahe/coevo_analysis_pypackage depends on the Biopython library and contains many auxiliary functions and executable python scripts for input file preparation.
A typical processing step may involve truncating sequence identifiers when converting between sequence formats, taking care that they remain informative and unique. For example:
The coevolution methods return tab, space, or comma delimited output with and without headers. The scores returned are often indexed by column numbers of the concatenated alignment and not the original two alignments of interest, and can be numbered starting from 0 or 1.
The scores module in our coevo package includes definitions for the various formats we encountered, extracts the relevant indices and scores, optionally merges results from different methods, and processes them to a standard tab delimited format with appropriate headers and indices that correspond to the alignments of interest. For example:
Another important procedure is to map column numbers from a given alignment to a reference PDB structure. For example, we used map_column_to_resnum.py, and get_dists.py to map atomic distances to column-pairs in existing alignments in order to compare them to coevolution scores and P-values and to validate predictions. The HisKA-RR complex in (PDB: 3DGE) is actually an ABAB tetramer—two sets of identical chains form a structure such that a HisKA chain will make contact with two RR chains. One can use min_dists.py to get the minimum distances between residues from both interactions. For a detailed example, see https:// github.com/aavilahe/coevo_analysis_pypackage/blob/dev/ example/pdb_tests/example_3DGE_column_distances.sh.
Visualization of coevolution score summaries on individual residues can be accomplished by generating an attributes file for use with UCSF Chimera  using make_attributes.py (e.g. Additional file 7, Additional file 13: Figure S36 shows Vif residues predicted to coevolve with A3G, each Vif residue is colored by most significant P-value out of all A3G residues).
In this work we aimed to paint a picture of the performance of emerging methods to identify inter-protein contacts using coevolution and to identify properties of alignments where performance is expected to be best. As previously noted in intra-protein predictions [3, 9, 14], re-weighting of the sequences to account for the underlying phylogeny is important for inter-protein predictions as well, however as the comparison between MI w and MI shows, it is important to tune the parameters controlling the re-weighting in cases where there are fast evolving alignment columns in an overall conserved protein family. Fortunately, methods that search for direct correlations—using a global statistical model for the sequence alignments—seem to be able to correct for the improper weighting (compare MI w to DI). These methods are more precise at strict false positive rates than their counterparts especially when the alignments have N eff/ L < 1.0. However, it may be beneficial to use a faster, MI-based method if the use case allows for a relaxed FPR and is concerned with power versus precision.
We also investigated the use of three null models to control the false positive rate. Counter-intuitively, a null model that explicitly models evolution independently for each alignment fails to control the false positive rate. We believe that our simulated alignments are systematically scoring too low because they fail to capture the correct amount of variation in the observed alignments, resulting in artificially significant P-values, except for when the effects of having small alignment sizes results in overly conservative P-values. Using a standard normal or the empirical distribution of scores as null models also failed to control the false positive rate, likely due to the correlation structure imposed by the shared evolutionary history of the residues, the distribution of evolutionary rates of the residues, or because asymptotic assumptions do not hold at small sample sizes. Thus, choosing an appropriate P-value cutoff in a real analysis when the structure is unknown and alignment depth is shallow still remains a challenge. However, we show that in diverse enough alignments the empirical null successfully controls the false positive rate for Direct methods. As a future direction, we suggest exploring theoretical null distributions that can be parameterized for individual alignment column pairs such as  or further improving protein evolution simulators to generate distributions of scores where the evolutionary rates are more similar between the null and alternate hypothesis.
These results are encouraging, but still leave us with the challenge of how to choose an appropriate P-value cutoff in a real analysis when the structure is unknown. Since our findings indicate that nominal FPRs exceed target FPRs using P normal and P empirical for nearly all methods, stricter P-value cutoffs than the target false positive rate seem warranted. But it is not clear how much stricter will be needed in any given alignment pair. Additional information to consider when making such modifications should include incorporating alignment properties such as N eff/L, and the expected proportion of contacts expected to exist (Additional file 13: Figure S27; Fig. 3). However, large data sets of many protein interactions are needed in order to be confident in parameters or prior probabilities to be used to correct the P-values. Hence, in most applications one must simply aim to control a target FPR, knowing that the true error rate is likely to be larger. For this reason, the empirical null distribution may be the best choice to use as it controls error rates across the majority of alignment sizes, target FPRs, and coevolution methods tested (Fig. 3; Additional file 13: Figure S13). As a rule of thumb, the empirical null overall controls the FPR for the Direct methods, however in small alignments (5 sequences or N eff/L < 0.3) it can be up to 1.5 times the target FPR. For the purposes of data collection and experimental design, we therefore recommend sequencing phylogenetically deeply enough to attain N eff/L > 1.0 to control FPR and > 2.0 to ensure modest TPR and PPV.
A related problem to the one discussed here is to search a large set of protein pairs (within or between species) to determine which ones are interacting. In this setting, coevolution method performance is potentially more important than when predicting contacting residues for known interactions, because the search space will contain so many negatives (i.e., non-interacting pairs). A permissive P-value cutoff will lead to a large number of false positives and that may misinform investigators, while being too strict will lead to false negatives that keep potentially important findings hidden. It would be interesting to understand if thresholds and the methods for choosing them generalize to all protein-protein interactions. Different experimental techniques have strengths and weakness in identifying different types of interactions. Interactions may be transient, but highly critical, or tightly binding but too conserved to detect any sequence variation among the sequenced orthologs Mulberry Ideally, we would like to understand what a null model teaches us about phylogeny-induced correlations when structural inter- or intra-protein constraints are minimal, perhaps at an evolutionary stage where a protein interaction is acquired or lost. What can this reveal about the birth and death of protein interactions, regulatory networks, and neofunctionalization? Another challenge for predicting interacting protein pairs from coevolutionary tests is how to summarize statistics for individual pairs of residues to produce a single score for a pair of proteins. Although outside the scope of our work, such a strategy would likely involve comparing tails of score or P-value distributions. Deciding on how to define how much of the tail to consider will depend highly on having an estimate of the false positive rate. Based on some preliminary investigations of these questions, we conclude that it is unlikely that cross-species interacting protein pairs can be accurately distinguished from non-interacting pairs on a genome-wide scale.
The progress of high-throughput interaction mapping highlights the need for continued refinement of inter-protein coevolution detection methods. Given that improper re-weighting of sequences can negatively affect power and the false positive rate, perhaps expanding Direct methods to independently obtain sequence weights for each alignment or using an evolution-based probabilistic weight (such as in CoMap or using phylogenetic logistic regression) for unusual variation in each column is a logical next step forward. Another important contribution would be to develop a generalizable null model that can help differentiate contacts when there are very few sequences available for protein families. Furthermore, investigating the correlations among the coevolution statistics themselves in inter-protein data sets could potentially disentangle structural from non-structural coevolutionary forces as well as serving to construct an ensemble method. Comprehensively sequencing orthologous pairs of protein families is a straightforward way to test the usefulness of these future contributions while simultaneously enabling current methods to perform to their fullest.
We benchmarked 13 coevolution methods on 33 protein interactions with associated sequence alignments of varying depths. We conclude that coevolutionary analyses of cross-species protein-protein interactions is largely hindered by a lack of phylogenetically deep protein alignments for many proteins, and furthur demonstrate this in an example case involving an HIV1-human protein interaction. Additionally, we report that commonly used null distributions generally fail to control false positives in coevolutionary analyses, though errors are best controlled by the empirical null in large alignments.
Multiple sequence alignments
A master alignment of 8998 horizontally concatenated HisKA and RR sequences from Procaccini et al.  was graciously provided by the authors (Additional files 2 and 3). From this alignment, aligned sequences were sampled uniformly (each sequence had equal probability of being sampled) to create sub-alignments with 5, 50, 250, 500, 1000, and 5000 sequences. We sampled 10 sub-alignments of each alignment size (number of sequences in sub-alignment), resulting in 60 total alignment pairs (Additional file 4).
The Ovch32 alignments  were downloaded from complexes section of the Baker lab website (http://gremlin.bakerlab.org/complexes/PDB_benchmark_alignments.zip) on Aug 29, 2014 (Additional file 1). A stable link is located at the Dryad repository, doi:10.5061/dryad.s00vr/7 . The corresponding structures were downloaded from PDB and processed to obtain contacts between representative protein chains. See Supplemental Files for accessions. Columns comprised of more than 75 % gaps were removed as in . Additionally, only columns mapping to the representative structure were kept.
The CoMap implementation requires a preprocessing step to remove sequence redundancy (a data munging alternative to sequence weighting). This additional step was also necessary to prevent buffer underflow errors when evaluating likelihoods in very large input trees. Therefore, all alignments with more than 200 sequences were culled to contain the 200 most diverse sequences before being passed to CoMap. The sub-alignment used corresponds to the 200-leaf sub-tree that maximizes PD for each original input alignment and tree.
Versions and sources of coevolution methods benchmarked
Code S1 in 
Evaluating coevolution performance
For each method, coevolution scores for pairs of amino acid positions were used to predict inter-domain pairs of amino acid residues that are close to each other in the representative co-crystal structure (PDB ID: 3DGE).
As previously described in Ezkurdia et al. , Monastyrskyy et al. , Jones et al. , and to be consistent with Morcos et al. , we define positives as pairs of alignment positions mapping to amino acid residues whose beta carbons (C β ) are less than 8 angstroms apart in 3DGE. All other pairs of alignment positions are considered negatives.
Residue pairs are predicted as coevolving if their scores or P-values are above a given threshold (eg. top 1 %, P < 0.05) (Table 2).
Phylogenetic diversity (PD) is calculated as the sum of the branch lengths in a tree built from the concatenated multiple sequence alignment of both proteins. Trees were built using FastTree (version2.1.7 SSE3) with options -gamma -nosupport -wag.
We thank Martin Weigt for providing HisKA and RR alignments and for providing links to DCA source code. We also thank Julien Dutheil for help running CoMap correctly. We greatly appreciate Art Poon for his help in adapting Spidermonkey batch files for amino acid analyses. This work was supported by a National Institutes of Health bioinformatics training grant, a UCSF Graduate Research Mentorship Fellowship, institutional funding from Gladstone Institutes, and a gift from the San Simeon Fund.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Yip KY, Patel P, Kim PM, Engelman DM, McDermott D, Gerstein M. An integrated system for studying residue coevolution in proteins. Bioinformatics. 2008; 24(2):290–2. doi:10.1093/bioinformatics/btm584.View ArticlePubMedGoogle Scholar
- Dutheil J, Galtier N. Detecting groups of coevolving positions in a molecule: a clustering approach. BMC Evol Biol. 2007; 7:242. doi:10.1186/1471-2148-7-242.View ArticlePubMedPubMed CentralGoogle Scholar
- Dutheil JY. Detecting coevolving positions in a molecule: why and how to account for phylogeny. Brief Bioinform. 2012; 13(2):228–43. doi:10.1093/bib/bbr048.View ArticlePubMedGoogle Scholar
- de Juan D, Pazos F, Valencia A. Emerging methods in protein co-evolution. Nat Rev Genet. 2013; 14(4):249–61. doi:10.1038/nrg3414.View ArticlePubMedGoogle Scholar
- Buslje CM, Santos J, Delfino JM, Nielsen M. Correction for phylogeny, small number of observations and data redundancy improves the identification of coevolving amino acid pairs using mutual information. Bioinformatics. 2009; 25(9):1125–31. doi:10.1093/bioinformatics/btp135.View ArticlePubMedPubMed CentralGoogle Scholar
- Fares MA, Travers SA. A novel method for detecting intramolecular coevolution adding a further dimension to selective constraints analyses. Genetics. 2006; 173(1):9–23. doi:10.1534/genetics.105.053249.View ArticlePubMedPubMed CentralGoogle Scholar
- Dahirel V, Shekhar K, Pereyra F, Miura T, Artyomov M, Talsania S, et al. Coordinate linkage of HIV evolution reveals regions of immunological vulnerability. Proc Natl Acad Sci USA. 1153; 108(28):0–5. doi:10.1073/pnas.1105315108.Google Scholar
- Dunn SD, Wahl LM, Gloor GB. Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction. Bioinformatics. 2008; 24(3):333–40. doi:10.1093/bioinformatics/btm604.View ArticlePubMedGoogle Scholar
- Morcos F, Pagnani A, Lunt B, Bertolino A, Marks DS, Sander C, et al. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc Natl Acad Sci USA. 2011; 108(49):E1293–301. doi:10.1073/pnas.1111471108.View ArticlePubMedPubMed CentralGoogle Scholar
- Dutheil J, Pupko T, Jean-Marie A, Galtier N. A model-based approach for detecting coevolving positions in a molecule. Mol Biol Evol. 2005; 22(9):1919–28. doi:10.1093/molbev/msi183.View ArticlePubMedGoogle Scholar
- Pollock DD, Taylor WR, Goldman N. Coevolving protein residues: maximum likelihood identification and relationship to structure. J Mol Biol. 1999; 287(1):187–98. doi:10.1006/jmbi.1998.2601.View ArticlePubMedGoogle Scholar
- Caporaso JG, Smit S, Easton BC, Hunter L, Huttley GA, Knight R. Detecting coevolution without phylogenetic trees? tree-ignorant metrics of coevolution perform as well as tree-aware metrics. BMC Evol Biol. 2008;8(327). doi:10.1186/1471-2148-8-327.
- Weigt M, White RA, Szurmant H, Hoch JA, Hwa T. Identification of direct residue contacts in protein-protein interaction by message passing. Proc Natl Acad Sci USA. 2009; 106(1):67–72. doi:10.1073/pnas.0805923106.View ArticlePubMedGoogle Scholar
- Jones DT, Buchan DW, Cozzetto D, Pontil M. PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments. Bioinformatics. 2012; 28(2):184–90. doi:10.1093/bioinformatics/btr638.View ArticlePubMedGoogle Scholar
- Burger L, van Nimwegen E. Disentangling direct from indirect co-evolution of residues in protein alignments. PLoS Comput Biol. 2010; 6(1):e1000633. doi:10.1371/journal.pcbi.1000633.View ArticlePubMedPubMed CentralGoogle Scholar
- Delaporte E, Wyler Lazarevic CA, Iten A, Sudre P. Large measles outbreak in geneva, switzerland, january to august 2011: descriptive epidemiology and demonstration of quarantine effectiveness. Euro Surveill Bull. 2013;18(6). http://www.ncbi.nlm.nih.gov/pubmed/23410259.
- Clark GW, Ackerman SH, Tillier ER, Gatti DL. Multidimensional mutual information methods for the analysis of covariation in multiple sequence alignments. BMC Bioinformatics. 2014; 15(1):157. doi:10.1186/1471-2105-15-157.View ArticlePubMedPubMed CentralGoogle Scholar
- McLaughlin Jr RN, Poelwijk FJ, Raman A, Gosal WS, Ranganathan R. The spatial architecture of protein function and adaptation. Nature. 2012; 491(7422):138–42. doi:10.1038/nature11500.View ArticleGoogle Scholar
- Uversky VN, Oldfield CJ, Dunker AK. Intrinsically disordered proteins in human diseases: Introducing the d 2 concept. Ann Rev Biophys. 2008; 37(1):215–46. doi:10.1146/annurev.biophys.37.032807.125924.View ArticleGoogle Scholar
- Dyson HJ, Wright PE. Intrinsically unstructured proteins and their functions. Nat Rev Mol Cell Biol. 2005; 6(3):197–208. doi:10.1038/nrm1589.View ArticlePubMedGoogle Scholar
- Ben-Shem A, Garreau de Loubresse N, Melnikov S, Jenner L, Yusupova G, Yusupov M. The structure of the eukaryotic ribosome at 3.0 å resolution. Science (New York, NY). 2011; 334(6062):1524–9. doi:10.1126/science.1212642.View ArticleGoogle Scholar
- Lasker K, Forster F, Bohn S, Walzthoeni T, Villa E, Unverdorben P, et al. Molecular architecture of the 26s proteasome holocomplex determined by an integrative approach. Proc Natl Acad Sci USA. 2012; 109(5):1380–7. doi:10.1073/pnas.1120559109.View ArticlePubMedPubMed CentralGoogle Scholar
- Jager S, Cimermancic P, Gulbahce N, Johnson JR, McGovern KE, Clarke SC, et al. Global landscape of HIV-human protein complexes. Nature. 7381; 481:365–70. doi:10.1038/nature10719.Google Scholar
- Vinayagam A, Zirin J, Roesel C, Hu Y, Yilmazel B, Samsonova AA, et al. Integrating protein-protein interaction networks with phenotypes reveals signs of interactions. Nat Methods. 2013; 11(1):94–9. doi:10.1038/nmeth.2733.View ArticlePubMedPubMed CentralGoogle Scholar
- Kamisetty H, Ovchinnikov S, Baker D. Assessing the utility of coevolution-based residue-residue contact predictions in a sequence- and structure-rich era. Proc Natl Acad Sci USA. 1567; 110(39):4–9. doi:10.1073/pnas.1314045110.Google Scholar
- Hopf TA, Scharfe CP, Rodrigues JP, Green AG, Kohlbacher O, Sander C, et al. Sequence co-evolution gives 3d contacts and structures of protein complexes. Elife. 2014;3. doi:10.7554/eLife.03430.
- Marks DS, Colwell LJ, Sheridan R, Hopf TA, Pagnani A, Zecchina R, et al. Protein 3d structure computed from evolutionary sequence variation. PloS ONE. 2011; 6(12):e28766. doi:10.1371/journal.pone.0028766.View ArticlePubMedPubMed CentralGoogle Scholar
- Hopf TA, Colwell LJ, Sheridan R, Rost B, Sander C, Marks DS. Three-dimensional structures of membrane proteins from genomic sequencing. Cell. 2012; 149(7):1607–21. doi:10.1016/j.cell.2012.04.012.View ArticlePubMedPubMed CentralGoogle Scholar
- Marks DS, Hopf TA, Sander C. Protein structure prediction from sequence variation. Nat Biotechnol. 2012; 30(11):1072–80. doi:10.1038/nbt.2419.View ArticlePubMedPubMed CentralGoogle Scholar
- Ovchinnikov S, Kamisetty H, Baker D. Robust and accurate prediction of residue-residue interactions across protein interfaces using evolutionary information. Elife. 2014; 3:e02030. doi:10.7554/eLife.02030.View ArticlePubMedPubMed CentralGoogle Scholar
- Juan D, Pazos F, Valencia A. High-confidence prediction of global interactomes based on genome-wide coevolutionary networks. Proc Natl Acad Sci USA. 2008; 105(3):934–9. doi:10.1073/pnas.0709671105.View ArticlePubMedPubMed CentralGoogle Scholar
- Gershoni M, Fuchs A, Shani N, Fridman Y, Corral-Debrinski M, Aharoni A, et al. Coevolution predicts direct interactions between mtDNA-encoded and nDNA-encoded subunits of oxidative phosphorylation complex i. J Mol Biol. 2010; 404(1):158–71. doi:10.1016/j.jmb.2010.09.029.View ArticlePubMedGoogle Scholar
- Clark NL, Gasper J, Sekino M, Springer SA, Aquadro CF, Swanson WJ. Coevolution of interacting fertilization proteins. PLoS Genet. 2009; 5(7):e1000570. doi:10.1371/journal.pgen.1000570.View ArticlePubMedPubMed CentralGoogle Scholar
- Yeang CH, Haussler D. Detecting coevolution in and among protein domains. PLoS Comput Biol. 2007; 3(11):e211. doi:10.1371/journal.pcbi.0030211.View ArticlePubMedPubMed CentralGoogle Scholar
- Morris JH, Knudsen GM, Verschueren E, Johnson JR, Cimermancic P, Greninger AL, et al. Affinity purification-mass spectrometry and network analysis to understand protein-protein interactions. Nat Protoc. 2014; 9(11):2539–54. doi:10.1038/nprot.2014.164.View ArticlePubMedPubMed CentralGoogle Scholar
- Brückner A, Polge C, Lentze N, Auerbach D, Schlattner U. Yeast two-hybrid, a powerful tool for systems biology. Int J Mol Sci. 2009; 10(6):2763–88. doi:10.3390/ijms10062763.View ArticlePubMedPubMed CentralGoogle Scholar
- Vidal M, Fields S. The yeast two-hybrid assay: still finding connections after 25 years. Nat Methods. 2014; 11(12):1203–6. http://www.nature.com/articles/nmeth.3182.View ArticlePubMedGoogle Scholar
- Michnick SW, Ear PH, Landry C, Malleshaiah MK, Messier V. Protein-fragment complementation assays for large-scale analysis, functional dissection and dynamic studies of protein-protein interactions in living cells. Methods Mol Biol. 2011; 756:395–425. doi:10.1007/978-1-61779-160-4_25.View ArticlePubMedGoogle Scholar
- Shapira SD, Gat-Viks I, Shum BO, Dricot A, de Grace MM, Wu L, et al. A physical and regulatory map of host-influenza interactions reveals pathways in h1n1 infection. Cell. 2009; 139(7):1255–67. doi:10.1016/j.cell.2009.12.018.View ArticlePubMedPubMed CentralGoogle Scholar
- Liao HX, Lynch R, Zhou T, Gao F, Alam SM, Boyd SD, et al. Co-evolution of a broadly neutralizing HIV-1 antibody and founder virus. 2013; 496(7446): 469–76. doi:10.1038/nature12053.
- Procaccini A, Lunt B, Szurmant H, Hwa T, Weigt M. Dissecting the specificity of protein-protein interaction in bacterial two-component signaling: Orphans and crosstalks. PLoS ONE. 2011; 6(5):e19729. doi:10.1371/journal.pone.0019729.View ArticlePubMedPubMed CentralGoogle Scholar
- Schug A, Weigt M, Onuchic JN, Hwa T, Szurmant H. High-resolution protein complexes from integrating genomic information with molecular simulation. Proc Natl Acad Sci USA. 2212; 106(52):4–9. doi:10.1073/pnas.0912100106.Google Scholar
- Faith DP. Conservation evaluation and phylogenetic diversity. Biol Conserv. 1992; 61(1):1–10. doi:10.1016/0006-3207(92)91201-3.View ArticleGoogle Scholar
- Casino P, Rubio V, Marina A. Structural insight into partner specificity and phosphoryl transfer in two-component signal transduction. Cell. 2009; 139(2):325–36. doi:10.1016/j.cell.2009.08.032.View ArticlePubMedGoogle Scholar
- Li L, Shakhnovich EI, Mirny LA. Amino acids determining enzyme-substrate specificity in prokaryotic and eukaryotic protein kinases. Proc Natl Acad Sci USA. 2003; 100(8):4463–8. doi:10.1073/pnas.0737647100.View ArticlePubMedPubMed CentralGoogle Scholar
- Haldimann A, Prahalad MK, Fisher SL, Kim SK, Walsh CT, Wanner BL. Altered recognition mutants of the response regulator PhoB: a new genetic strategy for studying protein-protein interactions. Proc Natl Acad Sci USA. 1436; 93(25):1–6. http://www.ncbi.nlm.nih.gov/pubmed/8962056.Google Scholar
- Skerker JM, Perchuk BS, Siryaporn A, Lubin EA, Ashenberg O, Goulian M, et al. Rewiring the specificity of two-component signal transduction systems. Cell. 2008; 133(6):1043–54. doi:10.1016/j.cell.2008.04.040.View ArticlePubMedPubMed CentralGoogle Scholar
- Laub MT, Goulian M. Specificity in two-component signal transduction pathways. Annu Rev Genet. 2007; 41:121–45. doi:10.1146/annurev.genet.41.042007.170548.View ArticlePubMedGoogle Scholar
- Tillier ERM, Lui TWH. Using multiple interdependency to separate functional from phylogenetic correlations in protein alignments. Bioinformatics. 2003; 19(6):750–55. doi:10.1093/bioinformatics/btg072.View ArticlePubMedGoogle Scholar
- Fodor AA, Aldrich RW. Influence of conservation on calculations of amino acid covariance in multiple sequence alignments. Proteins: Structure Function Bioinform. 2004; 56(2):211–21. doi:10.1002/prot.20098.View ArticleGoogle Scholar
- Gouveia-Oliveira R, Roque FS, Wernersson R, Sicheritz-Ponten T, Sackett PW, Molgaard A, et al. InterMap3d: predicting and visualizing co-evolving protein residues. Bioinformatics. 2009; 25(15):1963–5. doi:10.1093/bioinformatics/btp335.View ArticlePubMedGoogle Scholar
- Wollenberg KR, Atchley WR. Separation of phylogenetic and functional associations in biological sequences by using the parametric bootstrap. Proc Natl Acad Sci USA. 2000; 97(7):3288–91. doi:10.1073/pnas.070154797.View ArticlePubMedPubMed CentralGoogle Scholar
- Goebel B, Dawy Z, Hagenauer J, Mueller JC. An approximation to the distribution of finite sample size mutual information estimates. In: Communications, 2005. ICC 2005. IEEE International Conference on, vol. 2. IEEE: 2005. p. 1102–6. https://ieeexplore.ieee.org/ielx5/9996/32110/01494518.pdf, doi:10.1109/ICC.2005.1494518.
- Compton AA, Hirsch VM, Emerman M. The host restriction factor APOBEC3g and retroviral vif protein coevolve due to ongoing genetic conflict. Cell Host Microbe. 2012; 11(1):91–8. doi:10.1016/j.chom.2011.11.010.View ArticlePubMedPubMed CentralGoogle Scholar
- Compton AA, Emerman M. Convergence and divergence in the evolution of the APOBEC3g-vif interaction reveal ancient origins of simian immunodeficiency viruses. PLoS Pathog. 2013; 9(1):e1003135. doi:10.1371/journal.ppat.1003135.View ArticlePubMedPubMed CentralGoogle Scholar
- Chen G, He Z, Wang T, Xu R, Yu XF. A patch of positively charged amino acids surrounding the human immunodeficiency virus type 1 vif SLVx4yx9y motif influences its interaction with APOBEC3g. J Virol. 2009; 83(17):8674–82. doi:10.1128/JVI.00653-09.View ArticlePubMedPubMed CentralGoogle Scholar
- Russell RA, Pathak VK. Identification of two distinct human immunodeficiency virus type 1 vif determinants critical for interactions with human APOBEC3g and APOBEC3f. J Virol. 2007; 81(15):8201–10. doi:10.1128/JVI.00395-07.View ArticlePubMedPubMed CentralGoogle Scholar
- Zhang H, Pomerantz RJ, Dornadula G, Sun Y. Human immunodeficiency virus type 1 vif protein is an integral component of an mRNP complex of viral RNA and could be involved in the viral RNA folding and packaging process. J Virol. 2000; 74(18):8252–61. http://www.ncbi.nlm.nih.gov/pubmed/10954522.View ArticlePubMedPubMed CentralGoogle Scholar
- He Z, Zhang W, Chen G, Xu R, Yu XF. Characterization of conserved motifs in HIV-1 vif required for APOBEC3g and APOBEC3f interaction. J Mol Biol. 2008; 381(4):1000–11. doi:10.1016/j.jmb.2008.06.061.View ArticlePubMedGoogle Scholar
- Zhang L, Saadatmand J, Li X, Guo F, Niu M, Jiang J, et al. Function analysis of sequences in human APOBEC3g involved in vif-mediated degradation. Virology. 2008; 370(1):113–21. doi:10.1016/j.virol.2007.08.027.View ArticlePubMedGoogle Scholar
- Russell RA, Smith J, Barr R, Bhattacharyya D, Pathak VK. Distinct domains within APOBEC3g and APOBEC3f interact with separate regions of human immunodeficiency virus type 1 vif. J Virol. 2009; 83(4):1992–2003. doi:10.1128/JVI.01621-08.View ArticlePubMedGoogle Scholar
- Xu H, Svarovskaia ES, Barr R, Zhang Y, Khan MA, Strebel K, et al. A single amino acid substitution in human APOBEC3g antiretroviral enzyme confers resistance to HIV-1 virion infectivity factor-induced depletion. Proc Natl Acad Sci USA. 2004; 101(15):5652–7. doi:10.1073/pnas.0400830101.View ArticlePubMedPubMed CentralGoogle Scholar
- Guo Y, Dong L, Qiu X, Wang Y, Zhang B, Liu H, et al. Structural basis for hijacking CBF-beta and CUL5 e3 ligase complex by HIV-1 vif. Nature. 2014; 505(7482):229–33. doi:10.1038/nature12884.View ArticlePubMedGoogle Scholar
- Martin LC, Gloor GB, Dunn SD, Wahl LM. Using information theory to search for co-evolving residues in proteins. Bioinformatics. 2005; 21(22):4116–24. doi:10.1093/bioinformatics/bti671.View ArticlePubMedGoogle Scholar
- Meila M. Comparing clusterings–an information based distance. J Multivar Anal. 2007; 98(5):873–95. doi:10.1016/j.jmva.2006.11.013.View ArticleGoogle Scholar
- Pettersen EF, Goddard TD, Huang CC, Couch GS, Greenblatt DM, Meng EC, et al. UCSF chimera–a visualization system for exploratory research and analysis. J Comput Chem. 2004; 25(13):1605–12. doi:10.1002/jcc.20084.View ArticlePubMedGoogle Scholar
- Ovchinnikov S, Kamisetty H, Baker D. Data from: Robust and accurate prediction of residue-residue interactions across protein interfaces using evolutionary information. 2014. doi:10.5061/dryad.s00vr.
- Cocco S, Monasson R, Weigt M. From principal component to direct coupling analysis of coevolution in proteins: low-eigenvalue modes are needed for structure prediction. PLoS Comput Biol. 2013; 9(8):e1003176. doi:10.1371/journal.pcbi.1003176.View ArticlePubMedPubMed CentralGoogle Scholar
- Ezkurdia I, Graña O, Izarzugaza JMG, Tress ML. Assessment of domain boundary predictions and the prediction of intramolecular contacts in CASP8. Proteins: Structure Function Bioinform. 2009; 77(S9):196–209. doi:10.1002/prot.22554.View ArticleGoogle Scholar
- Monastyrskyy B, D’Andrea D, Fidelis K, Tramontano A, Kryshtafovych A. Evaluation of residue-residue contact prediction in CASP10: Contact prediction in CASP10. Proteins: Structure Function Bioinform. 2014; 82:138–53. doi:10.1002/prot.24340.View ArticleGoogle Scholar
- Shannon CE. A mathematical theory of communication. Bell Syst Tech J. 1948; 27:379–423. https://dx.doi.org/10.1002%2Fj.1538-7305.1948.tb01338.x.View ArticleGoogle Scholar
- Ekeberg M, Lovkvist C, Lan Y, Weigt M, Aurell E. Improved contact prediction in proteins: using pseudolikelihoods to infer potts models. Phys Rev E Stat Nonlinear Soft Matter Phys. 2013; 87(1):012707. http://www.ncbi.nlm.nih.gov/pubmed/23410359.View ArticleGoogle Scholar