ConEVA: a toolbox for comprehensive assessment of protein contacts

Background In recent years, successful contact prediction methods and contact-guided ab initio protein structure prediction methods have highlighted the importance of incorporating contact information into protein structure prediction methods. It is also observed that for almost all globular proteins, the quality of contact prediction dictates the accuracy of structure prediction. Hence, like many existing evaluation measures for evaluating 3D protein models, various measures are currently used to evaluate predicted contacts, with the most popular ones being precision, coverage and distance distribution score (Xd). Results We have built a web application and a downloadable tool, ConEVA, for comprehensive assessment and detailed comparison of predicted contacts. Besides implementing existing measures for contact evaluation we have implemented new and useful methods of contact visualization using chord diagrams and comparison using Jaccard similarity computations. For a set (or sets) of predicted contacts, the web application runs even when a native structure is not available, visualizing the contact coverage and similarity between predicted contacts. We applied the tool on various contact prediction data sets and present our findings and insights we obtained from the evaluation of effective contact assessments. ConEVA is publicly available at http://cactus.rnet.missouri.edu/coneva/. Conclusion ConEVA is useful for a range of contact related analysis and evaluations including predicted contact comparison, investigation of individual protein folding using predicted contacts, and analysis of contacts in a structure of interest.


Background
The success of many protein residue contact prediction methods, in the recent years, has kindled a new hope to solve the long standing problem of ab initio protein structure prediction [1][2][3][4][5][6]. Consequently, contactguided ab initio structure prediction has emerged as an important field. When accurately predicted contacts are supplied as input to structure prediction or reconstruction methods, accurate folds can be predicted consistently [1,[7][8][9]. In general, accurate contacts lead to accurate structural models. However, for predicting folds of sequences which do not have homologous templates (hard sequences), the optimal way of utilizing predicted contacts is still an ongoing research. For instance, experiments on true contact reconstruction have suggested that 9 Å or more distance threshold delivers best reconstruction with Cβ atom [10,11], but the Critical Assessment of Protein Structure Prediction (CASP)'s definition of 8 Å threshold is still widely used to predict contacts [1-3, 6, 12]. Marks et al. have even demonstrated successful structure predictions using Cα atoms and 7 Å threshold for defining contacts [13]. Similarly, it is widely accepted that long-range contacts [12,14,15] are the most useful of the three contact types (short-, medium-, and long-range), but some structural domains introduced in CASP like T0765-D1, T0709-D1, T0711-D1, T0756-D2, T0700-D1 have very few or no long-range contacts at all. In addition, Michel et al. discuss some examples of proteins that could not be accurately reconstructed despite high accuracy of predicted contacts in their PconsFold method [16]. Using the protein 1JWQ, Vassura et al. show how some structures cannot be folded with distance thresholds below 16 Å [10]. Zhang et al. report folding 90 transmembrane proteins at 14 Å cut-off [17]. Furthermore, in these works, no common agreement is found on the optimal number of contacts (or a range) needed for accurate reconstruction.
Hence, a tool to study the relationship between contact parameters and structure types is deemed necessary. Currently, for evaluating predicted contacts, the three most widely used evaluation measures are precision, coverage and distance distribution score (X d ) [3,12,14,[18][19][20][21][22]. In addition, other measures like 'mean false positive error' , 'distance in contact map' or 'spread' [13], F-score and Matthews correlation coefficient (MCC) [12] are also used for a more rigorous evaluation of the predicted contacts. Osvaldo et al. [23] had published EVAcon in 2005 that could calculate some of these measures, which no longer seems accessible. On the other hand, existing tools like CMView [24] and CoeViz [25] only enable contact map visualization and multiple sequence visualization.
In this paper, we present ConEVA, a fast web application (along with a downloadable tool) for protein contact evaluation and comparison. Besides the server, we also report some of our observations obtained through the application of our tool on larger data sets. We discuss how the length of a protein can influence various evaluation measures, the minimum number of contacts to evaluate, and the range of the evaluation measure values associated with the determination of the correct fold of a protein.

Datasets
Throughout this manuscript we often refer to the dataset of 150 diverse proteins with average length of 150 residues introduced by Jones et al. in the PSICOV paper [4]. This data set along with other examples, including many CASP data sets, are provided as pre-curated data sets available through the "All Examples" link in the web server homepage.

Contact definition
Other than the places where we explicitly mention, in this work we primarily use the CASP definition of contacts, which isa pair of residues separated by at least 6 residues are said to be in contact if their Cβ atoms (Cα in case of Glycine) are closer than 8 Å.

Input and interface
The primary input to ConEVA is residue-residue contacts in CASP's RR file format, whose description is available at http://predictioncenter.org/casprol/index.cgi?-page=format#RR. A single RR file or multiple RR files zipped into a single zip file can be supplied. Along with predicted contacts, a native structure in PDB file format [26], may be supplied for contact evaluation. For domain based evaluations, as performed in CASP evaluations, the domain structure may be supplied as native PDB file instead of the full target structure. Besides these data inputs, the server also allows to specify if the input contacts are between Cα or Cβ atoms. In addition, a user can choose to evaluate short-, medium-, long-range, or all contacts by defining the sequence separation distances. Figure 1 shows a screenshot of ConEVA input interface. Besides allowing users to supply contact RR files, many pre-curated data sets are available through the "All Examples" link in the homepage for users to test.

Server description
Input contacts are first sorted using the confidence column in the contact rows. Using the minimum and maximum sequence separation thresholds supplied for defining short-, medium-and long-range contacts, and the choice made for contact type (all/short-range/ medium-range/long-range) contact rows that are not of a user's interest are filtered out. If a native structure is also supplied, contact residue pairs that do not exist in the native structure are filtered out. Then, the top-5, L/ 10, L/5, L/2, L, and top-2 L contacts are selected and grouped for assessment. L is the length of the native chain when supplied, and otherwise, it is the length of the sequence for which contacts are predicted.
Perl and Perl CGI is used for server development, and we use 'heatmap.2' function in the 'gplots' package [27] in R for visualizing Jaccard similarity matrix, and 'plotrix' package [28] for drawing chord diagrams.

Sever Output
The web-server output is organized in various sections. The first section summarizes the input files, contacts computed from the native structure in EVACon format [23], sequence length of contacts file and native structure with a link to the sequence comparison, and a description of the definition of contact used for all following results. The next section tabularizes contact counts for short-, medium-, and long-range contacts, and for top-5, top-L/10, etc. up to top-2 L contacts. Number of contacts that are not in native structure is also shown. In addition, if a native structure is provided as input, all numbers appear as hyperlinks to UCSF Chimera command line scripts, which can be downloaded and opened in UCSF Chimera to directly visualize the selected number of contacts within the native structure. The next section, visualizes Jaccard similarity matrices in the form of 'heatmap' and 'dendrogram' plots. The dendrogram shows similar contact sets in closer branches. Each plot has a link below it which links to the actual similarity matrix. The next section visualizes Chord diagrams. Contact maps appear in the next section, with native contact map shown in background. The subsequent sections present calculations and plots for precision, mean false positive error, coverage, X d , and spread. ROC curves with calculations for Area Under the Curve (AUC) are displayed next, followed by precision-recall curves. The last two sections present calculations for Matthew's correlation coefficient and 1D visualization of coordination numbers. In the absence of a native structure, only the first five sections and the last section are reported, and further, if only a single contact prediction file is supplied, the section for Jaccard similarity calculations is skipped. A screenshot of the output is shown in Fig. 2.

Measures computed on contacts
For each group of selected top contacts, coordination numbers [29] and contact maps are shown as 1D and 2D visualizations. Coordination number defines the number of contacts that a residue is involved in. Realizing the importance of contact assessment in the absence of a native structure, we introduce visualization and comparison using chord diagrams. See Discussion section for illustrations.

Quality measures with respect to native structure
For each group of these selected contacts the following evaluation measures are calculated: precision, coverage, mean false positive error, distance distribution score (X d ) [3,12,14,[18][19][20][21][22], Spread [13], MCC [12], AUC_PR [30]. Precision is defined as the percentage of correctly predicted contacts, calculated as the ratio of the number of predicted contacts that are correct and the number of predicted contacts selected for evaluation, Precision ¼ TP TPþFP . The true positives (TP) and false positives (FP) are the number of correctly and incorrectly predicted contacts. For instance, when we select top five contacts for evaluation, TP + FP is fixed at five and TP can range from 0 to 5. Coverage is the percentage of true contacts contained in a predicted list of contacts, calculated as the ratio of the number of correctly predicted contacts and the total number of contacts in the native structure, Coverage ¼ TP N c , where N c is the number of true contacts in the native structure. Mean false positive error is calculated as the mean of absolute deviation of all the incorrectly predicted contacts, Mean FP Error ¼ 1 , where d is the distance threshold for the contact definition (usually The distance distribution score (X d ) measures the weighted harmonic average difference between the predicted contacts distance distribution and the all-pairs distance distribution. While predicted contact distance distribution refers to the distribution of actual distances for the predicted contacts, all-pairs distance distribution is the distribution of distances for all the true contacts in the native structure. X d is calculated as, where the sum runs for 15 distance bins covering the range from 0 to 60 Å. d i is the distance representing each bin, its upper limit (normalized to 60). PiP is the percentage of predicted pairs whose distance is included in bin i. PiA is the same for all the pairs and is zero for all bins with d i > 8 Å, such that the value of X d increases heavily because of the contacts that are very incorrect, i.e. the contacts whose true distance is very large. Defined in this way, although the harmonic average reflects the difference between the real and predicted distances of residues, interpreting the meaning of a particular valued of X d can be difficult. In general, for a given set of predicted contacts, X d > 0 indicates the positive cases where at least some contacts in the set are correct, whereas when X d is closer to 0, the set can be considered random contacts. Spread [13] is computed using contact maps. For a given set of predicted contacts, it is the mean of the distances from every true contact to the nearest predicted contact in 2D contact map.
where N c is the number of true contacts, T i is a true contact in the native structure, and min{dist(T i − P)} is the minimum Euclidean distance between the true pair T i and all predicted residue pairs in the 2D contact map where every residue sequence separation is considered a unit.

Measures of similarity between predicted sets
In addition, for computing similarity between predicted contacts in the absence of native structure we introduce Jaccard similarity matrix [31] computations with neighborhood relaxation. For each pair of input contact sets, say A and B, we compute the Jaccard similarity score between A and B, J AB as J AB ¼ jA∩Bj jA∪Bj where |A∩B| is the number of common contacts (intersection) between sets A and B, and |A⋃B| is the count of contacts in the set A union B. This similarity computation can evaluate to very small percentages in case of hard predictions because two sets must have precisely the same residue pair to be common, especially when we are evaluating top five or top L/10 contacts. For this reason, we introduce the idea of relaxing the similarity computation by considering contacts with ± N residue number deviation as same contact (N may be selected as 0, 1, 2 or 3). For instance, if set A has a pair 3-15 and set B has a pair 3-16, they may be considered as the same contact at N equal to 1. However, high similarity observed with N more than 1 in helical proteins can be sometimes misleading because shifts of two or more residues can have dramatic effect on the quality of the models generated using the contacts.
Besides these "reduced list" metrics [32] that only evaluate selected top contacts, ConEVA also presents "full list" metrics including Matthew's correlation coefficient (MCC), area under the precision-recall curve (AUC_PR) [30], and Receiver Operating Characteristic (ROC) curve. To calculate MCC for a set of predicted contacts, all contacts having confidence more than 0.5 are considered as predicted contacts to calculate true positive (TP), true negative (TN), false positive (FP) and false negative (FN) so that

Contact prediction and model generation
Throughout this work, we use the publicly available contacts predicted by PSICOV [4]. In addition, we also installed a local copy of the tools coevolution based tool CCMpred [2], pure machine-learning based method DNcon [3], and a hybrid method MetaPSICOV [1] to make contact predictions for various data sets including the PSICOV data set of 150 proteins. These contacts along with secondary structures predicted using PSIPRED [33] were used for building models using CONFOLD [8], a fragment-free ab initio method that we recently developed to build 3D models from scratch. As discussed in the CONFOLD paper, for each protein, we selected various top predicted contacts (top-5, L/10, L/5, L/2, L, and 2 L) and built models using subsets, resulting in a total of 400 models for each protein. We selected the best model out of 400 for our analysis.
To study how various evaluation measure correlate to the final quality of models reconstructed using the predicted contacts, we build 3D models with CONFOLD using the contacts predicted for the 150 proteins in the PSICOV dataset. We argue that the TM-score [34] of the best model can be used as a score that suggests the best utility of the predicted contacts.

Dependence of evaluation measures on L
The length of the sequence may be ignored when we are evaluating and comparing contacts predicted for a single protein sequence. However, when we are comparing contact prediction methods on more than one protein sequence and the sequences are not of same length, sequence length can bias the comparisons. For instance, if the evaluation measures we choose to make the comparison is influenced by the length of the sequence and penalizes longer sequences more, then the methods that perform poorly particularly on longer sequences can be ranked lower than they should. This is also the reason why evaluation measures like TM-score were introduced Table 1 Spearman's rank correlation coefficient between the length of a protein (L) and evaluation measures for PSICOV predicted long-range contacts in the PSICOV data set. It shows that spread, coverage and X d are more correlated to L and N c than precision and mean false positive error, especially below top-L contact selection. To study the relationship between length of the protein (L) and the quality of contacts suggested by the various contact evaluation measures, we computed Spearman's rank correlation coefficient between the length of the protein and the evaluation measuresprecision, coverage, X d , mean false positive error, and spreadfor the long-range contacts (with sequence separation more than 23) predicted in the PSICOV dataset. In Table 1 we show that mean false positive error is the measure most uncorrelated with the length of a protein, followed by precision values for all contact selections (top-5 to top-2 L). Spread and coverage are more correlated with the length at lesser contact selections (top-5, top-L/10 and top-L/5) whereas X d is more correlated with L when we select more contacts for evaluation (top-L/2, top-L, and top-2 L). Similar correlation values were obtained for the number of contacts in a protein (N c ). In summary, these observations lead us to argue that precision and mean false positive error are the most reliable measures when comparing contact predictions.

Number of contacts to evaluate
How many contacts should we evaluate, top-5 or top-L or top-2 L? On one hand, reconstruction studies using true contacts focus on the minimum number of contacts needed to recover the fold of a protein. For instance, DE et al. suggest that 1 contact in every 12 residues is sufficient to robustly fold a protein at topology level [35]. This translates to L/12 predicted contacts if we assume that the contacts are spread out without any overlaps. In a similar study, introducing a novel cone-peeling algorithm, Sathyapriya et al. suggest that as little as 8% of the native contacts are sufficient to determine the tertiary structure [36]. On the other hand, contacts are currently evaluated on a wide range of contact selections. It is a common practice for CASP assessors to evaluate top-5, top-L/10, and top-L/5 predicted long-range contacts. Similarly, recent contact prediction methods that Fig. 4 Expected TM-score of the best model reconstructed using CONFOLD against precision, mean false positive error, X d , and coverage bins. Top-L/5 contacts predicted by PSICOV for the 150 proteins in the PSICOV data set were used as input for the calculations utilize the predicted contacts to build three dimensional models discuss evaluating top-L/10, L/5, L/2, up to top-L contacts [2,4,6].
We argue that the minimum set of contacts for which there is a high correlation between the quality of contacts and the quality of the reconstructed models, is the optimal number of contacts we can evaluate. To test this, in the PSICOV data set, we calculated the Spearman's rank correlation coefficients between the evaluation measures (precision, coverage, X d , spread, and mean false positive error) and the TM-score of the best CONFOLD reconstructed model, for various contact selections. The plot of correlation against top contact selections in Fig. 3, shows that correlation for the three important measures precision, X d , and mean false positive error, is high for at least top-L/5 contacts. In summary, we find that top-L/5 is the minimum number of long-range contacts to evaluate.

Expected TM-score for values of evaluation measures
For a given protein, what values of precision, coverage, X d , or mean false positive error of predicted contacts may fold the protein accurately (with TM-score > 0.5)? For the contacts predicted using PSICOV [4] for the 150 proteins in the PSICOV data set we classified top-L/5 long-range contacts into 3 bins for each measure. We binned predicted contacts into three precision bins -0 to 40%, 40% to 60% and 60+ %, three X d bins -0 to 20, 20 to 28, and 28+, three mean false positive error bins -0 to 1, 1 to 4, and 4+, and three coverage bins -0-10, 10-15, 15+, and observed the distribution of TM-score values in each bins. The thresholds for these bins were selected by clustering the TM-scores into three clusters. We find that on average at least 40-60% precision is Table 2 Spearman's rank correlation coefficient calculations of L, N c , and various evaluation measures with TM-score of the best CONFOLD built model for various protein fold types. Top-L/5 PSICOV predicted contacts are evaluated  5 Relationship between precision, coverage, mean false positive error, and X d with the best TM-score for various protein folds. It shows that β proteins are best evaluated using precision and X d and coverage is relatively most important for α proteins. Evaluations are performed on top L/5 long-range contacts predicted by PSICOV and TM-score is that of the best model built using CONFOLD required to get a TM-score of 0.5 when folding using predicted contacts only; see Fig. 4. We also find that to get similar TM-score, X d should be more than 20, mean false positive error should be less than 4 and coverage should be more than 10. It is important to also note that coverage and X d are also dependent upon the length of the protein unlike precision and mean false positive error.

Protein types and evaluation measures
Using ConEVA we studied how the evaluation of predicted long-range contacts vary for the various protein folds (α, α + β, α/β, β) in the PSICOV data set. We find that mean false positive error has the highest correlation with the TM-score of the models for all protein folds, except for β proteins. For α proteins, mean false positive error and spread have the highest correlation with TMscore suggesting that α proteins are better evaluated using these two measures than others. For α + β and α/β proteins we observed that coverage has much lower correlation than other measures (X d , precision, and meanfalse-positive-error). All correlations are presented in Table 2 and visualized in Fig. 5. Similar statistics were observed when we selected "all" contacts instead of long-range.

Similarity between predicted contacts
No methods currently exist for assessing the quality of predicted contacts in the absence of native structures. Since Jaccard similarity score provides a quantitative comparison of contact sets, we hypothesized that when there is larger agreement between multiple sets of predicted contacts, the confidence of the contact prediction for the protein is higher. Using the same PSICOV data set, we first computed the Jaccard similarity between the Fig. 6 Precision of top-L/5 PSICOV predicted contacts versus the Jaccard similarity score between PSICOV contacts and CCMpred predicted contacts for the 150 proteins in PSICOV data set. N corresponds to the neighborhood size in computing Jaccard similarity PSICOV predicted contacts and CCMpred predicted contacts, and then calculated the Spearman's rank correlation coefficient between this similarity and the precision of the predicted contacts (see Fig. 6). High correlation coefficients of 0.63, 0.64, and 0.57 for N (neighborhood relaxation for computing Jaccard similarity) equal to 0, 1, and 2 respectively validates our hypothesis. These findings, although obvious (i.e., accurate contacts will be correlated), can have interesting applications. For instance, a very wide range of features are used for developing protein model quality assessment (QA) methods, including many contact related scores [37][38][39]. Jaccard similarity score is a potentially useful feature for developing QA methods. In addition, this similarity score can even be integrated into model building methods like FUSION [40], UniCon3D [41], and FRAGFOLD [42] to decide the weight of the contact energy term.

Contact evaluation
For predicted contacts, ConEVA evaluates the top five, L/10, L/5, L/2, L and top 2 L contacts against a native structure using precision, coverage, X d , mean false positive error, spread, MCC, AUC_PR, and ROC curves (see Fig. 7). For analysis and comparison, it also produces neat plots of two dimensional contact maps. For convenient comparison, in the presence of a native structure, contact maps are displayed with the native structure's contact maps in the background (see Fig. 8).
For visualizing predicted contacts in the native structure, UCSF Chimera command scripts [43] are provided to download and run locally (see Fig. 9).

Contact assessment in the absence of a native structure
When only predicted contacts (or multiple set of contacts) are submitted, two dimensional contact maps and one dimensional coordination numbers are presented along with counts for short-, medium-, and long-range contacts and visualizations using contact maps, chord diagrams and Jaccard similarity matrixes along with dendrograms (see Fig. 10). The visualization of coordination numbers serves as a detailed analysis of the residue location of predicted contacts (see Fig. 11). When analyzed along with predicted three-state secondary structures (helix, strand, and coil), coordination numbers can show the contrast or agreement between predicted secondary structures and contacts. For instance, clusters of predicted contacts are expected in the strand regions. Similarly, Chord diagrams can be useful to observe contact clusters, similarities in predicted contacts and even to predict disordered regions (see Fig. 12). Both, coordination numbers and Chord diagrams can also be useful to detect predicted contacts that have extremely low coverage, i.e. highly clustered contact predictions. Identifying such predictions and prediction methods can help us make decisions on using more contacts from the same source or resort to other methods of contact prediction. These results can be useful for predictive analysis of contacts to study how the contacts may be selected and/ or combined for building models.

Analysis of a structure's contacts
A three-dimensional protein data bank (PDB) structure [26] file or a 'pdb id' may be provided as input to study its true contacts for a chosen definition of contacts. This feature is useful not only to study the reconstruction of a protein but also to understand the maximum and minimum values of measures like X d for a structure, also allowing us to investigate what contact definitions yield a desired set of contacts for a structure of interest. This is sometimes important to investigate whether some protein structure has too few or no long range contacts at all. Fig. 10 A screenshot of Jaccard similarity matrix visualization of contacts predicted for the protein 1a3a chain A. The Jaccard similarity matrix with N equals 0 (right) shows that contacts predicted by mfDCA and PSICOV are most similar and MetaPSICOV contacts are equally similar to all other predictions Fig. 11 A screenshot of 1D visualization of coordination numbers within the first 100 residues of the protein '1aa3'. Each row represents the contacts predicted by a single method, with number of contacts and number of residues involved in the contacts shown at the end. From this visualization, three clusters of contacts can be observed as common between the four methods . The diagrams show that contacts predicted for T0763-D1 are clustered with no contacts predicted for the first 30 residues (which is in fact a disordered region with no native coordinates), whereas, predicted contacts have high overlaps between methods and are well spread for '1aa3'