A pairwise residue contact area-based mean force potential for discrimination of native protein structure

Background Considering energy function to detect a correct protein fold from incorrect ones is very important for protein structure prediction and protein folding. Knowledge-based mean force potentials are certainly the most popular type of interaction function for protein threading. They are derived from statistical analyses of interacting groups in experimentally determined protein structures. These potentials are developed at the atom or the amino acid level. Based on orientation dependent contact area, a new type of knowledge-based mean force potential has been developed. Results We developed a new approach to calculate a knowledge-based potential of mean-force, using pairwise residue contact area. To test the performance of our approach, we performed it on several decoy sets to measure its ability to discriminate native structure from decoys. This potential has been able to distinguish native structures from the decoys in the most cases. Further, the calculated Z-scores were quite high for all protein datasets. Conclusions This knowledge-based potential of mean force can be used in protein structure prediction, fold recognition, comparative modelling and molecular recognition. The program is available at http://www.bioinf.cs.ipm.ac.ir/softwares/surfield


Background
Considering energy function to detect a correct protein fold from incorrect ones is very important for protein structure prediction and protein folding. Mainly, two different types of potential energy function are currently in use, either on the identification of native protein models from a large set of decoys or protein fold recognition and threading studies [1][2][3][4][5][6][7][8][9][10]. The first class of potentials, named physical-based potential, is based on the fundamental analysis of the forces between the particles referred to as physical energy function. The second type is knowledge-based energy function based on information from known protein structures. In physical energy function, a molecular mechanics force field is used. Molecular mechanics force fields are parameterized from ab-initio calculation and small molecule structural data. They are essentially the sum of pairwise electrostatic and Van der Waals interaction energies, bonds, angles and dihedral angle terms [11][12][13][14]. In addition, terms that are not included such as entropy and the solvent effect are implicitly considered. Although, physical energy function is widely used in molecular dynamic simulation of proteins in their native and denatured states which can be used to distinguish the decoy/ native structures, but these functions have not been efficient in protein structure prediction because of their greater computational cost. To reduce the computational complexity of the protein folding problem, knowledge-based or empirical mean-force potential is widely used. Since the structure of folded proteins reflects the free energy of the interaction of all their components, including all enthalpic and entropic contributions, as well as solvent effects, such potentials provide an excellent shortcut towards a powerful objective function. It can be used to force the system to obtain potential between groups of atoms by the use of experimentally determined structures. In this approach, statistical thermodynamics is used in an analysis of the frequency of observed states to estimate the underlying free energy [15]. Most often, the distribution of pairwise distances are used to extract a set of effective potential between residues or atoms. The distribution of pairwise distances can be compiled from the protein structure database and by defining a reference state, Boltzmann equation is used to calculate the interaction energy of a particular pair. The total potential energy of a protein is simply taken as a sum over all pairwise interactions. In most cases, one or two points for each residue are used to represent a protein [16][17][18]. These points are usually C (alpha), C(beta) or the centre of mass of each side chain. Each interaction can be distance-dependent. A large variety of knowledge-based potential of mean force have been developed by introducing additional interactions such as surface area terms, the main chain and side chain dihedral angles, three and four body terms and heavy atoms [6,[19][20][21][22][23].
In the contact potential, either distance-dependent or contact based, the distance between the centres of two C(alpha), C(beta) or centre of mass of two residues or the all heavy atoms of two residues are calculated and the observed frequency of contacts between residues converts to free energy using Boltzmann equation. In this way, two problems may be encountered. First, when an atom or centre of mass is selected for each residue, calculated potential is independent of orientation of the side chains and when the distance between two atoms of two residues are equal to the distance of two atoms of other residues in other positions, the same potentials are assigned to them although the orientation of two residue side chains may be quite different. Second, the atoms of two residues may not have direct contact with each other and some atoms may be located in an interval close to them.
In this study, we develop a new approach to calculate a knowledge-based potential energy using pairwise residue contact area. We calculate the parts of each pairwise residue area that are in contact in Å2 by rolling a probe ball of different sizes around the atoms of a residue to determine the contacts area of each pair. This pairwise contact area is used to determine statistical contact area preference between each residue pairs, when a contact area preference estimates a sum of energetic interactions and structural constraints.
A good energy function at its minimum should discriminate native structures from decoys. So, to test the effectiveness of this new potential, we calculate it on several decoy sets to measure its ability to discriminate native structure from decoys. Several decoy sets that contain one to hundreds of decoy structures generated in different ways are used and in the most cases this potential has been able to distinguish native structures from the decoys. Calculated Z-score and Pe, which are useful measures of the validity of the computed potential, show high value for all protein datasets.

Results and Discussion
One of the best ways to show the performance of a force field is its ability to find the native structure in a large set of decoys. Different decoys sets have been used to evaluate how well knowledge-based potentials and physical potentials discriminate native structures. In this study, the performance of our model based on pairwise contact area was tested on models from different decoy sets containing misfold, DecoyForMMPBSA, fisa, hgstructal, semfold, vhp-mcmd, 4state_reduced, lmds, ig_structal, ig_structal_hires, HRDecoy and Rosetta_Tsai. The quality of the models in decoy sets and the members of decoy structures are very different.
From the principle of statistical mechanics, we suppose that the energy of the native structure has to be minimum energy among all conformations and much lower than the average energy of all possible conformations. Then, in addition to finding the fold with lower energy, the Z-Score has been calculated. Energy profiles have been made for each residue pairwise separated by d residues in sequence (10 distinct values for sequence separation have been considered) using different probe sizes (r = 0.25, 0.5, 0.75, 1, 1.5, 2, 2.5 Å). So, we have 70 different energy profiles for energy calculation. The total energy of particular structure of a protein can be calculated as the sum of all the pairwise interactions, but the best discrimination has been achieved when only energy profile for d = 1 has been considered. This shows that contact area of consequent neighbors has more important role in distinguishing native folds from the decoys. In this situation, with r = 0.25Å we could discriminate native folds from decoys in almost all decoy sets. The increase in the probe size has had slightly improvement on the discrimination power shown by Z-Score. Since the increase of the probe radius resulted in an increase in the amount of calculations, so choosing the great probe radius was not efficient. Table 1 shows the results for discrimination of 1417 native folds from more than 1300000 decoys in 12 sets. In the most cases, the native structures have the lowest energy and they have got first rank. The high negative values of Z-Scores show that choosing this energy potential is highly effective [7,[24][25][26].
Tables in Additional file 1 show the details of results for proteins in each decoy set. Although these decoy sets have been produced in different ways andfinding native structures in a set of low quality models would not be difficult, but our pairwise contact area based potential ranked the native structures first in the most cases. There are some exceptions. First, in semfold, protein 1nkl has the rank five. Second in hg_structal, 1gdm has the rank 17, but in the last version of PDB this protein has been replaced by 2gdm and it is notable that when energy has been calculated for 2gdm, this protein ranked first in decoy structures. In the Roset-ta_Tsai dataset, our model could not find any native structures in the first rank.
The results obtained on different decoy sets show good performance of our methodology to discriminate native folds. However, an experiment to evaluate the performance of an energy model when performing abinito folding is to discriminate between the native-like and non-native structures. In Additional file 1, contact area energy is plotted against the RMSD from native structure for native and all decoy structures. Different datasets have different distributions of RMSD for nonnative proteins. Usually RMSD's are calculated using C a distances and residue side chain atoms are not considered. Since our method is based on content area of side chain atoms, then it is very sensitive to orientation of side chains atoms although the change in the position of C a may not be very large. As shown in plots in Additional file 1, in the most cases decoy structures have RMSD more than 2 Å and in these cases the contact area of side chain atoms may be far from native structure and there is a cosiderable distance between the energy of native and decoy structures in most datasets. Table 2 shows the comparison of the performances of different methods including DFIRE [27], Rosetta [28,29], ModPipe-Pair, Modpipe-surf [23], DOPE [30], PC2CA [31], Force model, [32], TE13, LHL [33], and MJ [34] together with our model (surfield) in recognizing native structures from decoys in three decoy datasets. Our model correctly identifies all 20 native structures in these datasets while other methods do not work well.

Conclusions
The aim of this study was to evaluate a mean force potential based on contact area of residues instead of contact or distance to separate correct from incorrect folds. This was done by calculation of contact area of all atoms of residues considering the Van der Walls spheres of atoms and obtaining a coefficient from training dataset used to quantify pairwise potential in a protein fold. The new potential not only is residue orientation-dependent, but also gives residue contact area in angstrom square for each pair of atoms in adjacent residues that provides a better quantification of atomic interaction than distance-based methods.
The analysis in this work showed that the best definition is the one involving the contact area between Van der Walls spheres of atoms of any two consecutive residues with employing a cut off distance around 0.5 Å. Considering atomic radii, those distances around 5 Å has been considered. This, in fact, is close to the cut off distance considered in the contact-based potential methods. Contact area-based potential was able to recognize the native structures on different decoy sets with a high degree of accuracy, as evident from the Z-score. Only one of 1386 native fold in 11 decoy sets was not ranked first, however this protein, 1nlk in semfold decoy set, was ranked five among 11600 models.
These results show that in addition to the important role of contact area between two atoms in improvement of potential function that reflects the orientation of residue side chain, short range contact between neighbor residues play more important role than long range contact.

Pairwise Contact Area
The presented potential is based on an assessment of contact area for the pairwise residue atoms in a training dataset containing protein structures from protein Data Bank records. The contact area is defined as the faces of sphere of a given atom in a residue contacts to the sphere of an atom in other residue. The radius of the sphere is the atomic Van der Waals radius plus the radius of a probe. The procedure similar to accessible surface area calculation is used to quantify pairwise contact area. For each atom, sufficient number of approximately evenly distribute points are placed on the sphere of radius Ra+Rp centered at the atom where Ra and Rp are the Van der Waals radius of atom A and sphere probe radius respectively (Figure 1). Each point is interpreted as a defined area. In the absence of hydrogen atoms, group radii are used [35]. Table 3 shows the values of atoms radii that are used. Contacting atoms are defined as atoms with overlapping sphere, and thus the maximum distance between two contacting atoms is Ra+Rb+2Rp. In this work, the probe is defined as a sphere with radius 0.25, 0.5, 0.75, 1, 1.5, 2 and 2.5 Angstrom. In theory, by rolling a sphere with radius Rp around an atom, all atoms that have contact with it will be detected and contact area for each residue pair can be calculated. In practice the probe is located at each point on atom sphere and the numbers of points that are in contact with each atom are calculated and the contact area of every two atoms and subsequently for every two residues is calculated.

Pairwise Contact Area Potential
The pairwise contact area potential for every residue types a and b in a given protein are derived from contact area preference within the training set of experimentally determined structures and the contact area of residues type a and b.
is the potential of residue type a in contact with residue type b separated by d residues in sequence calculated using probe with radius r. K(a, b, d, r) is a coefficient showing the preference of pairwise contact area for a pair of residues (a, b) in d sequence separation by probe radius r derived from observed contact area in the training set.  These potentials are used to score protein and decoy structures, where the total score is the product of contact area and potential coefficient, summed over all pairwise contact area.

Measure of significance RMSD
To quantify the similarity of different conformations, we use the coordinate root mean square (cRMS) deviation with the following equation: Figure 1 Contact area of oxygen atom from first amino acid with N, CA and C of next amino acid. where r ai and r bi are respectively, the i th position of structure a and structure b when structures a and b have been optimally superimposed [45].

Z-scores and P e
The average performance of a potential function to discriminate native structure from decoys can be expressed as Z-score: where E native is the energy calculated for native protein structure and <E decoy > and δ are respectively, the average and the standard deviation of energy distribution of decoy proteins.
A negative Z-score indicates that the conformation's energy is lower than the average of the distribution. The more is the absolute value of the Z-score; the better is the separation of the native conformation from the decoy ones.
Another parameter to compare performance of the various potential functions to discriminate native structures from decoys is based on ranking the decoys by their total potential scores. The parameter is as follows: where R native is the rank of the native structure and N structures is the total number of structures in the decoy set. If the rank of the native structure is held constant while the set size is increased, the value of the P e will become more negative (indicating improvement in discrimination capability), while a zero value is the worst possible score indicating the lowest possible rank [46]. Additional file 1: details of results for proteins in each decoy set. Significance details including Zscore, Pe and scatter plot of energy vs. rmsd for each decoy set are shown in different sheets. Click here for file [ http://www.biomedcentral.com/content/supplementary/1471-2105-11-16-S1.XLS ]