 Research article
 Open Access
 Published:
Identification of native protein structures captured by principal interactions
BMC Bioinformatics volume 20, Article number: 604 (2019)
Abstract
Background
Evaluation of protein structure is based on trustworthy potential function. The total potential of a protein structure is approximated as the summation of all pairwise interaction potentials. Knowledgebased potentials (KBP) are one type of potential functions derived by known experimentally determined protein structures. Although several KBP functions with different methods have been introduced, the key interactions that capture the total potential have not studied yet.
Results
In this study, we seek the interaction types that preserve as much of the total potential as possible. We employ a procedure based on the principal component analysis (PCA) to extract the significant and key interactions in native protein structures. We call these interactions as principal interactions and show that the results of the model that considers only these interactions are very close to the full interaction model that considers all interactions in protein fold recognition. In fact, the principal interactions maintain the discriminative power of the full interaction model. This method was evaluated on 3 KBPs with different contact definitions and thresholds of distance and revealed that their corresponding principal interactions are very similar and have a lot in common. Additionally, the principal interactions consisted of 20 % of the full interactions on average, and they are between residues, which are considered important in protein folding.
Conclusions
This work shows that all interaction types are not equally important in discrimination of native structure. The results of the reduced model based on principal interactions that were very close to the full interaction model suggest that a new strategy is needed to capture the role of remaining interactions (nonprincipal interactions) to improve the power of knowledgebased potential functions.
Background
The protein structure is a result of noncovalent interactions between its residues in threedimensional space. One of the key challenges in structural bioinformatics is to promote an understanding of the structure of the complex network of noncovalent interactions in proteins that significantly contribute to the threedimensional structure [1]. Because of its complexity, the structure and the amino acid sequence of proteins need to simplify to make analysis tractable [2]. Protein folds into a single threedimensional conformation among an astronomical number of possible conformations in order of microseconds. According to Levinthal paradox [3], a protein could not exhaustively search all possible conformational states to achieve the native structure. Despite the advances in experimental and computational methods in protein structure analysis, the problem of folding protein has not solved yet.
Amino acids are the building blocks and the basic components of proteins. Anfinsen [4] believed that information of amino acid sequence is sufficient to determine the native fold of a protein. Several studies with different computational and experimental approaches suggested reduced alphabet of amino acids in protein design, protein fold recognition [5,6,7,8,9]. Reduced models of protein structure could decrease the complexity of the protein folding problem and be useful in protein design [5]. Therefore the determination of essential amino acids and their interactions in native protein structures is an important issue that needs to be addressed.
According to Anfinsen thermodynamic hypothesis, the native structure of a protein corresponds to minimum free energy [4]. For a given amino acid sequence, the protein structure prediction problem is to find a threedimensional structure such that the total free energy is minimized. Therefore, the assessment of protein structure needs a trustworthy potential function. Knowledgebased potentials (KBPs) [6, 7] as an approximation of the free energy are extracted from the analysis of experimentally determined protein structures. They are mainly based on the distribution of a feature relative to a reference state [8]. There are various KBP including contact potentials [6, 9], orientationdependent potentials [10,11,12,13,14,15], distancedependent potentials [16,17,18,19,20], multibody potentials [21]. Impressive progresses in protein design [22], simulation of protein folding [23], ligand binding [24, 25], aggregation [26], protein structure prediction [27], and fold recognition [28] have been achieved using knowledgebased potential functions.
In the previous work [29], we reported the importance of hydrophobic nonlocal interactions in scoring function (based on Delaunay tessellation) in discrimination of native structures. Here, in order to reduce protein structure, we employ a PCAbased way to formulate and extract principal interactions in protein 3D structure. The PCA (Principal Component Analysis) technique was first introduced for only a few variables in 1901 by Karl Pearson [30] and then it was extended by Hotelling in 1933 for a large number of variables [31]. Today, PCA is a technique which has been widely used to select the most important variables from a multivariate data and retrieve dominant pattern from noisy data. In fact, PCA maps a complex system from multidimensional space to a reduced space spanned by a few principal components to clarify the principal features underlying the observed data [32]. The threedimensional structure of a protein could be considered as such multidimensional data [33]. Studies such as protein dynamics prediction from experimental data [34], discovery of binding site [35], prediction of protein intermediate states [36], protein folding dynamics [37], analysis of molecular dynamics trajectories [38], and principal components analysis of protein structure ensembles [33, 39] are a few examples among many applications of PCA in protein structure analysis. If the native structure of a protein lies in the minimum potential and the total potential approximated by a summation of pairwise interaction type energies, we assert that the goal of structure reduction should be to preserve as much of total energy as possible. PCA is a natural tool for extracting these most important variables (pairwise interaction types) in total potential. The presented method extracts significant (principal) interactions by considering only native conformations without using decoys. We show that the principal interactions perform well in the identification of native structure, close to the same discriminatory power as the full interaction model. Our model derives the observation that principal interactions extracted in the threedimensional structure of native structures will tend to dictate a minimum energy protein conformation among decoy structures, suggesting that these gross features are dominant in dictating the structure.
In this study, at first, the energy between all pair of residues using distance dependent knowledgebased potential function for each native structure was calculated. Since there is 210 possible amino acidamino acid interaction type between 20 amino acids, therefore for each protein structure we could have a vector with 210 elements, containing energy between amino acidamino acid interaction types, where for example i’th element is the summation of energy between all ALAGLY residues in the structure. We obtained a (n sample) × (210 variables) matrix where n is the number of native protein structures in the train dataset. Finally, the principal variables (interactions) were extracted using a PCAbased approach. To our astonishment, these interactions that consisted of 20% of the full interaction on average, were found between residues that are known as critical, important, and effective in protein folding. The model that considers only the principal interactions was assessed by six measures including the number of correctly identified natives, the Zscore of the native energy, the RMSD of the minimum energy, the Pearson correlation coefficient between energy and Cα RMSD from the native structure, 20% fraction enrichment and the Zscore of the best decoy structure using two decoy sets. The results are very similar and close to the full interaction model. Therefore, we are able to demonstrate that the principal interactions maintain the discriminative power of the full interaction model. Additionally, the similar results were found when the principal interaction model was evaluated on DOPE and DFIRE potential functions.
Results
Principal interactions of protein native structures
The general definition of distancedependent knowledgebased potential between pairs of atoms i and j at distance d is:
where \( {f}_{obs}^{ij}(d) \) and \( {f}_r^{ij}(d) \) represent the relative frequency of atomic pairs i and j at distance d in a database of native structures and the reference state, respectively [7]. The total potential of a given protein structure, S, denoted by ∆E(S) can be approximated as follows [7]:
The potential between pairs of residues A and B can be calculated as a summation of pairwise potentials between atoms of residue A and B:
where i and j are atoms of residues A and B. Therefore, the total potential of a protein structure, S, can be approximated as follows:
where T is set of all 20 amino acid types. Since there are 210 possible different amino acidamino acid interaction types between 20 amino acids, the total number of distinct ∆E(A, B) in the above summation is equal to 210.
A nonredundant structural dataset of 6944 protein chains was culled by PISCES [40] from Protein Data Bank with pairwise sequence identity < 50%, resolution < 1.6 Å, Rfactor < 0.25, protein length > 40 and < 1000 residues. We also excluded the proteins with more than 50% sequence identity with targets in test sets (proteins in the 3DRobot [41] and CASP10–13 [19]). The final list contains 6384 protein chains. The final list of PDB and chain IDs are provided in Additional file 1: Table S1.
Let S_{i} be i’th protein native structure in the data set and the 210dimensional vector P_{i}, containing 210 amino acidamino acid energies; ∆E(A, B) as explained above. Obviously, the sum of P_{i} is equal to ∆E(S_{i}); the total energy of the protein structure S_{i}. The total energy matrix (denoted by TE) is a 6384 sampleby210 variables matrix whose rows are P_{i} calculated from 6384 native protein structures.
At first, each of the samples in TE matrix has been centered to have mean zero, the new matrix is called CTE (centered total energy). Figure 1 shows the image representation of CTE matrix for DBNI scoring function [42]. In DBNI, the neighbors of atoms (interactions) are determined by Delaunay tessellation and only nonlocal interactions (those between atoms farther than five residues in the sequence) by a distance less than 6 Å are considered. In fact, the figure displays the data matrix CTE as an image that uses the range of colors as indicated in the colorbar. Each element of CTE specifies the color for one pixel of the image. The smallest value in CTE is mapped into the black color. As shown in Fig. 1, the values of some variables or interactions shown as black points are very low and different from the others.
The extracted principal interactions (variables) as described in Methods section are as follows: LEULEU, CYSCYS, PHELEU, VALVAL, ILELEU, PHEPHE, ILEILE, LEUVAL, ILEVAL, LEUTYR, PHEILE, PHEVAL, TYRTYR, LEUTRP, ILETYR, LEUALA, PHETYR, VALTYR, VALALA, TRPPHE, and METLEU. These interactions are shown by arrows in Fig. 1. Most of these interactions are between aminoacids that are critical and important in protein structure and protein folding. Val, Ala, Leu, Ile, Phe, Tyr, and Trp are hydrophobic amino acids and are important in determining the threedimensional structure of proteins. Tyrosine is an aromatic, partially hydrophobic amino acid that is inclined to be buried in protein hydrophobic cores. The aromatic sidechain can also mean that tyrosine is involved in stacking interactions with other aromatic sidechains such as phenylalanine and tyrosine. Cysteine has sulfurcontaining side group with the potential to be more reactive. As a group with low polarity, cysteine is known for its ability to connect with another cysteine via the sulfur atoms to form a covalent disulfide bridge, which is important in the formation and maintenance of the tertiary (folded) structure in many proteins. In order to assess the principal interactions in the identification of the native structure, the results of the full interaction model were compared with the principal interaction model. In the full interaction model, all amino acids and their interactions, as determined in their corresponding KBP definition, are considered. In comparison, in the principal interaction model, only the principal interaction types of the full interaction model are considered.
Validation of the principal interaction model on decoy sets
The principal interaction model was assessed in discrimination of native structure, using the decoy sets 3DRobot [41] and CASP10–13 [19]. The 3DRobot set includes decoy structures for 200 nonhomologous proteins randomly selected from the PDB library. This protein set contains 48 α, 40 β, and 112 α/βsingledomain proteins with lengths ranging from 80 to 250 residues. Each protein contains 300 structural decoys with RMSD ranging from 0 to 12 Å. The CASP10–13 decoy set includes a total of 13,474 structures for 175 proteins, which were collected and trimmed from CASP10CASP13 experiments by Yu et.al [19]. In this data set, all prediction sets contain experimental structures that are sequentially consecutive and all nonfirst prediction models (the second to fifth models of predictors) have been removed. In addition, the selected prediction models are also consecutive and identical in sequence to the corresponding experimental structure. This data set has many decoys with RMSDs evenly distributed from very similar to very different to the native structure. The model was assessed using six measures including, the number of correctly identified native structures (Top1), the Zscore of the native structure energy, the RMSD of the minimum energy, the Pearson correlation coefficient (denoted by PC) between RMSD from the native structure and the total energies, 20% fraction enrichment, and the Zscore with respect to the best decoy as explained in the Methods section.
In order to compare the difference between the principal interaction model and the full interaction model in size, the percentage of principal interactions denoted by PI for all decoy structures in the data sets was also calculated:
A comparison of two models using DBNI potential function is summarized in Table 1. The values in parenthesis depict the performance of the potential function in the full interaction model. The PI values represents the average of the percentage of interactions between atoms used for the calculation of energy in the principal interaction model. On average, 25% of interactions are used in the principal interaction model.
The principal interaction model and the full interaction model correctly identified 286 native structures with a success rate of 76%. The results of the principal interaction model in all six criteria are very similar to the full interaction model.
The F.E. and Zscore_{b} evaluate the performance of the model in decoy discrimination between good and bad structural decoys. The possible values of F.E. ranges from 0 to 5 [19]. The higher value of F.E. the better decoy discrimination. For 3DRobot, the results of F.E. and Zscore_{b} are better than CASP10–13. Because most of the targets in 3DRobot have decoy structures with low RMSD, however in CASP10–13 the minimum RMSD is 5 Å for 44% of the targets. For 3DRobot decoy set, all criteria justify the model. The averages of CC’s in the principal interaction model and the full interaction model are 0.50 and 0.58, respectively with the average of Zscores higher than 3. For both decoy sets, the results of the principal interaction model are very similar and close to the full interaction model using 25% of the full interactions on average and adding further down the list of interactions could not significantly increase the performance of the potential function.
Discussion
Principal interaction model on DFIRE and DOPE
The principal interaction model was also tested on DFIRE [43], and DOPE [17] potential functions. In DBNI, atomic interactions are determined by Delaunay tessellation and only nonlocal interactions (those between atoms farther than five residues in the sequence) by a distance less than 6 Å are considered. DOPE and DFIRE are derived from a noninteracting ideal gas reference state, except that in DOPE the size effect of proteins is taken into account. In DFIRE and DOPE, two atoms by a distance less than 15 Å are in contact. The image representation of the CTE matrix for DFIRE and DOPE are shown in the Additional file 1: Figure S1 and S2. The extracted principal interactions (variables) for these potential functions are shown in Fig. 2 by red, black, and blue edges, respectively, and are listed in Additional file 1: Table S2. Although the aforementioned potential functions have differences in the definition of contact, the threshold of distance, and ignorance of local interactions, most of their principal interactions, as shown in Fig. 2, are the same. Note that these interactions are between residues considered as important in protein folding. For each interaction type, the number and the contribution value were calculated. The number of interactions was extracted from train data set and the contribution value was calculated by applying the procedure described in the Methods section on CTE matrix whose columns are pairwise potentials calculated from 6384 native structures in the train set. In summary, at first the covariance matrix of the CTE matrix was calculated using Eq. 5 and then it diagonalized according to Eq. 6. As described in the Methods section, the set of m eigenvectors of the covariance matrix (V_{1},V_{2}, …,V_{m}) with 80% of total variations was selected (Eqs. 7 and 8). The contribution of the pairwise interaction k, in a given principal direction V_{j} denoted by C_{kj} was calculated in Eq. 9. Let C_{k1}, C_{k2}, …, C_{km} is the contributions of the pairwise interaction k on V_{1},V_{2}, …, V_{m}, respectively. The total contribution of the pairwise interaction k was calculated using Eq. 10. Figure 3 shows the relation between the number of interaction and the contribution value for DBNI potential function and the righthand side inset plot corresponds to only the principal interactions. As shown in the figure, the principal interactions indicated by red points mainly have a large number of interactions, however the number of interactions for CYSCYS (inferred as principal interaction and indicated by arrow in the figure) is relatively low and on the other hand, the nonprincipal interactions such as ARGLEU, GLUARG, and ASPARG (data points in the green rectangle, the bellow lefthand side) have a large number of interactions. The similar results obtained for DOPE and DFIRE and presented in Additional file 1: Figure S6 and S7.
The results for these welldesigned potential functions, DFIRE, and DOPE are summarized in Table 2. The results of the principal interaction model for all these potential functions are also very close to the full interaction model and justify the principal interaction model well.
The success rate of three KBP functions in terms of the number of firstranked native structures is shown in Fig. 4. The Xaxis represents the index of sorted interactions based on their contribution values; the details are shown in Additional file 1: Table S2. The horizontal dashed line indicates the success rates obtained by the principal interaction model, which intersect the diagrams in bold points. The progress of the overall performances represents a sharp initial increase up to the bold points, followed by a straightening out or a slim decrease and increase. Therefore, the performance of these scoring functions is captured and achieved by the principal interactions.
Conclusions
In this study, we showed that amino acids and their interactions are not equally important in protein structure. The method is relatively straightforward and easy to apply to other potential functions. The results of the reduced model based on principal interactions inferred using three knowledgebased potential functions reveal that nonprincipal interactions do not increase significantly the performance of the knowledgebased potential functions. Hence, a new strategy is needed to capture the role of polar and charged amino acids for improving knowledgebased potential functions. Additionally, the simplicity of the protein structure could help to disclose the hidden secret of protein folding. By reducing the 20 amino acids to the simplified alphabet, the degree of freedom of protein structure could be greatly reduced, and the main mechanism and the underlying rules could be decoded. An appropriate simplification should regenerate the structural features of proteins. It is generally assumed that native protein structure takes its conformation at minimum energy. The pairwise potential is a widely used measure to capture the strength of pairwise interactions. In the context of energy landscape theory, the energy of the simplified native structure should be also at the minimum on the landscape of the simplified version. The application of the principal interaction model in de novo design and threading approaches would be a future direction.
Methods
Principal interactions extraction
PCA is generally used to select a subset of variables from a large set while retaining most of the information. This is achieved by transforming the original variables to a new set of variables, the principal components (PCs), which are ordered so that the first few retain most of the variation present in all of the original variables and have the highest correlations with the principal components [44]. Let the data matrix to be analyzed by PCA comprises n samples described by p variables and represented by the n × p matrix E, whose jth column is the vector x_{j} of samples on the jth variable. In the mean centered data, which we write as U, the column means of E have been subtracted from their corresponding columns and so the column means of U are equal to zero. The covariance matrix of the mean centered data, U, could easily be calculated as follows [44]:
It is a symmetric matrix and so it can be diagonalized:
where V is a matrix of eigenvectors (each column is an eigenvector) and L is a diagonal matrix with eigenvalues λ_{i} in the decreasing order on the diagonal. Each eigenvalue is proportional to the portion of the variance that is associated with each eigenvector. The eigenvectors are called principal axes or principal directions of the data. Let the principal directions of the covariance matrix Σ be V_{1}, V_{2}, …, V_{p} (these vectors are the columns of V) and λ_{1}, λ_{2}, …, λ_{p} be the eigenvalues of the covariance matrix. Here, we choose the set of m (m ≤ p) eigenvectors of Σ which have the m largest eigenvalues with 80% of total variations.
In order to assess the effect of each ∆E(A, B) on the total potential of the mean force, ∆E(S), we used the following procedure on the 500 × 210 matrix CTE:

1)
Let V_{1}, V_{2}, …, V_{210} and λ_{1}, λ_{2}, …, λ_{210} be the eigenvectors (principal directions) and the eigenvalues (variances) of the covariance matrix of CTE. The covariance matrix is calculated using formula (3). For a given V_{j}, the proportion of total variance that it accounts for is:
The first m Vs (V_{1}, V_{2}, …, V_{m}) were chosen such that the sum of the proportion of total variance be at least 80%. That means:
Variables (interactions) that are correlated with V_{1}, V_{2}, …, V_{m} possess high contribution and we consider them as the most important variables (interactions). The contributions of the variable k in accounting for the variability in a given principal direction, V_{j}, are calculated (in percentage) as:
where V_{j}(k) is kth element of V_{j}

2)
Let C_{k1}, C_{k2}, …, C_{km} be the contributions of variable k on V_{1}, V_{2}, …, V_{m}, respectively. The total contribution of variable k on explaining the variations retained by V_{1}, V_{2}, …, V_{m} is calculated as:
The 210 variables are sorted by the values of the variable’s total contribution. The total contribution of a variable is compared to the average contribution of that variable to a uniform distribution of contributions as follows:

3)
Assuming uniform contributions for the considered 210 variables, the contribution of a variable on a given direction would be \( \frac{1}{210}\% \). In this case, the average contribution of a variable for V_{1}, V_{2}, …, V_{m} is: \( \tau =\sum \limits_{i=1}^m\left(\frac{1}{2.1}\times {\lambda}_i\right) \)

4)
For the given V_{1}, V_{2}, …, V_{m}, a variable with a contribution larger than this cutoff, τ, could be considered important and significant in contributing to the components. We call these interactions (variables) as the principal interactions (variables).
Performance criteria
The performance criteria for model assessment and their definitions are as follows:

1)
Top1: Native structure is correctly identified if its structure has the lowest value of energy. The number of correctly identified native structures in a decoy set is denoted by Top1.

2)
Zscore of the native structure energy in a decoy set is as follows:
$$ Z score=\frac{\left\langle {score}_{decoys}\right\rangle {score}_{native}}{\sigma_{decoys}} $$
where score_{native}scorenative is the score (energy) calculated for a native structure, <score_{decoys} > and σ_{decoys} are the average and standard deviation of score distributions of decoys proteins, respectively.

3)
RMSD of the minimum energy: The root mean square deviation of the CαCα pairs between the native structure and the structure with the minimum energy.

4)
PC: The Pearson correlation coefficient between Cα RMSD from the native structure and the total energies.

5)
Zscore_{b}: The energy gap between the best decoy structure (the model with lowest RMSD from the native structure) and the remaining structures is calculated as follows:
$$ Z{score}_b=\frac{\left\langle {score}_{decoys}\right\rangle {score}_{best\ decoy}}{\sigma_{decoys}} $$
where score_{best decoy}scorenative is the score (energy) calculated for the best decoy structure, 〈score_{decoys}〉 and σ_{decoys} are the average and standard deviation of score distributions of decoys proteins (the native and the best decoy are excluded), respectively.

6)
F.E (20% Fraction Enrichment): which measures the fraction of the most accurate 20% decoys (the top 20% lowest RMSD structures) among the 20% best scoring decoys (by excluding the native structure) compared to that for the entire decoy set [19].
Availability of data and materials
The final list of PDB and chain IDs are provided in Additional file 1: Table S1. The CASP10–13 decoy set is publicly available at http://qbp.hzau.edu.cn/ANDIS/ and the 3DRobot decoy set is publicly available at http://zhanglab.ccmb.med.umich.edu/3DRobot/decoys.
Abbreviations
 KBP:

Knowledgebased potential
 PCA:

Principal component analysis
References
Dill KA, MacCallum JL. The proteinfolding problem, 50 years on. Science. 2012;338(6110):1042–6.
Hills RD Jr, Lu L, Voth GA. Multiscale coarsegraining of the protein energy landscape. PLoS Comput Biol. 2010;6(6):e1000827.
Zwanzig R, Szabo A, Bagchi B. Levinthal’s paradox. Proc Natl Acad Sci. 1992;89(1):20–2.
Anfinsen CB. Principles that govern the folding of protein chains. Science. 1973;181(4096):223–30.
Huang JT, Wang T, Huang SR, Li X. Reduced alphabet for protein folding prediction. Proteins. 2015;83(4):631–9.
Miyazawa S, Jernigan RL. Estimation of effective interresidue contact energies from protein crystal structures quasi chemical approximation. Macromolecules. 1985;18(3):534–52.
Sippl MJ. Calculation of conformational ensembles potentials of mean force  an approach to the knowledgebased prediction of local structures in globular proteins. J Mol Biol. 1990;213(4):859–83.
Mirzaie M, Sadeghi M. Knowledgebased potentials in protein fold recognition. J Paramed Sci. 2011;1(4):63–73.
Arab S, Sadeghi M, Eslahchi C, Pezeshk H, Sheari A. A pairwise residue contact areabased mean force potential for discrimination of native protein structure. BMC Bioinformatics. 2010;11(1):16.
Lu M, Dousis AD, Ma J. OPUSPSP: an orientationdependent statistical allatom potential derived from sidechain packing. J Mol Biol. 2008;376(1):288–301.
Zhou H, Skolnick J. GOAP: a generalized orientationdependent, allatom statistical potential for protein structure prediction. Biophys J. 2011;101(8):2043–52.
Dong GQ, Fan H, SchneidmanDuhovny D, Webb B, Sali A. Optimized atomic statistical potentials: assessment of protein interfaces and loops. Bioinformatics. 2013;29(24):3158–66.
Liu Y, Zeng J, Gong H. Improving the orientationdependent statistical potential using a reference state. Proteins. 2014;82(10):2383–93.
Elhefnawy W, Chen L, Han Y, Li Y. ICOSA: a distancedependent, orientationspecific coarsegrained contact potential for protein structure modeling. J Mol Biol. 2015;427(15):2562–76.
LópezBlanco JR, Chacón P. KORP: knowledgebased 6D potential for fast protein and loop modeling. Bioinformatics. 2019;35(17):3013–9.
Mirzaie M, Eslahchi C, Pezeshk H, Sadeghi M. A distancedependent atomic knowledgebased potential and force for discrimination of native structures from decoys. Proteins. 2009;77(2):454–63.
Shen My, Sali A. Statistical potential for assessment and prediction of protein structures. Protein Sci. 2006;15(11):2507–24.
Zhang C, Liu S, Zhou H, Zhou Y. An accurate, residuelevel, pair potential of mean force for folding and binding based on the distancescaled, idealgas reference state. Protein Sci. 2004;13(2):400–11.
Yu Z, Yao Y, Deng H, Yi M. ANDIS: an atomic angleand distancedependent statistical potential for protein structure quality assessment. BMC Bioinformatics. 2019;20(1):299.
Mirzaie M, Sadeghi M. Distancedependent atomic knowledgebased force in protein fold recognition. Proteins. 2012;80(3):683–90.
Park J, Saitou K. ROTAS: a rotamerdependent, atomic statistical potential for assessment and prediction of protein structures. BMC Bioinformatics. 2014;15(1):307.
Boas FE, Harbury PB. Potential energy functions for protein design. Curr Opin Struct Biol. 2007;17(2):199–204.
Lee J, Liwo A, Scheraga HA. Energybased de novo protein folding by conformational space annealing and an offlattice unitedresidue force field: application to the 1055 fragment of staphylococcal protein a and to apo calbindin D9K. Proc Natl Acad Sci U S A. 1999;96(5):2025–30.
Turjanski AG, Gutkind JS, Best RB, Hummer G. Bindinginduced folding of a natively unstructured transcription factor. PLoS Comput Biol. 2008;4(4):e1000060.
Bianchi V, Gherardini PF, HelmerCitterich M, Ausiello G. Identification of binding pockets in protein structures using a knowledgebased potential derived from local structural similarities. BMC Bioinformatics. 2012;13(4):S17.
Bereau T, Deserno M. Generic coarsegrained model for protein folding and aggregation. J Chem Phys. 2009;130(23):235106.
Skolnick J. In quest of an empirical potential for protein structure prediction. Curr Opin Struct Biol. 2006;16(2):166–71.
Buchete NV, Straub JE, Thirumalai D. Development of novel statistical potentials for protein fold recognition. Curr Opin Struct Biol. 2004;14(2):225–32.
Mirzaie M. Hydrophobic residues can identify native protein structures. Proteins. 2018;86(4):467–74.
Pearson K. LIII. On lines and planes of closest fit to systems of points in space. London Edinburgh Dublin Philos Mag J Sci. 1901;2(11):559–72.
Hotelling H. Analysis of a complex of statistical variables into principal components. J Educ Psychol. 1933;24(6):417.
Jolliffe IT, Cadima J. Principal component analysis: a review and recent developments. Philos Trans A Math Phys Eng Sci. 2016;374(2065):20150202.
Yang LW, Eyal E, Bahar I, Kitao A. Principal component analysis of native ensembles of biomolecular structures (PCA_NEST): insights into functional dynamics. Bioinformatics. 2009;25(5):606–14.
Sankar K, Mishra SK, Jernigan RL. Comparisons of protein dynamics from experimental structure ensembles, molecular dynamics ensembles, and coarsegrained elastic network models. J Phys Chem B. 2018;122(21):5409–17.
Lukman S, Nguyen MN, Sim K, Teo JC. Discovery of Rab1 binding sites using an ensemble of clustering methods. Proteins. 2017;85(5):859–71.
Orellana L, Yoluk O, Carrillo O, Orozco M, Lindahl E. Prediction and validation of protein intermediate states from structurally rich ensembles and coarsegrained simulations. Nat Commun. 2016;7:12575.
Maisuradze GG, Liwo A, Scheraga HA. Principal component analysis for protein folding dynamics. J Mol Biol. 2009;385(1):312–29.
McGibbon RT, Beauchamp KA, Harrigan MP, Klein C, Swails JM, Hernández CX, Schwantes CR, Wang LP, Lane TJ, Pande VS. MDTraj: a modern open library for the analysis of molecular dynamics trajectories. Biophys J. 2015;109(8):1528–32.
Howe PW. Principal components analysis of protein structure ensembles calculated using NMR data. J Biomol NMR. 2001;20(1):61–70.
Wang G, Dunbrack RL Jr. PISCES: a protein sequence culling server. Bioinformatics. 2003;19(12):1589–91.
Deng H, Jia Y, Zhang Y. 3DRobot: automated generation of diverse and wellpacked protein structure decoys. Bioinformatics. 2016;32(3):378–87.
Mirzaie M, Sadeghi M. Delaunaybased nonlocal interactions are sufficient and accurate in protein fold recognition. Proteins. 2014;82(3):415–23.
Zhou H, Zhou Y. Distancescaled, finite idealgas reference state improves structurederived potentials of mean force for structure selection and stability prediction. Protein Sci. 2002;11(11):2714–26.
Abdi H, Williams LJ. Principal component analysis. Wiley Interdiscip Rev. 2010;2(4):433–59.
Acknowledgments
Not applicable.
Funding
This work was financially supported by the Iran National Science Foundation (INSF) (http://en.insf.org/) under grant number 97003215. The funders did not play any role in the design, analysis, or interpretation of the study or in writing the manuscript.
Author information
Authors and Affiliations
Contributions
MM introduced the model, analyzed the data and wrote the paper. The author read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The author declares that he has no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1.
The supplementary material includes additional figures and tables.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Mirzaie, M. Identification of native protein structures captured by principal interactions. BMC Bioinformatics 20, 604 (2019). https://doi.org/10.1186/s1285901931866
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285901931866
Keywords
 Knowledgebased potential
 Principal interaction
 Native structure
 Decoy set