CRNPRED: highly accurate prediction of onedimensional protein structures by largescale critical random networks
 Akira R Kinjo^{1, 2, 3}Email author and
 Ken Nishikawa^{1, 2}
DOI: 10.1186/147121057401
© Kinjo and Nishikawa; licensee BioMed Central Ltd. 2006
Received: 07 April 2006
Accepted: 05 September 2006
Published: 05 September 2006
Abstract
Background
Onedimensional protein structures such as secondary structures or contact numbers are useful for threedimensional structure prediction and helpful for intuitive understanding of the sequencestructure relationship. Accurate prediction methods will serve as a basis for these and other purposes.
Results
We implemented a program CRNPRED which predicts secondary structures, contact numbers and residuewise contact orders. This program is based on a novel machine learning scheme called critical random networks. Unlike most conventional onedimensional structure prediction methods which are based on local windows of an amino acid sequence, CRNPRED takes into account the whole sequence. CRNPRED achieves, on average per chain, Q_{3} = 81% for secondary structure prediction, and correlation coefficients of 0.75 and 0.61 for contact number and residuewise contact order predictions, respectively.
Conclusion
CRNPRED will be a useful tool for computational as well as experimental biologists who need accurate onedimensional protein structure predictions.
Background
Onedimensional (1D) structures of a protein are residuewise quantities or symbols onto which some features of the native threedimensional (3D) structure are projected. 1D structures are of interest for several reasons. For example, predicted secondary structures, a kind of 1D structures, are often used to limit the conformational space to be searched in 3D structure prediction. Furthermore, it has recently been shown that certain sets of the native (as opposed to predicted) 1D structures of a protein contain sufficient information to recover the native 3D structure [1, 2]. These 1D structures are either the principal eigenvector of the contact map [1] or a set of secondary structures (SS), contact numbers (CN) and residuewise contact orders (RWCO) [2]. Therefore, it is possible, at least in principle, to predict the native 3D structure by first predicting the 1D structures, and then by constructing the 3D structure from these 1D structures. 1D structures are not only useful for 3D structure predictions, but also helpful for intuitive understanding of the correspondence between the protein structure and its amino acid sequence due to the residuewise characteristics of 1D structures. Therefore, accurate prediction of 1D protein structures is of fundamental biological interest.
Secondary structure prediction has a long history [3]. Almost all the modern predictors are based on positionspecific scoring matrices (PSSM) and some kind of machine learning techniques such as neural networks or support vector machines. Currently the best predictors achieve Q_{3} of 77–79% [4, 5]. The study of contact number prediction also started long time ago [6, 7], but further improvements were made only recently [8–10]. These recent methods are based on the ideas developed in SS predictions (i.e., PSSM and machine learning), and achieve a correlation coefficient of 0.68–0.73.
Recently, we have developed a new method for accurately predicting SS, CN, and RWCO based on a novel machine learning scheme, critical random networks (CRN) [10]. In this paper, we briefly describe the formulation of the method, and recent improvements leading to even better predictions. The computer program for SS, CN, and RWCO prediction named CRNPRED has been developed for the convenience of the general user, and a web interface and source code are made available online.
Implementation
Definition of 1D structures
Secondary structures (SS)
Secondary structures were defined by the DSSP program [11]. For threestate SS prediction, the simple encoding scheme (the socalled CK mapping) was employed [12]. That is, α helices (H), β strands (E), and other structures ("coils") defined by DSSP were encoded as H, E, and C, respectively. Note that we do not use the CASPstyle conversion scheme (the socalled EHL mapping) in which DSSP's H, G (3_{10} helix) and I (π helix) are encoded as H, and DSSP's E and B (β bridge) as E. We believe the CK mapping is more natural and useful for 3D structure predictions (e.g., geometrical restraints should be different between an α helix and a 3_{10} helix). For SS prediction, we introduce feature variables (${y}_{i}^{H},{y}_{i}^{E},{y}_{i}^{C}$) to represent each type of secondary structures at the ith residue position, so that H is represented as (1, 1, 1), E as (1, 1, 1), and C as (1, 1, 1).
Contact numbers (CN)
Let C_{i,j}represent the contact map of a protein. Usually, the contact map is defined so that C_{i,j}= 1 if the ith and jth residues are in contact by some definition, or C_{i,j}= 0, otherwise. As in our previous study, we slightly modify the definition using a sigmoid function. That is,
C_{i,j}= 1/{1 + exp [w(r_{i,j} d)]} (1)
where r_{i,j}is the distance between C_{ β }(C_{ α }for glycines) atoms of the ith and jth residues, d = 12Å is a cutoff distance, and w is a sharpness parameter of the sigmoid function which is set to 3 [8, 2]. The rather generous cutoff length of 12Å was shown to optimize the prediction accuracy [8]. The use of the sigmoid function enables us to use the contact numbers in molecular dynamics simulations [2]. Using the above definition of the contact map, the contact number of the ith residue of a protein is defined as
The feature variable y_{ i }for CN is defined as y_{ i }= n_{ i }/log L where L is the sequence length of a target protein. The normalization factor log L is introduced because we have observed that the contact number averaged over a protein chain is roughly proportional to log L, and thus division by this value removes the sizedependence of predicted contact numbers.
Residuewise contact orders (RWCO)
RWCO was first introduced in [2]. This quantity measures the extent to which a residue makes longrange contacts in a native protein structure. Using the same notation as contact numbers, the RWCO of the ith residue in a protein structure is defined by
The feature variable y_{ i }for RWCO is defined as y_{ i }= o_{ i }/L where L is the sequence length. Due to the similar reason as CN, the normalization factor L was introduced to remove the sizedependence of the predicted RWCOs (the RWCO averaged over a protein chain is roughly proportional to the chain length).
Critical random networks
Here we briefly describe the critical random network (CRN) method introduced in [10] which should be referred to for the details. Unlike most conventional methods for 1D structure prediction [except for some including the bidirectional recurrent neural networks [13, 5, 14]], the CRN method takes the whole amino acid sequence into account. In the CRN method, an Ndimensional state vector x_{ i }is assigned to the ith residue of the target sequence (we use N = 5000 throughout this paper). Neighboring state vectors along the sequence are connected via a random N × N orthogonal matrix W. This matrix is also blockdiagonal with the size of blocks ranging uniformly randomly between 2 and 50. The input to the CRN is the positionspecific scoring matrix (PSSM), U = (u_{1}, ..., u_{ L }) of the target sequence obtained by PSIBLAST [15] (L is the sequence length of the target protein). We impose that the state vectors satisfy the following equation of state:
x_{ i }= tanh [βW (x_{i1}+ x_{i+1}) + αV u_{ i }] (4)
for i = 1, ..., L where V is an N × 21 random matrix (the 21st component of u_{ i }is always set to unity), and β and α are scalar parameters. The fixed boundary condition is imposed (x_{0} = x_{L+1}= 0). By setting β = 0.5, the system of state vectors is made to be near a critical point in a certain sense, and thus the range of sitesite correlation is expected to be long when α is sufficiently small but finite [10]. The value of α was chosen so that the resulting solution x_{ i }oscillates continuously with respect to the residue number i; that is, each component of x_{ i }having values from 1 to 1, rather than being a discrete sequence of 1 or 1. It can be shown that there exists a unique solution of Eq. 4 for a given PSSM U (provided the above boundary condition and β = 0.5). The solution {x_{ i }} of Eq. 4 (i.e., the state vectors) can be interpreted as some kind of patterns that reflect the complicated interactions among neighboring residues along the amino acid sequence. In this way, each state vector implicitly incorporates longrange correlations, and its components serve as additional independent variables to the linear predictor described in the following. The 1D structure of the ith residue is predicted as a linear projection of a local window of the PSSM and the state vector obtained by solving Eq. 4:
where y_{ i }is the predicted quantity, and D_{m,a}and E_{ k }are the regression parameters. In the first summation, each PSSM column is extended to include the "terminal" residue. Since Eq. 5 is a simple linear equation once the equation of state (Eq. 4) has been solved, learning the parameters D_{m,a}and E_{ k }reduces to an ordinary linear regression problem. For SS prediction, the triple (${y}_{i}^{H},{y}_{i}^{E},{y}_{i}^{C}$) is calculated simultaneously, and the SS class is predicted as arg max_{s∈{H,E,C}}${y}_{i}^{s}$. For the CN and RWCO prediction, real values are predicted. 2state prediction is also made for CN using the average CN for each residue type as the threshold for "exposed" or "buried" as in [16]. We have noted earlier [8] that the apparent accuracy of 2state CN prediction depend on the threshold. Although we proposed that using the median instead of average CN should be more appropriate for the threshold, here we use the average in order to compare our results with others. The half window size M is set to 9 for SS and CN predictions, and to 26 for RWCO. Note that the solution of the equation of state (Eq. 4) is determined solely by the PSSM. Therefore, obtaining the solution to Eq. (4) can be regarded as a kind of unsupervised learning, and the method for solving the equation of state is irrelevant for learning the parameters.
Ensemble prediction
Since the CRNbased prediction is parametrized by the random matrices W and V, slightly different predictions are obtained for different pairs of W and V. We can improve the prediction by taking the average over an ensemble of such different predictions. 20 CRNbased predictors were constructed using 20 sets of different random matrices W and V. CN and RWCO are predicted as uniform averages of these 20 predictions.
For SS prediction, we employ further training. Let ${s}_{i}^{t,n}$ be the prediction results of the nth predictor for 1D structure t (H, E, C, CN, and RWCO) of the ith residue. The second stage SS prediction is made by the following linear scheme:
where ss = H, E, C, and w_{n,t,m}is the weight obtained from a training set. Finally, the feature variable for each SS class of the ith residue is obtained by (${y}_{i1}^{ss}+2{y}_{i}^{ss}+{y}_{i+1}^{ss}$)/4. This last procedure was found particularly effective for improving the segment overlap (SOV) measure.
Additional input
Another improvement is the addition of the amino acid composition of the target sequence to the predictor [9]: The term ${\sum}_{a=1}^{20}{F}_{a}}{f}_{a$ was added to Eq. 5 where F_{ a }is a regression parameter, and f_{ a }is the fraction of the amino acid type a. From a preliminary work based on a linear predictor [10], it was observed that this input slightly improved the accuracy by ~0.2%.
Training and test data set
We carried out a 15fold crossvalidation test following exactly the same procedure and the same data set as the previous study [10]. In the data set, there are 680 protein domains, each of which represents a superfamily according to the SCOP database (version 1.65) [17]. This data set was randomly divided so that 630 domains were used for training and the remaining 50 domains for testing, and the random division was repeated 15 times [See Additional File 1]. No pair of these domains belong to the same superfamily, and hence they are not expected to be homologous. Thus, the present benchmark is a very stringent one. For obtaining PSSMs by running PSIBLAST, we use the UniRef100 (version 6.8) amino acid sequence database [18] containing some 3 million entries. Also the number of iterations in PSIBLAST homology searches was reduced to 3 times from 10 used in the previous study. This especially increased the accuracy of SS predictions. These results are consistent with the study of [19].
Numerics
One drawback of the CRN method is the computational time required for numerically solving the equation of state (Eq. 4). For that purpose, instead of the GaussSeidellike method previously used, we implemented a successive overrelaxation method which was found to be much more efficient. Let v denote the stage of iteration. We set the initial value of the state vectors (with v = 0) as
Then, for i = 1, ..., L (in increasing order of i), we update the state vectors by
Next, we update them in the reverse order. That is, for i = L, ..., 1 (in decreasing order of i),
We then set v ← v + 1, and iterate Eqs. (8) and (9) until {x_{ i }} converges. The acceleration parameter of ω = 1.4 was found effective. The convergence criterion is
where ·_{ R }^{ N } denotes the Euclidean norm. This criterion is much less stringent than previous study (10^{7}), but this does not affect the prediction accuracy significantly. Convergence is typically achieved within 10 to 12 iterations for one protein.
It is noted that the algorithm and parameters presented in this subsection are determined only for efficiently solving the equation of state (Eq. 4). As such, the choice of the parameters such as ω or the threshold of convergence has little, if any, impact on the prediction accuracy.
Results and discussion
Summary of average prediction accuracies per chain (median in parentheses).
SS  Q_{3}= 80.5% (81.6)  SOV = 80.0% (81.1) 

CN  Cor = 0.746 (0.768)  DevA = 0.686 (0.670) 
RWCO  Cor = 0.613 (0.646)  DevA = 0.877 (0.812) 
Summary of perresidue accuracies for SS predictions.
measure  H  E  C 

Q _{ s }  82.7  69.3  84.0 
${Q}_{s}^{pre}$  84.4  78.9  78.3 
MC  0.754  0.674  0.645 
Finally, we note on practical applicability of predicted 1D structures. We do not believe, at present, that the construction of a 3D structure purely from the predicted 1D structures is practical, if possible at all, because of the limited accuracy of the RWCO prediction. However, SS and CN predictions are very accurate for many proteins so that they may already serve as valuable restraints for 3D structure predictions. Also, SS and CN predictions may be applied to domain identification often necessary for experimental determination of protein structures. CRNPRED has been proved useful for such a purpose [25]. Although of the limited accuracy, predicted RWCOs still exhibit significant correlations with the correct values. Since RWCOs reflect the extent to which a residue is involved in longrange contacts, predicted RWCOs may be useful for enumerating potentially structurally important residues[26]. An interesting alternative application of the CRN framework is to regard the solution of the equation of state (Eq. 4) as an extended sequence profile. By so doing, it is straightforward to apply the solution to the profileprofile comparison for fold recognition [27]. Such an application may be also pursued in the future.
Conclusion
We have developed the CRNPRED program that predicts secondary structures (SS), contact numbers (CN), and residuewise contact orders (RWCO) of a protein given its amino acid sequence. The method is based on largescale critical random networks. The achieved accuracies are at least as high as other predictors for SS and currently the best for CN and RWCO, although the success for RWCO prediction is still limited. CRNPRED will be a useful tool for computational as well as experimental biologists who need accurate onedimensional protein structure predictions.
Availability and requirements
Project name: CRNPRED
Project home page: http://bioinformatics.org/crnpred/
Operating system: UNIXlike OS (including Linux and Mac OS X).
Programming language: C.
Other requirements: zsh, PSIBLAST (blastpgp), The UniRef100 amino acid sequence database.
License: Public domain.
Any restrictions to use by nonacademics: None.
Abbreviations
 CRN:

critical random network
 SS:

secondary structure
 CN:

contact number
 RWCO:

residuewise contact order
 1D:

onedimensional
 3D:

threedimensional.
Declarations
Acknowledgements
We thank Yasumasa Shigemoto for helping construct the CRNPRED web interface. This work was supported in part by the MEXT, Japan.
Authors’ Affiliations
References
 Porto M, Bastolla U, Roman HE, Vendruscolo M: Reconstruction of protein structures from a vectorial representation. Phys Rev Lett 2004, 92: 218101.View ArticlePubMedGoogle Scholar
 Kinjo AR, Nishikawa K: Recoverable onedimensional encoding of threedimensional protein structures. Bioinformatics 2005, 21: 2167–2170. [Doi:10.1093/bioinformatics/bti330] [Doi:10.1093/bioinformatics/bti330]View ArticlePubMedGoogle Scholar
 Rost B: Prediction in 1D: secondary structure, membrane helices, and accessibility. In Structural Bioinformatics. Edited by: Bourne PE, Weissig H. Hoboken, U.S.A.: WileyLiss, Inc; 2003:559–587.Google Scholar
 Jones DT: Protein secondary structure prediction based on positionspecific scoring matrices. J Mol Biol 1999, 292: 195–202.View ArticlePubMedGoogle Scholar
 Pollastri G, McLysaght A: Porter: a new, accurate server for protein secondary structure prediction. Bioinformatics 2005, 21: 1719–1720.View ArticlePubMedGoogle Scholar
 Nishikawa K, Ooi T: Prediction of the surfaceinterior diagram of globular proteins by an empirical method. Int J Pept Protein Res 1980, 16(1):19–32.View ArticlePubMedGoogle Scholar
 Nishikawa K, Ooi T: Radial locations of amino acid residues in a globular protein: Correlation with the sequence. J Biochem 1986, 100: 1043–1047.PubMedGoogle Scholar
 Kinjo AR, Horimoto K, Nishikawa K: Predicting absolute contact numbers of native protein structure from amino acid sequence. Proteins 2005, 58: 158–165. [Doi:10.1002/prot.20300] [Doi:10.1002/prot.20300]View ArticlePubMedGoogle Scholar
 Yuan Z: Better prediction of protein contact number using a support vector regression analysis of amino acid sequence. BMC Bioinformatics 2005, 6: 248.PubMed CentralView ArticlePubMedGoogle Scholar
 Kinjo AR, Nishikawa K: Predicting secondary structures, contact numbers, and residuewise contact orders of native protein structure from amino acid sequence using critical random networks. BIOPHYSICS 2005, 1: 67–74. [Doi:10.2142/biophysics.1.67] http://www.jstage.jst.go.jp/article/biophysics/1/0/1_67/ [Doi:10.2142/biophysics.1.67]View ArticleGoogle Scholar
 Kabsch W, Sander C: Dictionary of Protein Secondary Structure: Pattern recognition of hydrogen bonded and geometrical features. Biopolymers 1983, 22: 2577–2637.View ArticlePubMedGoogle Scholar
 Crooks GE, Brenner SE: Protein secondary structure: entropy, correlations and prediction. Bioinformatics 2004, 20: 1603–1611.View ArticlePubMedGoogle Scholar
 Baldi P, Brunak S, Frasconi P, Soda G, Pollastri G: Exploiting the past and the future in protein secondary structure prediction. Bioinformatics 1999, 15: 937–946.View ArticlePubMedGoogle Scholar
 Chen J, Chaudhari NS: Bidirectional segmentedmemory recurrent neural network for protein secondary structure prediction. Soft Computing 2006, 10: 315–324.View ArticleGoogle Scholar
 Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DL: Gapped Blast and PSIBlast: A new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–3402.PubMed CentralView ArticlePubMedGoogle Scholar
 Pollastri G, Baldi P, Fariselli P, Casadio R: Prediction of coordination number and relative solvent accessibility in proteins. Proteins 2002, 47: 142–153.View ArticlePubMedGoogle Scholar
 Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: A structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247: 536–540.PubMedGoogle Scholar
 Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale D, O'Donovan C, Redaschi N, Yeh LS: The universal protein resource (UniProt). Nucleic Acids Res 2005, 33: D154D159.PubMed CentralView ArticlePubMedGoogle Scholar
 Przybylski D, Rost B: Alignments grow, secondary structure prediction improves. Proteins 2002, 46: 197–205.View ArticlePubMedGoogle Scholar
 McGuffin LJ, Bryson K, Jones DT: The PSIPRED protein structure prediction server. Bioinformatics 2000, 16: 404–405.View ArticlePubMedGoogle Scholar
 Eyrich VA, MartiRenom MA, Przybylski D, Madhusudhan MS, Fiser A, Pazos F, Valencia A, Sali A, Rost B: EVA: continuous automatic evaluation of protein structure prediction servers. Bioinformatics 2001, 17: 1242–1243.View ArticlePubMedGoogle Scholar
 Sander C, Schneider R: Database of homologyderived protein structures. Proteins 1991, 9: 56–68.View ArticlePubMedGoogle Scholar
 Kinjo AR, Nishikawa K: Eigenvalue analysis of amino acid substitution matrices reveals a sharp transition of the mode of sequence conservation in proteins. Bioinformatics 2004, 20: 2504–2508.View ArticlePubMedGoogle Scholar
 Bastolla U, Porto M, Roman HE, Vendruscolo M: Principal eigenvector of contact matrices and hydrophobicity profiles in proteins. Proteins 2005, 58: 22–30.View ArticlePubMedGoogle Scholar
 Minezaki Y, Homma K, Kinjo AR, Nishikawa K: Human transcription factors contain a high fraction of intrinsically disordered regions essential for transcriptional regulation. J Mol Biol 2006, 359: 1137–1149.View ArticlePubMedGoogle Scholar
 Kihara D: The effect of longrange interactions on the secondary structure formation of proteins. Protein Sci 2005, 14: 1955–1963.PubMed CentralView ArticlePubMedGoogle Scholar
 Tomii K, Akiyama Y: FORTE: a profileprofile comparison tool for protein fold recognition. Bioinformatics 2004, 20: 594–595.View ArticlePubMedGoogle Scholar
 Zemla A, Venclovas C, Fidelis K, Rost B: A modified definition of Sov, a segmentbased measure for protein secondary structure prediction assessment. Proteins 1999, 34: 220–223.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.