- Methodology article
- Open Access
Accurate prediction of protein secondary structure and solvent accessibility by consensus combiners of sequence and structure information
© Pollastri et al; licensee BioMed Central Ltd. 2007
- Received: 31 January 2007
- Accepted: 14 June 2007
- Published: 14 June 2007
Structural properties of proteins such as secondary structure and solvent accessibility contribute to three-dimensional structure prediction, not only in the ab initio case but also when homology information to known structures is available. Structural properties are also routinely used in protein analysis even when homology is available, largely because homology modelling is lower throughput than, say, secondary structure prediction. Nonetheless, predictors of secondary structure and solvent accessibility are virtually always ab initio.
Here we develop high-throughput machine learning systems for the prediction of protein secondary structure and solvent accessibility that exploit homology to proteins of known structure, where available, in the form of simple structural frequency profiles extracted from sets of PDB templates. We compare these systems to their state-of-the-art ab initio counterparts, and with a number of baselines in which secondary structures and solvent accessibilities are extracted directly from the templates. We show that structural information from templates greatly improves secondary structure and solvent accessibility prediction quality, and that, on average, the systems significantly enrich the information contained in the templates. For sequence similarity exceeding 30%, secondary structure prediction quality is approximately 90%, close to its theoretical maximum, and 2-class solvent accessibility roughly 85%. Gains are robust with respect to template selection noise, and significant for marginal sequence similarity and for short alignments, supporting the claim that these improved predictions may prove beneficial beyond the case in which clear homology is available.
The predictive system are publicly available at the address http://distill.ucd.ie.
- Secondary Structure
- Secondary Structure Prediction
- Solvent Accessibility
- Protein Secondary Structure
- Evolutionary Information
Protein secondary structure and solvent accessibility predictions are an important stage towards the prediction of protein structure and function. Accurate secondary structure and solvent accessibility information is not only at the core of most ab initio methods for the prediction of protein structure (e.g. see ) but is also effective in improving the sensitivity of fold recognition methods (e.g. [3–5]), and is routinely used in protein analysis and annotation .
Virtually all modern methods for the prediction of protein one-dimensional structural features (i.e. those features which may be represented as a string of the same length as the primary sequence, such as secondary structure and solvent accessibility) are based on machine learning techniques [7–22], and exploit evolutionary information in the form of amino acid frequency profiles extracted from alignments of multiple sequences, generally of unknown structure. The progress of these methods over the last 10 years has been slow, but steady, and is due to numerous factors: the ever-increasing size of training sets; more sensitive methods for the detection of homologues, such as PSI-BLAST ; the use of ensembles of multiple predictors trained independently, sometimes tens of them ; more sophisticated machine learning techniques (e.g. ); a combination of a number of the above .
Predictors of secondary structure and solvent accessibility are virtually always ab initio (with very few exceptions, e.g., recently, ), meaning that they do not rely directly on similarity to proteins of known structure. In fact, often, much care is taken to try to exclude any detectable similarity between training and test set instances when gauging predictive performances of structural feature predictors. The main reason for this seems to be a short-circuit, which happened early on in the field and was never disputed, between the idea of hypothesis validation by strict training and test set separation (borrowed from statistical learning), and the concept of ab initio prediction. For training and test sets to be strictly distinct, they are required to not only contain different examples (which is all the statistical learning principle dictates, together with independence and identical distribution), but to contain examples that do not show significant sequence identity to one another, as detected by a standard BLAST  search. A hint of the historical, more than scientific, nature of this issue is the fact that when subtler algorithms for sequence similarity detection became available (e.g. PSI-BLAST ), the criteria for training vs. test set separation did not always change.
Currently over half of all known protein sequences show some detectable degree of similarity to one or more sequences of known structure. Nearly 3/4 of newly deposited structures in the PDB  show significant similarity to previously deposited structures . Over 60% of the queries received by the server Porter  in the first six months of year 2006 have potential homologues in the PDB at the moment of submission (PSI-BLAST e-value smaller than 0.01), and another 25% have marginal similarity to some sequence in the PDB (PSI-BLAST e-value between 0.01 and 10). For the case of clear homology, direct structural information from the homologous proteins can be exploited for the prediction of structural features. For instance, secondary structure extracted from full three-dimensional comparative models is known to be significantly more reliable than secondary structure obtained from ab initio predictors [8, 22]. Moreover, even where alignments to PDB structures are of dubious reliability, or too short to reliably imply homology, these may carry information. One of the main sources of improvement for fold recognition and ab initio structure prediction methods over the last few CASP competitions [25–28] has been the reliance on sets of possible conformations for short fragments of chain , extracted from the PDB.
There is a number of reasons why direct, machine learning-based predictions of secondary structure or other structural features incorporating homology information are useful: nearly all the most reliable public predictors [6, 9, 14, 29, 30] ( is an exception, potentially equally reliable, although currently not tested by independent assessors such as EVA ) do not take structural information directly into account, which implies that over half of the responses provided to users could be improved, often dramatically; machine learning methods are robust with respect to noise – selecting a template from a set of candidate structures from the PDB may be less of a problem than in traditional comparative modelling, since a set or a profile of templates (possibly conflicting) may be provided to the method, rather than a single template which might be erroneous; machine learning methods are significantly faster than full comparative modelling methods – large-scale predictions may be generated with relatively modest computational resources, and feed into structure-based functional similarity algorithms, comparative modelling validation and template selection, protein analysis and proteome annotation efforts; low-similarity, short-alignment based predictions may improve on traditional ab initio ones in fold recognition or even novel fold cases.
Here we develop high-throughput systems for the prediction of protein secondary structure and solvent accessibility, which exploit similarity to proteins of known structure, where available, in the form of simple structural frequency profiles from sets of PDB templates. The systems have two stages: one in which a set of templates for a query sequence is generated based on a similarity search of the PDB; one in which this template information, plus the primary sequence, and evolutionary information in the form of multiple alignments is used as input to an ensemble of recursive neural networks to determine a query's secondary structure and solvent accessibility. Although here we use a simple PSI-BLAST-based protocol to find suitable templates (see Methods section), our systems are fully modular and may easily accommodate more sophisticated stages with better sensitivity to remote homology (e.g. [5, 32]). It is important to stress that, when homology information is available, the systems we design here do not simply take it as the final answer, but rather use it as a further input. This, on average, leads to significant improvements over extracting secondary structure and solvent accessibility directly from the best single PDB template, and weighed and unweighed averages of the top 10 templates, or all templates identified, suggesting that the combination of sequence and template information carries more information than templates alone. Not surprisingly, when only very high quality templates are available (PSI-BLAST e-value smaller than roughly 10-30), which are almost guaranteed to be close homologues, the improvements become marginal.
We also compare the predictive systems to to their state-of-the-art ab initio counterparts. We show that similarity information, when available, greatly improves prediction quality. For sequence similarity exceeding 30%, prediction quality is nearly at its theoretical maximum. Gains are significant for low sequence similarity when we design specialised systems for this case, and for alignments shorter than 20 residues, outside traditional comparative modelling territory, supporting the claim that these improved predictions may prove beneficial for fold recognition algorithms.
The predictive systems described in this paper are publicly available at the address , as part of a suite of predictors of protein structural features. When the user requests secondary structure (Porter) or solvent accessibility (PaleAle) predictions, homology-based results are automatically selected when suitable templates are available. Up to 20,000 queries a day may be served by the 40 CPU cluster hosting the predictors.
The four systems we describe here are: an ab initio secondary structure predictor (Porter ) in three classes; the same, but homology-based (Porter_H); an ab initio predictor of relative solvent accessibility in 4 and 2 classes (PaleAle); the same, but homology-based (PaleAle_H). All systems are trained and tested in rigorous 5-fold cross-validation on the December 2003 25% pdb_select list  (for details, see Methods section). We use this set to make direct comparisons with Porter's published results, and performances as recorded by EVA . The public versions of the servers will undergo regular trainings to keep them up-to-date with the expansion of the PDB.
Porter_H vs. templates
Performances of the template-based secondary structure predictor (Porter_H) compared with a baseline predictor which copies the secondary structure of the best template (baseline) and with the ab initio secondary structure predictor (Porter).
We also tested different baselines in which, instead of just the top template, respectively, the top 10 templates (as ranked by PSI-BLAST) and all the templates are used to predict the secondary structure of a protein. In both cases the prediction is obtained as a majority vote among the templates covering each residue. We tested both an unweighed vote (i.e. one in which each template counts the same) and a vote in which each template is weighed by its sequence similarity to the query, cubed. The latter weighing scheme is identical to the one used to present the templates to the network (see Methods section for details), and we refer to it as baseline_input. In all cases the predictions are worse than those obtained by only considering the top template (by at least 2% for the 95% maximum similarity case, and at least 3% for the 30% and 20% maximum similarity cases), hence at least 4% worse than Porter_H.
Percentage of residues on which Porter_H and PaleAle_H disagree with the template-based part of their input (profile of secondary structure/solvent accessibility frequency from templates, in which each template is weighed by its cubed sequence similarity to the query – see Methods section for details).
e < 0.01
e > 0.01
These results suggest that combining sequence and structure information is a better choice than only relying on templates, i.e. the sequence contains enough information to resolve at least some of the ambiguity contained in sets of templates retrieved by sequence similarity.
We also checked whether the presence of membrane proteins in the sets we use for training and testing has any influence on the results. In total there are 64 membrane proteins out of 2171 in the set, covering roughly 5% of all amino acids. On these proteins Porter_H outperforms Porter by 5.1% (82.1% correct prediction vs. 77%), less than on the whole set. Removing membrane proteins from the set changes the performances of both Porter and Porter_H by less than 0.1%, and keeps the difference between the methods statistically unchanged.
Lastly, we tested Porter_H on the EVA common2 set as available in November 2004, containing 134 proteins. On this set, a version of Porter retrained from scratch, after having excluded from its training set all sequences with more than 25% similarity to any sequence in the set, achieves 76.8% correct prediction, better by at least 1.9% than all the other servers evaluated. On the same set, a similarly retrained Porter_H achieves 81.5% correct prediction when templates with more than 95% sequence similarity to the query are ignored. In this set for 68 out of 134 sequences the best template is below 30% sequence similarity and for 44 below 20%.
Solvent Accessibility prediction
Performances of the template-based 4-class solvent accessibility predictor (PaleAle_H) compared with a baseline predictor which copies the solvent accessibility of the best template (baseline) and with the ab initio solvent accessibility predictor (PaleAle).
Performances of the two-class PaleAle and PaleAle_H compared with a number of recent methods on the Manesh dataset .
We have developed high-throughput systems for the prediction of protein secondary structure and solvent accessibility, exploiting similarity to proteins of known structure. These systems, based on machine learning techniques, greatly outperform their ab initio counterparts when PDB templates are available, are capable of combining sequence information and structural information from multiple templates, and outperform simpler strategies such as the extraction of the structural properties in question from the best available template in the PDB, or from weighed and unweighed profiles of templates. Moreover, they are entirely automated, and can be run on multi-genomic or bioengineering scales. On a small cluster of machines, hundreds of thousands of protein structural features may be predicted in days.
What is especially encouraging is that performance gains are significant even for marginal sequence similarity when we design specialised systems for this case. This suggests that our strategy may feed into fold recognition systems, which currently rely on ab initio secondary structure predictors. A closed-loop strategy in which the results of fold recognition searches are fed back into the predictors is also possible, and is the object of our current investigation.
All predictive systems are available at the address . Template-based predictions are automatically returned by the secondary structure prediction server (Porter) and the solvent accessibility server (PaleAle) when templates showing more than 20% sequence similarity to the query are detected. Given the current distribution of queries, this will yield greatly improved predictions for well over half of all requests.
The data set used in our simulations is extracted from the December 2003 25% pdb_select list . We assign each residue's secondary structure and solvent accessibility using the DSSP program . The relative solvent accessibility of a residue is defined as the accessibility in Å2 as computed by DSSP, divided by the maximum observed accessibility for that type of residue. Secondary structure is mapped from the 8 DSSP classes into three classes as follows: H, G, I → Helix; E, B → Strand; S, T,. → Coil. Relative solvent accessibility is mapped into 4 classes where class thresholds are chosen to be maximally informative, i.e. to split the set into (roughly) equally numerous classes: [0%, 4%), [4%, 25%), [25%, 50%) and [50%, ∞) exposed.
We remove all sequences for which DSSP does not produce an output due, for instance, to missing entries (e.g. if only the C α trace is present in the PDB file) or format errors. After processing by DSSP, this set (S2171) contains 2171 proteins and 344,653 amino acids. All the tests reported in this paper are run in 5-fold cross validation on S2171. The 5 folds are of roughly equal sizes, composed of 434–435 proteins and ranging between 67,345 and 70,098 residues. The datasets are available upon request.
Prediction from a multiple alignment of protein sequences rather than a single sequence has long been recognised as a way to improve prediction accuracy for virtually all protein structural features: secondary structure [9, 10, 14, 19, 22, 29, 40], solvent accessibility [11, 13, 15–18, 20, 21], beta-sheet pairing [41, 42], contact maps [43–45], etc. We exploit evolutionary information in the form of frequency profiles compiled from alignments of multiple homologous sequences, extracted from the NR database. Multiple sequence alignments for the S2171 set are extracted from the NR database as available on March 3 2004 containing over 1.4 million sequences. The database is first redundancy reduced at a 98% threshold, leading to a final 1.05 million sequences. The alignments are generated by three runs of PSI-BLAST  with parameters b = 3000 (maximum number of hits), e = 10-3 (expectation of a random hit) and h = 10-10 (expectation of a random hit for sequences used to generate the PSSM).
Data sets, training/test folds and multiple alignments are identical to those used to train and test the ab initio secondary structure predictor Porter .
For each of the proteins in S2171 we search for structural templates in the PDB. We base our search on PDBFINDERII  as available on August 22 2005. An obvious problem arising is that all proteins in the S2171 set are expected to be in PDB (barring name changes), hence every protein will have a perfect template. To avoid this, we exclude from PDBFINDERII every protein that appears in S2171. We also exclude all entries shorter than 10 residues, leading to a final 66,350 chains. Because of the PDBFINDERII origin, only one chain is present in this set for NMR entries.
To generate the actual templates for a protein, we run two rounds of PSI-BLAST against the version of the redundancy-reduced NR database described above, with parameters b = 3000, e = 10-3 and h = 10-10. We then run a third round of PSI-BLAST against the PDB using the PSSM generated in the first two rounds. In this third round we deliberately use a high expectation parameter (e = 10) to include hits that are beyond the usual Comparative Modelling scope (e < 0.01, at the CASP6 competition ). We further remove from each set of hits thus found all those with sequence similarity exceeding 95% over the whole query, to exclude PDB resubmissions of the same structure at different resolution, other chains in N-mers and close homologues.
Although the distribution is not uniform, all similarity intervals are adequately represented: for about 40% of the proteins no hit is above 30% similarity; for nearly 20% of the proteins the best hit is in the 30–50% similarity interval. Overall 74,543 residues (21.6% of the set) are not covered by any template. The average similarity for all PDB hits for each protein, not surprisingly, is generally low: for roughly 75% of all proteins in S2171 the average identity is below 30%.
To test template-based predictions in marginal similarity conditions we also extract two further template sets from which all hits are excluded that exceed, respectively, 30% and 20% sequence similarity. In this case the number of residues not covered by any template climbs, respectively, to 148,124 (43% of the total) and 193,921 (56.3%).
where i j (resp. o j ) is the input (resp. output) of the network in position j, and and are forward and backward chains of hidden vectors with . We parametrise the output update, forward update and backward update functions (respectively (O), (F)and (B)) using three two-layered feed-forward neural networks.
Encoding sequence and template information
Hence i j contains a total of e + t components.
This input coding scheme is richer than simple 20-letter schemes and has proven effective in .
In the case of secondary structure prediction we use t = 10 for representing structural information from the templates. Hence the total number of inputs for a given residue is e + t = 35. The first 8 structural input units contain the average 8-class (DSSP style) secondary structure composition in the PDB templates, while the last 2 encode the average quality of the template column. Assume that s p,j is an 8-component vector encoding the DSSP-assigned 8-class secondary structure of j-th residue in the p-th template as follows:
H = (1, 0, 0, 0, 0, 0, 0, 0)
G = (0, 1, 0, 0, 0, 0, 0, 0)
I = (0, 0, 1, 0, 0, 0, 0, 0)
E = (0, 0, 0, 1, 0, 0, 0, 0)
B = (0, 0, 0, 0, 1, 0, 0, 0)
S = (0, 0, 0, 0, 0, 1, 0, 0)
T = (0, 0, 0, 0, 0, 0, 1, 0)
· = (0, 0, 0, 0, 0, 0, 0, 1)
Taking the cube of the identity between template and query drastically reduces the contribution of low-similarity templates when good templates are available. For instance a 90% identity template is weighed two orders of magnitude more than a 20% one. In preliminary tests (not shown) this measure performed better than a number of alternatives.
It is worth noting how both structural information from templates and the two indices of template quality above are residue-based. For this reason, the case in which only templates covering fragments of a protein exist does not pose a problem for the method – the residues not covered by templates will simply have the section of the input with template information blank, and predictions will be based only on the sequence (and on sequence and template information transmitted by the forward and backwards memory chains). Template information for solvent accessibility is encoded similarly to secondary structure, except that 4 units are adopted to represent average solvent accessibility from PDB-derived templates (4 approximately equal classes). The two units encoding the profile quality are the same as in the secondary structure case. For the comparative experiments without templates, exactly same architectures are adopted, except that the part of the inputs representing the template profile is set to zero.
where k f = j + f (2w + 1), 2w + 1 is the size of the window over which first-stage predictions are averaged and 2p + 1 is the number of windows considered. In the tests we use w = 7 and p = 7, as in . This means that 15 contiguous, non-overlapping windows of 15 residues each are considered, i.e. first-stage outputs between position j - 112 and j + 112, for a total of 225 contiguous residues, are taken into account to generate the input to the filtering network in position j. This input contains a total of 16m real numbers: m representing the m-class output of the first stage in position j; 15m representing the m-class outputs of the first-stage averaged over each of the 15 windows. m is 3 in the case of secondary structure prediction and 4 for (4-class) solvent accessibility prediction.
Five two-stage BRNN models are trained independently and ensemble averaged to build the final predictor. Differences among models are introduced by two factors: stochastic elements in the training protocol, such as different initial weights of the networks and different shuffling of the examples; different architecture and number of free parameters of the models. The training strategy is identical to that adopted for Porter : 1000 epochs of training are performed for each model; the learning rate is halved every time we do not observe a reduction of the error for more than 50 epochs. The size and architecture of the models, apart from differences caused by the different number of inputs, is the same as Porter's. The number of free parameter per model ranges between 5,800 and 8,000. The template-based models are only slightly larger (on average 7% more free parameters) than the corresponding ab initio ones. Averaging the 5 models' outputs leads to classification performance improvements between 1% and 1.5% over single models. Furthermore a copy of each of the 5 models is saved at regular intervals (100 epochs) during training. Stochastic elements in the training protocol (similar to that described in ) guarantee that differences during training are non-trivial. An ensemble of a total of 45 such models yields a further slight improvement over the ensemble of 5 models.
This work is supported by Science Foundation Ireland grants 04/BR/CS0353 and 05/RFP/CMS0029, grant RP/2005/219 from the Health Research Board of Ireland, a UCD President's Award 2004, and an Embark Fellowship from the Irish Research Council for Science, Engineering and Technology to AV.
- Bradley P, Chivian D, Meiler J, Misura K, Rohl C, Schief W, Wedemeyer W, Schueler-Furman O, Murphy P, Schonbrun J, Strauss C, Baker D: Rosetta predictions in CASP5: Successes, failures, and prospects for complete automation. Proteins. 2003, 53 (S6): 457-468. 10.1002/prot.10552.View ArticlePubMedGoogle Scholar
- Jones D: GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. J Mol Biol. 1999, 287: 797-815. 10.1006/jmbi.1999.2583.View ArticlePubMedGoogle Scholar
- Karchin R, Cline M, Mandel-Gutfreund Y, Karplus K: Hidden markov models that use predicted local structure for fold recognition: alphabets of backbone geometry. Proteins. 2003, 51 (4): 504-14. 10.1002/prot.10369.View ArticlePubMedGoogle Scholar
- Przybylski D, Rost B: Improving Fold Recognition Without Folds. Journal of Molecular Biology. 2004, 341: 255-269. 10.1016/j.jmb.2004.05.041.View ArticlePubMedGoogle Scholar
- Rost B, Yachdav G, Liu J: The PredictProtein server. Nucleic Acids Research. 2004, 32: W321-326. 10.1093/nar/gkh377.PubMed CentralView ArticlePubMedGoogle Scholar
- Salamov A, Solovyev V: Prediction of protein secondary structure by combining nearest-neighbor algorithms and multiple sequence alignments. Journal of Molecular Biology. 1995, 247: 11-5. 10.1006/jmbi.1994.0116.View ArticlePubMedGoogle Scholar
- Rost B: PHD: predicting 1D proteins structure by profile based neural networks. Meth in Enzym. 1996, 266: 525-539.View ArticleGoogle Scholar
- Jones D: Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol. 1999, 292: 195-202. 10.1006/jmbi.1999.3091.View ArticlePubMedGoogle Scholar
- Baldi P, Brunak S, Frasconi P, Soda G, Pollastri G: Exploiting the past and the future in protein secondary structure prediction. Bioinformatics. 1999, 15: 937-946. 10.1093/bioinformatics/15.11.937.View ArticlePubMedGoogle Scholar
- Mucchielli-Giorgi M, Hazout S, Tuffery P: PredAcc: prediction of solvent accessibility. Bioinformatics. 1999, 15 (2): 176-7. 10.1093/bioinformatics/15.2.176.View ArticlePubMedGoogle Scholar
- Petersen T, Lundegaard C, Nielsen M, Bohr H, Bohr J, Brunak S, Gippert G, Lund O: Prediction of protein secondary structure at 80% accuracy. Proteins. 2000, 41 (1): 17-20. 10.1002/1097-0134(20001001)41:1<17::AID-PROT40>3.0.CO;2-F.View ArticlePubMedGoogle Scholar
- Cuff J, Barton G: Application of multiple sequence alignment profiles to improve protein secondary structure prediction. Proteins. 2000, 40 (3): 502-11. 10.1002/1097-0134(20000815)40:3<502::AID-PROT170>3.0.CO;2-Q.View ArticlePubMedGoogle Scholar
- Pollastri G, Przybylski D, Rost B, Baldi P: Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins. 2002, 47: 228-235. 10.1002/prot.10082.View ArticlePubMedGoogle Scholar
- Ahmad S, Gromiha M: NETASA: neural network based prediction of solvent accessibility. Bioinformatics. 2002, 18 (6): 819-24. 10.1093/bioinformatics/18.6.819.View ArticlePubMedGoogle Scholar
- Pollastri G, Fariselli P, Casadio R, Baldi P: Prediction of coordination number and relative solvent accessibility in proteins. Proteins. 2002, 47: 142-235. 10.1002/prot.10069.View ArticlePubMedGoogle Scholar
- Adamczak R, Porollo A, Meller J: Accurate prediction of solvent accessibility using neural networks-based regression. Proteins. 2004, 56 (4): 753-67. 10.1002/prot.20176.View ArticlePubMedGoogle Scholar
- Wagner M, Adamczak R, Porollo A, Meller J: Linear regression models for solvent accessibility prediction in proteins. Journal of Computational Biology. 2005, 12 (3): 355-69. 10.1089/cmb.2005.12.355.View ArticlePubMedGoogle Scholar
- Pollastri G, McLysaght A: Porter: a new, accurate server for protein secondary structure prediction. Bioinformatics. 2005, 21 (8): 1719-20. 10.1093/bioinformatics/bti203.View ArticlePubMedGoogle Scholar
- Qin S, Pan X: Predicting Protein Secondary Structure and Solvent Accessibility with and Improved Multiple Linear Regression Method. Proteins. 2005, 61: 473-80. 10.1002/prot.20645.View ArticlePubMedGoogle Scholar
- Nguyen M, Rajapakse J: Prediction of Protein Relative Solvent Accessibility With a Two-Stage SVM Approach. Proteins. 2005, 59: 30-7. 10.1002/prot.20404.View ArticlePubMedGoogle Scholar
- Montgomerie S, Sundaraj S, Gallin W, Wishart D: Improving the Accuracy of Protein Secondary Structure Prediction Using Structural Alignment. BMC Bioinformatics. 2006, 7: 301-10.1186/1471-2105-7-301.PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul S, Madden T, Schaffer A: Gapped blast and psi-blast: a new generation of protein database search programs. Nucl Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralView ArticlePubMedGoogle Scholar
- Berman H, Westbrook J, Feng Z, Gilliland G, Bhat T, Weissig H, Shindyalov I, Bourne P: The Protein Data Bank. Nucl Acids Res. 2000, 28: 235-242. 10.1093/nar/28.1.235. [http://pdbbeta.rcsb.org/pdb/Welcome.do]PubMed CentralView ArticlePubMedGoogle Scholar
- Orengo C, Bray J, Hubbard T, Lo Conte L, Sillitoe I: Analysis and assessment of ab initio three-dimensional prediction, secondary structure, and contacts prediction. Proteins: Structure, Function and Genetics. 1999, 37 (S3): 149-170. 10.1002/(SICI)1097-0134(1999)37:3+<149::AID-PROT20>3.0.CO;2-H.View ArticleGoogle Scholar
- Lesk A, Lo Conte L, Hubbard T: Assessment of novel fold targets in CASP4: predictions of three-dimensional structures, secondary structures, function and genetics. Proteins: Structure, Function and Genetics. 2001, S5: 98-118. 10.1002/prot.10056.View ArticleGoogle Scholar
- Moult J, Fidelis K, Zemla A, Hubbard T: Critical assessment of methods of protein structure prediction (CASP)-round V. Proteins. 2003, 53 (Suppl 6): 334-339. 10.1002/prot.10556.View ArticlePubMedGoogle Scholar
- Moult J, Fidelis K, Tramontano A, Rost B, Hubbard T: Critical Assessment of Methods of Protein Structure Prediction (CASP)-Round VI. Proteins. 2005, 61 (Suppl 6): 3-7. 10.1002/prot.20716.View ArticlePubMedGoogle Scholar
- Rost B, Sander C: Prediction of protein secondary structure at better than 70% accuracy. J Mol Biol. 1993, 232: 584-599. 10.1006/jmbi.1993.1413.View ArticlePubMedGoogle Scholar
- Cuff JA, Barton GJ: Application of multiple sequence alignments profiles to improve protein secondary structure prediction. Proteins: Structure, Function and Genetics. 2000, 40 (3): 502-511. 10.1002/1097-0134(20000815)40:3<502::AID-PROT170>3.0.CO;2-Q.View ArticleGoogle Scholar
- Eyrich V, Marti-Renom M, Przybylski D, Madhusudan M, Fiser A, Pazos F, Valencia A, Sali A, Rost B: EVA: continuous automatic evaluation od protein structure prediction servers. Bioinformatics. 2001, 17: 1242-1251. 10.1093/bioinformatics/17.12.1242.View ArticlePubMedGoogle Scholar
- Cheng J, Baldi P: A machine learning information retrieval approach to protein fold recognition. Bioinformatics. 2006, 22 (12): 1456-63. 10.1093/bioinformatics/btl102.View ArticlePubMedGoogle Scholar
- Hobohm U, Sander C: Enlarged representative set of protein structures. Protein Sci. 1994, 3: 522-24. [http://bioinfo.tg.fh-giessen.de/pdbselect/]PubMed CentralView ArticlePubMedGoogle Scholar
- Kabsch W, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983, 22: 2577-2637. 10.1002/bip.360221211.View ArticlePubMedGoogle Scholar
- Frishman D, Argos P: Knowledge-based protein secondary structure assignment. Proteins. 1995, 23 (4): 566-579. 10.1002/prot.340230412.View ArticlePubMedGoogle Scholar
- Fourrier L, Benros C, de Brevern A: Use of a structural alphabet for analysis of short loops connecting repetitive structures. BMC Bioinformatics. 2004, 5: 58-10.1186/1471-2105-5-58.PubMed CentralView ArticlePubMedGoogle Scholar
- Ceroni A, Frasconi P, Pollastri G: Learning Protein Secondary Structure from Sequential and Relational Data. Neural Networks. 2005, 18 (8): 1029-39. 10.1016/j.neunet.2005.07.001.View ArticlePubMedGoogle Scholar
- Sim J, Kim S, Lee J: Prediction of protein solvent accessibility using fuzzy k-nearest neighbor method. Bioinformatics. 2005, 21 (12): 2844-9. 10.1093/bioinformatics/bti423.View ArticlePubMedGoogle Scholar
- Naderi-Manesh H, Sadeghi M, Araf S, Movahedi A: Prediction of protein surface accessibility with information theory. Proteins. 2001, 42 (4): 452-9. 10.1002/1097-0134(20010301)42:4<452::AID-PROT40>3.0.CO;2-Q.View ArticlePubMedGoogle Scholar
- Riis SK, Krogh A: Improving prediction of protein secondary structure using structured neural networks and multiple sequence alignments. J Comp Biol. 1996, 3 (1): 163-183.View ArticleGoogle Scholar
- Baldi P, Pollastri G, Andersen CAF, Brunak S: Matching protein β-sheet partners by feedforward and recurrent neural networks. Proceedings of the 2000 Conference on Intelligent Systems for Molecular Biology (ISMB00), La Jolla, CA. 2000, Menlo Park, CA: AAAI Press, 8: 25-36.Google Scholar
- Cheng J, Baldi P: Three-stage prediction of protein β-sheets by neural networks, alignments and graph algorithms. Bioinformatics. 2005, 21: i75-i84. 10.1093/bioinformatics/bti1004.View ArticlePubMedGoogle Scholar
- Pollastri G, Baldi P: Prediction of Contact Maps by Recurrent Neural Network Architectures and Hidden Context Propagation from All Four Cardinal Corners. Bioinformatics. 2002, 18 (Suppl 1): S62-S70.View ArticlePubMedGoogle Scholar
- Baldi P, Pollastri G: The Principled Design of Large-Scale Recursive Neural Network Architectures – DAG-RNNs and the Protein Structure Prediction Problem. Journal of Machine Learning Research. 2003, 4 (Sep): 575-602.Google Scholar
- Vullo A, Walsh I, Pollastri G: A two-stage approach for improved prediction of residue contact maps. BMC Bioinformatics. 2006, 7: 180-10.1186/1471-2105-7-180.PubMed CentralView ArticlePubMedGoogle Scholar
- Krieger E, Hooft R, Nabuurs S, Vriend G: PDBFinderII – a database for protein structure analysis and prediction. 2004,http://swift.cmbi.ru.nl/gv/pdbfinder/, ,Google Scholar
- Gianese G, Bossa F, Pascarella S: Improvement in prediction of solvent accessibility by probability profiles. Protein Engineering. 2003, 16 (12): 987-92. 10.1093/protein/gzg139.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.