PreDisorder: ab initio sequence-based prediction of protein disordered regions
© Deng et al; licensee BioMed Central Ltd. 2009
Received: 3 August 2009
Accepted: 21 December 2009
Published: 21 December 2009
Disordered regions are segments of the protein chain which do not adopt stable structures. Such segments are often of interest because they have a close relationship with protein expression and functionality. As such, protein disorder prediction is important for protein structure prediction, structure determination and function annotation.
This paper presents our protein disorder prediction server, PreDisorder. It is based on our ab initio prediction method (MULTICOM-CMFR) which, along with our meta (or consensus) prediction method (MULTICOM), was recently ranked among the top disorder predictors in the eighth edition of the Critical Assessment of Techniques for Protein Structure Prediction (CASP8). We systematically benchmarked PreDisorder along with 26 other protein disorder predictors on the CASP8 data set and assessed its accuracy using a number of measures. The results show that it compared favourably with other ab initio methods and its performance is comparable to that of the best meta and clustering methods.
PreDisorder is a fast and reliable server which can be used to predict protein disordered regions on genomic scale. It is available at http://casp.rnet.missouri.edu/predisorder.html.
While most regions of a protein adopt localized, stable structures, there are some segments of the protein chain which do not. These are regions whose coordinates are hard to determine by experimental techniques or that simply do not fold into stable structures [1, 2]. Such regions are known as disordered regions. Proteins with disordered regions are capable of binding to multiple partners and participating in various reactions and pathways [3–5]. Disordered regions can also give rise to the poor expression of a protein, making it difficult to produce for crystallization or other purposes . Consequently, the prediction of disordered regions in proteins has implications for protein production, structure prediction and determination, function annotation and cellular process recognition.
Measuring native disorder experimentally is time consuming and expensive and thus computational approaches for the prediction of protein disordered regions have received considerable attention in recent years . As a result, a number of disorder prediction software and web services and their underlying methods are quickly becoming a valuable tool for protein structure prediction, determination, and function annotation [8–18]. To stimulate further development of disorder prediction, CASP has dedicated a category to blindly benchmark the current state of the art. Here we benchmark our ab initio and consensus (or meta) disorder predictors along with dozens of other predictors that participated in the CASP8 experiment. The good performance of our PreDisorder server makes it a valuable and accurate tool for protein structure prediction, protein determination and protein engineering.
Ab initio neural network method
Our server, PreDisorder, is based on our ab initio method that participated in CASP8 under the group name MULTICOM_CMFR. This is a machine learning approach using 1-D Recursive Neural Networks. With this approach, a target protein sequence is first aligned against several template profiles using PSI-BLAST. This creates an input profile of the sequence. This profile along with the predicted secondary structure and solvent accessibility is fed into a 1D Recursive Neural Network (1D-RNN) that makes the disorder predictions . More specifically for each protein sequence, the input is a 1-dimentional array I whose length is the total number of the residues in the sequence. Each element I i of the array is a vector with 25 values which represent the residue i. Of these 25 values, 20 represent the frequencies of each amino acid at the corresponding position from PSI-BLAST profile . The other five are binary values used to encode the predicted secondary structure (Helix, Strand or Coil) and solvent accessibility of the residue [20–22]. Based on the input I, the 1D-RNN produces an array of real numbers O, where the i th element O i is the probability that the i th residue will be disordered. A large curated dataset was randomly divided into ten subsets of approximately equal size in the preparation for the following ten-fold cross-validated training and testing. And then, this 1D-RNN was trained and cross-validated using the ten subsets . Finally, the predicted disorder probabilities of the residues were re-scaled so that the ratio of residues with disorder probability greater than or equal to 0.5 is close to the ratio of the disorder residues in the training dataset . Specifically, the scaling method first identified a probability threshold t (e.g. 0.1) for selecting predicted disorder residues such that the ratio (the number of predicted disordered residues/the number of total residues in the test dataset) is equal to the ratio of disorder residues in the training dataset (e.g. 5%). And then the predicted disorder probabilities (x) was re-scaled as x/t * 0.5 (if x <= t) or 0.5 + 0.5 * (x - t)/(1 - t) (if x >t).
A meta method is a consensus approach that makes predictions based on the output of other predictors. Similar ideas have been applied to solve many prediction problems such as protein fold recognition and achieved much better performance than individual predictors. One such example of this approach is 3D-Jury. 3D-Jury is an automated protein structure meta prediction system available through Meta Server, and it generates meta-predictions from a variety of models gained by variable methods . Our new meta predictor MULTICOM makes predictions based on a consensus formed from other CASP8 disorder predictors. It removes a few very inaccurate disorder predictors and then averages the output of the remaining disorder predictors. Our simple averaging approach is different from other meta methods based on consensus voting.
Results and discussion
We evaluated 27 disorder predictors that participated in CASP8. Among these predictors were our ab initio method predictor (MULTICOM-CMFR) and meta predictor (MULTICOM). They were evaluated on 117 protein targets whose structures were available when our evaluation was conducted. These targets contain 25431 residues and all the disorder predictions for them were downloaded from the CASP8 web site . When evaluating the disorder predictions against the protein targets, target residues that did not have corresponding coordinates in its PDB file were considered to be disordered. The disorder annotations for the targets were curated by Dr.McGuffin . Each residue in the target sequence is tagged with a binary label of "O" (order) or "D" (disorder). We evaluated the methods on all 117 targets and two subsets (97 X-ray structures and 20 NMR structures), respectively. It is worth pointing out that our evaluation serves as a complementary, comparative benchmark of our methods. Readers should refer to the CASP8 assessment paper for the official assessment of disorder predictions .
In evaluating the disorder predictors, we considered a number of different, commonly used measurements of performance for binary classifiers. One such measurement was the ROC score. This value represents the area under the Receiver Operating Characteristic (AUC) curve and measures the performance of a classifier system and its dependence upon its discrimination threshold. Ranking the predictors using ROC curves is a widely used method in bioinformatics and CASP competitions [7, 30, 31].
Negative Sensitivity ( ) and Negative Specificity ( ) . Here, TP is the number of true positives (residues correctly identified as disordered) and FP is the number of false positives (residues predicted as disordered, but experimentally ordered). TN is the number of true negatives (residues correctly identified as ordered) and FN the number of false negatives (residues predicted as ordered, but experimentally disordered).
While in principle it is possible for a system to achieve both high values for positive and negative sensitivity, in practice it does not happen often. Usually, a sharp increased in one, results in a decrease in the other. An extreme example would be a predictor which identifies all residues as disordered. Such a system would have a positive sensitivity of 100% and a negative sensitivity of 0%. In an attempt to join several of these measurements into one, we considered the product of positive sensitivity and negative sensitivity and the harmonic mean, or F-measure, of the positive sensitivity and positive specificity .
We also calculated a weighted score for each predictor. This is a measure which was introduced in CASP6 and is defined as Score ( ) where Wdisorder is set to 92.63 and Worder to 7.37 . As defined, this measure greatly rewards disordered residues correctly identified as classified as disordered while heavily penalizing any disordered residue that is misclassified.
Results for protein disorder predictors that participated in CASP8 on 117 targets.
Results for protein disorder predictors that participated in CASP8 on the 20 NMR targets (T0437, T0460, T0462, T0464, T0466, T0467, T0468, T0469, T0471, T0472, T0473, T0474, T0475, T0476, T0480, T0482, T0484, T0492, T0498, T0499).
Results for protein disorder predictors that participated in CASP8 on the 97 x-ray targets.
The CASP8 disorder prediction methods can be classified into four main categories : (1) Meta method. Predictors like MULTICOM, GS-MetaServer, Metaprdos, GeneSilico, GSMetaDisorder and Distill use this method to fulfill disorder prediction. (2) Clustering method. For instance, it is used by predictor DISOclust. DISOclust first gains multiple 3D models from the nFOLD3 server and then makes disorder predictions by combining the results obtained from running the DISOclust method and DISOPRED3 method. (3) Ab initio method. A large number of predictors in CASP8 adopt this method and examples include 3Dpro, Mariner, Spritz, biomine, CBRC_poodle, disopred, OnD-CRF and our predictor MULTICOM_CMFR. (4) Hybrid method. Fais-server is a hybrid method that combines both ab initio predictions and homology-based template information. Both ab initio and hybrid methods usually exist as standalone packages, while meta methods rely on other predictors.
In examining the results, no one method appears to perform decisively better than the rest according to all the measures. Predictors from each of the three types of methods (ab initio, meta and clustering) are represented in the top seven when comparing the predictors only on the basis of ROC score, weighted score, specificity or sensitivity. The meta method MULITCOM, the clustering method DISOclust, the hybrid method Fais-server and ab initio method MULTICOM-CMFR and 3Dpro are among top 5 in terms of ROC scores. Other ab initio predictors such as mariner1 and Distill-Punch also performed well. Interestingly, our ab initio predictor MULTICOM-CMFR also ranks first in weighted score and product of positive and negative sensitivity. Being an ab initio method, it also has the benefit of being able to make predictions solely on an input sequence. The other types of methods need additional information such as output from other predictors (e.g. meta methods), tertiary structure models (clustering methods), or homologous structure templates (hybrid methods). Consequently, our PreDisorder server based on MULTICOM-CMFR is generally an accurate predictor that can be applied to the genome-scale annotation of protein disordered regions. Especially regarding the limits of predictability of intrinsically disordered residues from crystallographic experiments, both of our methods performed well on the X-ray targets shown in Table 3. Several methods (e.g., MULTICOM, DISOclust, fais-server, MULTICOM-CMFR, 3Dpro, mariner and Distill-Punch) yield similarly good AUC scores (>= 0.846), suggesting that the accuracy of disorder predictions might be close to the limit .
All of the predictors do quite well with respect to negative specificity and negative sensitivity. This is not too surprising as the most of the residues in a protein are ordered and hence the number of true negatives (TN) is very close to the true negatives plus false positives (TN+FP) and to the true negatives plus the false negatives (TN+FN).
This paper presents our disorder prediction web server, PreDisorder, and evaluates its performance against several other disorder predictors. We benchmarked MULTICOM-CMFR, the method employed by Predisorder and our meta method MULTICOM, along with several other protein disorder predictors on the 117 targets used in CASP8. The results show that our method is among the best and provides reliable protein disordered region predictions. Therefore, our server (PreDisorder) is a useful tool for structural and functional genomics.
Availability and Requirements
Project name: PreDisorder
Project Home Page: http://casp.rnet.missouri.edu/predisorder.html
Operating system(s): Platform independent (web server)
Programming languages: Perl, C, C++
Other requirements: None
License: Web application is freely accessible for all users.
Any restrictions to use by non-academics: None
This work was supported in part by a UM research board grant and a MU research council grant to JC.
- Tompa P: Intrinsically unstructured proteins. Trends Biochemistry Science 2002, 27: 527–533. 10.1016/S0968-0004(02)02169-2View ArticleGoogle Scholar
- Receveur-Bréchot V, Bourhis JM, Uversky VN, Canard B, Longhi S: Assessing protein disorder and induced folding. Proteins: Structure, Function, and Bioinformatics 2006, 62: 24–45. 10.1002/prot.20750View ArticleGoogle Scholar
- Dyson J, Wright P: Intrinsically unstructured proteins and their functions. Nature Reviews Molecular Cell Biology 2005, 6: 197–208. 10.1038/nrm1589View ArticlePubMedGoogle Scholar
- Dunker AK, Obradovic Z: The protein trinity - linking function and disorder. Nature Biotechnology 2001, 19: 805–806. 10.1038/nbt0901-805View ArticlePubMedGoogle Scholar
- Dunker AK, Brown CJ, Lawson JD, Iakoucheva LM, Obradovic Z: Intrinsic disorder and protein function. Biochemestry 2002, 21: 6573–82. 10.1021/bi012159+View ArticleGoogle Scholar
- Cheng J, Sweredoski M, Baldi P: Accurate Prediction of Protein Disordered Regions by Mining Protein Structure Data. Data Mining and Knowledge Discovery 2005, 11: 213–222. 10.1007/s10618-005-0001-yView ArticleGoogle Scholar
- Bordoli L, Kiefer F, Schwede T: Assessment of disorder predictions in CASP7. Proteins: Structure, Function, and Bioinformatics 2007, 69(Suppl 8):129–136. 10.1002/prot.21671View ArticleGoogle Scholar
- Ferron F, Longhi S, Canard B, Karlin D: A Practical Overview of Protein Disorder Prediction Methods. Proteins: Structure, Function, and Bioinformatics 2006, 65: 1–14. 10.1002/prot.21075View ArticleGoogle Scholar
- Su CT, Chen CY, Ou YY: Protein disorder prediction by condensed PSSM considering propensity for order or disorder. BMC Bioinformatics 2006, 7: 319. 10.1186/1471-2105-7-319PubMed CentralView ArticlePubMedGoogle Scholar
- Yang ZR, Thomson R, McNeil P, Esnouf RM: RONN: the biobasis function neural network technique applied to the detection of natively disordered regions in proteins. Bioinformatics 2005, 21: 3369–3376. 10.1093/bioinformatics/bti534View ArticlePubMedGoogle Scholar
- Coeytaux K, Poupon A: Prediction of unfolded segments in a protein sequence based on amino acid composition. Bioinformatics 2005, 21: 1891–1900. 10.1093/bioinformatics/bti266View ArticlePubMedGoogle Scholar
- Melamud E, Moult J: Evaluation of disorder predictions in CASP5. Proteins 2003, 53: 561–565. 10.1002/prot.10533View ArticlePubMedGoogle Scholar
- Oldfield CJ, Cheng Y, Cortese MS, Brown CJ, Uversky VN, Dunker AK: Comparing and combining predictors of mostly disordered proteins. Biochemistry, 44, 1989–2000. Proteins 2005, 61: 167–175.Google Scholar
- Ward JJ, McGuffin LJ, Bryson K, Buxton BF, Jones DT: The DISOPRED server for the prediction of protein disorder. Bioinformatics 2004, 20: 2138–2139. 10.1093/bioinformatics/bth195View ArticlePubMedGoogle Scholar
- Peng K, Radivojac P, Vucetic S, Dunker AK, Obradovic Z: Length-dependent prediction of protein intrinsic disorder. BMC Bioinformatics 2006, 7: 208. 10.1186/1471-2105-7-208PubMed CentralView ArticlePubMedGoogle Scholar
- Vullo A, Bortolami O, Pollastri G, Tosatto S: Spitz.: A server for the prediction of intrinsically disordered regions in protein sequences using kernel machines. Nucleic Acids Research 2006, 34: W164-W168. 10.1093/nar/gkl166PubMed CentralView ArticlePubMedGoogle Scholar
- Obradovic Z, Peng K, Vucetic S, Radivojac P, Dunker A: Exploiting Heterogeneous Sequence Properties Improves Prediction of Protein Disorder. Proteins 2005, 61(suppl1):176–182. 10.1002/prot.20735View ArticlePubMedGoogle Scholar
- Yang M, Yang J: IUP: Intrinsically Unstructured Protein predictor - A Software Tool for Analyzing Poly-Peptide Sequences. Proceeding of Sixth Symposium on Bioinformatics. Bioengineering (IEEE BIBE 2006) IEEE Computer Society 1–11.Google Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Willer W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 1997, 25: 3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Cheng J, Randall A, Sweredoski M, Baldi P: SCRATCH: a protein structure and structural feature prediction server. Nucleic Acids Research 2005, 33: w72–76. 10.1093/nar/gki396PubMed CentralView ArticlePubMedGoogle Scholar
- Pollastri G, Przybylski D, Rost B, Bald P: Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins 2002, 47: 228–235. 10.1002/prot.10082View ArticlePubMedGoogle Scholar
- Pollastri G, Baldi P, Fariselli P, Casadio R: Prediction of Coordination Number and Relative Solvent Accessibility in Proteins. Proteins 2002, 47: 142–153. 10.1002/prot.10069View ArticlePubMedGoogle Scholar
- Hecker J, Yang J, Cheng J: Protein Disorder Prediction at Multiple Levels of Sensitivity and Specificity. BMC Genomics 2008, 9(Suppl 1):S9. 10.1186/1471-2164-9-S1-S9PubMed CentralView ArticlePubMedGoogle Scholar
- Meta server[http://meta.bioinfo.pl/submit_wizard.pl]
- Laszlo K, Leszek R: Evaluation of 3D-Jury on CASP7 models. Bioinformatics 2007, 8: 304. 10.1186/1471-2105-8-304Google Scholar
- Ginalski K, Elofsson A, Fischer D, Rychlewski L: 3D-Jury: a simple approach to improve protein structure predictions. Bioinformatics 2003, 22: 1015–1018. 10.1093/bioinformatics/btg124View ArticleGoogle Scholar
- CASP8 web site[http://predictioncenter.org/download_area/CASP8/predictions/]
- The disorder annotations for the targets curated by Dr.McGuffin[http://www.reading.ac.uk/bioinf/CASP8/index.html]
- Noivirt-Brik O, Prilusky J, Sussman JL: Assessment of disorder predictions in CASP8. Proteins: Structure, Function, and Bioinformatics 2009., 9999(9999):Google Scholar
- Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DT: Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol 2004, 337: 635–645. 10.1016/j.jmb.2004.02.002View ArticlePubMedGoogle Scholar
- Jin Y, Dunbrack RL Jr: Assessment of disorder predictions in CASP6. Proteins 2005, 61(Suppl 7):167–175. 10.1002/prot.20734View ArticlePubMedGoogle Scholar
- McGuffin LJ: Intrinsic disorder prediction from the analysis of multiple protein fold recognition models. Bioinformatics 2008, 24: 1798–1804. 10.1093/bioinformatics/btn326View ArticlePubMedGoogle Scholar
- Mohan A, Uversky VN, Radivojac P: Influence of sequence changes and environment on intrinsically disorder proteins. PLoS Comput Biol 2009., 5(Suppl 9):Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.