- Methodology article
- Open Access
Large-scale prediction of long disordered regions in proteins using random forests
BMC Bioinformatics volume 10, Article number: 8 (2009)
Many proteins contain disordered regions that lack fixed three-dimensional (3D) structure under physiological conditions but have important biological functions. Prediction of disordered regions in protein sequences is important for understanding protein function and in high-throughput determination of protein structures. Machine learning techniques, including neural networks and support vector machines have been widely used in such predictions. Predictors designed for long disordered regions are usually less successful in predicting short disordered regions. Combining prediction of short and long disordered regions will dramatically increase the complexity of the prediction algorithm and make the predictor unsuitable for large-scale applications. Efficient batch prediction of long disordered regions alone is of greater interest in large-scale proteome studies.
A new algorithm, IUPforest-L, for predicting long disordered regions using the random forest learning model is proposed in this paper. IUPforest-L is based on the Moreau-Broto auto-correlation function of amino acid indices (AAIs) and other physicochemical features of the primary sequences. In 10-fold cross validation tests, IUPforest-L can achieve an area of 89.5% under the receiver operating characteristic (ROC) curve. Compared with existing disorder predictors, IUPforest-L has high prediction accuracy and is efficient for predicting long disordered regions in large-scale proteomes.
The random forest model based on the auto-correlation functions of the AAIs within a protein fragment and other physicochemical features could effectively detect long disordered regions in proteins. A new predictor, IUPforest-L, was developed to batch predict long disordered regions in proteins, and the server can be accessed from http://dmg.cs.rmit.edu.au/IUPforest/IUPforest-L.php
Intrinsically unstructured/disordered proteins (IUPs/IDPs) contain long disordered regions or are completely disordered . IUPs are abundant in higher organisms and often involved in key biological processes, such as transcriptional and translational regulation, membrane fusion and transport, cell-signal transduction, protein phosphorylation, the storage of small molecules and the regulation of self-assembly of large multi-protein complexes [2–11]. The disordered state in IUPs creates larger intermolecular interfaces , which increase the speed of interaction with potential binding partners even in the absence of tight binding, and provide flexibility for binding diverse ligands [2, 5, 11, 13–15]. However, long disordered regions in IUPs cause difficulties in protein structure determination by both X-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy. Efficient prediction of disordered region(s) in IUPs by computational methods can provide valuable information in high-throughput protein structure characterization, and reveal useful information on protein function .
Many predictors have been developed to predict disordered regions in proteins, such as PONDR , RONN [17, 18], VL2, VL3, VL3H and VL3E from DisProt [1, 19, 20], NORSp [21, 22], DISpro , FoldIndex , DISOPRED and DISOPRED2 [25–27], GlobPlot  and DisEMBL , IUPred , Prelink , DRIP-PRED (MacCallum, online publication http://www.forcasp.org/paper2127.html), FoldUnfold , Spritz , DisPSSMP , VSL1 and VSL2 [35, 36], POODLE-L , POODLE-S , Ucon , PrDOS and metaPrDOS [40, 41]. Among these predictors, neural networks and support vector machines (SVM) are widely used machine learning models.
The accuracy of disorder predictors is generally limited by the existence of various kinds of disorder which are represented unevenly in the various databases, and the lack of a unique definition of disorder . Predictors designed for long disordered regions are usually less successful in predicting short disordered regions [36, 42] because the long and short disordered regions have different sequence features. As a result, some predictors are specified for predicting long disordered regions, such as POODLE-L , while predictors targeting all types of disordered regions usually have to sacrifice time efficiency for exploiting heterogeneous sequence properties, especially the evolution information extracted from PSI-BLAST or protein secondary structure [25, 27, 33–36, 38].
In this paper, a new algorithm, IUPforest-L, is proposed for predicting long disordered regions based on the random forest learning model  and simple parameters extracted from the amino acid sequences and amino acid indices (AAIs) . 10-fold cross validation tests and blind tests demonstrate that IUPforest-L can achieve significantly higher accuracy than many existing algorithms in predicting long disordered regions. The high efficiency of IUPforest-L makes it a suitable tool for high-throughput comparative proteomics studies.
Training and test datasets
To train IUPforest-L, a subset (positive training set) of disordered regions was constructed based on DisProt  (version3.6), which includes 352 regions of 30 aa or more in length, and 47251 aa in total. The negative training set was extracted from PDBSelect25  (Oct. 2004 version), from which 366 sequences (80,324 aa in total) of at least 80 aa were selected. Each of them has a high resolution crystal structure (< 2.0Å), free from missing backbone or side chain coordinates and free from non-standard amino acid residues.
To assess the prediction performance of IUPforest-L, three datasets were used for blind tests. The first dataset was based on the dataset constructed by Hirose et al (Hirose-ADS1) as a blind test dataset of POODLE-L . Hirose-ADS1 contains 53 ordered regions of at least 40 aa (11431 aa in total) from the Protein Data Bank  and 63 disordered regions of at least 30 aa (8700 aa in total) from DisProt (version 3.0). The second test set (Han-ADS1) comprised of 53 ordered regions as in Hirose-ADS1 and 33 long disordered regions (5959 aa in total) from the latest DisProt (version 4.8), after removing disordered regions homologous to those in DisProt (version 3.6) using the CD-HIT algorithm with a threshold of 0.9 sequence identity . The third test set (Peng-DB) was constructed based on the blind test dataset of VLS2 , where 56 long disordered regions of at least 30 aa (2841 aa in total) and 1965 ordered regions (318431 aa in total) were used in the assessment. For an objective blind test of IUPforest-L on Hirose-ADS1 (as reported in Table 1), disordered and ordered regions homologous to those in Hirose-ADS1 were removed from our training set based on the CD-HIT algorithm with a threshold of 0.9 sequence identity , resulting in 293 disordered regions and 364 ordered regions for training the predictor. Similarly for an objective blind test on Han-ADS1 (as reported in Table 2), ordered regions homologous to the 53 ordered regions in Hirose-ADS1 were also removed from the original training set for training the predictor. The final IUPforest-L was still trained by the whole training set. Han-ADS1 is listed in the Additional file 1 and is also available online at http://dmg.cs.rmit.edu.au/IUPforest/Han-ADS1.fasta.
The random forest model
A random forest is an ensemble of unpruned decision trees (shown in Figure 1), where each tree is grown using a (bootstrap) subset of the training dataset . Bootstrapping is a resampling technique where a number of bootstrap training sets are drawn randomly from the original training set with replacement. Each tree induced from bootstrap samples grows to full length and the number of trees in the forest is adjustable. To classify an instance of unknown class label, each tree casts a unit classification vote. The forest selects the classification having the most votes over all the trees in the forest. Compared with the decision tree classifier , random forests have better classification accuracy, are more tolerant to noise and are less dependent on the training datasets.
Features used in training and test
When a window of w aa slides along a sequence, six types of features were derived from residues within the window, as defined and explained below.
Auto-correlation function of amino acid indices (AAIs)
Each residue in the training set was replaced with a value of the normalized amino acid index (AAI), which is a set of 20 numerical values representing the physicochemical and biological property of 20 amino acids chosen from the AAI Database http://www.genome.ad.jp/dbget/aaindex.html. As such, a sequence of N amino acids in the training set was firstly transformed into a numerical sequence [49, 50], and denoted as:
P1P2 ⋯ P i ⋯ Pi+w⋯ P N (1)
Then the sequences were smoothed with the Savitzky-Golay filter . The Moreau-Broto auto-correlation function F d of an AAI was then calculated within a window, which is defined as:
where w is the window size, p i and pi+dare the AAI values at positions i and i+d respectively [49, 50]. For example, when d = 1, the numerical value for each residue (i) in the window multiplies by the value of the next nearby residue (i+1) and F1 is the average of these w-1 products. Similarly, F2 is the average of the w-2 products generated from every other residue. The value of d represented the order of the correlation and was tuned to optimize the prediction performance. The F d (d = 1, 2,..., 30) for the 40 sets of AAI listed in Table A1 in the Additional file 2 was calculated and evaluated in training IUPforest-L.
The mean hydrophobicity, defined as the average value of Kyte and Doolittle's hydrophobicity  in the window.
The modified hydrophobic cluster , calculated as the longest hydrophobic clusters in the window divided by the window size.
The mean net charge within the window and local mean net charge within a 13 aa fragment centered at the middle residue. Residues K and R were defined as +1; D and E were defined as -1; other residues were 0.
The mean contact number, defined as the mean expected number of contacts in the globular state of all residues within the window .
The composition of four reduced amino acid groups  and the Shannon's entropy (K2) of the amino acid composition within the window were calculated.
A flow chart of IUPforest-L is shown in Figure 2. At the training stage, features listed above were calculated when a window of w aa slides from the N-terminal end to the C-terminal end of a protein sequence. Each window was tagged with a label of disorder (Positive or P) or order (Negative or N) according to the label of the central residue, and IUPforest-L models were trained from the six types of features and the prediction result could be obtained by each of the trees in the forest. The final score was the combination of the outcomes from all trees by voting and smoothing . A threshold that best classifies the ordered or disordered state of a residue could be defined based on the scores and the optimal evaluated values in the 10-fold cross validation tests.
During the prediction stage, the features were firstly calculated when a window slides over an inquiry sequence and then a probability score of a residue being disordered was assigned by IUPforest-L. A region was annotated as disordered only when 30 or more consecutive amino acid residues were predicted to be disordered.
To estimate the generalization accuracy, 10-fold cross validation tests were conducted, where 90% of the sequences in the training set were randomly used in training and the other 10% were used in test. The process was repeated for the entire dataset and the final result was the average of the results from 10 processes. In addition, independent tests were performed on Hirose-ADS1 , Han-ADS1 and Peng-DB .
During the cross validation test, the confusion matrix, which comprises true positive (TP), false positive (FP), true negative (TN) and false negative (FN), was used to evaluate the prediction performance in terms of the following measures:
1) AUC, the area under the receiver operating characteristic (ROC) curve. Each point of a ROC curve was defined by a pair of values for the true positive rate (, or sensitivit y) and the false positive rate (, or 1-specificity).
2) Balanced overall accuracy
Sproduct ≡ sensitivity × specificity (4)
4 Matthew's correlation functions (MCC)
where w disorder and w order are the weights for disorder and order, respectively, that are inversely proportional to the number of residues in the disordered and ordered state. Sw is also referred to as probability excess .
The Sproduct and Sw scores were used in assessing the prediction of disordered residues in the Critical Assessment of techniques for protein Structure Prediction (CASP6 and CASP7) .
10-fold cross validation
The 10-fold cross validation test results using a window of 31 aa are shown in Figure 3. With the type 1 features (the auto-correlation function of AAIs), a forest of more trees has better predictive ability. For example, the AUC increased by 2% when the number of trees increased from 10 to 50. However, the prediction accuracy increased only modestly when the number of trees increased further from 50 to 100, while the training and prediction times increased significantly. Detailed test results on the time consumption with number of trees from 10 to 300 are shown in Additional file 3. The default setting of IUPforest-L is a forest of 50 trees for large-scale application.
With a forest of a fixed number of trees, the ROC curve trained with the auto-correlation function with d value between 1 and 15 almost overlaps with the ROC curve trained with d between 1 and 30. This result indicates that continuous correlations between nearby residues from 1 to 15 along the sequence could determine whether the fragment is involved in a long disordered region.
Figure 3 shows that training with either type 1 or the combination of type 2–6 features could reach the 70.5% or 70.0% true positive rates with a 10% false positive rate, while their combination of type 1–6 features could lead to a higher true positive rate of 76%, and an area of 89.5% under the ROC curve. This result indicates that type 1 and type 2–6 features have redundant, but complementary structural information. Type 2–6 features generated only nine parameters in total within a given window, while type 1 features could generate hundreds of parameters that take into account both order information and physicochemical properties. It has been shown that the random forest model has no risk of overfitting with an increasing number of trees when the input parameters increase . As such, using type 1 features to train the random forest could extract more sequence-structure information  and it was thus conjectured that better prediction accuracy could be achieved with the auto-correlation functions generated from AAIs combined with other features of type 2–6.
The window size and step size for sliding the window are additional parameters for tuning the performance of IUPforest-L models. The window should be of a reasonable size so that the AAI-based correlation can be of significance within a reasonable training or test time. Training with small windows increases training time and can introduce noises, whereas training with large windows can lose local information. Our test results indicated that from window size of 19 aa to 47 aa, the random forest gave more stable result on blind test set Han-ADS1, but the accuracy on the 10-fold cross validation test on the training set will drop with larger window size (details listed in the Additional file 4). To batch predict long disordered regions, the window size of 31 aa was set in default to keep the balance between high efficiency and accuracy. The step size for sliding windows can also affect the accuracy and overall time efficiency at both the training and test stage. If the step size is too small, when a window slides along a sequence, it will introduce redundancy between windows and prolong the time for training models. Our experiments (details listed in the Additional file 4) show that with a sliding step of 20 aa (default setting) models achieve stable sensitivity without significantly prolonging the training process.
Figure 4 depicts the ROC curves for IUPforest-L and nine other publicly available predictors on the blind test dataset Hirose-ADS1, including the most recently developed POODLE-L  and the well-established predictor VSL2 . It is obvious that IUPforest-L outperforms most of the other predictors in terms of the AUC in predicting long disordered regions. At low false positive rates (< 10%), IUPforest-L achieves the highest sensitivity among all the predictors. In terms of other performance measures listed in Table 1, IUPforest-L is also comparable to or better than other predictors. Figure 5 and Table 2 show the result of comparisons of IUPforest-L with POODLE-L and other predictors on the Han-ADS1. It can be seen that IUPforest-L always performs better than most of them. Figure 6 and Table 3 shows the result of comparisons of IUPforest-L with POODLE-L and other predictors on the Peng-DB. It can be seen again that IUPforest-L always performs better than most of them.
Protein structures are stabilized by numerous intramolecular interactions such as hydrophobic, electrostatic, van der Waals, and hydrogen bonds. The autocorrelation function tests whether the physicochemical property of one residue is independent of that of neighbouring residues. A group of residues involved in ordered structure close to other groups of residues in space will be dynamically constrained by the backbone or side chain interactions from these residues, and hence the residues in both groups will show higher density in the contact map or have higher pairwise correlation. On the other hand, a repetitive sequence of amino acids can also give significant positive correlation for all physicochemical properties. Therefore, residues within a fragment exhibiting a higher autocorrelation may either be structurally constrained, or have low sequence complexity. The random forest learning model employed by the IUPforest-L disorder predictor combines the complementary contributions from the autocorrelation function (type 1 feature) and other types of features, so that structural information is extracted with a high degree of prediction accuracy.
The random forest model is an ensemble learning model and is known to be more robust to noise than many non-ensemble learning models. However, as a classifier based on the random forest needs to load many decision trees into memory, it is relatively slow for a forest to predict a single instance at a time. As a result, the current web server of IUPforest-L is better suited to batch prediction of a large number of protein sequences, which provides an alternative useful tool in large-scale analysis of long disordered regions in proteomics. As an initial application, we have provided a server, IUPforest-L, for batch protein sequences analysis with the output of overall summary and details for each sequence. For convenience in proteomic comparisons, the prediction results for 62 eukaryotes linked to The European Bioinformatics Institute are also pre-calculated and can be downloaded from the server.
IUP studies are important because disordered regions are common and functionally important in proteins. The new features, the auto-correlation functions of AAIs within a protein fragment, reflect both residues' contact information and sequence complexity. The random forest model based on this new type of features and other physicochemical features could effectively detect long disordered regions in proteins. As a result, a new predictor, IUPforest-L, was developed to predict long disordered regions in proteins. Its high accuracy and high efficiency make it a useful tool in large-scale protein sequence analysis.
Vucetic S, Brown CJ, Dunker AK, Obradovic Z: Flavors of protein disorder. Proteins. 2003, 52 (4): 573-584. 10.1002/prot.10437.
Dyson H, Wright PE: Intrinsically Unstructured Proteins and their Functions. Nat Rev Mol Cell Biol. 2005, 6: 197-208. 10.1038/nrm1589.
Tompa P, Szasz C, Buday L: Structural disorder throws new light on moonlighting. Trends Biochem Sci. 2005, 30 (9): 484-489. 10.1016/j.tibs.2005.07.008.
Tompa P: Intrinsically unstructured proteins. Trends Biochem Sci. 2002, 27 (10): 527-533. 10.1016/S0968-0004(02)02169-2.
Uversky VN, Oldfield CJ, Dunker AK: Showing your ID: intrinsic disorder as an ID for recognition, regulation and cell signaling. J Mol Recognit. 2005, 18 (5): 343-384. 10.1002/jmr.747.
Wright PE, Dyson HJ: Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm. J Mol Biol. 1999, 293 (2): 321-331. 10.1006/jmbi.1999.3110.
Dunker AK, Lawson JD, Brown CJ, Williams RM, Romero P, Oh JS, Oldfield CJ, Campen AM, Ratliff CM, Hipps KW: Intrinsically disordered protein. J Mol Graph Model. 2001, 19 (1): 26-59. 10.1016/S1093-3263(00)00138-8.
Russell RB, Gibson TJ: A careful disorderliness in the proteome: Sites for interaction and targets for future therapies. FEBS Lett. 2008, 582 (8): 1271-1275. 10.1016/j.febslet.2008.02.027.
Radivojac P, Iakoucheva LM, Oldfield CJ, Obradovic Z, Uversky VN, Dunker AK: Intrinsic disorder and functional proteomics. Biophys J. 2007, 92 (5): 1439-1456. 10.1529/biophysj.106.094045.
Oldfield CJ, Cheng Y, Cortese MS, Romero P, Uversky VN, Dunker AK: Coupled folding and binding with alpha-helix-forming molecular recognition elements. Biochemistry. 2005, 44 (37): 12454-12470. 10.1021/bi050736e.
Tompa P: The interplay between structure and function in intrinsically unstructured proteins. FEBS Lett. 2005, 579 (15): 3346-3354. 10.1016/j.febslet.2005.03.072.
Gunasekaran K, Tsai CJ, Kumar S, Zanuy D, Nussinov R: Extended disordered proteins: targeting function with less scaffold. Trends Biochem Sci. 2003, 28 (2): 81-85. 10.1016/S0968-0004(03)00003-3.
Namba K: Roles of partly unfolded conformations in macromolecular self-assembly. Genes Cells. 2001, 6 (1): 1-12. 10.1046/j.1365-2443.2001.00384.x.
Dunker AK, Cortese MS, Romero P, Iakoucheva LM, Uversky VN: Flexible nets. The roles of intrinsic disorder in protein interaction networks. Febs J. 2005, 272 (20): 5129-5148. 10.1111/j.1742-4658.2005.04948.x.
Oldfield CJ, Ulrich EL, Cheng Y, Dunker AK, Markley JL: Addressing the intrinsic disorder bottleneck in structural proteomics. Proteins. 2005, 59 (3): 444-453. 10.1002/prot.20446.
Li X, Romero P, Rani M, Dunker AK, Obradovic Z: Predicting Protein Disorder for N-, C-, and Internal Regions. Genome Inform Ser Workshop Genome Inform. 1999, 10: 30-40.
Yang ZR, Thomson R, McNeil P, Esnouf RM: RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins. Bioinformatics. 2005, 21 (16): 3369-3376. 10.1093/bioinformatics/bti534.
Thomson R, Esnouf R: Prediction of natively disordered regions in proteins using a bio-basis function neural network. Lecture Notes in Computer Science. 2004, 3177: 108-116.
Smith DK, Radivojac P, Obradovic Z, Dunker AK, Zhu G: Improved amino acid flexibility parameters. Protein Sci. 2003, 12 (5): 1060-1072. 10.1110/ps.0236203.
Vucetic S, Obradovic Z, Vacic V, Radivojac P, Peng K, Iakoucheva LM, Cortese MS, Lawson JD, Brown CJ, Sikes JG: DisProt: a database of protein disorder. Bioinformatics. 2005, 21 (1): 137-140. 10.1093/bioinformatics/bth476.
Liu J, Tan H, Rost B: Loopy proteins appear conserved in evolution. J Mol Biol. 2002, 322 (1): 53-64. 10.1016/S0022-2836(02)00736-2.
Liu J, Rost B: NORSp: Predictions of long regions without regular secondary structure. Nucleic Acids Res. 2003, 31 (13): 3833-3835. 10.1093/nar/gkg515.
Cheng J, Sweredoski MJ, Baldi P: Accurate Prediction of Protein Disordered Regions by Mining Protein Structure Data. Data Mining and Knowledge Discovery. 2005, 11: 213-222. 10.1007/s10618-005-0001-y.
Prilusky J, Felder CE, Zeev-Ben-Mordehai T, Rydberg E, Man O, Beckmann JS, Silman I, Sussman JL: FoldIndex(C): a simple tool to predict whether a given protein sequence is intrinsically unfolded. Bioinformatics. 2005, 21 (16): 3435-3438. 10.1093/bioinformatics/bti537.
Jones DT, Ward JJ: Prediction of disordered regions in proteins from position specific score matrices. Proteins. 2003, 53 (Suppl 6): 573-578. 10.1002/prot.10528.
Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DT: Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol. 2004, 337: 635-645. 10.1016/j.jmb.2004.02.002.
Ward JJ, McGuffin LJ, Bryson K, Buxton BF, Jones DT: The DISOPRED server for the prediction of protein disorder. Bioinformatics. 2004, 20 (13): 2138-2139. 10.1093/bioinformatics/bth195.
Linding R, Russell RB, Neduva V, Gibson TJ: GlobPlot: Exploring protein sequences for globularity and disorder. Nucleic Acids Res. 2003, 31 (13): 3701-3708. 10.1093/nar/gkg519.
Linding R, Jensen LJ, Diella F, Bork P, Gibson TJ, Russell RB: Protein disorder prediction: implications for structural proteomics. Structure (Camb). 2003, 11 (11): 1453-1459. 10.1016/j.str.2003.10.002.
Dosztanyi Z, Csizmok V, Tompa P, Simon I: The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. J Mol Biol. 2005, 347 (4): 827-839. 10.1016/j.jmb.2005.01.071.
Coeytaux K, Poupon A: Prediction of unfolded segments in a protein sequence based on amino acid composition. Bioinformatics. 2005, 21 (9): 1891-1900. 10.1093/bioinformatics/bti266.
Galzitskaya OV, Garbuzynskiy SO, Lobanov MY: FoldUnfold: web server for the prediction of disordered regions in protein chain. Bioinformatics. 2006, 22 (23): 2948-2949. 10.1093/bioinformatics/btl504.
Vullo A, Bortolami O, Pollastri G, Tosatto SC: Spritz: a server for the prediction of intrinsically disordered regions in protein sequences using kernel machines. Nucleic Acids Res. 2006, W164-168. 10.1093/nar/gkl166. 34 Web Server
Su CT, Chen CY, Ou YY: Protein disorder prediction by condensed PSSM considering propensity for order or disorder. BMC Bioinformatics. 2006, 7: 319-10.1186/1471-2105-7-319.
Peng K, Radivojac P, Vucetic S, Dunker AK, Obradovic Z: Length-dependent prediction of protein intrinsic disorder. BMC Bioinformatics. 2006, 7: 208-10.1186/1471-2105-7-208.
Obradovic Z, Peng K, Vucetic S, Radivojac P, Dunker AK: Exploiting heterogeneous sequence properties improves prediction of protein disorder. Proteins. 2005, 61 (Suppl 7): 176-182. 10.1002/prot.20735.
Hirose S, Shimizu K, Kanai S, Kuroda Y, Noguchi T: POODLE-L: a two-level SVM prediction system for reliably predicting long disordered regions. Bioinformatics. 2007, 23 (16): 2046-2053. 10.1093/bioinformatics/btm302.
Shimizu K, Hirose S, Noguchi T: POODLE-S: web application for predicting protein disorder by using physicochemical features and reduced amino acid set of a position-specific scoring matrix. Bioinformatics. 2007, 23 (17): 2337-2338. 10.1093/bioinformatics/btm330.
Schlessinger A, Punta M, Rost B: Natively unstructured regions in proteins identified from contact predictions. Bioinformatics. 2007, 23 (18): 2376-2384. 10.1093/bioinformatics/btm349.
Ishida T, Kinoshita K: Prediction of disordered regions in proteins based on the meta approach. Bioinformatics. 2008, 24 (11): 1344-1348. 10.1093/bioinformatics/btn195.
Ishida T, Kinoshita K: PrDOS: prediction of disordered protein regions from amino acid sequence. Nucleic Acids Res. 2007, W460-464. 10.1093/nar/gkm363. 35 Web Server
Peng K, Vucetic S, Radivojac P, Brown CJ, Dunker AK, Obradovic Z: Optimizing Intrinsic Disorder Predictors with Protein Evolutionary Information. J Bioinform Comput Biol. 2005, 3: 35-60. 10.1142/S0219720005000886.
Breiman L: Random Forest. Machine Learning. 2001, 45 (1): 5-32. 10.1023/A:1010933404324.
Kawashima S, Kanehisa M: AAindex: amino acid index database. Nucleic Acids Res. 2000, 28 (1): 374-10.1093/nar/28.1.374.
Hobohm U, Sander C: Enlarged representative set of protein structures. Protein Sci. 1994, 3 (3): 522-524.
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res. 2000, 28 (1): 235-242. 10.1093/nar/28.1.235.
Li W, Godzik A: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006, 22 (13): 1658-1659. 10.1093/bioinformatics/btl158.
Dosztanyi Z, Chen J, Dunker AK, Simon I, Tompa P: Disorder and sequence repeats in hub proteins and their implications for network evolution. J Proteome Res. 2006, 5 (11): 2985-2995. 10.1021/pr060171o.
Feng ZP, Zhang CT: Prediction of membrane protein types based on the hydrophobic index of amino acids. J Protein Chem. 2000, 19 (4): 269-275. 10.1023/A:1007091128394.
Bu WS, Feng ZP, Zhang Z, Zhang CT: Prediction of protein (domain) structural classes based on amino-acid index. Eur J Biochem. 1999, 266 (3): 1043-1049. 10.1046/j.1432-1327.1999.00947.x.
Savitzky A, Golay MJE: Smoothing and Differentiation of Data by Simplified Least Squares Procedures. Analytical Chemistry. 1964, 36: 1627-1639. 10.1021/ac60214a047.
Kyte J, Doolittle RF: A simple method for displaying the hydropathic character of a protein. J Mol Biol. 1982, 157 (1): 105-132. 10.1016/0022-2836(82)90515-0.
Garbuzynskiy SO, Lobanov MY, Galzitskaya OV: To be folded or to be unfolded?. Protein Sci. 2004, 13 (11): 2871-2877. 10.1110/ps.04881304.
Jin Y, Dunbrack RL: Assessment of disorder predictions in CASP6. Proteins. 2005, 61 (Suppl 7): 167-175. 10.1002/prot.20734.
Han P, Zhang X, Norton RS, Feng ZP: Predicting disordered regions in proteins using the profiles of amino acid Indices. Supplement Issue of BMC Bioinformatics for APBC. 2009,
The authors thank Lefei Zhan for his help on conducting some experiments, Marc Cortese for his explanation of DisProt database. ZPF is supported by an APD award from the Australian Research Council. XZ is supported in part by an RMIT Emerging Researcher Grant.
PH wrote up the computer program, carried out calculations and developed the web interface; RSN participated in design of the project and drafting the manuscript; XZ and ZPF participated in design of the project, development of the algorithm, interpretation of the results and drafting the manuscript.
Electronic supplementary material
Authors’ original submitted files for images
About this article
Cite this article
Han, P., Zhang, X., Norton, R.S. et al. Large-scale prediction of long disordered regions in proteins using random forests. BMC Bioinformatics 10, 8 (2009). https://doi.org/10.1186/1471-2105-10-8
- Receiver Operating Characteristic Curve
- Random Forest
- True Positive Rate
- Random Forest Model
- Blind Test