Large-scale prediction of long disordered regions in proteins using random forests
© Han et al; licensee BioMed Central Ltd. 2009
Received: 30 August 2008
Accepted: 07 January 2009
Published: 07 January 2009
Many proteins contain disordered regions that lack fixed three-dimensional (3D) structure under physiological conditions but have important biological functions. Prediction of disordered regions in protein sequences is important for understanding protein function and in high-throughput determination of protein structures. Machine learning techniques, including neural networks and support vector machines have been widely used in such predictions. Predictors designed for long disordered regions are usually less successful in predicting short disordered regions. Combining prediction of short and long disordered regions will dramatically increase the complexity of the prediction algorithm and make the predictor unsuitable for large-scale applications. Efficient batch prediction of long disordered regions alone is of greater interest in large-scale proteome studies.
A new algorithm, IUPforest-L, for predicting long disordered regions using the random forest learning model is proposed in this paper. IUPforest-L is based on the Moreau-Broto auto-correlation function of amino acid indices (AAIs) and other physicochemical features of the primary sequences. In 10-fold cross validation tests, IUPforest-L can achieve an area of 89.5% under the receiver operating characteristic (ROC) curve. Compared with existing disorder predictors, IUPforest-L has high prediction accuracy and is efficient for predicting long disordered regions in large-scale proteomes.
The random forest model based on the auto-correlation functions of the AAIs within a protein fragment and other physicochemical features could effectively detect long disordered regions in proteins. A new predictor, IUPforest-L, was developed to batch predict long disordered regions in proteins, and the server can be accessed from http://dmg.cs.rmit.edu.au/IUPforest/IUPforest-L.php
Intrinsically unstructured/disordered proteins (IUPs/IDPs) contain long disordered regions or are completely disordered . IUPs are abundant in higher organisms and often involved in key biological processes, such as transcriptional and translational regulation, membrane fusion and transport, cell-signal transduction, protein phosphorylation, the storage of small molecules and the regulation of self-assembly of large multi-protein complexes [2–11]. The disordered state in IUPs creates larger intermolecular interfaces , which increase the speed of interaction with potential binding partners even in the absence of tight binding, and provide flexibility for binding diverse ligands [2, 5, 11, 13–15]. However, long disordered regions in IUPs cause difficulties in protein structure determination by both X-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy. Efficient prediction of disordered region(s) in IUPs by computational methods can provide valuable information in high-throughput protein structure characterization, and reveal useful information on protein function .
Many predictors have been developed to predict disordered regions in proteins, such as PONDR , RONN [17, 18], VL2, VL3, VL3H and VL3E from DisProt [1, 19, 20], NORSp [21, 22], DISpro , FoldIndex , DISOPRED and DISOPRED2 [25–27], GlobPlot  and DisEMBL , IUPred , Prelink , DRIP-PRED (MacCallum, online publication http://www.forcasp.org/paper2127.html), FoldUnfold , Spritz , DisPSSMP , VSL1 and VSL2 [35, 36], POODLE-L , POODLE-S , Ucon , PrDOS and metaPrDOS [40, 41]. Among these predictors, neural networks and support vector machines (SVM) are widely used machine learning models.
The accuracy of disorder predictors is generally limited by the existence of various kinds of disorder which are represented unevenly in the various databases, and the lack of a unique definition of disorder . Predictors designed for long disordered regions are usually less successful in predicting short disordered regions [36, 42] because the long and short disordered regions have different sequence features. As a result, some predictors are specified for predicting long disordered regions, such as POODLE-L , while predictors targeting all types of disordered regions usually have to sacrifice time efficiency for exploiting heterogeneous sequence properties, especially the evolution information extracted from PSI-BLAST or protein secondary structure [25, 27, 33–36, 38].
In this paper, a new algorithm, IUPforest-L, is proposed for predicting long disordered regions based on the random forest learning model  and simple parameters extracted from the amino acid sequences and amino acid indices (AAIs) . 10-fold cross validation tests and blind tests demonstrate that IUPforest-L can achieve significantly higher accuracy than many existing algorithms in predicting long disordered regions. The high efficiency of IUPforest-L makes it a suitable tool for high-throughput comparative proteomics studies.
Training and test datasets
To train IUPforest-L, a subset (positive training set) of disordered regions was constructed based on DisProt  (version3.6), which includes 352 regions of 30 aa or more in length, and 47251 aa in total. The negative training set was extracted from PDBSelect25  (Oct. 2004 version), from which 366 sequences (80,324 aa in total) of at least 80 aa were selected. Each of them has a high resolution crystal structure (< 2.0Å), free from missing backbone or side chain coordinates and free from non-standard amino acid residues.
To assess the prediction performance of IUPforest-L, three datasets were used for blind tests. The first dataset was based on the dataset constructed by Hirose et al (Hirose-ADS1) as a blind test dataset of POODLE-L . Hirose-ADS1 contains 53 ordered regions of at least 40 aa (11431 aa in total) from the Protein Data Bank  and 63 disordered regions of at least 30 aa (8700 aa in total) from DisProt (version 3.0). The second test set (Han-ADS1) comprised of 53 ordered regions as in Hirose-ADS1 and 33 long disordered regions (5959 aa in total) from the latest DisProt (version 4.8), after removing disordered regions homologous to those in DisProt (version 3.6) using the CD-HIT algorithm with a threshold of 0.9 sequence identity . The third test set (Peng-DB) was constructed based on the blind test dataset of VLS2 , where 56 long disordered regions of at least 30 aa (2841 aa in total) and 1965 ordered regions (318431 aa in total) were used in the assessment. For an objective blind test of IUPforest-L on Hirose-ADS1 (as reported in Table 1), disordered and ordered regions homologous to those in Hirose-ADS1 were removed from our training set based on the CD-HIT algorithm with a threshold of 0.9 sequence identity , resulting in 293 disordered regions and 364 ordered regions for training the predictor. Similarly for an objective blind test on Han-ADS1 (as reported in Table 2), ordered regions homologous to the 53 ordered regions in Hirose-ADS1 were also removed from the original training set for training the predictor. The final IUPforest-L was still trained by the whole training set. Han-ADS1 is listed in the Additional file 1 and is also available online at http://dmg.cs.rmit.edu.au/IUPforest/Han-ADS1.fasta.
The random forest model
Features used in training and test
Auto-correlation function of amino acid indices (AAIs)
Each residue in the training set was replaced with a value of the normalized amino acid index (AAI), which is a set of 20 numerical values representing the physicochemical and biological property of 20 amino acids chosen from the AAI Database http://www.genome.ad.jp/dbget/aaindex.html. As such, a sequence of N amino acids in the training set was firstly transformed into a numerical sequence [49, 50], and denoted as:
P1P2 ⋯ P i ⋯ Pi+w⋯ P N (1)
where w is the window size, p i and pi+dare the AAI values at positions i and i+d respectively [49, 50]. For example, when d = 1, the numerical value for each residue (i) in the window multiplies by the value of the next nearby residue (i+1) and F1 is the average of these w-1 products. Similarly, F2 is the average of the w-2 products generated from every other residue. The value of d represented the order of the correlation and was tuned to optimize the prediction performance. The F d (d = 1, 2,..., 30) for the 40 sets of AAI listed in Table A1 in the Additional file 2 was calculated and evaluated in training IUPforest-L.
The mean hydrophobicity, defined as the average value of Kyte and Doolittle's hydrophobicity  in the window.
The modified hydrophobic cluster , calculated as the longest hydrophobic clusters in the window divided by the window size.
The mean net charge within the window and local mean net charge within a 13 aa fragment centered at the middle residue. Residues K and R were defined as +1; D and E were defined as -1; other residues were 0.
The mean contact number, defined as the mean expected number of contacts in the globular state of all residues within the window .
The composition of four reduced amino acid groups  and the Shannon's entropy (K2) of the amino acid composition within the window were calculated.
During the prediction stage, the features were firstly calculated when a window slides over an inquiry sequence and then a probability score of a residue being disordered was assigned by IUPforest-L. A region was annotated as disordered only when 30 or more consecutive amino acid residues were predicted to be disordered.
To estimate the generalization accuracy, 10-fold cross validation tests were conducted, where 90% of the sequences in the training set were randomly used in training and the other 10% were used in test. The process was repeated for the entire dataset and the final result was the average of the results from 10 processes. In addition, independent tests were performed on Hirose-ADS1 , Han-ADS1 and Peng-DB .
During the cross validation test, the confusion matrix, which comprises true positive (TP), false positive (FP), true negative (TN) and false negative (FN), was used to evaluate the prediction performance in terms of the following measures:
1) AUC, the area under the receiver operating characteristic (ROC) curve. Each point of a ROC curve was defined by a pair of values for the true positive rate (, or sensitivit y) and the false positive rate (, or 1-specificity).
Sproduct ≡ sensitivity × specificity (4)
where w disorder and w order are the weights for disorder and order, respectively, that are inversely proportional to the number of residues in the disordered and ordered state. Sw is also referred to as probability excess .
The Sproduct and Sw scores were used in assessing the prediction of disordered residues in the Critical Assessment of techniques for protein Structure Prediction (CASP6 and CASP7) .
10-fold cross validation
With a forest of a fixed number of trees, the ROC curve trained with the auto-correlation function with d value between 1 and 15 almost overlaps with the ROC curve trained with d between 1 and 30. This result indicates that continuous correlations between nearby residues from 1 to 15 along the sequence could determine whether the fragment is involved in a long disordered region.
Figure 3 shows that training with either type 1 or the combination of type 2–6 features could reach the 70.5% or 70.0% true positive rates with a 10% false positive rate, while their combination of type 1–6 features could lead to a higher true positive rate of 76%, and an area of 89.5% under the ROC curve. This result indicates that type 1 and type 2–6 features have redundant, but complementary structural information. Type 2–6 features generated only nine parameters in total within a given window, while type 1 features could generate hundreds of parameters that take into account both order information and physicochemical properties. It has been shown that the random forest model has no risk of overfitting with an increasing number of trees when the input parameters increase . As such, using type 1 features to train the random forest could extract more sequence-structure information  and it was thus conjectured that better prediction accuracy could be achieved with the auto-correlation functions generated from AAIs combined with other features of type 2–6.
The window size and step size for sliding the window are additional parameters for tuning the performance of IUPforest-L models. The window should be of a reasonable size so that the AAI-based correlation can be of significance within a reasonable training or test time. Training with small windows increases training time and can introduce noises, whereas training with large windows can lose local information. Our test results indicated that from window size of 19 aa to 47 aa, the random forest gave more stable result on blind test set Han-ADS1, but the accuracy on the 10-fold cross validation test on the training set will drop with larger window size (details listed in the Additional file 4). To batch predict long disordered regions, the window size of 31 aa was set in default to keep the balance between high efficiency and accuracy. The step size for sliding windows can also affect the accuracy and overall time efficiency at both the training and test stage. If the step size is too small, when a window slides along a sequence, it will introduce redundancy between windows and prolong the time for training models. Our experiments (details listed in the Additional file 4) show that with a sliding step of 20 aa (default setting) models achieve stable sensitivity without significantly prolonging the training process.
Comparison of IUPforest-L with other predictors on th test set Hirose-ADS1*
Comparison of IUPforest-L with other predictors on test set Han-ADS1*
Comparison of IUPforest-L with other predictors on test set Peng-DB.*
Protein structures are stabilized by numerous intramolecular interactions such as hydrophobic, electrostatic, van der Waals, and hydrogen bonds. The autocorrelation function tests whether the physicochemical property of one residue is independent of that of neighbouring residues. A group of residues involved in ordered structure close to other groups of residues in space will be dynamically constrained by the backbone or side chain interactions from these residues, and hence the residues in both groups will show higher density in the contact map or have higher pairwise correlation. On the other hand, a repetitive sequence of amino acids can also give significant positive correlation for all physicochemical properties. Therefore, residues within a fragment exhibiting a higher autocorrelation may either be structurally constrained, or have low sequence complexity. The random forest learning model employed by the IUPforest-L disorder predictor combines the complementary contributions from the autocorrelation function (type 1 feature) and other types of features, so that structural information is extracted with a high degree of prediction accuracy.
The random forest model is an ensemble learning model and is known to be more robust to noise than many non-ensemble learning models. However, as a classifier based on the random forest needs to load many decision trees into memory, it is relatively slow for a forest to predict a single instance at a time. As a result, the current web server of IUPforest-L is better suited to batch prediction of a large number of protein sequences, which provides an alternative useful tool in large-scale analysis of long disordered regions in proteomics. As an initial application, we have provided a server, IUPforest-L, for batch protein sequences analysis with the output of overall summary and details for each sequence. For convenience in proteomic comparisons, the prediction results for 62 eukaryotes linked to The European Bioinformatics Institute are also pre-calculated and can be downloaded from the server.
IUP studies are important because disordered regions are common and functionally important in proteins. The new features, the auto-correlation functions of AAIs within a protein fragment, reflect both residues' contact information and sequence complexity. The random forest model based on this new type of features and other physicochemical features could effectively detect long disordered regions in proteins. As a result, a new predictor, IUPforest-L, was developed to predict long disordered regions in proteins. Its high accuracy and high efficiency make it a useful tool in large-scale protein sequence analysis.
The authors thank Lefei Zhan for his help on conducting some experiments, Marc Cortese for his explanation of DisProt database. ZPF is supported by an APD award from the Australian Research Council. XZ is supported in part by an RMIT Emerging Researcher Grant.
- Vucetic S, Brown CJ, Dunker AK, Obradovic Z: Flavors of protein disorder. Proteins. 2003, 52 (4): 573-584. 10.1002/prot.10437.View ArticlePubMedGoogle Scholar
- Dyson H, Wright PE: Intrinsically Unstructured Proteins and their Functions. Nat Rev Mol Cell Biol. 2005, 6: 197-208. 10.1038/nrm1589.View ArticlePubMedGoogle Scholar
- Tompa P, Szasz C, Buday L: Structural disorder throws new light on moonlighting. Trends Biochem Sci. 2005, 30 (9): 484-489. 10.1016/j.tibs.2005.07.008.View ArticlePubMedGoogle Scholar
- Tompa P: Intrinsically unstructured proteins. Trends Biochem Sci. 2002, 27 (10): 527-533. 10.1016/S0968-0004(02)02169-2.View ArticlePubMedGoogle Scholar
- Uversky VN, Oldfield CJ, Dunker AK: Showing your ID: intrinsic disorder as an ID for recognition, regulation and cell signaling. J Mol Recognit. 2005, 18 (5): 343-384. 10.1002/jmr.747.View ArticlePubMedGoogle Scholar
- Wright PE, Dyson HJ: Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm. J Mol Biol. 1999, 293 (2): 321-331. 10.1006/jmbi.1999.3110.View ArticlePubMedGoogle Scholar
- Dunker AK, Lawson JD, Brown CJ, Williams RM, Romero P, Oh JS, Oldfield CJ, Campen AM, Ratliff CM, Hipps KW: Intrinsically disordered protein. J Mol Graph Model. 2001, 19 (1): 26-59. 10.1016/S1093-3263(00)00138-8.View ArticlePubMedGoogle Scholar
- Russell RB, Gibson TJ: A careful disorderliness in the proteome: Sites for interaction and targets for future therapies. FEBS Lett. 2008, 582 (8): 1271-1275. 10.1016/j.febslet.2008.02.027.View ArticlePubMedGoogle Scholar
- Radivojac P, Iakoucheva LM, Oldfield CJ, Obradovic Z, Uversky VN, Dunker AK: Intrinsic disorder and functional proteomics. Biophys J. 2007, 92 (5): 1439-1456. 10.1529/biophysj.106.094045.PubMed CentralView ArticlePubMedGoogle Scholar
- Oldfield CJ, Cheng Y, Cortese MS, Romero P, Uversky VN, Dunker AK: Coupled folding and binding with alpha-helix-forming molecular recognition elements. Biochemistry. 2005, 44 (37): 12454-12470. 10.1021/bi050736e.View ArticlePubMedGoogle Scholar
- Tompa P: The interplay between structure and function in intrinsically unstructured proteins. FEBS Lett. 2005, 579 (15): 3346-3354. 10.1016/j.febslet.2005.03.072.View ArticlePubMedGoogle Scholar
- Gunasekaran K, Tsai CJ, Kumar S, Zanuy D, Nussinov R: Extended disordered proteins: targeting function with less scaffold. Trends Biochem Sci. 2003, 28 (2): 81-85. 10.1016/S0968-0004(03)00003-3.View ArticlePubMedGoogle Scholar
- Namba K: Roles of partly unfolded conformations in macromolecular self-assembly. Genes Cells. 2001, 6 (1): 1-12. 10.1046/j.1365-2443.2001.00384.x.View ArticlePubMedGoogle Scholar
- Dunker AK, Cortese MS, Romero P, Iakoucheva LM, Uversky VN: Flexible nets. The roles of intrinsic disorder in protein interaction networks. Febs J. 2005, 272 (20): 5129-5148. 10.1111/j.1742-4658.2005.04948.x.View ArticlePubMedGoogle Scholar
- Oldfield CJ, Ulrich EL, Cheng Y, Dunker AK, Markley JL: Addressing the intrinsic disorder bottleneck in structural proteomics. Proteins. 2005, 59 (3): 444-453. 10.1002/prot.20446.View ArticlePubMedGoogle Scholar
- Li X, Romero P, Rani M, Dunker AK, Obradovic Z: Predicting Protein Disorder for N-, C-, and Internal Regions. Genome Inform Ser Workshop Genome Inform. 1999, 10: 30-40.PubMedGoogle Scholar
- Yang ZR, Thomson R, McNeil P, Esnouf RM: RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins. Bioinformatics. 2005, 21 (16): 3369-3376. 10.1093/bioinformatics/bti534.View ArticlePubMedGoogle Scholar
- Thomson R, Esnouf R: Prediction of natively disordered regions in proteins using a bio-basis function neural network. Lecture Notes in Computer Science. 2004, 3177: 108-116.View ArticleGoogle Scholar
- Smith DK, Radivojac P, Obradovic Z, Dunker AK, Zhu G: Improved amino acid flexibility parameters. Protein Sci. 2003, 12 (5): 1060-1072. 10.1110/ps.0236203.PubMed CentralView ArticlePubMedGoogle Scholar
- Vucetic S, Obradovic Z, Vacic V, Radivojac P, Peng K, Iakoucheva LM, Cortese MS, Lawson JD, Brown CJ, Sikes JG: DisProt: a database of protein disorder. Bioinformatics. 2005, 21 (1): 137-140. 10.1093/bioinformatics/bth476.View ArticlePubMedGoogle Scholar
- Liu J, Tan H, Rost B: Loopy proteins appear conserved in evolution. J Mol Biol. 2002, 322 (1): 53-64. 10.1016/S0022-2836(02)00736-2.View ArticlePubMedGoogle Scholar
- Liu J, Rost B: NORSp: Predictions of long regions without regular secondary structure. Nucleic Acids Res. 2003, 31 (13): 3833-3835. 10.1093/nar/gkg515.PubMed CentralView ArticlePubMedGoogle Scholar
- Cheng J, Sweredoski MJ, Baldi P: Accurate Prediction of Protein Disordered Regions by Mining Protein Structure Data. Data Mining and Knowledge Discovery. 2005, 11: 213-222. 10.1007/s10618-005-0001-y.View ArticleGoogle Scholar
- Prilusky J, Felder CE, Zeev-Ben-Mordehai T, Rydberg E, Man O, Beckmann JS, Silman I, Sussman JL: FoldIndex(C): a simple tool to predict whether a given protein sequence is intrinsically unfolded. Bioinformatics. 2005, 21 (16): 3435-3438. 10.1093/bioinformatics/bti537.View ArticlePubMedGoogle Scholar
- Jones DT, Ward JJ: Prediction of disordered regions in proteins from position specific score matrices. Proteins. 2003, 53 (Suppl 6): 573-578. 10.1002/prot.10528.View ArticlePubMedGoogle Scholar
- Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DT: Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol. 2004, 337: 635-645. 10.1016/j.jmb.2004.02.002.View ArticlePubMedGoogle Scholar
- Ward JJ, McGuffin LJ, Bryson K, Buxton BF, Jones DT: The DISOPRED server for the prediction of protein disorder. Bioinformatics. 2004, 20 (13): 2138-2139. 10.1093/bioinformatics/bth195.View ArticlePubMedGoogle Scholar
- Linding R, Russell RB, Neduva V, Gibson TJ: GlobPlot: Exploring protein sequences for globularity and disorder. Nucleic Acids Res. 2003, 31 (13): 3701-3708. 10.1093/nar/gkg519.PubMed CentralView ArticlePubMedGoogle Scholar
- Linding R, Jensen LJ, Diella F, Bork P, Gibson TJ, Russell RB: Protein disorder prediction: implications for structural proteomics. Structure (Camb). 2003, 11 (11): 1453-1459. 10.1016/j.str.2003.10.002.View ArticleGoogle Scholar
- Dosztanyi Z, Csizmok V, Tompa P, Simon I: The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. J Mol Biol. 2005, 347 (4): 827-839. 10.1016/j.jmb.2005.01.071.View ArticlePubMedGoogle Scholar
- Coeytaux K, Poupon A: Prediction of unfolded segments in a protein sequence based on amino acid composition. Bioinformatics. 2005, 21 (9): 1891-1900. 10.1093/bioinformatics/bti266.View ArticlePubMedGoogle Scholar
- Galzitskaya OV, Garbuzynskiy SO, Lobanov MY: FoldUnfold: web server for the prediction of disordered regions in protein chain. Bioinformatics. 2006, 22 (23): 2948-2949. 10.1093/bioinformatics/btl504.View ArticlePubMedGoogle Scholar
- Vullo A, Bortolami O, Pollastri G, Tosatto SC: Spritz: a server for the prediction of intrinsically disordered regions in protein sequences using kernel machines. Nucleic Acids Res. 2006, W164-168. 10.1093/nar/gkl166. 34 Web ServerGoogle Scholar
- Su CT, Chen CY, Ou YY: Protein disorder prediction by condensed PSSM considering propensity for order or disorder. BMC Bioinformatics. 2006, 7: 319-10.1186/1471-2105-7-319.PubMed CentralView ArticlePubMedGoogle Scholar
- Peng K, Radivojac P, Vucetic S, Dunker AK, Obradovic Z: Length-dependent prediction of protein intrinsic disorder. BMC Bioinformatics. 2006, 7: 208-10.1186/1471-2105-7-208.PubMed CentralView ArticlePubMedGoogle Scholar
- Obradovic Z, Peng K, Vucetic S, Radivojac P, Dunker AK: Exploiting heterogeneous sequence properties improves prediction of protein disorder. Proteins. 2005, 61 (Suppl 7): 176-182. 10.1002/prot.20735.View ArticlePubMedGoogle Scholar
- Hirose S, Shimizu K, Kanai S, Kuroda Y, Noguchi T: POODLE-L: a two-level SVM prediction system for reliably predicting long disordered regions. Bioinformatics. 2007, 23 (16): 2046-2053. 10.1093/bioinformatics/btm302.View ArticlePubMedGoogle Scholar
- Shimizu K, Hirose S, Noguchi T: POODLE-S: web application for predicting protein disorder by using physicochemical features and reduced amino acid set of a position-specific scoring matrix. Bioinformatics. 2007, 23 (17): 2337-2338. 10.1093/bioinformatics/btm330.View ArticlePubMedGoogle Scholar
- Schlessinger A, Punta M, Rost B: Natively unstructured regions in proteins identified from contact predictions. Bioinformatics. 2007, 23 (18): 2376-2384. 10.1093/bioinformatics/btm349.View ArticlePubMedGoogle Scholar
- Ishida T, Kinoshita K: Prediction of disordered regions in proteins based on the meta approach. Bioinformatics. 2008, 24 (11): 1344-1348. 10.1093/bioinformatics/btn195.View ArticlePubMedGoogle Scholar
- Ishida T, Kinoshita K: PrDOS: prediction of disordered protein regions from amino acid sequence. Nucleic Acids Res. 2007, W460-464. 10.1093/nar/gkm363. 35 Web ServerGoogle Scholar
- Peng K, Vucetic S, Radivojac P, Brown CJ, Dunker AK, Obradovic Z: Optimizing Intrinsic Disorder Predictors with Protein Evolutionary Information. J Bioinform Comput Biol. 2005, 3: 35-60. 10.1142/S0219720005000886.View ArticlePubMedGoogle Scholar
- Breiman L: Random Forest. Machine Learning. 2001, 45 (1): 5-32. 10.1023/A:1010933404324.View ArticleGoogle Scholar
- Kawashima S, Kanehisa M: AAindex: amino acid index database. Nucleic Acids Res. 2000, 28 (1): 374-10.1093/nar/28.1.374.PubMed CentralView ArticlePubMedGoogle Scholar
- Hobohm U, Sander C: Enlarged representative set of protein structures. Protein Sci. 1994, 3 (3): 522-524.PubMed CentralView ArticlePubMedGoogle Scholar
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res. 2000, 28 (1): 235-242. 10.1093/nar/28.1.235.PubMed CentralView ArticlePubMedGoogle Scholar
- Li W, Godzik A: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006, 22 (13): 1658-1659. 10.1093/bioinformatics/btl158.View ArticlePubMedGoogle Scholar
- Dosztanyi Z, Chen J, Dunker AK, Simon I, Tompa P: Disorder and sequence repeats in hub proteins and their implications for network evolution. J Proteome Res. 2006, 5 (11): 2985-2995. 10.1021/pr060171o.View ArticlePubMedGoogle Scholar
- Feng ZP, Zhang CT: Prediction of membrane protein types based on the hydrophobic index of amino acids. J Protein Chem. 2000, 19 (4): 269-275. 10.1023/A:1007091128394.View ArticlePubMedGoogle Scholar
- Bu WS, Feng ZP, Zhang Z, Zhang CT: Prediction of protein (domain) structural classes based on amino-acid index. Eur J Biochem. 1999, 266 (3): 1043-1049. 10.1046/j.1432-1327.1999.00947.x.View ArticlePubMedGoogle Scholar
- Savitzky A, Golay MJE: Smoothing and Differentiation of Data by Simplified Least Squares Procedures. Analytical Chemistry. 1964, 36: 1627-1639. 10.1021/ac60214a047.View ArticleGoogle Scholar
- Kyte J, Doolittle RF: A simple method for displaying the hydropathic character of a protein. J Mol Biol. 1982, 157 (1): 105-132. 10.1016/0022-2836(82)90515-0.View ArticlePubMedGoogle Scholar
- Garbuzynskiy SO, Lobanov MY, Galzitskaya OV: To be folded or to be unfolded?. Protein Sci. 2004, 13 (11): 2871-2877. 10.1110/ps.04881304.PubMed CentralView ArticlePubMedGoogle Scholar
- Jin Y, Dunbrack RL: Assessment of disorder predictions in CASP6. Proteins. 2005, 61 (Suppl 7): 167-175. 10.1002/prot.20734.View ArticlePubMedGoogle Scholar
- Han P, Zhang X, Norton RS, Feng ZP: Predicting disordered regions in proteins using the profiles of amino acid Indices. Supplement Issue of BMC Bioinformatics for APBC. 2009,Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.