Volume 15 Supplement 16
Protein inter-domain linker prediction using Random Forest and amino acid physiochemical properties
© Shatnawi et al.; licensee BioMed Central Ltd. 2014
Published: 8 December 2014
Protein chains are generally long and consist of multiple domains. Domains are distinct structural units of a protein that can evolve and function independently. The accurate prediction of protein domain linkers and boundaries is often regarded as the initial step of protein tertiary structure and function predictions. Such information not only enhances protein-targeted drug development but also reduces the experimental cost of protein analysis by allowing researchers to work on a set of smaller and independent units. In this study, we propose a novel and accurate domain-linker prediction approach based on protein primary structure information only. We utilize a nature-inspired machine-learning model called Random Forest along with a novel domain-linker profile that contains physiochemical and domain-linker information of amino acid sequences.
The proposed approach was tested on two well-known benchmark protein datasets and achieved 68% sensitivity and 99% precision, which is better than any existing protein domain-linker predictor. Without applying any data balancing technique such as class weighting and data re-sampling, the proposed approach is able to accurately classify inter-domain linkers from highly imbalanced datasets.
Our experimental results prove that the proposed approach is useful for domain-linker identification in highly imbalanced single- and multi-domain proteins.
KeywordsProtein domain-linker prediction random forest machine learning physiochemical properties linker index
A domain is a conserved part of a protein that can evolve, function, and exist independently. Each domain forms a three-dimensional (3D) structure and can be folded and stabilized independently. Several domains could be joined together in different combinations forming multi-domain proteins, and perform specific biological task [1, 2]. One domain may exist in multiple proteins. A domain varies in length ranging from 25 to 500 amino acids (AAs) . Inter-domain linkers tie neighboring domains and support inter-domain communications in multi-domain proteins. They also provide sufficient flexibility to facilitate domain motions and regulate the inter-domain geometry . Predicting inter-domain linkers is of great importance in precise identification of structural domains within a protein sequence. Many domain prediction approaches first detect domain-linkers, and then predict the location of domain regions accordingly. This domain knowledge is then used to understand protein structures, functions, and evolution, to perform multiple sequence alignment, and to predict protein-protein interactions. In addition, downsizing proteins into functional domains without losing useful biological information leads to significant reduction in the computational cost of protein analysis [5, 6]. Therefore, the development of accurate computational method for splitting proteins into structural domains is regarded as a critical step in protein tertiary structure prediction and proteomics .
A number of protein inter-domain linker prediction methods have been developed and these methods can be classified into (i) statistical-based, (ii) alignment/homology-based and (iii) machine-learning (ML)-based methods. Dom-Cut  is one of the typical early day's statistical-based methods. Domcut predicts inter-domain linker regions based on the differences in AA compositions between domain and linker regions in a protein sequence. DomCut considers a region or segment in a sequence as a linker if it is in the range between 10 and 100 residues, connecting two adjacent domains, and not containing membrane spanning regions. To represent the preference for AA residues in linker regions, it defines the linker index as the ratio of the frequency of AA residue in domain regions to that in linker regions. A linker preference profile is generated by plotting the averaged linker index values along an AA sequence using a siding window of size 15AAs. A linker is predicted if there was a trough in the linker region and the averaged linker index value at the minimum of the trough is lower than the threshold value. At the threshold value of 0.09, the sensitivity and selectivity of DomCut were 53.5% and 50.1%, respectively. Despite the fact that DomCut showed glimpse of potential success, it was reported by Dong et al.  that DomCut has low sensitivity and specificity compared to other recent methods. However, integrating more biological evidences with the linker index could enhance the prediction and therefore, the idea of DomCut was later utilized by several researchers such as Zaki et al.  and Pang et al. .
Shatnawi and Zaki  used AA compositional index profile, which combines linker index and AA composition. They divided the protein sequence into chunks and then applied a simulated annealing algorithm to predict the optimal threshold value for each chunk. Linding et al.  proposed another statistical-based method called GlobPlot. GlobPlot allows users to plot the tendency within protein sequences for exploring both potential globular and disordered/flexible regions in proteins based on their AA sequence, and to identify inter-domain segments containing linear motifs.
A typical alignment/homology-based method which requires the use of PSI-BLAST  to generate evolutionary and homology information is DOMpro . DOMpro was independently evaluated along with 12 other predictors in the Critical Assessment of Fully Automated Structure Prediction 4 (CAFASP-4) [15, 16] and it was ranked among the top ab initio domain predictors. Other popular homology-based methods include Scooby-Domain , and FIEFDom .
ML-based methods have gained lots of attentions in protein domain-linker prediction tasks. Recent approaches employ machine learning techniques such as Artificial Neural Networks (ANN) and variants of Support Vector Machines (SVM). Sim et al.  introduced PRODO as an ANN classifier that is trained using features obtained from the position specific scoring matrix (PSSM) generated by PSI-BLAST. The training dataset contained 522 contiguous two-domain proteins was obtained from the structural classification of proteins (SCOP) database, version 1.63 . When tested on 48 newly added non-homologous proteins in SCOP version 1.65 and on CASP5 targets, PPRODO achieved 65.5% of prediction accuracy. ANN models have also used in DomNet , DOMpro , Shandy , and ThreaDom .
Ebina et al.  developed a protein linker predictor called DROP which utilizes a SVM with a Radial Basis Function (RBF) kernel. The classifier is trained using 25 optimal features. The optimal combination of features was selected from a set of 3000 features using a Random Forest (RF) algorithm. The selected features were related to secondary structures and PSSM elements of hydrophilic residues. The accuracy of DROP was evaluated by two domain-linker datasets; DS-All [24, 25], and CASP8 FM. DS-All contains 169 protein sequences, with a maximum sequence identity of 28.6%, and 201 linkers. DROP showed a sensitivity and precision of 41.3% and 49.4%, respectively. Varients of SVM have also been used in DomainDiscovery , Chatterjee et al. , and DoBo . The above-mentioned methods, in general, have the following limitations:
Although methods that use structural information could achieve good prediction results, finding the structural information by itself is another challenge. In contrast, predicting the domain-linkers could lead to infer the structural information.
ML-based domain predictors have shown limited capability in multi-domain proteins .
Although homology-based methods can achieve high prediction accuracy specially when close templates are retrieved, the accuracy often decreases piercingly when the sequence identity of target and template is low .
Some methods discard any protein sequence with non-contiguous domains. Therefore, domains that are connected by small linkers may not be identifiable.
Most ML-based methods are computationally expensive. They require the high computational cost to generate PSSM and/or predict secondary structure information for each protein.
Some methods are evaluated based on the overall prediction accuracy only. This may not effectively reflect the issues of the unbalancing problem of protein domain-linker data.
In this study, we develop a compact and accurate domain-linker prediction approach based solely on protein primary structure information. The novel profile containing AA physiochemical properties and linker indices is trained by a Random Forest (RF) classifier. The linker index is deduced from the protein sequence dataset of domain-linker segments. A sliding window of variable length is used to extract the information on the dependencies of each AA and its neighboring residues. The proposed approach efficiently processes high-dimensional multi-domain protein data with a much more accurate predictive performance than existing state-of-the-art approaches. In our approach, the well-accepted pre-processing techniques causing computational complexity such as class weights or data re-sampling are not used.
The proposed approach consists of five consecutive steps. First, we construct a multidomain benchmark dataset for fair comparison to the existing developed methods. Second, we build a novel profile that contains useful structural and physiochemical information of protein sequences for the protein domain-linker prediction tasks. Third, a nature inspired ML model using RF is constructed. The ML model is trained by the profile constructed in the second step. Fourth, we find the optimal averaging window size that slides across the protein AA sequence. Last, all of the above steps are integrated and its performance is compared to other existing ML models and domain-linker predictors on two benchmark datasets.
As mentioned, two protein sequence datasets are used to evaluate the performance of our approach. The first dataset is DS-All [24, 25] which was used to evaluate DROP . All the sequences in DS-All were extracted from the non-redundant Protein Data Bank (nr-PDB) chain set and contains 182 protein sequences including 216 linker segments. By examining each sequence, we found that the assignment of domains in DS-All dataset is inconsistent with the ones in PDB. We thus relabeled the domains and linkers of the protein sequences of this dataset according to NCBI conserved domains database and ended up with 140 sequences including 334 domains and 183 linker segments. The average numbers of AA residues in linker and domain segments are 12.7 and 147.1 respectively. This means that about 95.5% (334 ×147.1) of the total AA residues are located in domain segments and only 4.5% (183 ×12.7) are in linker segments.
The sequences in the second set were extracted from the Swiss-Prot database  and tested by Suyama and Ohara  to evaluate the performance of DomCut. This dataset contains 273 non-redundant protein sequences including 486 linkers and 794 domain segments. The average numbers of AA residues in linker and domain segments are 35.8 and 122.1 respectively. Therefore, about 85% (794 × 122.1) of the total AA residues exist in domain segments and only 15% (486 × 35.5) are in linker segments.
Amino acid composition and linker index.
Hydrophobicity index (kcal/mol) of amino acids in a distribution from non-polar to polar at pH = 7.
Rose hydrophobicity scale.
SARAH1 hydrophobicity scale.
Amino acid classification according to their physiochemical properties.
H, K, R
A, C, F, G, I, L, M, N, P, Q, S, T, V, W, Y
C, D, E, H, K, N, Q, R, S, T, Y
A, F, G, I, L, M, P, V, W
I, L, V
F, H, W, Y
A, C, D, E, G, K, M, N, P, Q, R, S, T
A, G, P, S
D, N, T
C, E, F, H, I, K, L, M, Q, R, V, W, Y,
A, D, E, P
I, L, V
C, G, H, S, W
F, M, Q, T, Y
K, N, R
Protein sequence representation
where L is the length of the protein sequence and x si is the feature vector for the AA residue s i which is located at position i in the protein sequence S. Figure 1 depicts the protein sequence representation by the amino acid features and the sliding window.
Random Forest model
Due to its averaging strategy, RF classifier is robust to outliers and noise, avoids overfitting, is relatively fast, simple, easily parallelized, and performs well in many classification problems [40, 42]. RF shows a significant performance improvement over the single tree classifiers such as CART and C4.5. RF model interprets the importance of the features using measures such as decrease mean accuracy or Gini importance . RF benefit from the randomization of decision tress as they have low-bias and high variance. RF has few parameters to tune and less dependent on tuning parameters [44, 45].
Ensemble methods including RF, bagging, and boosting have been increasingly applied to bioinformatics. When compared to bagging and boosting ensemble methods, RF has a unique advantage of using multiple feature subsets which is well suited for high-dimensional data as demonstrated by several bioinformatics studies . Lee et al.  compared the ensemble of bagging, boosting and RF using the same experimental settings and found that RF is the most successful one. The experimental results through ten microarray datasets in  reported that RF is able to preserve predictive accuracy while yielding smaller gene sets compared to diagonal linear discriminant analysis, kNN, SVM, shrunken centroids (SC), and kNN with feature selection. Other advantages of RF such as robustness to noise, lack of dependence upon tuning parameters, and the computation speed have been verified by  in classifying SELDI-TOF proteomic data. Wu et al.  compared the ensemble methods of bagging, boosting, and RF to individual classifiers of LDA, quadratic discriminant analysis, kNN, and SVM for MALDI-TOF (matrix assisted laser desorption/ionization with time-of-flight) data classification and reported that among all methods RF gives the lowest error rate with the smallest variance. RF also has better generalization ability than Ababoost ensembles .
Recently, RF has been successfully employed to a wide range of bioinformatics problems including protein-protein binding sites , protein-protein interaction [52, 53], protein disordered regions , transmembrane helix , residue-residue contact and helix-helix interaction , and solvent accessible surface area of TM helix residues in membrane proteins .
where TP (true positive) is the number of AAs within the known linker segment predicted as linkers, TN (true negative) is the number of AAs within the known domain segment predicted as domains, FN (false negative) is the number of AA within the known linker segments predicted as domains, and FP (false positive) is the number of AA within the known domain segment predicted as linkers.
Recall and precision are useful measures in domain-linker prediction problem. Recall and precision are class-independent measures so that they can handle unbalanced data situation, where data points are not equally distributed among classes such as domain-linker data. F1-score is also used as a unified measure to compare two approaches when one approach has higher recall and lower precision than the other.
The experimental results showed that the proposed approach is useful for the domain-linker identification of highly imbalanced single-domain and multi-domain proteins. There are several advantages of the proposed approach. First, the better predictive performance of the proposed approach was achieved on the imbalance domain-linkers without applying any class weights or data re-sampling techniques. In other words, the proposed approach it is not biased towards the majority class like most other models. To compare RF performance to SVM and ANN, we trained SVM and ANN classifiers with the same protein data and found that both classifiers classified the whole protein sequences as domains. This can be explained by the fact that the training of such methods is based on optimizing the model parameters to maximize the classification accuracy (by minimizing the error rate) which is not a successful strategy in case of highly imbalanced data. Second, physiochemical properties that are used in this approach play important roles in forming the behavior of AAs and their interactions with other AAs and these interactions have significant impact on the formation, folding, and stabilization of protein 3D structures. Therefore, these properties are important features to distinguish structural domains from linkers. Third, AA features that are used in this approach can be extracted with a low computational cost when compared to extracting other features such as PSSM and protein secondary structure that are used in most of the current approaches. Generating PSSM and predicting secondary structure features are computationally expensive and time consuming. Moreover, protein secondary structures are normally predicted by computational methods, and therefore, domain-linker prediction is influenced by secondary structure prediction accuracy as the incorrectly predicted secondary structures may lead to model misclassification.
On the other hand, one of the limitations of our approach is that RF may break the correlation between AAs. Each instance in the training data is a the average feature values for a certain AA residue in the protein. The preference of each AA to exist in domain or linker strongly depends on its neighbor AAs. Therefore, there is a strong correlation between these AA instances and when RF algorithm randomly selects a number of instances for each decision tree, the sequence-order knowledge may be lost.
Prediction measures after removing features that have less information gain using DS-All dataset.
Charge and Polarity
Size and all the above
Electronic and all the above
Aromaticity and all the above
Hydrophobicity and all the above
Although various ML-based domain prediction approaches have been developed, they have shown a limited capability in multi-domain protein prediction. Capturing long-term AA dependencies and developing a more suitable representation of protein sequence profiles that includes evolutionary information may lead to better model performance. Existing approaches showed a limited capability in exploiting long-range interactions that exist among amino acids and participate in the formation of protein secondary structure. Residues can be adjacent in 3D space while located far apart in the AA sequence. [2, 60]. Regarding protein sequence profile representation, the proposed input profiles in most domain-linker predictors still provides insufficient structural information to reach the maximum accuracy.
One reason behind the limited capability of multi-domain protein predictors is the disagreement of domain assignment within different protein databases. The agreement between domain databases covers about 80% of single domain proteins and about 66% of multi-domain proteins only . This disagreement is due to the variance in the experimental methods used in domain assignment. The most predominant techniques used to experimentally determine protein 3D structures are X-ray crystallography and nuclear magnetic resonance spectroscopy (NMR). However, their conformational results of domain assignment vary in about 20% so that upper limit accuracy for such domain-linker prediction task could be about 80%.
To the best of our knowledge, it is clearly novel that the use of well-optimized RF classifier along with a profile that contains domain-linker and physiochemical property information for protein domain linker identification problem. The profile also uses a sliding window of variable length to extract the information on the dependencies of each AA and its neighbors. The utility of the proposed approach is proved on two well-known benchmark datasets by achieving a recall of 68%, precision of 99%, and F1-score of 80%. The proposed approach successfully eliminates some of the data pre-processing steps such as class weights or data re-sampling techniques, and proves that the model can handle imbalanced data and is not biased towards the majority class. This work can be extended by examining longer averaging window sizes in order to capture long-range AA dependency information. The averaging window formula can also be improved to a weighted average so that the closer AA neighbors to the central residue can take higher weights than farther ones.
The authors would like to acknowledge the assistance provided by the College of Information Technology and the National Research Foundation (Grant Ref. No. 31T038) and the Research and Graduate Studies Office at the United Arab Emirates University (UAEU). The authors would like to also thank Mikita Suyama, Osamu Ohara, Teppei Ebina, Hiroyuki Toh, and Yutaka Kuroda for providing the protein datasets used in their studies.
The publication cost for this article was funded by the College of Information Technology at the United Arab Emirates University.
This article has been published as part of BMC Bioinformatics Volume 15 Supplement 16, 2014: Thirteenth International Conference on Bioinformatics (InCoB2014): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/15/S16.
- Chothia C: Proteins. one thousand families for the molecular biologist. Nature. 1992, 357 (6379): 543-10.1038/357543a0.View ArticlePubMedGoogle Scholar
- Yoo PD, Sikder AR, Taheri J, Zhou BB, Zomaya AY: Domnet: protein domain boundary prediction using enhanced general regression network and new profiles. NanoBioscience, IEEE Transactions. 2008, 7 (2): 172-181.View ArticleGoogle Scholar
- Suyama M, Ohara O: Domcut: prediction of inter-domain linker regions in amino acid sequences. Bioinformatics. 2003, 19 (5): 673-674. 10.1093/bioinformatics/btg031.View ArticlePubMedGoogle Scholar
- Bhaskara RM, de Brevern AG, Srinivasan N: Understanding the role of domain-domain linkers in the spatial orientation of domains in multi-domain proteins. Journal of Biomolecular Structure and Dynamics. 2012,Google Scholar
- Zaki N: Prediction of protein-protein interactions using pairwise alignment and inter-domain linker region. Engineering Letters. 2008, 16 (4): 505-Google Scholar
- Zaki N, Campbell P: Domain linker region knowledge contributes to protein-protein interaction prediction. Proceedings of International Conference on Machine Learning and Computing (ICMLC 2009). 2009Google Scholar
- Hondoh T, Kato A, Yokoyama S, Kuroda Y: Computer-aided nmr assay for detecting natively folded structural domains. Protein science. 2006, 15 (4): 871-883. 10.1110/ps.051880406.PubMed CentralView ArticlePubMedGoogle Scholar
- Dong Q, Wang X, Lin L, Xu Z: Domain boundary prediction based on profile domain linker propensity index. Computational biology and chemistry. 2006, 30 (2): 127-133. 10.1016/j.compbiolchem.2006.01.001.View ArticlePubMedGoogle Scholar
- Zaki N, Bouktif S, Lazarova-Molnar S: A combination of compositional index and genetic algorithm for predicting transmembrane helical segments. PLoS ONE. 2011, 6 (7): 21821-10.1371/journal.pone.0021821.View ArticleGoogle Scholar
- Pang CN, Lin K, Wouters MA, Heringa J, George RA: Identifying foldable regions in protein sequence from the hydrophobic signal. Nucleic acids research. 2008, 36 (2): 578-588.PubMed CentralView ArticlePubMedGoogle Scholar
- Shatnawi M, Zaki N: Prediction of protein inter-domain linkers using compositional index and simulated annealing. Proceeding of the Fifteenth Annual Conference Companion on Genetic and Evolutionary Computation Conference Companion. GECCO '13 Companion. 2013, 1603-1608. [http://doi.acm.org/10.1145/2464576.2482740]View ArticleGoogle Scholar
- Linding R, Russell RB, Neduva V, Gibson TJ: Globplot: exploring protein sequences for globularity and disorder. Nucleic acids research. 2003, 31 (13): 3701-3708. 10.1093/nar/gkg519.PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic acids research. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralView ArticlePubMedGoogle Scholar
- Cheng J, Sweredoski MJ, Baldi P: Dompro: protein domain prediction using profiles, secondary structure, relative solvent accessibility, and recursive neural networks. Data Mining and Knowledge Discovery. 2006, 13 (1): 1-10. 10.1007/s10618-005-0023-5.View ArticleGoogle Scholar
- Fischer D, Barret C, Bryson K, Elofsson A, Godzik A, Jones D, Karplus KJ, Kelley LA, MacCallum RM, Pawowski K: Cafasp-1: critical assessment of fully automated structure prediction methods. Proteins: Structure, Function, and Bioinformatics. 1999, 37 (S3): 209-217. 10.1002/(SICI)1097-0134(1999)37:3+<209::AID-PROT27>3.0.CO;2-Y.View ArticleGoogle Scholar
- Saini HK, Fischer D: Meta-dp: domain prediction meta-server. Bioinformatics. 2005, 21 (12): 2917-2920. 10.1093/bioinformatics/bti445.View ArticlePubMedGoogle Scholar
- George RA, Lin K, Heringa J: Scooby-domain: prediction of globular domains in protein sequence. Nucleic acids research. 2005, 33 (suppl 2): 160-163.View ArticleGoogle Scholar
- Bondugula R, Lee MS, Wallqvist A: Fiefdom: a transparent domain boundary recognition system using a fuzzy mean operator. Nucleic acids research. 2009, 37 (2): 452-462.PubMed CentralView ArticlePubMedGoogle Scholar
- Sim J, Kim S-Y, Lee J: Pprodo: Prediction of protein domain boundaries using neural networks. Proteins: Structure, Function, and Bioinformatics. 59 (3):Google Scholar
- Murzin AG, Brenner SE, Hubbard T, Chothia C: Scop: a structural classification of proteins database for the investigation of sequences and structures. Journal of molecular biology. 1995, 247 (4): 536-540.PubMedGoogle Scholar
- Walsh I, Martin AJ, Mooney C, Rubagotti E, Vullo A, Pollastri G: Ab initio and homology based prediction of protein domains by recursive neural networks. BMC bioinformatics. 2009, 10 (1): 195-10.1186/1471-2105-10-195.PubMed CentralView ArticlePubMedGoogle Scholar
- Xue Z, Xu D, Wang Y, Zhang Y: Threadom: extracting protein domain boundary information from multiple threading alignments. Bioinformatics. 2013, 29 (13): 247-256. 10.1093/bioinformatics/btt209.View ArticleGoogle Scholar
- Ebina T, Toh H, Kuroda Y: Drop: an svm domain linker predictor trained with optimal features selected by random forest. Bioinformatics. 2011, 27 (4): 487-494. 10.1093/bioinformatics/btq700.View ArticlePubMedGoogle Scholar
- Tanaka T, Yokoyama S, Kuroda Y: Improvement of domain linker prediction by incorporating loop-length-dependent characteristics. Peptide Science. 2006, 84 (2): 161-168. 10.1002/bip.20361.View ArticlePubMedGoogle Scholar
- Ebina T, Toh H, Kuroda Y: Loop-length-dependent svm prediction of domain linkers for high-throughput structural proteomics. Peptide Science. 2009, 92 (1): 1-8. 10.1002/bip.21105.View ArticlePubMedGoogle Scholar
- Sikder AR, Zomaya AY: Improving the performance of domaindiscovery of protein domain boundary assignment using inter-domain linker index. BMC bioinformatics. 2006, 7 (Suppl 5): 6-10.1186/1471-2105-7-S5-S6.View ArticleGoogle Scholar
- Chatterjee P, Basu S, Kundu M, Nasipuri M, Basu DK: Improved prediction of multi-domains in protein chains using a support vector machine. 2009Google Scholar
- Eickholt J, Deng X, Cheng J: Dobo: Protein domain boundary prediction by integrating evolutionary signals and machine learning. BMC bioinformatics. 2011, 12 (1): 43-10.1186/1471-2105-12-43.PubMed CentralView ArticlePubMedGoogle Scholar
- Bairoch A, Apweiler R: The swiss-prot protein sequence database and its supplement trembl in 2000. Nucleic acids research. 2000, 28 (1): 45-48. 10.1093/nar/28.1.45.PubMed CentralView ArticlePubMedGoogle Scholar
- Hu H-J, Pan Y, Harrison R, Tai PC: Improved protein secondary structure prediction using support vector machine with a new encoding scheme and an advanced tertiary classifier. NanoBioscience, IEEE Transactions. 2004, 3 (4): 265-271. 10.1109/TNB.2004.837906.View ArticleGoogle Scholar
- Kim H, Park H: Prediction of protein relative solvent accessibility with support vector machines and long-range interaction 3d local descriptor. Proteins: Structure, Function, and Bioinformatics. 2004, 54 (3): 557-562.View ArticleGoogle Scholar
- Korenberg MJ, David R, Hunter IW, Solomon JE: Automatic classification of protein sequences into structure/function groups via parallel cascade identification: a feasibility study. Annals of biomedical engineering. 2000, 28 (7): 803-811.View ArticlePubMedGoogle Scholar
- Yoo P, Zhou B, Zomaya A: A modular kernel approach for integrative analysis of protein domain boundaries. BMC genomics. 2009, 10 (Suppl 3): 21-10.1186/1471-2164-10-S3-S21.View ArticleGoogle Scholar
- Rose GD, Geselowitz AR, Lesser GJ, Lee RH, Zehfus MH: Hydrophobicity of amino acid residues in globular proteins. Science. 1985, 229 (4716): 834-838. 10.1126/science.4023714.View ArticlePubMedGoogle Scholar
- Taylor WR: The classification of amino acid conservation. Journal of theoretical Biology. 1986, 119 (2): 205-218. 10.1016/S0022-5193(86)80075-3.View ArticlePubMedGoogle Scholar
- Betts MJ, Russell RB: Amino acid properties and consequences of substitutions.Google Scholar
- Ganapathiraju M, Balakrishnan N, Reddy R, Klein-Seetharaman J: Transmembrane helix prediction using amino acid property features and latent semantic analysis. Bmc Bioinformatics. 2008, 9 (Suppl 1): 4-10.1186/1471-2105-9-S1-S4.View ArticleGoogle Scholar
- Hayat M, Khan A: Mem-phybrid: Hybrid features-based prediction system for classifying membrane protein types. Analytical biochemistry. 2012, 424 (1): 35-44. 10.1016/j.ab.2012.02.007.View ArticlePubMedGoogle Scholar
- Hayat M, Khan A: Wrf-tmh: predicting transmembrane helix by fusing composition index and physicochemical properties of amino acids. Amino acids. 2013, 1-12.Google Scholar
- Breiman L: Random forests. Machine learning. 2001, 45 (1): 5-32. 10.1023/A:1010933404324.View ArticleGoogle Scholar
- Wang X-F, Chen Z, Wang C, Yan R-X, Zhang Z, Song J: Predicting residue-residue contacts and helix-helix interactions in transmembrane proteins using an integrative feature-based random forest approach. PloS one. 2011, 6 (10): 26767-10.1371/journal.pone.0026767.View ArticleGoogle Scholar
- Caruana R, Karampatziakis N, Yessenalina A: An empirical evaluation of supervised learning in high dimensions. Proceedings of the 25th International Conference on Machine Learning ACM. 2008, 96-103.Google Scholar
- Chang KY, Yang J-R: Analysis and prediction of highly effective antiviral peptides based on random forests. PloS one. 2013, 8 (8): 70166-10.1371/journal.pone.0070166.View ArticleGoogle Scholar
- Izmirlian G: Application of the random forest classification algorithm to a seldi-tof proteomics study in the setting of a cancer prevention trial. Annals of the New York Academy of Sciences. 2004, 1020 (1): 154-174. 10.1196/annals.1310.015.View ArticlePubMedGoogle Scholar
- Qi Y: Random forest for bioinformatics. Ensemble Machine Learning Springer. 2012, 307-323.View ArticleGoogle Scholar
- Yang P, Hwa Yang Y, B Zhou B, Y Zomaya A: A review of ensemble methods in bioinformatics. Current Bioinformatics. 2010, 5 (4): 296-308. 10.2174/157489310794072508.View ArticleGoogle Scholar
- Lee JW, Lee JB, Park M, Song SH: An extensive comparison of recent classification tools applied to microarray data. Computational Statistics & Data Analysis. 2005, 48 (4): 869-885. 10.1016/j.csda.2004.03.017.View ArticleGoogle Scholar
- Díaz-Uriarte R, De Andres SA: Gene selection and classification of microarray data using random forest. BMC bioinformatics. 2006, 7 (1): 3-10.1186/1471-2105-7-3.PubMed CentralView ArticlePubMedGoogle Scholar
- Wu B, Abbott T, Fishman D, McMurray W, Mor G, Stone K, Ward D, Williams K, Zhao H: Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data. Bioinformatics. 2003, 19 (13): 1636-1643. 10.1093/bioinformatics/btg210.View ArticlePubMedGoogle Scholar
- Chen C, Liaw A, Breiman L: Using random forest to learn imbalanced data. 2004, University of California, BerkeleyGoogle Scholar
- Bordner AJ: Predicting protein-protein binding sites in membrane proteins. BMC bioinformatics. 2009, 10 (1): 312-10.1186/1471-2105-10-312.PubMed CentralView ArticlePubMedGoogle Scholar
- Chen X-W, Liu M: Prediction of protein-protein interactions using random decision forest framework. Bioinformatics. 2005, 21 (24): 4394-4400. 10.1093/bioinformatics/bti721.View ArticlePubMedGoogle Scholar
- Šikić M, Tomić S, Vlahovićek K: Prediction of protein-protein interaction sites in sequences and 3d structures by random forests. PLoS computational biology. 2009, 5 (1): 1000278-10.1371/journal.pcbi.1000278.View ArticleGoogle Scholar
- Han P, Zhang X, Norton R, Feng Z-P: Large-scale prediction of long disordered regions in proteins using random forests. BMC bioinformatics. 2009, 10 (1): 8-10.1186/1471-2105-10-8.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang C, Xi L, Li S, Liu H, Yao X: A sequence-based computational model for the prediction of the solvent accessible surface area for α-helix and β-barrel transmembrane residues. Journal of computational chemistry. 2012, 33 (1): 11-17. 10.1002/jcc.21936.View ArticlePubMedGoogle Scholar
- Sasaki Y: The truth of the f-measure. Teach Tutor mater. 2007, 1-5.Google Scholar
- Powers D: Evaluation: From precision, recall and f-measure to roc., informedness, markedness & correlation. Journal of Machine Learning Technologies. 2011, 2 (1): 37-63.Google Scholar
- Hernández-Lobato D, Martínez-Muñoz G, Suárez A: How large should ensembles of classifiers be?. Pattern Recognition. 2013, 46 (5): 1323-1336. 10.1016/j.patcog.2012.10.021.View ArticleGoogle Scholar
- Bibimoune M, Elghazel H, Aussem A: An empirical comparison of supervised ensemble learning approaches. 2013, monthGoogle Scholar
- Chen J, Chaudhari NS: Bidirectional segmented-memory recurrent neural network for protein secondary structure prediction. Soft Computing. 2006, 10 (4): 315-324. 10.1007/s00500-005-0489-5.View ArticleGoogle Scholar
- Marsden RL, McGuffin LJ, Jones DT: Rapid protein domain assignment from amino acid sequence using predicted secondary structure. Protein Science. 2002, 11 (12): 2814-2824.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.