Volume 14 Supplement 14
Identification and characterization of plastid-type proteins from sequence-attributed features using machine learning
© Kaundal et al; licensee BioMed Central Ltd. 2013
Published: 9 October 2013
Plastids are an important component of plant cells, being the site of manufacture and storage of chemical compounds used by the cell, and contain pigments such as those used in photosynthesis, starch synthesis/storage, cell color etc. They are essential organelles of the plant cell, also present in algae. Recent advances in genomic technology and sequencing efforts is generating a huge amount of DNA sequence data every day. The predicted proteome of these genomes needs annotation at a faster pace. In view of this, one such annotation need is to develop an automated system that can distinguish between plastid and non-plastid proteins accurately, and further classify plastid-types based on their functionality. We compared the amino acid compositions of plastid proteins with those of non-plastid ones and found significant differences, which were used as a basis to develop various feature-based prediction models using similarity-search and machine learning.
In this study, we developed separate Support Vector Machine (SVM) trained classifiers for characterizing the plastids in two steps: first distinguishing the plastid vs. non-plastid proteins, and then classifying the identified plastids into their various types based on their function (chloroplast, chromoplast, etioplast, and amyloplast). Five diverse protein features: amino acid composition, dipeptide composition, the pseudo amino acid composition, Nterminal-Center-Cterminal composition and the protein physicochemical properties are used to develop SVM models. Overall, the dipeptide composition-based module shows the best performance with an accuracy of 86.80% and Matthews Correlation Coefficient (MCC) of 0.74 in phase-I and 78.60% with a MCC of 0.44 in phase-II. On independent test data, this model also performs better with an overall accuracy of 76.58% and 74.97% in phase-I and phase-II, respectively. The similarity-based PSI-BLAST module shows very low performance with about 50% prediction accuracy for distinguishing plastid vs. non-plastids and only 20% in classifying various plastid-types, indicating the need and importance of machine learning algorithms.
The current work is a first attempt to develop a methodology for classifying various plastid-type proteins. The prediction modules have also been made available as a web tool, PLpred available at http://bioinfo.okstate.edu/PLpred/ for real time identification/characterization. We believe this tool will be very useful in the functional annotation of various genomes.
Moreover, in cases such as the ordered rearrangement of the proteome during plastid differentiation, profiling of static proteomes provides only limited information on proteome dynamics . To circumvent these constraints and to increase proteome coverage, the development of highly efficient computational prediction tools is another complementary approach to provide useful global information about the plastid proteomes. Various proteomic approaches have led to the development of some databases available for plant plastids, for example, the Chloroplast Genome Database , plprot , PPDB . However, there is no computational prediction system to identify and characterize various plastid types that could be used to classify 'unknown' proteins. TargetP is currently the most widely known prediction program with a tested prediction accuracy around 68% for known plastid proteins, suggesting that a significant number of proteins cannot be identified by this type of analysis [12, 16–19]. The most likely reason for this low performance is that TargetP is based on the presence of an N-terminal transit peptide region in a protein. In cases where there are alternate signals, it will fail to predict. It has been reported that plastid protein dynamics most likely also relate to different protein-targeting routes that exist in plastids . This means that novel algorithms have to be developed based on whole amino acid sequence properties. Secondly, TargetP cannot predict the plastid type of a query protein e.g. whether it is a chloroplast, chromoplast, etioplast or an amyloplast protein. Previous attempts to predict plastid-types have been unsuccessful; several etioplast proteins are not predicted by TargetP for plastid localization .
In the current study, we have developed a prediction system for the genome-wide identification and classification of plastid proteins. This method works in two phases: first, distinction between plastid and non-plastid proteins, and second, classification of the identified plastid proteins into sub-classes (chloroplast, chromoplast, etioplast, and amyloplast). Various features of a protein sequence viz. Amino acid composition (AAC), Dipeptide composition (DIPEP), Pseudo Amino Acid Composition (PseAAC), Nterminal-Center-Cterminal (NCC) composition, and Physicochemical properties are explored in a Support Vector Machine (SVM) framework to develop diverse prediction models. In addition, the models have been tested on 'independent test' datasets for better confidence and reliability. An online tool, PLpred has also been developed for use by the research community. With the advances in recent genomics technology and more and more genomes being sequenced, there has been a spur in data generation lately. The predicted proteomes of these genomes thus need annotation at a much faster pace. We have developed a prediction method trained on 'known' plastid proteins, which could be used to annotate the 'unknown' proteins predicted from these genomic DNA sequences. We believe the current method would be a useful resource in this direction.
As the current method is developed in two phases, we discuss below the data collection and preparation separately. Data was collected accordingly from various online repositories.
Number of protein sequences for plastids and non-plastid class used in phase-I (identification) training/testing
< 30% cutoff
< 30% cutoff (across class)
10% independent test set
Number of sequences available for plastid types in various online databases
Number of protein sequences for various plastid types used in phase-II (classification) training/testing
< 30% cutoff
10% independent test set
Feature representation methods
The following diverse features were extracted from the protein sequences for use in a machine learning framework for developing prediction models in both phases:
Amino acid composition (AAC)
Dipeptide composition (DIPEP)
where P(x i ,x j ) is the fraction of each (x i ,x j ) dipeptide and f(x i ,x j ) is the frequency of occurrence of (x i ,x j ) dipeptides, and the denominator represents the total number of all possible dipeptides.
Pseudo amino acid composition (PseAAC)
In composition based methods, protein sequence order and length information are completely lost, which in turn may affect the prediction accuracy of the model. To include all the details of its sequence order and length, Chou  proposed an effective way of representing known proteins as pseudo amino acid compositions (PseAAC) in his seminal study.
In this representation, the protein character sequence is coded by some of its physicochemical properties. Since the amphiphilic property (hydrophobicity and hydrophilicity) plays a very important role in protein folding, and functioning [34, 35], these two indices may be used to reflect effectively the sequence order effects.
where fi, i = 1, 2, ..., 20 are the normalized occurrence frequencies corresponding to 20 native amino acids in the protein P, the symbol θτ represents the j-tier sequence correlation factor computed using (4) with H(Pi) and H(Pj) representing hydrophobic and hydrophilic values of the amino acids Pi and Pj respectively and the symbol 'w' represents the weight factor, which governs the degree of the sequence order effect to be incorporated. In the present study, we have judicially chosen the weight as 0.1 and as 5 for better accuracy. In essence, the first 20 values in (3) represent the classic amino acid composition, the next 2λ values reflect the amphiphilic sequence correlation along the protein chain.
Terminal-based N-Center-C (NCC) amino acid composition
The AAC for each segment is computed using (1). Hence, a 60 dimensional feature vector is used to represent a protein. In an empirical study, the residue length of 25 was found to be the best compromise, both in phase-I and phase-II predictions.
Physicochemical property-based composition
Physicochemical properties used to represent a protein for predicting plastids and their types using SVM.
D, R, E, K, H
Hydrophilic (polar) and neutral
N, Q, S, T, Y
Basic polar or Positively charged
H, K, R
Acidic polar or Negatively charged
A, G, I, L, V
F, W, Y
T, D, N
G, A, S, P
F, R, W, Y
Hydrophobic (non-polar) and aromatic
Hydrophobic (non-polar) and neutral
A, C, G, I, L, M, F, P, W, V
Amidic (contains amide group)
C, W, N, Q, S, T, Y, K, R, H, D, E
Acidic and their Amide
D, E, N, Q
D, E, H, C, Y, K, R
Forms covalent cross-link (disulfide bond)
Theoretical pI (isoelectric point)
Similarity search-based PSI-BLAST module
In this study, we also performed PSI-BLAST based predictions in which a query sequence is searched based on its similarity against the non-redundant database; all of the UniProt/Swiss-Prot used as a target database. Previous studies have suggested that PSI-BLAST has the capability to detect remote homologies, and is thus preferred over the normal BLAST. It carries out an iterative search in which sequences found in one round are used to build a new score model for the next round of searching . Three iterations were carried out at a best cut-off E-value of 0.001. This module was run separately for plastid and non-plastid data and the various plastid-type classes depending upon the similarity of the query protein to the proteins in the dataset. The module would return "unknown protein type" if no significant similarity is obtained. Accordingly, values for H (number of total hits), C (number of correct hits), P (percent of correct hits), and A (percent accuracy) are calculated to evaluate the PSI-BLAST based prediction performance.
Support Vector Machine (SVM)
Support Vector Machine is a class of learning machines based on optimization principle from statistical learning theory, originally introduced by Vapnik and co-workers [37, 38] about two decades ago. It has been well studied and extensively applied in the areas of pattern recognition, regression and classification problems in various fields of science and engineering, for example: predicting protein subcellular localization [19, 29, 30, 32, 39–42], classifying microarray data , predicting protein secondary structure [44, 45], forecasting disease , predicting membrane protein type  and many other areas. In classification problems, the objective of SVM is to separate the training data with a maximum margin while maintaining reasonable computing efficiency. To handle the multi-class classification, a simple strategy is used by reducing the multi-classification to a series of binary classifications. The popular methods include One-Versus-Rest (OVR), One-Versus-One (OVO), and Directed Acyclic Graph Support Vector Machines (DAGSVM). In this work, we followed the OVO method for the multi-classification problem. More details of the theory of SVM have been described elsewhere [37, 38].
To develop various classifiers, we have used SVM_light , a freely downloadable package of SVM (http://svmlight.joachims.org/). This software enables the user to define a number of parameters besides allowing a choice of built-in kernel functions, including linear, polynomial, and radial basis function (RBF). In our preliminary study, it was elucidated that the RBF kernel performed better than the linear and polynomial kernels (data not shown). Therefore, we used the RBF kernel in all further analysis and have presented the results accordingly.
Training/testing schema: In both steps, the training data was transformed into a five-fold cross-validation scheme, where the dataset is divided into five different parts. Four parts are combined to form one training set and the models developed from this set are then tested on the fifth part (called testing set). This process is repeated five times changing the training/testing set each time, and is thus called five-fold cross-validation. In addition, we have also tested the performance of our models on independent test datasets, those that have not been used in any kind of machine learning.
Evaluation parameters: The performance of models developed in both the phase-I (single class) and phase-II (multi class) predictions is evaluated based on the following standard parameters:
where TP = True Positives, TN = True Negatives, FP = False Positives, and FN = False Negatives.
In addition, we also plot Receiver Operating Characteristic (ROC) curves and calculate the Area Under Curve (AUC) for each of the classifiers.
Results and discussion
At first, we will describe the homology-based prediction results and then, the SVM-based performance for both the phases of plastid-types prediction, including testing on independent datasets.
(i). Homology-based PSI-BLAST
Overall performance of homology-based (PSI-BLAST) prediction for the identification of plastid vs. non-plastid proteins and the classification of diverse plastid-types.
No. of sequences
(ii). Phase-I: SVM-based identification of plastid proteins
Overall performance of various feature classifiers in 5-fold cross-validation for the identification of plastid vs. non-plastid proteins (phase- I)
SVM kernel type
RBF (γ = 370, C = 3, j = 1)
RBF (γ = 385, C = 2, j = 2)
RBF (γ = 265, C = 6, j = 1)
RBF (γ = 20, C = 3, j = 2)
RBF (γ = 135, C = 2, j = 1)
To include more diverse information, we further develop a dipeptide composition-based model. This classifier achieves the highest MCC (0.74) of all models with a slight increase in accuracy (86.80%) and a significant reduction in the rate of false prediction (14.12%). It has been reported in earlier studies that dipeptide composition performs better as compared to the simple amino acid composition [29, 30, 32], because it also provides the sequence order information along with the composition. Next, we compared the results of NCC and physicochemical property-based composition models. The physicochemical model, with an overall sensitivity of 79.57% and MCC of 0.61, did not perform well in predicting the plastid proteins comparatively. The NCC-based classifier achieves an accuracy of 86.90 % with a MCC of 0.74, which is at par with the DIPEP model, although the sensitivity was less in comparison. However, it achieves a higher specificity (89.66%) and precision (89.06%) value, with a lower RFP (10.94%) of all the models. Thus, for distinguishing plastid vs non-plastid proteins, both the DIPEP and NCC classifiers could be used efficiently, as both achieve the best MCC of 0.74 with higher accuracies (~87%). To check this further, we plot ROC curves for each of the models as discussed below. Please note: Table 6 is the overall performance of prediction modules at 0.0 threshold score of SVM. Individual performances of these classifiers at all values of threshold (-1.2 to 1.2) are available in the Supplementary Material (Additional file 1: Tables S3-S7).
Overall performance of various feature classifiers on an 'independent test' dataset for the identification of plastid vs. non-plastid proteins (phase-I)
(iii). Phase-II: SVM-based classification of plastid-type proteins
Overall performance of various feature classifiers in 5-fold cross-validation for the classification of diverse plastid-types* (phase-II)
SVM kernel type
RBF (γ = 246, C = 1, j = 2)
RBF (γ = 225, C = 1, j = 2)
RBF (γ = 210, C = 1, j = 2)
RBF (γ = 5, C = 2, j = 3)
RBF (γ = 37, C = 9, j = 1)
It is worth mentioning that prediction performance falls significantly in phase-II compared to the phase-I prediction process. This might be due to the fact that all of the sub-classes of plastids have common targeting signals (e.g. the transit peptides), as all still belong to one class 'plastids' and thus, it may be very difficult to distinguish their individual patterns by machine learning. However the overall amino acid composition varied significantly among them (Figure 4, and Additional file 2: Figure S1), which contributed towards respectable prediction accuracies as shown in Table 8. Combined, the results show that the plastid types could be categorized computationally with a statisfactory performance level. Although the models need more refinement, which we plan to do in the future, as, and when, more plastid-type training data is added to various repositories.
Overall performance of various feature classifiers on an 'independent test' dataset for the classification of diverse plastid-types* (phase-I I)
Overall, the above results suggest that it is possible to categorize plastid proteins into various plastid-types using machine learning approaches with a moderate to high accuracy; the similarity-based module showed very low performance in this study. Although we achieved a significantly high prediction performance in phase-I to distinguish plastid vs. non-plastid proteins, the performances of the models developed in phase-II were not so outstanding. As this is a first attempt to develop prediction models for plastid types based on their function, we achieved a satisfactory level of accuracy. One possible reason for the lower success level is that very few training sequences are available in classes such as chromoplast, etioplast and amyloplast and are almost negligible in other subtypes. Although experimental proteomics approaches have generated a considerable amount of data, more training data is needed to develop highly accurate and efficient prediction models. A second possible reason is that there might be very small differences in sequences among plastid-types, making it very challenging for machine learning modules to distinguish among them. We were able to achieve about 79% prediction accuracy in phase-II with a MCC of 0.44 and precision of 63%, which shows that it is certainly possible to classify plastid-types through machine learning. With the increase in datasets and also by applying novel algorithmic approaches, we will refine these models in future and make available on the PLpred web server.
(iv). Comparison with existing plastid localization predictors
Overall performance comparison of our method with the existing web tools for predicting plastid proteins.
Plastids, found in plants and algae, are the major site of manufacture and storage of important chemical compounds used by the cell. In plants, they are differentiated into various forms, depending upon which function they play in the cell such as the chloroplast, chromoplast, etioplast, amyloplast etc. Recent proteomics approaches have generated an adequate amount of protein data in each of these sub classes. However, large-scale plastid proteomics has become difficult and is nearing saturation due to several constraints as discussed. On the other hand, with the emphasis on genome sequencing and more and more data being generated rapidly, there is a need for accurate computational systems that could be used for genome-wide annotation of various plant genomes. To date, there is no prediction system that can be used to categorize plastid proteins into their various functional types. The current work is an attempt in that direction where we explore homology-based as well as machine learning approaches to classify plastid protein types.
The similarity-based approach showed very weak performance indicating the need and importance of machine learning algorithms. Our benchmark tests on diverse training and testing data showed that it is possible to develop prediction models to distinguishing various plastid-types just from their sequences. Our SVM-based method works in two phases; it first identifies a query protein as plastid or non-plastid with high accuracy and then, further classifies the identified sequences into one of the four plastid subclasses under study. Although we will be further refining the phase-II models with the increase in data availability, the current method should be applicable to the annotation of various available proteomes.
List of abbreviations
Support Vector Machine
Amino acid composition
Pseudo amino acid composition
Matthews correlation coefficient
Radial Basis Function
The authors acknowledge the support to this study from faculty start-up funds to RK from NIMFFAB/Department of Biochemistry & Molecular Biology, and support to TW/RK from OSU's Provost Office interdisciplinary grant (#12), the i CREST Center for Bioinformatics and Computational Biology (http://icrest.okstate.edu/). Support to RV from USDA-NIFA grant number 2010-85605-20542 is duly acknowledged. We also thank Dr. Ulrich Melcher for reading of a draft manuscript. The authors thank the anonymous referees for help in improving the research article.
Funding for the publication of this article has come from start-up funds account AA-1-51220, OSU.
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 14, 2013: Proceedings of the Tenth Annual MCBIOS Conference. Discovery in a sea of data. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S14.
- Kleffmann T, von Zychlinski A, Russenberger D, Hirsch-Hoffmann M, Gehrig P, Gruissem W, Baginsky S: Proteome dynamics during plastid differentiation in rice. Plant physiology. 2007, 143 (2): 912-923.PubMed CentralView ArticlePubMedGoogle Scholar
- Cui L, Veeraraghavan N, Richter A, Wall K, Jansen RK, Leebens-Mack J, Makalowska I, dePamphilis CW: ChloroplastDB: the Chloroplast Genome Database. Nucleic acids research. 2006, 34 (Database): D692-696.PubMed CentralView ArticlePubMedGoogle Scholar
- Gewolb J: Bioengineering: plant scientists see big potential in tiny plastids. Science. 2002, 295: 258-259. 10.1126/science.295.5553.258.View ArticlePubMedGoogle Scholar
- Baginsky S, Grossmann J, Gruissem W: Proteome analysis of chloroplast mRNA processing and degradation. Journal of proteome research. 2007, 6 (2): 809-820. 10.1021/pr060473q.View ArticlePubMedGoogle Scholar
- Siddique MA, Grossmann J, Gruissem W, Baginsky S: Proteome analysis of bell pepper (Capsicum annuum L.) chromoplasts. Plant & cell physiology. 2006, 47 (12): 1663-1673. 10.1093/pcp/pcl033.View ArticleGoogle Scholar
- Balmer Y, Vensel WH, Cai N, Manieri W, Schurmann P, Hurkman WJ, Buchanan BB: A complete ferredoxin/thioredoxin system regulates fundamental processes in amyloplasts. Proc Natl Acad Sci USA. 2006, 103: 2988-2993. 10.1073/pnas.0511040103.PubMed CentralView ArticlePubMedGoogle Scholar
- Andon NL, Hollingworth S, Koller A, Greenland AJ, Yates JR, Haynes PA: Proteomic characterization of wheat amyloplasts using identification of proteins by tandem mass spectrometry. Proteomics. 2002, 2 (9): 1156-1168. 10.1002/1615-9861(200209)2:9<1156::AID-PROT1156>3.0.CO;2-4.View ArticlePubMedGoogle Scholar
- Zeng Y, Pan Z, Ding Y, Zhu A, Cao H, Xu Q, Deng X: A proteomic analysis of the chromoplasts isolated from sweet orange fruits [Citrus sinensis (L.) Osbeck]. Journal of Experimental Botany. 2011, 62 (15): 5297-5309. 10.1093/jxb/err140.PubMed CentralView ArticlePubMedGoogle Scholar
- Balmer Y, Vensel WH, DuPont FM, Buchanan BB, Hurkman WJ: Proteome of amyloplasts isolated from developing wheat endosperm presents evidence of broad metabolic capability. Journal of Experimental Botany. 2006, 57 (7): 1591-1602. 10.1093/jxb/erj156.View ArticlePubMedGoogle Scholar
- Dupont FM: Metabolic pathways of the wheat (Triticum aestivum) endosperm amyloplast revealed by proteomics. BMC Plant Biology. 2008, 8: 39-10.1186/1471-2229-8-39.PubMed CentralView ArticlePubMedGoogle Scholar
- Barsan C, Sanchez-Bel P, Rombaldi C, Egea I, Rossignol M, Kuntz M, Zouine M, Latche A, Bouzayen M, Pech JC: Characteristics of the tomato chromoplast revealed by proteomic analysis. Journal of Experimental Botany. 2010, 61: 2413-2431. 10.1093/jxb/erq070.View ArticlePubMedGoogle Scholar
- Baginsky S, Kleffmann T, von Zychlinski A, Gruissem W: Analysis of shotgun proteomics and RNA profiling data from Arabidopsis thaliana chloroplasts. J Proteome Res. 2005, 4: 637-640. 10.1021/pr049764u.View ArticlePubMedGoogle Scholar
- Kleffmann T, Hirsch-Hoffmann M, Gruissem W, Baginsky S: plprot: a comprehensive proteome database for different plastid types. Plant Cell Physiol. 2006, 47: 432-436. 10.1093/pcp/pcj005.View ArticlePubMedGoogle Scholar
- Peltier JB, Cai Y, Sun Q, Zabrouskov V, Giacomelli L, Rudella A, Ytterberg AJ, Rutschow H, van Wijk KJ: The oligomeric stromal proteome of Arabidopsis thaliana chloroplasts. Mol Cell Proteomics. 2006, 5: 114-133.View ArticlePubMedGoogle Scholar
- Sun Q, Zybailov B, Majeran W, Friso G, Olinares PD, van Wijk KJ: PPDB, the Plant Proteomics Database at Cornell. Nucleic acids research. 2009, 37 (Database): D969-974. 10.1093/nar/gkn654.PubMed CentralView ArticlePubMedGoogle Scholar
- Emanuelsson O, Nielsen H, Brunak S, von Heijne G: Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol. 2000, 300: 1005-1016. 10.1006/jmbi.2000.3903.View ArticlePubMedGoogle Scholar
- Kleffmann T, Russenberger D, von Zychlinski A, Christopher W, Sjolander K, Gruissem W, Baginsky S: The Arabidopsis thaliana chloroplast proteome reveals pathway abundance and novel protein functions. Current Biology. 2004, 14: 354-362. 10.1016/j.cub.2004.02.039.View ArticlePubMedGoogle Scholar
- Richly E, Leister D: An improved prediction of chloroplast proteins reveals diversities and commonalities in the chloroplast proteomes of Arabidopsis and rice. Gene. 2004, 329: 11-16.View ArticlePubMedGoogle Scholar
- Nair R, Rost B: Mimicking cellular sorting improves prediction of subcellular localization. J Mol Biol. 2005, 348: 85-100. 10.1016/j.jmb.2005.02.025.View ArticlePubMedGoogle Scholar
- Jarvis P, Robinson C: Mechanisms of protein import and routing in chloroplasts. Current Biology. 2004, 14: R1064-R1077. 10.1016/j.cub.2004.11.049.View ArticlePubMedGoogle Scholar
- von Zychlinski A, Kleffmann T, Krishnamurthy N, Sjölander K, Baginsky S, Gruissem W: Proteome analysis of the rice etioplast: metabolic and regulatory networks and novel protein functions. Mol Cell Proteomics. 2005, 4 (8): 1072-1084. 10.1074/mcp.M500018-MCP200.View ArticlePubMedGoogle Scholar
- Dondoshansky WY: BLASTCLUST - BLAST score-based single-linkage clustering. 2000Google Scholar
- Chou KC, Shen HB: Predicting eukaryotic protein subcellular location by fusing optimized evidence-theoretic K-nearest neighbor classifiers. Journal of Proteome Research. 2006, 5: 1888-1897. 10.1021/pr060167c.View ArticlePubMedGoogle Scholar
- Chou KC, Shen HB: Hum-PLoc: a novel ensemble classifier for predicting human protein subcellular localization. Biochem Biophys Res Commun. 2006, 347: 150-157. 10.1016/j.bbrc.2006.06.059.View ArticlePubMedGoogle Scholar
- Briesemeister S, Blum T, Brady S, Lam Y, Kohlbacher O, Shatkay H: SherLoc2: A High-Accuracy Hybrid Method for Predicting Subcellular Localization of Proteins. Journal of Proteome Research. 2009, 8: 5363-5366. 10.1021/pr900665y.View ArticlePubMedGoogle Scholar
- Yu CS, Chen YC, Lu CH, Hwang JK: Prediction of protein subcellular localization. Proteins: Structure, Function, and Bioinformatics. 2006, 64 (3): 643-651. 10.1002/prot.21018.View ArticleGoogle Scholar
- Su EC, Chiu HS, Lo A, Hwang JK, Sung TY, Hsu WL: Protein subcellular localization prediction based on compartment-specific features and structure conservation. BMC Bioinformatics. 2007, 8: 330-10.1186/1471-2105-8-330.PubMed CentralView ArticlePubMedGoogle Scholar
- Casadio R, Martelli PL, Pierleoni A: The prediction of protein subcellular localization from sequence: a shortcut to functional genome annotation. Briefings in Functional Genomics. 2008, 7 (1): 63-73. 10.1093/bfgp/eln003.View ArticleGoogle Scholar
- Kaundal R, Saini R, Zhao PX: Combining Machine Learning and Homology-Based Approaches to Accurately Predict Subcellular Localization in Arabidopsis. Plant Physiology. 2010, 154: 36-54. 10.1104/pp.110.156851.PubMed CentralView ArticlePubMedGoogle Scholar
- Kaundal R, Raghava GPS: RSLpred: an integrative system for predicting subcellular localization of rice proteins combining compositional and evolutionary information. Proteomics. 2009, 9 (9): 2324-2342. 10.1002/pmic.200700597.View ArticlePubMedGoogle Scholar
- Sahu SS, Panda G: A novel feature representation method based on Chou's pseudo amino acid composition for protein structural class prediction. Computational Biology and Chemistry. 2010, 34: 320-327. 10.1016/j.compbiolchem.2010.09.002.View ArticlePubMedGoogle Scholar
- Garg A, Bhasin M, Raghava GPS: Support vector machine-based method for subcellular localization of human proteins using amino acid compositions, their order, and similarity search. Journal of Biological Chemistry. 2005, 280: 14427-14432. 10.1074/jbc.M411789200.View ArticlePubMedGoogle Scholar
- Chou KC: Prediction of protein cellular attributes using pseudo amino acid composition. Proteins. 2001, 43: 246-255. 10.1002/prot.1035.View ArticlePubMedGoogle Scholar
- Jiang X, Wei R, Zhang TL, Gu Q: Using the concept of Chou's pseudo amino acid composition to predict apoptosis proteins subcellular location: an approach by approximate entropy. Protein Peptide Lett. 2001, 15: 392-396.View ArticleGoogle Scholar
- Zhang TL, Ding YS, Chou KC: Prediction protein structural classes with pseudo-amino acid composition: approximate entropy and hydrophobicity pattern. J Theor Biol. 2008, 250: 186-193. 10.1016/j.jtbi.2007.09.014.View ArticlePubMedGoogle Scholar
- Altschul SF, TL M, AA S, J Z, Z Z, W M, DJ L: Gapped Blast and PSI-Blast: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralView ArticlePubMedGoogle Scholar
- Cortes C, Vapnik V: Support vector networks. Machine Learning. 1995, 20: 273-293.Google Scholar
- Vapnik V: The Nature of Statistical Learning Theory. 1995, Springer, New YorkView ArticleGoogle Scholar
- Hua S, Sun Z: Support vector machine approach for protein subcellular localization prediction. Bioinformatics. 2001, 17: 721-728. 10.1093/bioinformatics/17.8.721.View ArticlePubMedGoogle Scholar
- Park KJ, Kanehisa M: Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics. 2003, 19: 1656-1663. 10.1093/bioinformatics/btg222.View ArticlePubMedGoogle Scholar
- Bhasin M, Raghava GPS: ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST. Nucleic Acids Research. 2004, 32: 414-419. 10.1093/nar/gkh350.View ArticleGoogle Scholar
- Xie D, Li A, Wang M, Fan Z, Feng H: LOCSVMPSI: a web server for subcellular localization of eukaryotic proteins using SVM and profile of PSI-BLAST. Nucleic Acids Research. 2005, 33: 105-110.View ArticleGoogle Scholar
- Brown MPS, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, Ares M, Haussler D: Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc Natl Acad Sci. 2000, 97: 262-267. 10.1073/pnas.97.1.262.PubMed CentralView ArticlePubMedGoogle Scholar
- Ward JJ, McGuffin LJ, Buxton BF, Jones DT: Secondary structure prediction with support vector machines. Bioinformatics. 2003, 19: 1650-1655. 10.1093/bioinformatics/btg223.View ArticlePubMedGoogle Scholar
- Ding CHQ, Dubchak I: Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics. 2001, 17 (4): 349-358. 10.1093/bioinformatics/17.4.349.View ArticlePubMedGoogle Scholar
- Kaundal R, Kapoor AS, Raghava GPS: Machine learning techniques in disease forecasting: a case study on rice blast prediction. BMC Bioinformatics. 2006, 7: 485-10.1186/1471-2105-7-485.PubMed CentralView ArticlePubMedGoogle Scholar
- Cai YD, Zhou GP, Chou KC: Support vector machines for predicting membrane protein types by using functional domain composition. J Biophys. 2003, 84: 3257-3263. 10.1016/S0006-3495(03)70050-2.View ArticleGoogle Scholar
- Joachims T: Advances in Kernel Methods - Support Vector Learning. Edited by: Schölkopf B, Burges C, Smola A. 1999, MIT-Press, Massachusetts, 41-56.Google Scholar
- Cedano J, Aloy P, Perez-Pons JA, Querol E: Relation Between Amino Acid Composition and Cellular Location of Proteins. Journal of Molecular Biology. 1997, 266: 594-600. 10.1006/jmbi.1996.0804.View ArticlePubMedGoogle Scholar
- Benedito VA, Li H, Dai X, Wandrey M, He J, Kaundal R, Torres-Jerez I, Gomez SK, Harrison MJ, Tang Y, Zhou P, Udvardi M: Genomic inventory and transcriptional analysis of Medicago truncatula transporters. Plant Physiology. 2010, 152 (3): 1716-1730. 10.1104/pp.109.148684.PubMed CentralView ArticlePubMedGoogle Scholar
- Andrade MA, O'Donoghue SI, Rost B: Adaptation of Protein Surfaces to Subcellular Location. Journal of Molecular Biology. 1998, 276: 517-525. 10.1006/jmbi.1997.1498.View ArticlePubMedGoogle Scholar
- Emanuelsson O, Brunak S, von Heijne G, Nielsen H: Locating proteins in the cell using TargetP, SignalP and related tools. Nature Protocols. 2007, 2: 953-971. 10.1038/nprot.2007.131.View ArticlePubMedGoogle Scholar
- Horton P, Park KJ, Obayashi T, Fujita N, Harada H, Adams-Collier CJ, Nakai K: WoLF PSORT: protein localization predictor. Nucleic Acids Research. 2007, 35: W585-W587. 10.1093/nar/gkm259.PubMed CentralView ArticlePubMedGoogle Scholar
- Briesemeister S, Rahnenführer J, Kohlbacher O: YLoc - an interpretable web server for predicting subcellular localization. Nucleic Acids Research. 2010, 38: W497-W502. 10.1093/nar/gkq477.PubMed CentralView ArticlePubMedGoogle Scholar
- Wu ZC, Xiao X, Chou KC: iLoc-Plant: a multi-label classifier for predicting the subcellular localization of plant proteins with both single and multiple sites. Molecular Biosystems. 2011, 7: 3287-3297. 10.1039/c1mb05232b.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.