- Research article
- Open Access
pSLIP: SVM based protein subcellular localization prediction using multiple physicochemical properties
© Sarda et al; licensee BioMed Central Ltd. 2005
- Received: 01 December 2004
- Accepted: 17 June 2005
- Published: 17 June 2005
Protein subcellular localization is an important determinant of protein function and hence, reliable methods for prediction of localization are needed. A number of prediction algorithms have been developed based on amino acid compositions or on the N-terminal characteristics (signal peptides) of proteins. However, such approaches lead to a loss of contextual information. Moreover, where information about the physicochemical properties of amino acids has been used, the methods employed to exploit that information are less than optimal and could use the information more effectively.
In this paper, we propose a new algorithm called pSLIP which uses Support Vector Machines (SVMs) in conjunction with multiple physicochemical properties of amino acids to predict protein subcellular localization in eukaryotes across six different locations, namely, chloroplast, cytoplasmic, extracellular, mitochondrial, nuclear and plasma membrane. The algorithm was applied to the dataset provided by Park and Kanehisa and we obtained prediction accuracies for the different classes ranging from 87.7% – 97.0% with an overall accuracy of 93.1%.
This study presents a physicochemical property based protein localization prediction algorithm. Unlike other algorithms, contextual information is preserved by dividing the protein sequences into clusters. The prediction accuracy shows an improvement over other algorithms based on various types of amino acid composition (single, pair and gapped pair). We have also implemented a web server to predict protein localization across the six classes (available at http://pslip.bii.a-star.edu.sg/).
- Support Vector Machine
- Feature Vector
- Amino Acid Composition
- Binary Classifier
- Vote Scheme
One of the biggest challenges facing biologists today is the structural and functional classification and characterization of protein sequences. For example, in humans, the number of proteins for which the structures and functions are unknown makes up more than 40% of the total number of proteins. As a result, over the past couple of decades, extensive research has been done on trying to identify the structures and functions of proteins.
It is well known that the subcellular localization of proteins plays a crucial role in their functions . A number of computational approaches have been developed over the years to predict the localization of proteins, including recent works like [2–12].
Initial efforts relied on amino acid compositions [13, 14], the prediction of signal peptides [15–19] or a combination of both [20, 21]. Later efforts were targeted at incorporating sequence order information (in the form of dipeptide compositions etc.) in the prediction algorithms [22–27].
There are drawbacks associated with all these methods. For example, prediction algorithms based on amino acid compositions suffer from the drawback that there is a loss of contextual information. As a result, sequences which are completely different in function and localization but that have a very similar amino acid composition would both be predicted as belonging to the same region of the cell. On the other hand, approaches that rely on predicting signal peptides can lead to inaccurate predictions when the signals are missing or only partially included .
Recent efforts have also focused on the use of physicochemical properties to predict subcellular localization of proteins [28, 29]. Bhasin et al.  created an algorithm which was a hybrid of four different predictive methods. In addition to using amino acid compositions and dipeptide composition information, they also included 33 different physicochemical properties of amino acids, averaged over the entire protein. However such a globally averaged value again leads to a loss of contextual information. Bickmore et al.  studied the characteristics of the primary sequences of different proteins and concluded that motifs and domains are often shared amongst proteins co-localized within the same sub-nuclear compartment. Since the structure and hence the function of proteins is dictated by the different interacting physicochemical properties of the amino acids making up the protein, it would stand to reason that co-localized proteins must share some conservation in the different properties.
In this paper, we present a new algorithm called pSLIP: Prediction of Subcellular Localization in Proteins. We use multiple physicochemical properties of amino acids to obtain protein extracellular and subcellular localization predictions. A series of SVM based binary classifiers along with a new voting scheme enables us to obtain high prediction accuracies for six different localizations.
The number of proteins in the dataset. * These classes were not considered as they have too few proteins to achieve reliable training.
Number of entries
Sensitivity (sens) and Specificity (spec) (in %) on Park and Kanehisa's dataset . First two columns show results from Park and Kanehisa's algorithm  obtained by 5-fold crossvalidation. The next column shows results from Chou and Cai's work  obtained using leave one out test. The last set of results are from our algorithm obtained using 5-fold and 10-fold crossvalidation.
P & K
P & K
The results reported by Park and Kanehisa are obtained after 5-fold cross validation testing. To ensure fairness in comparing results, we ran a 5-fold test on our algorithm. As is apparent from Table 2, our method provides good overall accuracy of 89.5% which is significantly higher than 78.2% and 79.1% obtained for the two different cases from Park and Kanehisa's paper. Even more interesting is the fact that the accuracies obtained by Park and Kanehisa are skewed towards those locations that have the most number of proteins in the dataset, viz., nuclear and plasma membrane. Total accuracies can sometimes present a misleading picture about the efficacy of a classification technique. Local accuracies, on the other hand can provide a more realistic view of classification efficiencies. We obtained a local accuracy of 88.7% which is only slightly less than the overall accuracy (89.5%) of the technique. On the other hand, the local accuracies obtained by Park and Kanehisa are significantly lower than the corresponding total accuracies (57.9% and 68.5% when compared with total accuracies of 78.2% and 79.1% respectively.)
Chou and Cai have used the leave one out cross validation (LOO-CV) test to assess the performance of their GO-FunD-PseAA predictor. Due to reasons described later, we've used only NF-CV tests. In order to make a reasonable comparison with their results, we did a 10-fold test which provides a good trade-off between bias and variance in test results. As results in Table 2 show, our algorithm performs as well as the GO-FunD-PseAA predictor and the obtained accuracy of 93.1% compares favorably with the 92.4% accuracy obtained by Chou and Cai. Although Chou and Cai's work tackles the harder problem of classifying over more subcellular locations than we do, the results do show the promise in the approach of using physicochemical properties for localization prediction.
Cluster-wise Specificity (spec) and Sensitivity (sens) (in %) for pSLIP using 10-fold cross validation.
Cross validation experiments are frequently prone to an optimistic bias . This occurs because the experimental setup can be such that the choice of the learning machine's parameters becomes dependent on the test data. We've tried to minimize the possible effect of this by using only a small subset (ninety sequences of each type) of the available sequences for parameter search, as described later in this paper. As a further experiment, we've also carried out an independent dataset (ID) test using the the eukaryotic sequences dataset developed by Reinhardt and Hubbard . This dataset has also been widely used for subcellular localization studies. Instead of doing cross-validation testing on this dataset, we use the SVM classifiers generated by our method using Park and Kanehisa's dataset and predict the subcellular localization of all sequences in the Reinhardt and Hubbard dataset.
Classification performance (sensitivity) (in %) on Reinhardt and Hubbard's dataset ; NF-CV: Results are given by N-fold cross validation. LOO-CV: Results are given by leave one out cross validation test. ID: Results are given by directly testing entire dataset, without any training on this dataset.
The results in Table 4 illustrate the importance of incorporating sequence order information in the classification method. The first two methods ignore order information entirely and we believe that their prediction accuracy suffers as a result of this. Furthermore, although the prediction accuracies of Esub8 and ESLpred are better than those of our method, it must be borne in mind that these results are from training and testing on the same dataset while our results are from training the classifiers on a different dataset. It must be noted however that the prediction accuracies for mitochondrial proteins, which are notoriously difficult to predict, are significantly higher using our method than any of the other methods (85.3% as compared to accuracies between 56% and 68.2% for the other methods).
The GO-FunD-PseAA predictor, whose classification performance on the Park and Kanehisa dataset is shown in Table 2, has also been tested on the Reinhardt and Hubbard dataset. The predictor performs well on this dataset too and yields the highest total accuracy of 92.9%  using the rigorous leave one out cross validation test. However, we could not include these results in Table 4 since the results in  do not provide a subcellular location-wise breakdown of prediction performance.
We have implemented our algorithm for predicting subcellular localizations as a web server which can be accessed at http://pslip.bii.a-star.edu.sg/.
Protein subcellular localization has been an active area of research due to the important role it plays in indicating, if not determining, protein function. A number of efforts have previously used amino acid compositions as well as limited sequence order information in order to predict protein localization. In this work, we have developed a novel approach based on using multiple physicochemical properties. In order to use sequence order information, we divide the set of proteins into four different clusters based on their lengths. Within each cluster, proteins are mapped onto the lowest length in that cluster (50, 150, 300 and 450 for the four clusters).
We then developed multiple binary classifiers for each cluster. For each protein, the output from each binary classifier was encoded as a binary bit sequence to form a meta-dataset. To predict the localization of a query protein, a similar binary sequence was generated based on the outputs of the different binary classifiers and the nearest neighbor to this protein was sought in the meta-dataset.
We obtained significantly higher classification accuracies (93.1% overall and 92.1% local) for the Park and Kanehisa dataset. The prediction accuracies obtained for mitochondrial and extracellular proteins in particular are among the highest that have been achieved so far.
The clustered approach was chosen to not only be able to include sequence order information beyond that of di-, tri-and tetra-peptide information but also to mitigate the effects of over-averaging. One of the problems we encountered was the small number of proteins of length greater than 1350. As a result, these were averaged down to a base length of 450 leading to a drop in accuracies for the 450 cluster. Obviously larger datasets, with more representative samples in the length range greater than 1350 might yield greater accuracies.
We used the protein sequences dataset1created by Park and Kanehisa . The dataset consists of 7579 eukaryotic proteins drawn from the SWISS-PROT database and classified into twelve subcellular locations. The protein sequences were classified based on the keywords found in the CC (comments or notes) and OC (organism classification) fields of SWISS-PROT. Proteins annotated with multiple subcellular locations were not included in the dataset. Further, proteins containing B, X or Z in the amino acid sequence were excluded from the dataset. Finally, proteins with high sequence similarity (greater than 80%) were not chosen for inclusion.
Table 1 summarizes the number of sequences in each of the twelve subcellular locations. For some of the locations such as cytoskeleton, there were too few sample sequences to achieve reliable training accuracies using SVM, the machine learning algorithm used in this work. Hence, we considered only sequences of type: chloroplast, cytoplasmic, extracellular, mitochondrial, nuclear and plasma membrane resulting in a dataset with 7106 eukaryotic protein sequences.
Support vector machine
The concept of Support Vector Machines (SVM) was first introduced by Vapnik [35, 36] and in recent times, the SVM approach has been used extensively in the areas of classification and regression. SVM is a learning algorithm which, upon training with a set of positively and negatively labeled samples, produces a classifier that can then be used to identify the correct label for unlabeled samples. SVM builds a classifier by constructing an optimal hyperplane that divides the positively and the negative labeled samples with the maximum margin of separation. Each sample is described by a feature vector. Typically, training samples are not linearly separable. Hence, the feature vectors of all training samples are first mapped to a higher dimensional space H and an optimal dividing hyperplane is sought in this space.
The SVM algorithm requires the solving of a quadratic optimization problem. To simplify the problem, SVM does not explicitly map the feature vectors of all the samples to the space H. Instead, mapping is done implicitly by defining a kernel function between two samples with feature vectors and as:
where Φ is the mapping to the space H.
For a detailed description of the mathematics behind SVM, we refer the reader to an article by Burges . For the present study, we used the SVMlight package (version 6.01) created by Joachims . The package is available online2 and is free for scientific use.
Multi class SVM
A multi-class classification problem, such as the subcellular localization problem, is typically solved by reducing the multi-class problem into a series of binary classification problems. In the method employed here, called the 1-vs-1 method, a binary classifier is constructed for each pair of classes. Thus, a c-class problem is transformed into several two-class problems: one for each pair of classes (i, j), 1 ≤ i, j ≤ c, i ≠ j. We use the notation to refer to both the binary classification problem of separating samples of classes i and j as well as the SVM classifier which is used to solve this problem. The classifier for the two-class problem is trained with samples of classes i and j, ignoring samples from all other classes.
We term those classifiers which are trained to differentiate the true class of the test sample from other classes as relevant classifiers while the remaining classifiers are termed irrelevant classifiers. For example, using the three class example cited above, if the (as yet unknown) true class of a test sample is a, then the relevant classifiers would be and while the classifier would be irrelevant.
An unlabeled test sample is tested against all the c (c - 1) /2 binary classifiers and the predictions from each of these classifiers are combined by some algorithm to assign a class label to the sample. The design of the combining algorithm should be such that the predictions from the relevant classifiers gain precedence over those from the irrelevant classifiers. A simple voting scheme is one such algorithm that has been used earlier . In this scheme, a vote is assigned to a class every time a classifier predicts that the test sample belongs to that class. The class with the maximum number of votes is deemed to be the true class of the sample.
The prediction performance of the voting scheme approach relies on the assumption that the relevant classifiers for the unlabeled sample perform very well and the number of votes they cast in favor of the true class outnumber the number of votes obtained by any other class from the irrelevant classifiers. In practice, we found this not to be the case. Some of the relevant classifiers performed poorly and a wrong class frequently got the highest number of votes by virtue of many irrelevant classifiers voting for it.
Thus, after testing a sample with all the classifiers, we get an encoded representation of the classifier predictions in the form of a bit sequence. Since this sample is a training sample, it's true class (or label) is known. The true class and the bit sequence together constitute a meta-data instance derived from the training sample. The collection of all such meta-data instances derived from all the training samples is termed the meta-data set. Figure 1 provides an illustration of this process.
Truth table for the Exclusive OR (XOR) operator. Thus, for example, 101 XOR 011 would be 110.
After the sequence from the meta-data set that is closest to the test sequence is identified, we assign its known class label to the test sample. In case multiple (equidistant) sequences are found after the nearest neighbor search, resulting in a tie between two or more classes, we pick that particular class which would have got the maximum votes of the tied classes; the votes being counted according to the voting scheme described earlier. By implementing this meta-classification approach, we found an improvement in accuracy of 10% – 15% over the voting method.
Many previous efforts have used amino acid composition as the feature to determine protein subcellular localization. In these efforts, the feature vector corresponding to an amino acid sequence is typically a 20-dimensional vector with each element of the vector representing the frequency of occurrence of an amino acid in that particular sequence. As highlighted earlier, this approach leads to a complete loss of sequence order information. On the other hand, the averaging of physicochemical properties over the entire length of the protein sequence also results in a loss of sequence order information. We believe that sequences that are co-localized must share some similarity across certain physicochemical properties, regardless of their length.
To overcome the shortcomings of earlier efforts, we employ a novel method of building feature vectors which is based on an idea first proposed in . Consider an amino acid sequence of length L. Suppose we wish to use M different physicochemical properties in the feature representation of the amino acid sequence. Corresponding to each amino acid i, we build property vectors (1,..., 20) where (1,..., M) is the vector of normalized values of the M physicochemical properties for the amino acid i. Then, for the sequence of length L, we concatenate the property vectors of each of the amino acids in the sequence in succession to get a vector of dimension L × M.
While this method allows us to build feature vectors using any number of desired physicochemical properties, it results in vectors whose dimension is a function of the length of the amino acid sequence. One of the problems with using physicochemical properties averaged over the entire sequence length, in localization prediction efforts so far, has been the difference in lengths of the different protein sequences. Since SVM requires equal length feature vectors, this has always been a deterrent to utilizing sequence order information. Hence, we apply a local averaging process to scale all generated feature vectors to a standard dimension.
Suppose we take our standard dimension to be K and wish to construct a feature vector of this dimension for a protein sequence of length L. For now, let's assume that we are building amino acid property vectors using just one physicochemical property. So the feature vector , obtained by concatenating property vectors as described above, will also be of dimension L. We then sequentially group the feature vector's elements such that we end up with K groups or partitions of the vector. Withing each part, we take the average values of the constituent elements and then build a vector out of these averaged values. This operation is like constructing a mapping and can be thought of conceptually as reducing the protein sequence of length L to a standard length l and here l = K.
If we used M physicochemical properties instead of one, a conceptual scaling of a protein sequence of length L to a standard length l is equivalent to a scaling of (of length L × M) to a standard dimension of K = l × M. To do this operation, we partition the amino acid sequence of length L into l nearly equal parts and then take the average value of the property vectors within these partitions.
There is a loss of sequence order information due to this averaging process and this loss is significant when scaling down proteins of very long lengths to a much shorter length. To minimize this information loss, we divided our dataset into clusters defining a base length for each cluster. This base length is equivalent to the conceptual protein sequence standard length l. Within each cluster, no sequence has a length less than l and the length of the longest sequence is no greater than three times l. Initially, we built clusters using the base lengths as (10, 30, 90, 270, 810, 2430) but this resulted in an uneven population distribution of sequences across clusters that caused problems in the SVM training stage. We adjusted the cluster sizes and finally chose the base lengths as (50,150, 300, 450). For sequences of length less than fifty, we extended their length to fifty by suitably repeating the residues. For example, if the sequence AMKMSF of length six needs to be scaled to a length of ten, we would repeat residues to get AAMMKKMMSF.
Further, the SVM optimization model employs a regularization parameter C which controls the trade-off between the margin of separation (between positive and negatively labeled samples) and the error in classification. Thus, the process of parameter selection for the classification problem repeats over the set of binary classifiers per cluster and then for each of the clusters.
The set of M physicochemical properties to represent the amino acid sequences
The value of γ for the kernel function
The value of the regularization parameter C
For the set of physicochemical properties, we used the Amino Acid index database  available at http://www.genome.jp/dbget/aaindex.html. An amino acid index is a set of 20 numerical values representing any of the different physicochemical properties of amino acids. This database currently contains 484 such indices.
We then build feature vectors using these top five indices and do a search over the C and γ space looking for the best performing combination of these parameters. At the end of this parameter search, we obtain the best combination of amino acid indices, C and γ for each classifier in each sequence cluster. We then look at which sequence cluster achieved the best overall prediction performance and pick the set of best parameters for classifiers in that cluster as the best set for all the clusters.
The set of top five amino acid indices for each of the classifiers as found using this parameter search for the Park and Kanehisa dataset have been provided [see Additional file 1].
To assess the prediction performance of the proposed algorithm, a cross-validation test must be performed. The three methods most often used for cross-validation are the independent dataset (ID) test, the leave one out cross validation (LOO-CV) test and the N-fold cross validation (NF-CV) test . Of the three, the LOO-CV test is considered to be the most rigorous and objective . Although bias-free, this test is very computationally demanding and is often impractical for large datasets. Further, it suffers from possibly high variance in results depending on the composition of the dataset and the characteristics of the classifier. The NF-CV test provides a bias-free estimate of the accuracy  at a much reduced computational cost and is considered an acceptable test for evaluating prediction performance of an algorithm .
In NF-CV tests, the dataset is divided into N parts with approximately equal number of samples in each part. The learning machine is trained with samples from N - 1 parts while the Nth part is used as testing set to calculate classification accuracies. The learning-testing process is repeated N times until each part has been used as a testing set once.
Since the number of protein sequence samples in each class are all different, it is obvious that during training phase of a binary classifier, the number of training samples in the two classes will not be equal. If the SVM is trained on these unequally sized sets, the resulting classifier will be inherently biased toward the more populous class; it is more likely to predict a test sample to belong to that class. The greater the disparity in populations between the two classes, the more pronounced the bias is. It is difficult to prevent this bias in the training stage without adjusting more parameters on a per classifier level. To prevent this problem, we reduce the training set to an equisized set by randomly selecting m samples from the larger set; m being the size of the smaller set.
To quantify the performance of our proposed algorithm, we use the widely used measures of Specificity and Sensitivity. Let N be the total number of proteins in the testing dataset and let k be the number of subcellular locations (classes). Let be the number of proteins of class i classified by the algorithm as belonging to class j.
The specificity, also called precision, for class i measures how many of the proteins classified as belonging to class i truly belong to that class.
The sensitivity, also called recall, for class i measures how many of the proteins truly belonging to class i were correctly classified as belonging to that class.
We further define Total Accuracy to measure how many proteins overall were correctly classified.
It is expected that the most populous classes will dominate the total accuracy measure and a classifier biased towards those classes will perform well according to this measure, even if the prediction performance for the smaller sized classes is not good. Hence, we consider another measure termed Location Accuracy and defined as:
Location accuracy reveals any poor performance by an individual classifier by providing a measure of how well the classification works for each class of proteins.
The definitions of Total Accuracy and Local Accuracy used here are equivalent to those used by Park and Kanehisa .
The authors would like to acknowledge Tariq Riaz and Jiren Wang for their inputs during several helpful discussions.
- Feng ZP: An overview on predicting subcellular location of a protein. Silico Biology 2002., 2(0027):Google Scholar
- Cai YD, Chou KC: Nearest neighbour algorithm for predicting protein subcellular location by combining functional domain composition and pseudo-amino acid composition. Biochemical and Biophysical Research Communications 2003, 305: 407–411. 10.1016/S0006-291X(03)00775-7View ArticlePubMedGoogle Scholar
- Cai YD, Zhou GP, Chou KC: Support vector machines for predicting membrane protein types by using functional domain composition. Biophysical Journal 2003, 84: 3257–3263.PubMed CentralView ArticlePubMedGoogle Scholar
- Cai YD, Chou KC: Predicting subcellular localization of proteins in a hybridization space. Bioinformatics 2004, 20: 1151–1156. 10.1093/bioinformatics/bth054View ArticlePubMedGoogle Scholar
- Chou KC, Cai YD: A new hybrid approach to predict subcellular localization of proteins by incorporating Gene ontology. Biochemical and Biophysical Research Communications 2003, 311: 743–747. 10.1016/j.bbrc.2003.10.062View ArticlePubMedGoogle Scholar
- Chou KC, Cai YD: Prediction and classification of protein subcellular location: sequence-order effect and pseudo amino acid composition. Journal of Cellular Biochemistry 2003, 90: 1250–1260. 10.1002/jcb.10719View ArticlePubMedGoogle Scholar
- Chou KC, Cai YD: Predicting subcellular localization of proteins by hybridizing functional domain composition and pseudo-amino acid composition. Journal of Cellular Biochemistry 2004, 91: 1197–1203. 10.1002/jcb.10790View ArticlePubMedGoogle Scholar
- Chou KC, Cai YD: Prediction of protein subcellular locations by GO-FunD-PseAA predictor. Biochemical and Biophysical Research Communications 2004, 320: 1236–1239. 10.1016/j.bbrc.2004.06.073View ArticlePubMedGoogle Scholar
- Pan YX, Zhang ZZ, Guo ZM, Feng GY, Huang ZD, He L: Application of pseudo amino acid composition for predicting protein subcellular location: stochastic signal processing approach. Journal of Protein Chemistry 2003, 22: 395–402. 10.1023/A:1025350409648View ArticlePubMedGoogle Scholar
- Wang M, Yang J, Liu G, Xu ZJ, Chou KC: Weighted-support vector machines for predicting membrane protein types based on pseudo amino acid composition. Protein Engineering, Design and Selection 2004, 17: 509–516. 10.1093/protein/gzh061View ArticlePubMedGoogle Scholar
- Wang M, Yang J, Xu ZJ, Chou KC: SLLE for predicting membrane protein types. Journal of Theoretical Biology 2004, 232: 7–15. 10.1016/j.jtbi.2004.07.023View ArticleGoogle Scholar
- Zhou ZP, Doctor K: Subcellular location prediction of apoptosis proteins. Proteins: Structure, Function and Genetics 2003, 50: 44–48. 10.1002/prot.10251View ArticleGoogle Scholar
- Hua S, Sun Z: Support vector machine approach for protein subcellular localization prediction. Bioinformatics 2001, 17(8):721–728. 10.1093/bioinformatics/17.8.721View ArticlePubMedGoogle Scholar
- Reinhardt A, Hubbard T: Using neural networks for prediction of the subcellular location of proteins. Nucleic Acids Res 1998, 26(9):2230–2236. 10.1093/nar/26.9.2230PubMed CentralView ArticlePubMedGoogle Scholar
- Claros M, Vincens P: Computational method to predict mitochondrially imported proteins and their targeting sequences. Eur J Biochem 1996, 241: 779–786. 10.1111/j.1432-1033.1996.00779.xView ArticlePubMedGoogle Scholar
- Emanuelsson O, Nielsen H, Brunak S, von Heijne G: Predicting Subcellular Localization of Proteins Based on their N-terminal Amino Acid Sequence. Journal of Molecular Biology 2000, 300(4):1005–1016. 10.1006/jmbi.2000.3903View ArticlePubMedGoogle Scholar
- Emanuelsson O, Nielsen H, von Heijne G: ChloroP, a neural network-based method for predicting chloroplast transit peptides and their cleavage sites. Protein Sci 1999, 8: 978–984.PubMed CentralView ArticlePubMedGoogle Scholar
- Fujiwara Y, Asogawa M, Nakai K: Prediction of Mitochondrial Targeting Signals using Hidden Markov Models. In Genome Informatics 1997. Edited by: Miyano S, Takagi T. Japanese Society for Bioinformatics, Tokyo: Universal Academy Press; 1997:53–60.Google Scholar
- Predotar: A prediction service for identifying putative mitochondrial and plastid targeting sequences1997. [http://www.inra.fr/predotar/]
- Nakai K, Horton P: PSORT: a program for detecting the sorting signals of proteins and predicting their subcellular localization. Trends Biochem Sci 1999, 24: 34–35. 10.1016/S0968-0004(98)01336-XView ArticlePubMedGoogle Scholar
- Park KJ, Kanehisa M: Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics 2003, 19(13):1656–1663. 10.1093/bioinformatics/btg222View ArticlePubMedGoogle Scholar
- Chou KC, Zhang CT: Predicting protein folding types by distance functions that make allowances for amino acid interactions. J Biol Chem 1994, 269(35):22014–20.PubMedGoogle Scholar
- Chou KC: A novel approach to predicting protein structural classes in a (20–1)-D amino acid composition space. Proteins 1995, 21(4):319–344. 10.1002/prot.340210406View ArticlePubMedGoogle Scholar
- Chou KC, Elrod DW: Prediction of membrane protein types and subcellular locations. Proteins 1999, 34: 137–153. 10.1002/(SICI)1097-0134(19990101)34:1<137::AID-PROT11>3.0.CO;2-OView ArticlePubMedGoogle Scholar
- Chou KC, Elrod DW: Protein Subcellular location prediction. Protein Eng 1999, 12(2):107–118. 10.1093/protein/12.2.107View ArticlePubMedGoogle Scholar
- Chou KC: Prediction of protein cellular attributes using psuedo-amino acid composition. Proteins 2001, 43(3):246–255. 10.1002/prot.1035View ArticlePubMedGoogle Scholar
- Cui Q, Jiang T, Liu B, Ma S: Esub8: A novel tool to predict protein subcellular localizations in eukaryotic organisms. BMC Bioinformatics 2004, 5: 66. 10.1186/1471-2105-5-66PubMed CentralView ArticlePubMedGoogle Scholar
- Chou KC: Prediction of protein subcellular locations by incorporating quasi-sequence-order effect. Biochem Biophys Res Comm 2000, 278(2):477–483. 10.1006/bbrc.2000.3815View ArticlePubMedGoogle Scholar
- Feng ZP, T ZC: Prediction of the subcellular location of prokaryotic proteins based on the hydrophobicity index of amino acids. International Journal of Biological Macromolecules 2001, 28: 255–261. 10.1016/S0141-8130(01)00121-0View ArticlePubMedGoogle Scholar
- Bhasin M, Raghava GPS: ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST. Nucl Acids Res 2004, 32: W414–419.PubMed CentralView ArticlePubMedGoogle Scholar
- Bickmore W, Sutherland H: Addressing protein localization within the nucleus. EMBO J 2002, 21(6):1248–1254. 10.1093/emboj/21.6.1248PubMed CentralView ArticlePubMedGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene Ontology: tool for the unification of biology. Nature Genetics 2000, 25: 25–29. 10.1038/75556PubMed CentralView ArticlePubMedGoogle Scholar
- Chou KC, Cai YD: Using functional domain composition support vector machines for prediction of protein subcellular location. Journal of Biological Chemistry 2002, 277: 45765–45769. 10.1074/jbc.M204161200View ArticlePubMedGoogle Scholar
- Scheffer T, Herbrich R: Unbiased Assessment of Learning Algorithms. IJCAI-97 1997, 798–803.Google Scholar
- Vapnik V: The Nature of Statistical Learning Theory. Springer; 1995.View ArticleGoogle Scholar
- Vapnik V: Statistical Learning Theory. Wiley; 1998.Google Scholar
- Burges CJC: A Tutorial on Support Vector Machines for Pattern Recognition. Data Min Knowl Discov 1998, 2(2):121–167. 10.1023/A:1009715923555View ArticleGoogle Scholar
- Joachims T: Making large-scale support vector machine learning practical. In Advances in Kernel Methods: Support Vector Machines. Edited by: Schölkopf B, Burges C, Smola A. MIT Press, Cambridge, MA; 1998.Google Scholar
- Savicky P, Füernkranz J: Combining Pairwise Classifiers with Stacking. In Advances in Intelligent Data Analysis V. Edited by: Berthold M, Lenz H, Bradley E, Kruse R, Borgelt C. Springer; 2003:219–229.View ArticleGoogle Scholar
- Allwein EL, Schapire RE, Singer Y: Reducing multiclass to binary: a unifying approach for margin classifiers. Journal of Machine Learning Research 2001, 1: 113–141. 10.1162/15324430152733133Google Scholar
- Bock JR, Gough DA: Predicting protein-protein interactions from primary structure. Bioinformatics 2001, 17(5):455–460. 10.1093/bioinformatics/17.5.455View ArticlePubMedGoogle Scholar
- Kawashima S, Kanehisa M: AAindex: amino acid index database. Nucleic Acids Res 2000, 28: 374. 10.1093/nar/28.1.374PubMed CentralView ArticlePubMedGoogle Scholar
- Chou KC, Zhang CT: Prediction of Protein Structural Classes. Crit Rev Biochem Mol Biol 1995, 30(4):275–349.View ArticlePubMedGoogle Scholar
- Mardia KV, Kent JT, Bibby JM: Multivariate Analysis. London: Academic Press; 1979:322–381.Google Scholar
- Stone M: Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society 1974, 36: 111–147.Google Scholar
- Kohavi R: Wrappers for performance enhancement and oblivious decision graphs. PhD thesis. Stanford University; 1995.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.