Protein subcellular localization prediction for Gram-negative bacteria using amino acid subalphabets and a combination of multiple support vector machines

Background Predicting the subcellular localization of proteins is important for determining the function of proteins. Previous works focused on predicting protein localization in Gram-negative bacteria obtained good results. However, these methods had relatively low accuracies for the localization of extracellular proteins. This paper studies ways to improve the accuracy for predicting extracellular localization in Gram-negative bacteria. Results We have developed a system for predicting the subcellular localization of proteins for Gram-negative bacteria based on amino acid subalphabets and a combination of multiple support vector machines. The recall of the extracellular site and overall recall of our predictor reach 86.0% and 89.8%, respectively, in 5-fold cross-validation. To the best of our knowledge, these are the most accurate results for predicting subcellular localization in Gram-negative bacteria. Conclusion Clustering 20 amino acids into a few groups by the proposed greedy algorithm provides a new way to extract features from protein sequences to cover more adjacent amino acids and hence reduce the dimensionality of the input vector of protein features. It was observed that a good amino acid grouping leads to an increase in prediction performance. Furthermore, a proper choice of a subset of complementary support vector machines constructed by different features of proteins maximizes the prediction accuracy.


Background
Subcellular localization is a key functional attribute of a protein. Since cellular functions are often localized in specific compartments, predicting the subcellular localization of unknown proteins may be used to obtain useful information about their functions and to select proteins for further study. Moreover, studying the subcellular local-ization of proteins is also helpful in understanding disease mechanisms and for developing novel drugs.
As a result of large-scale genome sequencing efforts in recent years, protein data has accumulated in public data banks at an increasing rate. Analyzing protein data to extract useful knowledge is thus essential for projects like automatic annotation. It is desirable to have an automated and reliable system for predicting subcellular localization of proteins from amino acid sequences.
A number of efforts  have been made to predict protein subcellular localization. Most of these prediction methods can be classified into two categories: one based on the recognition of protein N-terminal sorting signals and the other based on amino acid compositions [22].
Previous works have been focused on protein localization prediction for Gram-negative bacteria. There are five primary localization sites in Gram-negative bacteria, which are the cytoplasm, the extracellular space, the inner membrane, the outer membrane, and the periplasm. PSORT I [23] is the most widely used tool for predicting multiple localizations for Gram-negative bacteria. It uses biological knowledge represented by "if-then" rules for predicting protein localization sites. Most of these rules were derived from experimental observations. However, the PSORT I does not consider the extracellular space site. Additionally, the overall recall for the data set [24] only attains 60.9%.
Gardy et al. [24] presented PSORT-B to improve the prediction performance of PSORT I. PSORT-B combines information of the amino acid composition, similarity to proteins of known localization, presence of a signal peptide, transmembrane alpha-helices and motifs corresponding to specific localizations for a given protein sequence, through a probabilistic approach. It returns a list of five possible localization sites with associated probability scores. It attains an overall recall of 74.8% for the same data set mentioned above.
Recently, Yu et al. [25] proposed a predictive system called CELLO for Gram-negative bacteria by using support vector machines based on n-peptide compositions. They classified 20 amino acids into four groups (charged, polar, aromatic and nonpolar) to reduce the dimensionality of the input vector. Forty SVM classifiers were used to predict the localization sites. Their overall recall was 88.9%. It was a significant improvement over the previous results of PSORT-B. However, the recall for extracelluar proteins was still relatively low at 78.9%. This paper studies ways to improve the accuracy for predicting extracellular localization in the Gram-negative bacteria. We explored a new way to extract features from protein sequences for protein localization prediction by clustering 20 amino acids into a few groups using a greedy algorithm. Our method for clustering 20 amino acids considers the factors of both amino acids' physical-chemical properties and their contextual correlations. In contrast, the method presented by Yu et al. classifies the 20 amino acids into 4 groups (charged, polar, aromatic and nonpo-lar) based on physical-chemical properties of amino acids alone. Instead of simply combining multiple SVMs to give a better prediction, we propose a selection score function and a greedy algorithm to select a subset of SVMs to maximize the prediction accuracy.
Based on the proposed approaches, we have developed a system called P-CLASSIFIER for predicting the subcellular localization of Gram-negative bacteria by using a combination of multiple support vector machines. This has resulted in an improvement in the recall for extracellular proteins from 78.9% in CELLO [25] (currently the best predicting system for Gram-negative bacteria) to 86.0% in P-CLASSIFIER. The overall recall of P-CLASSIFIER reaches 89.8%. To the best of our knowledge, these are the most accurate results for predicting protein subcellular localization in Gram-negative bacteria.

Results
The dataset used in this study is from [24] and was extracted from SWISS-PROT release 40.29 [26]. It contains 1441 proteins of experimentally determined localization, where 1302 proteins are resident at a single localization site and 139 proteins are resident at dual localization sites. Table 1 lists the number of protein sequences from different sites in the data set.
The prediction performance of our prediction system is estimated from a 5-fold cross-validation where the given training samples are randomly partitioned into 5 mutually exclusive sets of approximately equal size and approximately equal class distribution.
It is observed that there are some protein sequences in the dataset containing character "X". To avoid possible noise from ambiguous information, the protein entries containing "X" in the protein sequence are excluded in the crossvalidation training set, but included in the testing set in this work.  Table 2 shows the prediction recall for single localization. The recall is calculated as TP x / (TP x + FN x ), where TP x and FN x represent true positives (number of samples correctly classified as X) and false negatives (number of samples classified as not X that are actually X) over the predictive site X.
In the dataset, some proteins occur in two different subcellular localizations. Since we are comparing our combined classifier P-CLASSIFIER with the P-SORTB and CELLO classifiers, we followed their method in evaluating the classifier for proteins resident at dual localization sites, where we consider them as predicted correctly if one of their localization sites is predicted correctly. Table 3 shows the prediction recall for dual localizations.
The Matthews correlation coefficient [27] is used to measure the predictive performance for five predictive sites. The Mattews correlation coefficient (MCC) is defined by: where TP x , TN x , FP x , and FN x are true positives, true negatives (the number of samples correctly predicted as not X that are actually not X), false positives (the number of samples incorrectly predicted as X that are actually not X), and false negatives of localization site X, respectively. MCC offers a comprehensive and robust measurement for the predictive performance as this measurement considers both under-and over-predictions. The value of MCC equals 1 for a perfect prediction, and 0 for a completely random assignment. Table 4 lists the performance comparisons among P-CLASSIFIER's (our system), PSORT-B's, and CELLO's [25] systems. As shown in Table 4, the values of MCC of all five sites in our system is greater than or equal to the values in CELLO's system, currently the best predicting system for Gram-negative bacteria. Moreover, we increase the recall for the extracellular site from 78.9% in CELLO to 86.0% in P-CLASSIFIER, a significant improvement for the extracellular site on the previous results. The overall recall of P-CLASSIFIER reaches 89.8%, which is better than previous results. To the best of our knowledge, these are the most accurate results for predicting Gram-negative bacteria localization.

Discussion
To computationally analyse protein data, the representation of protein sequences is an important issue. A good input representation makes it easier for the SVM to identify underlying regularities and therefore is crucial to the success of SVM learning.
In this paper, we encode protein sequences by using the patterns of one amino acid, two adjacent amino acids, three adjacent amino acids, and four adjacent amino acids.
As there are 8000 and 160000 different patterns for the three and four adjacent amino acids cases, clustering 20 amino acids into several groups provides a way to reduce the number of unique patterns since it is difficult to train the SVM with very large number of features such as 160000 for all possible patterns of four adjacent amino acids. Since amino acids in proteins do not contribute to the function of proteins independently and functional patterns in proteins are embedded as sequence correlations, amino acids may not be grouped based on their physical-chemical properties alone [28]. For the prediction task, a good amino acid grouping leads to an improvement in prediction performance.
It is observed that the prediction results from SVMs constructed by different lengths of adjacent amino acid patterns, e.g. the patterns of a single amino acid and amino acid pairs, are complementary. That is, there are some cases where the prediction made by the SVM constructed by patterns of some particular length is correct while the prediction made by the SVM constructed by patterns of another length is incorrect, and vice versa. Therefore, combining complementary results provides a way to improve the prediction accuracy. However, combining all complementary results together may not be a good choice. There- fore, we propose to choose a subset of complementary support vector machines properly that will maximize the prediction accuracy.
After analyzing the predictive results, it is observed that there are some protein sequences that cannot be predicted correctly by any SVM in the combined classifier. It means that these protein sequences cannot be correctly classified by their composition. This is the reason why the recall of some predictive sites in Gram-negative bacteria cannot be further improved.
Since we are comparing our combined classifier P-CLAS-SIFIER with the P-SORTB and CELLO classifiers, we use the same data set as theirs. We did not check the sequence redundancy in the dataset. As the level of sequence redundancy normally strongly affects prediction accuracy, removing those protein sequences which have high sequence identity (e.g. more than 40%) with each other in the dataset can avoid redundancy and bias.
Instead of giving full credit for dual-localized proteins if either of the sites is predicted correctly, we also evaluate the prediction performance by counting "half" correct when only one of the sites of dual-localized proteins is predicted correctly. Table 5 shows their prediction recalls. The full credit for dual-localized proteins is only given when two possible localization sites with the top two associated probability scores match to actual dual locali-zations of the protein. The corresponding overall recall for predicting dual localizations only reaches 67.3%. To properly deal with subcellular localizations for proteins resident in several different sites is a challenging problem. The paper [5] addressed the problem of subcellular localizations for proteins resident in several different sites.
There are three methods used for cross-validation test: the independent dataset test, n-fold cross-validation test, and the leave one out cross-validation test. Among these methods, the leave one out cross-validation test is the most rigorous and objective [29,42]. However, the leave one out cross validation test is very expensive computationally and is often impractical for large datasets. The n-fold cross-validation test provides a bias-free estimation of the accuracy [30] at much reduced computational cost and is considered as an acceptable test for evaluating predictive performance of an algorithm [31] for large datasets.

Conclusion
This paper introduces a protein subcellular localization prediction method using amino acid subalphabets and a combination of multiple support vector machines.
The main contributions of our work include: (1) A new way to extract features from protein sequences by clustering 20 amino acids into a few groups using the proposed greedy algorithm to reduce the input dimensionality of support vector machines. Our method for clustering 20 amino acids considers not only the factor of the amino acids' physical-chemical properties but also the factor of their contextual correlations. (2) A selection score function and a greedy algorithm are proposed to select a subset of candidate support vector machines to maximize the cross-validation accuracy instead of simply combining multiple support vector machines to give better prediction.
(3) A web-based system has been developed for predicting protein subcellular localization of Gram-negative bacteria. It allows people to submit multiple Gram-nega-  tive bacteria protein sequences to perform protein subcellular localization prediction. It is available at [43].
Clustering 20 amino acids into a few groups by our proposed greedy algorithm provides a new way to extract features to cover more adjacent amino acids from protein sequences and reduce the dimensionality of these features. Since amino acids in proteins do not contribute to the function of proteins independently, it may not be a good idea to group amino acids based on their physicalchemical properties alone. For the prediction task, a good amino acid grouping leads to an increase in prediction performance. Furthermore, properly choosing a subset of complementary support vector machines constructed by different features of proteins maximizes the prediction accuracy.

Support vector machines
Support Vector Machines (SVMs) have been widely used in the analysis of biological data [32][33][34]. SVM is a relatively new family of learning methods and has some theoretical support from statistical learning theory [35,36]. SVM non-linearly maps the input space into a high dimensional feature space, and seeks a hyperplane in this space that separates the positive samples from the negative ones with the largest possible margin and optimizes the trade-off between good classification and large margin. Instead of explicitly mapping the objects to the high dimensional feature space, SVM usually works implicitly in the feature space by only computing the corresponding kernel between any two objects.
Several parameters need to be set during the SVM training phase. These parameters include the regularization parameter, which controls the trade-off between good classification and large margin, the kernel type, and the kernel parameters. These parameters are tuned based on the criteria of cross-validation accuracy. The radial basis function (RBF) kernel is used for all our experiments and the software BSVM [44], a multi-class SVM [37], is used in this work.

Protein features
The amino acid compositions in the full or partial sequences are considered as global features, which represent the overall similarity among multiple protein sequences. In this paper, the global features are used as the input of the SVMs to predict protein subcellular localization.
a. W-gram protein encoding Two types of features are considered in our work: W-gram and gapped 2-gram. A W-gram is defined as patterns of W (W ≥ 1) consecutive amino acid residues without any gaps and a gapped 2-gram is defined as two amino acid residues with some specified number of gaps in a protein sequence. Here, a gapped 2-gram is also referred to as a 2gram. The main purpose of introducing the gapped encoding features for 2-gram is to increase the number of 2-gram feature candidates.
For each protein sequence P and each W-gram (or feature) F, let N(P, F) be the number of occurrences of F in the protein sequence P. Further, let T(P, W) be the total number of possible W-grams in P, length(P) be the length of P, and G(F) be the specified number of gaps. We have T In the W-gram protein encoding method, the total number of different possible features is 20 w .

b. Amino acid subalphabets
It is difficult to train the SVM with very large number of features such as 8000 for 3-gram. To reduce dimensionality, one way is to classify the 20 amino acids into small number of groups based on their physical-chemical properties. All members in the same group can be represented by one symbol. The merged amino acid alphabet has fewer than 20 symbols and is called the amino acid subalphabet, which can be used to re-encode the original protein sequences. The re-encoded protein sequences have fewer features. For example, if the number of symbols in an alphabet is reduced from 20 to 6, the number of 3gram features is reduced from 4000 (20 × 20 × 20) to 216 (6 × 6 × 6). Reducing the number of features to a manageable size for SVMs can help to improve the predictive performance.
This paper suggests optimizing the grouping by using the proposed greedy algorithm, which considers the factors of both the amino acids' physical-chemical properties and their contextual correlations, instead of using the grouping based on their physical-chemical properties alone. Note that there are an exponential number of ways to group the 20 amino acids. For example, there are 580606446 and 45232115901 ways to divide 20 amino acids into 3 and 4 groups, respectively. The number of subalphabets with m groups (1 ≤ m ≤ 20) for the protein alphabet size of 20, N(m) can be calculated by the formula [28] below.
We learn the local optimal grouping based on a greedy algorithm using the SVM classification algorithm to evaluate the fitness of each candidate subalphabet, where the criteria for evaluation is the 5-fold cross-validation accuracy.

c. Search for amino acid subalphabets
This section presents our greedy algorithm for finding a good grouping for the amino acids. Given a particular subalphabet encoding schema S, supposing N g and T c are the predefined number of groups and threshold of crossvalidation accuracy, respectively. Further, we assume the parameters of a SVM to evaluate the fitness of a candidate subalphabet are given. These SVM parameters can be set either by the values suggested by the SVM software or by the tuning result of the SVM, which is constructed from features re-encoded by grouping 20 amino acids based on their physical-chemical properties, according to the criteria of cross-validation accuracy. For a particular subalpha-bet encoding schema S, let the grouping score h(S) be the cross-validation accuracy when prediction is done by a SVM using W-gram and the subalphabet scheme S. h(S) can be used to measure the goodness of the grouping S. Table 6 shows an example of clustering 20 amino acids into 4 groups for the 4-gram protein encoding method using the proposed greedy algorithm. The initial node with 4-group assignment is set to   The proposed greedy algorithm to search for amino acid subalphabets is described in Table 7. The greedy local search [39] has been used for learning the subalphabets.

Searching states Cross-validation accuracy Actions
In the search tree [39], every node represents an amino acid subalphabet encoding schema. The child nodes of a node are subalphabets encoding schemata, which are generated by moving every group member to each other group if the number of members in this group is greater than one.
This algorithm is composed of the following four steps. First, the 20 amino acids are initially divided into N g groups either randomly with approximately the same size or based on some physical-chemical properties of the 20 amino acids. Amino acids in the same group are denoted by one symbol in a subalphabet. Suppose the current subalphabet encoding schema is represented by current node, its grouping score is calculated where the grouping score is the cross-validation accuracy when prediction is done by a SVM using W-gram and this subalphabet scheme.
Second, all child nodes of the current node are generated.
If there is only one member in some group, this member cannot move to any other group. Otherwise, the total number of groups will be less than N g . There are at most 20 × (N g -1) possible child nodes in the searching space since there are 20 amino acids and each amino acid can only move to at most (N g -1) other groups. If the highest grouping score among the child nodes is greater than the grouping score of the current node, this child node will become the current node.
Third, the above process for searching the child node with the highest grouping score among all child nodes will be repeated until the grouping scores of all child nodes are less than the grouping score of the current node.
Fourth, if the grouping score in the final current node is greater than T c , the N g groups in the current node will become the accepted merged subalphabets. Otherwise, we randomly re-generate the current node and repeat the Steps 2 to 4 above.
The training sequences are divided into two parts: One part is used for choosing the subalphabet while the other is used for evaluating the performance of a subalphabet.
The greedy algorithm is applied to reduce the number of W-gram features. In particular, for 3-gram, we classify the 20 amino acids into 6, 7, and 8 groups. For 4-gram, we classify the 20 amino acids into 4 groups. The number of features is m W , where m is the number of groups and W is the number of protein peptides in W-gram encoding methods. For example, the number of features is 6 × 6 × 6 = 216 for 6 groups in 3-gram encoding method.

Multiple SVMs
Due to the nature of the multi-class classification, it may not be easy to obtain a single SVM that can return high accuracies for the subcellular localization prediction. Therefore, multiple SVMs are trained from different features and their results are combined using voting.
Currently most of the existing protein subcellular localization prediction systems using SVMs only use the features generated from 1-gram or 2-gram protein encoding methods. For example, the extracted features of amino acid compositions [2] and features of amino acid pair and gapped amino acid pair compositions [40] can be considered as the features generated from the 1-gram and 2-gram encoding methods, respectively.
As many functional patterns in proteins are embedded as sequence correlations, it is expected that more information will be included by combining classifiers constructed UNTIL h(current_node) ≥ T c from features generated by 1-gram, 2-gram, 3-gram, and 4-gram protein encoding methods, instead of just using the classifiers constructed from 1-gram and 2-gram encoding methods since more adjacent amino acid residues will be considered.
In this paper, the following four types of features are extracted from protein sequences. The first type is the 1gram encoding feature, which includes amino acid compositions and the partitioned amino acid compositions, where the protein sequence is partitioned into P parts with approximately the same length. The total number of these features is 20 × P. In this work, P is set from 2 to 6. The second one is 2-gram encoding feature, which includes amino acid pair and gapped amino acid pair compositions, where the number of features is 400 (20 × 20) and the number of gaps is set from 1 to 2. The purpose of introducing the gapped encoding features only for 2gram is to increase the number of 2-gram feature candidates. The third one is the 3-gram encoding feature, where the 20 amino acids are divided into 6, 7, and 8 groups whose numbers of features are 216 (6 × 6 × 6), 343 (7 × 7 × 7), and 512 (8 × 8 × 8), respectively. The last one is the 4-gram encoding method, where the 20 amino acids are divided into 4 groups, whose number of features is 256 (4 × 4 × 4 × 4).

Feature selection
We apply the wrapper approach [41] in the backward elimination version to select the feature subset for our SVM classifiers and use 5-fold cross-validation accuracy as the criteria for evaluation.
Let SVM a and SVM b be the SVM classifiers using all features and features selected by the wrapper approach, respectively. Although the prediction accuracy of SVM b is improved, the prediction results from SVM a and SVM b are different. There are some cases where the prediction made by SVM a is correct while the prediction made by SVM b is not correct, and vice versa. Therefore, both SVM a and SVM b can be considered as candidates to build the final combined classifier.

SVM subset selection
Different SVMs give different predictions. One way to combine their predictions is by voting. That is, each protein sequence is assigned to a class with the most votes. For cases where two or more classes get the most votes, we assign these cases to the predictive results by one of the SVMs, which gets the most number of correct predictions for all these cases.
Suppose S is a set of protein sequences, N is the number of candidate SVMs, M = {SVM 1 , SVM 2 , ..., SVM N } is the set of candidate SVMs defined previously, V 1 (S, M) is the number of correct predictions classified by M with only one class corresponding to the most vote, and V 2 (S, M) is the number of the correct predictions by the assigned SVM when two or more classes correspond to the most vote. The selection score function V(S, M) is defined as V 1 (S, M) + V 2 (S, M) and is used to select a subset of all candidate SVMs to form a combined classifier, which maximizes the cross-validation accuracy. The proposed greedy algorithm to select a subset of M is described in Table 8.
This greedy algorithm consists of the following two steps. The process for removing some SVM p (1 ≤ p ≤ N) will continue until i = 1, that is, only one SVM is left. Then Set max is selected to be the combined classifier.  We can use the prediction results of four-fifth training protein sequences to select a subset of SVMs and use the prediction results of the rest of one-fifth training protein sequences to evaluate the performance of the result of the SVM subset selection.
In this work, 15 SVMs are selected and combined to form the final classifier. Table 9 shows the encoding methods of input vectors in the fifteen selected SVMs. Rows  We have conducted some experiments on constructing SVMs by using 5-gram encoding method. Preliminary experimental results show that the cross-validation accuracies predicted by SVM constructed by 3-gram, 4-gram, and 5-gram encoding methods are not satisfactory when the number of groups is less than 6, 4, and 4, respectively. When we increase the number of groups to 4 for 5-gram, the time required to train the corresponding SVM and calculate the 5-fold cross validation accuracy is relatively slow as the number of features reaches 1024 (4 × 4 × 4 × 4 × 4). Therefore, only 1-gram, 2-gram, 3-gram, and 4gram encoding methods are considered in this paper. Furthermore, the 20 amino acids are classified into 6, 7, and 8 groups for 3-gram and 4 groups for 4-gram encoding methods, respectively.
Since there are too many zero elements in the encoding results, 2-gram, 3-gram, and 4-gram protein's encoding methods are not applied to those cases where the protein sequences are partitioned into P (P > 1) parts with approximately same length.