Multi-location gram-positive and gram-negative bacterial protein subcellular localization using gene ontology and multi-label classifier ensemble

Background It has become a very important and full of challenge task to predict bacterial protein subcellular locations using computational methods. Although there exist a lot of prediction methods for bacterial proteins, the majority of these methods can only deal with single-location proteins. But unfortunately many multi-location proteins are located in the bacterial cells. Moreover, multi-location proteins have special biological functions capable of helping the development of new drugs. So it is necessary to develop new computational methods for accurately predicting subcellular locations of multi-location bacterial proteins. Results In this article, two efficient multi-label predictors, Gpos-ECC-mPLoc and Gneg-ECC-mPLoc, are developed to predict the subcellular locations of multi-label gram-positive and gram-negative bacterial proteins respectively. The two multi-label predictors construct the GO vectors by using the GO terms of homologous proteins of query proteins and then adopt a powerful multi-label ensemble classifier to make the final multi-label prediction. The two multi-label predictors have the following advantages: (1) they improve the prediction performance of multi-label proteins by taking the correlations among different labels into account; (2) they ensemble multiple CC classifiers and further generate better prediction results by ensemble learning; and (3) they construct the GO vectors by using the frequency of occurrences of GO terms in the typical homologous set instead of using 0/1 values. Experimental results show that Gpos-ECC-mPLoc and Gneg-ECC-mPLoc can efficiently predict the subcellular locations of multi-label gram-positive and gram-negative bacterial proteins respectively. Conclusions Gpos-ECC-mPLoc and Gneg-ECC-mPLoc can efficiently improve prediction accuracy of subcellular localization of multi-location gram-positive and gram-negative bacterial proteins respectively. The online web servers for Gpos-ECC-mPLoc and Gneg-ECC-mPLoc predictors are freely accessible at http://biomed.zzuli.edu.cn/bioinfo/gpos-ecc-mploc/ and http://biomed.zzuli.edu.cn/bioinfo/gneg-ecc-mploc/ respectively.


Background
Bacteria widely distributed in soil and water, or coexistence with other creatures, which are the most one in all organisms. All bacteria are grouped into prokaryotes that have a very simple cell structure lacking a cell nucleus, mitochondria and chloroplasts. Bacteria can be classified into two groups via Gram staining method: Gram-positive and Gram-negative. The former are stained dark blue or violet by Gram staining, while the latter instead appear red or pink. Because the functions of proteins are closely related to their subcellular locations, knowing subcellular locations of proteins in a bacterial cell can help biologists elucidating the functions of proteins and thus screening candidates in drug design.
Nowadays, there are two methods for identifying the subcellular locations of proteins: biochemical experiments and computational methods. In the post-genomic The Key Laboratory of Embedded System and Service Computing, Ministry of Education, Department of Control Science and Engineering, Tongji University, Shanghai, China Full list of author information is available at the end of the article era, with the completion of various sequencing projects, new protein sequences have grown exponentially [1]. The biochemical experiments not only consume a lot of time but also pay high costs, and thus they have not adapted to the new situation. It is required to develop computational methods to identify the subcellular locations of these proteins automatically and accurately.
Although there exist a lot of prediction methods for subcellular localization of proteins, the majority of these methods can only deal with single-location proteins. But unfortunately many multi-location proteins are located at more than one location site simultaneously. When prediction models are constructed by these methods, multi-location proteins are not included in the training set. Actually, multi-location proteins have special biological functions capable of helping the development of new drugs.
In this article, two efficient multi-label predictors, Gpos-ECC-mPLoc and Gneg-ECC-mPLoc, are proposed to predict the subcellular locations of multi-label gram-positive and gram-negative bacterial proteins respectively. The two multi-label predictors extract GO feature vectors from GO terms of homologs of query proteins and then adopt a powerful multi-label ensemble classifier to output the final multi-label prediction results. Experimental results show that Gpos-ECC-mPLoc and Gneg-ECC-mPLoc can efficiently predict the subcellular locations of multi-label gram-positive and gram-negative bacterial proteins respectively. For readers'convenience, we developed the online web servers for Gpos-ECC-mPLoc and Gneg-ECC-mPLoc predictors which are freely accessible at http://biomed. zzuli.edu.cn/bioinfo/gpos-ecc-mploc/ and http://biomed. zzuli.edu.cn/bioinfo/gneg-ecc-mploc/ respectively.
The gram-positive bacterial dataset consists of 519 gram-positive bacterial proteins, which are distributed in 4 locations (see Table 1). Of the 519 gram-positive bacterial proteins, 515 belong to one subcellular location, 4 to two locations, and none to more locations. The number of locative proteins in this dataset is 523. The concept of locative proteins and actual proteins have been explained in detail in literature [33][34][35]. The sequence identity in this dataset is controlled fewer than 25%.
The gram-negative bacterial dataset consists of 1392 gram-negative bacterial proteins, which are distributed in 8 locations (see Table 2). Of the 1392 gram-negative bacterial proteins, 1328 belong to one subcellular location, 64 to two locations, and none to more locations. The number of locative proteins in this dataset is 1456. The sequence identity in this dataset is also controlled fewer than 25%.

Performance measures
In this article, we use the (overall) locative and absolute accuracy to measure the performance of multi-label predictors. The overall locative and absolute accuracy are defined as follows: where Y i is the set of true labels of each protein, Z i the set of predicted labels of each one, N loc the number of locative proteins, N dif the number of different proteins, | -| the operator acting on the set to count the number of its elements, ∩ the intersection of sets, 1(Y i ≡ Z i ) equals 1 if true labels are entirely identical to predicted labels, 0 otherwise.
When and only when all of the subcellular locations of a query protein are exactly predicted, the prediction result of the query protein can be considered as correct. Therefore, the overall absolute accuracy is stricter than the overall locative accuracy. For the two measures, more detailed explanation can be found in [36].

Comparison with the state-of-the-art predictors
In statistical prediction, the jackknife test, also named leave-one-out cross validation, is considered as the most rigorous and objective evaluation method [37]. The jackknife test has been widely utilized by researchers to evaluate the performance of various prediction methods [38][39][40][41][42][43]. Hence, in this article, we also use the jackknife test to evaluate the prediction performance of our proposed Gpos-ECC-mPLoc and Gneg-ECC-mPLoc predictors.
For the Gpos-ECC-mPLoc predictor, we compare our proposed Gpos-ECC-mPLoc predictor with two state-ofthe-art gram-positive bacterial multi-label predictors, i.e., Gpos-mPLoc [31] and iLoc-Gpos [30] predictors. For the Gneg-ECC-mPLoc predictor, we also compare our proposed Gneg-ECC-mPLoc predictor with two state-ofthe-art gram-negative bacterial multi-label predictors, i.e., Gneg-mPLoc [26] and iLoc-Gneg [32] predictors. Ensemble sizes of multi-label ensemble classifiers (i.e., ECC) used in Gpos-ECC-mPLoc and Gneg-ECC-mPLoc are respectively set to 25 and 40 for achieving the best performance. Table 3 shows the comparison results of our proposed Gpos-ECC-mPLoc predictor against two state-of-the-art gram-positive bacterial multi-label predictors on the grampositive bacterial benchmark dataset by the jackknife test. Similar to both Gpos-mPLoc [31] and iLoc-Gpos [30], Gpos-ECC-mPLoc also uses the accession numbers of homologous proteins of query proteins to retrieve corresponding GO terms from the GOA database. Gpos-ECC-mPLoc utilizes homologous proteins which have ≥ 60% pairwise sequence similarity with the query protein. Note that if a query protein do not have any homologous protein or accession numbers of its homologous proteins do not match any GO term from the GOA database, dipeptide composition method is used as a backup for extracting its feature vector. In the gram-positive bacterial benchmark dataset, there is one protein without any homologs. Table 4 shows the comparison results of our proposed Gneg-ECC-mPLoc predictor against two state-of-the-art gram-negative bacterial multi-label predictors on the gram-negative bacterial benchmark dataset by the jackknife test. Gneg-mPLoc [26] uses similar methods as Gpos-mPLoc [31], and iLoc-Gneg [32] uses similar methods as iLoc-Gpos [30]. Gneg-ECC-mPLoc also utilizes homologous proteins which have ≥ 60% pairwise sequence similarity with the query protein. In the gramnegative bacterial benchmark dataset, there are two proteins without any homologs.
As can be seen from Table 3 and 4, for the grampositive bacterial dataset, Gpos-ECC-mPLoc performs better than Gpos-mPLoc and iLoc-Gpos; for the gramnegative bacterial dataset, Gneg-ECC-mPLoc also performs better than Gneg-mPLoc and iLoc-Gneg. Specifically, in the gram-positive bacterial dataset, the overall locative accuracy achieved by Gpos-ECC-mPLoc is 94.44%, which is more than 12% higher than that achieved by Gpos-mPLoc and 1% higher than that achieved by iLoc-Gpos, while the overall absolute accuracy of Gpos-ECC-mPLoc is 94.02%, which is more than 1% higher than iLoc-Gpos; and in the gram-negative bacterial dataset, Gneg-ECC-mPLoc achieves 94.1% overall locative accuracy, with more than 8% performance improvement against Gneg-mPLoc and approximately 3% improvement against iLoc-Gneg, while Gneg-ECC-mPLoc achieves 92.4% overall absolute  accuracy, with approximately 3% improvement against iLoc-Gneg. The results on both datasets show that Gpos-ECC-mPLoc and Gneg-ECC-mPLoc are more capable of handling multi-label problems than Gpos-mPLoc, iLoc-Gpos, Gneg-mPLoc and iLoc-Gneg. That is because Gpos-ECC-mPLoc and Gneg-ECC-mPLoc take correlations among subcellular locations into account, while Gpos-mPLoc, iLoc-Gpos, Gneg-mPLoc and iLoc-Gneg only transform the multi-label classification problem to one single-label classification problem and thus lose the beneficial label correlations information. Moreover, ensembling multiple multi-label classifiers in Gpos-ECC-mPLoc and Gneg-ECC-mPLoc further enhances the prediction performance. As for the individual locative accuracy, in the gram-positive bacterial dataset, Gpos-ECC-mPLoc achieves the similar locative accuracies to iLoc-Gpos for the 'Cell membrane', 'Cell wall' and 'Cytoplasm', while the locative accuracy of Gpos-ECC-mPLoc is remarkably higher than iLoc-Gpos for the 'Extracell'; in the gram-negative bacterial dataset, the locative accuracies of Gneg-ECC-mPLoc for all of the 8 locations are significantly higher than Gneg-mPLoc, except for the 'Cell inner membrane', 'Fimbrium' and 'Flagellum' for which both Gneg-ECC-mPLoc and iLoc-Gneg achieve the similar locative accuracies, while Gneg-ECC-mPLoc performs remarkably better than iLoc-Gneg for the rest of location sites.

Conclusions
In this article, we propose two efficient multi-label predictors, Gpos-ECC-mPLoc and Gneg-ECC-mPLoc, to predict the subcellular locations of multi-label gram-positive and gram-negative bacterial proteins respectively. The two multi-label predictors use the GO terms of homologous proteins of query proteins to construct the GO vectors and then the GO vectors are fed into the powerful ensemble of classifier chains (ECC) classifier for generating the final multi-label prediction results. Compared with the existing predictors, Gpos-ECC-mPLoc and Gneg-ECC-mPLoc have three following advantages: (1) CC takes the correlations among different labels into account and then improves the prediction performance of multi-label proteins; (2) ECC ensembles multiple CC classifiers and can generate better prediction results by ensemble learning; and (3) they construct the GO vectors by using the frequency of occurrences of GO terms in the typical homologous set instead of using 0/1 values. Experimental results show that Gpos-ECC-mPLoc and Gneg-ECC-mPLoc can efficiently predict the subcellular locations of multi-label gram-positive and gram-negative bacterial proteins respectively. For readers'convenience, the online web servers for Gpos-ECC-mPLoc and Gneg-ECC-mPLoc predictors are freely accessible at http:// biomed.zzuli.edu.cn/bioinfo/gpos-ecc-mploc/ and http:// biomed.zzuli.edu.cn/bioinfo/gneg-ecc-mploc/ respectively.

Feature extraction Gene ontology
The Gpos-ECC-mPLoc and Gneg-ECC-mPLoc predictors only use amino acid sequences as input and do not need to know the accession numbers of query proteins in advance. Given a query protein, its amino acid sequence is entered to BLAST [44] to search its homologous proteins. Those homologous proteins with ≥ 60% pairwise similarity are picked out as the typical homologous set of the query protein. Corresponding GO terms of the query protein are retrieved from the GOA database using the accession numbers of its typical homologous set as the keys. Note that for a different query protein, the number of its typical homologous set may be different.
In this article, we used the GOA database released on 08-Apr-2011, which consists of 18844 distinct GO terms. These GO terms form an Euclidean space with 18844 dimensions. Given a dataset, we used the procedure described in the above to retrieve the GO terms of all of its proteins. For each protein in the dataset, it can be represented as a GO vector by matching its GO terms to all of the 18844 GO terms. We used the approach described in [45,46] to determine the elements of the GO vectors. Specifically, the GO vector p i of the i-th protein is defined as: 1 . . . where N h is the number of its typical homologous set, g (j, k) = 1 if the k-th homologous protein hits the j-th GO term, g (j, k) = 0 otherwise, and f i,j means the frequency of occurrences of the j-th GO term in the typical homologous set.

Dipeptide composition
Some proteins can not be represented as GO vectors because they do not have any homologous proteins or accession numbers of their homologous proteins do not match any GO term from the GOA database. In this article, dipeptide composition is used as a backup, which represents the frequency of occurrences of each two adjacent amino acid residues. 420-dimensional vectors are generated by the dipeptide composition for the query proteins, in which the first 20 elements are the conventional amino acid composition (AAC), the following 400 elements are the frequency of occurrences of the 400 different dipeptides.

Prediction method Binary relevance
Binary relevance method (BR) [47] uses the one-againstrest strategy to convert a multi-label problem into several binary classification problems. Given a multi-label dataset with N class labels, BR method trains one classifier for each class label. When training one classifier for each class label, BR method annotates all of the training examples associated with that label as positive examples while all remaining examples are regarded as negative examples. Given a test example, each classifier in BR will output a prediction score and BR will combine these scores into a N-dimensional score vector, where each score corresponds to a specific class label. The value of the score has two conditions, positive and negative, positive means the binary classifier predicts the test example belongs to the corresponding class label, negative means it do not belong to the class label. Note that if all N scores are negative, the class label with the maximum score is assigned to the test example.

Classifier chain
Classifier Chain (CC) method [48] is derived from BR method and also makes up of N binary classifiers as in BR. Unlike BR, each classifier in CC has to be trained sequentially. Classifiers in CC are then linked along a chain in sequence that they are trained. Because examples in a multi-label dataset could have multiple class labels and class labels may be correlated, CC thus takes the correlations among class labels into account. It extends the feature space of each classifier in the chain with the predicted labels of all previous classifiers. Since CC method passes class label information between classifiers, CC takes label correlations into account and thus overcomes the label independence weakness of BR method. The process of making the prediction in the CC method is the same as in the BR method.

Ensemble of classifier chains
Considering an ensemble of multiple classifiers generally generates a better prediction accuracy [49], we construct an multi-label classifier ensemble by combining multiple CC classifiers. Because different label orders could generate different prediction results, ensemble of classifier chains (ECC) trains multiple different CC classifiers, where each CC classifier is trained with a random chain order. Each CC classifier will outputs a score vector, we then take the average of these score vectors to make the final predictions by the prediction process as described in the BR method. In this article, we use ECC as the prediction engine in Gpos-ECC-mPLoc and Gneg-ECC-mPLoc.

Support vector machine
Each classifier in BR and CC method can be trained by different binary classification algorithm. For simplicity, in this article, we use support vector machine (SVM) [50] as the base learner to train each classifier in CC method. SVM is a well-known binary classification algorithm and commonly used in various fields of bioinformatics [28,[51][52][53][54][55][56][57]. The LIBLINEAR software package [58] is used to train SVM. It is very efficient and designed specially for high dimensional vectors as the GO vectors used in this work.