ProLoc-GO: Utilizing informative Gene Ontology terms for sequence-based prediction of protein subcellular localization

Background Gene Ontology (GO) annotation, which describes the function of genes and gene products across species, has recently been used to predict protein subcellular and subnuclear localization. Existing GO-based prediction methods for protein subcellular localization use the known accession numbers of query proteins to obtain their annotated GO terms. An accurate prediction method for predicting subcellular localization of novel proteins without known accession numbers, using only the input sequence, is worth developing. Results This study proposes an efficient sequence-based method (named ProLoc-GO) by mining informative GO terms for predicting protein subcellular localization. For each protein, BLAST is used to obtain a homology with a known accession number to the protein for retrieving the GO annotation. A large number n of all annotated GO terms that have ever appeared are then obtained from a large set of training proteins. A novel genetic algorithm based method (named GOmining) combined with a classifier of support vector machine (SVM) is proposed to simultaneously identify a small number m out of the n GO terms as input features to SVM, where m <<n. The m informative GO terms contain the essential GO terms annotating subcellular compartments such as GO:0005634 (Nucleus), GO:0005737 (Cytoplasm) and GO:0005856 (Cytoskeleton). Two existing data sets SCL12 (human protein with 12 locations) and SCL16 (Eukaryotic proteins with 16 locations) with <25% sequence identity are used to evaluate ProLoc-GO which has been implemented by using a single SVM classifier with the m = 44 and m = 60 informative GO terms, respectively. ProLoc-GO using input sequences yields test accuracies of 88.1% and 83.3% for SCL12 and SCL16, respectively, which are significantly better than the SVM-based methods, which achieve < 35% test accuracies using amino acid composition (AAC) with acid pairs and AAC with dipedtide composition. For comparison, ProLoc-GO using known accession numbers of query proteins yields test accuracies of 90.6% and 85.7%, which is also better than Hum-PLoc (85.0%) and Euk-OET-PLoc (83.7%) using ensemble classifiers with hybridization of GO terms and amphiphilic pseudo amino acid composition for SCL12 and SCL16, respectively. Conclusion The growth of Gene Ontology in size and popularity has increased the effectiveness of GO-based features. GOmining can serve as a tool for selecting informative GO terms in solving sequence-based prediction problems. The prediction system using ProLoc-GO with input sequences of query proteins for protein subcellular localization has been implemented (see Availability).


Background
Gene Ontology (GO) [1] annotation, which describes the function of genes and gene products across species, has recently been utilized to predict protein subcellular and subnuclear localization. The prediction of protein localization is important for elucidating protein functions involved in various cellular processes. Additionally, the accomplishment of the various genome sequencing projects causes the accumulation of massive amount of gene sequence information. For example, the percentage of large-scale eukaryotic proteins with subcellular locations annotated in the Swiss-Prot database increased rapidly from 52.4% (version 49.5, released on April 18, 2006) [2] to 69.4% (version 50.7, released Sep. 11,2006) [3]. Meanwhile, the percentage of proteins with subcellular locations annotated in the GO database increased from 44.9% [2] to 65.5% [3]. The growth of the GO database in size and popularity increases the effectiveness of GO-based features.
Additionally, these two efficient GO-based systems Euk-OET-PLoc [2] and Hum-PLoc [4] predict subcellular localization of proteins using their known accession numbers. However, they cannot work for novel proteins without known accession numbers. The GO-AA method [5], which uses GO terms of homologies retrieved by BLAST to assess protein similarity, can deal with novel proteins without known accession numbers for subnuclear localization prediction. Besides, some SVM-based methods using only the features derived from input sequences, such as ProtLock with AAC [8], Ploc with AAC and acid pairs [10], and HSLPred with AAC and dipeptide composition [11], predict subcellular localization inaccurately [4]. Therefore, this study would develop an accurate SVMbased method for predicting subcellular localization of novel proteins by using input sequences with BLAST.
The Gene Ontology provided by the GO Consortium [1] has quickly grown in size and popularity. The newest version (UniProt 52.0 released in September 2007) of GO [22] contained 29,383 terms in the three branches, molecular function, biological process and cellular component. The terms and relationships among them are represented by a directed acyclic graph in which vertices represent the GO terms, and edges represent the relationships among these terms. Genes can be annotated with GO terms creating gene associations that can be used for whole genome analyses [23].
GO annotation has been successfully used in various sequence-based applications, which can be classified into two groups. 1) The first group uses the GO terms and their corresponding structure information of GO graph, such as grouping GO terms to improve the assessment of gene set enrichment [24]; using GO with probabilistic chain graphs for protein classification [25,26], and prediction of subnuclear localization [5]. 2) The second group uses GO terms only without structure information, such as predicting transcription factor DNA binding preference [27] and various predictions of subcellular and subnuclear localization [2,4,5]. In the second group, protein sequences are often represented as high dimensional vectors of n binary features, where n is the total number of terms in the complete annotation set (a component of 1 if the annotation is hit, and 0 otherwise) [28]. This representation is valuable in well-known vector space clustering algorithms such as k-NN [2,4,13,19,21] and fuzzy k-NN [13,29,30]. However, because n is often large, and each gene product is generally annotated by few GO terms, the vectors became long and sparse, making the clustering rather problematic [28].
This study proposes an efficient method, named GOmining, based on an intelligent genetic algorithm (IGA) [31,32] incorporating an SVM classifier to simultaneously identify a small number m out of a large number n of GO terms as input features, where m <<n. Some GO annotations corresponding to subcellular compartments are called essential GO terms for subcellular localization pre-diction, such as GO:0005634 (Nucleus), GO:0005737 (Cytoplasm) and GO:0005856 (Cytoskeleton), shown in Table 1. These essential GO terms are regarded as domain knowledge to be included in the feature set of m informative GO terms for subcellular localization prediction. A prediction method ProLoc-GO based on GOmining was implemented using the feature set of informative GO terms. This method performed well in predicting protein subcellular localization from input sequences only.

Data sets
Two existing data sets SCL12 [4] and SCL16 [2] obtained from UniProtKB/Swiss-Prot database [33] were used to evaluate the proposed method ProLoc-GO. The SCL12 and SCL16 have 2041 human proteins localized in 12 human subcellular compartments and 4150 eukaryotic proteins in 16 subcellular compartments, respectively. The two data sets were operated by a culling program [34] so that those sequences had < 25% sequence identity.
The proteins in SCL12 were screened strictly using the following rules: 1) only those sequences annotated with "human" in the ID (identification) field were collected; 2) sequences annotated with ambiguous or uncertain terms, such as "potential", "probable", "probably", "maybe", or "by similarity", were excluded; 3) sequences annotated by two or more locations were excluded, and 4) sequences with less than 50 amino acid residues were removed [4]. The data set SCL12 was divided into two parts, SCL12L and SCL12T, with 919 and 1122 proteins, respectively. The SCL12L set was used for training and the SCL12T was used for independent testing, as shown in Table 2 [4].
The proteins of SCL16 were screened according to four criteria. The first criterion is to exclude sequences annotated with "prokaryotic", because this study focused only on eukaryotic proteins. The other three criteria were the same as criteria 2-4 for SCL12 above. Table 3 shows the num-bers of proteins within each compartment, where the SCL16 consists of two parts, SCL16L for training and SCL16T for independent testing. The sequences in the training and test data sets were obtained from the web servers of Euk-OET-PLoc [2] and Hum-PLoc [4].

GO annotation
This study applied the Gene Ontology Annotation (GOA) database [35], which includes GO annotations for nonredundant proteins from many species in the UniProtKB/ Swiss-Prot database [33]. The GOA database was downloaded directly from [36] (UniProt 45.0 released in Jan. 2007). The accession numbers of proteins are required for querying the GOA database to obtain GO terms. BLAST [37,38] was used to obtain a homology with a known accession number to the protein for retrieving the GO terms. The corresponding accession numbers of all protein sequences in SCL12 and SCL16 were obtained by using BLAST with h = 1 and e = 10 -9 . Table 4 shows the GO annotation results of all proteins in the training data sets SCL12L and SCL16L. For SCL12L, the size of the complete set of all GO terms that appeared was n = 1714 from the 919 human proteins. The smallest, largest and mean numbers of GO terms annotated for individual proteins were 0, 35 and 8.3, respectively. The percentage of training proteins whose homologies were not annotated by any GO term (that is, the number of GO terms annotated is zero) was 1.31%. For SCL16L, n = 2870 GO terms were obtained from 2423 eukaryotic proteins. The smallest, largest and mean numbers of GO terms  annotated were 0, 50 and 7.7, respectively. The percentage of training proteins whose homologies were not annotated was 3.96%. The proteins annotated by GO are often represented as an n-dimensional binary feature vector, where the attribute value is 1 if the corresponding GO term is annotated, and 0 otherwise.
To know the prediction performance according to only the essential GO terms annotated, we calculated the numbers of sequences annotated by g essential GO terms. Table 4 shows that 453 out of 919 (49.3%) sequences are annotated by only one essential GO term (g = 1) for SCL12L, where 425 sequences are correctly annotated and 28 sequences are incorrectly annotated. The other 466 sequences annotated by zero (g = 0) or more than one (g > 1) essential GO term can not be effectively predicted. Table 2 lists the numbers of sequences which are correctly annotated by only one essential GO term for every compartment. The two GO terms, GO:0005634 (Nucleus) and GO:0005739 (Mitochondrion), made a great contribution to the prediction accuracy of 46.2% (= 425/919), which correctly annotate a large number of sequences, 179 and 111, respectively.
As for SCL16L, the number of sequences annotated by only one essential GO term is 1247 out of 2423 (51.5%). Table 3 lists the numbers of sequences which are correctly annotated by only one essential GO term for every compartment. Only 48.0% (= 1162/2423) of the sequences with known accession numbers can be correctly predicted by using only the annotation of essential GO terms. According to Table 3, the three essential GO terms, GO:0005634 (Nucleus, 395 out of 474), GO:0009507 (Chloroplast, 192 out of 207) and GO:0005739 (Mitochondrion, 173 out of 183), made a great contribution to prediction accuracy.
The analytic results reveal that it is not sufficient to use only essential GO terms for accurately predicting protein subcellular localization. However, the essential GO terms play an important role in designing GO-based prediction methods.

Selected informative GO terms
Selecting a set of m informative GO terms out of n candidate GO terms is a combinatorial optimization problem C(n, m), which can be solved by using the intelligent genetic algorithm with an inheritance mechanism (IGA) [31,32]. IGA can efficiently search for the solution S r+1 to C(n, r+1) by inheriting a good solution S r to C(n, r). This study proposes an efficient algorithm based on IGA, called GOmining, to identify a small set of m informative GO terms including the essential GO terms as features to SVM. The GOmining algorithm incorporates LIBSVM [39] using series of binary classifiers. GOmining aims to maximize the training accuracy of prediction using 10-fold cross-validation (10-CV) when identifying the m informative GO terms.
The SVM classifier based on the selected informative GO terms as features is called SVM-IGO. To evaluate a candidate set of r informative GO terms accompanied with the SVM parameters, the prediction accuracy of 10-CV serves as a fitness function of IGA. Figure 1 shows the results of SVM-IGO from r = 40, 41,..., 70. Table 5 lists the m = 44 informative GO terms for SCL12L obtained from the highest accuracy of 89.8% (r = 44), where the SVM parameters (C, γ) = (2 3 , 2 -4 ). Table 6 lists the m = 60 informative GO terms for SCL16L, where the highest accuracy was 86.5%, and (C, γ) = (2 5 , 2 -3 ).
Training accuracies of SVM-IGO and SVM-RBS performed by using SVM with a number r of selected informative GO terms Figure 1 Training accuracies of SVM-IGO and SVM-RBS performed by using SVM with a number r of selected informative GO terms. The orthogonal experimental design with orthogonal array and factor analysis used in IGA is an efficient method for simultaneously examining the individual effect of several factors on the evaluative function [40,41]. The factors are the parameters (GO terms) that manipulate the evaluation function, and a setting of a parameter is regarded as a level of the factor. In this study, the two levels of one factor are the inclusion and exclusion of the ith GO term in the feature selection using IGA. The factor analysis can quantify the effects of individual factors on the evaluation function, rank the most effective factors and determine the best level for each factor to optimize the evaluation function. The most effective factor has the largest main effect difference (MED). Tables 5 and 6 show that the essential GO term GO:0005634 (Nucleus) having the largest values of MED is the most effective feature of discrimination. The only essential GO term GO:0030198 (Extracellular matrix organization and biogenesis) belongs to biologic process branch and the other essential GO terms belong to cellular component branch. The abbreviations M, B and C represent the three branches molecular function, biological process, and cellular component, respectively.  Table 7 lists all prediction accuracies using 10-CV for both data sets SCL12L and SCL16L.

Evaluation of feature selection
The highest accuracies of SVM-RBS are 86.5% and 83.5% using 65 and 68 selected GO terms for SCL12L and SCL16L, respectively, shown in Fig. 1. Table 7 shows that the three SVM-based classifiers (SVM-GO, SVM-RBS and SVM-IGO), with accuracies >80%, were better than the two k-NN based classifiers (k-NN-GO and fuzzy k-NN-GO), with accuracies <75%, for both data sets. SVM-IGO had the highest accuracies 89.8% and 86.5% for SCL12L and SCL16L, respectively. The GO term selection method based on GOmining was more effective than RBS and the method without selection of GO terms. Furthermore, SVM uses the selected GO terms as features, making it better than the k-NN classifier.

Performance comparison
The proposed ProLoc-GO method predicts the subcellular localization of an input sequence using either SVM-IGO or SVM-GO, depending on its annotation on the informative GO terms (see Methods for detail). Tables 8, 9, 10, 11 list the results of ProLoc-GO using SCL12 and SCL16. Some existing AAC-based prediction methods, such as ProtLock [8], Least Euclidean distance [9], Ploc [10] and HSLPred [11], use only the query sequence as input data for their classifiers. Hum-PLoc [4] and Euk-OET-PLoc [2] use both the sequence and its accession number as input data. For comparison with these predictors, the method ProLoc-GO was performed using the two kinds of input data separately. The first test used only the sequence and used BLAST to obtain annotated GO terms. The second test used the known accession number of proteins directly. For the accuracy on both SCL12L and SCL16L, ProLoc-GO used leave-one-out cross-validation (LOOCV) for comparison with the other methods (see Methods section).
The Matthews correlation coefficient (MCC) [5,12,18] values are usually employed while evaluating the performance on unbalanced datasets. In addition to the overall accuracy, the MCC values were also recorded due to the unbalance of numbers of proteins localized in the compartments, such as 196 of Nucleus vs. 7 of Microsome ( Table 2). The MCC is defined as follows [5]: where p c is the number of correctly predicted proteins of the location c, s c is the number of correctly predicted proteins not in the location c, u c is the number of under-predicted proteins, o c is the number of over-predicted proteins, and N c is the number of locations. The test MCC performances of ProLoc-GO were 0.822 and 0.661 for SCL12L and SCL12T, respectively. Table 10 presents the (1) detailed results for individual compartments. The results of the five sequence-based methods reveal that the set of informative GO terms is more useful for protein subcellular localization than the AAC-based features.

Performance of using known accession numbers
The accession number of each protein sequence in SCL12 and SCL16 was available in querying the GOA database. for SVM-IGO and (C, γ) = (2 3 , 2 -5 ) for SVM-GO. Euk-OET-PLoc [2] using the ensemble classifiers with features of both sequence and accession number obtains training and test accuracies of 81.6% and 83.7%, respectively. Pro-Loc-GO performed better than Euk-OET-PLoc on SCL16 using either sequences or accession numbers as the input data [2].

Analysis of informative GO terms
The GOmining method identifies a feature set of m effective GO terms, called informative GO terms, to design an accurate SVM-based prediction method. Table 12 shows the distribution of the m informative GO terms in the GO graph. For SCL12L with m = 44, GOmining selected 12 essential GO terms and 32 instructive GO terms. The 32 instructive GO terms consist of 7 GO terms from the molecular function branch, 14 terms from the biological process branch, and 11 terms from the cellular component branch, denoted as 7(M), 14(B) and 11(C), respectively. Analytical results reveal that all the three branches contain instructive GO terms.
Due to the high correlation among GO terms in the GO graph, the feature selection of SVM should consider simultaneously a set of informative GO terms, rather than individual GO terms. Since the essential GO terms are always included, GOmining benefits from a confined search space of candidate instructive GO terms. Considering the position relationships between instructive and essential GO terms in the GO graph, instructive GO terms belonged to one of the three classes: (a) offspring but not ancestor of some essential GO term; (b) between two essential GO terms, and (c) not offspring of any essential GO term. Of the 32 instructive GO terms, 4, 2 and 26 GO   terms belonged to the classes (a), (b) and (c), respectively. The 26 GO terms consist of 7(M), 14(B) and 5(C). The GO terms near the root of the GO graphs are considered to be more generic while terms near the leaves are more specific [23]. Of the instructive GO terms, 81.2% (26/32) were not offspring of any essential GO term. These analytical results reveal that the essential GO terms are informative enough in predicting subcellular localization, and are effective in confining the space of searching instructive GO terms. The other six instructive GO terms from the cellular component branch have more specific functions than the essential GO terms in discrimination of the subcellular localization. The statistical results of instructive GO terms distributed in the three classes for both SCL12L and SCL16L reveal that the inclusion of essential GO terms can be regarded as using domain knowledge for GOmining to mine a feature set of informative GO terms. The heuristic approach (using domain knowledge) of GOmining is efficient when the GO database grows fast. Therefore, GOmining can be easily applied to other applications of sequence-based predictions using SVM with the features of informative GO terms.

Discussion
The GO database has grown in size recently, increasing the effectiveness of GO-based features. Meanwhile, the percentage of proteins with subcellular locations annotated in the GO database increased from 44.9% [2] to 65.5% [3] Some of the selected GO terms which are between two essential GO terms Figure 3 Some of the selected GO terms which are between two essential GO terms. For SCL12L, the two instructive GO terms GO:0005815 and GO:0005813 are between the essential GO terms GO:0005856 and GO:0005814. For SCL16L, GO:0005813 and GO:0000922 are offspring of the essential GO term GO:0005856, belonging to the class (a). GO:0005814 is not an essential GO term for SCL16L. fast. It is indicated that there is a linkage in the GO annotation process between molecular function annotation and subcellular localization annotation [43]. Therefore, the GO-based prediction method for protein subcellular localization is increasingly efficient. Because the accession number of proteins is necessary for retrieving GO terms from GO databases, existing efficient GO-based systems Euk-OET-PLoc [2] and Hum-PLoc [4] directly utilize the accession numbers of proteins and a large number n of GO terms annotated in a complete set where n = 9918 for SCL12L [4] and n = 9567 for SCL16L [2].
To predict subcellular localizations for novel proteins, ProLoc-GO uses a good homology, rather than the query protein itself, to retrieve annotated GO terms using BLAST. To use GO term features effectively, ProLoc-GO uses only a homology with annotated GO terms to reduce n. Thus, n = 1714 for SCL12L and n = 2870 for SCL16L. Furthermore, a small set of m informative GO terms is selected simultaneously by GOmining. GOmining can consider internal relevant-feature correlation, instead of individual features by using an efficient global optimization method. The distribution analysis of informative GO terms in the GO graph is consistent with the properties of

Conclusion
Computational prediction methods from primary protein sequences are fairly economical in terms of identifying large-scale eukaryotic proteins with unknown functions. The GO annotation, which describes the function of genes and gene products across species, has been used to improve the prediction of protein subcellular localization. The accession numbers of proteins are necessary to query the GOA database to obtain GO terms. Since novel proteins have no known accession numbers, BLAST was used to obtain homologies with known accession numbers to the proteins for the retrieval of GO terms.
GO annotation has grown in size and popularity. However, few studies have explored informative GO terms from the over 20,000 annotations available at present for sequence-based prediction problems. This study proposes a genetic algorithm based method, GOmining, which combines SVM to simultaneously identify a small number m out of the n GO terms as features to SVM, where m <<n. The m GO terms include the essential GO terms annotating subcellular compartments such as GO:0005634 (Nucleus), GO:0005737 (Cytoplasm) and GO:0005856 (Cytoskeleton). ProLoc-GO was evaluated using SVM with the GO-based features from two kinds of input data, sequence and known accession numbers of proteins.
ProLoc-GO yields test accuracies of 88.1% and 83.3% from SCL12 and SCL16, respectively, when using only input sequences. These results are significantly superior to those of the other SVM-based methods, which have accuracies <35% using AAC with acid pairs, and using AAC with dipedtide composition. ProLoc-GO using known accession numbers of proteins has accuracies 90.6% and 85.7% for SCL12 and SCL16, which is also slightly better than Hum-PLoc and Euk-OET-PLoc, which have 85.0% and 83.7%, respectively.
Analysis of m informative GO terms in the GO graph reveals that GOmining can consider internal relevant-feature correlation, rather than individual features, by using an efficient global optimization method. GOmining can serve as an efficient tool for mining informative GO terms for various sequence-based predictions of proteins, especially when the GO database grows fast. The prediction system using ProLoc-GO with protein sequence as input data for protein subcellular localization has been implemented (see Availability).

Proposed GOmining algorithm
An efficient genetic-algorithm-based method, called GOmining, is proposed for selecting informative GO terms. GOmining uses an intelligent genetic algorithm with an inheritable mechanism (IGA) [31,32], combined with an SVM classifier, to simultaneously identify a small number m out of a large number n of GO terms as input features, where m <<n. The exploration of the m informative GO terms from n candidate GO terms is a combinatorial optimization problem C(n, m) with a huge search space of size C(n, m) = n!/(m!(n-m)!)). An IGA based on orthogonal experimental design using a divide-and-conquer strategy and systematic reasoning method can efficiently solve this large combinatorial optimization problem.
The leave-one-out cross-validation (LOOCV) is considered to be the most rigorous and objective test. Although bias-free, this test is very computationally demanding and is often impractical for large data sets. The N-fold crossvalidation not only provides a bias-free estimation of the accuracy at a much reduced computational cost, but is also considered as an acceptable test for evaluating prediction performance of an algorithm [44]. Therefore, GOmining uses the prediction accuracy of 10-CV as the fitness function to perform IGA on the entire training sets of proteins under considering the computation cost.
The input of the algorithm GOmining is composed of 1) a training set of protein sequences categorized into a number of compartments (classes), and 2) the essential GO terms corresponding to the compartments. The output comprises a set of m informative GO terms and the associated parameter settings of an SVM classifier. Since the novel sequences without known accession numbers use BLAST to obtain annotated GO terms, all training sequences use the same BLAST to obtain GO terms for consistence.
Step 2: (sequence representation) Obtain annotated GO terms from the GOA database for all training proteins using BLAST with h = 1 and e = 10 -9 . Let n be the total number of GO terms that appear among all proteins in the training data set. For example, n = 1714 and n = 2870 were derived for SCL12L and SCL16L, respectively. The protein is represented as an n-dimensional binary feature vector.
Step 3: (inclusion of essential GO terms) Identify d essential GO terms out of n GO terms and number them from 1 to d. For example, d = 12 and d = 15 were found from SCL12L and SCL16L, respectively.
Step 4: (chromosome encoding) The IGA-chromosome comprises n binary IGA-genes f i for selecting informative GO terms and two 4-bit IGA-genes for encoding γ and C, where f i = 1, i = 1,..., d. The ith GO term is included in the feature set of the SVM classifier if f i = 1; otherwise, the ith GO term is excluded (f i = 0). Figure 5 shows the sequence representation and IGA-chromosome encoding method.
Step 5: (initial solution) Perform IGA to select r start out of n GO terms, i.e., the solution to C(n, r start ), where the d GO terms are always selected. Table 13 shows the parameter settings of IGA, such as crossover probability p c = 0.8. The procedure of IGA is described in detail in the work [18].
Step 6: (inheritance mechanism) The inheritance mechanism of IGA can efficiently search for the solution to C(n, r+1) by inheriting a good solution S r to C(n, r). Obtain all solutions S r from r = r start +1,..., r end one by one using IGA [31,32]. For example, r start = 40 and r end = 70 according to former experience.
Step 7: (decoding chromosome) Let S m be the most accurate solution with m selected GO terms among all solutions S r . Obtain the m informative GO terms and parameter values of γ and C.
Step 8: (robust performance) Perform Steps 5-7 for N independent runs to obtain the best one of N solutions S m and the associated parameter settings of the SVM parameters. The best solution considers both high prediction accuracy and high mean frequency of the m selected GO terms appeared in the N runs. In this study, N = 30.

ProLoc-GO
As shown in Fig. 6, each query protein is first BLASTed with h = 1 and e = 10 -9 against the Swiss-Prot database to obtain a homology with a known accession number. If no such homology exists, then adjust the threshold value e of BLAST until the desired homology is obtained, where h = 1 and e ∈ {10 -9 , 10 -8 ,..., 10 -1 }. The accession number of the homology of each protein sequence in SCL12 and SCL16 was obtained by using BLAST with h = 1 and e = 10 -9 . This accession number is used as input to the GOA database for retrieving the corresponding k (>1) GO terms: GO:1, GO:2,... GO:k. If none of the k GO terms belongs to the set of m informative GO terms, then the sequence is represented using an n-dimensional binary vector and is predicted by the SVM-GO classifier. Otherwise, the sequence is represented as an m-dimensional binary vector and is predicted by the SVM-IGO classifier. Notably, the SVM-GO classifier predicts only a very small percentage of input sequences. ProLoc-GO is derived from the two major classifiers SVM-GO and SVM-IGO for subcellular localization prediction.

Fuzzy k-NN
The protein is represented as an n-dimensional binary vector and the generalized distance between two proteins P and P i [2] is denoted as : where P·P i is the dot product of vectors P and P i , and ||P|| and ||P i || are their moduli.
This study determined the best value of k by using a stepwise approach where k ∈ {1, 2,..., 10}.
The fuzzy k-NN classifier [13,29,30] is a variation of k-NN, which assign fuzzy membership values r c (P) of a query sequence P to each class c as follows: where the distance is calculated by according to (1). In this study, the best values of parameters (k, w) are tuned iteratively from k ∈ {1, 2,..., 10} and w ∈ {1.05, 1.10,..., 1.95} for the fuzzy k-NN classifier.

SVM-RBS
To evaluate the proposed IGA-based feature selection method GOmining, this study implements a classifier SVM-RBS by using SVM with a subset of the n GO terms by the rank-based selection (RBS) method [17,42]. One previous work on ProLoc [18] showed that this univariate method RBS is inferior to the multivariate feature selection by IGA for selecting physicochemical properties. First, each of all n GO terms (for example, n = 1714 for SCL12L) is ranked according to the accuracy of SVM with the evaluated single feature, where the best values of parameters (C, γ) were determined using a step-wise approach where γ ∈ {2 -7 , 2 -6 ,..., 2 8 } and C ∈ {2 -7 , 2 -6 ,...,