Assessing protein similarity with Gene Ontology and its use in subnuclear localization prediction

Background The accomplishment of the various genome sequencing projects resulted in accumulation of massive amount of gene sequence information. This calls for a large-scale computational method for predicting protein localization from sequence. The protein localization can provide valuable information about its molecular function, as well as the biological pathway in which it participates. The prediction of localization of a protein at subnuclear level is a challenging task. In our previous work we proposed an SVM-based system using protein sequence information for this prediction task. In this work, we assess protein similarity with Gene Ontology (GO) and then improve the performance of the system by adding a module of nearest neighbor classifier using a similarity measure derived from the GO annotation terms for protein sequences. Results The performance of the new system proposed here was compared with our previous system using a set of proteins resided within 6 localizations collected from the Nuclear Protein Database (NPD). The overall MCC (accuracy) is elevated from 0.284 (50.0%) to 0.519 (66.5%) for single-localization proteins in leave-one-out cross-validation; and from 0.420 (65.2%) to 0.541 (65.2%) for an independent set of multi-localization proteins. The new system is available at . Conclusion The prediction of protein subnuclear localizations can be largely influenced by various definitions of similarity for a pair of proteins based on different similarity measures of GO terms. Using the sum of similarity scores over the matched GO term pairs for two proteins as the similarity definition produced the best predictive outcome. Substantial improvement in predicting protein subnuclear localizations has been achieved by combining Gene Ontology with sequence information.


Background
With the completion of genomic sequencing projects, the need for automated prediction of protein subcellular or subnuclear localizations becomes increasingly important. The localization of a protein can provide valuable information about its molecular function, as well as the biological pathway in which it participates [1,2]. The bulk of past work has focused on protein subcellular localizations [3][4][5][6][7][8][9][10][11][12][13][14][15], and has achieved high accuracy. However, the prediction of protein localization at subnuclear level is far more challenging. We have developed the first SVM-based system using protein sequence information for this task with considerable predictive accuracy [16]. In this work, we attempted to improve the performance of the system through the incorporation of information obtained from Gene Ontology (GO).
GO has been developed to help manage the overwhelming mass of current biological data that are difficult to tie together into a cohesive whole from a computational perspective [17,18]. It has become a de facto standard tool to annotate gene products for various databases. GO is a controlled vocabulary of terms split into three related ontologies consisting of Molecular Function (MF), Biological Processes (BP) and Cellular Components (CC). Molecular function describes activities, such as catalytic or binding activities, at the molecular level. Molecular functions generally correspond to activities that can be performed by individual gene products, but some activities are performed by assembled complexes of gene products. A biological process is series of events accomplished by one or more ordered assemblies of molecular functions. A cellular component is a component of a cell, but with the proviso that it is part of some larger object such as an anatomical structure, a gene product group. A gene product might be associated with or located in one or more cellular components [17]. It is active in one or more biological processes, during which it performs one or more molecular functions.
Each category of GO terms is structured as a directed acyclic graph (DAG). Currently there are over 20,000 GO terms [18]. The relationships between GO terms have been extensively explored and applied to various biological problems, such as search for genes with similar function. One of the key problems in these applications is how to define similarity between two GO terms. Lord et al. [19,20] proposed a measure based on information content for the semantic similarity of GO terms. They revealed that the semantic similarity is correlated with the protein sequence similarity and this correlation is more marked in Molecular Functional annotation. However, their definition of similarity measure relies on a particular database, e.g. SWISS-PROT. Zhang et al. [21] used a recursive procedure to define a statistical measure D-value (distribution value) for each GO term in the GO DAG to avoid the dependency on a single annotation database, and developed a gene functional similarity search tool. Gentleman [22] proposed two measures based on graph similarity: simUI and simLP. The former is the ratio of the number of common nodes in the two graphs reduced from the GO DAG and the number of nodes in their union. The latter is defined as the depth of the longest shared path from the root node. Wu et al. [23] predicted functional modules encoded in microbial genomes using a similarity measure similar to simLP.
Although the semantic similarity between two GO terms has been extensively investigated, how to define similarity between two gene products based on GO annotations for a specific application remains unclear. Suppose that each gene product is annotated by a set of GO terms. Each GO term from one set will be paired with all GO terms in the other set. There are three general ways of defining similarity for two gene products from those GO term pairs: (1) to take the maximum value from the similarity scores of GO term pairs [23,24], (2) to take average over all the similarity scores of GO term pairs [19,20], and (3) to count the number of identical GO terms in the two GO term sets [9,25]. We are particularly interested in the identification of an appropriate definition of similarity for proteins for the prediction of protein subnuclear localization. To do so, it is necessary to investigate the effect of various combinations of different measures of GO term similarity and different similarity measures of a pair of proteins on the predictive performance. This evaluation was carried out through our new predictive system expanded from the previous SVM module [16] with the addition of a nearest neighbor classification module, which was constructed based on a similarity definition between a pair of proteins.

Dataset
To provide a valid comparison with our previous system, the same dataset as in [16] was used for evaluation of the new system. The dataset was extracted from the Nuclear Protein Database (NPD) [26] using a Perl script. The NPD is a curated database that stores information on more than 2000 vertebrate proteins, chiefly from human and mouse, which are reported in the literature to be localized in the cell nucleus. Since certain proteins are associated with more than one compartment, a test dataset consisting of proteins with multiple localizations was extracted. These proteins have the same SwissProt or Entrez Protein accession numbers although localized in different compartments. This preparative procedure resulted in 92 proteins that are localized within the six compartments described below. The majority is localized in 2 compartments and the remaining portion is localized in 3 or 4 compartments. After excluding the multi-localization proteins, a non-redundant dataset was further constructed by PRO-SET [27] to ensure low sequence identity (<50%). In order to have sufficient number of proteins for training and testing, only six localizations were selected for evaluation. These are PML BODY (38), Nuclear Lamina (55), Nuclear Splicing Speckles (56), Chromatin (61), Nucleoplasm (75), and Nucleolus (219). Each of these proteins has a single localization and the total number is 504. The 92 multi-localization proteins are not included in the set of 504 single-localization proteins for the leave-one-out cross-validation (LOOCV). Therefore, the multi-localization dataset is an independent testing set. The summary of the datasets is presented in Table 1.

Predictive system and evaluation criteria
Given a test protein with GO annotations, the similarity scores between this protein and all the other proteins in the training set are calculated from the similarity scores of GO term pairs (see Methods). The protein with the highest similarity score is designated as the nearest neighbor of the testing protein and its class label will be assigned to the test protein. If multiple proteins in various localizations attain the same highest score or the test protein does not have GO annotation, then the test protein will be assigned as "unpredicted". The unpredicted proteins will be passed on to the SVM module, which uses sequence information [16], for a full coverage of prediction.
Since the numbers of proteins for the six localizations are unbalanced, the Matthew's correlation coefficient (MCC) was employed for the optimization of parameters and evaluation of performance [28]. The overall accuracy for the multi-class classification proposed by Rost [29] was also used for the evaluation of our system. Definitions of the MCC and overall accuracy are detailed in Methods section.

Comparison of various similarity measures for GO term pairs
Three different similarity measures for GO term pairs were compared: (1) Lord's method [20], (2) SimLP as described in Bioconductor [22], and (3) Exact Match. For Lord's method, the GO term frequencies were extracted based on UniProtKB/Swiss-Prot [30]. For a GO term pair, Exact Match defines the similarity score as 1 if the two GO terms are identical, 0 otherwise. SUM_Match was utilized to compute the similarity score between two proteins from similarity scores of GO term pairs. It takes the sum of similarity sores for all matched GO terms from two proteins. Note that the SUM_Match score is equivalent to the inner product of two GO term vectors if Exact Match is used for GO term similarity (see Methods for details). As shown in Table 2, no significant difference in performance can be observed for these three similarity measures of GO term pairs. Surprisingly, the Exact Match method, which does not utilize any DAG structure of GO, achieved competitive performance in comparison with the other two methods.

Comparison of various similarity definitions for proteins
Very few studies have focused on exploring similarity definition of proteins based on GO terms. Two simple ways are usually employed in defining the similarity between two proteins annotated by GO terms. One is to take the maximum value from the similarity scores of GO term pairs. The other is to take average over all the similarity scores of GO term pairs. However, the above two methods produced poor results especially when the proteins were annotated by many GO terms for the prediction of protein subnuclear localization. Consequently, an extensive investigation on various similarity definitions obtained from similarity scores of GO terms was warranted. As shown in Table 3, similarity definition has profound impact on the quality of prediction. The overall accuracy ranges from 27.0% to 66.5% and overall MCC ranges from 0.141 to 0.519 for proteins with single-location. It seems that the use of the sum of similarity scores over the matched GO term pairs for two proteins as the similarity definition produces the best predictive outcome for this prediction task.

Effect of using GO terms from homologs
Lord et al. [20] reported a problem that many GO term pairs have identical similarity values. This problem stems from two sources: (1) proteins are represented by rela- (Based on SUM_Match: The similarity of two proteins is defined as the sum of similarity scores over all matched GO term pairs.) tively small number of GO terms; (2) the similarity measure considers only the information content p ms (probability of the minimum subsumer) of shared parents of the query terms, meaning that the semantic distances of many different GO term pairs are identical. In order to alleviate this problem, GO terms of homologs retrieved by BLAST were used for the representation of a query protein.
The parameter E-value in BLAST is crucial for the quality of homologs, as well as the number of candidate homologs. If E-value is too large, then homologs of low quality may be retrieved. On the other hand, if Evalue is too small, then the number of candidate homologs retrieved becomes small. We tested the following E-value parameters: 10 0 , 10 -1 , 10 -2 , ..., 10 -10 , 10 -15 , 10 -20 , 10 -30 , 10 -50 , 10 -100 , 10 -200 , and found that E-value = 10 -9 was a good trade-off value. Even with this threshold the BLAST could retrieve different numbers of hits for different query proteins. We found that up to 5 homologs were suitable to represent the query protein (see Table 4).

Predictive performance of the new system
As demonstrated before, the predictive outcome is greatly influenced by the ways of combining similarity scores of GO term pairs to give the similarity between two proteins. With the appropriate similarity definition, the performance of the current system can be significantly better than that of the previous SVM system. As seen in Table 5, the overall MCC (accuracy) is elevated from 0.284 to 0.519 (50.0% to 66.5%) for single-localization proteins in the leave-one-out cross-validation; and from 0.420 to 0.541 (65.2%, no change in accuracy) for an independent set of multi-localization proteins. More specifically, 401 (281 true predictions and 120 false predictions) out of 504 proteins were predicted by the GO module in the LOOCV, and the remaining 103 were passed on to the SVM module. For the independent test set of proteins with multilocalizations, 82 (55 true predictions and 27 false predictions) out of 92 proteins were predicted by the GO module, and the remaining 10 were passed on to the SVM module.
It also should be noted that our system currently is designed to predict only one localization. In fact, the results shown for the proteins with multiple localizations is somewhat overestimated, as the prediction is consid-  (Based on SimLP: The GO term similarity is defined on the longest path shared by two GO terms [22].) ered correct if any one of localizations of a protein is correctly predicted.

Discussion
GO terms have been used in the prediction of protein subcellular localization [9,25]. The similarity of two proteins was defined as the number of the exactly shared GO terms from the two proteins, or equally defined as the inner product of GO term vectors representing the two proteins (see Methods). The inner product of two GO term vectors can be considered as a special case of the similarity definition SUM_Match for two proteins used in this work. SUM_Match is essentially a weighted sum of the matched GO term pairs, where the weight is the depth of the term if SimLP is the GO term similarity; while the inner product weights uniformly 1 for all matched GO term pairs. Consequently, the more specific the two matched GO terms is, the greater the weight is; and the higher the contribution to the similarity is.
It seems that the inclusion of similarity scores of all GO term pairs is in general not a good strategy for the definition of similarity between two protein sequences. The same conclusion can be drawn for the use of scores of all best GO term pairs (see Methods). The reason may be considered as follows. If two GO terms are remotely related, but sharing a common ancestor, they still have a positive score which contributes to the similarity of two proteins. However, the similarity for protein pairs based on the matched GO terms has zero contribution from those unmatched GO terms. It seems that the unmatched terms add noise to the data and thus weaken the discriminative ability of the nearest neighbour module in our system. In our study, the best performance was attained when the similarity measure of two protein sequences is defined as SUM_Match. The similarity scores of ~20,000 matched GO term pairs can be pre-computed and stored in a hash table to effectively reduce the computation time.
A question that needs to be clarified in the GO-based approach is whether the prediction accuracy could be artificially inflated if the proteins in training or testing sets have their specific subnuclear class annotated in GO. We examined this issue as follows. In this study, there are six GO terms associated with the subnuclear compartments: To assess if these specific GO terms are influential in the prediction, the performance of the GO module was compared before and after the removal of the six GO terms from the annotation list. As shown in Table 6, the accuracies for the compartments Nuclear Lamina, Chromatin and Nucleolus decreased slightly, and those for the compartments Nuclear Splicing Speckles and Nucleoplasm increased slightly, and there is no change for the compartment PML body. The role of the GO terms of subnuclear  compartments appears to be not decisive in the identification of the subnuclear compartment of a protein. Rather, the information of the overall annotated GO terms, that is, the similarity of two proteins defined from the GO term pairs is more important.
The incorporation of the GO module has substantially improved the system performance. However, the module still makes relatively high number of incorrect predictions. This error can not be corrected by the next SVM module. Therefore, it would be desirable if the system can integrate the outcomes from two modules whenever two predictions are available. We are investigating the possibility on this aspect.
Our system can be combined with other subcellular localization predictors, e.g. WoLF PSORT [32], PA-SUB [33] and pTARGET [34], for genome scale prediction of protein localizations. Our system can take a list of predicted nuclear proteins obtained from the subcellular localization predictors and make a refined prediction at the subnulear level.

Conclusion
Gene Ontology terms have been effectively incorporated into our previous SVM-based system for the prediction of protein subnuclear localization with the use of a nearest neighbour classification module. The improvement on performance of the new system is substantial. Various similarity definitions for a pair of proteins from different similarity measures of GO terms have been examined for their effect on prediction. The use of the sum of similarity scores over the matched GO term pairs for two proteins as the similarity definition produced the best predictive outcome in our study. The extensive investigation conducted in this work may provide some guidance on the determination of similarity definition for protein pairs based on GO terms in other applications.

Retrieval of GO terms
Given a protein sequence, we first BLASTed it against the Swiss-Prot database with a threshold E-value = 10 -9 . We selected up to 5 homologs, and submitted the Swiss-Prot accession numbers of the homologs to the QuickGO server [31] for the retrieval of predicted GO terms. The retrieved GO terms were used to represent the given protein.

Definitions of similarity between two GO terms
First we define the depth for each GO term as follows.
Depth(g i ) = the distance of the longest path from GO term g i to the root of Gene_Ontology, i.e., GO:0003673. Fig. 1 shows an example of some GO depths, e.g. Depth(GO:0001838) = 7.
The similarity of two GO terms g 1 and g 2 can be defined as the depth of their most recent common ancestor (MRCA): where P(g 1 , g 2 ) is the set of ancestral GO terms shared by both g 1 and g 2 including themselves. When g 1 = g 2 , Depth(g c ) = Depth(g 1 ) = Depth(g 2 ). For two GO terms from different ontologies (MF, BP, CC), their MRCA is the root GO:0003673, whose depth is zero. That means that there is no similarity between two GO terms from different ontologies.
The GO term similarity described here is the same as the method simLP implemented by Gentleman [22] in Bioconductor.

Definitions of similarity between two protein sequences
Consider two proteins that are represented respectively by the sets of GO terms G 1 and G 2 . The similarity Sim_Pro between the two proteins can be defined as a function of Sim_GO.  (g) AVG_BestPairs: Average similarity between the best paired GO terms calculated with the following pseudo codes: Sim_Pro ← Sim_Pro + Max_sim_GO Delete g i from G 1 , and g j from G 2
In this work, the similarity Sim_Pro of two proteins employed in the final system is based on function (f) SUM_Match: where Sim_GO is defined in (1). Alternatively, if Sim_GO is defined as a constant 1, the Sim_Pro is exactly the Inner Product of two GO vectors (see below).

Inner product of two GO term vectors
The Inner Product of two GO term vectors has been used in previous study for the prediction of protein subcellular localization [9,25]. A vector with a length equal to the number of all appeared GO terms is prepared for a given protein. An entry is assigned a value 1 if the corresponding GO term is used for the annotation of the protein, 0 oth-erwise. Then each protein is represented by a binary vector. The similarity between two proteins is defined as the inner product of the two corresponding GO term vectors. Alternatively, Inner Product is the same as the total number of the matched GO terms from the annotation lists of the two proteins.

Nearest neighbor classification
Our system includes a K-Nearest Neighbor (KNN) model. The best result was achieved with K = 1. A protein is assigned with a localization label of its nearest neighbor that has the highest similarity score Sim_Pro. If the protein does not have associated GO terms or has multiple nearest neighbors in various classes, then the second SVM module built on sequence information [16] will be called to give a prediction.

The SVM module
In our previous work [16], we built an SVM system for prediction of protein subnuclear localizations based solely on protein sequence information. New SVM kernel functions were introduced for the measure of sequence similarity. The k-peptide vectors are first mapped by a matrix of high-scored pairs of k-peptides which are measured by BLOSUM62 scores. The kernels, measuring the similarity for sequences, are then defined on the mapped vectors. By combining these new encoding methods, a multi-class classification system for the prediction of protein subnuclear localizations was established.

Evaluation
Since the numbers of proteins for various localizations are unbalanced, the Matthew's correlation coefficient (MCC) was employed for the optimization of parameters and evaluation of performance [28]: where p n is the number of correctly predicted proteins of the location n, s n is the number of correctly predicted proteins not in the location n, u n is the number of under-predicted proteins, and o n the number of over-predicted proteins.
Also, the overall accuracy for the multi-class classification proposed by Rost [29] was used for the evaluation of our system. Suppose there are m = m 1 + m 2 + … + m N test proteins, where m n is the number of proteins belonging to class n(n = 1,...,N). Suppose further that out of the proteins considered, p n proteins are predicted to belong to class n. Then p = p 1 + p 2 + … + p N is the number of correctly predicted proteins. The accuracy for class n is Sim PRO p p Sim GO g g g G , g G   .

Additional File 1
GO annotation for single-localization proteins. The data provides singlelocalization proteins annotated by six subnuclear compartment GO terms.