Predicting gene ontology functions from protein's regional surface structures

Background Annotation of protein functions is an important task in the post-genomic era. Most early approaches for this task exploit only the sequence or global structure information. However, protein surfaces are believed to be crucial to protein functions because they are the main interfaces to facilitate biological interactions. Recently, several databases related to structural surfaces, such as pockets and cavities, have been constructed with a comprehensive library of identified surface structures. For example, CASTp provides identification and measurements of surface accessible pockets as well as interior inaccessible cavities. Results A novel method was proposed to predict the Gene Ontology (GO) functions of proteins from the pocket similarity network, which is constructed according to the structure similarities of pockets. The statistics of the networks were presented to explore the relationship between the similar pockets and GO functions of proteins. Cross-validation experiments were conducted to evaluate the performance of the proposed method. Results and codes are available at: . Conclusion The computational results demonstrate that the proposed method based on the pocket similarity network is effective and efficient for predicting GO functions of proteins in terms of both computational complexity and prediction accuracy. The proposed method revealed strong relationship between small surface patterns (or pockets) and GO functions, which can be further used to identify active sites or functional motifs. The high quality performance of the prediction method together with the statistics also indicates that pockets play essential roles in biological interactions or the GO functions. Moreover, in addition to pockets, the proposed network framework can also be used for adopting other protein spatial surface patterns to predict the protein functions.


Pocket pairs and closest neighbors
An example of pocket query results in pvSOAR database is shown in Figure 2. We also illustrate the closest neighbors of the example pocket.  Figure 2: The querying pocket 1B8O 31 A in pvSOAR database. 1ULB 51 0 is one of the hitting pocket which satisfies the required threshold. Three similarity scores between the queried pocket and the hitting pocket are shown in the same line. The GO terms annotated to their corresponding proteins 1B8O and 1ULB are listed respectively. Since two proteins have at least one common GO term GO:0016763, the edge representing the pocket pair 1B8O 31 A and 1ULB 51 0 is a GO related edge. The closest neighborhood of pocket 1B8O 31 A is shown in the bottom of figure. The pocket 1B8O 31 A has three closest neighbors.

Sizes of pocket similarity networks
The sizes of pocket similarity networks constructed by using different similarity thresholds are listed in Table  1.

Functional similarity between pocket and its closest neighbors
We study the relationship between the pocket and the most frequent function among its closest neighbors.
The statistics results are shown in Tables 2-5. The statistics are conducted on the pocket similarity networks constructed by different similarity thresholds. The 'GO Annotated' line calculates the numbers of pockets which have GO annotations and at least one closest neighbor with GO annotations. 'Similar Pockets' line records the numbers of pockets which have the most frequent function among closest neighbors. The high percentages show the strong relationship between the pocket and the most frequent function among its closest neighbors.

Frequent functions associated with similar pocket pairs
We study the most frequent GO functions associated with similar pocket pairs, attempting to find which kinds of GO functions are frequently shared by two proteins with similar pockets. From the statistical results in the article, we choose cRMSD p-value 10 −5 as the threshold to construct the pocket similarity network. The top 15 of the most frequent GO functions and their GO descriptions are shown in Table 6. Note that most of them are related to binding or catalytic activity. The full list of GO terms and their frequencies can be found on our web site.

C. Additional prediction results
Prediction results using another scoring scheme We also use another scoring scheme for pockets which is based on an observation that the pocket similarity networks are very sparse. We treat each connected components in the network as a similar pocket group. Then the scores of pockets are evaluated from the pockets in the same group instead of the closest neighbors. The remain procedures of learning and prediction are the same as the closest-neighbor-based method. The experiments are also performed in the pocket similarity network using cRMSD p-value 10 −2 as threshold.
The recall-precision graphs and the prediction results are shown in Figure 3 and Table 7. The results are very similar to those of the closest-neighbor-based method. One of the possible reasons is that most groups are internally densely connected, and therefore the similar pockets group are almost the same as the closest neighborhood.

Prediction results using different thresholds
We use different cRMSD p-value thresholds to construct the pocket similarity network, ranged from 10 −3 to 10 −5 . For each threshold, we do the same experiments as those in the article. The recall-precision graphs and prediction results are shown in Figures 4-7 and Tables 8-11. These results are very similar to the results in the article, which using cRMSD p-value 10 −2 as the threshold.

D. Prediction results by the protein similarity network
We construct the corresponding protein similarity networks in the similar way of constructing the pocket similarity networks. The global structure similarity between the proteins is measured by CE (combinatorial extension, Shindyalov and Bourne, Protein Engineering 1998). We do the similar testing and compare the results with those which we have achieved in the pocket similarity networks. The prediction results on the protein similarity networks constructed by different CE Z-Scores 3.8, 4.8, 5.8 are shown in Figure 8 (CE recommended Z-Score 3.8 as the threshold to filter out structure similarities). The details of the prediction are listed in Table 12.  We compare the results of pocket similarity networks by thresholds p-value 10 −2 , 10 −3 with those of protein similarity networks by Z-Scores 4.8, 5.8 in an all-against-all manner. The detail of comparison can be found in Table 13. The comparison of RP curves can be found in Figure 9 in addition to the two RP curves comparison graphs shown in the text.
In Table 13, "ProtNum" means the number of proteins in different similarity networks. "Common" represents the common proteins in the two kinds of networks. "Max F" is the maximum F-measure. "R-P" represents recall-precision. "R 100%" and "P 100%" mean that the number of proteins can be predicted with recall value 1 and with precision value 1 respectively. "R & P 100%" represents the proteins which can be prediction with both recall value 1 and precision value 1.  E. An example of two proteins with similar pockets The following Figure 10 is an example of two proteins with similar pockets. Figure 10 (a) is the global folding of the two proteins. They have one similar pocket. The positions where the two pockets locate on the protein surfaces are also shown. Part (b) is the visualization of the two similar pockets. (c) is the sequences of the two proteins. The red characters are the amino acid residues of the two similar pockets. They locate nonconsecutively on the protein sequences individually. The GO annotations (EBI, http://www.ebi.ac.uk/goa/) to the proteins are also listed. The two proteins have some common GO functions. When we align the two protein sequences by EMBOSS (EBI, http://www.ebi.ac.uk/emboss/), the sequence identity of the two proteins is 7.9%, the sequence similarity is 11.7%. The results of structure alignment is 4.9Å(RMSD) and 1.2(Z-Score) by CE (UCSD, http://cl.sdsc.edu/). From the sequence and structure alignment, we can find that they are with both low sequence similarity and low globally structure similarity. From the example we can find that the two proteins with similar function, but they have both low sequence similarity and structure similarity. If we predict protein functions from the sequence and/or global structure similarity, it would be difficult to get the correct predictions. However, we can find that the pockets located on the different protein surfaces have the sequence and structure similarity. When we predict function by the method proposed in the paper, we can get the correct annotations. The detailed descriptions of the GO terms and the common annotations of the two proteins are shown in Table 14. The depth is the level that the GO term located in the GO hierarchy tree, when the GO term occurs in different branches, the mean level is used. The example also gives us implications that the functionally important pockets on protein surface would have important applications in functional genomics and in bioengineering. The similar pockets in this example are the ion binding sites (which can be identified from the Swissprot database). The functional residues lie in the primary sequence discontinuously. However, when these residues fold into 3D shape, they would constitute a binding site to perform concrete function. If we can detect the biochemical features of pockets and identified these pockets as functional motifs, these pockets would be the functional templates. The template-based method to predict function is an important future direction.

F. Semantic measure of the functional similarity
The proteins are often annotated with several GO terms simultaneously. The semantic similarity can be used to compare the functional similarity between two proteins along the edges in the pocket similarity network, instead of the simple method by considering the common GO terms. Two GO term sets of the proteins can be compared in three ontologies individually. The relevance semantic similarity score of each edge in the pocket similarity networks constructed by different cRMSD p-value thresholds, ranged from 10 −1 to 10 −5 , is calculated by the method proposed in the literature (Schlicker A. et al., BMC Bioinformatics, 2006). The scores in three ontologies are calculated independently and the semantic similarity values range between 0 and 1. The distributions of the semantic similarity scores in different pocket networks are illustrated in Figure 11, i.e. the percentage of GO annotated edges in each semantic similarity score interval. The figures show that the pocket similarity (represented by the edges in the pocket similarity networks) is closely related to the semantic similarity.

G. Influence of GO relevance information
The GO organizes the terms as directed acyclic graphs (DAG), where the child term is more specific and informative than its ancestors. To analyze the influence of the unspecific GO terms in the proposed prediction method, we use the GO term probability and GO depth level to select informative GO terms.
The GO term probability Table 15 shows the detailed prediction results by using the GO term probability thresholds 0.05, 0.01 and 0.005 respectively. For the selected informative GO terms with probability less than 0.01, we also compare the prediction results by the pocket similarity network (constructed by pvSOAR cRMSD p-value 10 −2 ) and by protein similarity network (constructed by CE Z-score 4.8). The recall-precision graph in their common predicted proteins and the coverage of the predicted proteins are shown in Figure 12. Table 16 records the detailed values of prediction in the common proteins.  Figure 12: Recall-precision graph and coverage of the results by the pocket similarity network and by protein similarity network for the GO terms with probability less than 0.01. The GO depth level An alternative simple way to select informative GO terms is the depth level of GO terms. Figure 13 shows the distribution of GO terms with different depth levels. When a GO term belongs to several branches in the hierarchical tree and have different depth level labels, the mean value is used.  Table 17 shows the detailed prediction results by using the GO depth level thresholds 5, 6, 7 and 8 respectively. We also compare the prediction results by the pocket similarity network (constructed by pvSOAR cRMSD p-value 10 −2 ) and by the protein similarity network (constructed by CE Z-Score 4.8) for the GO terms with depth level ≥ 5. The recall-precision graph in their common predicted proteins and the coverage of the predicted proteins are shown in Figure 14. Table 18 records the detailed values of the prediction results in the common proteins.   Figure 14: Recall-precision graph and coverage of the results by the pocket similarity network and by protein similarity network for the GO terms with depth level ≥ 5.