A novel gene functional similarity calculation model by utilizing the specificity of terms and relationships in gene ontology

Background Recently, with the foundation and development of gene ontology (GO) resources, numerous works have been proposed to compute functional similarity of genes and achieved series of successes in some research fields. Focusing on the calculation of the information content (IC) of terms is the main idea of these methods, which is essential for measuring functional similarity of genes. However, most approaches have some deficiencies, especially when measuring the IC of both GO terms and their corresponding annotated term sets. To this end, measuring functional similarity of genes accurately is still challenging. Results In this article, we proposed a novel gene functional similarity calculation method, which especially encapsulates the specificity of terms and edges (STE). The proposed method mainly contains three steps. Firstly, a novel computing model is put forward to compute the IC of terms. This model has the ability to exploit the specific structural information of GO terms. Secondly, the IC of term sets are computed by capturing the genetic structure between the terms contained in the set. Lastly, we measure the gene functional similarity according to the IC overlap ratio of the corresponding annotated genes sets. The proposed method accurately measures the IC of not only GO terms but also the annotated term sets by leveraging the specificity of edges in the GO graph. Conclusions We conduct experiments on gene functional classification in biological pathways, gene expression datasets, and protein-protein interaction datasets. Extensive experimental results show the better performances of our proposed STE against several baseline methods.


Background
Since Gene Ontology (GO) [1,2] was first founded in 1998, it has been an important resource to support modern biological research. The GO knowledge base contains a controlled vocabulary of terms, which has three different orthogonal ontologies named biological process (BP), molecular function (MF), and cellular component (CC) . In each ontology, terms are employed to describe the function of genes and the relationships which have specific meanings are used to connect two terms. There are many relationships in the GO database and we only consider two of them: is_a and part_of.
GO exists in the form of a directed acyclic graph (DAG) and has two important characteristics. One is that terms with lower hierarchy generally show more specific meanings while terms with higher hierarchy have more generic meanings. Traditionally, the specific and generic meanings of terms are measured by the IC values, which could represent their specificity. The other is that the edges in different levels also have different specificity because of the terms that they connect.
The functions of a gene could be described by go terms and thus we suggest that this gene is annotated by the terms. The GO annotations (GOA) [3][4][5] database is specifically used to describe genes and their annotation terms. Since GO has three branches, genes can also be annotated from BP, CC, and MF aspects. Comparing the functional similarity between genes has many significant applications [6][7][8][9][10], such as protein interaction prediction, gene clustering, and disease gene identification.
In the past decades, various kinds of methods have already been developed for studying gene functional similarity. The most important concept in functional similarity comparison of genes domain is IC, which could measure the specificity of a GO term. So far, there are two types of IC values computing categories: corpus-based [11][12][13][14][15][16][17] and structured-based [18][19][20][21].
For a term t, its IC value calculated by corpus-based approaches shows as follow: where p(t) denotes the probability of both term t and its descendants appearing in the corpus. Method Resnik [11], Jiang and Conrath [12], and Lin [13] are all based on this definition. Equation 1 strongly demonstrates that the IC value of a given GO term is mainly attributed to the number of genes or proteins it annotates in the corpus. Therefore, the IC value of terms may vary according to the corpus. On the other hand, the annotation information in a corpus is updating over time, which also has an effect on the IC values of terms [22]. To overcome this drawback, researcher David Sánchez [18] put forward another IC computing model based on the GO structure. For a given GO term t, its IC value can be expressed as: where max_leaves means the amount of leaf terms. subsumers(t) is the ancestor set of term t. Additionally, the terms in leaves(t) are belonged to the descendants of term t that are also belonged to leaves. From the equation, we can find that this model exploits the specific genetic information of term t. The information contains the leaves of ontology, the number of their descendants and ancestors. Later, method SORA [19] and WIS [20] make an improvement based on this model, and achieve better performances.
For pair-wise strategy methods, they measure gene functional similarity mainly utilizing two steps: the first one focusing on computing the semantic similarity between annotated terms and the second one is measuring functional similarity with respect to the semantic similarity in the first step. The best matches average rule is commonly used in the second step. For group-wise approaches, they measure the gene functional similarity from the annotation set perspective. Here we select some typical computing models for a brief review. A detailed review is beyond the scope of this paper and has already been presented by Catia Pesquita [29].
Method Resnik [11] is a pair-wise strategy approach. For two given term t 1 and t 2 , the semantic similarity between them can be expressed as: where LCA(t 1 , t 2 ) means the lowest common ancestor for term t 1 and t 2 . Then it calculates gene functional similarity leveraging the BMA rule. The procedure for some other methods like Wang [23], Jiang and Conrath [12], and Lin [13] are similar to method Resnik.
Method simUI [26] is a group-wise method, which is proposed by Gentlman. Suppose there are two genes G 1 and G 2 , the formula of this model can be expressed as: where S G 1 and S G 2 represent the annotation term set for G 1 and G 2 respectively. Followed Gentlman, some other models such as SimGIC [27], SORA [19] and WIS [20] are also proposed. All of these approaches pay much attention to compute the IC of annotated term sets accurately and effectively. For example, simGIC sums up the IC value of each term, while SORA puts forward the concept of inherited semantics to avoid computing the overlap IC of terms in annotation sets. Method WIS first assigns a weighted value to the relationships of GO structure and then designs a rule to compute the inherited IC values of GO terms.
Based on the idea of vector representation, some other approaches [24,25,30] are proposed. These methods employ the one-hot coding to deal with the annotation terms. Terms in the annotation term set will be represented in a vector of which dimension indicates the total amount of GO terms. Each dimension denoted by a binary digit. Suppose there are two genes G 1 and G 2 , their annotation term vectors are v 1 and v 2 , the functional similarity between G 1 and G 2 based on basic vector space model (VSM) can be expressed as: For group-wise approaches, they do not make the best use of the GO structure, which may cause the calculation of IC not accurately. For example, method GIC does not take the number of ancestors of terms into consideration on the IC values calculation. VSM neglects the relationship between GO terms. To overcome the drawbacks, we put forward a novel gene functional similarity calculation method, which especially encapsulates the Specificity of Terms and Edges (STE). STE mainly has two models: the first one calculates the IC value of terms and another one is designed for computing the weighted value of edges. Their detailed description will be shown in "Methods" section.

Results
In this section, the experimental results on various datasets are presented. Before that, we first introduce the experimental data.

Datasets
The GO data is downloaded from the online resource website. In this version, the term number of BP, CC, and MF are 29,380, 4,181, and 11,113 respectively. Besides, the Gene Ontology annotation for H. sapiens and Saccharomyces cerevisiae data are also downloaded from the gene ontology resource website. In this study, we divide annotation types into two categories: IEA+, and IEA-, which means that the annotation term sets of genes contain the Inferred electronic-assigned (IEA) terms or not. Moreover, six annotation combinations are presented as MF_IEA+, MF_IEA-, CC_IEA+, CC_IEA-,BP_IEA+,BP_IEA-.
It is an important and popular validation strategy for gene functional similarity methods to classify the genes based on molecular function. In this study, we employ the yeast pathway data in Saccharomyces genome database (SGD) to make an analysis for the functional classification of genes based on the gene functional similarity calculation results.
For protein-protein interaction experiments, we download the data from the previous approaches [20,23]. Besides, we remove the obsoleted data and rebuild a new experimental dataset. Negative PPIs for human and yeast are randomly generated based on the annotation of genes on three ontologies. What's more, the number of negative PPIs and positive PPIs are the same.
In the end, gene expression data of Saccharomyces cerevisiae is from Jain and Davis [31]. In this dataset, there are a total of 11,966 pairs of cerevisiae gene when we remove some obsoleted data. In the end, there are a total of 4,211, 3,888, and 3,867 gene pairs for CC, BP, and MF aspects respectively.

The analysis for the distribution of IC
Measuring the IC of GO terms reasonably is the foundation for accurately calculating the gene functional similarity. Therefore, we firstly investigate the distribution of the IC of terms on three sub-ontologies with three methods which are Resnik, WIS, STE. The detailed results are shown in Fig. 1 Fig. 2. The GO terms with the middle level are 89 percent of the total in the three ontologies, which demonstrates that the IC values of most terms should be medium. Lastly, we analyze the results of method Resnik, WIS, and STE in detail. For method Resnik, there is more than 85 percent of IC of terms are larger than 0.9. In other words, this model does not distinguish the difference of terms in the GO graph. Method WIS makes a big improvement compared with Resnik. However, many small IC values are presented on the curve of WIS. The results of the proposed model STE are highly consistent with the Eq. 6, which meets the human perspectives. Overall, STE is the best model in these three methods in measuring the IC values of terms.

Gene functional classification in biological pathways
Compare gene functional similarity calculation methods with meaningful pathways is effective to a large extent. If the results of a gene functional similarity method are consistent with the fact that demonstrated in the biological pathways, this method will be an effective one. Meanwhile, there are more than 80 biological pathways in the selected dataset, and we choose one pathway named 'phenylalanine degradation' with ten different genes and eight various EC numbers to validate the performances of methods to be compared. The selected pathway is shown in Fig. 3. At the same time, we compute the functional similarities of the 10 genes with respect to MF ontology with STE and three baseline methods Resnik, Wang, and VSM.
It is generally believed that genes with similar EC numbers will have a higher functional similarity. The results are demonstrated in Fig. 4. For method Resnik, there is one pair of genes of which functional similarity is higher than 0.5 and the similarity of other genes pairs are small. Taking gene 'PDC1' and 'PDC5 ' as an example, the EC number of these two genes are the same, and their similarity value is only 0.43. For method Wang, the functional similarities of gene pairs are not very distinguishable. For example, gene 'PDC6' has a higher similarity with gene ' ADH5' and ' ADH4' Fig. 2 The amount distribution of terms based on depth with respect to BP, CC and MF ontologies than gene 'SFA1' . This is not inconsistent with the EC number knowledge. For method VSM, the EIC number of gene 'H1S5' and 'PDC1' is quite different, but the functional similarity of them has a higher similarity, which is unreasonable.
In the end, for STE, the functional similarities of gene pairs are consistent with the class of the EC number. Moreover, it can also distinguish different the 'distance' of

Results on PPIs
It is another critical evaluation criterion to score the functional similarity calculation methods utilizing protein interactions. In this sub-experiment, according to the selected PPIs in the dataset, we calculate their functional similarity. Then the performance of functional calculation methods is deeply compared based on the receiver operating characteristic (ROC) and the area under the curve (AUC) metric.
The functional similarity values of PPI pairs for S. cerevisiae and H. sapiens are measured using all seven methods. Tables 1 and 2 present their corresponding AUC values respectively. Specifically, on S. cerevisiae dataset, method simGIC runs first on CC_IEA-and MF_IEA+. STE achieves the best performance on four sub-datasets, which are CC_IEA+, BP_IEA+, MF_IEA-, and BP_IEA-. The performances of the other four methods are inferior to these two methods on the whole. On H. sapiens dataset, similar to the results on S. cerevisiae, method STE get the rank first on three sub-ontologies: BP_IEA+, BP_IEA-and MF_IEA-. Besides, simGIC achieves a relatively good performance, since it also got first results on MF_IEA+, CC_IEA+ and CC_IEA-. However, there is only a small gap between STE and simGIC on CC_IEAthat the score of simGIC is 0.0028 higher than that of STE. Therefore, method STE is superior to method simGIC and the other five methods on PPI experiments. It is worth noting that group-wise methods show better performance than pairwise methods on PPI experiment.

Results of gene expression experiment
In this experiment, we randomly selected 3500 gene pairs on the three ontologies. At the same time, the functional similarity with IEA+ and IEA-on different ontologies of these gene pairs are computed using our proposed STE and six baselines (WIS, VSM, Resnik, Wang, simUI, and simGIC). Based on the obtained gene functional similarity values and gene expression values, the pearson's correlation coefficients between them are calculated. The results for these seven methods are listed in Table 3.
On the whole, the correlation coefficients on CC, BP, and MF have a different distribution that CC ontology has the highest values, followed by BP and MF. On the method aspect, method GIC performs best on four sets of experiments, which are BP_IEA+, MF_IEA+, CC_IEA-, and BP_IEA-respectively. Meanwhile, method Resnik get, the highest score on CC_IEA+. The proposed method STE only runs first on MF_IEA-, which is less unsatisfactory. In this experiment, the performance of method simGIC is best, followed by method STE and Resnik. On the whole, the performance of group-wise approaches is better than that of the pairwise methods.

Discussion
In the current study, we propose a novel computational model for calculating the IC of a term in the GO graph. As far as we know, there are two categories of methods for computing the IC of GO terms, which are corpus-based and structural-based. Corpus-based methods such as Lin [13], Jiang and Conrath [12], and Resnik [11], measure the IC of a term by calculating the frequency that the term appears in a specific corpus. However, owing to the diversity and variability of corpora, corpus-based methods may obtain inaccurate IC of terms. Structural-based methods incorporate the structural information of term into its IC, which can effectively capture the information of the GO graph. For example, Sánchez [18] uses the ancestors of terms and the leaf terms as the information to measure IC. Subsequently, SORA [19] propose to add the depth information of terms into the measurement of IC. Following SORA, WIS [20] employ the depth, ancestors, descendants simultaneously to enrich the information contained by terms. Nevertheless, Sánchez may lose some useful structural information such as depth compared with SORA and WIS. WIS improves SORA by introducing the depth of descendants and the number of ancestors and . These works show the IC of a term has a strong correlation with its depth in the GO graph. Inspired by the above observations, we encapsulate the depth of both the given term and its ancestors to compute its IC. Additionally, we also exploit the number of descendants and all GO terms to enrich the information contained by terms. From Fig. 2, it can be found that a large proportion of terms located in the middle hierarchy of the GO graph. Further, Fig. 1 demonstrates the IC of terms calculated by our proposed method are concentrated in the middle range, whether in BP, CC, or MF, which fits well with the distribution in Fig. 2. To sum up, our proposed method for calculating the IC of terms is more effective against the early proposed methods.

Conclusion
In the current study, we proposed a novel computational model called STE to measure gene functional similarity. This method could make the best use of the GO structure to calculate the IC values of GO terms accurately by assigning a reasonable weighted value to the relationships of the GO structure. Especially, the depth and the genetic structure of GO terms are all merged into the IC value calculation model. Therefore, the IC values of terms are ranging from 0 to 1 and most of them are between 0.3 and 0.7. Besides, based on the values of edges, we have the ability to accurately estimate the IC values of annotation term sets with the concept of the inherited IC value concept. This is critical to the functional similarity calculation methods. Consequently, experimental results on various datasets demonstrated that STE is superior to the other six competitive methods in measuring functional similarity of genes.

Measuring the IC value of a term
A GO term with a lower level will describe a more specific function and vice versa. The IC of a term will be employed as a metric to measure how specific the term is. Therefore, terms with lower hierarchy will show higher values than those with higher hierarchy. Aside from this, terms with lower hierarchy always tend to have more ancestors and fewer descendants. Therefore, for a give GO term t, a novel computational model for calculating its IC value is developed as follows: where Ance(t) and Desc(t) denote the ancestor set and descendant set of term t, depth (t) and N are the max depth of term t and the total amount of GO terms.

The weighted value of an edge
As we know, there are many edges at different levels that linking the terms in the GO graph. To show the specificity of the edge, we assign a value ranging from 0 to 1 to each edge in the GO graph. The model for calculating the weighted value of an edge between term t i and t j can be expressed as: (6)  the eight terms and the computational process using our method to calculate the IC of term S is presented in Table 6.
In the first step, we initialize the IC(S) to be 0. It's obvious that term t 1 is the root of the term set and IC own (t 1 ) is 0. Hence,the result after the first step is IC(S) plus IC own (t 1 ) and is equal to 0.
In the second step, according to the Eq. 9, we calculate IC own (t 2 ) and its result is 0.01. Therefore, the result of IC(S) after the second step is IC(S) calculated by the last step plus IC own (t 2 ) and is equal to 0.01.
After iteration is finished, we have calculated all IC own of terms in the DAG and get the final result of IC(S). It is worth noting that using the proposed method to calculate the IC of term sets is very efficient and Algorithm 1 describes the computational process of the IC of o term set by the proposed model.

Measuring the gene functional similarity
Suppose there are two genes G 1 and G 2 , their annotation term sets are T G 1 and T G 2 respectively. The functional similarity between them is expressed as: where ∩ denotes the intersection while ∪ represents the union of the two sets respectively. (11) simSTE(G 1 , G 2 ) = IC(T G 1 ∩ T G 2 ) IC(T G 1 ∪ T G 2 )  Table 5 The weight values of corresponding edges in Fig. 5 Edge  Table 6 The computational process for measuring the IC of term set S Step Term IC IC own IC(S)