Skip to main content

A novel gene functional similarity calculation model by utilizing the specificity of terms and relationships in gene ontology

Abstract

Background

Recently, with the foundation and development of gene ontology (GO) resources, numerous works have been proposed to compute functional similarity of genes and achieved series of successes in some research fields. Focusing on the calculation of the information content (IC) of terms is the main idea of these methods, which is essential for measuring functional similarity of genes. However, most approaches have some deficiencies, especially when measuring the IC of both GO terms and their corresponding annotated term sets. To this end, measuring functional similarity of genes accurately is still challenging.

Results

In this article, we proposed a novel gene functional similarity calculation method, which especially encapsulates the specificity of terms and edges (STE). The proposed method mainly contains three steps. Firstly, a novel computing model is put forward to compute the IC of terms. This model has the ability to exploit the specific structural information of GO terms. Secondly, the IC of term sets are computed by capturing the genetic structure between the terms contained in the set. Lastly, we measure the gene functional similarity according to the IC overlap ratio of the corresponding annotated genes sets. The proposed method accurately measures the IC of not only GO terms but also the annotated term sets by leveraging the specificity of edges in the GO graph.

Conclusions

We conduct experiments on gene functional classification in biological pathways, gene expression datasets, and protein-protein interaction datasets. Extensive experimental results show the better performances of our proposed STE against several baseline methods.

Background

Since Gene Ontology (GO) [1, 2] was first founded in 1998, it has been an important resource to support modern biological research. The GO knowledge base contains a controlled vocabulary of terms, which has three different orthogonal ontologies named biological process (BP), molecular function (MF), and cellular component (CC) . In each ontology, terms are employed to describe the function of genes and the relationships which have specific meanings are used to connect two terms. There are many relationships in the GO database and we only consider two of them: is_a and part_of.

GO exists in the form of a directed acyclic graph (DAG) and has two important characteristics. One is that terms with lower hierarchy generally show more specific meanings while terms with higher hierarchy have more generic meanings. Traditionally, the specific and generic meanings of terms are measured by the IC values, which could represent their specificity. The other is that the edges in different levels also have different specificity because of the terms that they connect.

The functions of a gene could be described by go terms and thus we suggest that this gene is annotated by the terms. The GO annotations (GOA) [3,4,5] database is specifically used to describe genes and their annotation terms. Since GO has three branches, genes can also be annotated from BP, CC, and MF aspects. Comparing the functional similarity between genes has many significant applications [6,7,8,9,10], such as protein interaction prediction, gene clustering, and disease gene identification.

In the past decades, various kinds of methods have already been developed for studying gene functional similarity. The most important concept in functional similarity comparison of genes domain is IC, which could measure the specificity of a GO term. So far, there are two types of IC values computing categories: corpus-based [11,12,13,14,15,16,17] and structured-based [18,19,20,21].

For a term t, its IC value calculated by corpus-based approaches shows as follow:

$$\begin{aligned} I C_{corpus}(t)=-\log (p(t)) \end{aligned}$$
(1)

where p(t) denotes the probability of both term t and its descendants appearing in the corpus. Method Resnik [11], Jiang and Conrath [12], and Lin [13] are all based on this definition. Equation 1 strongly demonstrates that the IC value of a given GO term is mainly attributed to the number of genes or proteins it annotates in the corpus. Therefore, the IC value of terms may vary according to the corpus. On the other hand, the annotation information in a corpus is updating over time, which also has an effect on the IC values of terms [22].

To overcome this drawback, researcher David Sánchez [18] put forward another IC computing model based on the GO structure. For a given GO term t, its IC value can be expressed as:

$$\begin{aligned} IC_{structure}(t)=-\log \left( {\frac{\frac{\left| leaves(t) \right| }{\left| subsumers(t) \right| } +1 }{max\_leaves+1}}\right) \end{aligned}$$
(2)

where \(max\_leaves\) means the amount of leaf terms. subsumers(t) is the ancestor set of term t. Additionally, the terms in leaves(t) are belonged to the descendants of term t that are also belonged to leaves. From the equation, we can find that this model exploits the specific genetic information of term t. The information contains the leaves of ontology, the number of their descendants and ancestors. Later, method SORA [19] and WIS [20] make an improvement based on this model, and achieve better performances.

Based on the IC values of terms, researchers have developed numerous gene functional similarity methods, which have two categories: pair-wise strategies [11,12,13, 23] and group-wise strategies [19, 20, 24,25,26,27,28].

For pair-wise strategy methods, they measure gene functional similarity mainly utilizing two steps: the first one focusing on computing the semantic similarity between annotated terms and the second one is measuring functional similarity with respect to the semantic similarity in the first step. The best matches average rule is commonly used in the second step. For group-wise approaches, they measure the gene functional similarity from the annotation set perspective. Here we select some typical computing models for a brief review. A detailed review is beyond the scope of this paper and has already been presented by Catia Pesquita [29].

Method Resnik [11] is a pair-wise strategy approach. For two given term \(t_1\) and \(t_2\), the semantic similarity between them can be expressed as:

$$\begin{aligned} Sim_{Resnik}(t_1,t_2)=IC(LCA(t_1,t_2)) \end{aligned}$$
(3)

where \(LCA(t_1, t_2)\) means the lowest common ancestor for term \(t_1\) and \(t_2\). Then it calculates gene functional similarity leveraging the BMA rule. The procedure for some other methods like Wang [23], Jiang and Conrath [12], and Lin [13] are similar to method Resnik.

Method simUI [26] is a group-wise method, which is proposed by Gentlman. Suppose there are two genes \(G_1\) and \(G_2\), the formula of this model can be expressed as:

$$\begin{aligned} simUI(G_1,G_2)=\frac{\left| S_{G_1}\cap S_{G_2}\right| }{\left| S_{G_1}\cup S_{G_2}\right| } \end{aligned}$$
(4)

where \(S_{G_1}\) and \(S_{G_2}\) represent the annotation term set for \(G_1\) and \(G_2\) respectively. Followed Gentlman, some other models such as SimGIC [27], SORA [19] and WIS [20] are also proposed. All of these approaches pay much attention to compute the IC of annotated term sets accurately and effectively. For example, simGIC sums up the IC value of each term, while SORA puts forward the concept of inherited semantics to avoid computing the overlap IC of terms in annotation sets. Method WIS first assigns a weighted value to the relationships of GO structure and then designs a rule to compute the inherited IC values of GO terms.

Based on the idea of vector representation, some other approaches [24, 25, 30] are proposed. These methods employ the one-hot coding to deal with the annotation terms. Terms in the annotation term set will be represented in a vector of which dimension indicates the total amount of GO terms. Each dimension denoted by a binary digit. Suppose there are two genes \(G_1\) and \(G_2\), their annotation term vectors are \(v_1\) and \(v_2\), the functional similarity between \(G_1\) and \(G_2\) based on basic vector space model (VSM) can be expressed as:

$$\begin{aligned} Sim_{VSM}(G_1,G_2)=\frac{\upsilon _1\cdot \upsilon _2}{\left| \upsilon _1 \right| \left| \upsilon _2 \right| } \end{aligned}$$
(5)

For group-wise approaches, they do not make the best use of the GO structure, which may cause the calculation of IC not accurately. For example, method GIC does not take the number of ancestors of terms into consideration on the IC values calculation. VSM neglects the relationship between GO terms.

To overcome the drawbacks, we put forward a novel gene functional similarity calculation method, which especially encapsulates the Specificity of Terms and Edges (STE). STE mainly has two models: the first one calculates the IC value of terms and another one is designed for computing the weighted value of edges. Their detailed description will be shown in “Methods” section.

Results

In this section, the experimental results on various datasets are presented. Before that, we first introduce the experimental data.

Datasets

The GO data is downloaded from the online resource website. In this version, the term number of BP, CC, and MF are 29,380, 4,181, and 11,113 respectively. Besides, the Gene Ontology annotation for H. sapiens and Saccharomyces cerevisiae data are also downloaded from the gene ontology resource website. In this study, we divide annotation types into two categories: IEA+, and IEA-, which means that the annotation term sets of genes contain the Inferred electronic-assigned (IEA) terms or not. Moreover, six annotation combinations are presented as MF_IEA+, MF_IEA-, CC_IEA+, CC_IEA-,BP_IEA+,BP_IEA-.

It is an important and popular validation strategy for gene functional similarity methods to classify the genes based on molecular function. In this study, we employ the yeast pathway data in Saccharomyces genome database (SGD) to make an analysis for the functional classification of genes based on the gene functional similarity calculation results.

For protein-protein interaction experiments, we download the data from the previous approaches [20, 23]. Besides, we remove the obsoleted data and rebuild a new experimental dataset. Negative PPIs for human and yeast are randomly generated based on the annotation of genes on three ontologies. What’s more, the number of negative PPIs and positive PPIs are the same.

In the end, gene expression data of Saccharomyces cerevisiae is from Jain and Davis [31]. In this dataset, there are a total of 11,966 pairs of cerevisiae gene when we remove some obsoleted data. In the end, there are a total of 4,211, 3,888, and 3,867 gene pairs for CC, BP, and MF aspects respectively.

The analysis for the distribution of IC

Measuring the IC of GO terms reasonably is the foundation for accurately calculating the gene functional similarity. Therefore, we firstly investigate the distribution of the IC of terms on three sub-ontologies with three methods which are Resnik, WIS, STE. The detailed results are shown in Fig. 1. According to Eq. 6, terms with higher levels tend to have smaller IC values, while terms with lower levels have bigger IC values. To prove this point of view, we investigate the relationships between the number of terms and their depth. The results are shown in Fig. 2. The GO terms with the middle level are 89 percent of the total in the three ontologies, which demonstrates that the IC values of most terms should be medium.

Fig. 1
figure 1

The amount distribution of terms based on IC with respect to ontologies a BP, b CC, and c MF. X-axis and Y-axis indicate the the scope of IC value and the amount of terms respectively

Fig. 2
figure 2

The amount distribution of terms based on depth with respect to BP, CC and MF ontologies

Lastly, we analyze the results of method Resnik, WIS, and STE in detail. For method Resnik, there is more than 85 percent of IC of terms are larger than 0.9. In other words, this model does not distinguish the difference of terms in the GO graph. Method WIS makes a big improvement compared with Resnik. However, many small IC values are presented on the curve of WIS. The results of the proposed model STE are highly consistent with the Eq. 6, which meets the human perspectives. Overall, STE is the best model in these three methods in measuring the IC values of terms.

Gene functional classification in biological pathways

Compare gene functional similarity calculation methods with meaningful pathways is effective to a large extent. If the results of a gene functional similarity method are consistent with the fact that demonstrated in the biological pathways, this method will be an effective one. Meanwhile, there are more than 80 biological pathways in the selected dataset, and we choose one pathway named ‘phenylalanine degradation’ with ten different genes and eight various EC numbers to validate the performances of methods to be compared. The selected pathway is shown in Fig. 3. At the same time, we compute the functional similarities of the 10 genes with respect to MF ontology with STE and three baseline methods Resnik, Wang, and VSM.

Fig. 3
figure 3

The detailed information of phenylalanine degradation pathway

It is generally believed that genes with similar EC numbers will have a higher functional similarity. The results are demonstrated in Fig. 4. For method Resnik, there is one pair of genes of which functional similarity is higher than 0.5 and the similarity of other genes pairs are small. Taking gene ‘PDC1’ and ‘PDC5 ’ as an example, the EC number of these two genes are the same, and their similarity value is only 0.43. For method Wang, the functional similarities of gene pairs are not very distinguishable. For example, gene ‘PDC6’ has a higher similarity with gene ‘ADH5’ and ‘ADH4’ than gene ‘SFA1’. This is not inconsistent with the EC number knowledge. For method VSM, the EIC number of gene ‘H1S5’ and ‘PDC1’ is quite different, but the functional similarity of them has a higher similarity, which is unreasonable.

Fig. 4
figure 4

Gene function classification results with respect to MF ontology. a Resnik, b Wang, c VSM and d STE

In the end, for STE, the functional similarities of gene pairs are consistent with the class of the EC number. Moreover, it can also distinguish different the ‘distance’ of gene pairs from the functional similarity. Overall, STE is the best of the four compared methods.

Results on PPIs

It is another critical evaluation criterion to score the functional similarity calculation methods utilizing protein interactions. In this sub-experiment, according to the selected PPIs in the dataset, we calculate their functional similarity. Then the performance of functional calculation methods is deeply compared based on the receiver operating characteristic (ROC) and the area under the curve (AUC) metric.

The functional similarity values of PPI pairs for S. cerevisiae and H. sapiens are measured using all seven methods. Tables 1 and 2 present their corresponding AUC values respectively. Specifically, on S. cerevisiae dataset, method simGIC runs first on CC_IEA- and MF_IEA+. STE achieves the best performance on four sub-datasets, which are CC_IEA+, BP_IEA+, MF_IEA-, and BP_IEA-. The performances of the other four methods are inferior to these two methods on the whole. On H. sapiens dataset, similar to the results on S. cerevisiae, method STE get the rank first on three sub-ontologies: BP_IEA+, BP_IEA- and MF_IEA-. Besides, simGIC achieves a relatively good performance, since it also got first results on MF_IEA+, CC_IEA+ and CC_IEA-. However, there is only a small gap between STE and simGIC on CC_IEA- that the score of simGIC is 0.0028 higher than that of STE. Therefore, method STE is superior to method simGIC and the other five methods on PPI experiments. It is worth noting that group-wise methods show better performance than pairwise methods on PPI experiment.

Table 1 AUC values in S. cerevisiae datasets with respect to ontology BP, CC and MF (IEA+ and IEA-)
Table 2 AUC values in H. sapiens datasets with respect to ontology BP, CC and MF (IEA+ and IEA-)

Results of gene expression experiment

In this experiment, we randomly selected 3500 gene pairs on the three ontologies. At the same time, the functional similarity with IEA+ and IEA- on different ontologies of these gene pairs are computed using our proposed STE and six baselines (WIS, VSM, Resnik, Wang, simUI, and simGIC). Based on the obtained gene functional similarity values and gene expression values, the pearson’s correlation coefficients between them are calculated. The results for these seven methods are listed in Table 3.

Table 3 Pearson’s correlation coefficient with gene expression dataset with respect to ontology BP, CC and MF (IEA+ and IEA-)

On the whole, the correlation coefficients on CC, BP, and MF have a different distribution that CC ontology has the highest values, followed by BP and MF. On the method aspect, method GIC performs best on four sets of experiments, which are BP_IEA+, MF_IEA+, CC_IEA-, and BP_IEA- respectively. Meanwhile, method Resnik get, the highest score on CC_IEA+. The proposed method STE only runs first on MF_IEA-, which is less unsatisfactory. In this experiment, the performance of method simGIC is best, followed by method STE and Resnik. On the whole, the performance of group-wise approaches is better than that of the pairwise methods.

Discussion

In the current study, we propose a novel computational model for calculating the IC of a term in the GO graph. As far as we know, there are two categories of methods for computing the IC of GO terms, which are corpus-based and structural-based. Corpus-based methods such as Lin [13], Jiang and Conrath [12], and Resnik [11], measure the IC of a term by calculating the frequency that the term appears in a specific corpus. However, owing to the diversity and variability of corpora, corpus-based methods may obtain inaccurate IC of terms. Structural-based methods incorporate the structural information of term into its IC, which can effectively capture the information of the GO graph. For example, Sánchez [18] uses the ancestors of terms and the leaf terms as the information to measure IC. Subsequently, SORA [19] propose to add the depth information of terms into the measurement of IC. Following SORA, WIS [20] employ the depth, ancestors, descendants simultaneously to enrich the information contained by terms. Nevertheless, Sánchez may lose some useful structural information such as depth compared with SORA and WIS. WIS improves SORA by introducing the depth of descendants and the number of ancestors and . These works show the IC of a term has a strong correlation with its depth in the GO graph.

Inspired by the above observations, we encapsulate the depth of both the given term and its ancestors to compute its IC. Additionally, we also exploit the number of descendants and all GO terms to enrich the information contained by terms. From Fig. 2, it can be found that a large proportion of terms located in the middle hierarchy of the GO graph. Further, Fig. 1 demonstrates the IC of terms calculated by our proposed method are concentrated in the middle range, whether in BP, CC, or MF, which fits well with the distribution in Fig. 2. To sum up, our proposed method for calculating the IC of terms is more effective against the early proposed methods.

Conclusion

In the current study, we proposed a novel computational model called STE to measure gene functional similarity. This method could make the best use of the GO structure to calculate the IC values of GO terms accurately by assigning a reasonable weighted value to the relationships of the GO structure. Especially, the depth and the genetic structure of GO terms are all merged into the IC value calculation model. Therefore, the IC values of terms are ranging from 0 to 1 and most of them are between 0.3 and 0.7. Besides, based on the values of edges, we have the ability to accurately estimate the IC values of annotation term sets with the concept of the inherited IC value concept. This is critical to the functional similarity calculation methods. Consequently, experimental results on various datasets demonstrated that STE is superior to the other six competitive methods in measuring functional similarity of genes.

Methods

Measuring the IC value of a term

A GO term with a lower level will describe a more specific function and vice versa. The IC of a term will be employed as a metric to measure how specific the term is. Therefore, terms with lower hierarchy will show higher values than those with higher hierarchy. Aside from this, terms with lower hierarchy always tend to have more ancestors and fewer descendants. Therefore, for a give GO term t, a novel computational model for calculating its IC value is developed as follows:

$$\begin{aligned} \begin{aligned} IC(t)=log(depth(t))&*\left( log\left( \sum \limits _{t_i\in Ance(t)}depth(t_i)\right) +1 \right) \\&*\left( 1-\frac{log(\left| Desc(t)\right| )}{log(N+1)} \right) \end{aligned} \end{aligned}$$
(6)

where Ance(t) and Desc(t) denote the ancestor set and descendant set of term t, depth(t) and N are the max depth of term t and the total amount of GO terms.

The weighted value of an edge

As we know, there are many edges at different levels that linking the terms in the GO graph. To show the specificity of the edge, we assign a value ranging from 0 to 1 to each edge in the GO graph. The model for calculating the weighted value of an edge between term \(t_i\) and \(t_j\) can be expressed as:

$$\begin{aligned} \omega _{ij}=\frac{\sum \limits _{t_m\in Desc(t_j)}IC(t_m)}{\sum \limits _{t_n\in Desc(t_i)}IC(t_n)} \end{aligned}$$
(7)

where \(t_j\) is the direct descendant node of \(t_i\), Desc(t) contains the descendants of term t.

Measuring the own IC of a term

Based on the true path rule in the gene annotation area, if one gene is annotated by a GO term, all the ancestors of this term will annotate the gene. Therefore, terms in the annotation term set have an inheritance relationship. The IC of term \(t_j\) can be seen as two parts: one is the inherited IC denoted as \(IC_{inherited}(t_j)\) from its parent terms and the other is its own IC denoted as the \(IC_{own}(t_j)\), which can be calculated as follows:

$$\begin{aligned} IC_{inherited}(t_j)=\sum _{t_i\in Parent(t_j)}\omega _{ij}*IC(t_i) \end{aligned}$$
(8)

where \(Parent(t_j)\) includes all the direct ancestors of term \(t_j\). The weighted value \(\omega _{ij}\) is calculated using Eq. 7.

$$\begin{aligned} \begin{aligned} IC_{own}(t_j)&=IC(t_j)-IC_{inherited}(t_j) \end{aligned} \end{aligned}$$
(9)

Measuring the IC value of a term set

To avoid calculating the overlap semantics in a term set, we utilize Eq. 9 to sum up the IC of a term. Suppose T is a term set, the IC value of T defined as follows:

$$\begin{aligned} IC(T)=\sum \limits _{t\in T}IC_{own}(t) \end{aligned}$$
(10)

An example of calculating the IC of a term set

From Fig. 5, eight GO terms are contained in a term set. The IC of each term and the weight of every edge are listed in Tables 4 and 5 respectively. Suppose set S contains the eight terms and the computational process using our method to calculate the IC of term S is presented in Table 6.

Fig. 5
figure 5

DAG for GO term organelle assembly:0070925

Table 4 The IC values of corresponding terms in Fig. 5
Table 5 The weight values of corresponding edges in Fig. 5
Table 6 The computational process for measuring the IC of term set S

In the first step, we initialize the IC(S) to be 0. It’s obvious that term \(t_1\) is the root of the term set and \(IC_{own}(t_1)\) is 0. Hence,the result after the first step is IC(S) plus \(IC_{own}(t_1)\) and is equal to 0.

In the second step, according to the Eq. 9, we calculate \(IC_{own}(t_2)\) and its result is 0.01. Therefore, the result of IC(S) after the second step is IC(S) calculated by the last step plus \(IC_{own}(t_2)\) and is equal to 0.01.

After iteration is finished, we have calculated all \(IC_{own}\) of terms in the DAG and get the final result of IC(S). It is worth noting that using the proposed method to calculate the IC of term sets is very efficient and Algorithm 1 describes the computational process of the IC of o term set by the proposed model.

Measuring the gene functional similarity

Suppose there are two genes \(G_1\) and \(G_2\), their annotation term sets are \(T_{G_1}\) and \(T_{G_2}\) respectively. The functional similarity between them is expressed as:

$$\begin{aligned} simSTE(G_1,G_2)=\frac{IC(T_{G_1}\cap T_{G_2})}{IC(T_{G_1}\cup T_{G_2})} \end{aligned}$$
(11)

where \(\cap\) denotes the intersection while \(\cup\) represents the union of the two sets respectively.

Algorithms

figure a

Availability of data and materials

The public datasets used in this study are described in the results section of this paper. The experimental data are available at: https://github.com/hc-fang/STE.

Abbreviations

GO:

Gene ontology

GOA:

Gene ontology annotation

LCA:

lowest common ancestor

IC:

Information content

STE:

Specificity of terms and edges

BP:

Biological process

MF:

Molecular function

CC:

Cellular component

DAG:

Directed acyclic graph

PPIs:

Protein-protein interactions

EC number:

Enzyme class number

IEA:

Inferred electronic-assigned terms

MF_IEA+:

Molecular function with IEA

MF_IEA-:

Molecular function without IEA

CC_IEA+:

Cellular component with IEA

CC_IEA-:

Cellular component without IEA

BP_IEA+:

Biological process with IEA

BP_IEA-:

Biological process without IEA.

References

  1. Consortium GO. The gene ontology (go) database and informatics resource. Nucleic Acids Res. 2004;32(suppl-1):258–61.

    Article  Google Scholar 

  2. Consortium GO. The gene ontology resource: 20 years and still going strong. Nucleic Acids Res. 2019;47(D1):330–8.

    Article  Google Scholar 

  3. Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R. The gene ontology annotation (goa) database: sharing knowledge in uniprot with gene ontology. Nucleic Acids Res. 2004;32(suppl-1):262–6.

    Article  Google Scholar 

  4. Consortium GO. Gene ontology annotations and resources. Nucleic Acids Res. 2012;41(D1):530–5.

    Article  Google Scholar 

  5. Huntley RP, Sawford T, Mutowo-Meullenet P, Shypitsyna A, Bonilla C, Martin MJ, O’Donovan C. The goa database: gene ontology annotation updates for 2015. Nucleic Acids Res. 2015;43(D1):1057–63.

    Article  Google Scholar 

  6. Peng J, Xue H, Wei Z, Tuncali I, Hao J, Shang X. Integrating multi-network topology for gene function prediction using deep neural networks. Brief Bioinform. 2021;22(2):2096–105.

    Article  CAS  PubMed  Google Scholar 

  7. Peng J, Wang Y, Guan J, Li J, Han R, Hao J, Wei Z, Shang X. An end-to-end heterogeneous graph representation learning-based framework for drug–target interaction prediction. Brief Bioinform (2021).

  8. Zhao Y, Fu G, Wang J, Guo M, Yu G. Gene function prediction based on gene ontology hierarchy preserving hashing. Genomics. 2019;111(3):334–42.

    Article  CAS  PubMed  Google Scholar 

  9. Qiu Y, Ching WK, Zou Q. Matrix factorization-based data fusion for the prediction of RNA-binding proteins and alternative splicing event associations during epithelial–mesenchymal transition. Briefings Bioinform. 2021;22(6). https://doi.org/10.1093/bib/bbab332.

  10. Zhang X, Zou Q, Rodriguez-Paton A, Zeng X. Meta-Path Methods for Prioritizing Candidate Disease miRNAs. IEEE/ACM Trans Comput Biol Bioinf. 2019;16(1):283–291. https://doi.org/10.1109/TCBB.2017.2776280.

  11. Resnik P. Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language. J Artif Intell Res. 1999;11:95–130.

    Article  Google Scholar 

  12. Jiang JJ, Conrath DW. Semantic similarity based on corpus statistics and lexical taxonomy. In: Proceedings of the 10th Research on Computational Linguistics International Conference, 1997; 19–33.

  13. Lin D. An information-theoretic definition of similarity. In: Proceedings of the Fifteenth International Conference on Machine Learning, 1998;296–304.

  14. Islam A, Inkpen D. Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans Knowl Discov Data (TKDD). 2008;2(2):1–25.

    Article  Google Scholar 

  15. Ferret O. Testing semantic similarity measures for extracting synonyms from a corpus. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10) (2010).

  16. Ahmed M, Dixit C, Mercer RE, Khan A, Samee MR, Urra F. Multilingual corpus creation for multilingual semantic similarity task. In: Proceedings of the 12th Language Resources and Evaluation Conference. 2020;4190–4196.

  17. Yu G, Luo W, Fu G, Wang J. Interspecies gene function prediction using semantic similarity. BMC Syst Biol. 2016;10(4):495–507.

    Google Scholar 

  18. Sánchez D, Batet M. A semantic similarity method based on information content exploiting multiple ontologies. Expert Syst Appl. 2013;40(4):1393–9.

    Article  Google Scholar 

  19. Teng Z, Guo M, Liu X, Dai Q, Wang C, Xuan P. Measuring gene functional similarity based on group-wise comparison of go terms. Bioinformatics. 2013;29(11):1424–32.

    Article  CAS  PubMed  Google Scholar 

  20. Tian Z, Wang C, Guo M, Liu X, Teng Z. An improved method for functional similarity analysis of genes based on gene ontology. BMC Syst Biol. 2016;10(4):465–84.

    Google Scholar 

  21. Dutta P, Basu S, Kundu M. Assessment of semantic similarity between proteins using information content and topological properties of the gene ontology graph. IEEE/ACM Trans Comput Biol Bioinf. 2017;15(3):839–49.

    Article  Google Scholar 

  22. Milano M, Agapito G, Guzzi PH, Cannataro M. An experimental study of information content measurement of gene ontology terms. Int J Mach Learn Cybern. 2018;9(3):427–39.

    Article  Google Scholar 

  23. Wang JZ, Du Z, Payattakool R, Yu PS, Chen C-F. A new method to measure the semantic similarity of go terms. Bioinformatics. 2007;23(10):1274–81.

    Article  CAS  PubMed  Google Scholar 

  24. Zhang J, Jia K, Jia J, Qian Y. An improved approach to infer protein-protein interaction based on a hierarchical vector space model. BMC Bioinform. 2018;19(1):1–14.

    Article  Google Scholar 

  25. Benabderrahmane S, Smail-Tabbone M, Poch O, Napoli A, Devignes M-D. Intelligo: a new vector-based semantic similarity measure including annotation origin. BMC Bioinform. 2010;11(1):1–16.

    Article  Google Scholar 

  26. Falcon S, Gentleman R. Using gostats to test gene lists for go term association. Bioinformatics. 2007;23(2):257–8.

    Article  CAS  PubMed  Google Scholar 

  27. Pesquita C, Faria D, Bastos H, Ferreira AE, Falcão AO, Couto FM. Metrics for go based protein semantic similarity: a systematic evaluation. In: BMC Bioinformatics. 2008;9:1–16. BioMed Central

  28. Yu G, Fu G, Wang J, Zhu H. Predicting protein function via semantic integration of multiple networks. IEEE/ACM Trans Comput Biol Bioinf. 2015;13(2):220–32.

    Article  Google Scholar 

  29. Xu Y, Guo M, Shi W, Liu X, Wang C. A novel insight into gene ontology semantic similarity. Genomics. 2013;101(6):368–75.

    Article  CAS  PubMed  Google Scholar 

  30. Yu G, Lu C, Wang J. Nogoa: predicting noisy go annotations using evidences and sparse representation. BMC Bioinform. 2017;18(1):1–13.

    Article  Google Scholar 

  31. Jain S, Bader GD. An improved method for scoring protein-protein interactions using semantic similarity within the gene ontology. BMC Bioinform. 2010;11(1):1–14.

    Article  Google Scholar 

Download references

Acknowledgements

Thanks go to the editor and the anonymous reviewers for their comments and suggestions.

About this supplement

This article has been published as part of BMC Bioinformatics Volume 23 Supplement 1, 2022: Selected articles from the Biological Ontologies and Knowledge bases workshop 2020. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-23-supplement-1.

Funding

This work was supported by the Natural Science Foundation of China (Grant Nos. 61772475, 61801432). Especially, publication costs are funded by the Natural Science Foundation of China (Grant No. 61801432). The funding bodies played no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

Author information

Authors and Affiliations

Authors

Contributions

ZT and HF developed the codes, processed the experimental data, and drafted the manuscript together. YY set up the general idea of this study. ZZ revised the whole manuscript and gave some advice. All authors have read and approved the manuscript.

Corresponding author

Correspondence to Zhenfeng Zhu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tian, Z., Fang, H., Ye, Y. et al. A novel gene functional similarity calculation model by utilizing the specificity of terms and relationships in gene ontology. BMC Bioinformatics 23 (Suppl 1), 47 (2022). https://doi.org/10.1186/s12859-022-04557-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-022-04557-6

Keywords