The Gene Ontology (GO) [1] is a widely used vocabulary system in bioinformatics, which systematically describes the functional relations between different genes or their products. The GO consists of three independent ontologies: biological process (BP), cellular component (CC), and molecular function (MF). Each ontology is structured as a Directed Acyclic Graph (DAG), in which GO terms form the nodes, and the relations between the GO terms form the edges. In the DAG, GO terms are connected by different hierarchical relations (mostly is_a and part_of relations). The is_a relation describes the fact that a child term is a specialization of a parent term, while the part_of relation denotes the fact that a child term is a component of a parent term. The term at the lower level (e.g., leaf term) has more specific information than the term at the upper level (e.g., root term). Recently, GO has been widely used in protein function prediction, validation [2, 3] and classification of protein-protein interactions [4, 5], gene expression studies [6] and pathway analysis [7].
Gene products are usually annotated with a set of GO terms. The functional relations between gene products are quantified by using the shared GO terms of gene products [8–10] or explicitly using semantic similarity measures [11]. The semantic similarity measures have been widely used, which generate numerical values describing the likeness between two terms [12].
In this paper we presented a new method to calculate semantic similarity, the Hierarchical Vector Space Model (HVSM), which enhanced the basic vector space model (VSM) by explicitly introducing the relations between GO terms. When constructing the vector for a gene, in addition to the terms annotated to the gene, HVSM takes their ancestors and descendants into consideration as well. Besides, HVSM considers both “is_a” and “part_of” relations. The introduction of the Certainty Factor to calibrate the similarity value based on the number of annotated terms improves the effectiveness of HVSM further. The simplicity of the algorithm makes it very efficient. We tested HVSM on Homo sapiens and Saccharomyces cerevisiae protein-protein interaction datasets and compared the results with two other vector-based measures, IntelliGO [13] and basic VSM, and the six other popular measures, including TCSS [14], Resnik [15], Lin [16], Jiang [17], Schlicker [18], and SimGIC [19]. The results showed that HVSM outperformed the other eight measures in most cases. HVSM achieved an improvement of up to 4% compared to TCSS, 8% compared to IntelliGO, 12% compared to VSM, 6% compared to Resnik, 8% compared to Lin, 11% compared to Jiang, 8% compared to Schlicker, and 11% compared to SimGIC. The correlation coefficients with protein sequence, EC, and Pfam similarity also showed that HVSM was comparable to SimGIC, and outperformed all other similarity measures in the CESSM test.
Related Work
Different approaches have been proposed to calculate the semantic similarity, such as the vector-based approach, the term-based approach, the set-based approach, and the graph-based approach. The vector-based approach transforms a gene product into a vector, and functional similarity is measured by the similarity of corresponding vectors. The term-based approach calculates semantic similarities from term similarities using various combination strategies. The set-based approach views the set of terms as bags of words. Two gene products are similar if there is a large overlap between the two corresponding sets of terms. The graph-based approach uses graph matching techniques to compute the similarity.
In vector-based approaches, the dimension of the vector is equal to the total number of terms in GO. Each dimension corresponds to a term in GO. Each vector component is either 1 or 0, denoting the presence or absence of a term in the set of annotations of a given gene product. The alternative way is to have each dimension represent a certain property of a term (e.g., IC value) [20]. The most common method of measuring similarity between vectors is the cosine similarity:
$$\begin{array}{@{}rcl@{}} S_{v}\left(G_{1},G_{2}\right)=\frac{v_{1}\cdot v_{2}}{\left|v_{1}\right|\left|v_{2}\right|} \end{array} $$
(1)
where v
i
represents the vector of the gene product G
i
, v1·v2 corresponds to the dot product between the two vectors, and |v
i
| denotes the magnitude of vector v
i
.
Suppose G1 and G2 are two given genes or gene products annotated by two sets of GO terms {t11,t12,⋯,t1n} and {t21,t22,⋯,t2m}. IntelliGO [13], a vector-based method, represented each gene as a vector \(g=\sum _{i}\alpha _{i}e_{i}\), where α
i
=w(g,t
i
)IFA(t
i
), w(g,t
i
) representing the weight assigned to the evidence code between g and t
i
, IFA(t
i
) being the inverse annotation frequency of the term t
i
, and e
i
being the i-th basis vector corresponding to the annotation term t
i
. The dot product between two gene vectors was defined as:
$$\begin{array}{@{}rcl@{}} g_{1}*g_{2}=\sum_{ij}\alpha_{i}*\beta_{i}*e_{i}*e_{j} \end{array} $$
(2)
$$\begin{array}{@{}rcl@{}} e_{i}*e_{j}=\frac{2Depth(LCA)}{MinSPL\left(t_{1i},t_{2j}\right)+2Depth(LCA)} \end{array} $$
(3)
where Depth(LCA) was the depth of the lowest common ancestor (LCA) for t1i and t2j, and MinSPL(t1i,t2j) was the length of the shortest path between t1i and t2j, which passed through LCA. The similarity measure for the two genes vectors g1 and g2 was then defined using the cosine formula:
$$\begin{array}{@{}rcl@{}} {SIM}_{IntelliGO}\left(g_{1},g_{2}\right)=\frac{g_{1} \cdot g_{2}}{\sqrt{g_{1}*g_{1}}\sqrt{g_{2}*g_{2}}} \end{array} $$
(4)
The basic vector-based methods ignore the intrinsic relationship between different terms and treat different terms as independent components, which may lead to the inaccuracy of the semantic similarity.
Term-based approaches can be classified into two groups: path-based and IC-based.
Path-based approaches, also called edge-based approaches [2, 21–26], use the number of edges or the distance between two terms to quantify the semantic similarity. When more than one path exist between two terms, the shortest path or the average of all paths is usually used. Similar approaches were adapted to the biomedical field [27]. Path-based methods are based on two assumptions: (1) edges and nodes are uniformly distributed [28], and (2) edges at the same level in the ontology correspond to the same semantic distance between terms. However, both of the above assumptions are rarely true.
IC-based approaches [14–19, 29–32] use the Information Content (IC) to measure how specific and informative a term is. IC can be quantified by negative log likelihood, − logp(c), where p(c) is the occurrence probability of the term c in a specific corpus, such as the UniProt Knowledge base [12]. The TCSS [14] measure defined a different way to calculate IC, which depended upon the specificity of the term in the graph, shown as:
$$\begin{array}{@{}rcl@{}} ICT(t)=-ln\left(\frac{\left|{N(t)}\right|}{\left|{O}\right|}\right) \end{array} $$
(5)
where t was a term in the ontology O, |N(t)| was the number of children terms of t, and |O| was the total number of terms in O. The IC value of a term was dependent on its children, and its parents were not considered [15].
Many of the term-based methods are hybrid. They involve both ideas of the path-based and IC-based approaches, so the distinction between the two groups is not clear. Three combination approaches are commonly used in term-based approaches to obtain semantic similarities of gene pairs from term similarities: maximum (MAX), average (AVG) and best-match average (BMA) [18]. Let GO(A) and GO(B) denote the term sets annotated to two proteins A and B. The MAX and the AVG approach are given by the maximum and the average of the similarity between each term in GO(A) and each term in GO(B). The BMA is given by the average similarity between each term in GO(A) and its most similar term in GO(B), averaged with its reciprocal [33].
Set-based approaches use the Tversky ratio model of similarity [34] (a general model of distance) to calculate the similarity between gene products, which is defined as:
$$\begin{array}{@{}rcl@{}} \frac{f\left(G_{1}\cap G_{2}\right)}{f\left(G_{1}\cap G_{2}\right)+\alpha*f\left(G_{1}-G_{2}\right)+\beta*f(G_{2}-G_{1})} \end{array} $$
(6)
where G1 and G2 are sets of terms annotated to two different gene products from the same ontology and f is an additive function on sets. When α=β=1, we get the Jaccard distance between two sets:
$$\begin{array}{@{}rcl@{}} S_{Jaccard}=\frac{f\left(G_{1}\cap G_{2}\right)}{f\left(G_{1}\cup G_{2}\right)} \end{array} $$
(7)
When \(\alpha =\beta =\frac {1}{2}\), we have the Dice distance between two sets:
$$\begin{array}{@{}rcl@{}} S_{Dice}=\frac{2*f\left(G_{1}\cap G_{2}\right)}{f\left(G_{1}\right)+f\left(G_{2}\right)} \end{array} $$
(8)
Set-based approaches assume that the terms are independent of each other. The similarity and dissimilarity of genes are modeled by two sets and their interactions. From Eqs. (7) and (8), we can conclude that the Jaccard and Dice distance return a similarity of 0 if two sets have no shared terms. However, these terms may have a certain relationship in the GO hierarchy.
Graph-based approaches make use of graph matching and graph similarity to calculate the similarity between gene products. A gene is modeled by the sets of nodes and edges associated with a sub-graph. The similarity is calculated by quantifying the difference between two sub-graphs.
Graph-based methods have three disadvantages: (1) a few measures only takes into account the shared terms in the sub-graphs, ignoring the edge type [35–38]; (2) graph matching have a weak correlation with similarity between terms [39]; (3) graph matching is an NP-complete problem [40].
Mazandu et al. [11] compared fourteen semantic similarity tools based on GO, classified in the context of IC models, term similarity approaches and functional similarity measures. The features and challenges of each approach were analyzed, including the use scope and limitations. Mazandu et al. also described two key reasons for the difficulty in comparison: the dataset issue, where different tools use different version of GO or annotation datasets, and the scaling issue, which results from tools making different assumption regarding normalization methods.
The effects of the shared information for the semantic similarity calculation were discussed in [41]. The shared information of a term pair is the common inheritance relations extracted from the structure of the GO graph. Experiments of three different methods calculating the term similarity, each with five shared information methods, were done on three ontologies across six benchmarks. Among the choice of shared information, term similarity algorithm, and ontology type, the choice of ontology type most strongly influenced the performance, and shared information type had the least influence [41].
More and more hybrid approaches were proposed in recent years, such as the algorithm described in [42], which utilized both the topological features of the GO graph and the information contents of the GO terms. Based on the topological structure of the GO graph, the measure [42] identified a number of GO terms as cluster centers according to a specific threshold, and then a membership was calculated for each cluster center and term pair. Semantic similarity scores were obtained by combining the relevant memberships and shared information contents. The threshold and the width of the Gaussian membership function were determined for different ontologies and datasets respectively to achieve the best AUC scores, while most of the other methods, including TCSS, used fixed value of parameters. Besides, the normalization method used in [42] depended on different ontologies. Therefore, the method showed relatively good performance.
The machine learning approaches are emerging to study semantic similarity, such as support vector machine (SVM) [43], random forest [44], and AdaBoost strategy [45]. Among the machine learning techniques, random forest and support vector machine (SVM) are found to achieve the best performance [43].
Methods involving natural language processing were reported. w2vGO [46] utilized the Word2vec model to compare definitions of two GO terms, which did not rely on the GO graph. The results showed that w2vGO was comparable to Resnik [15].
The semantic similarity measure was also extended to gene network analysis. GFD-Net [47] combined the concept of semantic similarity with the use of gene network topology to analyze the functional dissimilarity of gene networks based on GO. It was used in gene network validation to prove its effectiveness.