A relation based measure of semantic similarity for Gene Ontology annotations
 Brendan Sheehan^{1}Email author,
 Aaron Quigley^{1},
 Benoit Gaudin^{1} and
 Simon Dobson^{1}
https://doi.org/10.1186/147121059468
© Sheehan et al; licensee BioMed Central Ltd. 2008
Received: 09 May 2008
Accepted: 04 November 2008
Published: 04 November 2008
Abstract
Background
Various measures of semantic similarity of terms in bioontologies such as the Gene Ontology (GO) have been used to compare gene products. Such measures of similarity have been used to annotate uncharacterized gene products and group gene products into functional groups. There are various ways to measure semantic similarity, either using the topological structure of the ontology, the instances (gene products) associated with terms or a mixture of both. We focus on an instance level definition of semantic similarity while using the information contained in the ontology, both in the graphical structure of the ontology and the semantics of relations between terms, to provide constraints on our instance level description.
Semantic similarity of terms is extended to annotations by various approaches, either though aggregation operations such as min, max and average or through an extrapolative method. These approaches introduce assumptions about how semantic similarity of terms relates to the semantic similarity of annotations that do not necessarily reflect how terms relate to each other.
Results
We exploit the semantics of relations in the GO to construct an algorithm called SSA that provides the basis of a framework that naturally extends instance based methods of semantic similarity of terms, such as Resnik's measure, to describing annotations and not just terms. Our measure attempts to correctly interpret how terms combine via their relationships in the ontological hierarchy. SSA uses these relationships to identify the most specific common ancestors between terms. We outline the set of cases in which terms can combine and associate partial order constraints with each case that order the specificity of terms. These cases form the basis for the SSA algorithm. The set of associated constraints also provide a set of principles that any improvement on our method should seek to satisfy.
Conclusion
We derive a measure of semantic similarity between annotations that exploits all available information without introducing assumptions about the nature of the ontology or data. We preserve the principles underlying instance based methods of semantic similarity of terms at the annotation level. As a result our measure better describes the information contained in annotations associated with gene products and as a result is better suited to characterizing and classifying gene products through their annotations.
Background
Although the semantic similarity between two GO terms has been extensively investigated [1–4], how to define similarity between two gene products based on GO annotations for a specific application remains unclear [5]. To date annotation similarity has been computed by four general approaches: the setbased approach; the graphbased approach; the vectorbased approach; and the termbased approach. In the setbased approach an annotation is viewed as a 'bag of words'. Two annotations are similar if there is a large overlap between their sets of terms. A graphbased approach views similarity as a graphmatching procedure. Vectorbased methods embed annotations in a vector space where each possible term in the ontology forms a dimension. Termbased approaches compute similarity between individual terms and then combine these similarities to produce a measure of annotation similarity.
All the above approaches do not consider the semantics of relationships between terms. How terms are related can significantly alter how an annotation, which is a set of terms, is interpreted. In the GO there are two main types of relations: is_a and part_of. The is_a relation represents a taxonomic relationship between terms that can be modeled using the improper subset relation, which is a partial ordering of terms. The part_of relation represents a partonomic relationship between terms that can also be modeled in terms of a partial order. Though the partial orders represented by taxonomies and partonomies are well understood there has been little attention given as to how these two partial orderings combine. Using the various cases identified by combining taxonomies and partonomies we construct an algorithm called SSA (S emantic S imilarity of A nnotations) that identifies the terms that can be associated with an annotation and terms that relate to both annotations. Instances associated with these terms are then used to construct a Resniklike measure of annotation similarity thus extending the underlying intuitions behind this termbased measure to the annotation level.
A measure of term or annotation similarity should be based on a set of principles that form the basis for what is considered similar. The nature of similarity has been the focus of intense research in the areas of aesthetics [6, 7] and psychology [8]. In mathematics properties such as identity, symmetry and the triangle inequality have been used to form the basis of measures of similarity of mathematical objects. Principles of term and annotation similarity have been suggested by various authors. This work intends to build on these principles and introduce additional principles that a measure of similarity should seek to satisfy.
Similarity between objects is normally expressed as a number that ranges along an interval on the real numbers ℝ. However the main purpose of similarity is usually to determine whether two or more objects are similar to a reference object. For this reason a measure of similarity can be viewed as a partial order on a set of objects, the actual numbers play only a secondary purpose. For example, we may say that an object X is more similar to Z than another object Y. Formally this is expressed as sim(X, Z) > sim(Y, Z).
In the study of ontological similarity Lin [9] develops the principles of commonality and difference when constructing a measure of term similarity. The greater the commonality between objects the greater the similarity. Likewise, the greater the difference between objects the greater the dissimilarity. The source of both the commonality and difference between terms depends on the method chosen to measure the descriptiveness of terms. Different sources of descriptiveness may result in different orderings of similarity between terms or annotations.
Popescu et al. [10] recognize that an important property of term similarity is that two different terms should have a nonzero similarity value if the terms are related. They also recognize that an important property of annotation similarity is that the descriptiveness of annotations should be greater than or equal to the descriptiveness of its constituent terms. In this paper this property is called the monotonicity property.
In defining a measure of similarity a set of relevant properties that objects can be compared along are identified. In ontological similarity, whether of terms or annotations, there are two main sources of similarity: the conceptual or structural level; and the instance level. At the structural level we may consider such properties as graph distance, graph similarity, relation types, common ancestors, etc. At the instance level we consider the set of instances associated with a term or annotation. Our measure of ontological similarity combines aspects from both levels. Here we survey how various measures of annotation similarity combine these properties in various ways to form the basis for a measure of descriptiveness of a term or annotation.
SetBased Approaches
In this situation the source of descriptiveness of an annotation is its set of terms. Each term and its set of associated instances is considered independent of other terms. The commonality and difference between annotations is modeled as set intersection and difference of sets of terms respectively. Setbased approaches return a similarity of zero if they do not share common terms ignoring the fact that terms may be closely related. Because of the atomic nature of terms in the setbased approach the monotonicity property does not apply.
VectorBased Approaches
where v_{ i }represents a vector of terms constructed from an annotation (group of terms) G_{ i }. · corresponds to the size of the vector and • corresponds to the dot product between two vectors. The source of descriptiveness, commonality and difference is the same as the situation for setbased approaches.
GraphBased Approaches
An ontology is a directed, acyclic graph (DAG) whose edges correspond to relationships between terms. Thus it is natural to compare terms using methods for graph matching and graph similarity. We may consider the similarity between annotations in terms of the subgraph that connects terms within each annotation. Annotation similarity is then measured in terms of similarity between two graphs. Graph matching has only a weak correlation with similarity between terms. It is also computationally expensive to compute, graph matching being an NPcomplete problem on general graphs [12].
The descriptiveness of an annotation is modeled by the set of nodes and edges associated with a subgraph. Commonality between annotations is based on the set intersection while difference is modeled by the set difference where each set consists of the nodes and edges associated with each subgraph. Alternatively, the set of edges may be ignored and only common terms of both graphs are considered [13–15].
Improving Similarity Measures by Weighting Terms
where, as before, G_{1} and G_{2} are annotations or sets of terms describing data (e.g. a gene product), T_{ x }is the x^{ th }term from a set of terms and m(T_{ x }) denotes the weight of T_{ x }. This weighting function can be used to represent various properties of a term or annotation such as a measure of vagueness, uncertainty, sense of preference or a combination of the above. The vectorbased approach may be extended so that values along a particular dimension can lie on the interval [0, 1] or [0, ∞). The graphbased approach can be extended by weighting the edges between terms in the graph.
Assigning a weight to each term in an annotation allows for the possibility of introducing the monotonicity property into a similarity measure. Using the monotonicity property, the weight associated with an annotation should be greater than or equal to the weight associated with any of its constituent terms. Weights can form an additional basis on which to measure the descriptiveness of a term or annotation.
InstanceBased Weights
One approach to assigning weight to an ontological term is to measure how informative a term is in describing data. A method of measuring information is to analyze a term's use in a corpus against the general use of ontological terms in the same corpus. Information is measured using the surprisal function:
IC_{ Corpus }(T_{ i }) = log(p(T_{ i }))
where p(T_{ i }) corresponds to the probability of a term T_{ i }or its taxonomic descendants occurring in a corpus. For example, consider the case where there are 30 distinct instances in a corpus and 5, 3 and 2 of these instances are annotated by the terms T_{ i }, T_{ j }and T_{ k }respectively. If T_{ j }and T_{ k }are subtypes or children of T_{ i }and do not have child terms themselves then $I{C}_{Corpus}({T}_{i})=\mathrm{log}\phantom{\rule{0.5}{0ex}}\left(\frac{5+3+2}{30}\right)\approx 1.099$.
Other Weighting Approaches
where desc(T_{ i }) returns the number of descendants of term T_{ i }and numTerms refers to the total number of terms in the ontology.
TermBased Approaches
More sophisticated term based approaches combine multiple measures of term similarity and aggregate similarity values using more complex functions, for example [17].
Graphical Measures of Term Similarity
Where T_{1} and T_{2} are the two terms being compared, T_{ lcta }is the term that corresponds to the lowest common taxonomic ancestor between T_{1} and T_{2}. T_{ root }denotes to root node of the ontology (assuming that the ontology has only one root). dist(T_{ i }, T_{ j }) denotes the graph distance between terms T_{ i }and T_{ j }. The 2 * dist(T_{ lcta }, T_{ root }) component of the denominator serves to normalize the measure.
InstanceBased Measures of Term Similarity
Similarity may be measured using an instance based measure of semantic similarity as computed by either Resnik (eqn. 2) or Lin (eqn. 3). Resnik [21, 22] exploits the informativeness of the lowest common ancestor between terms as a measure of semantic similarity:s_{ Resnik }(T_{ i }, T_{ j }) = IC_{ Corpus }(T_{ lcta })
where T_{ lcta }denotes the lowest common taxonomic ancestor between ontological terms T_{ i }and T_{ j }. This measure only accounts for the commonality between terms.
which has the advantage that it maps onto values on the interval [0, 1] unlike Resnik's measure which maps onto the interval [0, ∞). Lin's measure also accounts for both the commonality and difference between terms. Resnik's measure does have the desirable property that terms close to the root of the ontology have a low similarity however. This is not the case for Lin's measure.
The only structural property that both Resnik and Lin exploit is the lowest common taxonomic ancestor. To overcome this weakness Jiang and Conrath [23] integrate graph distance based measures of similarity into information based approaches. They construct a generalized weighting measure between a child and its immediate parent that accounts for the number of out edges and depth of terms along the shortest path between the compared terms in the ontology. While they acknowledge that other relation types might be relevant to measuring similarity their measure is based solely on the taxonomic or is_a relations in the ontology.
New Approaches to Annotation Similarity
Beyond the set, vector, graph and termbased approaches to measuring similarity of annotations exist other methods that introduce the additional properties discussed above such as monotonicity and taking into account the semantics of ontological relations.
Similarity Based on Fuzzy Measures
The monotonicity property leads naturally to the use of fuzzy measures as a basis for measuring the descriptiveness of an annotation. Using the information content measure of terms described in eqn. 1 as the basis for measuring similarity a fuzzy measure is constructed. A fuzzy measure is a weighting on sets of terms such that the weight associated with a set of terms is greater than or equal to the weight associated with any of its subsets.
Popescu et al. [10] use fuzzy measures to induce a weighting m for an annotation from its constituent terms. This weight is extrapolated from the weights of individual terms by using the formula for constructing a Sugeno λfuzzy measure: For a set of terms G_{ a }, G_{ b }and G_{ c }where G_{ c }= G_{ a }∪ G_{ b }and G_{ a }∩ G_{ b }= ∅ a λfuzzy measure for G_{ c }ism_{ λ }(G_{ c }) = m_{ λ }(G_{ a }) + m_{ λ }(G_{ b }) + λ * m_{ λ }(G_{ a }) * m_{ λ }(G_{ b })
where ${m}_{{G}_{1}}$ and ${m}_{{G}_{2}}$ are the λfuzzy measure functions that characterize G_{1} and G_{2} respectively. The relatedness of terms is accounted for by augmenting each annotation with the lowest common ancestors for each pair of terms from each annotation. This ensures a nonzero similarity between annotations containing related terms.
However, an ontology models other aspects of relatedness that should be taken into account. Relations between terms in an annotation can be used to identify redundant terms whose relevance to the descriptiveness of an annotation is already accounted for by other terms. For example, if two terms in an annotation are taxonomically related the existence of the parent term is implied by the existence of the child term.
If redundancy of terms is not taken into account it may lead to too many or too few instances being associated with the term. This is especially true when a term is part_of another term. The instances associated with the annotation consist of the parts and not what the instances are part of.
Exploiting Semantics of Ontological Relations
where $s({T}_{x},{G}_{y})={\mathrm{max}\phantom{\rule{0.5}{0ex}}}_{{T}_{y}\in {G}_{y}}(s({T}_{x},{T}_{y}))$ and G_{ y } denotes the number of terms in annotation G_{ y }.
Wang et al. make the observation that the instance based measures of term similarity will produce varying results based on the corpus chosen. They keep a fixed value for the contribution each relation type makes to the descriptiveness of a term. This does not account for the varying influence of terms on each other throughout the ontology even if the graph distance is the same. Exploiting the corpus statistics, if used appropriately, may account for this drawback. As with all termbased methods, where terms from each annotation are compared in a pairwise fashion, it is difficult to see how the monotonicity property is ensured when measuring the similarities between two annotations.
Methods
The Gene Ontology relates terms using is_a and part_of relations. We develop a measure of informativeness that provides a description of an annotation that takes into consideration the relations between terms. We use the informativeness measure of a term (eqn. 1) as the basis for providing a description of an annotation. We define an algorithm called SSA that combines the instances of terms while taking into account how these sets of instances are related by how their associated terms are related in the ontology. This results in a set of instances that can be said to be associated with an annotation and not just a term. We can then extend the concept of instance based semantic similarity of terms, such as Resnik's measure, to annotations.
Interpreting Annotations from Taxonomies
A taxonomy induces a partial ordering on a set of terms by the improper subset relation ⊆. If T_{ i }is_a T_{ k }and T_{ j }is_a T_{ k }then the set of instances associated with both T_{ i }and T_{ j }are subsets of T_{ k }. Assuming that we know of all possible instances that can be associated with a term, whatever properties that instances of both T_{ i }and T_{ j }share can be associated with any of the instances that can be associated with T_{ k }. This forms the basis for measuring the commonality between terms used in instancebased measures of similarity between terms.
The difference between terms T_{ i }and T_{ j }is modeled by the difference between the set of instances associated with each term. If we have two or more terms from a taxonomy in an annotation then it is reasonable to argue that the set of instances associated with an annotation should be the intersection of the set of instances associated with each term. The informativeness of the annotation is then based on the set of instances resulting from this intersection.
Interpreting Annotations from Partonomies
The part_of relation between terms denotes the concept that one term is 'part of ' another. It provides an alternative notion of relatedness between terms. An ontology consisting only of part_of relations is known as a partonomy. An example of a simple partonomy is wheel part_of car. It would not make sense to say that a wheel is_a car. The study of partness is complicated by the fact that there are many kinds of part_of relations. Yet the study of partness, known as mereology [24], has shown that there are also common aspects to all types of part_of relations, namely that part_of relations form a partial ordering on the sets of instances associated with each term.
According to the GO Consortium's usage guidelines since 2004 [25] the part_of relation should be interpreted as 'necessarily part of' where T_{ i }part_of T_{ j }means that all instances of T_{ i }are part of one or more instances of T_{ j }. The converse is not necessarily true. For example, all nuclei are part of cells but not all cells contain a nucleus. Bittner [26] models such a part_of relation using an improper partial order i.e. for term T_{ i }with descendant terms T_{ j }.T_{ j }≤_{ part_of }T_{ i }∀T_{ j }part_of T_{ i }
Annotations consisting of terms such that one term is part_of another should view the child term as being relevant to the annotation while the parent term provides redundant, contextual information. For example, consider an annotation consisting of two terms T_{ i }and T_{ j }from a partonomy. If T_{ j }part_of T_{ i }then the annotation should be interpreted as the set of instances of T_{ j }. All we can say is that the number of instances of T_{ i }associated with the annotation can be no more than the number of instances of T_{ j }. In general, an annotation consisting of terms belonging to a partonomy consists of terms that provide the set of instances that can be associated with the annotation while other terms provide the context in which these instances are embedded.
Partial Order Constraints for GO Annotations
The GO consists of many examples similar to the one described above. In general, the GO can be viewed as a taxonomy interspersed with part_of relations. Two terms are said to be directly related if there exists a series of relations on a single path between them. Terms that are not directly related along a path in the graph are indirectly related via a common ancestor. For example there may be other terms that are part_of 'mitochondrial nucleoid' in which case the term 'mitochondrial chromosome' is only related to the other parts by an indirect path of part_of relations. Though not shown, the terms 'mitochondrial nucleoid' and 'chromosome' are only indirectly related via a common ancestor through a number of is_a relations. When interpreting an annotation it is necessary to account for such situations.
Partial Order Constraints
Situation  Ordering  

Directly*  T_{ i }IS T_{ j }  ρ(T_{ i }) ≤ ρ(T_{ j }) 
T_{ i }PART T_{ j }  ρ(T_{ i }) ≤ ρ(T_{ j })  
T_{ i }MIXED T_{ j }via T_{ k }  ρ(T_{ i }) ≤ ρ(T_{ k }) ≤ ρ(T_{ j })  
Indirectly Via T_{ lca }*  T_{ i }IS T_{ lca }, T_{ j }IS T_{ lca }  ρ(T_{ i }), ρ(T_{ j }) ≤ ρ(T_{ lca }) 
T_{ i }PART T_{ lca }, T_{ j }IS T_{ lca }  ρ(T_{ i }) ≤ ρ(T_{ j }) ≤ ρ(T_{ lca })  
T_{ i }PART T_{ lca }, T_{ j }PART T_{ lca }  ρ(T_{ i }), ρ(T_{ j }) ≤ ρ(T_{ lca })  
T_{ i }MIXED T_{ lca }via T_{ k }, T_{ j }IS T_{ lca }  (ρ(T_{ i }) ≤ ρ(T_{ k })), ρ(T_{ j }) ≤ ρ(T_{ lca })  
T_{ i }MIXED T_{ lca }via T_{ k }, T_{ j }PART T_{ lca }  ρ(T_{ i }), ρ(T_{ j }) ≤ ρ(T_{ k }) ≤ ρ(T_{ lca })  
T_{ i }MIXED T_{ lca }via T_{ k }, T_{ j }MIXED T_{ lca }via T_{ m }  (ρ(T_{ i }) ≤ ρ(T_{ k })), (ρ(T_{ j }) ≤ ρ(T_{ m })) ≤ ρ(T_{ lca }) 
Directly Related Cases
it can be inferred that T_{ i }⊆ T_{ j }. Where terms are related by a PART path a similar argument can be inferred for how two terms are ordered.
For the MIXED case there exists a mixture of is_a and part_of relations. The nature of the MIXED relationship is ultimately determined by the part_of relations. For example, if T_{ i }MIXED T_{ j }then this can be interpreted as T_{ i }part_of T_{ j }. There may be several is_a relations traversed along a MIXED path from T_{ j }to T_{ i }before a part_of relation is encountered. This means that T_{ i }can only be part_of a subset of the instances of T_{ j }. This subset is identified by the set of instances associated with the term (labeled T_{ k }in table 1) which is the parent term of the first part_of relation encountered along a MIXED path from T_{ j }to T_{ i }. This results in the partial order:T_{ i }≤ T_{ k }≤ T_{ j }
where T_{ i }is the descendant of T_{ j }, T_{ i }is the parent and T_{ k }denotes the first term before a part_of relation is encountered while traversing the MIXED path in the ontology from T_{ j }to T_{ i }. This form of reasoning can be further extended along the rest of the MIXED path to produce a more detailed partial order. However if the ultimate goal is to only determine the partial order between T_{ i }and T_{ j }then such induction of this reasoning is unnecessary.
Indirectly Related Homogeneous Cases
There are three cases to handle where both the paths to the common ancestor between terms are homogeneous: IS – IS, PART – PART and IS – PART (or PART – IS). In the first case, where T_{ i }IS T_{ lca }and T_{ j }IS T_{ lca }, since both terms T_{ i }and T_{ j }are taxonomic descendants of a lowest common ancestor T_{ lca }then it should be expected that the number of instances associated with T_{ i }and T_{ j }are less than the number of instances associated with T_{lca.}This results in the partial orderT_{ i }, T_{ j }≤ T_{ lca }
An annotation consisting of two such related terms can be interpreted as the set of instances that are associated with both T_{ i }and T_{ j }. A similar form of reasoning can be applied to the PART – PART case. The partial order for the final case IS – PART (or PART – IS) can be derived in a similar fashion to the inhomogeneous direct MIXED case. If T_{ i }IS T_{ lca }and T_{ j }PART T_{ lca }then it can be inferred that T_{ j }PART T_{ i }. If an annotation consists of two such terms then it should be interpreted as the set of instances of T_{ j }. As a partial order constraint this can be modeled asT_{ j }≤ T_{ i }≤ T_{ lca }
Indirectly Related Inhomogeneous Cases
Indirectly related inhomogeneous cases occur when terms are related by a common ancestor in the ontology and one or both of the paths connecting the common ancestor with each term consists of an inhomogeneous set of relation types. There are three such cases to account for: IS – MIXED (or MIXED – IS), PART – MIXED (or MIXED – PART) and MIXED – MIXED.
The partial order for the first case IS – MIXED (or MIXED – IS) can be handled by considering each path separately. The partial order for the T_{ i }IS T_{ lca }path is T_{ i }≤ T_{ lca }. The partial order for the MIXED path is T_{ j }≤ T_{ k }≤ T_{ lca }which is derived in the same way as the directly related MIXED case. Combining the two partial orders results in(T_{ j }≤ T_{ k }), T_{ i }≤ T_{ lca }
If an annotation consists of two such terms then it should be interpreted as the set of instances of T_{ j }that are part of instances that are of type T_{ i }and T_{ k }.
The PART – MIXED (or MIXED – PART) case requires slightly more reasoning about to construct its associated partial order. If T_{ i }PART T_{ lca }and T_{ j }MIXED T_{ lca }then it can be inferred that both T_{ i }and T_{ j }are part of T_{ lca }. Because T_{ j }is only part of a subset of the instances associated with T_{ lca }, the instances associated with T_{ k }, then T_{ i }can only be part of the set of instances associated with T_{ k }also. This results in the partial orderT_{ j }, T_{ i }≤ T_{ k }≤ T_{ lca }
An annotation consisting of two such related terms should be interpreted as the set of instances of T_{ i }and T_{ j }that are part of the same instances of T_{ k }.
The final case MIXED – MIXED occurs when paths from both terms to the common ancestor consist of a mixture of relation types. The partial order for such a case can be constructed by looking at each path separately. If T_{ i }MIXED T_{ lca }then the partial ordering is T_{ i }≤ T_{ k }≤ T_{ lca }. Similarly for T_{ j }MIXED T_{ lca }we get T_{ j }≤ T_{ m }≤ T_{ lca }. Combining the two partial orders results in(T_{ i }≤ T_{ k }), (T_{ j }≤ T_{ m }) ≤ T_{ lca }
If an annotation consists of two such terms then it should be interpreted as the set of instances of T_{ i }and T_{ j }that are part of the same instances of T_{ k }and T_{ m }.
The SSA Algorithm
The SSA algorithm is based on the nine cases of term relatedness described above. The SSA algorithm derives the set of instances that can be associated with an annotation from the set of instances associated with that annotation's constituent terms. There are two aspects to the algorithm: identifying which terms are the contextual, redundant instances and which terms' instances can be associated with the annotation. For example, a contextual instance may be 'mitochondrial nucleoid' that provides the context for the set of instances of 'chromosome'. Throughout we denote the set of contextual terms by exclTerms and the set of terms whose instances can be associated with the annotation as inclTerms. numInst(T_{ i }) denotes the number of instances associated with T_{ i }.
The above partial order constraints were constructed under the ideal assumptions assumed by the partial orderings in taxonomies and partonomies. In reality there only ever exists an incomplete set of instances associated with terms and some adjustment of the number of instances is required if the partial order constraints are to be satisfied. Terms that are taxonomically related are guaranteed to satisfy the taxonomic constraints. However, terms that are partonomically related may not satisfy their associated partial order constraints. In these cases some adjustment of the number of instances associated with a term is necessary. For example, if T_{ i }PART T_{ j }and there are no instances associated with T_{ j }in the corpus while there are a number of instances associated with T_{ i }then in order to satisfy the PART constraint the number of instances of T_{ j }is set equal to the number of instances associated with T_{ i }.
The algorithm consists of the following steps:

For each distinct ordered pair (T_{ i }, T_{ j }) of terms in annotations G_{1} and G_{2} respectively

Identify the case that corresponds to how T_{ i }is related to T_{ j }
* Terms are assigned to inclTerms or exclTerms depending on case
* The number of instances associated with a term may be adjusted if the case allows

Remove any terms from inclTerms also found in exclTerms

Return the sets inclTerms and exclTerms
where an ordered pair of terms (T_{ i }, T_{ j }) means that (T_{ i }, T_{ j }) ≠ (T_{ j }, T_{ i }). In the following sections we identify how each case assigns terms to inclTerms and exclTerms and adjusts the number of instances associated with each term used to compare annotations.
Direct Cases
The IS constraint where one term in an annotation is a special case of another term can be implemented as follows:
1 if (T_{ i }IS T_{ j })
inclTerms ← inclTerms ∪ T_{ i }
exclTerms ← exclTerms ∪ T_{ j }
In this situation the term T_{ j }is viewed as being the common taxonomic ancestor of both terms.
The PART constraint where one term is a part of another term can be implemented as:
2 if (T_{ i }PART T_{ j })
inclTerms ← inclTerms ∪ T_{ i }
exclTerms ← exclTerms ∪ T_{ j }
if (numInst (T_{ j }) < numInst(T_{ i }))
numInst(T_{ j }) = numInst(T_{ i })
In this situation the term T_{ j }is viewed as providing the context that instances of T_{ i }are part of.
The case is similar for T_{ i }MIXED T_{ j }. In these cases we are relating terms that belong to two different lines of taxonomic inheritance where terms have a possibly incomplete set of associated instances. In order to ensure that the partial order constraint associated with this case is implemented correctly if T_{ j }has fewer instances associated with it than T_{ i }then we adjust the number of instances associated with T_{ j }to be equal to the number of instances associated with T_{ i }.
The MIXED constraint where T_{ i }is a part of another term T_{ j }via an intermediate term T_{ k }can be implemented similarly to the PART case:
3 if (T_{ i }MIXED T_{ j })
inclTerms ← inclTerms ∪ T_{ i }
exclTerms ← exclTerms ∪ T_{ j }
exclTerms ← exclTerms ∪ T_{ k }
if (numInst(T_{ k }) < numInst(T_{ i }))
numInst(T_{ k }) = numInst(T_{ i })
if (numInst(T_{ j }) < numInst(T_{ i }))
numInst(T_{ j }) = numInst(T_{ i })
In this situation the term T_{ k }is viewed as providing the context that instances of T_{ i }are part of.
Indirect Homogeneous Cases
In the indirect homogeneous cases compared terms T_{ i }and T_{ j }are indirectly related via a common ancestor T_{ lca }along homogeneous paths. The first such case is where T_{ i }IS T_{ lca }and T_{ j }IS T_{lca.}In this situation the number of instances associated with T_{ lca }provides a measure of similarity between T_{ i }and T_{ j }:
4 if (T_{ i }IS T_{ lca }&T_{ j }IS T_{ lca })
numInst(T_{ i }), numInst(T_{ j }) ← min(numInst(T_{ i }), numInst(T_{ j }))
inclTerms ← inclTerms ∪ T_{ j }∪ T_{ i }
exclTerms ← exclTerms ∪ T_{ lca }
In the case where T_{ i }PART T_{ lca }and T_{ j }PART T_{ lca }T_{ lca }provides the context in which instances of T_{ i }and T_{ j }are embedded.
5 if (T_{ i }PART T_{ lca }&T_{ j }PART T_{ lca })
numInst(T_{ i }), numInst(T_{ j }) ← min(numInst(T_{ i }) ∩ numInst(T_{ j }))
inclTerms ← inclTerms ∪ T_{ j }∪ T_{ i }
exclTerms ← exclTerms ∪ T_{ lca }
if (numInst(T_{ lca }) < numInst(T_{ i }))
numInst(T_{ lca }) = numInst(T_{ i })
Since terms from two different lines of taxonomic inheritance are being compared and the set of instances associated with each term is incomplete an adjustment of the number of instances associated with each term is necessary.
The final homogeneous indirect case occurs when T_{ i }PART T_{ lca }and T_{ j }IS T_{ lca }. This is equivalent to T_{ i }PART T_{ j }since if T_{ i }is a part of T_{ lca }and T_{ j }is a kind of T_{ lca }then T_{ i }is a part of T_{ j }.
6 else if (T_{ i }PART T_{ lca }&T_{ j }IS T_{ lca })
inclTerms ← inclTerms ∪ T_{ i }
exclTerms ← exclTerms ∪ T_{ j }
exclTerms ← exclTerms ∪ T_{ lca }
if (numInst(T_{ j }) < numInst(T_{ i }))
numInst(T_{ j }) = numInst(T_{ i })
if (numInst(T_{ lca }) < numInst(T_{ i }))
numInst(T_{ lca }) = numInst(T_{ i })
As with other cases the number of instances associated with each term are adjusted to ensure that the partial order constraint associated with the case is satisfied.
Indirect Inhomogeneous Cases
In these cases one or both paths from T_{ lca }to terms T_{ i }and T_{ j }contain inhomogeneous types of relations. Throughout this section the term T_{ k }is a term in the ontology such that T_{ m }MIXED T_{ k }and T_{ k }IS T_{ n }if T_{ n }is an ancestor of T_{ m }in the ontology.
The first such case occurs where for two indirectly related terms being compared, T_{ i }and T_{ j }, there exists an MIXED path from T_{ i }to T_{ lca }via T_{ k }and an IS path from T_{ j }to T_{ lca }.
7 if (T_{ i }MIXED T_{ lca }&T_{ j }IS T_{ lca })
inclTerms ← inclTerms ∪ T_{ i }
exclTerms ← exclTerms ∪ T_{ lca }
if (numInst(T_{ k }) < numInst(T_{ i }))
numInst(T_{ k }) = numInst(T_{ i })
if (numInst(T_{ lca }) < numInst(T_{ k }))
numInst(T_{ lca }) = numInst(T_{ k })
Since the relationship between T_{ i }and T_{ j }cannot be refined further than their relationship via T_{ lca }only T_{ lca }is assigned to exclTerms.
The second case occurs when T_{ i }MIXED T_{ lca }via T_{ k }and T_{ j }PART T_{ lca }. Since T_{ j }is part of T_{ lca }and T_{ i }is part of T_{ k }which is a kind of T_{ lca }then T_{ j }is a part of T_{ k }.
8 if (T_{ i }MIXED T_{ lca }&T_{ j }PART T_{ lca })
inclTerms ← inclTerms ∪ T_{ i }
inclTerms ← inclTerms ∪ T_{ j }
exclTerms ← exclTerms ∪ T_{ k }
exclTerms ← exclTerms ∪ T_{ lca }
if (numInst(T_{ k }) < numInst(T_{ i }))
numInst(T_{ k }) = numInst(T_{ i })
if (numInst(T_{ k }) < numInst(T_{ j }))
numInst(T_{ k }) = numInst(T_{ j })
if (numInst(T_{ lca }) < numInst(T_{ k }))
numInst(T_{ lca }) = numInst(T_{ k })
The final case occurs when both terms T_{ i }and T_{ j }are MIXED related to T_{ lca }via T_{ k }and T_{ m }respectively. What is common between both terms T_{ i }and T_{ j }is that they are both part of T_{ lca }. The number of instances associated with each term is adjusted to satisfy the partial order constraints associated with this case.
9 if (T_{ i }MIXED T_{ lca }&T_{ j }MIXED T_{ lca })
inclTerms ← inclTerms ∪ T_{ i }
inclTerms ← inclTerms ∪ T_{ j }
exclTerms ← exclTerms ∪ T_{ lca }
if (numInst(T_{ k }) < numInst(T_{ i }))
numInst(T_{ k }) = numInst(T_{ i })
if (numInst(T_{ m }) < numInst(T_{ j }))
numInst(T_{ m }) = numInst(T_{ j })
if (numInst(T_{ lca }) < numInst(T_{ k }))
numInst(T_{ lca }) = numInst(T_{ k })
if (numInst(T_{ lca }) < numInst(T_{ m }))
numInst(T_{ lca }) = numInst(T_{ m })
After all terms have been compared with each other it is necessary to remove any terms from inclTerms that are found in exclTerms. This can occur when one comparison assigns a term to inclTerms while another comparison identifies the term as belonging to the excluded set. After all terms are compared each term in inclTerms should have the same number of instances associated with it. The number of instances that are associated with an annotation G is equal to the minimum number of instances that can be associated with any of the terms in inclTerms ∩ G.
Finding the Nearest Common Annotation
Just as in semantic similarity of terms, where there is a common ancestor between two terms, there exists a nearest common annotation between two annotations. The concept of a nearest common annotation allows the extension of information based semantic similarity measures of terms, such as Resnik's and Lin's measures, to information based measures of semantic similarity of annotations.
We define the nearest common annotation (NCA) between two annotations G_{1} and G_{2} to be the annotation containing terms related to both annotations. The NCA should have the minimum possible number of instances associated with it such that either G_{1} or G_{2} can be derived from it. The set of terms exclTerms which results from applying SSA to two annotations G_{1} and G_{2} will return the set of terms associated with the NCA.
Measuring Similarity
where maxNumInst is the number of distinct instances in the corpus.
In this case the SSA algorithm is used to find the non redundant terms that can be associated with an annotation.
Example
Example Annotations and Their Descriptions
Gene  Term  Description 

AAH1  GO:0000034  adenine deaminase activity 
GO:0004000  adenosine deaminase activity  
GO:0005634  nucleus  
GO:0005737  cytoplasm  
GO:0006146  adenine catabolic process  
GO:0009117  nucleotide metabolic process  
GO:0009168  purine ribonucleoside monophosphate biosynthetic process  
GO:0016787  hydrolase activity  
GO:0019239  deaminase activity  
GO:0042254  ribosome biogenesis and assembly  
GO:0043101  purine salvage  
GO:0043103  hypoxanthine salvage  
FUR1  GO:0004845  uracil phosphoribosyltransferase activity 
GO:0005622  intracellular  
GO:0008655  pyrimidine salvage  
GO:0009116  nucleoside metabolic process  
GO:0016740  transferase activity  
GO:0016757  transferase activity, transferring glycosyl groups 
FUR1's annotation consisted of six terms: {GO:0004845, GO:0005622, GO:0008655, GO:0009116, GO:0016740, GO:0016757}. Each term's description is found in table 2. Likewise, AAH1's annotation consists of twelve terms: {GO:0000034, GO:0004000, GO:0005634, GO:0005737, GO:0006146, GO:0009117, GO:0009168, GO:0016787, GO:0019239, GO:0042254, GO:0043101, GO:0043103}. The NCA is constructed by applying the SSA algorithm to identify the set of contextual terms common to both annotations. Terms such as the root term 'all' are immediately added to exclTerms. The term 'cellular component' (GO:0005575) is added to exclTerms since another term 'cell part' is is_a related to it. The term 'nucleobase metabolic process' (GO:0009112) is a more specific type of 'nucloebase, nucleoside and nucleotide process' (GO:0055086) and the terms are added to inclTerms and exclTerms respectively. Similar assignments occur for 'nucleobase metabolic process' (GO:0009112)/'cellular metabolic process' (GO:0044237), 'nucleobase metabolic process' (GO:0009112)/'cellular process' (GO:0009987) as well as other terms.
The SSA algorithm return nine contextual terms, {'all' (all), 'cellular process' (GO:0009987), 'cellular metabolic process' (GO:0044237), 'nucleobase metabolic process' (GO:0009112), 'nucleobase, nucleoside, nucleotide and nucleic acid metabolic process' (GO:0006139), 'nucleobase, nucleoside and nucleotide metabolic process' (GO:0055086), 'cell part' (GO:0044464), 'intracellular' (GO:0005622), 'catalytic activity' (GO:0003824), 'metabolic compound salvage' (GO:0043094)}. The resulting annotation contains terms from all three ontologies in the GO. There are 19 instances associated with the annotation. The number of instances is determined by the most specific term: 'metabolic compound salvage' (GO:0043094). The total number of instances in the corpus is 5554. $SS{A}_{Resnick}=\mathrm{log}\phantom{\rule{0.5}{0ex}}\left(\frac{19}{5554}\right)\approx 5.678$. Since the highest value that SSA_{ Resnik }could return for the chosen corpus is ~8.622, taking the natural log of $\frac{1}{5554}$, 5.678 corresponds to high degree of similarity.
Results
To validate our approach the discriminatory power of our method to identify clusters of related gene products was compared against Wang's measure of annotation similarity that also exploits the differences between types of relations. The average similarity of gene products found in the same biochemical pathway in the SGD database was compared against the average similarity of the same gene products compared with gene products found in other pathways. A large difference between these two values indicates the effectiveness of a similarity measure in discovering new pathways in a set of gene products. Average similarity of annotations inside and outside pathways was measured under four conditions: all terms; cellular component terms only; biological process terms only; and molecular function terms only.
A better test would be to take the average similarity of a set of gene products found in the same pathway and find the average or max of the average similarities of all other similarly sized sets of gene products. Of course this is intractable since the computational complexity of such a test is O(n!) since there are $\left(\begin{array}{c}N\\ n\end{array}\right)$ ways of creating a set of size n from a set of N elements.
As shown in figures 4, 5, 6, when only terms from the cellular component subontology are used the difference between SSA_{ Resnik }and Max_{ Resnik }becomes clear. Max_{ Resnik }returns a very high average similarity value between terms inside and outside a pathway. This may be an artifact of the low number of instances associated with cellular component terms. However when SSA is applied the average similarity values between annotations inside and outside pathways remains consistently low. SSA_{ Resnik }returns a comparatively high average similarity value for annotations inside pathways for approximately half the cases to which it can reasonably be applied. Wang's method behaves similarly to Max_{ Resnik }in this situation.
As shown in figures 7, 8, 9, if only biological process terms are used further dissimilarity between Max_{ Resnik }and SSA_{ Resnik }can be observed. The average similarity values of annotations inside a pathway with annotations outside a pathway is much higher for Max_{ Resnik }than for SSA_{ Resnik }. Wang's method and SSA_{ Resnik }behave similarly. Similarity values of annotations inside a pathway remain consistently higher than when the same annotations are compared with annotations outside the pathway for all methods.
The source of the similarity between SSA_{ Resnik }and Max_{ Resnik }can be identified when only molecular function terms are used, as shown in figures 10 and 11. In this case both methods behave exactly the same since there are no part of relations to exploit when comparing terms. Wang's method, shown in figure 12, returns a consistently high average similarity value for annotations inside a pathway compared with annotations outside a pathway.
Further discriminatory power can be achieved by considering the standard deviation of similarity values inside and outside a pathway. A set of gene products paired with other gene products in a pathway tend to have a high standard deviation of similarity values over all pairs mainly due to the small number of pairs being compared. Conversely, pairing gene products inside a pathway with those found outside the pathway should produce a set of similarity values with a lower standard deviation since annotations are expected to be dissimilar and values come from a larger set.
Figures 13, 14, 15 shows the standard deviation of similarity values of annotations consisting of cellular component terms inside pathways. Max_{ Resnik }returns a low internal standard deviation while reporting a consistently high standard deviation of similarity values when annotations inside a pathway are compared with annotations outside a pathway. The standard deviation of annotation similarity values between different pathways returned by both SSA_{ Resnik }and Wang's method are both consistently low. The standard deviation of all methods behave similarly as average similarity of annotations, consisting only of biological process terms, within pathways increase, as shown in figures 16, 17, 18. The same is also true of annotations consisting of molecular function terms, as shown in figures 19, 20, 21.
Discussion and conclusion
The SSA algorithm provides the basis of a framework for extending instance based measures of term similarity to annotations. The algorithm's construction is based on the set of cases for how terms are related to each other when the ontology consists only of is_a and part_of relations. Due to the incomplete nature of the set of instances associated with a term it is necessary to adjust the number of instances associated with a term in order to satisfy the partial order constraints of each case fully. As the number of annotations of gene products increase and ontological terms are applied more consistently it may be possible to satisfy the constraints without such adjustment. Alternatively, the partial order constraints can be used to develop a similarity method which is less dependent on the set of instances associated with terms.
When terms from all three subontologies (CC, BP and MF) are used similarity of annotations between Max_{ Resnik }and SSA_{ Resnik }are equivalent on proteins found in the SGD database. This is due to the high degree of specificity of molecular function terms, which are not related partonomically, which causes the two measures to return the same values. When only cellular component and biological process terms are used, based on the experimental evidence, SSA_{ Resnik }becomes a better identifier of proteins belonging to pathways. SSA_{ Resnik }may identify new gene products that belong to pathways but have a different molecular function to those proteins already identified as belonging to the pathway. Molecular function terms only play a small role in identifying new pathway proteins since proteins tend to have different molecular functions inside pathways.
By finding the set of instances that can be associated with an annotation it is possible to preserve, at the annotation level, the properties of instance based methods used to measure the similarity of terms. For two given annotations, the nearest common annotation (NCA) is a minimal set of terms such that either annotation could be derived from it. The SSA algorithm provides a method for finding the set of terms associated with the NCA.
By combining the SSA algorithm with Resnik's measure and the concept of nearest common annotation we have developed a measure that provides good discriminatory power to identify possible pathways and other functional groups from gene product annotations. More generally, the set of cases and their associated constraints further extend the set of principles that a reasonable measure of annotation similarity should be built on.
Declarations
Acknowledgements
This work has been supported by Microsoft Research Cambridge and the Irish Research Council for Science, Engineering and Technology.
Authors’ Affiliations
References
 Lord P, Stevens R, Brass A, Goble CA: Semantic Similarity Measures as Tools for Exploring the Gene Ontology. Pacific Symposium on Biocomputing 2003, 8: 601–612.Google Scholar
 Lord P, Stevens R, Brass A, Goble C: Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics 2003, 19(10):1275–1283.View ArticlePubMedGoogle Scholar
 Sevilla J, Segura V, Podhorski A, Guruceaga JE Mato, MartinezCruz L, Corrales F, Rubio A: Correlation between gene expression and GO semantic similarity. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2005, 2(4):330–338.View ArticlePubMedGoogle Scholar
 Couto FM, Silva MJ, Coutinho PM: Measuring semantic similarity between Gene Ontology terms, Data and Knowledge Engineering. Business Process Management – Where business processes and web services meet 2007, 61: 137–152.Google Scholar
 Lei Z, Dai Y: Assessing protein similarity with Gene Ontology and its use in subnuclear localization prediction. BMC Bioinformatics 2006, 7: 491.PubMed CentralView ArticlePubMedGoogle Scholar
 Goodman N: Seven strictures on similarity. In Problems and Projects. Edited by: Goodman N. New York: BobbsMerrill; 1972:437–447.Google Scholar
 Arrell D: What Goodman Should Have Said about Representation. The Journal of Aesthetics and Art Criticism Autumn 1987, 46: 41–49.View ArticleGoogle Scholar
 Tversky A: Features of Similarity. Psychological Rev 1977, 84: 327–352.View ArticleGoogle Scholar
 Lin D: An InformationTheoretic Definition of Similarity. In Fifteenth International Conference on Machine Learning (ICML'98). Madison, WI: MorganKaufmann; 1998.Google Scholar
 Popescu M, Keller J, Mitchell J: Fuzzy Measures on the Gene Ontology for Gene Product Similarity. IEEEIACM Transactions on computational biology and bioinformatics 2006, 3(3):263–274.View ArticlePubMedGoogle Scholar
 Cross V: Tversky's Parameterized Similarity Ratio Model: A Basis for Semantic Relatedness. Fuzzy Information Processing Society, 2006. NAFIPS 2006. Annual meeting of the North American 541–546. 3–6 June 2006View ArticleGoogle Scholar
 Torsello A, Hidovic D, Pelillo M: Four Metrics for Efficiently Comparing Attributed Trees. Proc of 17th International Conference on Pattern Recognition 2004, 2: 467–470.View ArticleGoogle Scholar
 Guo X, Liu R, Shriver CD, Hu H, Liebman MN: Assessing semantic similarity measures for the characterization of human regulatory pathways. Bioinformatics 2006, 22(8):967–973.View ArticlePubMedGoogle Scholar
 Wang JZZ, Du Z, Payattakool R, Yu PSS, Chen CFF: A New Method to Measure the Semantic Similarity of GO Terms. Bioinformatics 2007.Google Scholar
 Pesquita C, Faria D, Bastos H, Falcao A, Couto F: Evaluating GObased Semantic Similarity Measures. BioOntologies SIG at ISMB/ECCB – 15th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB) 2007.Google Scholar
 Veale N , Seco JHT: An Intrinsic Information Content Metric for Semantic Similarity in WordNet. ECAI 2004 2004, 1089–1090.Google Scholar
 Schlicker A, Albrecht M: FunSimMat: a comprehensive functional similarity database. Nucl Acids Res 2007. gkm806+Google Scholar
 Rada R, Mili H, Bicknell E, Bletner M: Development and Application of a Metric on Semantic Nets. IEEE Transactions on Systems, Man, and Cybernetics 1989, 19: 17–30.View ArticleGoogle Scholar
 Lee JH, Kim MH, Lee YJ: Information Retrieval Based on Conceptual Distance in ISA Hierarchies. Journal of Documentation 1993, 49: 188–207.View ArticleGoogle Scholar
 Wu Z, Palmer M: Verb semantics and lexical selection. In 32nd. Annual Meeting of the Association for Computational Linguistics. New Mexico State University, Las Cruces, New Mexico; 1994:133–138.View ArticleGoogle Scholar
 Resnik P: Using Information Content to Evaluate Semantic Similarity in a Taxonomy. Proceedings of IJCAI95 1995.Google Scholar
 Resnik P: Semantic Similarity in a Taxonomy: An InformationBased Measure and its Application to Problems of Ambiguity in Natural Language. Journal of Artificial Intelligence Research 1999, 11: 95–130.Google Scholar
 Jiang J, Conrath D: Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. Proc Int'l Conf Research in Computational Linguistics, ROCLING X 1997.Google Scholar
 Simon P: Parts: a study in ontology. Oxford: Clarendon Press; 1987.Google Scholar
 Gene Ontology Consortium:GO Editorial Style Guide. 2004. [http://www.geneontology.org/GO.usage.html]Google Scholar
 Bittner T: Axioms for parthood and containment relations in bioontologies. Unknown 2004.Google Scholar
 Cherry JM, Adler C, Ball C, Chervitz SA, Dwight SS, Hester ET, Jia Y, Juvik G, Roe T, Schroeder M, Weng S, Botstein D: SGD: Saccharomyces Genome Database. Nucleic Acids Res 1998, 26: 73–79.PubMed CentralView ArticlePubMedGoogle Scholar
 Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R: The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Research 2004, (32 Database):D262D266.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.