A relation based measure of semantic similarity for Gene Ontology annotations

Background Various measures of semantic similarity of terms in bio-ontologies such as the Gene Ontology (GO) have been used to compare gene products. Such measures of similarity have been used to annotate uncharacterized gene products and group gene products into functional groups. There are various ways to measure semantic similarity, either using the topological structure of the ontology, the instances (gene products) associated with terms or a mixture of both. We focus on an instance level definition of semantic similarity while using the information contained in the ontology, both in the graphical structure of the ontology and the semantics of relations between terms, to provide constraints on our instance level description. Semantic similarity of terms is extended to annotations by various approaches, either though aggregation operations such as min, max and average or through an extrapolative method. These approaches introduce assumptions about how semantic similarity of terms relates to the semantic similarity of annotations that do not necessarily reflect how terms relate to each other. Results We exploit the semantics of relations in the GO to construct an algorithm called SSA that provides the basis of a framework that naturally extends instance based methods of semantic similarity of terms, such as Resnik's measure, to describing annotations and not just terms. Our measure attempts to correctly interpret how terms combine via their relationships in the ontological hierarchy. SSA uses these relationships to identify the most specific common ancestors between terms. We outline the set of cases in which terms can combine and associate partial order constraints with each case that order the specificity of terms. These cases form the basis for the SSA algorithm. The set of associated constraints also provide a set of principles that any improvement on our method should seek to satisfy. Conclusion We derive a measure of semantic similarity between annotations that exploits all available information without introducing assumptions about the nature of the ontology or data. We preserve the principles underlying instance based methods of semantic similarity of terms at the annotation level. As a result our measure better describes the information contained in annotations associated with gene products and as a result is better suited to characterizing and classifying gene products through their annotations.


Background
Although the semantic similarity between two GO terms has been extensively investigated [1][2][3][4], how to define similarity between two gene products based on GO annotations for a specific application remains unclear [5]. To date annotation similarity has been computed by four general approaches: the set-based approach; the graphbased approach; the vector-based approach; and the termbased approach. In the set-based approach an annotation is viewed as a 'bag of words'. Two annotations are similar if there is a large overlap between their sets of terms. A graph-based approach views similarity as a graph-matching procedure. Vector-based methods embed annotations in a vector space where each possible term in the ontology forms a dimension. Term-based approaches compute similarity between individual terms and then combine these similarities to produce a measure of annotation similarity.
All the above approaches do not consider the semantics of relationships between terms. How terms are related can significantly alter how an annotation, which is a set of terms, is interpreted. In the GO there are two main types of relations: is_a and part_of. The is_a relation represents a taxonomic relationship between terms that can be modeled using the improper subset relation, which is a partial ordering of terms. The part_of relation represents a partonomic relationship between terms that can also be modeled in terms of a partial order. Though the partial orders represented by taxonomies and partonomies are well understood there has been little attention given as to how these two partial orderings combine. Using the various cases identified by combining taxonomies and partonomies we construct an algorithm called SSA (Semantic Similarity of Annotations) that identifies the terms that can be associated with an annotation and terms that relate to both annotations. Instances associated with these terms are then used to construct a Resnik-like measure of annotation similarity thus extending the underlying intuitions behind this term-based measure to the annotation level.
A measure of term or annotation similarity should be based on a set of principles that form the basis for what is considered similar. The nature of similarity has been the focus of intense research in the areas of aesthetics [6,7] and psychology [8]. In mathematics properties such as identity, symmetry and the triangle inequality have been used to form the basis of measures of similarity of mathematical objects. Principles of term and annotation similarity have been suggested by various authors. This work intends to build on these principles and introduce additional principles that a measure of similarity should seek to satisfy.
Similarity between objects is normally expressed as a number that ranges along an interval on the real numbers ‫.ޒ‬ However the main purpose of similarity is usually to determine whether two or more objects are similar to a reference object. For this reason a measure of similarity can be viewed as a partial order on a set of objects, the actual numbers play only a secondary purpose. For example, we may say that an object X is more similar to Z than another object Y. Formally this is expressed as sim(X, Z) > sim(Y, Z).
In the study of ontological similarity Lin [9] develops the principles of commonality and difference when constructing a measure of term similarity. The greater the commonality between objects the greater the similarity. Likewise, the greater the difference between objects the greater the dissimilarity. The source of both the commonality and difference between terms depends on the method chosen to measure the descriptiveness of terms. Different sources of descriptiveness may result in different orderings of similarity between terms or annotations.
Popescu et al. [10] recognize that an important property of term similarity is that two different terms should have a non-zero similarity value if the terms are related. They also recognize that an important property of annotation similarity is that the descriptiveness of annotations should be greater than or equal to the descriptiveness of its constituent terms. In this paper this property is called the monotonicity property.
In defining a measure of similarity a set of relevant properties that objects can be compared along are identified. In ontological similarity, whether of terms or annotations, there are two main sources of similarity: the conceptual or structural level; and the instance level. At the structural level we may consider such properties as graph distance, graph similarity, relation types, common ancestors, etc. At the instance level we consider the set of instances associated with a term or annotation. Our measure of ontological similarity combines aspects from both levels. Here we survey how various measures of annotation similarity combine these properties in various ways to form the basis for a measure of descriptiveness of a term or annotation.

Set-Based Approaches
Set based methods for measuring the similarity of annotations are based on the Tversky ratio model of similarity [8,11] which is a general model of distance between sets of terms. It is represented by the formula where G 1 and G 1 are sets of terms or annotations from the same ontology and f is an additive function on sets (usu- and for α = β = we get the Dice distance between sets [11]: In this situation the source of descriptiveness of an annotation is its set of terms. Each term and its set of associated instances is considered independent of other terms. The commonality and difference between annotations is modeled as set intersection and difference of sets of terms respectively. Set-based approaches return a similarity of zero if they do not share common terms ignoring the fact that terms may be closely related. Because of the atomic nature of terms in the set-based approach the monotonicity property does not apply.

Vector-Based Approaches
Vector-based methods embed ontological terms in a vector space by associating each term with a dimension. Usually a vector is binary consisting of 0's and 1's where 0 denotes the absence (resp. presence) of a term (along a particular dimension) in an annotation. This has the advantage that standard clustering techniques on vector spaces such as k-means can be applied to group similar terms. What is required is a means of measuring the size of vectors. This can be achieved by embedding terms in a metric space (usually Euclidean). The most common method of measuring similarity between vectors of terms is the cosine similarity where v i represents a vector of terms constructed from an annotation (group of terms) G i . |·| corresponds to the size of the vector and • corresponds to the dot product between two vectors. The source of descriptiveness, commonality and difference is the same as the situation for set-based approaches.

Graph-Based Approaches
An ontology is a directed, acyclic graph (DAG) whose edges correspond to relationships between terms. Thus it is natural to compare terms using methods for graph matching and graph similarity. We may consider the similarity between annotations in terms of the sub-graph that connects terms within each annotation. Annotation simi-larity is then measured in terms of similarity between two graphs. Graph matching has only a weak correlation with similarity between terms. It is also computationally expensive to compute, graph matching being an NP-complete problem on general graphs [12].
The descriptiveness of an annotation is modeled by the set of nodes and edges associated with a subgraph. Commonality between annotations is based on the set intersection while difference is modeled by the set difference where each set consists of the nodes and edges associated with each subgraph. Alternatively, the set of edges may be ignored and only common terms of both graphs are considered [13][14][15].

Improving Similarity Measures by Weighting Terms
Set, vector and graph-based methods for measuring similarity between annotations can be improved by introducing a weighting function into the similarity measure. For example, the weighted Jaccard distance can be formulated as: where, as before, G 1 and G 2 are annotations or sets of terms describing data (e.g. a gene product), T x is the x th term from a set of terms and m(T x ) denotes the weight of T x . This weighting function can be used to represent various properties of a term or annotation such as a measure of vagueness, uncertainty, sense of preference or a combination of the above. The vector-based approach may be extended so that values along a particular dimension can lie on the interval [0, 1] or [0, ∞). The graph-based approach can be extended by weighting the edges between terms in the graph.
Assigning a weight to each term in an annotation allows for the possibility of introducing the monotonicity property into a similarity measure. Using the monotonicity property, the weight associated with an annotation should be greater than or equal to the weight associated with any of its constituent terms. Weights can form an additional basis on which to measure the descriptiveness of a term or annotation.

Instance-Based Weights
One approach to assigning weight to an ontological term is to measure how informative a term is in describing data. A method of measuring information is to analyze a term's use in a corpus against the general use of ontological terms in the same corpus. Information is measured using the surprisal function: where p(T i ) corresponds to the probability of a term T i or its taxonomic descendants occurring in a corpus. For example, consider the case where there are 30 distinct instances in a corpus and 5, 3 and 2 of these instances are annotated by the terms T i , T j and T k respectively. If T j and T k are sub-types or children of T i and do not have child terms themselves then .

Other Weighting Approaches
Other measures of information can be used not necessarily relying on corpus data. One measure [16] relies on the assumption that how the ontology is constructed is semantically meaningful: where desc(T i ) returns the number of descendants of term T i and numTerms refers to the total number of terms in the ontology.

Term-Based Approaches
In term-based approaches similarity between pairs of terms from each annotation are computed. These weightings are then combined in order to characterize the similarity between annotations as a whole. There are several ways to combine similarities of pairs of terms such as the min, max or average operations. Term-based approaches depend on a function s(T i , T j ) where T i and T j are terms from two annotations G 1 and G 2 respectively. s(T i , T j ) provides a measure of distance/similarity between these two terms. Once distances has been measured between all possible pairs of terms they are then aggregated using an operation such as max or the average of all distances. For example: More sophisticated term based approaches combine multiple measures of term similarity and aggregate similarity values using more complex functions, for example [17].

Graphical Measures of Term Similarity
The simplest approach to measuring similarity between ontological terms using the graph structure is to measure the shortest path distance between terms in the graph [18,19]. Referring to figure 1, in terms of graph distance, we may consider the terms 'muscle cell proliferation' and 'fibroblast cell proliferation' (graph distance of 2) as being more similar than the former term with 'fibroblast regulation' (graph distance of 3). However the graph distance has only a weak correlation with similarity of terms. The semantic similarity between 'positive fibroblast regulation' and 'negative fibroblast regulation' is far greater than the similarity between 'muscle cell proliferation' and 'fibroblast cell proliferation' even though both examples have a graph distance of two. A simple graph distancebased measure of similarity does not model in a consistent way any notion of commonality or difference between terms.
A more refined use of graph distance as a basis for a measure of term similarity is found in the Wu-Palmer measure of similarity [20]. It uses the idea that the distance from the root to the lowest common taxonomic ancestor (LCTA) measures the commonality between two terms while the sum of the distance between the LCTA and each term measures the difference between two terms. Combining these aspects results in the formula: Where T 1 and T 2 are the two terms being compared, T lcta is the term that corresponds to the lowest common taxonomic ancestor between T 1 and T 2 . T root denotes to root node of the ontology (assuming that the ontology has only one root). dist(T i , T j ) denotes the graph distance between terms T i and T j . The 2 * dist(T lcta , T root ) component of the denominator serves to normalize the measure.

Instance-Based Measures of Term Similarity
Similarity may be measured using an instance based measure of semantic similarity as computed by either Resnik (eqn. 2) or Lin (eqn. 3). Resnik [21,22] exploits the informativeness of the lowest common ancestor between terms as a measure of semantic similarity: where T lcta denotes the lowest common taxonomic ancestor between ontological terms T i and T j . This measure only accounts for the commonality between terms.
Another method of measuring similarity derived by Lin [9]  1 099 An Example of an Ontology of GO Terms The only structural property that both Resnik and Lin exploit is the lowest common taxonomic ancestor. To overcome this weakness Jiang and Conrath [23] integrate graph distance based measures of similarity into information based approaches. They construct a generalized weighting measure between a child and its immediate parent that accounts for the number of out edges and depth of terms along the shortest path between the compared terms in the ontology. While they acknowledge that other relation types might be relevant to measuring similarity their measure is based solely on the taxonomic or is_a relations in the ontology.

New Approaches to Annotation Similarity
Beyond the set, vector, graph and term-based approaches to measuring similarity of annotations exist other methods that introduce the additional properties discussed above such as monotonicity and taking into account the semantics of ontological relations.

Similarity Based on Fuzzy Measures
The monotonicity property leads naturally to the use of fuzzy measures as a basis for measuring the descriptiveness of an annotation. Using the information content measure of terms described in eqn. 1 as the basis for measuring similarity a fuzzy measure is constructed. A fuzzy measure is a weighting on sets of terms such that the weight associated with a set of terms is greater than or equal to the weight associated with any of its subsets.
Popescu et al. [10] use fuzzy measures to induce a weighting m for an annotation from its constituent terms. This weight is extrapolated from the weights of individual terms by using the formula for constructing a Sugeno λ- . Given that the weights (fuzzy measure densities) m for individual terms T i in an annotation are known then λ can be determined by solving the following equation: In [10] the weight for each term is based on the ICCorpus measure (eqn. 1). The similarity of two annotations, rep-resented by a set of terms G 1 and G 1 from the same ontology, are compared using the similarity function: where and are the λ-fuzzy measure functions that characterize G 1 and G 2 respectively. The relatedness of terms is accounted for by augmenting each annotation with the lowest common ancestors for each pair of terms from each annotation. This ensures a non-zero similarity between annotations containing related terms.
However, an ontology models other aspects of relatedness that should be taken into account. Relations between terms in an annotation can be used to identify redundant terms whose relevance to the descriptiveness of an annotation is already accounted for by other terms. For example, if two terms in an annotation are taxonomically related the existence of the parent term is implied by the existence of the child term.
If redundancy of terms is not taken into account it may lead to too many or too few instances being associated with the term. This is especially true when a term is part_of another term. The instances associated with the annotation consist of the parts and not what the instances are part of.

Exploiting Semantics of Ontological Relations
Wang et al. [14] account for the different contributions that terms related by is_a and part_of relations make to the descriptiveness of a term. The semantic contribution that ancestor terms make to a child term is calculated by: where T anc, i denotes the ancestors of term T i and is calculated as where w e ∈ [0, 1] is a number that corresponds to the semantic contribution factor for edge e. childrenOf(T x ) is a function that returns the immediate children of T x that are ancestor terms of T i . In this paper w is_a = 0.8 and w part_of = 0.6. The similarity of two terms is computed by the formula A term-based approach is taken to measuring the similarity between annotations G 1 and G 2 . The similarities of the most similar pairs of terms from each annotation are averaged over to calculate the similarity between annotations: where and |G y | denotes the number of terms in annotation G y .
Wang et al. make the observation that the instance based measures of term similarity will produce varying results based on the corpus chosen. They keep a fixed value for the contribution each relation type makes to the descriptiveness of a term. This does not account for the varying influence of terms on each other throughout the ontology even if the graph distance is the same. Exploiting the corpus statistics, if used appropriately, may account for this drawback. As with all term-based methods, where terms from each annotation are compared in a pairwise fashion, it is difficult to see how the monotonicity property is ensured when measuring the similarities between two annotations.

Methods
The Gene Ontology relates terms using is_a and part_of relations. We develop a measure of informativeness that provides a description of an annotation that takes into consideration the relations between terms. We use the informativeness measure of a term (eqn. 1) as the basis for providing a description of an annotation. We define an algorithm called SSA that combines the instances of terms while taking into account how these sets of instances are related by how their associated terms are related in the ontology. This results in a set of instances that can be said to be associated with an annotation and not just a term. We can then extend the concept of instance based semantic similarity of terms, such as Resnik's measure, to annotations.

Interpreting Annotations from Taxonomies
A taxonomy induces a partial ordering on a set of terms by the improper subset relation ⊆. If T i is_a T k and T j is_a T k then the set of instances associated with both T i and T j are subsets of T k . Assuming that we know of all possible instances that can be associated with a term, whatever properties that instances of both T i and T j share can be associated with any of the instances that can be associated with T k . This forms the basis for measuring the commonality between terms used in instance-based measures of similarity between terms.
The difference between terms T i and T j is modeled by the difference between the set of instances associated with each term. If we have two or more terms from a taxonomy in an annotation then it is reasonable to argue that the set of instances associated with an annotation should be the intersection of the set of instances associated with each term. The informativeness of the annotation is then based on the set of instances resulting from this intersection.

Interpreting Annotations from Partonomies
The part_of relation between terms denotes the concept that one term is 'part of ' another. It provides an alternative notion of relatedness between terms. An ontology consisting only of part_of relations is known as a partonomy. An example of a simple partonomy is wheel part_of car. It would not make sense to say that a wheel is_a car. The study of partness is complicated by the fact that there are many kinds of part_of relations. Yet the study of partness, known as mereology [24], has shown that there are also common aspects to all types of part_of relations, namely that part_of relations form a partial ordering on the sets of instances associated with each term.
According to the GO Consortium's usage guidelines since 2004 [25] the part_of relation should be interpreted as 'necessarily part of' where T i part_of T j means that all instances of T i are part of one or more instances of T j . The converse is not necessarily true. For example, all nuclei are part of cells but not all cells contain a nucleus. Bittner [26] models such a part_of relation using an improper partial order i.e. for term T i with descendant terms T j .
Annotations consisting of terms such that one term is part_of another should view the child term as being relevant to the annotation while the parent term provides redundant, contextual information. For example, consider an annotation consisting of two terms T i and T j from a partonomy. If T j part_of T i then the annotation should be interpreted as the set of instances of T j . All we can say is that the number of instances of T i associated with the annotation can be no more than the number of instances of T j . In general, an annotation consisting of terms belonging to a partonomy consists of terms that provide the set of instances that can be associated with the annotation while other terms provide the context in which these instances are embedded. Figure 2 shows a subset of the GO consisting of both part_of and is_a relations. According to the taxonomic is_a relations both 'mitochondrial chromosome' and 'mitochondrial nucleoid ' are 'mitochondrial part's. A measure of descriptiveness of a term should at least say that both 'mitochondrial chromosome' (a) and 'mitochondrial nucleoid ' (b) are more descriptive than 'mitochondrial part' (c), i.e. a, b ⊆ c. Likewise, the part_of relation in figure 2 indicates that a ≤ part_of b. Here we can see how the part_of relation provides additional indirect information about descriptiveness not represented by the taxonomic relations. If an annotation consists of the terms 'mitochondrial chromosome' and 'mitochondrial nucleoid' then the annotation should be interpreted as the set of instances of 'mitochondrial chromosome'. If the terms 'mitochondrial part' and 'chromosome' are added to the annotation then the same set of instances should be associated with the annotation. All additional terms are already implied by the existence of 'mitochondrial chromosome' in the annotation. If we had either treated the part_of relation as an is_a relation or ignored it then the A Subset of GO Terms and Relations  The GO consists of many examples similar to the one described above. In general, the GO can be viewed as a taxonomy interspersed with part_of relations. Two terms are said to be directly related if there exists a series of relations on a single path between them. Terms that are not directly related along a path in the graph are indirectly related via a common ancestor. For example there may be other terms that are part_of 'mitochondrial nucleoid' in which case the term 'mitochondrial chromosome' is only related to the other parts by an indirect path of part_of relations. Though not shown, the terms 'mitochondrial nucleoid' and 'chromosome' are only indirectly related via a common ancestor through a number of is_a relations. When interpreting an annotation it is necessary to account for such situations.

Partial Order Constraints for GO Annotations
In general, as described in table 1, there are nine cases to handle when trying to account for how terms are related. Terms or their taxonomic descendants may be directly related to each other in the ontology via a single path. Alternatively they may be indirectly related to each other via a common ancestor in which case we consider the two paths from the common ancestor to each term. A path may be homogeneous in that it consists of relations of only one type i.e. all relations are either only is_a or only part_of. Such paths are denoted by IS and PART respectively. A path that is inhomogeneous, consisting of both is_a and part_of relations, is denoted by MIXED.

Directly Related Cases
There are three cases to handle when there exists a single path between terms in the ontology: IS, PART and MIXED paths. The first case is the generalized case of taxonomic relations where T i IS T j . For two terms T i and T j , where T j is the parent term and T i is a descendant, and a set of n intermediate terms {T n } such that: it can be inferred that T i ⊆ T j . Where terms are related by a PART path a similar argument can be inferred for how two terms are ordered.
For the MIXED case there exists a mixture of is_a and part_of relations. The nature of the MIXED relationship is ultimately determined by the part_of relations. For example, if T i MIXED T j then this can be interpreted as T i part_of T j . There may be several is_a relations traversed along a MIXED path from T j to T i before a part_of relation is encountered. This means that T i can only be part_of a subset of the instances of T j . This subset is identified by the set of instances associated with the term (labeled T k in table 1) which is the parent term of the first part_of relation encountered along a MIXED path from T j to T i . This results in the partial order: where T i is the descendant of T j , T i is the parent and T k denotes the first term before a part_of relation is encountered while traversing the MIXED path in the ontology from T j to T i . This form of reasoning can be further extended along the rest of the MIXED path to produce a more detailed partial order. However if the ultimate goal is to only determine the partial order between T i and T j then such induction of this reasoning is unnecessary.

Indirectly Related Homogeneous Cases
There are three cases to handle where both the paths to the common ancestor between terms are homogeneous: IS - Overview of general forms of relation based ordering for directly and indirectly related terms. Terms are indirectly related via a common ancestor term T lca . Instances of terms T i and T j may be part of the common ancestor T lca via terms T k and T m respectively. ρ denotes a function that measures the number of instances (our source of descriptiveness) of terms. These orderings assume complete knowledge of all instances associated with a term.
IS, PART -PART and IS -PART (or PART -IS). In the first case, where T i IS T lca and T j IS T lca , since both terms T i and T j are taxonomic descendants of a lowest common ancestor T lca then it should be expected that the number of instances associated with T i and T j are less than the number of instances associated with T lca. This results in the partial order An annotation consisting of two such related terms can be interpreted as the set of instances that are associated with both T i and T j . A similar form of reasoning can be applied to the PART -PART case. The partial order for the final case IS -PART (or PART -IS) can be derived in a similar fashion to the inhomogeneous direct MIXED case. If T i IS T lca and T j PART T lca then it can be inferred that T j PART T i . If an annotation consists of two such terms then it should be interpreted as the set of instances of T j . As a partial order constraint this can be modeled as

Indirectly Related Inhomogeneous Cases
Indirectly related inhomogeneous cases occur when terms are related by a common ancestor in the ontology and one or both of the paths connecting the common ancestor with each term consists of an inhomogeneous set of relation types. There are three such cases to account for: IS -MIXED (or MIXED -IS), PART -MIXED (or MIXED -PART) and MIXED -MIXED.
The partial order for the first case IS -MIXED (or MIXED -IS) can be handled by considering each path separately. The partial order for the T i IS T lca path is T i ≤ T lca . The partial order for the MIXED path is T j ≤ T k ≤ T lca which is derived in the same way as the directly related MIXED case. Combining the two partial orders results in If an annotation consists of two such terms then it should be interpreted as the set of instances of T j that are part of instances that are of type T i and T k .
The PART -MIXED (or MIXED -PART) case requires slightly more reasoning about to construct its associated partial order. If T i PART T lca and T j MIXED T lca then it can be inferred that both T i and T j are part of T lca . Because T j is only part of a subset of the instances associated with T lca , the instances associated with T k , then T i can only be part of the set of instances associated with T k also. This results in the partial order

An annotation consisting of two such related terms should be interpreted as the set of instances of T i and T j that are part of the same instances of T k .
The final case MIXED -MIXED occurs when paths from both terms to the common ancestor consist of a mixture of relation types. The partial order for such a case can be constructed by looking at each path separately. If T i MIXED T lca then the partial ordering is T i ≤ T k ≤ T lca . Similarly for T j MIXED T lca we get T j ≤ T m ≤ T lca . Combining the two partial orders results in If an annotation consists of two such terms then it should be interpreted as the set of instances of T i and T j that are part of the same instances of T k and T m .

The SSA Algorithm
The SSA algorithm is based on the nine cases of term relatedness described above. The SSA algorithm derives the set of instances that can be associated with an annotation from the set of instances associated with that annotation's constituent terms. There are two aspects to the algorithm: identifying which terms are the contextual, redundant instances and which terms' instances can be associated with the annotation. For example, a contextual instance may be 'mitochondrial nucleoid' that provides the context for the set of instances of 'chromosome'. Throughout we denote the set of contextual terms by exclTerms and the set of terms whose instances can be associated with the annotation as inclTerms. numInst(T i ) denotes the number of instances associated with T i .
The above partial order constraints were constructed under the ideal assumptions assumed by the partial orderings in taxonomies and partonomies. In reality there only ever exists an incomplete set of instances associated with terms and some adjustment of the number of instances is required if the partial order constraints are to be satisfied. Terms that are taxonomically related are guaranteed to satisfy the taxonomic constraints. However, terms that are partonomically related may not satisfy their associated partial order constraints. In these cases some adjustment of the number of instances associated with a term is necessary. For example, if T i PART T j and there are no instances associated with T j in the corpus while there are a number of instances associated with T i then in order to satisfy the PART constraint the number of instances of T j is set equal to the number of instances associated with T i .
The algorithm consists of the following steps: • For each distinct ordered pair (T i , T j ) of terms in annotations G 1 and G 2 respectively -Identify the case that corresponds to how T i is related to T j * Terms are assigned to inclTerms or exclTerms depending on case * The number of instances associated with a term may be adjusted if the case allows • Remove any terms from inclTerms also found in excl-Terms

• Return the sets inclTerms and exclTerms
where an ordered pair of terms (T i , T j ) means that (T i , T j ) ≠ (T j , T i ). In the following sections we identify how each case assigns terms to inclTerms and exclTerms and adjusts the number of instances associated with each term used to compare annotations.

Direct Cases
The IS constraint where one term in an annotation is a special case of another term can be implemented as follows: In this situation the term T j is viewed as being the common taxonomic ancestor of both terms.
The PART constraint where one term is a part of another term can be implemented as: In this situation the term T j is viewed as providing the context that instances of T i are part of.
The case is similar for T i MIXED T j . In these cases we are relating terms that belong to two different lines of taxonomic inheritance where terms have a possibly incomplete set of associated instances. In order to ensure that the partial order constraint associated with this case is imple-mented correctly if T j has fewer instances associated with it than T i then we adjust the number of instances associated with T j to be equal to the number of instances associated with T i .
The MIXED constraint where T i is a part of another term T j via an intermediate term T k can be implemented similarly to the PART case: In this situation the term T k is viewed as providing the context that instances of T i are part of.

Indirect Homogeneous Cases
In the indirect homogeneous cases compared terms T i and T j are indirectly related via a common ancestor T lca along homogeneous paths. The first such case is where T i IS T lca and T j IS T lca. In this situation the number of instances associated with T lca provides a measure of similarity between T i and T j : In the case where T i PART T lca and T j PART T lca T lca provides the context in which instances of T i and T j are embedded.
Since terms from two different lines of taxonomic inheritance are being compared and the set of instances associated with each term is incomplete an adjustment of the number of instances associated with each term is necessary.
The final homogeneous indirect case occurs when T i PART T lca and T j IS T lca . This is equivalent to T i PART T j since if T i is a part of T lca and T j is a kind of T lca then T i is a part of T j .
As with other cases the number of instances associated with each term are adjusted to ensure that the partial order constraint associated with the case is satisfied.

Indirect Inhomogeneous Cases
In these cases one or both paths from T lca to terms T i and T j contain inhomogeneous types of relations. Throughout this section the term T k is a term in the ontology such that T m MIXED T k and T k IS T n if T n is an ancestor of T m in the ontology.
The first such case occurs where for two indirectly related terms being compared, T i and T j , there exists an MIXED path from T i to T lca via T k and an IS path from T j to T lca .
Since the relationship between T i and T j cannot be refined further than their relationship via T lca only T lca is assigned to exclTerms.
The second case occurs when T i MIXED T lca via T k and T j PART T lca . Since T j is part of T lca and T i is part of T k which is a kind of T lca then T j is a part of T k .
The final case occurs when both terms T i and T j are MIXED related to T lca via T k and T m respectively. What is common between both terms T i and T j is that they are both part of T lca . The number of instances associated with each term is adjusted to satisfy the partial order constraints associated with this case.
After all terms have been compared with each other it is necessary to remove any terms from inclTerms that are found in exclTerms. This can occur when one comparison assigns a term to inclTerms while another comparison identifies the term as belonging to the excluded set. After all terms are compared each term in inclTerms should have the same number of instances associated with it. The number of instances that are associated with an annotation G is equal to the minimum number of instances that can be associated with any of the terms in inclTerms ∩ G.

Finding the Nearest Common Annotation
Just as in semantic similarity of terms, where there is a common ancestor between two terms, there exists a nearest common annotation between two annotations. The concept of a nearest common annotation allows the extension of information based semantic similarity measures of terms, such as Resnik's and Lin's measures, to information based measures of semantic similarity of annotations.
We define the nearest common annotation (NCA) between two annotations G 1 and G 2 to be the annotation containing terms related to both annotations. The NCA should have the minimum possible number of instances associated with it such that either G 1 or G 2 can be derived from it. The set of terms exclTerms which results from applying SSA to two annotations G 1 and G 2 will return the set of terms associated with the NCA.

Measuring Similarity
By introducing the notion of nearest common annotation we can naturally extend Resnik's measure to measuring similarity of annotation. The LCA between two terms is replaced with the NCA of two annotations G 1 and G 2 . Likewise, instead of applying IC Corpus (eqn. 1) to instances associated with a term we apply IC Corpus to instances of an annotation. Thus the extension of Resnik's measure from terms to annotations G 1 and G 2 , SSA Resnik , becomes: where maxNumInst is the number of distinct instances in the corpus.
Lin's measure may be extended as follows: In this case the SSA algorithm is used to find the non redundant terms that can be associated with an annotation.

Example
We compare the similarity of two gene product's annotations that returns a high measure of similarity when compared using our measure SSA Resnik . Two gene products, AAH1 and FUR1 whose annotations (listed in table 2) were taken from the SGD database [27] were compared producing a similarity value of 5.678. The number of instances associated with each term were obtained from the GOA [28]s. cerevisiae table of GO assignments.

Results
To validate our approach the discriminatory power of our method to identify clusters of related gene products was compared against Wang's measure of annotation similarity that also exploits the differences between types of relations. The average similarity of gene products found in the same biochemical pathway in the SGD database was compared against the average similarity of the same gene products compared with gene products found in other pathways. A large difference between these two values indicates the effectiveness of a similarity measure in discovering new pathways in a set of gene products. Average similarity of annotations inside and outside pathways was measured under four conditions: all terms; cellular component terms only; biological process terms only; and molecular function terms only.
A better test would be to take the average similarity of a set of gene products found in the same pathway and find the average or max of the average similarities of all other similarly sized sets of gene products. Of course this is intrac-   Figure 3 Normalized SSA Resnik vs Wang's Method vs Normalized Max Resnik . Values shown correspond to the average annotation similarity values between gene products with other gene products in the same pathway (taken from the SGD biochemical pathways database) and between gene products in a pathway with other gene products not found in the pathway. values of annotations inside a pathway remain consistently higher than when the same annotations are compared with annotations outside the pathway for all methods.
The source of the similarity between SSA Resnik and Max Resnik can be identified when only molecular function terms are used, as shown in figures 10 and 11. In this case both methods behave exactly the same since there are no part of relations to exploit when comparing terms. Wang's method, shown in figure 12, returns a consistently high average similarity value for annotations inside a pathway compared with annotations outside a pathway.

Discussion and conclusion
The SSA algorithm provides the basis of a framework for extending instance based measures of term similarity to annotations. The algorithm's construction is based on the set of cases for how terms are related to each other when the ontology consists only of is_a and part_of relations. Due to the incomplete nature of the set of instances associated with a term it is necessary to adjust the number of instances associated with a term in order to satisfy the partial order constraints of each case fully. As the number of annotations of gene products increase and ontological terms are applied more consistently it may be possible to satisfy the constraints without such adjustment. Alternatively, the partial order constraints can be used to develop a similarity method which is less dependent on the set of instances associated with terms.
When terms from all three sub-ontologies (CC, BP and MF) are used similarity of annotations between Max Resnik and SSA Resnik are equivalent on proteins found in the SGD database. This is due to the high degree of specificity of molecular function terms, which are not related partonomically, which causes the two measures to return the same values. When only cellular component and biological process terms are used, based on the experimental evi-dence, SSA Resnik becomes a better identifier of proteins belonging to pathways. SSA Resnik may identify new gene products that belong to pathways but have a different molecular function to those proteins already identified as belonging to the pathway. Molecular function terms only play a small role in identifying new pathway proteins since proteins tend to have different molecular functions inside pathways.
By finding the set of instances that can be associated with an annotation it is possible to preserve, at the annotation level, the properties of instance based methods used to measure the similarity of terms. For two given annotations, the nearest common annotation (NCA) is a minimal set of terms such that either annotation could be derived from it. The SSA algorithm provides a method for finding the set of terms associated with the NCA.
By combining the SSA algorithm with Resnik's measure and the concept of nearest common annotation we have developed a measure that provides good discriminatory power to identify possible pathways and other functional groups from gene product annotations. More generally, the set of cases and their associated constraints further