Gene Ontology term overlap as a measure of gene functional similarity
© Mistry and Pavlidis. 2008
Received: 22 February 2008
Accepted: 04 August 2008
Published: 04 August 2008
Skip to main content
© Mistry and Pavlidis. 2008
Received: 22 February 2008
Accepted: 04 August 2008
Published: 04 August 2008
The availability of various high-throughput experimental and computational methods allows biologists to rapidly infer functional relationships between genes. It is often necessary to evaluate these predictions computationally, a task that requires a reference database for functional relatedness. One such reference is the Gene Ontology (GO). A number of groups have suggested that the semantic similarity of the GO annotations of genes can serve as a proxy for functional relatedness. Here we evaluate a simple measure of semantic similarity, term overlap (TO).
We computed the TO for randomly selected gene pairs from the mouse genome. For comparison, we implemented six previously reported semantic similarity measures that share the feature of using computation of probabilities of terms to infer information content, in addition to three vector based approaches and a normalized version of the TO measure. We find that the overlap measure is highly correlated with the others but differs in detail. TO is at least as good a predictor of sequence similarity as the other measures. We further show that term overlap may avoid some problems that affect the probability-based measures. Term overlap is also much faster to compute than the information content-based measures.
Our experiments suggest that term overlap can serve as a simple and fast alternative to other approaches which use explicit information content estimation or require complex pre-calculations, while also avoiding problems that some other measures may encounter.
In this paper we consider the problem of deciding if two genes are functionally related using computational methods. In particular, we are interested in how existing information about gene function can be used to enhance or evaluate computational predictions of functional relationships among genes.
Many genes have been functionally characterized by experimental methods, sequencing efforts, and high-throughput techniques, and as a consequence those genes then appear in public databases annotated with terms or concepts representative of their deduced function or biological role in the cell. The Gene Ontology (GO) is a structured, controlled vocabulary of terms providing consistency in annotating how a given gene product behaves in a cellular context, and many genes are now annotated with terms from GO . It is increasingly common to attempt to define functional relatedness using "semantic similarity" of genes using GO annotations [2–7]. While many measures have been used, their relative benefits and drawbacks are unclear. The current work involves an examination of the behaviour of various semantic similarity measures that have been proposed, including one that has not been previously considered in comparisons.
Previous work on the use of the GO to measure functional similarity focussed on the use of information content [9, 10]. The information content (IC) of a term is related to how often the term is applied to genes in the database, such that rarely used terms are ascribed higher IC. The IC for GO terms is monotonically decreasing as one follows the graph from a leaf terms towards the root term. Intuitively, terms low in the hierarchy are "more detailed" and impart more information about function than high-level terms such as "metabolism". Semantic similarity measures based on IC make use of the idea that genes sharing terms with high IC are expected to be more functionally similar than terms that share terms with low IC. Indeed, it was shown that some IC-based measures correlate with other measures of functional relatedness such as sequence similarity [9, 10].
A second set of semantic similarity measures are variations on the Vector Space Model (VSM) , an algebraic model originally developed for use in information retrieval. Unlike the IC-based measures, these methods do not account for hierarchical relations in the GO, and instead refer to GO terms in a 'flat' matrix format. The requirement of individual gene vectors to be generated is an extra complexity cost that is incurred prior to the actual similarity computation itself. In this study we implement three of such methods [12–14] for comparison against our measure.
Our proposed method, Term Overlap (TO), was used previously by Lee et al.  in a study of gene coexpression analysis, where it was shown that TO correlates with increasing confidence in coexpression. Although Lee et al.  first implemented the TO measure, it was not thoroughly evaluated nor was it compared to other similarity measures. In this study we sought to test whether TO is an adequate substitute for other measures that have been put forward. In contrast to the other semantic similarity measures, TO does not use an explicit information content computation, and is less algorithmically complex. Here we explore the properties of TO in more detail and carry out a more formal evaluation of the approach, and find that TO has a number of attractive features that may recommend it as an alternative to other semantic similarity measures.
Sets of gene pairs were generated by random pairwise selection from the mouse genome. Each of the genes was annotated with its respective GO terms as it appears in the NCBI Gene database  (downloaded on January 8, 2008). Genes which were not annotated with any GO terms were not considered, leaving a set of 18,161 genes. Several sets containing 10,000 gene pairs each were evaluated initially, with a final dataset of 100,000 gene pairs generated for which the results are displayed in this paper ("100 k"). The 100 k set has pairs covering the entire corpus of mouse genes. The 100 k set with associated statistics is available as supplementary data from http://bioinformatics.ubc.ca/pavlidis/lab/gometric/.
Several of the measures we considered require the computation of the information content of each GO term. These measures were originally described for the analysis of any corpus of text, and were adapted for use with GO by Lord et al. (2003), where full details are given. The information content of a GO term t i is:
IC(t i ) = -log(p(t i )) (1)
Where p(t i ) is the probability of a term occurring in the corpus:
p(t i ) = freq(t i )/freq(root) (2)
Where children(t i ) is the set of all children terms for the term t i (that is, the set of all terms for which t i is a parent term, either directly or indirectly).
In our analysis we focus on three IC based measures adapted from the work of Resnik, Lin, Jiang and Conrath. Resnik's measure calculates the similarity between two terms by using only the IC of the lowest common ancestor (LCA) shared between two terms t 1 and t 2 :
sim Res (t1, t2) = IC(LCA) (4)
For each of the three measures, a higher score indicates a higher semantic similarity between two terms. The lowest score for all three measures is 0. The highest score for Lin and Jiang is 1, and Resnik's measure has no upper bound.
These measures are intended to score the similarity between two GO terms, and must be extended to compare genes, each of which can have multiple GO terms. Following the approach of , let us compare two gene products g 1 and g 2. Every term in the direct annotation set for gene g 1 is compared against every term in the direct annotation set for gene g 2. For each pairwise comparison if two direct annotations are identical, that term is then considered the LCA. If two direct annotations are not identical, we then retrieve the parent term sets induced for the two annotation terms, and the shared parent term with the highest information content is considered the LCA. The similarity score is then calculated for that pair of terms. The scores generated for all pairs of GO terms are used to produce a final score for the gene pair in one of two ways: i) scores can be averaged across all possible term pairs for the two genes  or ii) only the maximum score resulting from all possible term pairs for the two genes is used, as proposed by . We refer to these as the "average" and "maximum" methods in the following. Thus we consider six IC-based measures: Li, Jiang and Resnik, each with average and maximum variants.
A variation on the Cosine measure, which has been previously used in ontology-based similarity, first generates a weight, w t , for each GO term based on the frequency of its occurrence in the corpus .
w t = log(N/n t ) (8)
Where N is the total number of genes in the corpus and n t is the number of genes in the corpus annotated with that term t. These weights replace the non-zero values in the binary vector and similarly the cosine measure is calculated as in (7). We refer to this method as the Weighted Cosine measure in this study.
Finally, Huang et al also propose a vector-based similarity measure integrated in the DAVID Gene Functional Classification Tool . For a given gene pair, binary gene vectors are extracted from the compiled matrix as described above. Kappa statistics are then used to measure co-occurrence of annotation between gene pairs. The algorithm can be found in detail in .
When calculating the term overlap between two gene products we consider the set of all direct annotations for each gene and all of their associated parent terms (excluding the root of the hierarchy) as a gene product annotation set, annot g1. The term overlap score for two genes is then calculated as the number of terms that occur in the intersection set of the two gene product annotation sets.
sim TO (g1, g2) = |annotg1 ∩ annotg2| (9)
Traditional cardinality-based similarity measures such as Jaccard and Dice  are computed similarly to NTO, but use the union or sum, respectively, of the two gene annotation set sizes as the normalizing factor. The Czekanowski-Dice distance used in the functional analyses module of GOToolBox , calculates a distance by normalizing the number of symmetric differences between the two gene term sets with the sum of the intersection and union sets. In this way, the scale for the distance is reversed from that of NTO, with genes having no GO terms in common scoring a distance of 1 and highly functionally related scoring closer to 0. Since these measures are very similar to NTO, we chose not to include them in our study.
The NCBI Consensus Coding Sequence (CCDS) protein sequences were obtained from the NCBI FTP site ftp://ftp.ncbi.nih.gov/blast/. We used the NCBI blast suite program "bl2seq" to analyze the similarity between the protein sequences for each of the 100,000 gene pair . Similarity for this analysis was measured using the bit score values. CCDS did not include sequences for some of the genes considered, yielding similarity scores for 67,179 pairs. Correlations and their statistical significance were determined using the cor.test function in R .
For each data set (consisting of randomly selected pairs of mouse genes), we computed similarity scores using TO, NTO, each of the IC-based similarity measures (Resnik, Lin and Jiang, using both the "average" and "maximum" variants for each), and each of the three vector-based measures (Cosine, Weighted Cosine and Kappa) for a total of eleven measures. We then sought to see how well the scores generated from the different measures correlated with one another. A correlation analysis was repeated for several sets of 10,000 randomly chosen gene pairs. We found the variation between the results for each of the sets to be negligible (Additional file 1), thus only data from a final 100 K gene pair set was studied in detail.
Correlation of TO scores with various other similarity measures
Rank correlation of semantic similarity with sequence similarity
Running times for some of the similarity measures
Normalized Term Overlap
In this work we conducted a detailed study of a measure of gene functional similarity based on Gene Ontology terms, Term Overlap. We found that TO compares very well to other semantic similarity measures, and is easier and faster to compute. This suggests that TO can be used as an alternative to the more complex measures that have been proposed. In addition we demonstrate that in general, the various measures are all highly correlated, with some important exceptions. Here we discuss some of the reasons for differences in performance among the methods.
We find that with the IC-based methods, the scores from the TO correlate best with Resnik-maximum and Resnik-average scores. Recent studies have shown that the similarity measure proposed by Resnik out-performs the Lin and Jiang methods in terms of correlation with gene sequence similarities [9, 10] and gene expression profiles , consistent with our findings. Our sequence analysis findings indicate that TO correlates comparably (even slightly higher) with sequence similarity than Resnik. This suggests that TO is at least as reflective of "true" gene function as the measure used by Lord et al. (2003). However, we point out that overall correlations are low; the difference might be corpus-specific, and there is no unassailable "gold-standard" for evaluating semantic similarity measures.
We found that the Lin and Jiang measures correlate relatively poorly with most of the other methods, while being most similar to NTO. It has been previously shown that the Lin and Jiang methods suffer from what is referred to as the "shallow annotation problem" [7, 23]. This is because Lin and Jiang both use the IC of the query genes as well as the LCA. As a result, genes that are annotated at only very shallow levels of the GO hierarchy (e.g., "metabolism") can yield very high semantic similarities. Such pairs are therefore not distinguishable from high-scoring pairs of genes that have "deep" annotations. The effect of shallow annotations can be seen in Figures 3 and 4, where both Lin and Jiang measures have large numbers of points with scores of 1.0, distributed over a wide range of TO scores, including very low TO values. Thus, although the Lin and Jiang methods attempt to capture the nature of the hierarchy in their methods, the effect of the shallow annotation problem shows that these methods can produce misleading results. For example, the gene pair containing Akap1 (A kinase anchor protein 1), and Bbs9 (Bardet-Biedl syndrome 9) using Lin and Jiang methods score a similarity of 1.0. Akap1 is a trans-membrane protein that participates in second messenger signalling [24, 25], and has 29 GO terms associated with it (including parents). The function for Bbs9 on the other hand is poorly understood  and has only 3 associated terms, including "extracellular space", which it happens to share with Akap1. Despite this weak link, according to both Lin and Jiang methods these genes are not only similar but they generate the maximum attainable score for those measures.
Scrutiny of the data leads us to believe that the NTO scores also suffer from the shallow annotation problem. This is because even if a gene is annotated with only one term, and it shares that term with another gene, the NTO is 1.0. Using the previously mentioned gene pair Akakp1 and Bbs9, the NTO measure generates a high score of 0.75, whereas the TO measure generates a more appropriate low score of 3.0. The Jaccard, Dice and Czekanowski-Dice methods [14, 20], which are computed in a similar fashion to NTO but use larger normalizing factors, will ameliorate the shallow annotation problem. However, shallow annotation artifacts will persist when comparing pairs of genes where both have few terms For this reason we favour the raw TO over the normalized overlap measures. On the other hand, the normalized measures have the useful property of yielding values restricted between zero and one.
Others have suggested that using IC values can induce substantial artifacts for terms that are rarely used, but not necessarily very specific . This problem will be particularly acute for organisms with sparse GO annotations. We refer to this problem as the "corpus bias". For example, a high level, general term such as "cell growth" (GO:0016049) should have a significantly high background probability and low IC. However, if the corpus does not contain many genes involved in cell growth, the term will score a low probability and will be incorrectly identified as a high IC specific term. As shown in Figure 7A, most terms near the top of the hierarchy have low IC (Figure 7A). However, there are exceptions, as noted above. These terms located near the top of the hierarchy but with low IC have two potential causes. First, it is possible that depth in the hierarchy is not always related to semantic information content. In other words, there may be terms near the top of the hierarchy that are as "specific" as terms deep in the hierarchy. Second, there may be true corpus bias where terms with many children are rarely used. We can partly distinguish between these possibilities by examining the relationship between IC and the number of child terms (Figure 7B). This showed that high-IC terms almost always have many child terms, arguing against corpus bias. It is still unclear whether the cases of terms with few children and few parents are truly "specific" terms or just parts of the GO which are not yet fully fleshed out. For example, the term "chemoattractant activity" (GO:0042056) is a direct child of the root of the molecular function hierarchy, but has no child terms.
TO appears to avoid the shallow annotation problem. However, it may be questioned whether depth in the hierarchy is a strong enough correlate with IC. TO is in effect based entirely on how many parents a term has, with no consideration of the frequency of use by annotators. Thus two genes annotated with low level terms falling far from the root term, and sharing all parent terms would obtain a high similarity score. On the other hand, if two genes are annotated with high level terms falling closer to the root term, also sharing all parent terms, will obtain a low similarity score. This will work so long as the depth in the hierarchy is a reasonably uniform measure of semantic specificity (in effect, information content). The data in Figure 7A suggest this might not be an entirely safe assumption, but we argue that the overall good behaviour of the overlap statistic argues that the assumption is not completely without basis.
The scores generated using VSM based measures also show a high correlation with TO (>0.8). These methods rely upon a gene-term annotation matrix that essentially flattens the redundant and structured GO terms into a collection of 'independent' terms. The Kappa and Cosine methods weight each of the GO terms equally by using a binary valued matrix and would explain the high correlation with each other, and with TO. On the other hand using weighted values in the matrix delivers scores that still correlate fairly well with TO (higher than the correlation with Resnik) as we found with the Weighted Cosine measure. Both Cosine measures also correlate very well with the NTO measure. This is not surprising, since the dot product of two gene vectors equates to the term overlap, and thus the two measures merely differ by the factor that they normalize the TO value with. This is also in general agreement with results presented recently by Chagoyen et al. (8th Spanish Symposium on Bioinformatics and Computational Biology, 2008).
TO (and NTO) also differs from the other measures in algorithmic complexity. First, computing the IC for each term is an expensive computation, as is the compilation process of the gene-term annotation matrix required for the vector-based methods. The IC requires obtaining counts of each term in GO for all genes; as does the Weighted Cosine method. Making matters worse, in both cases the data generated should in principle be recomputed for the entire database every time the annotations are updated. TO completely avoids this step. In addition, the computation of overlap for a pair of genes is O(N) where N is the number of terms. Computation of the other IC-based measures requires pairwise comparison of all terms for the pair of genes, which is O(N2).
In summary, given the generally high correlation among the various measures, it seems reasonable to use the simplest and fastest method when high throughput is necessary. Therefore we expect that TO will be of use for rapidly evaluating algorithms predicting gene functional relationships, and in exploring high-throughput experimental data. For example, in Lee et al. (2004), TO was used to evaluate the performance of an algorithm for predicting gene function on the basis of expression profile similarity. TO is fast enough to use in on-line applications, and is used in the Gemma system http://www.bioinformatics.ubc.ca/Gemma to display gene semantic similarities (Hamer et al., in preparation). Semantic similarity computed by TO could be used to evaluate and examine results of high throughput studies such as yeast 2-hybrid screens or proteomic studies.
Supported by a Canadian Institutes for Health Research/Michael Smith Foundation for Health Research graduate scholarship to MM, a Michael Smith Foundation for Health Research Career Investigator Award to PP and NIH grant GM076990 to PP. We thank Kelsey Hamer for technical support.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.