GOSim – an R-package for computation of information theoretic GO similarities between terms and gene products

Background With the increased availability of high throughput data, such as DNA microarray data, researchers are capable of producing large amounts of biological data. During the analysis of such data often there is the need to further explore the similarity of genes not only with respect to their expression, but also with respect to their functional annotation which can be obtained from Gene Ontology (GO). Results We present the freely available software package GOSim, which allows to calculate the functional similarity of genes based on various information theoretic similarity concepts for GO terms. GOSim extends existing tools by providing additional lately developed functional similarity measures for genes. These can e.g. be used to cluster genes according to their biological function. Vice versa, they can also be used to evaluate the homogeneity of a given grouping of genes with respect to their GO annotation. GOSim hence provides the researcher with a flexible and powerful tool to combine knowledge stored in GO with experimental data. It can be seen as complementary to other tools that, for instance, search for significantly overrepresented GO terms within a given group of genes. Conclusion GOSim is implemented as a package for the statistical computing environment R and is distributed under GPL within the CRAN project.


Background
Modern DNA microarray analysis has lead to an enormous collection of experimental data. Different analysis techniques are applied to gain insight into the underlying biological processes: statistical tests are used to find significantly regulated genes. Other methods cluster genes according to their expression profiles [1]. The hypothesis is that genes with expression patterns similar to known genes are involved in similar biological processes. In either case, researchers often end up with long lists of interesting candidate genes that need further examination. At this point, a second step is almost always applied: biologists categorize these genes to known biological functions and thus try to combine experimental outcomes with biological knowledge. Such information is e.g. provided by Gene Ontology (GO). GO has become one of the most widespread systems for systematically annotating gene products within the bioinformatics community and is developed by the Gene Ontology Consortium [2] with the intention to describe gene products with a controlled and structured vocabulary. GO terms are part of a Directed Acyclic Graph (DAG), covering three orthogonal taxonomies: molecular function, biological process and cellular component. Two different kinds of relationship between GO terms exist: the "is-a" relationship and the "part-of" relationship. By providing a standard vocabulary for all biological resources, GO enables researchers to use this information for further data analysis.
The GOSim package provides the user with an easy to use implementation of various information theoretical similarity concepts for GO terms [3][4][5][6][7][8][9]. It additionally implements different methods for computing functional similarities between gene products based on the similarities between the associated GO terms. This can for example be used for clustering genes according to their biological function [10,11] and thus may help to get a better understanding of the biological aspects covered by a set of genes. GOSim can be seen as complementary to existing tools that, for instance, search for significantly overrepresented GO terms within a given group of genes [12]. It extends methods like FuSSiMeg [8] by offering additional similarity measures for GO terms and gene products. With its specific focus, to our knowledge, GOSim is the most comprehensive software package of this kind.

GO term similarities
GOSim concentrates on similarity concepts for GO terms derived from information theory. One of the most wellknown information theoretic similarity measures was introduced by Resnik [3]. It relies on the notion of the socalled minimum subsumer of two GO terms t and t', which is the lowest common ancestor in the GO hierarchy (Figure 1). Its information content IC ms , which can be understood as a measure of similarity between t and t', is given by: Here Pa(t, t') denotes the set of all common (also indirect) ancestors of GO terms t and t', while IC(t) denotes the information content of term t. It is defined as (c.f. [7]) i.e. as the negative logarithm of the probability of observing t. The information content of each GO term can be computed with GOSim for each of the taxonomies molecular function, biological process and cellular component. The calculation is based on counting, how many times a specific GO term or any of its direct or indirect offspring appear in annotated gene products. The association between gene products and GO identifiers is reported reg-ularly by the GO Consortium [2]. The GO Consortium further provides evidence codes on the annotations, which can be used to calculate the information contents of all GO terms on a different basis. GOSim stores the information contents of all GO terms in data files to speed up all following computations. By default, for some combinations of evidence codes the information contents are already precomputed.
Besides Resnik's GO term similarity measure, extensions of Lin [5], and Jiang and Conrath [6] exist, which are included into GOSim as well. Both only differ in the way they normalize (Eq. 1). Jiang and Conrath's similarity measure is defined as i.e. the similarity between t and t' is 0, if their normalized distance is at least 1. Lin's pairwise similarity between GO terms is defined as: GOSim also contains a similarity concept introduced by Couto et al. [9], which is not based on the minimum subsumer, but on the set of all so-called disjunctive common ancestors. Roughly speaking, the idea is not to consider the common ancestor having the highest information content Example of a GO graph starting with leaves GO:0007166 and GO:0007267 Figure 1 Example of a GO graph starting with leaves GO:0007166 and GO:0007267. only, but also others, if they are somehow "separate" from each other, i.e. there is a path to t and t' not passing any other of the disjunctive common ancestors. In our example from Figure 1 the set of disjunctive common ancestors only consists of the minimum subsumer, because any path from the other ancestors to GO:0007166 and GO:0007267 would have to pass the minimum subsumer. Based on the notion of disjunctive common ancestors Resnik's similarity concept can be extended by defining: Likewise, Jiang-Conraths's and Lin's measures can be extended as well by replacing IC ms (t, t') by IC share (t, t'). Finally, it should be mentioned that also the depth and density enriched term similarity by Couto et al. [8] has been integrated into GOSim.

Functional gene similarities
The special strength of GOSim lies in the possibility not only to calculate similarities for individual GO terms, but also for genes based on their complete GO annotation. For this purpose three basic ideas have been implemented: 1. Computation of the maximum and average similarity between any pair of GO terms.
2. Computation of a so-called optimal assignment of terms from one gene to those of another one [11].
3. Embedding of each gene into a feature space defined by the gene's similarity to certain prototype genes [10,11]. Within this feature space similarities naturally arise as dot products between the feature vectors. These dot products can be understood as so-called kernel functions, as used e.g. in Support Vector Machines [13].
Especially the inclusion of the last two methods clearly distinguishes GOSim from existing tools like FuSSiMeg [8]. More information on these methods, including a detailed evaluation, can also be found in our earlier publications [10] and [11].

Maximum and average pairwise GO term similarity
The idea of the maximum pairwise GO term similarity is straight forward and is for instance employed in FuSSiMeg [8]. Given two genes g and g' annotated with GO terms t 1 ,..., t n and we define the functional similarity between between g and g' as where sim is some similarity measure to compare GO terms t i and . In GOSim the resulting value can be further normalized to account for an unequal number of GO terms for both genes: Instead of computing the maximum pairwise GO term similarity one may also take the average here.
Optimal assignment gene similarities Given a similarity concept sim to compare individual GO terms, the idea of an optimal assignment is to assign each term of the gene having fewer GO terms to exactly one term of the other gene such that the overall similarity is maximized (c.f. Figure 2). More formally this can be stated as follows: Let π be some permutation of either an n-sub- ,,..., ,..., Idea of an optimal assignment: each GO term belonging to gene 2 is assigned to exactly one GO term belonging to gene 1 such that the overall GO term similarity is maximized Figure 2 Idea of an optimal assignment: each GO term belonging to gene 2 is assigned to exactly one GO term belonging to gene 1 such that the overall GO term similarity is maximized.
set of natural numbers {1,..., m} or an m-subset of natural numbers {1,..., n} (this will be clear from context). Then we are looking for the quantity The computation of (Eq. 8) corresponds to the solution of the classical maximum weighted bipartite matching (optimal assignment) problem in graph theory and can be carried out in O(max(n, m) 3 ) time [14]. To prevent that larger lists of terms automatically achieve a higher similarity we should again normalize sim gene according to (Eq. 7).

Feature space embedding of gene products
The idea of this method is to calculate for each gene g feature vectors φ(g) by using their similarity to certain prototype genes p 1 ,..., p n : By default the 250 best annotated genes, i.e. which have been annotated with GO terms most often, are used as prototypes, and sim' is the maximum pairwise GO term similarity. Alternatively, one can use the optimal assignment similarity for sim' as well. Both similarity measures can by itself again be combined with arbitrary GO term similarity concepts. The default is that of Jiang and Conrath.
Feature space constructions like in (Eq. 9) are known from the literature on Support Vector Machines and other kernel methods and give rise to so-called "empirical kernel maps" [13].
Because the feature vectors are very high-dimensional we usually perform a principal component analysis (PCA) to project the data into a lower dimensional subspace (Figure 3). The number of principal components is by default chosen such that at least 95% of the total variance in feature space can be explained (this is a relatively conservatve criterion), and the feature vectors are normalized to norm 1. It should be mentioned that in principle one can combine functional similarities between gene products with regard to different GO sub-categories ("biological process", "molecular function", "cellular component"). An obvious way for doing so would be to consider the sum of the respective similarities: sim total (g, g') = sim Ontology1 (g, g') + sim Ontology2 (g, g') (10) Of course, one could also use a weighted averaging scheme here, if desired.

Functional gene clustering
The calculated GO similarities between gene products can be used to cluster genes with respect to their function. The practical usage of this method is highlighted in more detail in an example study on microarray data in the Results Section of this paper.

Cluster evaluations
GOSim has the possibility to evaluate a given clustering of genes or terms by means of their GO similarities. Supposed we have decided to group genes into certain clusters on the basis of other experiments (e.g. microarray). Then we can ask ourselves, how similar these groups are with respect to their GO annotations. GOSim uses the functional similarity between genes to calculate for each cluster the median within cluster similarity and the median absolute deviation (mad). Moreover, a visualization via cluster silhouettes [15] is provided by GOSim as well.
Likewise, different groupings of GO terms can be evaluated in a similar manner. Again, the practical usage of this method is highlighted in more detail in the example study contained in the Results Section.

Implementation
GOSim is implemented as a package within the statistical computing environment R and is distributed under GPL within the CRAN project [16]. Functions of the following Genes embedded into a feature space defined by the GO similarity to certain prototype genes Figure 3 Genes embedded into a feature space defined by the GO similarity to certain prototype genes. principal components analysis was used to reduce the dimensionality of the feature space and the first two principal components are displayed. R packages are internally utilized and are hence required by GOSim to work properly: • GOStats (≥ 1.7) • mclust (≥ 2.1) • cluster (≥ 1.11) Furthermore, Rgraphviz is recommended to visualize GO graphs. To calculate functional similarities between gene products GOSim needs their Entrez Gene IDs. The mapping of these IDs to the Gene Ontology is provided by the R package GO, which is required by the GOStats package. Genes without annotation are filtered out automatically, when GOSim performs similarity calculations for gene products. Their exists a function in GOSim to specify, on which sub-category ("biological process", "molecular function", "cellular component") all computations are based on. Moreover, it is possible to restrict the GO annotation of gene products to arbitrary user defined evidence codes.
To summarize, GOSim provides R-methods for the following tasks: • low-level functions for GO graph traversal • specification of the GO sub-category ("biological process", "molecular function", "cellular component") and of evidence codes • calculation of the information content of GO terms and of the similarity between GO terms (see last Section) • similarity calculation between gene products based on their GO annotations (see last Section) • filtering and printing the GO annotation of a given list of genes • evaluation of a given clustering of genes or terms via precomputed similarities (see last Section)

Data
In this Section we show an example application of the GOSim package to an analysis of DNA microarray data. The data was taken from [17], who derived a gene expression profile for dilated cardiomyopathy based on cDNA and Affymetrix microarray chips. The details of this study can be found in the paper. The number of differentially upregulated genes was 1107 on the cDNA and 336 on the Affymetrix microarray data. The number of differentially downregulated genes was 278 on the cDNA and 67 on the Affymetrix chips.

Methods
The Entrez Gene IDs of differentially upregulated and downregulated genes collected from the study were treated separately for cDNA and Affymetrix chips. 677 of the differentially upregulated genes on the cDNA chips and 230 on the Affymetrix chips showed a mapping to the GO category "biological process". 157 differentially downregulated genes could be mapped on the cDNA and 43 on the Affymetrix chips. We used the GOSim package to calculate gene similarities based on the feature vector representation [10,11] (c.f. Section Background). This was done by defining each gene by its maximum Jiang-Conrath similarity [6] to 250 prototypes genes. Prototype genes were those 250 genes, which were most frequently annotated with GO terms. The similarity between feature vectors x, y was taken as their normalized dot product (c.f. Section Background): Prior to similarity calculation we performed a principal component analysis (PCA) on the feature vectors to reduce their dimensionality. The number of principal components was chosen such that at least 95% of the total variance in feature space could be explained (c.f. Section "Feature Space Embedding of Gene Products"). This way we obtained 29 principal components for the upregulated and 23 principal components for the downregulated genes on the cDNA chips. For the Affymetrix chips the number of principal components was 24 for the upregulated and 11 for the downregulated genes.
Having the functional similarities between regulated genes, we used the GOSim package to compute a hierarchical clustering using Ward's method. We decided to cut the clustering tree at height 0.05 and looked at the clustering silhouettes by employing the GOSim package ( Figures  4, 5, 6, 7). Cluster silhouettes [15] are a classical way of depicting the quality of a given clustering of objects. The silhouette value for each point in a cluster is a measure of how similar that point is to points in its own cluster vs. points in other clusters, and ranges from -1 to +1. It is defined as: where W (i) is the average distance from the i-th point to the other points in its own cluster, and B (i, j) is the average distance from the i-th point to points in another cluster j. The quality for a given cluster can effectively be expressed by the average silhouette value (= cluster silhouette index) of points belonging to that specific cluster.
In contrast, downregulated genes on the cDNA chips are involved into transcription (cluster 2), transport (cluster 4) and immune response (cluster 7). On the Affymetrix chips there is just one identifiable cluster of downregulated genes are involved into metabolism.
The whole clusters are also included in an additional supplement as text files. In conclusion of this analysis one could say that in this data upregulated genes are more involved in protein modification, protein localization, blood coagulation and cell adhesion while downregulated genes are more related to metabolic tasks.
These findings can be seen complementary to those by Bart et al., who found a significant enrichment of GO terms related to protein biosynthesis in upregulated and GO terms related to immune response in downregulated genes. Both results are based on different statistical analysis methods of the data. Functional gene clustering is based on comparing the full GO term profiles for genes of interest, while GO-significance analysis looks for the statistical over-representation of individual GO terms in the full list of regulated genes. An advantage of a functional gene clustering compared to a traditional GO-significance analysis is that the complete list of regulated genes is structured into functionally related groups, which can help later interpretation. This kind of analysis is possible for very large as well as for very small lists. On the other hand functional clusters may also contain several genes, which are regarded as statistical significant, but are actually not differentially expressed (false positives). Hence, the interpretation of small functional clusters should generally be taken with care.

Discussion
The GOSim package offers an easy and straight forward way to gain insights into functional gene groups by using the Gene Ontology. It offers a rich toolbox of similarity concepts for GO terms as well as for gene products. The present example study highlights the usefulness of GOSim in a real life scenario.

Conclusion
The GOSim package integrates information theoretic similarity concepts for GO terms and derived functional similarity measures for gene products in a novel and unique way. Applications thereof are clusterings of gene products with regard to their function [10,11] or scorings of a given grouping of genes or terms with regard to their GO similarities. Both tasks can be performed with GOSim in a simple and straight forward way. It hence provides the user with a flexible and powerful tool to combine biological knowledge with experimental data. GOSim systematically extends existing tools like FuSSiMeg [8] by integrating its functionality and providing additional similarity concepts for gene products [10,11]. GOSim is implemented as a package for the statistical computing environment R and has been integrated into the CRAN project. It is thus made available for a broad community of potential users. More documentation on the implemented methods can be found in the user manual. Detailed quantitative results can be found in [8][9][10][11].

Availability and requirements
• Project name: GOSim The software is also included as a supplement to this paper.