During the last decade several Gene Ontology (GO) semantic similarity approaches
[1–10] have been introduced for assessing the specificity of and relationship between GO terms based on their position in the GO Directed Acyclic Graph (DAG)
[11–13]. Terms in the GO DAG are semantically and topologically linked by the relations ‘is_a’ and ‘part_of’, expressing relations between a given child term and its parents. Semantic similarity approaches are based on these relations between terms and enable efficient exploitation of the enormous corpus of biological knowledge embedded in the GO DAG by comparing GO terms and proteins at the functional level. GO semantic similarity measures have been widely used in different contexts of protein analysis, including gene clustering, gene expression data analysis, prediction and validation of molecular interactions, and disease gene prioritization
Initially, path- or edge-based approaches, which use a distance or the number of edges between terms in the ontology structure, were introduced
[15, 16]. For these approaches, the similarity score between two terms is proportional to the number of edges on the shortest path between these two terms. Path-based approaches were criticized for being limited to edge counting, ignoring positions of terms in the structure and producing uniform similarity scores
. Thus, information content based approaches, which rely on a numerical value to convey the description and specificity of a GO term using its position in the structure, were introduced
. This numerical value is called information content (IC) or semantic value, and depending on the conception of the term IC, these approaches are divided into two main families, annotation-based and topology-based families. Those depending only on the intrinsic topology of the GO structure are referred to as topology-based approaches while those using the frequencies at which terms occur in the corpus under consideration are referred to as annotation-based approaches.
Annotation-based approaches have been widely analyzed, deployed in many biological applications and were shown to outperform path-based models
. Most of them are adapted from Resnik
 or Jiang & Conrath’s
 methods, and are referred to as classical IC-based similarity approaches. These classical approaches use the most informative common ancestor (MICA) between terms to assess their semantic similarity. Beyond these classical approaches, several other IC-based GO semantic similarity approaches and enhancements have been suggested in order to improve annotation-based measures. These include the graph-based similarity measure (GraSM), developed by Couto et al.
, which uses all the disjunctive common ancestors (DCA) instead of MICA, the relevance similarity approach proposed by Schlicker et al.
, and the information coefficient idea of Li et al.
 to correct the overestimation of similarity scores in Lin’s metric. However, the reliance of these approaches on the annotation statistics of the terms biases the scores produced
. Topology-based approaches, including the GO-universal metric
, and the Zhang et al.
 and Wang et al.
 methods, were proposed to remove the effect of annotation dependence.
The main use of GO semantic similarity measures is the computation of protein semantic similarity or functional similarity between proteins based on their GO annotations. The completion of several genome sequencing projects has generated immense quantities of sequence data. Subsequently, with the continuous development of new high-throughput methods the amount of functional data has increased dramatically, justifying the development of dedicated methods and tools that help extract information from these data. GO
 has successfully provided a way of consistently describing genes and proteins and a well adapted platform to computationally process data at the functional level. Protein functional similarity methods are counted among tools that allow integration of the biological knowledge contained in the GO DAG, and have contributed to the improvement of biological analyses
. These protein functional similarity measures have been used in several applications, including microarray data analysis
, protein-protein interaction assessments
, clustering and identification of functional modules in protein-protein interaction networks
, and putative disease gene identification
As well as different GO semantic similarities, several functional similarity approaches have been proposed. Some of them depend directly on the GO term IC, referred to as Direct Term- or graph-based approaches, and others are constructed via computation of GO term semantic similarity measures, referred to as Term Semantic-based approaches. The former includes approaches derived from the Jaccard, Dice and universal indices based on the Tversky ratio model of similarity
, referred to as SimGIC
[8, 27], SimDIC and SimUIC
, respectively. The latter approach includes the average (Avg)
, best-match average (BMA)
[8, 22], average best matches (ABM)
[5, 24], and the maximum (Max)
 combinations of GO term similarities for calculating protein functional similarities where proteins are annotated to multiple GO terms. The recent proliferation of these measures in the biomedical and bioinformatics areas was accompanied by the development of tools (
http://neurolex.org/wiki/Category:Resource:Gene_Ontology_Tools) that facilitate effective exploration of these measures.
These tools include software packages and web-based online tools. Most of the software packages are implemented in the R programming language
[28, 29], among which we have SemSim
, and csbl.go
. There are also online tools, such as ProteInOn
 and G-SESAME
. In addition, an integrated online tool exists, the Collaborative Evaluation of Semantic Similarity Measures (CESSM)
, for automated evaluation of GO-based semantic similarity approaches, enabling the comparison of new measures against previously published annotation-based GO similarity measures. Evaluation is done in terms of performance with respect to sequence, Pfam and EC similarity. Note that most of the online tools do not support topology-based approaches. The G-SESAME online tool, designed by Du et al.
 in the context of the Wang et al. approach, supports only classical Resnik
, Jiang & Conrath
, and Lin
 similarity measures for protein or gene clustering applications.
The appropriate use of functional similarity measures depends on the applications
[9, 24] since the measures perform differently for different applications. A given measure can yield good performance for one application, but performs poorly for another. Numerous online tools have been developed, but to the best of our knowledge there is no single tool that exhaustively integrates the IC-based functional similarity metrics in order to provide researchers with the freedom to choose the most relevant approach for their specific applications. Here, this is solved through the DaGO-Fun online tool, which integrates up to 27 functional similarity measures, including topology- and annotation-based approaches. This tool also includes some important biological applications directly linked to the use of GO semantic similarity measures, namely the identification of genes based on their GO annotations, the clustering of functionally related genes within a set, and GO term enrichment analysis.