### Measuring the similarity between diseases

Terms in DO include disease names and disease-related concepts. Exploring the similarity between them can help us to understand the relatedness between diseases. The past few years have seen an increase in the number of different measures used for the calculation of semantic similarity. Based on the semantic similarity measures in the application of biomedical ontologies reviewed by Pesquita etc al. [14], for general applicability, in DOSim we implemented ten representative semantic similarity measures, which are Resnik measure [15], Lin measure [16], Jiang and Conrath measure (JC) [17], Relevance measure (Rel) [18], Graph Information Content measure (GIC) [19], Information Coefficient similarity measure (simIC) [20], Wang measure [21], modified Resnik measure (CoutoResnik) [22], modified Lin measure (CoutoLin) [22], and modified Jiang and Conrath measure (CoutoJC) [22]. Except for the Wang measure that uses a hybrid measure, the other nine measures are based on information content (IC).

The IC of a term/disease *t* in the DO database gives a measure of how specific and informative a term/disease is, and is defined as *IC*(*t*) = -log *p*(*t*), where *p*(*t*) is the number of genes annotated to the term *t* and its descendants divided by the total number of genes annotated to DO. When characterizing the shared IC between two terms, two concepts, most informative common ancestor (MICA) and disjunctive common ancestor (DCA), are widely used[22]. The MICA of two terms *t*_{
1
}and *t*_{
2
}is the one that possesses the maximum IC among all the common ancestor terms of the two terms. And the DCAs of two terms *t*_{
1
}and *t*_{
2
}are the MICA of disjunctive ancestors of the two terms, which can be defined as follows:

where disjunctive ancestors of the term *t*, *DisjAnc*(*t*), can be described as that two ancestors *a*_{
1
}and *a*_{
2
}are disjunctive ancestors of the term *t* if there is a path from *a*_{
1
}to *t* not passing through *a*_{
2
}and a path from *a*_{
2
}to *t* not passing through *a*_{
1
}. It can be formulated as follows:

Then, the shared information of two terms *t*_{
1
}and *t*_{
2
}, *Share*(*t*_{
1
}*,t*_{
2
}), is defined as the average of the IC of the DCAs, formulated as:

Let *t*_{
MICA
}represent the MICA term of two terms *t*_{
1
}and *t*_{
2
}, then the nine IC-based similarity measures are calculated as follows:

In the Wang measure, each edge is given a weight according to the types of relationships. For a term *A*, a sub-DAG comprised of the term *A* and all its ancestor terms can be represented as *DAG*_{
A
}= (*A,T*_{
A
}*,E*_{
A
}), where *T*_{
A
}is the ancestor term set of term *A* (including *A* itself) and *E*_{
A
}is the set of edges connecting to the terms in *DAG*_{
A
}. For any term *t* in *DAG*_{
A
}, Wang et al. [21] defined the semantic contribution of *t* to *A*, *DA*(*t*), as the product of all the edge weights in the "best" path from term *t* to *A*, where the "best" path is the one that maximises the product (the semantic contribution of the term *A* to itself is set to 1). It can be represented as follow:

where *w*_{
e
}is the semantic contribution factor of edge *e* (*e* ∈ *E*_{
A
}). It is set between 0 and 1 according to the types of relationships, e.g., "is-a" or "part-of". In DO, there is only one type of relationship, defined as "is-a". In DOSim, we set *w*_{
e
}to 0.7.

The semantic similarity between two terms *A* and *B* is then calculated as follows:

where *SV*(*A*) (or *SV*(*B*)) is the total semantic contribution of the term *A* (or *B*) in *DAG*_{
A
}(or *DAG*_{
B
}), which is calculated as:

### Measuring the similarity between human genes in terms of diseases

In the DOSim package, the similarity between two genes based on the similarity of their DO term annotation groups is calculated. Each gene is represented by its set of direct DO term annotations, and semantic similarity is calculated between terms in one set and terms in the other (using one of the measures described above). Some methods consider every pairwise combination of terms for the two sets, while others consider only the best-matching pair for each term. Five different methods are implemented in DOSim; they are the arithmetic maxima and average of pairwise similarity between two groups of DO terms describing the two genes (Max, Mean) [23], the arithmetic maxima and average between similarities for two directional comparisons of the similarity matrix *S* of two genes (funSimMax, funSimAvg) [18], and the best-match average approach (BMA) [21] which considers the contributions from the semantically similar terms that annotated the two genes respectively (Formula 23).

Let *DO*_{
1
}and *DO*_{
2
}be the groups of annotation terms for two genes *g*_{
1
}and *g*_{
2
}, and *m* and *n* are the number of terms in *DO*_{
1
}and *DO*_{
2
}respectively. A similarity matrix *S=*[*s*_{
ij
}]_{
m×n
}contains all pairwise similarity scores of mappings from *DO*_{
1
}to *DO*_{
2
}when you refer to each row and vice verse when you refer to each column. '*rowScore*' and '*columnScore*' of *S* are the averages over the row maxima and the column maxima, which give similarity scores for the comparison of *DO*_{
1
}to *DO*_{
2
}and the comparison of *DO*_{
2
}to *DO*_{
1
}, respectively.

Using these definitions, the five similarity methods for the computation of gene similarity between two genes *g*_{
1
}and *g*_{
2
}are defined as follows:

For a set of genes *G* (*g*_{
1
}*,g*_{
2
}*,...,g*_{
n
}) of size *n*, the similarity matrix for these genes is defined as *Sim*=[*Sim*_{
ij
}]_{
n×n
}, where *Sim*_{
ij
}is the similarity between gene *g*_{
1
}and *g*_{
j
}derived by any of the five methods defined above.

In DOSim, there are a total of fifty optional semantic similarity measures for genes, which are combinations of the ten semantic similarity measures for term pairs and the five similarity methods mentioned above.

### Other utilities

#### Conducting DO enrichment analysis

In DOSim, DO-based enrichment analysis is implemented to explore the disease feature of an independent gene set, for example, a differentially expressed gene set from a microarray analysis. Significance of the enrichment analysis is assessed by the hypergeometric test and the *p*-value is adjusted by false discovery rate (FDR). For a certain DO term *t* which meets the requirement (see below), if *M* genes are the number of annotated genes in the human genome and *x* genes are the number of annotated genes in the gene set for this term, then to calculate whether the gene set is enriched in DO term the following formula is used:

where, *N* is the total number of human genes in the genome, *k* is the size of the gene set of interest, and is the number of combinations of the *N* genes taken *k* at a time and is equal to .

Compared with FunDO, which uses a small set of DO terms (DOLite) [13], DOSim selects the DO terms satisfy two criteria for enrichment analysis, aiming at exploring more biological result. The first criterion is that the term should be annotated by at least *n* genes, and the second is that the term should be beneath a depth *m* in the DAG of DO, where *n* and *m* can be set by users when running the DO enrichment analysis.

In the DOSim package, the *DOEnrichment* function carries out the DO enrichment analysis; the input is a list of Entrez gene IDs. The *filter* and *layer* parameters are the two criteria mentioned above that can be used to control the terms to be analysed; so that the term is annotated by at least 'filter size' genes and it is beneath the 'layer' depth in the DAG of DO.

### Detecting and annotating DO-directed gene modules

A gene module is a group of highly correlated genes. In DOSim, gene modules can be detected as follows: after the gene similarity matrix for a gene set is constructed, a hierarchical clustering is performed using the standard R function *hclust* and one of three branch cutting methods is applied (one constant-height cutting and two dynamic branch cutting methods are embed in our package) [24].

The DOSim package incorporates multilayered enrichment analysis (GO and KEGG annotation) to explore the biological meaning of the detected gene modules. The GO annotations are conducted using GOSim [25] and the KEGG annotations are generated using SubpathwayMiner [26]. The input for GO and KEGG annotations is a list of Entrez gene IDs, the mechanism implied in each annotation database is the hypergeometric test, and the outputs for each annotation database are the enriched terms with *p*-values.

### Describing and visualizing DO structures and terms

DO is a collection of terminologies associated with human diseases and the terms in DO are organised in a DAG (Figure 1). DOSim also provides useful utilities to easily visualise the DO structure; thus users need not turn to other tools (e.g., OBO-Edit). Specifically, the hierarchical structures of DO terms can be represented as a *graphNEL* object and the *getDOGraph* function in DOSim can be used to fetch the DO graph with specified DO terms at its leaves. For a certain DO term, DOSim provides a series of functions to extract related terms (e.g., father and child terms.).