# Table 3 Similarity measures used in this study

Name Formula
Euclidean similarity $$s^{2}\left (g_{1}, g_{2}\right)=\frac {1}{1+\left (x_{g1}-x_{g2}\right)\left (x_{g1}-x_{g2}\right)^{'}}$$
Correlation similarity $$s\left (g_{1},g_{2}\right) = \frac {\left (x_{g1}-\overline {x}_{g1}\right)\left (x_{g2}-\overline {x}_{g2}\right)^{'}} {\sqrt {\left (x_{g1}-\overline {x}_{g1}\right)\left (x_{g1}-\overline {x}_{g1}\right)^{'}} \sqrt {\left (x_{g2}-\overline {x}_{g2}\right)\left (x_{g2}-\overline {x}_{g2}\right)^{'}}}$$
where $$\overline {x}_{g1}=\frac {1}{n}\sum _{p \in P}x^{p}_{g1}$$ and $$\overline {x}_{g2}=\frac {1}{n}\sum _{p \in P}x^{p}_{g2}$$
Cosine similarity $$s\left (g_{1},g_{2}\right) = \frac {x_{g1}x_{g2}^{'}}{\sqrt {x_{g1}^{'}x_{g1}} \sqrt {x_{g2}^{'}x_{g2}}}$$
Hamming similarity $$s\left (g_{1},g_{2}\right) = \frac {x^{p}_{g1}=x^{p}_{g2}}{n}$$
Jaccard similarity $$s\left (g_{1},g_{2}\right) = 1 - \frac {\left [\left (x^{p}_{g1} \neq x^{p}_{g2}\right)\wedge \left (\left (x^{p}_{g1} \neq 0\right) \vee \left (x^{p}_{g2} \neq 0\right)\right)\right ]} {\left (x^{p}_{g1} \neq 0\right) \vee \left (x^{p}_{g2} \neq 0\right)}$$
Cohen’s kappa $$s\left (g_{1},g_{2}\right)=\frac {p_{0}-p_{c}}{1-p_{c}}$$ where:
- p 0 is the proportion of terms common to profiles g 1 and g 2, and
- p c is the proportion of terms common to profiles g 1 and g 2 expected by chance.
TF-IDF similarity $$s\left (g_{1},g_{2}\right) = \max _{p \in P}\left \{x^{p}_{g1}x^{p}_{g2}IDF(p)\right \}$$ where$$IDF(p)=log\frac {n_{G}}{1+\sum _{g \in G}{x^{p}_{g}}}$$
Resnik’s semantic similarity s(t 1,t 2)=IC(t MICA ) where:
- the Most Informative Common Ancestor is$$t_{MICA}={argmax}_{t \in S\left (t_{1},t_{2}\right)}{IC(t)}$$,
- the information content (IC) of a term t is IC(t)=−log(p(t)),
- the probability of a term t is $$p(t)=\frac {annotations(t)}{totalAnnotations}$$, and
- S(t 1,t 2) is the set of common ancestors of t 1 and t 2.
Lin’s semantic similarity $$s\left (t_{1},t_{2}\right) = {\frac {{2\cdot IC\left (t_{MICA}\right)}}{IC\left (t_{1}\right)+IC\left (t_{2}\right)}}$$
Schlicker’s semantic similarity $$s\left (t_{1},t_{2}\right) = \frac {2\cdot IC\left (t_{MICA}\right)}{IC\left (t_{1}\right)+IC\left (t_{2}\right)}\cdot \left (1-p\left (t_{MICA}\right)\right)$$
Jiang’s semantic similarity s(t 1,t 2)=1+2·IC(t MICA )(IC(t 1)+IC(t 2))
Pesquita’s semantic similarity $$s\left (t_{1},t_{2}\right) = \frac {\sum \limits _{t \in S(t_{1},t_{2})}{IC(t)}}{\sum \limits _{t \in P(t_{1},t_{2})}{IC(t)}}$$ where:
- P(t 1,t 2) is the set of ancestors of either t 1 or t 2.
1. G is the full set of genes (n G =4198) and P is the set of 36 (n P ) phenotypes. x g denotes the phenotypic profile of gene g with $$x^{p}_{g}=1$$ if g shows phenotype p, $$x^{p}_{g}=0$$ otherwise