Skip to main content

Table 3 Similarity measures used in this study

From: How can functional annotations be derived from profiles of phenotypic annotations?

Name

Formula

Euclidean similarity

\(s^{2}\left (g_{1}, g_{2}\right)=\frac {1}{1+\left (x_{g1}-x_{g2}\right)\left (x_{g1}-x_{g2}\right)^{'}}\)

Correlation similarity

\(s\left (g_{1},g_{2}\right) = \frac {\left (x_{g1}-\overline {x}_{g1}\right)\left (x_{g2}-\overline {x}_{g2}\right)^{'}} {\sqrt {\left (x_{g1}-\overline {x}_{g1}\right)\left (x_{g1}-\overline {x}_{g1}\right)^{'}} \sqrt {\left (x_{g2}-\overline {x}_{g2}\right)\left (x_{g2}-\overline {x}_{g2}\right)^{'}}}\)

 

where \(\overline {x}_{g1}=\frac {1}{n}\sum _{p \in P}x^{p}_{g1}\) and \(\overline {x}_{g2}=\frac {1}{n}\sum _{p \in P}x^{p}_{g2}\)

Cosine similarity

\(s\left (g_{1},g_{2}\right) = \frac {x_{g1}x_{g2}^{'}}{\sqrt {x_{g1}^{'}x_{g1}} \sqrt {x_{g2}^{'}x_{g2}}}\)

Hamming similarity

\(s\left (g_{1},g_{2}\right) = \frac {x^{p}_{g1}=x^{p}_{g2}}{n}\)

Jaccard similarity

\(s\left (g_{1},g_{2}\right) = 1 - \frac {\left [\left (x^{p}_{g1} \neq x^{p}_{g2}\right)\wedge \left (\left (x^{p}_{g1} \neq 0\right) \vee \left (x^{p}_{g2} \neq 0\right)\right)\right ]} {\left (x^{p}_{g1} \neq 0\right) \vee \left (x^{p}_{g2} \neq 0\right)}\)

Cohen’s kappa

\(s\left (g_{1},g_{2}\right)=\frac {p_{0}-p_{c}}{1-p_{c}}\) where:

 

- p 0 is the proportion of terms common to profiles g 1 and g 2, and

 

- p c is the proportion of terms common to profiles g 1 and g 2 expected by chance.

TF-IDF similarity

\(s\left (g_{1},g_{2}\right) = \max _{p \in P}\left \{x^{p}_{g1}x^{p}_{g2}IDF(p)\right \}\) where\(IDF(p)=log\frac {n_{G}}{1+\sum _{g \in G}{x^{p}_{g}}}\)

Resnik’s semantic similarity

s(t 1,t 2)=IC(t MICA ) where:

 

- the Most Informative Common Ancestor is\(t_{MICA}={argmax}_{t \in S\left (t_{1},t_{2}\right)}{IC(t)}\),

 

- the information content (IC) of a term t is IC(t)=−log(p(t)),

 

- the probability of a term t is \(p(t)=\frac {annotations(t)}{totalAnnotations}\), and

 

- S(t 1,t 2) is the set of common ancestors of t 1 and t 2.

Lin’s semantic similarity

\(s\left (t_{1},t_{2}\right) = {\frac {{2\cdot IC\left (t_{MICA}\right)}}{IC\left (t_{1}\right)+IC\left (t_{2}\right)}}\)

Schlicker’s semantic similarity

\(s\left (t_{1},t_{2}\right) = \frac {2\cdot IC\left (t_{MICA}\right)}{IC\left (t_{1}\right)+IC\left (t_{2}\right)}\cdot \left (1-p\left (t_{MICA}\right)\right)\)

Jiang’s semantic similarity

s(t 1,t 2)=1+2·IC(t MICA )(IC(t 1)+IC(t 2))

Pesquita’s semantic similarity

\(s\left (t_{1},t_{2}\right) = \frac {\sum \limits _{t \in S(t_{1},t_{2})}{IC(t)}}{\sum \limits _{t \in P(t_{1},t_{2})}{IC(t)}}\) where:

 

- P(t 1,t 2) is the set of ancestors of either t 1 or t 2.

  1. G is the full set of genes (n G =4198) and P is the set of 36 (n P ) phenotypes. x g denotes the phenotypic profile of gene g with \(x^{p}_{g}=1\) if g shows phenotype p, \(x^{p}_{g}=0\) otherwise