Skip to main content

Table 3 Similarity measures used in this study

From: How can functional annotations be derived from profiles of phenotypic annotations?

Name Formula
Euclidean similarity \(s^{2}\left (g_{1}, g_{2}\right)=\frac {1}{1+\left (x_{g1}-x_{g2}\right)\left (x_{g1}-x_{g2}\right)^{'}}\)
Correlation similarity \(s\left (g_{1},g_{2}\right) = \frac {\left (x_{g1}-\overline {x}_{g1}\right)\left (x_{g2}-\overline {x}_{g2}\right)^{'}} {\sqrt {\left (x_{g1}-\overline {x}_{g1}\right)\left (x_{g1}-\overline {x}_{g1}\right)^{'}} \sqrt {\left (x_{g2}-\overline {x}_{g2}\right)\left (x_{g2}-\overline {x}_{g2}\right)^{'}}}\)
  where \(\overline {x}_{g1}=\frac {1}{n}\sum _{p \in P}x^{p}_{g1}\) and \(\overline {x}_{g2}=\frac {1}{n}\sum _{p \in P}x^{p}_{g2}\)
Cosine similarity \(s\left (g_{1},g_{2}\right) = \frac {x_{g1}x_{g2}^{'}}{\sqrt {x_{g1}^{'}x_{g1}} \sqrt {x_{g2}^{'}x_{g2}}}\)
Hamming similarity \(s\left (g_{1},g_{2}\right) = \frac {x^{p}_{g1}=x^{p}_{g2}}{n}\)
Jaccard similarity \(s\left (g_{1},g_{2}\right) = 1 - \frac {\left [\left (x^{p}_{g1} \neq x^{p}_{g2}\right)\wedge \left (\left (x^{p}_{g1} \neq 0\right) \vee \left (x^{p}_{g2} \neq 0\right)\right)\right ]} {\left (x^{p}_{g1} \neq 0\right) \vee \left (x^{p}_{g2} \neq 0\right)}\)
Cohen’s kappa \(s\left (g_{1},g_{2}\right)=\frac {p_{0}-p_{c}}{1-p_{c}}\) where:
  - p 0 is the proportion of terms common to profiles g 1 and g 2, and
  - p c is the proportion of terms common to profiles g 1 and g 2 expected by chance.
TF-IDF similarity \(s\left (g_{1},g_{2}\right) = \max _{p \in P}\left \{x^{p}_{g1}x^{p}_{g2}IDF(p)\right \}\) where\(IDF(p)=log\frac {n_{G}}{1+\sum _{g \in G}{x^{p}_{g}}}\)
Resnik’s semantic similarity s(t 1,t 2)=IC(t MICA ) where:
  - the Most Informative Common Ancestor is\(t_{MICA}={argmax}_{t \in S\left (t_{1},t_{2}\right)}{IC(t)}\),
  - the information content (IC) of a term t is IC(t)=−log(p(t)),
  - the probability of a term t is \(p(t)=\frac {annotations(t)}{totalAnnotations}\), and
  - S(t 1,t 2) is the set of common ancestors of t 1 and t 2.
Lin’s semantic similarity \(s\left (t_{1},t_{2}\right) = {\frac {{2\cdot IC\left (t_{MICA}\right)}}{IC\left (t_{1}\right)+IC\left (t_{2}\right)}}\)
Schlicker’s semantic similarity \(s\left (t_{1},t_{2}\right) = \frac {2\cdot IC\left (t_{MICA}\right)}{IC\left (t_{1}\right)+IC\left (t_{2}\right)}\cdot \left (1-p\left (t_{MICA}\right)\right)\)
Jiang’s semantic similarity s(t 1,t 2)=1+2·IC(t MICA )(IC(t 1)+IC(t 2))
Pesquita’s semantic similarity \(s\left (t_{1},t_{2}\right) = \frac {\sum \limits _{t \in S(t_{1},t_{2})}{IC(t)}}{\sum \limits _{t \in P(t_{1},t_{2})}{IC(t)}}\) where:
  - P(t 1,t 2) is the set of ancestors of either t 1 or t 2.
  1. G is the full set of genes (n G =4198) and P is the set of 36 (n P ) phenotypes. x g denotes the phenotypic profile of gene g with \(x^{p}_{g}=1\) if g shows phenotype p, \(x^{p}_{g}=0\) otherwise