Revealing and avoiding bias in semantic similarity scores for protein pairs

BMC Bioinformatics

Table 1 Summary of 14 semantic similarity scores for protein pairs.

Measure	Description	Range
Similarity scores for term pairs
Resnik [6]	Information content of the most informative common ancestor of two terms	≥ 0
Lin [5]	Normalized Resnik similarity score by assessing how close two terms are to their most informative common ancestor	[0, 1)
RS [4]	Weighted Lin similarity score by using the probability of annotations of the most informative common ancestor	[0,1)
Jiang [7]	Based on the difference between two terms and their most informative common ancestor in information content	(0,1]
Similarity scores for protein pairs based on pairwise similarity scores between term groups
AVG [2]	The average of the similarity scores for all pairs of terms between two groups of protein annotations	Same with those for the corresponding similarity scores for term pairs
BMA [3]	The score of the best-matching pairs between two groups of protein annotations
Similarity scores for protein pairs based on groupwise similarity scores between term groups
TO [9]	The number of terms shared by the annotations for two proteins	≥ 1
NTO [9]	Dividing TO by the minimum of the annotation lengths of two proteins	(0,1]
Dice [12]	Dividing TO by the average of annotation lengths of two proteins	(0,1]
Kappa [11]	A chance-corrected measure of co-occurrence between two groups of protein annotations	[0, 1]
GIC [8]	Jaccard index weighted by the information content of each GO term	[0, 1]
VSM [10]	Cosine similarity weighted by the information content of each GO term	[0, 1]

ISSN: 1471-2105