Skip to main content

Table 2 Log of Probability Measures for all the methods using rigorous values. The best performance in each row is marked with an asterisk

From: Improving a gold standard: treating human relevance judgments of MEDLINE document pairs

Judge M 1 M 2 M 3 M 23 M 4 M 5
0 -8897 -8884 -8384 -8704 -8501 -8202*
1 -7103 -7085 -7006 -6843 -6940 -6690*
2 -6900 -6884 -6889 -6687 -6701 -6371*
3 -6806 -6729 -6699 -6493 -6734 -6192*
4 -7694 -7637 -7501* -7560 -8121 -9350
5 -7131 -7045 -6912 -6872* -7259 -7514
6 -7044 -6993 -6884 -6814* -7026 -7237
7 -7110 -7149 -7035 -6876 -6557 -6446*
8 -7354 -7521 -7374 -7266 -6559* -6838
9 -7122 -7040 -7100 -6911* -7004 -7125
10 -8032 -8128 -7862 -7881 -7576 -7545*
11 -7281 -7123 -7071 -7008* -7450 -7593
12 -8153 -8305 -8044 -8047 -7694* -8056
Ave -7433 -7425 -7289 -7228 -7240 -7320