Skip to main content

Table 5 In order to measure which method best predicts the individual class values made by a test judge between two methods, we apply the signed rank test. We also count query document pairs where the predicted probability of the class value is bigger for each method (and also ties). An asterisk marks th better result when the difference has a p-value less than 0.05 by the signed rank test. The optimal parameters are the single parameter optimizations of Table 1.

From: Improving a gold standard: treating human relevance judgments of MEDLINE document pairs

Judge M 4 vs M 5
  M 4 M 5 =
0 1992 3008* 0
1 2546 2454* 0
2 2864* 2136 0
3 2598 2402* 0
4 2148 2851* 1
5 2247 2753* 0
6 2527 2473* 0
7 3392* 1608 0
8 3798* 1202 0
9 2676 2324* 0
10 2802* 2198 0
11 2084 2916* 0
12 2938* 2062 0
Total 34612 30387 1