Skip to main content

Table 4 In order to measure which method best predicts the individual class values made by a test judge between two methods, we apply the signed rank test. We also count query document pairs where the predicted probability of the class value is bigger for each method (and also ties). An asterisk marks the better result when the difference has a p-value less than 0.05 by the signed rank test. The optimal parameters are obtained through the rigorous induction method as in Table 2.

From: Improving a gold standard: treating human relevance judgments of MEDLINE document pairs

Judge M 23 vs M 4 M 23 vs M 5 M 4 vs M 5
  M 23 M 4 = M 23 M 5 = M 4 M 5 =
0 2326 2674* 0 1763 3237* 0 2197 2803* 0
1 2576* 2423 1 1741 3259* 0 1808 3192* 0
2 2336* 2664 0 1580 3420* 0 1892 3108* 0
3 2637* 2363 0 1616 3384* 0 1592 3408* 0
4 3130* 1870 0 2817* 2183 0 2788 2212 0
5 2955* 2045 0 2463 2537* 0 2341 2659* 0
6 2692* 2308 0 2302 2698* 0 2301 2699* 0
7 1829 3171* 0 1504 3496* 0 1972 3028* 0
8 1398 3602* 0 1504 3496* 0 2313 2687* 0
9 2449* 2551 0 1964 3036* 0 2024 2976* 0
10 1970 3030* 0 1689 3311* 0 2337 2663* 0
11 3035* 1965 0 2199 2801* 0 2096 2904* 0
12 1965 3035* 0 1915 3085* 0 2452 2548* 0
Total 31298 33701 1 25057 39943 0 28113 36887 0