Skip to main content

Table 4 In order to measure which method best predicts the individual class values made by a test judge between two methods, we apply the signed rank test. We also count query document pairs where the predicted probability of the class value is bigger for each method (and also ties). An asterisk marks the better result when the difference has a p-value less than 0.05 by the signed rank test. The optimal parameters are obtained through the rigorous induction method as in Table 2.

From: Improving a gold standard: treating human relevance judgments of MEDLINE document pairs

Judge

M 23 vs M 4

M 23 vs M 5

M 4 vs M 5

 

M 23

M 4

=

M 23

M 5

=

M 4

M 5

=

0

2326

2674*

0

1763

3237*

0

2197

2803*

0

1

2576*

2423

1

1741

3259*

0

1808

3192*

0

2

2336*

2664

0

1580

3420*

0

1892

3108*

0

3

2637*

2363

0

1616

3384*

0

1592

3408*

0

4

3130*

1870

0

2817*

2183

0

2788

2212

0

5

2955*

2045

0

2463

2537*

0

2341

2659*

0

6

2692*

2308

0

2302

2698*

0

2301

2699*

0

7

1829

3171*

0

1504

3496*

0

1972

3028*

0

8

1398

3602*

0

1504

3496*

0

2313

2687*

0

9

2449*

2551

0

1964

3036*

0

2024

2976*

0

10

1970

3030*

0

1689

3311*

0

2337

2663*

0

11

3035*

1965

0

2199

2801*

0

2096

2904*

0

12

1965

3035*

0

1915

3085*

0

2452

2548*

0

Total

31298

33701

1

25057

39943

0

28113

36887

0