Improving a gold standard: treating human relevance judgments of MEDLINE document pairs

BMC Bioinformatics

Table 4 In order to measure which method best predicts the individual class values made by a test judge between two methods, we apply the signed rank test. We also count query document pairs where the predicted probability of the class value is bigger for each method (and also ties). An asterisk marks the better result when the difference has a p-value less than 0.05 by the signed rank test. The optimal parameters are obtained through the rigorous induction method as in Table 2.

Judge	M ₂₃ vs M ₄			M ₂₃ vs M ₅			M ₄ vs M ₅
	M ₂₃	M ₄	=	M ₂₃	M ₅	=	M ₄	M ₅	=
0	2326	2674*	0	1763	3237*	0	2197	2803*	0
1	2576*	2423	1	1741	3259*	0	1808	3192*	0
2	2336*	2664	0	1580	3420*	0	1892	3108*	0
3	2637*	2363	0	1616	3384*	0	1592	3408*	0
4	3130*	1870	0	2817*	2183	0	2788	2212	0
5	2955*	2045	0	2463	2537*	0	2341	2659*	0
6	2692*	2308	0	2302	2698*	0	2301	2699*	0
7	1829	3171*	0	1504	3496*	0	1972	3028*	0
8	1398	3602*	0	1504	3496*	0	2313	2687*	0
9	2449*	2551	0	1964	3036*	0	2024	2976*	0
10	1970	3030*	0	1689	3311*	0	2337	2663*	0
11	3035*	1965	0	2199	2801*	0	2096	2904*	0
12	1965	3035*	0	1915	3085*	0	2452	2548*	0
Total	31298	33701	1	25057	39943	0	28113	36887	0

ISSN: 1471-2105