Improving a gold standard: treating human relevance judgments of MEDLINE document pairs

Given prior human judgments of the condition of an object it is possible to use these judgments to make a maximal likelihood estimate of what future human judgments of the condition of that object will be. However, if one has a reasonably large collection of similar objects and the prior human judgments of a number of judges regarding the condition of each object in the collection, then it is possible to make predictions of future human judgments for the whole collection that are superior to the simple maximal likelihood estimate for each object in isolation. This is possible because the multiple judgments over the collection allow an analysis to determine the relative value of a judge as compared with the other judges in the group and this value can be used to augment or diminish a particular judge’s influence in predicting future judgments. Here we study and compare five different methods for making such improved predictions and show that each is superior to simple maximal likelihood estimates.


Introduction
Human relevance judgments of a document in answer to a query are important as a means of evaluating the performance of a search engine and as a source of training data for machine learning methods to improve search engine performance [1,2]. Because human judgments are difficult, time consuming and expensive to obtain, it is important to extract as much advantage or information from human judgments as possible. If one is fortunate enough to have multiple judgments for the same query-document pair, the question arises as to how these multiple answers can best be used. It is the purpose of this paper to argue that ideally all available data should be used. It is not uncommon that relevance judgments are made on an ordinal scale consisting of {0,1,2,…,k} categories of relevance where k is as large as four [3,4]. We will not concern ourselves here with why a particular application might benefit from judgments on a scale with k greater than 2, but will simply assume that if this is important then it is important to predict the relevance of documents on this same scale. We propose that all available judgment data should be used to produce the most accurate assignment of probabilities to the different relevance categories for a document in answer to a query. The meaning of these probabilities must be the probabilities that these categories would be assigned by some new unseen judge (or user). Such probabilities will then provide optimal training data for improving system performance. But this leads to the important question, how shall we measure the quality of the probabilities produced from the human judgments? Our answer is to leave out one judge's judgments and measure the quality of the predicted probabilities by how well they predict the held out judge's judgments.
Before we proceed further with our discussion it is important to point out a distinction between what we are doing and work that has been done on a related problem. There are many examples of classification problems for which the true class of any object definitely exists. For example a patient either is or is not fit to undergo anaesthesia [5], a certain number of volcanoes are present in a given region of the surface of venus or that many are not present [6], a mushroom is either known to be edible or not known to be edible [7] , etc. For such data where it is known that there is a ground truth it makes sense to study models of the labeling process that incorporate an estimate of the reliability of labelers and an estimate of the ground truth for a task. Several such models have been developed and applied to a variety of data [5,[7][8][9][10][11][12][13][14]. Such models are generally tested on how well they predict the ground truth which is known independently of the labeling process and labels being studied. This situation is fundamentally different than the problem we are interested in. Our data consist of multiple judgments of relevance of a query to a document and we consider each of these judgments to be legitimate and valuable. Judgments of relevance are generally understood to be highly subjective and their diversity represents different interests and insights of the judges [15][16][17][18][19][20][21][22]. Search services and online merchants are interested in what interests their customers and how to predict this interest and to the question of interest there is generally no one correct answer. Thus we make no assumption regarding correctness, but only seek how best to predict what some new searcher will find relevant.
Given the goal of producing probabilities of relevance categories from multiple human judgments, the next question is what are the options to do this? Clearly the simplest and most obvious approach is to compute the maximal likelihood estimates of class probabilities for each document. As an example suppose we have a document d retrieved by a query q and we require judgments to be made from the set of numbers {0,1,2,3,4} where 0 means clearly irrelevant and 4 means clearly relevant and the other options represent grades between these two extremes. Suppose we have ten prior human judgments {2,3,1,2,4,2,3,2,2,0}. Then the maximal likelihood predictions for future human judgments are p 0 = 1/10, p 1 = 1/10, p 2 = 5/10, p 3 = 2/10, and p 4 = 1/10 and are proportional to the number of times each different judgment was seen in the past. Based on these predictions it seems much more likely that some future judge will assign a label of 2 than a label of 4 to the question of d 's relevance to q. The maximal likelihood approach treats all the judges as of equal value, i.e., we have assumed that all make judgments that are equally predictive of what a future judge would do. However, there is already in the data {2,3,1,2,4,2,3,2,2,0} a hint that some judges might be more valuable than others. There is a consensus in the data that 2 may be more likely as the value of a future judgment than other values. Thus a judge who chose the value 2 may be more useful than a judge who chose a different value. Of course we cannot really rate the usefulness of judges based on their judgment of a single object. But with judgments over a reasonable sized collection of objects it becomes quite feasible to rate judges for usefulness. To put this approach into practice, methods must be designed which account for the predictive value of judges.
As far as we have been able to ascertain, little work has been done in this area. Yu and colleagues [23,24] proposed a method to estimate the hidden intrinsic values of a set of objects that have been evaluated by a group of judges. They argue that the intrinsic value of an object judged by a group of judges is a suitably weighted average over the judgments of those judges where the weights represent the rating power of the judges. The intrinsic values determined in this way are interpreted as the consensus values of the group and each individual judge j's mean squared deviation s j 2 from the consensus values over the set of objects represents the reputation of the judge. They propose that the weight for judge j should be proportional to 1/s j and by normalizing one obtains the weights. Beginning with uniform weights one may calculate intrinsic values and then a more refined set of weights. This procedure may be iterated to convergence. They then suggest using the final intrinsic value for an object as the mean of a Gaussian distribution representing the distribution predictive of future judgments. This requires determining the variance of this predictive distribution, but this can be done using held out data. We evaluate this approach and compare it with our proposed methods and find that it performs well.
One of our approaches is related to the method proposed by Yu, et al. [24] in that we assume there are weights that represent the value of the individual judges. However, our approach differs from theirs in several respects. First, we are dealing with a small discrete set of possible judgments (five in number). In this setting it is convenient to combine prior judgments in a weighted manner to approximate a distribution predictive of future judgments. Instead of obtaining the weights by some iterative procedure we take a machine learning approach and learn optimal weights based on predicting held out data from the training set. We obtain our best results with this approach.
Our second approach is to treat the problem of predicting future judgments as a multiclass (five classes) classification problem. It is then natural to apply the maximum entropy classifier to this problem as it readily allows the computation of probabilities for multiple classes. In this approach the features are judgments of the training set judges and each training judge takes a turn at being held out to provide labels used to learn the weights for the features derived from the other training judges. When the training is completed the learned weights are then suitable for prediction. While this method works well it seems to be somewhat less reliable than the other methods.
The paper is organized as follows. Section 2 describes the judgment data we study and how it was obtained. Section 3 presents the six different methods of predicting future judgments that we tested. The results are in section 4 and the discussion of these results comprises section 5. Section 6 presents conclusions and possible directions for future work.

Judgement data
The data that we study are human judgments of relevance between a query document q and a second document d where both documents were extracted from approximately a million MEDLINE documents dealing with aspects of molecular biology [25]. There are one hundred q 's that were selected at random and for each q a generic cosine retrieval algorithm [26,27] was used to find the top 50 documents d in relation to q. The resulting set of 5,000 query-document pairs will be denoted here by DP. The human judge was asked to judge for each pair in DP whether they would want to read d if they had to write the paper represented by q. They were asked to make their judgments on a scale of 0-4 where 0 means the document is clearly not relevant; 1, the document has a 0.25 probability of relevance to writing the query document; 2, a 0.50 probability of relevance; 3, a 0.75 probability of relevance; 4, the document is certainly relevant to the query-writing task [28]. Initially, a panel of seven judges trained in the area of molecular biology was hired to judge the set DP. Multiple judges were asked to perform the task because of the known variability in human judgments [18,29]. Later, because of questions raised by the work of the first panel [25,28], a panel of six untrained judges was hired to judge the 5000 query-document pairs of DP. One of the interesting findings coming from the work of the second panel was that while the untrained judges on average did not perform as well as the trained judges, some of the untrained judges were competitive and the pooled results of the untrained judges were almost as good as the pooled results for the trained judges and better than any single trained judge. Here we study the full set of thirteen judges who have judged DP.
Ξ dp (J):set of judgment values of the query-document pair dp DP made by the judge set J, i.e., Ξ dp k dp k J . Ξ(J):set of judgment values of all 5,000 query-document pairs made by the judge set J, i.e., Ξ(J) = {Ξ dp (J)} dp DP .

Methods
Our approach is to consider a method M (Φ) that depends on a set of parameters Φ and that can be applied to a set of judgments Ξ(J) to make predictions about the judgments of an as yet unseen judge k who has also judged the members of DP. We require these predictions to be in the form of probability distribu-tionsP dp (c|M(Φ), Ξ(J)) where c C.
We can then evaluate the performance of M(Φ) by the log probability that it assigns to k's judgments S k P M J dp k dp dp DP Because our data is limited to 13 judges on the set DP, we follow a standard leave-one-out cross validation scheme for training and testing. We remove a judge k from the set J and denote the remaining set byJ k = J -{k}. (Hereafter, any judges marked as subscripts on the set J are to be understood as removed from the set J. For example, the set J k,i means that the judges k and i have been removed from the set J.) Now there is a particular problem in applying a method M(Φ) to data such as Ξ(J k ). The problem is how to choose Φ to achieve good performance and yet avoid overtraining. If we choose Φ according to there is a serious risk of overtraining. In order to overcome this issue, we apply a cross inductive learning process to get optimal parameters Φ* as follows: Given a test judge k let us exclude one more judge i ≠ k from the set J. Cycling through all 12 judges i J k , then the induction process is to find the optimal Φ according to Then the optimal parameters Φ k * obtained from (5) may be applied to training on Ξ(J k ) and the success S(k) of the method M (Φ) is given by S k P M J dp DP dp i dp i k To perform the cross validation we compute S(k) in (6) over all judges k J and average the results and use Ave as the measure of overall performance for the method M in this study. We consider six different methods to predict a probability distribution over the possible judgment values of a query document pair dp DP. Each method is applied to induce the associated optimal parameters Φ* according to (5) and then evaluated according to (6) and (7).
We proceed to a description of the individual methods. Here we present the basic ideas of the methods. The mathematical details can be found online at: http:// www.ncbi.nlm.nih.gov/CBBresearch/Wilbur/IRET/ research/methods.pdf

Method M 1 : direct probability estimation
We take as our estimate of the probability of a given relevance category and a given query-document pair the fraction of the training judges that assigned that category to that pair, i.e., we take the maximal likelihood estimate of the probability of that pair based on the training judges. However, we must modify this estimate slightly to avoid predicting zero for any category because the test judge may have chosen a category that no training judge chose. We do this by mixing in a small fraction τ of the training judge probabilities of choosing the categories over the whole set of querydocument pairs. We optimize the choice of τ by holding out from the training each training judge in turn and choosing the single τ that gives the best overall average of predictions over all such experiments.

Method M 2 : direct probability estimation with weighting parameters
It is not optimal to put each judge on an equal footing for his class label judgments of query-document pairs as the previous method M 1 does since the predictive value of judgments will differ among judges. To deal with this we assign an arbitrary positive weight to each judge and instead of counting as in the previous method to obtain probabilities we add the weights of judges to obtain probabilities. Thus if three training judges chose category c C for a given query-document pair dp we add the weights for the three judges and divide by the sum of the weights for all the training judges to obtain the probability assigned to c for dp. We also smooth in the same way as for the previous method and for the same reason. In fact we use the value of τ determined in M 1 . Finally we optimize the choice of the weights by leaving each training judge out in turn and predicting his/her judgments based on the weights and optimizing their choice base on the whole set of such experiments at once. (The choice of τ here can be arbitrary since the weights will always adjust themselves to produce the same optimal results.) Method M 3 : correlation matrix with weighting parameters If a given judge j assigns a category c to a query-document pair dp we can examine all the instances dp′ when this judge assigned the category c. Based on all these instances we can come up with probabilities p(c′ assigned by any judge ≠ j|c assigned by j). This matrix of probabilities should have predictive value and may capture aspects not captured by the previous methods. Thus if for a particular dp if j has judged c we will use the distribution p(c′ assigned by any judge ≠ j|c assigned by j) as part of our prediction for dp. If a different judge j′ has assigned c′ we also want p(c″ assigned by any judge ≠ j′|c′ assigned by j′) to contribute to our prediction. Thus we take a weighted average over all these distributions to obtain our prediction and we smooth as before for the same reason. For each judge and each category there is assigned a weight and these same weights are used whenever the corresponding distribution is used in the predictions. Thus there are more weights here than in the previous method. The approach to optimization is the same as before.
Method M 23 : combining the methods M 2 and M 3 We can combine the methods M 2 and M 3 defining the probabilities as a mixture of weighted terms coming from each method plus the smoothing. The optimization is then performed over all the weights at once.

Method M 4 : intrinsic judgments from a weighted average
Yu et. al. devised the method whereby a community judgment can be obtained from a suitably weighted average over judgments for any given item. Given the judge set J k,i and a query-document pair dp DP, one defines the weighted average of judgments by Here the numbers r j are a nonnegative normalized set of weights and are designed to reflect the importance of each judge's judgments. A judge's predictive capability is reflected in the average quadratic error in her judging history on all query-document pairs in DP: e DP j J j j dp k dp dp DP k i  (10) gives the optimal weighting for statistical estimation [24,30], we use b = 0.5 for better numerical stability [24].
Starting with uniform weighting, the algorithm iterates eqs. (8), (9), and (10) to convergence to a solution. Once the solution has been obtained and we have the intrinsic class value m k i dp , for each dp, m k i dp , is taken as the mean of a Gaussian distribution which is used to predict the judgments of a test judge. There is one parameter and that is the s for the predictive distributions and this is taken to be the same number for all dp. The value of s is optimized as in the previous methods by optimizing the predictions for all i J k simultaneously. Once s is determined, then the method is evaluated by its predictions for the judge k and the evaluation is completed by averaging over all such k J.

Method M 5 : maximum entropy classifier
For details of the Maximum Entropy classifier we refer the reader to Berger, Pietra, and Pietra [31]. Here data points to be classified correspond to the query-document pairs dp DP. In order to apply a maximum entropy classifier we need to define a class label and features for each instance. The basic approach is to use one judge to supply the label for a pair dp and let other judges paired with their judgments on dp serve as the features. The same pair dp can serve repeatedly as an instance with each judge in turn supplying the label and the other judges and their judgments supplying the features. Since the labels are treated as true it is not crucial that they remain connected to the judges that produced them. But for the features it is crucial that they are pairs consisting of the judgment and the judge who produced that judgment. In this way when the features are weighted the weights reflect the predictive value of the judges that are involved. The scheme that we use is straightforward but a little complicated by several levels of held out judges. First we hold out judge k for testing leaving the set of judges J k for training. Then we hold out judge i for determining the regularization parameter for the Maximum Entropy classifier leaving judges J k,i for training. Finally, we leave out judge j for labeling the instances coming from all the pairs in DP and use the judges remaining in J k,i,j to provide the features for each such instance. When we have created instances from all of DP for each j J k,i we train the classifier over all these instances together and then evaluate performance at predicting judge i 's labels for different values of the regularization parameter. We choose as optimal that value of the regularization parameter that gives the best average performance at prediction over all the i J k at once. When this regularization parameter is determined we use it and repeat the training on all instances coming from all j J k and test the prediction of k's labels. By repeating this for all k J and averaging the results we measure the method's performance.

One parameter optimizations
The foregoing methods rigorously avoid overtraining in choosing the optimal parameter set Φ k * by equation (5). This clearly has advantages. On the other hand for methods where Φ k involves only a single parameter, it is reasonable to consider the optimization of that single parameter for performance on the test data. This means optimization of (7) by choice of a single parameter value for all k. We have done this for the methods M 1 , M 4 , and M 5 . The optimal parameters for these methods are given in Table 1.

Results
For a baseline performance of the predicted probability of human judgments for query document pairs in the set DP, we assume the uniform distribution where all pairs receive the probability 1/5 for all relevance categories. The measure (7) for this baseline method is We applied each of the methods M 1 -M 5 to induce the optimal parameters in (6) and the results are shown in Table 2. We also applied the parameters given in Table 1 for methods M 1 , M 4 , and M 5 and the results are shown in Table 3. Overall, one can observe that the performances of all methods are almost always better than the random level on each judge. The major exception is judge 0 where almost all methods make predictions that are less accurate than random predictions. Judge 12 is also challenging to predict and about half the predictions are worse than random. In a comparison of different methods we see that among the methods based on a rigorous determination of Φ k * method M 23 performs best based on the average log probability measure. M 4 is a close second. Using the same measure for the single parameter optimizations in Table 3, the method M 5 performs best. If one considers the Table 1 Optimal parameters associated with the methods M 1 , M 4 and M 5 accurate to two digits. predictions for individual judges the method M 5 achieves the best result more than any of the other methods in both Tables. However, the differences between methods do not achieve statistical significance by the sign test. While the results in Tables 2 and 3 provide a useful performance gauge, they do not allow meaningful statistical testing of the differences seen. To allow statistical testing we consider the 5,000 predicted probabilities for the judgments of each test judge by the different methods under the same circumstances as those used to obtain the results in Tables 2 and 3.
To compare methods M 4 and method M 5 on how well they predict the judgments of judge 0 we examine judge 0's assigned label for each dp DP and consider the difference in the probability it receives from method M 4 and the probability it receives from M 5 . If this difference is positive it favors method M 4 , but if negative method M 5 . Of course the magnitude of the difference is also important as a large magnitude is more important than a small magnitude. This leads us to apply the Wilcoxon signed rank test [32] to determine the significance of differences. For the conditions of Table 2 we find method M 4 makes the better prediction for judge 0's judgments in 2,197 cases and M 5 for 2,803 cases and there are no ties. We then apply the signed rank test to see that the likelihood of the observed differences happening by chance if the two methods were equally good at making such predictions would be a probability of 5.12 × 10 -63 . This indicates there is a very significant difference in the ability of the two methods in predicting this judge's judgments over the 5,000 query-document pairs.

Discussion
It is evident from the results of Table 2 that each of the six different methods of predicting relevance judgments for the unseen judge are far better than random, i.e., the -8047. 19 given in (11). The method M 1 which takes the simplest approach of making the maximal likelihood estimate under the assumption that all the judges are of equal value in making predictions for what an unknown judge would judge gives the poorest result. Improved results come from making estimates of how to weight individual judges in combining their judgments. When these weights are learned by the iterative method of Yu, et al. [24] we see that the result is very good. When the weights are learned from the training judges using a held out judge, methods M 2 , M 3 , and M 23 , we see our best result in M 23 . The methods M 2 and M 3 each represent only a part of the solution and to get the best result both methods have to be combined in M 23 . The method M 5 , based on the maximum entropy method, comes in fourth in the competition based on the summary figure of -7320 for the average log probability of the judgments computed over all judges. From one point of view this summary figure is a little deceptive in that M 5 actually obtained the best score on six of the judges and this is a greater number of best scores than even the method M 23 which achieved the overall best average. An examination of the scores for different judges shows that M 5 would have done much better had it not done very poorly predicting the judgments for judge 4. Analysis for judge 4 shows the algorithm attempts to use a regularization parameter l 4 * that is much too small and hence overtrains and makes poor predictions. This problem led us to ask what performance would be if optimization were done to produce a single optimal l* for  all test data at once as given in Table 1. As seen in Table 3, one obtains improved overall performance. Of course there is a small risk of overtraining. The same single parameter optimization for method M 4 essentially does not work. The reason for this failure is not clear. For M 1 the single optimization just involves the smoothing parameter and has little effect, but does not degrade performance. The methods M 2 , M 3 , and M 23 all involve multiple parameters and have a higher risk of overtraining and hence are not included in this analysis. While the log probability of the judgments of an unseen judge averaged over all judges in turn seems like a reasonable way to rate overall performance, it does not provide a method to determine whether an observed difference between methods has statistical significance. In order to compute such significance values we have resorted to examining the difference in the probabilities assigned by two methods to a judge's judgments over the whole set DP. We can apply the Wilcoxon signed rank test to this data to ascertain statistical significance in a comparison of two methods for each judge. Such data is contained in Table 4 and Table 5 and is based on the same calculations reported in Table 2 and Table  3, respectively. The results are interesting in that they show that method M 5 is superior in the comparison of the rigorous approaches reported in Table 2 except for its performance in predicting judge 4's judgments. The data in Table 5 do not support any conclusion regarding the comparison of methods M 4 and M 5 . A natural question that may have occurred to the reader is why not apply some of the techniques used in studying noisy classification labels [5,6,9,10,13,33] to our problem. Indeed this might be an interesting thing to try. However, there are two reasons we have not done it. First, all these models involve latent values of annotator reliability and true labels and are more complicated in concept and/or in application than the methods we use. Second and more important, all these models are constructed to predict the reliability of the judges and the true labels for items and are evaluated on how well they predict these true labels. Since we do not have true labels and true labels are philosophically inconsistent with our data and how it was obtained, we would not be able to evaluate such an application of these models to our data except in how well they could predict judgments of held out judges. But it is not readily apparent how such predictions could or should be derived in these approaches. Therefore we consider this a question beyond the scope of our current investigation.

Conclusions and future work
We have studied basically three methods of predicting human judgments from known human judgments. We find that method M 23 gives the best overall predictions Table 4 In order to measure which method best predicts the individual class values made by a test judge between two methods, we apply the signed rank test. We also count query document pairs where the predicted probability of the class value is bigger for each method (and also ties). An asterisk marks the better result when the difference has a p-value less than 0.05 by the signed rank test. The optimal parameters are obtained through the rigorous induction method as in Table 2.  Table 5 In order to measure which method best predicts the individual class values made by a test judge between two methods, we apply the signed rank test. We also count query document pairs where the predicted probability of the class value is bigger for each method (and also ties). An asterisk marks th better result when the difference has a p-value less than 0.05 by the signed rank test. The optimal parameters are the single parameter optimizations of Table 1. for all judges. On the other hand the maximum entropy method M 5 gave the best results on twelve of the thirteen judges. However, it failed badly on one judge. As a result we conclude that M 5 is usually the best method, but is subject to occasional large errors. It is possible that such large errors could be prevented by setting a lower limit for the regularization parameter which is followed regardless of training. The method M 4 we regard as somewhere between M 23 and M 5 in that it does not give quite as good results as M 23 on the one hand and did not experience the large error seen with M 5 on the other hand. Several directions for further investigation are suggested by our results. First, it is possible that the method M 4 of Yu, et al. [24] could be improved by taking their same basic approach, but determining the weights for individual judges as those that are optimal for predicting held out data. Determining the weights using their iterative algorithm works well, but there is no theoretical reason why that approach should be optimal for the purpose of making the desired predictions. Second, it may be useful to explore the connections of our methods with methods for fusing multiple classifiers [34][35][36] as both problems have solutions involving weighting the individual members to be combined or involving second stage machine learning to learn how to combine individuals. On the other hand the problems are distinct because combining human judgments employs no gold standard and attempts to predict what an unknown member typical of the group would do, whereas the classification problem generally works with a gold standard set of training data. Third, in a real application the predictive value of judges could be used to control the judgment process so that if less predictive judges judge material, then more such judgments are needed to obtain a certain level of assurance regarding the predictive value achieved. This suggests an active learning scenario in which not only the entities to be judged, but the judges, are controlled for maximum efficiency much as has been done for the classification problems [8,13,14].