Volume 11 Supplement 2
Third International Workshop on Data and Text Mining in Bioinformatics (DTMBio) 2009
Enabling multilevel relevance feedback on PubMed by integrating rank learning into DBMS
 Hwanjo Yu^{1},
 Taehoon Kim^{1},
 Jinoh Oh^{1},
 Ilhwan Ko^{1},
 Sungchul Kim^{1} and
 WookShin Han^{2}Email author
DOI: 10.1186/1471210511S2S6
© Han et al; licensee BioMed Central Ltd. 2010
Published: 16 April 2010
Abstract
Background
Finding relevant articles from PubMed is challenging because it is hard to express the user's specific intention in the given query interface, and a keyword query typically retrieves a large number of results. Researchers have applied machine learning techniques to find relevant articles by ranking the articles according to the learned relevance function. However, the process of learning and ranking is usually done offline without integrated with the keyword queries, and the users have to provide a large amount of training documents to get a reasonable learning accuracy. This paper proposes a novel multilevel relevance feedback system for PubMed, called RefMed, which supports both adhoc keyword queries and a multilevel relevance feedback in real time on PubMed.
Results
RefMed supports a multilevel relevance feedback by using the RankSVM as the learning method, and thus it achieves higher accuracy with less feedback. RefMed "tightly" integrates the RankSVM into RDBMS to support both keyword queries and the multilevel relevance feedback in real time; the tight coupling of the RankSVM and DBMS substantially improves the processing time. An efficient parameter selection method for the RankSVM is also proposed, which tunes the RankSVM parameter without performing validation. Thereby, RefMed achieves a high learning accuracy in real time without performing a validation process. RefMed is accessible at http://dm.postech.ac.kr/refmed.
Conclusions
RefMed is the first multilevel relevance feedback system for PubMed, which achieves a high accuracy with less feedback. It effectively learns an accurate relevance function from the user’s feedback and efficiently processes the function to return relevant articles in real time.
Background
PubMed is one of the most important information sources for biomedical researchers. It supports an efficient processing of keyword and constraint queries. However, finding relevant articles from PubMed is still challenging because it is hard to express the user’s specific intention in the given query interface, and a keyword query typically retrieves a large number of results. For example, the keyword “breast cancer” returns more than two hundred thousand articles. Adding a few more constraints could narrow down the search results but is still likely to return more results that the user can easily handle. The user can sort the results according to publication date, author’s first or last name, or journal name, but sorting them by some notion of relevance is hard.
To improve the search quality on PubMed, researchers have studied querying methodologies for PubMed, such as how to use controlled vocabulary, MeSH terms, or background knowledge to formulate proper PubMed queries [1, 2]. Reorganizing the search results using ontologies or clustering techniques has been explored to provide better presentation of the results to the users [3–5]. Text mining researchers have also tried to compute the global importance of articles using the citation information and have applied it to rank the results as done in Google [4, 6, 7]. However, users’ specific intentions are typically widely varied even with the same keyword query. For example, with a query “breast cancer”, one user may be interested in finding geneticstudy related papers while another user may want to find the latest cancer treatments. Thus ranking according to the global importance often does not meet the users’ specific information needs.
Researchers have also applied machine learning techniques to find relevant articles by ranking the articles according to the learned relevance function [8, 9]. However, the process of learning and ranking is usually done offline without being integrated with the PubMed’s keyword queries, and the users have to provide a large amount of training articles to get a reasonable learning accuracy.
Finally, relevance feedback, a well established technique in IR to improve retrieval performance [10, 11], has been applied on PubMed (e.g., MiSearch, a recent relevance feedback system for PubMed [12]). However, existing relevance feedback systems use classification methods and thus are limited to two level relevance judgements (relevant or not).
This paper proposes a novel multilevel relevance feedback system for PubMed, called RefMed, which supports both adhoc keyword queries and a multilevel relevance feedback in real time on PubMed.
To the best of our knowledge, RefMed is the first “multilevel” relevance feedback system for PubMed. The new technical contributions of RefMed are as follows.

RefMed supports the multilevel relevance feedback by using the RankSVM as the learning method, and thus it achieves higher accuracy with less feedback. Traditional relevance feedback systems use classification methods for learning (e.g., SVM, Bayesian learning) and thus are limited to two levels of relevance judgments (i.e., relevant or not). RankSVM is one of the most actively researched algorithms for learning ranking functions in the machine learning community and is regarded as the most accurate methodology when the size of training data is relatively small [13]. In a real time relevance feedback system such as RefMed, the amount of user feedback, i.e., training data, is typically small. Thus, we adopted the RankSVM as the learning method.

RefMed “tightly” integrates the RankSVM into a relational database management system (RDBMS) to support keyword queries and relevance feedback in the same framework and to minimize the response time. Specifically, we develop and integrate new SQL expressions for learning and predicting ranking into DBMS. The tight coupling of RankSVM and DBMS improves the processing time substantially by running the RankSVM directly on the data tables instead of files. The new SQL expressions also facilitates the application development process by running the rank learning and predicting operations within SQL.

An efficient parameter selection method for RankSVM is proposed, which tune the parameter without performing validation. Validation is a necessary process in learning with RankSVM in order to tune the soft margin parameter C. However, it is not feasible to perform the validation in RefMed, as no validation set is given during the search process. By the parameter selection method, RefMed estimates the best parameter to achieve a high learning accuracy without performing validation.
Methods section overviews the RankSVM and presents the integration of RankSVM within SQL and our parameter selection method. Result section demonstrates RefMed, and reports experiment results. We report (1) the learning accuracy of RankSVM against Rocchio with different amounts of feedback and relevance levels, (2) the query processing time of the tight coupling against a loose coupling, and (3) the accuracy of our parameter selection method against the cross validation and other parameter selection methods.
Methods
Preprocessing PubMed
Stopwords and word stemming are processed before extracting the features.
RankSVM
Then, the goal is to learn F which is concordant with the ordering R and also generalize well beyond R. That is to find the weight vector such that for most data pairs . RankSVM finds such weight vector by solving the following optimization problem [14]
minimize: (2)
By the constraint (3) and by minimizing the upper bound in (2), the above optimization problem satisfies orderings on the training set R with minimal error. By minimizing or by maximizing the margin ( ), it tries to maximize the generalization of the ranking function. C is the soft margin parameter that controls the tradeoff between the margin size and training error. (Refer to Conclusion section of [15] for more detailed explanation about formulating the optimization problem of RankSVM.)
The primal problem of RankSVM can be transformed to the following dual problem using the Lagrange multipliers.
maximize: (5)
subject to: (6)
Once transformed to the dual, the kernel trick can be applied to support nonlinear ranking function. K (·) is a kernel function where K (a,b) = a·b in the linear kernel or in the RBF kernel. The RBF kernel contains an additional parameter g that needs to be tuned. (Refer to Methods section of [15] for more detailed explanation of the kernel trick.)
The function F becomes a linear function w.r.t. the features when K is the linear kernel, or F is a nonlinear function when K is a nonlinear kernel such as the RBF kernel. RefMed applies the linear kernel since the linear kernel is known to perform well for high dimemsional data such as documents [16].
Integration of RankSVM within SQL
RefMed tightly integrates the RankSVM into the MySQL DBMS in order to minimize the response time. The tight integration of RankSVM enables the learning and processing of ranking on the SQL data tables directly without additional disk accesses for generating intermediate files. By integrating RankSVM within MySQL, we can also use the DB facilities such as indexes and optimizers for managing and accessing the data.
The model_table is constructed after running RANKSVM_LEARN, which contains the model information and will be used as an input of the RANKSVM_PREDICT command. The model information includes the parameters (i.e., CVal, KType, and KVal) and a set of support vectors and the coefficients (i.e. and α_{ ij } in Eq.(7)), The test_table contains attributes ID and FVector, and The output_table contains attributes ID and RScore (the ranking score).
Parameter selection
RankSVM has the soft margin parameter C (in Eq.(2)) that controls the tradeoff between the margin size and training error. The parameter typically needs to be tuned by a validation process, but it is infeasible to perform a validation process in a realtime relevance feedback system such as RefMed, as no validation set is provided. We develop a parameter selection method for RankSVM that tunes the soft margin parameter without running a validation process as follows.
Since K (·) > 0,
Thus,
Assume the training set is a set of multilevel relevance levels (e.g., {1:"not relevant", 2:"partially relevant", 3:"highly relevant"}) where F (a) > F (b) for . Then, we can estimate the lower bound of C by computing Eq.(9) using the training set. In fact, when is a bounded support vector whose α = C, the inequality in Eq.(9) becomes the equality. The summation in the denominator is over all records in the training set. The resulting set of C values is sorted in descending order, and the 90^{ th } percentile value is selected. Thus, the model is given sufficient capacity to achieve good but not perfect performance for the training data.
Our experiments (in "Evaluation of Parameter Selection" Section) show that our parameter selection method generates significantly higher accuracy than the default parameters provided in the SVM light [13] and LibSVM [17], and its accuracy is very close to that of the cross validation.
Results
RefMed demonstration
Accuracy evaluation
This section evaluates the effectiveness of the multilevel relevance feedback over binary relevance feedback. The effectiveness is measured based on NDCG and Kendall's τ, that are, the two popularly used measures for evaluating ranking accuracy. NDCG is popularly used for IR applications where ranking on top results is more important than that on bottom results [18–21], and Kendall's τ is favorably used to measure the overall accuracy based on the number of correctly ranked pairs [22–24]. There are other measures for evaluating the ranking accuracy such as AUC (Area Under the Curve) and MAP (Mean Average Precision). They are used when there are two levels of relevance. Descriptions of Kendall's τ and NDCG follow.
Kendall's τ
In this example, τ (R^{ * } , R^{ F } ) is computed as 0.7, as the number of discordant pairs is 3, i.e., while all remaining 7 pairs are concordant.
Normalized Discount Cumulative Gain (NDCG)
Where j is the position from the top, R(j) is the rating of the j 's movie. Z_{ n } is a normalization factor to guarantee that the NDCG score of a perfect ranking is equals to 1. As j increases or as the returned movie becomes farther from the top, its impact on the NDCG score decreases logarithmically.
Data sets
We used synthetic and OHSUMED data sets. The synthetic data consists of 150 data instances of 50 features and each feature value is a random number between zero and one. A linear function is then created by generating a random weight vector in . Then, the training and testing set are created using the function F, and the accuracy is measured by comparing the F' and on testing set where F' is learned from the training set.
The OHSUMED data set is a subset of the PubMed articles and consists of 348,566 documents and 106 queries [26]. In total, there are 16,140 querydocument pairs on which relevance judgments are made. The relevance judgments are either 'd' (definitely relevant), 'p' (partially relevant), or 'n' (not relevant). The data has been used in many experiments in IR [21]. In the same way we preprocessed the PubMed data, we preprocessed the OHSUMED documents and extracted features by running stopwords, word stemming, and computing TFIDF. A feature is a TFIDF value for each word.
Results
Efficiency evaluation
We compare the efficiency of the tight coupling against the loose coupling. In the loose coupling, DBMS exports the data in the tables to files, RankSVM trains and predicts on the files, the prediction results are imported to the DBMS tables, and the DBMS processes the rest of the query. We ran the experiments using the MySQL v5.0.5 on a linux machine with two Intel Quadcore CPUs, 48GB memory, and 4.5TB HDD.
Evaluation of parameter selection
We compare five different parameter selection methods — (1) RankSelection: our parameter selection method, (2) CSelection: the method proposed in [27], (3) SVMLight: the svm light default parameter ( ), (4) LIBSVM: the libsvm default parameter (= 1), and (5) CV: 3fold cross validation.
Accuracy of RankSVM with five different parameter selection methods
NDCG@10  

SVMLight  LIBSVM  CSelection  RankSelection  CV  
qid 16  0.472788  0.471753  0.493179  0.654761  0.66833 
qid 712  0.386431  0.390575  0.377642  0.454554  0.45718 
qid 1318  0.320041  0.320191  0.31299  0.35037  0.37794 
Kendall's tau  
qid 16  0.787682  0.786749  0.789671  0.817379  0.818743 
qid 712  0.786614  0.786388  0.788733  0.803645  0.804825 
qid 1318  0.780289  0.779738  0.781045  0.782841  0.78917 
Conclusions
This paper proposes RefMed, a novel multilevel relevance feedback system for PubMed. RefMed supports the multilevel relevance retrieval by using the RankSVM as the learning method. RefMed tightly integrates the RankSVM into RDBMS to support both keyword queries and relevance feedback in real time. A novel parameter selection method for the RankSVM is also proposed, which tune the soft margin parameter without performing a validation process. By the tight coupling of RankSVM within DBMS and the parameter selection method, RefMed achieves a high relevance accuracy with less feedback.
Declarations
Acknowledgements
This work was supported by the Korea Research Foundation Grant funded by the Korean Government (KRF20090080667).
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 2, 2010: Third International Workshop on Data and Text Mining in Bioinformatics (DTMBio) 2009. The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/11?issue=S2.
Authors’ Affiliations
References
 Murphy L, Reinsch S, Najm W, Dickerson V, Seffinger M, Adams A, Mishra S: Searching biomedical databases on complementary medicine: the use of controlled vocabulary among authors, indexers and investigators. BMC Complementary and Alternative Medicine 2003.Google Scholar
 Sneiderman C, DemnerFushman D, Fisaman M, Ide N, Rindflesch T: Knowledgebased Methods to Help Clinicians Find Answers in MEDLINE. Journal of American Medical Informatics Association 2003.Google Scholar
 GoPubMed[http://www.gopubmed.com/]
 Lin Y, Li W, Chen K, Liu Y: A Document Clustering and Ranking System for Exploring MEDLINE Citations. Journal of American Medical Informatics Association 2007.Google Scholar
 Illhoi Yoo, Min Song: Biomedical Ontologies and Text Mining for Biomedicine and Healthcare: A Survey. Journal of Computing Science and Engineering 2008, 2(2):109–136.View ArticleGoogle Scholar
 Lu Z, Kim W, Wilbur W: Evaluating Relevance Ranking Strategies for MEDLINE Retrieval. Journal of American Medical Informatics Association 2009.Google Scholar
 Siadaty M, Shu J, Knaus W: Relemed: sentencelevel search engine with relevance score for the MEDLINE database of biomedical articles. BMC Med Inform Decis Mak 2007, 7: 1. 10.1186/1472694771PubMed CentralView ArticlePubMedGoogle Scholar
 Suomela B, Andrade M: Ranking the whole MEDLINE database according to a large training set using text indexing. BMC Bioinformatics 2005, 6: 75. 10.1186/14712105675PubMed CentralView ArticlePubMedGoogle Scholar
 Poulter G, Rubin D, Altman R, Seoighe C: MScanner: a classifier for retrieving Medline citations. BMC Bioinformatics 2008, 9: 108. 10.1186/147121059108PubMed CentralView ArticlePubMedGoogle Scholar
 Salton G, Buckley C: Improving Retrieval Performance by Relevance Feedback. Journal of the American Society for Information Science and Technology 1990, 41: 288–297. 10.1002/(SICI)10974571(199006)41:4<288::AIDASI8>3.0.CO;2HView ArticleGoogle Scholar
 Oh H, Myaeng S, Lee M: A Practical Hypertext Categorization Method Using Links and Incrementally Available Class Information. Proc. ACM SIGIR Int. Conf. Information Retrieval (SIGIR'00) 2000, 264–271. full_textView ArticleGoogle Scholar
 States D, Ade A, Wright Z, Bookvich A, Athey B: MiSearch Adaptive PubMed Search Tool. Bioinformatics 2008.Google Scholar
 SVMlight[http://svmlight.joachims.org/]
 Herbrich R, Graepel T, Obermayer K (Eds): Large margin rank boundaries for ordinal regression. MITPress; 2000.Google Scholar
 Yu H, Kim S: SVM Tutorial: Classification, Regression, and Ranking. Handbook of Natural Computing 2009. [http://hwanjoyu.org/publication/svmtutorial.pdf]Google Scholar
 Joachims T: Text Categorization with Support Vector Machines. Proc. European Conf. Machine Learning (ECML'98) 1998, 137–142.View ArticleGoogle Scholar
 LIBSVM[http://www.csie.ntu.edu.tw/\~cjlin/libsvm/]
 Cao Y, Xu J, Liu TY, Li H, Huang Y, Hon HW: Adapting Ranking SVM to Document Retrieval. Proc. ACM SIGIR Int. Conf. Information Retrieval (SIGIR'06) 2006.Google Scholar
 Burges C, Shaked T, Renshaw E, Lazier A, Deeds M, Hamilton N, Hullender G: Learning to Rank using Gradient Descent. Proc. Int. Conf. Machine Learning (ICML'04) 2004.Google Scholar
 Qin T, Liu TY, Lai W, Zhang XD, Wang DS, Li H: Ranking with Multiple Hyperplanes. Proc. ACM SIGIR Int. Conf. Information Retrieval (SIGIR'07) 2007.Google Scholar
 Xu J, Li H: AdaRank: A Boosting Algorithm for Information Retrieval. Proc. ACM SIGIR Int. Conf. Information Retrieval (SIGIR'07) 2007.Google Scholar
 Joachims T: Optimizing Search Engines using Clickthrough Data. Proc. ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD'02) 2002.Google Scholar
 Radlinski F, Joachims T: Query Chains: Learning to Rank from Implicit Feedback. Proc. ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD'05) 2005.Google Scholar
 Yu H: SVM Selective Sampling for Ranking with Application to Data Retrieval. Proc. Int. Conf. Knowledge Discovery and Data Mining (KDD'05) 2005.Google Scholar
 Geng X, Liu T, Qin T, Li H: Feature Selection for Ranking. Proc. ACM SIGIR Int. Conf. Information Retrieval (SIGIR'07) 2007.Google Scholar
 Hersh W, Buckley C, Leone T, Hickam D: OHSUMED: An Interactive Retrieval Evaluation and New Large Test Collection for Research. Proc. ACM SIGIR Int. Conf. Information Retrieval (SIGIR'94) 1994.Google Scholar
 Cherkassky V, Ma Y: Practical selection of SVM parameters and noise estimation for SVM regression. Neural Networks 2003.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.