Objective and automated protocols for the evaluation of biomedical search engines using No Title Evaluation protocols

Background The evaluation of information retrieval techniques has traditionally relied on human judges to determine which documents are relevant to a query and which are not. This protocol is used in the Text Retrieval Evaluation Conference (TREC), organized annually for the past 15 years, to support the unbiased evaluation of novel information retrieval approaches. The TREC Genomics Track has recently been introduced to measure the performance of information retrieval for biomedical applications. Results We describe two protocols for evaluating biomedical information retrieval techniques without human relevance judgments. We call these protocols No Title Evaluation (NT Evaluation). The first protocol measures performance for focused searches, where only one relevant document exists for each query. The second protocol measures performance for queries expected to have potentially many relevant documents per query (high-recall searches). Both protocols take advantage of the clear separation of titles and abstracts found in Medline. We compare the performance obtained with these evaluation protocols to results obtained by reusing the relevance judgments produced in the 2004 and 2005 TREC Genomics Track and observe significant correlations between performance rankings generated by our approach and TREC. Spearman's correlation coefficients in the range of 0.79–0.92 are observed comparing bpref measured with NT Evaluation or with TREC evaluations. For comparison, coefficients in the range 0.86–0.94 can be observed when evaluating the same set of methods with data from two independent TREC Genomics Track evaluations. We discuss the advantages of NT Evaluation over the TRels and the data fusion evaluation protocols introduced recently. Conclusion Our results suggest that the NT Evaluation protocols described here could be used to optimize some search engine parameters before human evaluation. Further research is needed to determine if NT Evaluation or variants of these protocols can fully substitute for human evaluations.

engine by calculating various quantitative performance measures. Such measures include Mean Average Precision (MAP), binary preference (bpref), precision at rank (e.g., P5, P10 or P20), among others. Performance measures and the traditional information retrieval evaluation paradigms have been reviewed in [1] and the reader should refer to this source for background information.
Most established evaluation methodologies commonly rely on domain experts to make relevance judgments for documents retrieved by search engines. For instance, in the evaluation paradigm used by various tracks (including the genomics track) in the annual Text Retrieval Conference (TREC), groups who participate in the evaluation share the same corpus and perform the same queries. Ranked lists of documents retrieved by each group are pooled to keep only unique documents, and these documents are evaluated by the judges. In another evaluation paradigm recently introduced [2], for each query, judges evaluate specific terms for their likelihood to be included in relevant documents (onTopic terms) or in non relevant documents (offTopic terms). Such term judgments are called TRels and are used to evaluate retrieval effectiveness.
Judging documents (and to a lesser extent judging terms) is an expensive activity that limits the scope of current search engine evaluations. For instance, current studies are limited to numbers of queries ranging from 25-100 because this is the number of queries for which the results can be judged by TREC staff-or judges funded by organizers of the TREC Genomics Track [3]-in a couple of months.
In this manuscript, we describe two approaches which can be used to evaluate the retrieval effectiveness of search engines without human judgments. The two approaches are entirely automated and rely on an objective metric of document relevance made possible by the document structure of Medline. We describe the approaches, discuss the reasons why they would be expected to correlate with human judgments; and present empirical evidence that confirms the existence of a significant correlation between performance measures obtained in the TREC Genomics Track and the results obtained with our approaches.

Results
We describe two evaluation protocols in this section. The first protocol is appropriate for the evaluation of search methods for focused searches, while the second protocol is suggested for the evaluation of high-recall search methods. Figure 1 provides an overview of these two protocols. Methods common to both evaluation protocols are described in the Method section.

Evaluating focused searches
With this protocol, we limit the evaluation to focused searches. We call focused searches requests that are expected to retrieve only one relevant document (such searches may retrieve many non relevant documents, but can retrieve one relevant document at most). In the biomedical domain, examples of focused searches include cases when the end-user is trying to locate the article that describes the discovery of the interaction between two proteins (e.g., the specific article which describes the discovery and characterization of the direct interaction between iNOS and COX2 [4]), the cloning of a specific gene, or the first demonstration that a disease is caused by mutations in a specific gene. In the web search domain, queries where users try to locate the home page of a person are also focused searches, if only one such page exists.
In the following sections of this manuscript, we denote sets of documents as D 0 , D 1 , etc. where the subscripts differentiate between different document sets defined in the text (e.g., a random sample of documents or a subset of documents from top ranking retrieval hits). The set of documents in the complete text collection is referred to as D all . We write Q 0 , Q 1 , and so on to denote sets of queries. We denote search engine methods as S 0 , S 1 , etc., where a search method consists of an algorithm and set of parameters. A search method S can process a set of queries Q over a document collection D and produce a ranked list of document hits H for each query in Q. The query operation is summarized as query(D, S, Q) → H.
We call T 0 , T 1 , and so on sets of natural language texts which describe some information need (for instance, the following text set T e has two elements which describe two possible information needs of a researcher studying Alzheimer's Disease T e = {"provide information about genes which interact with PSN1", "gamma secretase and Alzheimer's disease"}. In TREC, T sets would represent the set of narratives for the topics used in a track. We define the operation produce-query(T, S) → Q as the transformation of a set of text into a set of queries suitable for search method S. This notation abstracts procedures that range from completely automatic to fully manual where an expert translates an information need expressed as an element of T into a well-formed query for the search method S. We write evaluate(H, D) → P as the process of scoring hits H against relevance judgments where all documents of D are considered relevant. A perfect search method would produce hits H where all the k documents in D appear in the k top ranks. The resulting vector P has one element per performance measure that is being scored. The procedure evaluate can therefore produce a variety of performance measures (for instance MAP (mean average precision), bpref (binary preference), or reciprocal rank). Figure 1 Overview of the NT Evaluation protocols. The three panels introduce notation used throughout the manuscript. Each protocol randomly samples documents from the text collection to produce information requests/topics. See text for a description of each protocol. Step 9

Overview of the NT Evaluation protocols
Step 6.
For all protocols: Sample articles randomly from D all to produce the set of document D 0.
Step We call T 0 , T 1 , and so on sets of natural language texts which describe some information need (for instance, the following text set T e has two elements which describe two possible information needs of a researcher studying Alzheimer's Disease T e ={"provide information about genes which interact with PSN1", "gamma secretase and Alzheimer's disease"} We denote sets of documents as D0 D 1 , etc. where the subscripts differentiate between different document sets defined in the text (e.g., a random sample of documents or a subset of documents from top ranking retrieval hits). The set of documents in the complete text collection is referred to as D all.
To evaluate focused searches, we leverage the structure of Medline records. Most Medline records contain both title and abstract of the article referenced by the record. Further, most authors carefully craft the title of an article to summarize the content of the abstract. Briefly, we construct a document collection where titles have been removed and only abstracts are indexed (we call this document collection D nt , for documents-no-title), and ask how well the title of an article can retrieve the corresponding abstract.
More precisely, we collect a random sample of documents from the original text collection, thereafter denoted D 0 .
Step 1. extract-random-sample(D all ) → D 0 We extract the titles of the articles in D 0 and call the set of these titles T 0 : Step 2. extract-title(D 0 ) → T 0 We now retrieve documents from D nt using each element of T 0 as a query. Formally, we perform the steps: For each method S i under evaluation, do: Step 3. generate-query(T 0 , S i ) → Q 0i Step 4. query(D nt , S i , Q 0i ) → H 0i We score the H 0i hits with traditional measures of information retrieval performance using D 0 as the relevance document set. Formally, Step 5. evaluate(H 0i , D 0 ) → P i This overall focused evaluation strategy produces evaluation measures which indicate how well a search method S i can identify the abstract of an article using the title of the article as the query, when the title is absent from the text collection.
This strategy can be seen as mimicking the scenario where a user is trying to locate an article in Medline for which the user only remembers some keywords about what the article was about. Indeed, the title of an article is likely to contain similar keywords to those found in the abstract, but is also unlikely to contain exactly the same words as found in the abstract because authors try to avoid repetitions.
The soundness of this evaluation strategy should be selfevident since the title of an article is clearly relevant to the abstract of the same article (if that were not the case, the title would not match the relevant abstract for any of the method evaluated). However, is this evaluation scheme inadequate because it evaluates such a trivial problem that most information retrieval approaches will always find the correct answer in the first document retrieved? If so, comparisons among information retrieval approaches would be impossible because each method would have the best performance score. Table 1 show that this is not the case. This table lists the mean reciprocal rank obtained by different search approaches and measured over Medline with the focused evaluation search for a query set of 1,000 titles. Mean reciprocal rank is ideal to measure the performance of a focused search because it averages the inverse of the rank of the relevant document averaged over each query. A value of 1 would indicate that the relevant document was always found at rank 1 in the list of retrieved result, for each query. Values in Table 1 range from 0.493 to 0.580. The span of the values shows that the evaluation methodology for focused searches can rank search approaches by performance. (Details of the approaches are given in Method Details).

Evaluating high-recall searches
The strategy that we present here aims to evaluate approaches for high-recall searches. In contrast to focused searches, high-recall searches are expected to retrieve more than one, and potentially many, relevant documents for each query. This is the search scenario that is typically evaluated in TREC (for instance in the ad hoc task of the TREC terabyte track). We propose a simple extension to the focused search evaluation strategy to evaluate highrecall searches. We start as for the focused search evaluation and produce D nt , D 0 and T 0 .
The following step uses a search engine S ref to query D all with T 0 and produce H' 0 hits: H' 0 is produced by searching the document collection with titles, while H 0 was produced by searching D nt (collection without titles).
We keep the k highest scoring hits from H' 0 to produce H' 0 [1.
Step 3.best-scores( We evaluate the Z-score for each document in H' in this study. These steps can be summarized as follow: Step Our evaluation strategy stands on the assumption that D' 0ZT can be used as a set of relevant documents when evaluating a query over D nt . Since we have used T 0 to produce a set of relevant documents, we need an independent set of texts (we will denote this text T 1 ) that can be used to evaluate retrieval effectiveness against D' 0ZT . We chose to construct this text with the n th sentence of each abstract in D 0 . In this manu-script, we have used the third sentence of each abstract in D 0 to produce the set of texts T 1. We have confirmed that evaluation results are insensitive to the choice of which sentence is used when taken from the same abstract (data not shown). We proceed with T 1 to query D nt and evaluate performance against the D' 0ZT pseudo-relevance judgments: Step For each search method under evaluation S i , do Step 7. generate-query(T 1n , S i ) → Q 1i Step 8.
Step 9. evaluate(H 1i , D' 0ZT ) → P i In contrast to the evaluation for focused searches, it is not clear a priori that the high-recall evaluation protocol presented here will produce performance measures that adequately reflect the retrieval performance of the methods under test. To gain insight into this question, we asked whether the results of the evaluation would correlate with the results of a traditional evaluation. Figure 2 and Table  2 present empirical results that indicate that the highrecall evaluation protocol presented in this manuscript generates performance measurements that correlate with those obtained in TREC (genomics tracks 2004 and  2005). The search methods tested are described in Table 3.
The correlation observed between mapTREC and mapNT is significantly different from a random correlation (Pvalue = 0 two-tailed test Pearson coefficients correlation calculated with S-Plus 7.06). The same is true comparing bprefTREC and bprefNT (P-value = 0 two-tailed test). Correlation tests using ranks are also significant (P-value = 4.35 10-5 using a Kendall-τ test for MAP, P-value = 2.26 10-6 for bpref). See the Discussion section for a comparison of these correlation coefficients to the correlations reported in other studies.
We next asked to what extent the results of a high-recall evaluation were sensitive to the choice of the method S ref used to generate relevance judgments (see parameter S ref in Step 1 above). To address this question, we generated relevance judgments with a different search method than was used to produce Figure 2. Figure 3 plots how well the MAP and bpref measures agree between the two evaluations.
In the plot shown in Figure 2  Finally, we asked if NT Evaluation could help tune the parameters of a specific search method without human judgments. Search methods often contain parameters that require tuning for each text collection. An example is the Okapi BM25 probabilistic scoring method, which accepts two parameters k1 and b. Choice of parameters has been shown to significantly affect retrieval performance in past TREC experiments. We therefore tested the ability of NT evaluation to identify favorable and unfavorable regions of the BM25 parameter space. Figure 5 shows   Evaluations with a different sample of search methods

Discussion
In this manuscript, we have presented NT Evaluation, two protocols to evaluate the performance of a search engine. The first protocol helps evaluate so called focused searches, i.e., those searches expected to retrieve one or a few relevant results.

Focused vs. high-recall searches
A drawback of focused search evaluations is that it is unclear if the performance of search approaches measured for focused searches is indicative of the performance of the same approaches for searches expected to retrieve potentially many results. The question is not just academic because Büttcher et al. have recently tried to optimize search parameters for either focused searches or high-recall searches in the TREC 2006 terabyte track, and have not been able to obtain a parameter set that would perform optimally for both types of searches [5]. The results reported in [5] do not entirely establish that focused searches are intrinsically different from high-NT Evaluation predicts favorable regions of the search parameter space recall searches, since the difference in parameters could have been caused by artifacts introduced by the construction of the text collection (named pages have a shorter length on average in the TREC corpus used). Yet, the suspicion remains that the two types of searches may be sufficiently different that evaluating approaches with focused searches may not inform about the performance of the same approaches for high-recall searches.
Interestingly, NT evaluation for focused searches shows very different contours for the best parameter performance compared to high-recall searches, with a preference for higher values of the parameter b (i.e., values of b higher than 1 seem to be preferred, see contour of bpref value = 0.7800). A similar preference for high-values of b was also observed by Buttcher et al. for named page searches [5]. These differences support the hypothesis that focused searches and high-recall searches are distinct tasks that may benefit from different search parameters.

Evaluation with random judgments
Our work aims to develop and test approaches that require minimal human assessment of the information retrieved by the search engine. While Soboroff and colleagues pioneered research in this field by considering random relevance judgments [6], it should be noted that their method is unable to rank best and worse performing systems reliably and is therefore of little practical use.

Evaluation with TRels
Notable progress has been accomplished in this field of study with the development of the TRels approach [2]. This approach reduces human evaluation efforts to the assembly of TRels (lists of onTopic and offTopic terms for each query in an evaluation). Correlation coefficients reported for a TRel evaluation vs a TREC evaluation (MAP-tScore correlation) was 0.938 (Pearson's) and 0.746 (Kendall's-τ). Compared with a different TREC evaluation, NT Evaluation reaches correlation coefficients of 0.9214 (Pearson's) and 0.562 (Kendall's-τ). This comparison therefore suggests that TRels correlate better with TREC evaluation than NT Evaluation. However, while TRels are easier to assemble than document relevance judgments, the approach is still semi-automatic because each query must be studied to identify onTopic and offTopic terms. In contrast to TRels, NT Evaluation requires no human assessment of terms potentially present in retrieved documents. Consequently, NT Evaluation can be used to perform large scale evaluations with tens of thousands of queries. For instance, we report in this manuscript a series of evaluation runs performed with 1,000 queries over the whole set of MEDLINE abstract and titles (see Focused evaluation protocol, Table 1). This evaluation would have been impractical with the TRels approach.

Estimates of MAP from samples of judged documents
In an elegant paper, Aslam and colleagues describe how to efficiently sample pairs of documents to produce unbiased estimates of MAP with low variance [7]. . Sources of variability due to varying query difficulty in the topic samples may therefore yield lower correlations than reported in the study of Aslam.

Evaluation with Data Fusion
Data fusion is a technique where hits from different search engines are aggregated based on rank. In a recent article, Nuray and Can show that data fusion can produce pseudo-relevance judgments which correlate with TREC evaluations [8]. When fusing results from the best search engines in an evaluation, they obtain mean Spearman's correlation coefficients ranging from 0.752 to 0.854. Since the identity of the best systems is unknown before a TREC evaluation is conducted, these correlation values should be regarded as an upper-bound on the correlation that the data fusion approach can produce. Indeed, when the identity of these systems is unknown a priori, the approaches described in [8] achieve on average lower correlation with TREC (average correlations range from 0.527 to 0.627 depending on the data fusion technique used). Correlations obtained with the NT Evaluation protocol are above 0.7, suggesting that NT Evaluation outperforms data fusion (confirmation of this claim will require testing NT Evaluation on the same text TREC evaluations and systems as reported in [8]). Furthermore, in [8] data fusion was used with official document results from methods that participated in the TREC evaluation (results were used by TREC staff to build the pool used by the assessors). Methods included in a TREC pool are known to evaluate better on average than methods not in the pool evaluated. Our study evaluates some search methods which were related to methods in the TREC pool, and others very dissimilar. It is unclear what level of correlation would be observed with data fusion if used to evaluate mixes of in-the-pool and out-of-the-pool systems.

Cross topic variability
A long history of TREC evaluations has shown that performance of the same approach can vary widely from one topic/query to the next, so that performance measures are now only reported on sets of queries [1]. This effect can clearly be seen in Figures 2 and 4, where performance of the same set of methods measured with TREC Genomics Track data from 2004 and 2005 yields Spearman correlation coefficients in the range 0.86-0.87. To counter the effects of topic variability on performance, TREC experiments try to sample the types of topics that users of the search engines are likely to be interested in and search for.
In TREC, this is usually achieved by interviewing search engine users and asking for examples of searches that the users would perform. In the NT Evaluation protocol, however, topics used to generate queries are randomly sampled from the document collection without interviewing users. In the case of Medline, this results in search topics from basic research to clinical interests. While NT Evaluation provides no guarantee that somebody would want to perform a search corresponding to a given query used in the evaluation, there must be some level of interest in the topic since each topic is derived from an article published and indexed in Medline (also, information topics more frequently discussed in Medline are more likely to be sampled as evaluation topics).

Subjectivity of human evaluations
TREC evaluations have also shown that assessor disagreement is common when human judges assess documents. Disagreement occurs when one assessor judged a document relevant while another assessor judged the same document non relevant to the topic. Detailed analysis of the impact of assessor disagreement showed a minimal impact on the reproducibility of performance estimates as long as all the systems tested are assessed consistently (all systems compared judged by the same set of assessors) [1]. The fact that two assessors can disagree on 30% of documents in various TREC experiments confirms the subjective nature of relevance. A recent study by Dong et al suggests that the cause of assessor disagreement may be rooted in the different background and familiarity of the judges with the material discussed in the documents [9]. The NT Evaluation approach substitutes an objective measure that is used consistently across the entire evaluation (objective in the sense that the protocol can be automated and is reproducible). Because it does not use human judgments, the NT approach can be considered not sensitive to assessor disagreement.

NT Evaluation as yet another assessor
An alternative view suggested by a reviewer of this manuscript would be to consider the NT approach as yet another assessor. However, there are major differences between a human assessor and the protocol described in this manuscript. Most importantly, NT Evaluation cannot produce relevance judgments for arbitrarily formulated topics. Human assessors have no problem judging relevance of documents to arbitrary topics as long as the topics overlap with the assessors domain of expertise. In contrast, NT Evaluation considers only topics that can be constructed from the corpus by sampling a random set of documents (see Figure 1). Therefore, each topic is derived from a single article in the corpus. This is usually not the case in evaluations that involve human assessors. It should be noted that this difference prevents the comparison of inter-assessor agreement between NT Evaluation and traditional human relevance judgments. Indeed, there is no way to calculate statistics of assessor agreement when the set of topics 'judged' is disjoint. This is the reason why our study evaluated the correlation in overall system performance between NT Evaluation and TREC Genomics Track evaluations.

Sensitivity to the quality of titles
Medline is a text collection where the quality of titles is high. Most articles are described with a title and an abstract such that the title accurately describes the content of the abstract. Exceptions occur, however, and the titles of some articles may not contain enough information to locate the corresponding article (i.e., consider "A productivity study", a non informative title, or an article with a title but no abstract). The NT Evaluation protocols leverage the association between title and abstract in Medline. How are the protocols affected when there are many non informative titles in the text collection? Since the focused evaluation protocol derives queries from titles, noninformative titles will fail to match the corresponding abstract. For a given method, a higher proportion of non informative titles will therefore decrease the performance of the approach. However, because the set of queries is fixed for all methods under consideration in the NT evaluation protocol, the decrease in performance will be consistent across all methods under evaluation. Higher proportions of non-informative titles in the text collection require the evaluation of more abstracts to find some that match documents, but do not affect the relative performance scores of the methods compared. The same argument can be made for the high recall NT evaluation protocol because the set of queries used is again fixed for all methods under evaluation.

Sensitivity to the S ref parameter
In the first step of the high-recall NT Evaluation, we use a search method to assemble the pseudo-relevance judg-ments that other search methods will be judged against. This could introduce a bias in the evaluation in that the methods among S i most similar to S ref would score better than other methods. We minimize this problem by giving an unfair advantage to S ref ( Figure 3).

Fusion of search engine results
A powerful method to improve retrieval effectiveness is to combine results of different search methods. Various approaches have been developed to this effect, for instance rank fusion [8], or the combination of individual method scores, as described in [10]. Fusion methods weight each method to produce a final ranking of results where each method has a given influence. Machine learning methods have also been used to learn how best to combine results from different search methods and improve search effectiveness (see [11] and references therein). Optimization of fusion parameters is an important activity that could benefit from the NT evaluation protocols. The ability to scale up the number of queries considered in the evaluation may allow the determination of general fusion parameters that work well across a variety of topics. If sufficiently large numbers of queries are evaluated (i.e., 1,000 queries or more), it may also be possible to mine the resulting data to determine which parameters will work best for specific queries or classes of queries. This type of study is currently impossible because of the cost of human relevance judgments.

Future experiments
Our results indicate that the high-recall evaluation protocol produces performance measures that correlate with results obtained in the 2004 and 2005 TREC Genomics Track evaluations. Testing the correlation of NT Evaluations with the results of other TREC evaluations (e.g., ad hoc terabyte track which traditionally evaluates 50 topics per year and has been organized for several years) will help establish how well this protocol agrees with evaluation protocols relying on human judges. The NT Evaluation protocol described here must be adapted to noisy HTML text collections before these experiments can be conducted.

Conclusion
We have presented two evaluation protocols designed to evaluate biomedical search engines over Medline. The protocols are fully automated and do not require human relevance judgment, but will require further validation on large non-biomedical text collection before they can be used confidently for search engine evaluation. If future evaluations confirm our findings, NT Evaluation protocols will allow scaling up search engine evaluation studies to very large number of queries. The first protocol described makes it possible to evaluate search engines when users look for one relevant document per query. The second protocol supports evaluation of searches when many relevant documents are expected. The evaluation protocols that we have described can be used to optimize the parameters of search engines for a specific corpus in the absence of preexisting relevance judgements. Table 3 describes the search methods that were compared in this study. To obtain a sample of methods with different performances, we varied several search method parameters. Parameters varied included:

Search Methods
• query generation approach; • document scoring approach and parameters (BM25ec, INTER_MATCH_DISTANCE_SCORER (IMD scorer), or BM25EC2_IMD_SCORER); • use or not of relevance feedback (and associated parameters).
These approaches are described in the sections below. All queries were performed with MG4J 1.1.2.1 (local version derived from the official MG4J distribution 1.1.2) and the latest development version of the Twease search engine [12,13].

Scoring approaches
BM25ec is an extension of the Okapi BM25 scoring method [14,15] presented in [12]. INTER_MATCH_DISTANCE_SCORER (IMD scorer) is a scoring approach which uses only information about the distance between matches of the query words to the document. To estimate the IMD score, minimal interval semantic [16] is used to determine the intervals of text that match the query, these sets of intervals are pruned to remove overlapping intervals (when two intervals overlap, the shorter is kept), and the gaps between these intervals are used to evaluate: denotes the length of the document, and D denotes the average length of documents in the text collection. Because the IMD scorer only uses information about the density of query word matches to a document, it is not expected to do well (and was included in this study as an example of poorly performing method). We use the IMD scorer and combinations of this scorer with BM25ec in this evaluation to provide intermediate low performance search approaches. The BM25EC2_IMD_SCORER scores documents as the linear combination of scores from BM25ec and IMD: 2*BM25ec-Score + IMDScore.

Query Generation
Query generation implements the procedure generatequery(T 0 , S) → Q 0. We used three automatic query generation algorithms: DisjunctiveQueryDistributor, Conjunc-tiveDisjunctiveQueryDistributor and CombinationAndThenQueryDistributor. Each algorithm first tokenizes T 0 to produce a list of unique word-tokens. The list is sorted by increasing corpus frequency for each token. Assume the list of words produced is {A, B, C, D, E}, with frequency of token A in corpus less or equal to frequency of word B. DisjunctiveQueryDistributor will produce the query A|B|C, including at most maxWord-Keep = 3 words in the final query. ConjunctiveDisjunc-tiveQueryDistributor will split {A, B, C, D, E} into two lists {A, B} {C, D, E} and produce the query (A|B)(C|D) when maxWordKeep = 2. CombinationAndThenQuery-Distributor implements the query generation mechanism described in [16] and produces (A&B&C&D), (A&B&C)| (B&C&D) |(A&C&D), (A&B)| (B&C) |(C&D)|(A&C)|..., A|B|C|D. The symbol & represents conjunction and the symbol ',' represents the "and then" query mechanism described in [16]. Our implementation is governed by maxWordKeep, the number of words to include in the query, starting with lowest frequency words, and maxInclude, the number of words per disjunctive clause in the last and then query clause generated.

Pseudo-Relevance Feedback
Some runs were performed with pseudo-relevance feedback. Such runs are performed in two steps. The first step obtains the same hits as when no relevance feedback is used. The second step inspects k top documents retrieved to select j words. Words are selected by rank according to TF-IDF score in the k documents considered for feedback, and are used for scoring with a BM25 score multiplied by 1/3. This process is different from that described in [17,18] and aims to produce a search method not included in the TREC pool.

Text Collections
The results of the focused evaluation presented in Table 1 were performed by searching a no-title version of the whole Medline text collection consisting of about 16 million abstracts (the Medline baseline of 2006 was used). In order to compare with results obtained in the TREC Genomics Track, the high-recall evaluation was performed on the TREC-genomics track 2004 corpus (re-used in the 2005 evaluation). This corpus is a subset of Medline with about 4.5 million abstracts.

Performance evaluations
The procedure evaluate(H, D) → P is performed with the official TREC evaluation tool, trec_eval version 8.0 [19].