PubMed related articles: a probabilistic topic-based model for content similarity
© Lin and Wilbur. 2007
Received: 25 July 2007
Accepted: 30 October 2007
Published: 30 October 2007
Skip to main content
© Lin and Wilbur. 2007
Received: 25 July 2007
Accepted: 30 October 2007
Published: 30 October 2007
We present a probabilistic topic-based model for content similarity called pmra that underlies the related article search feature in PubMed. Whether or not a document is about a particular topic is computed from term frequencies, modeled as Poisson distributions. Unlike previous probabilistic retrieval models, we do not attempt to estimate relevance–but rather our focus is "relatedness", the probability that a user would want to examine a particular document given known interest in another. We also describe a novel technique for estimating parameters that does not require human relevance judgments; instead, the process is based on the existence of MeSH ® in MEDLINE ®.
The pmra retrieval model was compared against bm25, a competitive probabilistic model that shares theoretical similarities. Experiments using the test collection from the TREC 2005 genomics track shows a small but statistically significant improvement of pmra over bm25 in terms of precision.
Our experiments suggest that the pmra model provides an effective ranking algorithm for related article search.
There is evidence to suggest that related article search is a useful feature. Based on PubMed query logs gathered during a one-week period in June 2007, we observed approximately 35 million page views across 8 million browser sessions. Of those sessions, 63% consisted of a single page view–representing bots and direct access into MEDLINE (e.g., from an embedded link or another search engine). Of all sessions in our data set, approximately 2 million include at least one PubMed search query and at least one view of an abstract–this figure roughly quantifies actual searches. About 19% of these involve at least one click on a related article. In other words, roughly a fifth of all non-trivial user sessions contain at least one invocation of related article search. In terms of overall frequency, approximately five percent of all page views in these non-trivial sessions were generated from clicks on related article links. More details can be found in .
We evaluate the pmra retrieval model with the test collection from the TREC 2005 genomics track. A test collection is a standard laboratory tool for evaluating retrieval systems, and it consists of three major components:
a corpus–a collection of documents on which retrieval is performed,
a set of information needs–written statements describing the desired information, which translate into queries to the system, and
relevance judgments–records specifying the documents that should be retrieved in response to each information need (typically, these are gathered from human assessors in large-scale evaluations ).
The use of test collections to assess the performance of retrieval algorithms is a well-established methodology in the information retrieval (IR) literature, dating back to the Cranfield experiments in the 60's . These tools enable rapid, reproducible experiments in a controlled setting without requiring users.
The pmra model is compared against bm25 [5, 6], a competitive probabilistic model that shares theoretical similarities with pmra. On test data from the TREC 2005 genomics track, we observe a small but statistically significant improvement in terms of precision.
Before proceeding, a clarification on terminology: although MEDLINE records contain only abstract text and associated bibliographic information, PubMed provides access to the full text articles (if available). Thus, it is not inaccurate to speak of searching for articles, even though the search itself is only performed on information in MEDLINE. Throughout this work, we use "document" and "article" interchangeably.
We formalize the related document search problem as follows: given a document that the user has indicated interest in, the system task is to retrieve other documents that the user may also want to examine. Since this activity generally occurs in the context of broader information-seeking behaviors, relevance can serve as one indicator of interest, i.e., retrieve other relevant documents. However, we think of the problem in broader terms: other documents may be interesting because they discuss similar topics, share the same citations, provide general background, lead to interesting hypotheses, etc.
To constrain this problem, we assume in our theoretical model that documents of interest are similar in terms of the topics or concepts that they are about; in the case of MEDLINE citations, we limit ourselves to the article title and abstract (the deployed algorithm in PubMed also takes advantage of MeSH terms, which we do not discuss here). Following typical assumptions in information retrieval , we wish to rank documents (MEDLINE citations, in our case) based on the probability that the user will want to see them. Thus, our pmra retrieval model focuses on estimating P(c|d), the probability that the user will find document c interesting given expressed interest in document d.
Rephrased in prose, P(c|s j ) is the probability that a user would want to see c given an interest in topic s j , and similarly for P(d|s j ). Thus, the degree to which two documents are related can be computed by the product of these two probabilities and the prior probability on the topic P(s j ), summed across all topics.
Thus far, we have not addressed the important question of what a topic actually is. For computational tractability, we make the simplifying assumption that each term in a document represents a topic (that is, each term conveys an idea or concept). Thus, the "aboutness" of a document (i.e., what topics the document discusses) is conveyed through the terms in the document. As with most retrieval models, we assume single-word terms, as opposed to potentially complex multi-word concepts. This satisfies our requirement that the set of topics be exhaustive and mutually-exclusive.
From this starting point, we leverage previous work in probabilistic retrieval models based on Poisson distributions (e.g., [6, 8, 9]). A Poisson distribution characterizes the probability of a specific number of events occurring in a fixed period of time if these events occur with a known average rate. The underlying assumption is a generative model of document content: let us suppose that an author uses a particular term with constant probability, and that documents are generated as a sequence of terms. A Poisson distribution specifies the probability that we would observe the term n times in a document. Obviously, this does not accurately reflect how content is actually produced–nevertheless, this simple model has served as the starting point for many effective retrieval algorithms.
This content model also assumes that each term occurrence is independent. Although in reality term occurrences are not independent–for example, observing the term "breast" in a document makes the term "cancer" more likely to also be observed–such a simplification makes the problem computationally tractable. This is commonly known as the term-independence assumption and dates back to the earliest days of information retrieval research . See  for recent work that attempts to introduce term dependencies into retrieval algorithms.
Building on this, we invoke the concept of eliteness, which is closely associated with probabilistic IR models . A given document d can be about a particular topic s i or not. Following standard definitions, in the first case we say that the term t i (representing the topic s i ) is elite for document d (and not elite in the second case).
Let us further assume, as others have before, that elite terms and non-elite terms are used with different frequencies. That is, if the author intends to convey topic s i in a document, the author will use term t i with a certain probability (elite case); if the document is not about s i , the author will use term t i with a different (presumably smaller) probability. We can characterize the observed frequency of a term by a Poisson distribution, defined by a single parameter (the mean), which in our model is different for the elite and non-elite cases.
A term's weight with respect to a particular document (w t ) can be computed using Equation 9, derived from the estimation of eliteness in our probabilistic topic similarity model. Similarity between two documents is computed by an inner product of term weights, and documents are sorted by their similarity to the current document d in the final output. We note that this derivation shares similarities with existing probabilistic retrieval models, which we discuss in Section 3.
The optimization of parameters is one key to good retrieval performance. In many cases, test collections with relevance judgments are required to tune parameters in terms of metrics such as mean average precision (the standard single-point measure for quantifying system performance in the IR literature). However, test collections are expensive to build and not available for many retrieval applications. To address this issue, we have developed a novel process for estimating pmra parameters that does not require relevance judgments.
Nevertheless, we must still determine the parameters λ and μ (Poisson parameters for the elite and non-elite distributions). If a document collection were annotated with actual topics, then these values could be estimated directly. Fortunately, for MEDLINE we have exactly this metadata–in the form of MeSH terms associated with each record. MeSH terms are useful for parameter estimation in our model precisely because they represent topics present in the articles. Thus, we can assume that if H n is assigned to document d, the terms in the MeSH descriptor are elite. For example, if the MeSH descriptor "headache" [C10.597.617.470] were assigned to a citation, than the term "headache" must be elite in that abstract. We can record the frequency of the term and estimate λ from such observations. Similarly, we can treat as the non-elite case terms in a document that do not appear in any MeSH descriptors, and from this we can derive μ. There is, however, one additional consideration: from what set of citations should these parameters be estimated? A few possibilities include: the entire corpus, a random sample, or a biased sample (e.g., results of a search). In this work, we experiment with variants of the third approach.
As a final note, while it is theoretically possible to estimate the parameter η based on MeSH descriptors using a similar procedure, this assumes that the coverage of MeSH terms is complete, i.e., that they completely enumerate all topics present in the abstract. Since the assignment of MeSH is performed by humans, we suspect that recall is less than perfect–therefore, we do not explore this idea further.
We evaluated our pmra retrieval model against bm25–a comparison that is appropriate given their shared theoretical ancestry (see Section 3.2). Despite the popularity and performance of language modeling techniques for information retrieval (see  for an overview), bm25 remains a competitive baseline.
Our experiments were conducted using the test collection from the TREC 2005 genomics track , which used a ten-year subset of MEDLINE. The test collection contains fifty information needs and relevance judgments for each, which take the form of lists of PMIDs (unique identifiers for MEDLINE citations) that were previously determined to be relevant by human assessors. See Section 5.1 for more details.
More specifically, we measured precision at a cutoff of five retrieved documents, commonly written as P5 for short. Since our test collection contains a list of relevant PMIDs for each information need (i.e., the relevance judgments), this computation was straightforward.
We performed two types of experiments:
a number of runs that exhaustively explored the parameter space to determine optimal values, and
additional runs of pmra using parameters that were estimated in different ways.
The pmra experiments used the ranking algorithm described in the previous section. For bm25, we used the complete text of the abstract verbatim as the "query" and treated the resulting output as the ranked list of related documents. Finally, as a computational expedient, we ran retrieval experiments as a reranking task using the top 100 documents retrieved by bm25 with default parameter settings (k 1 = 1.2, b = 0.75), as implemented in the open source Lemur Toolkit for language modeling and information retrieval . Due to the large number of queries involved in our exhaustive exploration of the parameter space and the length of each query (the entire abstract text), this setup made the problem much more tractable given the computational resources we had access to (half a dozen commodity PCs). Since we were only evaluating the top five hits, we believe that this procedure is unlikely to yield different results from a retrieval run against the complete corpus. An experiment to validate this assumption is presented in Section 5.2.
The following procedures were adopted for our exhaustive runs: For bm25, we tried all possible parameter combinations, with k 1 ranging from 0.5 to 3.0 in 0.1 increments and b from 0.6 to 1.0 in 0.05 increments. This range was selected based on the default settings of k 1 = 1.2, b = 0.75 widely reported in the literature. Our exploration of the pmra parameter space started with arbitrary values of λ and μ. Assuming that the performance surface was convex and smooth, we tried different values until its shape became apparent. This was accomplished by first fixing a λ value and varying μ values in increments of 0.001; this process was repeated for different λ values in 0.001 increments.
In the second set of experiments, λ and μ for pmra were estimated using the procedure described in Section 1.2, on different sets of citations. We also performed cross-validation as necessary to further verify our experimental results.
Overall comparison between the bm25 and pmra models.
vs. bm25 b
bm25 (k1 = 1.2, b = 0.75)
bm25, default parameters
bm25 (k1 = 1.9, b = 1.00)
bm25, optimal parameters
pmra (λ = 0.022, μ = 0.013)
pmra, optimal parameters
Comparison between the bm25 and pmra models, broken down by template.
#1: methods or protocols
#2: role of gene in disease
#3: role of gene in biological process
#4: gene interactions in organ/disease
#5: mutation of gene and its impact
Relative differences between the bm25 and pmra models.
bm25* vs. bm25b
pmra* vs. bm25b
pmra* vs. bm25*
#1: methods or protocols
#2: role of gene in disease
#3: role of gene in biological process
#4: gene interactions in organ/disease
#5: mutation of gene and its impact
We also attempted to automatically estimate parameters for the pmra model using the method described in Section 1.2. However, that method is underspecified with respect to the set of MEDLINE citations over which it is applied. We experimented with the following possibilities:
The complete set of documents examined by human assessors in the TREC 2005 genomics track (see  for a description of how these documents were gathered).
The top 100 hits for each of the 4584 PMIDs that comprise our test abstracts, using bm25 with default parameters.
The top 100 hits for each of the 50 template queries that comprise the TREC 2005 genomics track, retrieved using Indri's default ranking algorithm based on language models. Indri is a component in the open source Lemur Toolkit.
Same as previous, except with top 1000 hits.
Values of pmra parameters (λ, μ) estimated using different sets of MEDLINE citations.
All assessed documents from TREC 2005 genomics track
Top 100 hits for every relevant citation, bm25
Top 100 hits for every template query, Indri
Top 1000 hits for every template query, Indri
Finally, to further verify these results and to ensure that we were not estimating parameters from the same set used to measure precision, cross-validation experiments were performed on the second condition. The 4584 test abstracts were divided into five folds, stratified across the templates so that each template was represented in each fold. We conducted five separate experiments, using four of the folds for parameter estimation and the final fold for evaluation. The results were exactly the same–P5 figures were statistically indistinguishable from the optimal values.
In summary, we have empirically demonstrated the effectiveness of our pmra retrieval model and shown a small but statistically significant improvement in precision at five documents over the bm25 baseline. Furthermore, our novel parameter estimation method was found to be effective when applied to a wide range of citation sets varying in both composition and size. Notably, the tuning of parameters did not require relevance judgments, the component in a test collection that is the most expensive and time-consuming to gather.
Although we measured statistically significant differences in P5 between pmra and bm25, are the improvements meaningful in a real sense? The difference between baseline bm25 and optimal pmra (achievable by our parameter estimation process) is 4.7%. In terms of the PubMed interface, for each abstract, one would expect 2.0 vs. 1.9 interesting articles in the related links display. We argue that although small, this is nevertheless a meaningful improvement.
PubMed is one of the Internet's most-visited gateways to MEDLINE–small differences, multiplied by thousands of users and many more interactions add up to substantial quantities. In addition, our metrics are measuring performance differences per interaction, since a list of related articles is retrieved for every citation that the user examines. In the course of a search session, a user may examine many citations, especially when conducting in-depth research on a particular subject. Thus, the effects of small performance improvements accumulate.
One might also argue that this accumulation of benefits is not linear. Consider the case of repeatedly browsing related articles–the user views a citation, examines related articles, selects an interesting one, and repeats (cf. the simulation studies in [17, 18]). In that case, the expected number of interesting links per interaction can be viewed as a branching factor if one wanted to quantify the total number of interesting articles that are accessible in this manner. In about 13 interactions, an improvement of 0.1 (i.e., 1.913 vs. 2.013) would result in potential access to twice as many interesting articles.
A suitable point of comparison for this work is the Binary Independent Retrieval (BIR) model for probabilistic IR [5, 6], which underlies bm25. Indeed, bm25 was chosen as a baseline not only for its performance, but also because it shares certain theoretical similarities with our model. Along with related work dating back several decades [8, 9], these two models share in their attempts to capture term frequencies with Poisson distributions. However, there are important differences that set our work apart.
The pmra model was designed for a fundamentally different task–related document search, not ad hoc retrieval. In the latter, the system's task is to return a ranked list of documents that is relevant to a user's query (what most people think of as "search"). One substantial difference is query length–in ad hoc retrieval, user queries are typically very short (a few words at the most). As a result, query-length normalization is not a critical problem, and hence has not received much attention. In contrast, since the "query" in related document search is a complete document, more care is required to account for document length differences.
Another important difference between pmra and bm25 is that there is no notion of relevance in the pmra model, only that of relatedness, mediated via topic similarity. Note, however, that the concept of relevance is still implicitly present in the task definition–in that the examination of documents may take place in the context of broader information-seeking behaviors. In contrast, the starting point of BIR is a log-odds, i.e., P(R|D)/P( |D), which explicitly attempts to estimate the relevance (R) and non-relevance ( ) of a document (D). Relevance is then modeled in terms of eliteness (see below). The starting point of our task definition leads to a different derivation.
Although both bm25 and pmra attempt to capture term dependencies in terms of Poisson distributions, they do so in different ways. BIR employs a more complex representation, where term frequencies are modeled as mixtures of two different Poisson distributions (elite and non-elite). In total, the complete model has four parameters–the two Poisson parameters, P(E|R), and P(E| ). Since eliteness is a hidden variable, there is no way to estimate the parameters directly. Instead, Robertson and Walker devised simple approximations that work well empirically . One side effect of this 2-Poisson approximation is that bm25 parameters are not physically meaningful, unlike λ and μ in pmra, which correspond to comprehensible quantities. Unlike BIR, our model makes the simplifying assumption that terms are exclusively drawn from either the elite or non-elite distribution. That is, if the document is about a particular topic, then the corresponding term frequency is dictated solely by the elite Poisson distribution; similarly, the non-elite distribution for the non-elite case.
Finally, the derivation of our model, coupled with the availability of MeSH headings in the biomedical domain, allow us to directly estimate parameters for our system. Most notably, the process does not require a test collection with relevance judgments, making the parameter optimization process far less onerous.
The estimation of parameters in the pmra model depends on the existence of MeSH terms, which is indeed a fortuitous happenstance in the case of MEDLINE. Does this limit the applicability of our model to other domains in which topic indexing and controlled vocabularies are not available? We note that effective access to biomedical text is suffciently important an application that even a narrowly-tailored solution represents a contribution. Nevertheless, we present evidence to suggest that the pmra model provides a general solution to related document search.
This finding suggests that the pmra model is relatively insensitive to parameter settings, so long as a particular relationship is maintained between λ and μ. Thus, it would be reasonable to apply our model to texts for which controlled-vocabulary resources do not exist.
In most search applications, system input is comprised of a short query, which is a textual representation of the user's information need. In contrast, this work focuses on related document search, where given a document, the goal is to find other documents that may be of interest to the user–in our case, the specific task is to retrieve related MEDLINE abstracts. We present a novel probabilistic topic-based content similarity algorithm for accomplishing this, deployed in the PubMed search engine. Experiments on the TREC 2005 genomics track test collection show a small but statistically significant improvement over bm25, a competitive probabilistic retrieval model. Evidence suggests that the pmra model is able to effectively retrieve related articles, and that its integration into PubMed enriches the user experience.
The test collection used in our experiments was developed from the TREC 2005 genomics track . The Text Retrieval Conferences (TRECs) are annual evaluations of information retrieval systems that draw dozens of participants from all over the world each year . Numerous "tracks" at TREC focus on different aspects of information retrieval, ranging from spam detection to question answering. The genomics track in 2005 focused on retrieval of MEDLINE abstracts in response to typical information needs of biologists and other biomedical researchers.
The live MEDLINE database as deployed in PubMed is constantly evolving as new articles are added, making it unsuitable for controlled, reproducible experiments. Therefore, the TREC 2005 genomics track evaluation employed a ten-year subset of MEDLINE (1994–2003), which totals 4.6 million citations (approximately a third of the size of the entire database at the time it was collected in 2004). Each record is identified by a unique PMID and includes bibliographic information and abstract text (if available).
Templates and sample instantiations used in the TREC 2005 genomics track evaluation.
#1 Information describing standard [methods or protocols] for doing some sort of experiment or procedure.
methods or protocols: how to "open up" a cell through a process called "electroporation"
#2 Information describing the role(s) of a [gene] involved in a [disease].
disease: multiple sclerosis
#3 Information describing the role of a [gene] in a specific [biological process].
gene: nucleoside diphosphate kinase (NM23)
biological process: tumor progression
#4 Information describing interactions between two or more [genes] in the [function of an organ] or in a [disease].
genes: CFTR and Sec61
function of an organ: degradation of CFTR
disease: cystic fibrosis
#5 Information describing one or more [mutations] of a given [gene] and its [biological impact or role].
gene with mutation: BRCA1 185delAG mutation
biological impact: role in ovarian cancer
In total, 32 groups submitted 59 runs to the TREC 2005 genomics track, consisting of both automatic runs and those with human intervention. Relevance judgments were provided by an undergraduate student and a Ph.D. researcher in biology. We adapted the judgments for our task by treating each relevant document as a test abstract–citations relevant to the same information need were said to be related to each other. In other words, we assume that if a user were examining a MEDLINE citation to address a particular information need, other relevant citations would also be of interest.
Recall from Section 2.1 that for computational expediency, our experiments were performed as reranking runs over results retrieved by bm25 with default paramters. We describe an experiment that examined the potential impact of this setup.
In theory, both bm25 and pmra establish an ordering over all documents in a corpus with respect to a query. Reranking in the limit yields exactly the same results; thus, the substantive question is whether reranking the top hundred hits would yield the same results as searching over the entire corpus. We can examine this issue by tallying the original rank positions of the top five results after reranking–that is, if reranking promotes hits that are highly ranked in the original list to begin with, then we can conclude that hits in the lower ranked positions of the original list matter little. On the other hand, if the reranking brings up hits that are very far down in the original ranked list, it might cause us to wonder what other documents from lower-ranked positions are missed.
For this work, JL was funded in part by the National Library of Medicine, where he was a visiting research scientist during the summer of 2006. WJW is supported by the Intramural Research Program of the NIH, National Library of Medicine. JL would also like to thank Esther and Kiri for their kind support.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.