There is evidence to suggest that related article search is a useful feature. Based on PubMed query logs gathered during a one-week period in June 2007, we observed approximately 35 million page views across 8 million browser sessions. Of those sessions, 63% consisted of a single page view–representing bots and direct access into MEDLINE (e.g., from an embedded link or another search engine). Of all sessions in our data set, approximately 2 million include at least one PubMed search query and at least one view of an abstract–this figure roughly quantifies actual searches. About 19% of these involve at least one click on a related article. In other words, roughly a fifth of all non-trivial user sessions contain at least one invocation of related article search. In terms of overall frequency, approximately five percent of all page views in these non-trivial sessions were generated from clicks on related article links. More details can be found in [2].

The use of test collections to assess the performance of retrieval algorithms is a well-established methodology in the information retrieval (IR) literature, dating back to the Cranfield experiments in the 60's [4]. These tools enable rapid, reproducible experiments in a controlled setting without requiring users.

Before proceeding, a clarification on terminology: although MEDLINE records contain only abstract text and associated bibliographic information, PubMed provides access to the full text articles (if available). Thus, it is not inaccurate to speak of searching for articles, even though the search itself is only performed on information in MEDLINE. Throughout this work, we use "document" and "article" interchangeably.

### 1.1 Formal Model

We formalize the related document search problem as follows: given a document that the user has indicated interest in, the system task is to retrieve other documents that the user may also want to examine. Since this activity generally occurs in the context of broader information-seeking behaviors, relevance can serve as one indicator of interest, i.e., retrieve other relevant documents. However, we think of the problem in broader terms: other documents may be interesting because they discuss similar topics, share the same citations, provide general background, lead to interesting hypotheses, etc.

To constrain this problem, we assume in our theoretical model that documents of interest are similar in terms of the topics or concepts that they are *about*; in the case of MEDLINE citations, we limit ourselves to the article title and abstract (the deployed algorithm in PubMed also takes advantage of MeSH terms, which we do not discuss here). Following typical assumptions in information retrieval [7], we wish to rank documents (MEDLINE citations, in our case) based on the probability that the user will want to see them. Thus, our *pmra* retrieval model focuses on estimating *P*(*c*|*d*), the probability that the user will find document *c* interesting given expressed interest in document *d*.

Let us begin by decomposing documents into mutually-exclusive and exhaustive "topics" (denoted by the set {

*s*
_{1}...

*s*
_{
N
}}). Assuming that the relatedness of documents is mediated through topics, we get the following:

Expanding

*P*(

*s*
_{
j
}|

*d*) by Bayes' Theorem, we get:

Since we are only concerned about the ranking of documents, the denominator can be safely ignored since it is independent of

*c*. Thus, we arrive at the following criteria for ranking documents:

Rephrased in prose, *P*(*c*|*s*
_{
j
}) is the probability that a user would want to see *c* given an interest in topic *s*
_{
j
}, and similarly for *P*(*d*|*s*
_{
j
}). Thus, the degree to which two documents are related can be computed by the product of these two probabilities and the prior probability on the topic *P*(*s*
_{
j
}), summed across all topics.

Thus far, we have not addressed the important question of what a topic actually is. For computational tractability, we make the simplifying assumption that each term in a document represents a topic (that is, each term conveys an idea or concept). Thus, the "aboutness" of a document (i.e., what topics the document discusses) is conveyed through the terms in the document. As with most retrieval models, we assume single-word terms, as opposed to potentially complex multi-word concepts. This satisfies our requirement that the set of topics be exhaustive and mutually-exclusive.

From this starting point, we leverage previous work in probabilistic retrieval models based on Poisson distributions (e.g., [6, 8, 9]). A Poisson distribution characterizes the probability of a specific number of events occurring in a fixed period of time if these events occur with a known average rate. The underlying assumption is a generative model of document content: let us suppose that an author uses a particular term with constant probability, and that documents are generated as a sequence of terms. A Poisson distribution specifies the probability that we would observe the term *n* times in a document. Obviously, this does not accurately reflect how content is actually produced–nevertheless, this simple model has served as the starting point for many effective retrieval algorithms.

This content model also assumes that each term occurrence is independent. Although in reality term occurrences are *not* independent–for example, observing the term "breast" in a document makes the term "cancer" more likely to also be observed–such a simplification makes the problem computationally tractable. This is commonly known as the term-independence assumption and dates back to the earliest days of information retrieval research [10]. See [11] for recent work that attempts to introduce term dependencies into retrieval algorithms.

Building on this, we invoke the concept of *eliteness*, which is closely associated with probabilistic IR models [8]. A given document *d* can be *about* a particular topic *s*
_{
i
}or not. Following standard definitions, in the first case we say that the term *t*
_{
i
}(representing the topic *s*
_{
i
}) is *elite* for document *d* (and not elite in the second case).

Let us further assume, as others have before, that elite terms and non-elite terms are used with different frequencies. That is, if the author intends to convey topic *s*
_{
i
}in a document, the author will use term *t*
_{
i
}with a certain probability (elite case); if the document is not about *s*
_{
i
}, the author will use term *t*
_{
i
}with a different (presumably smaller) probability. We can characterize the observed frequency of a term by a Poisson distribution, defined by a single parameter (the mean), which in our model is different for the elite and non-elite cases.

Thus, we wish to compute

*P*(

*E*|

*k*)–the probability that a document is

*about* a topic, given that we observed its corresponding term

*k* times in the document. By Bayes' rule:

Next, we must compute the two probabilities

*P*(

*k*|

*E*) and

*P*(

*k*|

). As discussed above, we model the two as Poisson distributions. For the elite case, the distribution is defined by the parameter

*λ*, for the non-elite case, the parameter

*μ*:

After further algebraic manipulation, we get the expression in Equation 8. Since there are differences in length between documents in the same collection, we account for this by introducing

*l*, the length of the document in words. Previous research has shown that document length normalization plays an important role in retrieval performance (e.g., [

12]), since longer documents are likely to have more query terms

*a priori*. Finally, we define the parameter

*η* =

*P*(

)/

*P*(

*E*).

How does Equation 8 relate to our retrieval model? Recall from Equation 3 that we need to compute

*P*(

*c*|

*s*
_{
j
}) and

*P* (

*d*|

*s*
_{
j
})–the probability that a user would want to see a particular document given interest in a specific topic. Let us employ

*P*(

*E*|

*k*) for exactly this purpose: we assume that users want to see the elite set of documents for a particular topic, which is computed by observing the frequency of the term that represents the topic. Finally, we approximate

*P*(

*s*
_{
i
}) with

*idf*, that is, the inverse document frequency of

*t*
_{
i
}. Putting everything together, we derive the following term weighting and document ranking function:

A term's weight with respect to a particular document (*w*
_{
t
}) can be computed using Equation 9, derived from the estimation of eliteness in our probabilistic topic similarity model. Similarity between two documents is computed by an inner product of term weights, and documents are sorted by their similarity to the current document *d* in the final output. We note that this derivation shares similarities with existing probabilistic retrieval models, which we discuss in Section 3.

### 1.2 Parameter Estimation

The optimization of parameters is one key to good retrieval performance. In many cases, test collections with relevance judgments are required to tune parameters in terms of metrics such as mean average precision (the standard single-point measure for quantifying system performance in the IR literature). However, test collections are expensive to build and not available for many retrieval applications. To address this issue, we have developed a novel process for estimating *pmra* parameters that does not require relevance judgments.

The

*pmra* model has three parameters:

*λ*,

*μ*, and

*η* . The first two define the means of the elite and non-elite Poisson distributions, respectively, and the third is

*P*(

)/

*P*(

*E*). To make our model computationally tractable, we make one additional simplifying assumption: that half the term occurrences in the document are elite and the other half are not. This corresponds to assuming a uniform probability distribution in absence of any other information–a similar principle underlies maximum entropy models commonly used in natural language processing [

13]. This leads to the following:

Experimental results presented in Sections 2.2 and 2.3 suggest that this assumption works reasonably well. More importantly, it reduces the number of parameters in

*pmra* from three to two, and yields a slightly simpler weighting function:

Nevertheless, we must still determine the parameters *λ* and *μ* (Poisson parameters for the elite and non-elite distributions). If a document collection were annotated with actual topics, then these values could be estimated directly. Fortunately, for MEDLINE we have exactly this metadata–in the form of MeSH terms associated with each record. MeSH terms are useful for parameter estimation in our model precisely because they represent topics present in the articles. Thus, we can assume that if *H*
_{
n
}is assigned to document *d*, the terms in the MeSH descriptor are elite. For example, if the MeSH descriptor "headache" [C10.597.617.470] were assigned to a citation, than the term "headache" must be elite in that abstract. We can record the frequency of the term and estimate *λ* from such observations. Similarly, we can treat as the non-elite case terms in a document that do not appear in any MeSH descriptors, and from this we can derive *μ*. There is, however, one additional consideration: from what set of citations should these parameters be estimated? A few possibilities include: the entire corpus, a random sample, or a biased sample (e.g., results of a search). In this work, we experiment with variants of the third approach.

As a final note, while it is theoretically possible to estimate the parameter *η* based on MeSH descriptors using a similar procedure, this assumes that the coverage of MeSH terms is complete, i.e., that they completely enumerate all topics present in the abstract. Since the assignment of MeSH is performed by humans, we suspect that recall is less than perfect–therefore, we do not explore this idea further.