Passage relevance models for genomics search
© Urbain et al; licensee BioMed Central Ltd. 2009
Published: 19 March 2009
We present a passage relevance model for integrating syntactic and semantic evidence of biomedical concepts and topics using a probabilistic graphical model. Component models of topics, concepts, terms, and document are represented as potential functions within a Markov Random Field. The probability of a passage being relevant to a biologist's information need is represented as the joint distribution across all potential functions. Relevance model feedback of top ranked passages is used to improve distributional estimates of query concepts and topics in context, and a dimensional indexing strategy is used for efficient aggregation of concept and term statistics. By integrating multiple sources of evidence including dependencies between topics, concepts, and terms, we seek to improve genomics literature passage retrieval precision. Using this model, we are able to demonstrate statistically significant improvements in retrieval precision using a large genomics literature corpus.
Traditional retrieval functions, including state-of-the-art probabilistic and language models are typically based on a bag of words assumption where text is represented as unordered sets of terms, and any notion of concept identification, term ordering, or proximity is lost. Capturing a greater number of distinct query concepts within the context of a passage of text, however, is more likely to be relevant than a document containing fewer concepts dominated by higher IDF or term frequency scores. Without modeling contextual dependencies between terms, traditional models are not suitable for disambiguating terms and identifying relevant text without explicit term matching. These issues are particularly relevant when attempting to retrieve passages of text from biological literature where the significant use of ambiguous terms, acronyms, and term variants make identification of biological concepts especially challenging. We use concepts here to refer to the meanings, or definitions of natural language terms, where concepts can be represented by one or more terms, and terms can consist of one or more words.
Use of external knowledge sources coupled with query expansion techniques have been popular methods for identifying concept term variants. For example, bovine spongiform encephalopathy, BSE, and Mad Cow Disease all refer to the same biological concept. Use of external knowledge sources, however, can be problematic. An acronym like IP could represent immunoprecipitant or ischemic precondition. In this case we can only disambiguate IP if we have sufficient context to understand that one of the topics covered in the document involves immuno precipitation versus cardiology. These techniques also provide no relevance weight to passages, which are contextually similar but lack explicit matching of key terms. For example, acronyms for immunoglobulin G can be abbreviated as IGG, Ig G, or IgG. Since all capitals are frequently used in knowledge sources such as the UMLS, a query augmented with IGG would fail to match more general and alternative forms such as IG or IgM using standard gene and protein name normalization techniques .
Dealing with general concepts like gene, protein, or disease, can be especially troublesome. First, knowledge sources can generate an intractable number of query expansion terms. Second, general concepts take on a more specific meaning when coupled with contextual information. For example, a general term like protein when used within the context of the topic chronic wasting disease is likely to refer to a prion protein. A term like progression when used within the topic neoplasm is likely referring to tumor progression.
To address these issues, we present a passage retrieval model for capturing semantics through the notion of topic and concept relevance by learning the latent relationships between terms and concepts in relevant passages. First we present our passage relevance model, followed by the model's component topic, concept, term, and document models. Next, we review our dimensional indexing and query processing strategies. Finally, we present our results and discussion of prior work.
Passage relevance model
Our passage relevance model is based on the framework of an undirected probabilistic graphical model (Markov Random Field). A graphical model is a graph that models the joint probability distribution over a set of random variables. Each node in the graph is a random variable and missing edges between nodes represent conditional independencies. By modeling conditional independence assumptions, the full joint distribution can be factorized into a typically much more manageable product of conditional distributions. Unlike directed graphical models, Markov Random Fields (MRF) are unable to represent induced dependencies (causality) between random variables. This can allow more modeling flexibility, including the ability to model cyclic dependencies and more freedom in defining component models expressed as potential functions over cliques of random variables.
We define our model based on the belief that effective retrieval of relevant passages requires a model for integrating syntactic and semantic evidence from multiple levels of document context. Context is captured at the document level using term statistics, at the passage level through query topic modeling, and at the concept level by identifying concept terms and the terms they co-occur with within the context of a sentence. We posit that the most relevant passages contain the maximum number of distinct query concepts and terms within the minimum spanning lexical distance. We define passages as one or more contiguous sentences identified from the minimum spanning distance of query concepts.
Potential functions in the passage retrieval model are defined for the topic, concept, term, and document cliques:
ψ(Q, P, P R , θ p ) ψ(Q, P, P R , θ c ) ψ(Q, P, P R , θ t ) ψ(Q, P, θ d )
Since all potential functions are restricted to being strictly positive, it is customary to express them as exponentials, where f(c) is a real valued feature function over clique values and λ c is the weight given to the feature. As we are interested in the relative likelihood of each ranked passage within a potential function being relevant, and to eliminate parameter tuning, we set all feature function weighting constants λ c to 1, and normalize each function to between 0 and 1 (4).
ψ(c) = exp(λ c f(c)), log(ψ(c)) = λ c f(c) = f norm (c)
It is important to appreciate that no tuning parameters have been introduced to adjust the weight contributed by each potential function. Instead we rely on the relative likelihood of relevance expressed by each potential function. This notion follows from Robertson's probability ranking principle (PRP) . Next, we present the passage model's component topic, concept, term, and document models.
Topic relevance model
Topics are fundamentally based on the distribution of terms within and across documents . In prior efforts , we generated a corpus wide topic model using an unsupervised Markov Chain Monte Carlo sampling procedure. We evaluated the topic model in isolation and as a component within a probabilistic graphical retrieval model. The overall results from the retrieval model were excellent, however the topic model component did not significantly improve results, was difficult to parameterize, and was computationally expensive. These results were consistent with Azzopardi, Girolami, and van Rijsbergen , and Wei and Croft . The technique effectively identified related words for automatically generated topics; however these topics were not necessarily relevant to the topic of a user query.
Our objective in modeling topic relevance is to directly address the issue of learning a topic model that is relevant to the latent structure, or topic, of a user query by capturing the probability of each term over all other terms in a relevant set of passages.
|S R | is the size of the set of relevant passages.
is the count of relevant passages containing w i .
is the count of paragraphs in the collection containing w i sans the relevant passage count. This serves as a proxy for the count of non-relevant passages.
β is a smoothing parameter set to zero, since only terms occurring in at least two relevant passages are considered.
As a proxy for the relevant set of passages, we sample terms from the top 30 ranked passages containing at least one resolved concept using the passage retrieval model (Figure 1) without using the discrete random variable representing topic relevance which we seek here to create. The full model is then evaluated on the top 500 retrieved passages for final ranking.
Concepts from 2007 TREC Genomics
Concept 1: Blood protein
Concept 2: Lupus Erythematosus
Concept 8: Lysosome
Concept term instances are identified during query processing (refer to the query processing section).
- 2.The likelihood of each sentence (within a candidate passage) being generated for a given concept is determined from the concept-word co-occurrence distribution. The distribution is generated for all words co-occurring with an instance of a concept term in the same sentence. The likelihood of each sentence k within a candidate passage of generating a given query concept c j is estimated using equation (9).(9)
As shown in equation (10), the probability of a concept being generated for a given sentence is determined using Jelinek-Mercer style linear-weighting of the Boolean presence of the concept weighted by the likelihood of the concept term instance distinctly representing the concept, and the likelihood of the sentence given the concept (equation 9).
p(c j | s k ) = λ *(present(c j )* Γ) + (1 - λ)* p d (c j |s k )
- 4.Finally, the probability of passage P generating query Q for concept model θ c is shown in equation (11).(11)
C is the set of query concepts, and S is the minimum set of contiguous sentences covering the maximum number of distinct query concepts.
Sentence-level term co-occurrence distributions are used with term matching within the term model. The term model uses the same formulation as the concept model and is based on the likelihood of passage terms co-occurring.
The Jelinek-Mercer language model is used to capture document context (12). We set λ = 0.8.
p(q | d i ) = Σ wq log(λ *P ml (w|d) + (1 - λ)* p(w k |C))
P ml (w|d) = tf d /doclen represents the likelihood of a term given a document, P(w|C) represents the term collection frequency.
Dimensional indexing model
By indexing each individual word, queries can be developed for searching single- and multi-word terms. In the data warehousing literature, this model is refered to as a star schema [9, 10]. A more detailed treatment of the dimensional indexing model can be found in Urbain, Goharian, and Frieder .
Lexical Partitioning: Documents are parsed into paragraphs, and sentences.
Indexing: Each term along with its long-form expansion and lexical variants are stored in the index with the same positional information.
Sentences are extracted, and acronyms and their long-forms are identified: PRNP (PRioN Protein).
Part-of-speed tagging is performed using our 2nd order statistical Hidden Markov Model tagger: ... role_NN of_II the_DD gene_NN PRNP_NN (_(prion_NN protein_NN)_) in_II the_DD disease_NN Mad_NN Cow_NN Disease_NN.
Stop and function words are removed.
Candidate concepts are identified by locating non-recursive noun phrases ("noun chunks"): [gene PRNP], [prion protein], [Mad_NN Cow_NN Disease_NN].
Candidate concepts are verified in the index, and resolved using the UMLS Metathesaurus®, and Entrez Gene databases . If an entity is successfully resolved, all synonyms and one level of hyponyms (from the UMLS) are included.
- 6.If the synonym is considered ambiguous, it is not included. We consider a term ambiguous if either:
The synonym's normalized IDF (NIDF) is < 0.1. (IDF = log (N/df) normalized to between 0 and 1).
The synonym correlates with the long-form in less than 20% of all instances within the acronym table.
[Encephalopathy, Bovine Spongiform]
[Mad Cow Disease]
The position of all term variants of each concept is retrieved from the dimensional index by paragraph.
A minimum-spanning tree is constructed from the adjacency list by determining the maximum number of distinct concepts identified within the shortest lexical distance.
Finally, the passage boundary based on the first and last occurrences of distinct concepts is expanded out to include sentence boundaries.
Passage level concept search is illustrated with the following query: "Exact reactions that take place when you do glutathione S-transferase (GST) cleavage during affinity chromatography".
First, concepts and term variants are identified:
Cleavage: [cleavag], [merogenesi], [cytokinesi]]
Affinity purification: [affin, purif], [affin, chromatographi]]
Glutathione S-transferase: [glutathion, s, transferase], [gst]]
Second, the index is searched for all concept term variants.
Third, passages are identified: "affinity chromatography, and purified Mce1A and Mce1E, free of the fusion partner, were recovered following specific proteolytic cleavage of the GST"
Fourth, passages are expanded to sentence boundaries: "The fusion proteins were purified to near homogeneity by affinity chromatography, and purified Mce1A and Mce1E, free of the fusion partner, were recovered following specific proteolytic cleavage of the GST portion by thrombin protease."
Results 2007 TREC Genomics collection (MAP)
TREC 2007 Submission
Min Spanning Passage 5
0.3554 (+14.46%) p = 0.0582
0.1214 (+24.39%) p = 0.0321 †
Max Spanning Passage 6
0.3576 (+15.17%) p = 0.0504
0.1280 (+16.68%) p = 0.0834
To get a better understanding of the effectiveness of our proposed topic relevance model, we include the results from automatically learned topic models from our earlier work . These topic models lack a model of relevance with respect to the user's information need. The topic relevance model outperforms passages scored using the general topic model by 28.07%. This is clearly a more effective technique for incorporating topic models in information retrieval.
To understand the contributions of each component model, we have listed the results for ranking passages by each model individually along with the results of the full passage retrieval model (integrating evidence from the document, concept, term, and topic relevance models). The percentage improvements shown for the full passage retrieval model are relative to the top results in each category from all submissions to the 2007 TREC Genomics track.
The use of the minimal spanning distance of distinct concepts expanded to sentence boundaries for defining passages for evaluation resulted in a significant improvement for the Passage measurement which emphasizes precision, but less effective results for the Passage2 and Aspect measurements which place greater emphasis on recall. As an alternative, we evaluated submitting passages by the maximal spanning distance of all concepts (non-distinct). This resulted in significant improvements in the Passage2 and Aspect scores, no significant difference in the Document score, and a modest decrease in the Passage score. The overall results show significant improvements in Document and Passage retrieval using min spanning distance passages, and also improvements for Passage2 retrieval using the max spanning distance passages.
Discussion and related work
The proposed passage retrieval model exceeded the top results in each category, and demonstrated statistically significant improvements in document and the original passage retrieval measurement across a large test collection of genomics literature. The model can be used to help disambiguate polysemous terms and provide weight to potentially relevant passages without explicit term matching by capturing term co-occurrence distributions within context, and incorporating these distributions within a statistical relevance model.
Combining evidence from all component models rather than using evidence from any individual component alone achieved the best results. Examination of relevant passages returned from the system indicated that the use of term distributions within the concept, term, and topic relevance models had the most significant impact on passages where the system was able to identify only one or a small number of potentially ambiguous query concepts or terms respectfully. In these cases, the distributions were helpful in disambiguating acronyms and terms for biological concepts. For example, in the 2005 TREC query: How to "open up" cell through "electroporation," we were able to identify only one distinct concept: electroporation, as the acronym EPT.
The system was able to disambiguate the use of the acronym EPT in the following two passages and rank the second passage significantly higher even though electroporation is often used to treat endocrine pancreatic tumors.
Passage 1: ...malignant potential among endocrine pancreatic tumors (EPTs) varies greatly and can frequently not be predicted using histopathological parameters...
Passage 2: ... EPT, which uses pulsed electric fields in combination with a chemotherapeutic agent is being developed to treat human pancreatic tumors...
In the 2007 TREC query: What serum [proteins] change in association with high disease activity in lupus?
The system was able to only identify the concept lupus in the form of SLE (Systemic Lupus Erythematosus) in the following passage. We were not able to identify serum in the form of sera, but we were able to accurately identify this passage as relevant, and rank it higher than less relevant passages about lupus, which did not deal with blood serum.
Passage: ... SLE sera were used. The most marked nuclear staining occurred with sera from patients with active disease...
In passages where multiple distinct concepts are identified, the distributions had only a marginal effect in the ranking as enough evidence is provided by the presence of other query concepts for accurate disambiguation.
The topic relevance model clearly improved results, exceeding the automatically learned topic models by 28.07%, and our 2007 TREC results, which had the benefit of using concepts. The topic-relevance model also exceeded the median results of the track in all categories, and improved the performance of our composite passage retrieval model. Most importantly, the topic relevance model significantly outperforms general topic modeling using only a fraction of the computational resources.
To the best of our knowledge, there has not been prior work that models topics, concepts, and terms with distributional evidence within the framework of an undirected graphical model. Blei, Jordan, and Ng  introduced the idea of using hierarchical Bayesian models for applications in information retrieval including the estimation of latent Dirichlet hyperparameters using variational Bayes inference. They reported empirical results only and did not analyze precision with respect to user queries. Liu and Croft  introduced a cluster model for document retrieval; and Azzopardi, Girolami, and van Rijsbergen , and Wei and Croft  used an LDA-based language model for document retrieval. Both techniques demonstrated good results, but did not exceed the results of top-performing relevance-based language models .
Using word co-occurrence information has a long history in word sense disambiguation research and goes back to the famous dictum by J. R. Firth: "you shall know a word by the company it keeps" . Yarowsky  showed that with high probabiity a polysemous word has one sense per discourse.
Several researchers have made contributions to modeling term dependencies. Most work has focused on phrases, term proximity, and co-occurrence for pairs of terms [21, 22]. Metzler and Croft  developed a Markov Random field for modeling single terms, ordered phrases, and unordered phrases. They explored a number of independence assumptions and optimized their model for mean-average precision rather than likelihood to achieve their best results.
Using retrieval of fixed length passages of text to improve retrieval of relevant documents is based on the premise that only a small portion of each relevant document is relevant to a user's query. Similarity coefficients are computed at the passage level, and the highest scoring passage or some combination of the scores of individual passages is used to compute a document's similarity coefficient [24–26]. Callan  used a combination score with document and passage level evidence to obtain their best results. These efforts focused on fixed length passages of text and did not include multiple levels of document context and semantic evidence. Tellex  performed a quantitative evaluation of passage retrieval algorithms used by question-answering systems. Common to all three top performing algorithms is a non-linear boost to query terms that occur very close together in a candidate passage.
We presented a passage relevance model based on an undirected graphical model (Markov Random Field), and methods for modeling concepts, terms, and topic relevance as potential functions within the model. Using relevance modeling, we've introduced a new, more effective method for incorporating topic modeling into information retrieval applications that is also computationally efficient. Topic modeling using relevance outperformed automatically generated topic models by 28.07%.
The full model outperforms models of query terms, concepts, document, or passage relevance alone. Modeling query topic relevance improves the overall performance of the model and significantly outperforms topic models without relevance modeling. The model exceeds the top results in each category of retrieval as assessed by the 2007 TREC Genomics track and the results are statistically significant for automatic document and passage retrieval.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 3, 2009: Second International Workshop on Data and Text Mining in Bioinformatics (DTMBio) 2008. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S3.
- Urbain J, Goharian N, Frieder O: IIT TREC-2006: Genomics Track. Proceedings of the Fifteenth Text REtrieval Conference. 2006Google Scholar
- Robertson S: The probability ranking principle in IR. Journal of Documentation. 1977, 33 (4): 294-303.View ArticleGoogle Scholar
- Steyvers M: Probabilistic Topic Models. Latent Semantic Analysis: A Road to Meaning. Edited by: Landauer, T, McNamara, D, Dennis S, Kintch W. 2006, Laurence ErlbaumGoogle Scholar
- Urbain J, Frieder O, Goharian N: Probabilistic Passage Models for Semantic Search of Genomics Literature/. Journal of the American Society of Information Science. 2008, 59 (12): 2008-2023.View ArticleGoogle Scholar
- Azzopardi L, Girolami M, van Rijsbergen CJ: Topic Based Language Models for ad hoc Information Retrieval. Proceedings of the International Joint Conference on Neural Networks. 2004Google Scholar
- Wei X, Croft WB: LDA-based document models for ad-hoc retrieval. Proceedings of the 29th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval. 2006Google Scholar
- Lavrenko V, Croft WB: Relevance-based language models. Proceedings of the 24th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2001, 120-127.Google Scholar
- Hersh W: TREC 2007 Genomics Track Overview. The Sixteenth Text REtrieval Conference Proceedings. 2007Google Scholar
- Kimball R: Data Warehouse Toolkit: Practical Techniques for Building Dimensional Data Warehouses. 1996, Ralph, John WileyGoogle Scholar
- Gray J, Chaudhuri S, Bosworth A, Layman A, Reichart D, Venckatrao M, Pells F: Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals. Data Mining and Knowledge Discovery. 1997, 1 (1): 29-53.View ArticleGoogle Scholar
- Urbain J, Goharian N, Frieder O: Combining Semantics, Context, and Statistical Evidence in Genomics Literature Search. IEEE 7th International Symposium on BioInformatics and BioEngineering. 2007Google Scholar
- Schwartz A, Hearst M: A simple algorithm for identifying abbreviation definitions in biomedical text. Pacific Symposium on Biocomputing. 2003Google Scholar
- Urbain J, Goharian N, Frieder O: IIT TREC 2007 Genomics Track: Using Concept-Based Semantics in Context for Genomics Literature Passage Retrieval. The Sixteenth Text REtrieval Conference (TREC 2007) Conference Proceedings. 2007Google Scholar
- National Center for Biotechnology Information (NCBI). [http://www.ncbi.nlm.nih.gov/]
- Demner-Fushman D, Humphrey SM, Ide NC, Loane RF, Mork JG, Ruiz ME, Smith LH, Wilbur WJ, Aronson AR, Ruch P: Combining Resources to Find Answers to Biomedical Questions. The Sixteenth Text REtrieval Conference Proceedings. 2007Google Scholar
- Zhou W, Yu C: TREC Genomics Track at UIC. The Sixteenth Text REtrieval Conference Proceedings. 2007Google Scholar
- Blei D, Jordan M, Ng A: Hierarchical Bayesian models for applications in information retrieval. Bayesian Statistics. Edited by: Bernardo JM, Bayarri M, Berger JO, Dawid AP, Heckerman D, Smith AFM, West M. 2003, 7-Google Scholar
- Liu X, Croft WB: Cluster-based retrieval using language models. Proceedings of the 27th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2004, 186-193.Google Scholar
- Firth JR: A Synopsis of Linguistic Theory, 1930–1955. Studies in Linguistic Analysis. 1957, Oxford: Blackwell, 1-32.Google Scholar
- Yarowsky D: Word Sense Disambiguation Using Statistical Models of Roget's Categries Trained on Large Corpora. Proceedings, COLING-92. 1992Google Scholar
- Croft WB, Turtle HR, Lewis DD: The use of phrases and structured queries in information retrieval. Proceedings of the the 14th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1991Google Scholar
- Rijsbergen CJ: A theoretical basis for using co-occurrence data in information retrieval. Journal of Documentation. 1997, 33 (2): 106-119.View ArticleGoogle Scholar
- Metzler D, Croft WB: A Markov random field model for term dependencies. Proceedings of the 28th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2005Google Scholar
- Ittycheriah A, Roukos S: IBM's Statistical Question Answering System. TREC-11. 2001Google Scholar
- Kaszkiel M, Zobel J: Passage retrieval revisited. Proceedings of the 20th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval. 2001Google Scholar
- Lin J: Role of Information Retrieval in Answering Complex Questions. Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. 2006Google Scholar
- Callan J: Passage-Level Evidence in Document Retrieval. Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval. 1994Google Scholar
- Tellex S, Katz B, Lin J, Fernandes A, Marton G: Quantitative Evaluation of Passage Retrieval Algorithms for Question Answering. Proceedings of the 26th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval. 2003Google Scholar
- Urbain J, Frieder O, Goharian N: Passage Relevance Models for Genomics Search. Data and Textmining in Bioinformatics DTMBIO CIKM. 2008Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.