PubMatrix: a tool for multiplex literature mining
© Becker et al; licensee BioMed Central Ltd. 2003
Received: 29 July 2003
Accepted: 10 December 2003
Published: 10 December 2003
Molecular experiments using multiplex strategies such as cDNA microarrays or proteomic approaches generate large datasets requiring biological interpretation. Text based data mining tools have recently been developed to query large biological datasets of this type of data. PubMatrix is a web-based tool that allows simple text based mining of the NCBI literature search service PubMed using any two lists of keywords terms, resulting in a frequency matrix of term co-occurrence.
For example, a simple term selection procedure allows automatic pair-wise comparisons of approximately 1–100 search terms versus approximately 1–10 modifier terms, resulting in up to 1,000 pair wise comparisons. The matrix table of pair-wise comparisons can then be surveyed, queried individually, and archived. Lists of keywords can include any terms currently capable of being searched in PubMed. In the context of cDNA microarray studies, this may be used for the annotation of gene lists from clusters of genes that are expressed coordinately. An associated PubMatrix public archive provides previous searches using common useful lists of keyword terms.
In this way, lists of terms, such as gene names, or functional assignments can be assigned genetic, biological, or clinical relevance in a rapid flexible systematic fashion. http://pubmatrix.grc.nia.nih.gov/
With the advent of high throughput genomic and proteomic approaches, the ability to generate data has outstripped the ability to assign biological relevance. Searching the MEDLINE literature database of greater than 14 million entries one-by-one makes establishing biological significance a daunting task. The basic PubMed search window contains a typical single search box for the input of simple keyword combinations. PubMed does allows complex searches using advanced search options, but this requires some knowledge of string search assembly, and an understanding of the PubMed Entrez programming utilities. These are still relatively obscure for many molecular biologists.
Literature mining approaches have been developed to place multiplex biological datasets into context relative to published medical literature. Computational tools such as PubGene , VxInsight , MedMiner , EASE , MeshMap, XPLORMED , AbXtract , and HAPI  are available which allow the user to query more complex gene name or keyword combinations, including multiplex proteomic or cDNA microarray results, versus literature citations in PubMed with defined types of output. These computational tools use different strategies in literature mining and in some cases produce statistical significance or graphical displays.
Similarly, literature based annotation projects such as the GO project  have devised a complex hierarchical annotation system using expert annotators based on a defined controlled vocabulary. This is important due to significant problems with nomenclature in molecular biology. Although quite useful, the GO approach does not allow for flexibility, variable interpretation, or individual investigator input in the hierarchies assigned to individual genes. In this report, we describe a simple, freely accessible, web based application for basic literature mining of PubMed that performs automatic multiplex Boolean queries. This allows the naive user to assign biological relevance and annotate gene lists through PubMed.
PubMatrix is a CGI front-end application, which submits queries consisting of search and modifier terms against NCBI's PubMed database and presents the results as a matrix of document hits. Results are stored in a database for retrieval and are presented as hyperlinks to the user for rerunning individual queries of interest. The application runs on an Apache http server using the PERL programming language and a MySQL database for storing terms and results.
PubMatrix is a simple intuitive multiplex comparison tool requiring no user understanding of algorithm design, perl, or scripting, and minimal instruction time. It allows automatic systematic searching of PubMed followed by a quick intuitive survey of results, and as such, dramatically reduces the investigator time needed to query PubMed as compared to a gene-by-gene approach. Moreover, it is systematic and objective. Unlike typical PubMed searches performed on an ad hoc basis, the 1st and the 1000th query are searched in an unbiased manner. This helps avoid becoming distracted with seemingly interesting comparisons early on in a traditional literature search session. It is flexible in the sense that any list of terms in almost any combination may be used. With PubMatrix, large lists of keywords can be compared and, when used with lists of gene names and function, may be used to analyze and annotate microarray gene lists and datasets. GO keywords may be used to annotate gene lists in a semi-automated fashion. Additionally, the ability to save searches in an archive allows sharing among collaborators and public accessibility of curated lists of useful search results and search terms.
Examples of categorical search lists
Official Gene Symbols
APOB, ACE, BDNF, CD45, ...
D1S478, D6S470, D13S193, ...
AAATTT, CAGCAG, TTTTTT, ...
1ter*, 1p36*, 1p35*....Xq27*, Xter*
sweden, canad*, mexic*, finland, ...
Common Prescription drugs
acetaminophen, acyclovir, albuterol, alprazolam, ...
atopic dermatitis, asthma, crohn's, Celiac, Graves',...
Date of Publication
1973 [dp], 1974 [dp]......2000 [dp], 2001 [dp], 2002 [dp]
Weiss A, Pierce SK, Kupfer A,...
PubMatrix allows a simple systematic approach to query the medical literature in PubMed with comparative keyword lists. It performs simple automatic queries and greatly reduces analysis time. In this way, increasingly large datasets generated by high-throughput multiplex assays such as proteomic or microarray assays can be mined, archived, displayed, and annotated for biological and disease relevance.
Availability and requirements
PubMatrix is available for free use at this URL: http://pubmatrix.grc.nia.nih.gov/
The authors would like to thank E. Hovig and D. Wheeler for helpful discussions, and C. Sherman-Baust and P. Morin for sharing of data.
- Jenssen TK, Laegreid A, Komorowski J, Hovig E: A literature network of human genes for high-throughput analysis of gene expression. Nat Genet 2001, 28: 21–28. 10.1038/88213PubMedGoogle Scholar
- Kim SK, Lund J, Kiraly M, Duke K, Jiang M, Stuart JM, Eizinger A, Wylie BN, Davidson GS: A gene expression map for Caenorhabditis elegans. Science 2001, 293: 2087–2092. 10.1126/science.1061603View ArticlePubMedGoogle Scholar
- Tanabe L, Scherf U, Smith LH, Lee JK, Hunter L, Weinstein JN: MedMiner: an Internet text-mining tool for biomedical information, with application to gene expression profiling. Biotechniques 1999, 27: 1210–4. 1216–1217PubMedGoogle Scholar
- Hosack DA, Dennis G, Sherman BT, Lane HC, Lempicki RA: Identifying biological themes within lists of genes with EASE. Genome Biology 2003, 4: R70. 10.1186/gb-2003-4-10-r70PubMed CentralView ArticlePubMedGoogle Scholar
- Srinivasan P: MeSHmap: a text mining tool for MEDLINE. Proc AMIA Symp 2001, 642–646.Google Scholar
- Perez-Iratxeta C, Perez AJ, Bork P, Andrade MA: Update on XplorMed: A web server for exploring scientific literature. Nucleic Acids Res 2003, 31: 3866–3868. 10.1093/nar/gkg538PubMed CentralView ArticlePubMedGoogle Scholar
- Andrade MA, Valencia A: Automatic annotation for biological sequences by extraction of keywords from MEDLINE abstracts. Development of a prototype system. Proc Int Conf Intell Syst Mol Biol 1997, 5: 25–32.PubMedGoogle Scholar
- Masys DR, Welsh JB, Fink JL, Gribskov M, Klacansky I, Corbeil J: Use of keyword hierarchies to interpret gene expression patterns. Bioinformatics 2001, 7: 319–26. 10.1093/bioinformatics/17.4.319View ArticleGoogle Scholar
- The Gene Ontology Consortium: Gene Ontology: tool for the unification of biology. Nature Genetics 2000, 25: 25–29. 10.1038/75556PubMed CentralView ArticleGoogle Scholar
- Entrez Programming Utilities[http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html]
- Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A 1998, 95: 14863–8. 10.1073/pnas.95.25.14863PubMed CentralView ArticlePubMedGoogle Scholar
- Asher B: Decision analytics software solutions for proteomics analysis. J Mol Graph Model 2000, 18: 79–82.PubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.