GAPscreener: An automatic tool for screening human genetic association literature in PubMed using the support vector machine technique
© Yu et al; licensee BioMed Central Ltd. 2008
Received: 07 December 2007
Accepted: 22 April 2008
Published: 22 April 2008
Synthesis of data from published human genetic association studies is a critical step in the translation of human genome discoveries into health applications. Although genetic association studies account for a substantial proportion of the abstracts in PubMed, identifying them with standard queries is not always accurate or efficient. Further automating the literature-screening process can reduce the burden of a labor-intensive and time-consuming traditional literature search. The Support Vector Machine (SVM), a well-established machine learning technique, has been successful in classifying text, including biomedical literature. The GAPscreener, a free SVM-based software tool, can be used to assist in screening PubMed abstracts for human genetic association studies.
The data source for this research was the HuGE Navigator, formerly known as the HuGE Pub Lit database. Weighted SVM feature selection based on a keyword list obtained by the two-way z score method demonstrated the best screening performance, achieving 97.5% recall, 98.3% specificity and 31.9% precision in performance testing. Compared with the traditional screening process based on a complex PubMed query, the SVM tool reduced by about 90% the number of abstracts requiring individual review by the database curator. The tool also ascertained 47 articles that were missed by the traditional literature screening process during the 4-week test period. We examined the literature on genetic associations with preterm birth as an example. Compared with the traditional, manual process, the GAPscreener both reduced effort and improved accuracy.
GAPscreener is the first free SVM-based application available for screening the human genetic association literature in PubMed with high recall and specificity. The user-friendly graphical user interface makes this a practical, stand-alone application. The software can be downloaded at no charge.
The peer-reviewed scientific literature is a major source of information for developing research hypotheses and creating new knowledge through synthesis of research findings . The information explosion in biomedical science has created a huge challenge for researchers, who want to obtain useful information promptly and efficiently. Human genetic association studies epitomize this challenge because they have proliferated rapidly since completion of the Human Genome Project . Systematic review and meta-analysis have become important approaches for evaluating the robustness of such associations across different study platforms and populations . A key factor in the quality of a systematic review is complete capture of the relevant studies . Many databases that deposit genetic association information, including citations from PubMed, have been built and curated [5–7]. PubMed  is the largest publicly accessible biomedical literature database and is the main source for such activities. However, because of its large size and the complex syntax required for query formation, it is fairly difficult to comprehensively and effectively search PubMed for genetic association studies. The necessarily labor-intensive screening and curation process makes the maintenance of such databases extremely challenging.
Automatic literature classification is becoming increasingly attractive and has already demonstrated some successes in the biomedical literature [9–12]. The support vector machine (SVM) method  is a powerful machine learning technique that has been used to solve classification problems [14–18]. An earlier report described a potential application of SVM methods to classify literature on human genome epidemiology . In this paper, we report a novel method for feature selection and show that using it to train the SVM model significantly improved its ability to classify reports of human genetic association studies. We implemented the method as a Java-based application named GAPscreener (G enetic A ssociation P ublication screener) that can be freely downloaded .
SVM Model Generation
To generate the training dataset for the SVM experiment, we used 10,000 randomly selected abstracts from articles published between 2001 and 2006 in PubMed as a background dataset. The positive dataset consisted of 10,000 randomly selected gene-disease association articles from the HuGE Navigator  (formerly known as the HuGE Pub Lit database ), a continuously updated database of studies relevant to human genome epidemiology sponsored by the National Office of Public Health Genomics. Inclusion and exclusion criteria for positive dataset from the HuGE Pub Lit database has been reported .
PubMed abstract text retrieval
We developed a PubMed text extraction tool using the NCBI E-utility  to retrieve text content based on PubMed identification numbers (PMIDs). The text used for processing consisted of the title and the abstract, or the title alone if the abstract was not available. The text data were stored in a data structure for processing.
Text processing and extraction of keywords
The abstract and title of each article were then processed with the text-processing tool we developed. A stemming technique was used to deal with morphologic word changes, for example, polymorph(isms) and polymorph(ic) were considered the same word. A stop word list was generated for some common English words, such as pronouns and articles, to reduce the number of words extracted.
Significant keyword generation
We selected keywords by identifying statistically significant differences between the probability of their occurrence in the text (title and abstract) of human genetic association articles, compared with their frequency in all other articles. The sample sizes of both groups were large enough that the distribution of differences in probabilities was approximated by a normal distribution. Thus words with a z score greater than 1.96 or less than – 1.96 (significance level of α = .05) were chosen as feature keywords.
p1 = probability of occurrence of word in genetic association abstracts.
p2 = probability of occurrence of word in non-genetic association abstracts.
n1 = total occurrences of word in genetic association abstracts.
n2 = total occurrences of word in non-genetic association abstracts.
Generating SVM input data
The statistically significant keywords are called feature keywords and were used to construct the SVM features. Each feature keyword was weighted according to its z score, normalized to values from -1 to +1. For the training and testing data sets, the script generated the SVM input based on sparse format . The presence of each keyword was represented by its position on the feature keyword list, followed by a colon and the normalized z score; the absence of keywords was ignored and each feature was separated by a space, for example, 1:0.003589 30:- 0.81189. In the training data set, the first column of the input data was set to the known outcome, i.e., 1 for positive, -1 for negative. In the test set, the first column of the input dataset was set to 0.
Two sets of significant keywords were generated. One set contained those with positive z scores above the threshold (1.96) (called one-way weighted scheme); the other contained key words with both positive (greater than 1.96) and negative z scores (less than -1.96) (called two-way weighted scheme).
SVM model training
We used LibSVM , a freely available SVM software library, to train the SVM model. The accompanying utility, grid.py, was used to find optimum parameters for penalty parameter C and gamma in the radial basis function (RBF) kernel. The RBF kernel was chosen based on its potential in terms of performance .
Stand-alone Application Implementation
GAPscreener is a stand-alone application built with the Java programming language. Java Swing  components were used to build the graphical user interface (GUI). The application incorporates open-source LibSVM Java codes for prediction, employing the SVM model we trained. Java-based Web services in the NCBI E-utility were used to query and retrieve PubMed records. EzInstall , a freeware application, was used to package the application with a Java Runtime Environment (JRE), for automatic, self-contained installation.
General performance evaluation
To evaluate the performance of the screening tool, we used a series of new test data (not included in the training set). The first test data set (92253 negatives, 773 positives) consisted of selections from PubMed during five consecutive weeks (February 22, 2007 to March 22, 2007) according to the routine, traditional screening process used to build the HuGE Navigator . Positive or negative status assigned by the routine process was considered the gold standard. We used this data set to evaluate two keyword weighting schemes. A second data set (68255 negatives, 597 positives), selected from PubMed during four subsequent weeks (April 5, 2007 to April 26, 2007), was used to evaluate false-positive results generated by the GAPscreener using the selected weighting scheme.
where TP, TN, FP and FN represent the number of true positive, true negative, false positive and false negative results, respectively.
To compare the results of classification by the SVM tool with the gold standard, we used logistic regression (SAS Version 9.13, SAS Institute, Cary, NC). We produced separate logistic regression models for results of the one-way and two-way SVM schemes during the 5-week experiment (February 22, 2007 to March 28, 2007). Results from each model were used to generate receiver-operating characteristics (ROC) and calculate the area under the curve (AUC) with 95% confidence intervals. The AUC of ROC curves for the two models were compared using nonparametric methods [26, 27].
Domain-specific performance evaluation
A list of articles compiled independently by domain experts was used as the gold standard to evaluate the predictive accuracy of the application. A network of eight experts in the analysis of genetic associations with preterm birth performed a comprehensive literature search to build a knowledge base for systematic review and meta-analysis. The search was limited to articles published from January 1, 1990, to April 12, 2007. Complex queries compiled by a librarian were used to query PubMed and EMBASE . The complex queries consisted of sophisticated PubMed and EMBASE syntax filling more than four single-spaced pages. The results were manually reviewed by the domain experts.
For comparison, we used the GAPscreener to screen all PubMed abstracts published during the same period of time in a two-step process. First, we compiled a broad PubMed query based on common terms related to preterm birth. The 42,585 PubMed abstracts returned by this query were then classified by the SVM tool.
Query: Prematurity OR infant, premature OR infant, low birth weight OR labor, premature OR preterm labour OR premature birth OR preterm birth OR preterm infant OR preterm premature rupture OR preterm pregnancy outcome OR preterm delivery OR adverse outcomes of pregnancy OR obstetric labor, premature.
SVM feature selection
We generated a list of significant keywords using the z score method, based on comparing their relative frequencies in 10,000 general PubMed abstracts and 10,000 gene disease-associated abstracts included in the HuGE Pub Lit database. The one-way and two-way weighted schemes generated lists of 1,301 and 4,589 keywords, respectively. Normalized z scores between 1 and -1 were used as weighting parameters for each keyword.
Performance test results comparing SVM results with known classification in test set (data selected from PubMed during five consecutive weeks from Feb 22, 2007 to March 28, 2007)
ROC area (95% CI)
(0.976 – 0.987)
Using the SVM tool for HuGE Pub Lit database screening and curation
Results of the SVM method and previous method in screening PubMed for the HuGE Pub Lit database.
Number of positive abstracts missed by the previous method*
Number of positive abstracts missed by SVM
Number of positive abstracts picked up by both methods
Number of total positive abstracts
Numbers of PubMed abstracts requiring manual review after screening by SVM method and previous method*.
The SVM tool
The previous method
Screening PubMed for genetic associations with preterm birth
Implementation of the user-friendly application
The number of published genetic associations has exploded during the past decade . Finding these associations in major online databases like PubMed is critical for establishing the knowledge base on genetic factors in specific diseases . Automated tools are needed to help scientists cope with the information overload. For 6 years, the HuGE Pub Lit database has continuously collected PubMed literature related to human genome epidemiology, providing a great opportunity to test machine learning techniques for automating the screening process . Compared with the existing, traditional screening process, the GAPscreener dramatically reduced the burden of manual review and substantially improved screening recall, from 80% to 97.5%.
Feature selection is an important element of the support vector machine technique. Our weighted z score method performed better than a previously reported method based on the Term Frequency × Inverse Document Frequency (TFIDF) weighting scheme . Representing statistical information for each keyword as a normalized z score (value between 1 and -1) performed better than the binary representation .
As we demonstrated in the example of preterm birth, a potentially important application of the GAPscreener is identifying genetic association literature in a specific domain (e.g., disease, gene, or pathway). This could be very useful to disease-specific networks or consortia, such as those that have banded together in a global HuGENet collaboration . The GAPscreener takes advantage of PubMed search capacity to narrow down the returned abstracts to a specific topic before applying the SVM technique.
The GAPscreener could become a routine screening tool for researchers and database curators for maintaining a local reference database. The tool can be downloaded at no charge and source code is available upon request. It is a freeware search tool that can assist researchers with systematic reviews by identifying genetic association literature in PubMed in a user-friendly and sensitive way. To our knowledge, it is the first free application that uses SVM techniques to classify published literature related to human genetic association studies. Certainly, a similar approach could be used to classify literature in other biomedical fields.
Although the GAPscreener demonstrated high recall and specificity, it has many aspects that could be improved. For example, the two-way weighted z score scheme based on a threshold of ± 1.96 generated 4,589 keywords. The number of featured keywords influences the processing speed, which in this example averaged about 0.02 second per abstract. We are planning to experiment with shorter featured keyword lists to improve processing time without sacrificing recall.
The keyword approach is only one of many ways to transform text into a feature vector. Use of controlled vocabularies can make "keywords" more meaningful and condense the list by reducing synonyms for a particular concept to a single term. The Unified Medical Language System (UMLS) sponsored by the National Library of Medicine provides a central repository for standard controlled vocabularies in the biomedical fields . MetaMap Transfer (MMTx) is a tool that maps free text to concepts in the UMLS Metathesaurus . UMLS terms could be used during the selection of featured keywords.
GAPscreener is the first free SVM-based application available for screening the human genetic association literature in PubMed. It uses a novel SVM weighted-feature selection scheme. A performance evaluation demonstrated high recall and specificity. The user-friendly graphical user interface makes this a practical, stand-alone application.
Availability and requirements
Project home page:
Operating systems: Windows
Programming language: Java
Software packages: J2EE 1.4.
License: GNU General Public License. This license allows the source code to be redistributed and/or modified under the terms of the GNU General Public License as published by the Free Software Foundation. The source code for the application is available at no charge.
Any restrictions to use by non-academics: None
We thank Dr. Sham Navathe and his group at the Georgia Institute of Technology for useful discussions on support vector machines. Thanks also to Joseph Long for comments on the manuscript.
- Jensen LJ, Saric J, Bork P: Literature mining for the biologist: from information retrieval to biological discovery. Nat Rev Genet 2006, 7: 119–129. 10.1038/nrg1768View ArticlePubMedGoogle Scholar
- Guttmacher AE, Collins FS: Realizing the promise of genomics in biomedical research. JAMA 2005, 294: 1399–1402. 10.1001/jama.294.11.1399View ArticlePubMedGoogle Scholar
- Ioannidis JP, Gwinn M, Little J, Higgins JP, Bernstein JL, Boffetta P, et al.: A road map for efficient and reliable human genome epidemiology. Nat Genet 2006, 38: 3–5. 10.1038/ng0106-3View ArticlePubMedGoogle Scholar
- HuGENet Handbook of Systematic Reviews2007. [http://www.genesens.net/_intranet/doc_nouvelles/HuGEReviewHandbookv11.pdf]
- Yu W, Gwinn M, Clyne M, Yesupriya A, Khoury MJ: A navigator for human genome epidemiology. Nat Genet 2008, 40: 124–125. 10.1038/ng0208-124View ArticlePubMedGoogle Scholar
- Lin BK, Clyne M, Walsh M, Gomez O, Yu W, Gwinn M, et al.: Tracking the epidemiology of human genes in the literature: the HuGE Published Literature database. Am J Epidemiol 2006, 164: 1–4. 10.1093/aje/kwj175View ArticlePubMedGoogle Scholar
- Bertram L, McQueen MB, Mullin K, Blacker D, Tanzi RE: Systematic meta-analyses of Alzheimer disease genetic association studies: the AlzGene database. Nat Genet 2007, 39: 17–23. 10.1038/ng1934View ArticlePubMedGoogle Scholar
- PubMed. Bethesda, MD: National Library of Medicine2006. [http://www.ncbi.nlm.nih.gov/entrez]
- Shatkay H: Hairpins in bookstacks: information retrieval from biomedical text. Brief Bioinform 2005, 6: 222–238. 10.1093/bib/6.3.222View ArticlePubMedGoogle Scholar
- Polavarapu N, Navathe SB, Ramnarayanan R, ul HA, Sahay S, Liu Y: Investigation into biomedical literature classification using support vector machines. Proc IEEE Comput Syst Bioinform Conf 2005, 366–374.Google Scholar
- Donaldson I, Martin J, de BB, Wolting C, Lay V, Tuekam B, et al.: PreBIND and Textomy–mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics 2003, 4: 11. 10.1186/1471-2105-4-11PubMed CentralView ArticlePubMedGoogle Scholar
- Cohen AM, Hersh WR: The TREC 2004 genomics track categorization task: classifying full text biomedical documents. J Biomed Discov Collab 2006, 1: 4. 10.1186/1747-5333-1-4PubMed CentralView ArticlePubMedGoogle Scholar
- Cortes C, Vapnik V: Support-vector networks. Machine Learning 1995, 20: 273–297.Google Scholar
- Han B, Obradovic Z, Hu ZZ, Wu CH, Vucetic S: Substring selection for biomedical document classification. Bioinformatics 2006, 22: 2136–2142. 10.1093/bioinformatics/btl350View ArticlePubMedGoogle Scholar
- Chapelle O: Training a support vector machine in the primal. Neural Comput 2007, 19: 1155–1178. 10.1162/neco.2007.19.5.1155View ArticlePubMedGoogle Scholar
- Ng KL, Mishra SK: De novo SVM classification of precursor microRNAs from genomic pseudo hairpins using global and intrinsic folding measures. Bioinformatics 2007, 23: 1321–1330. 10.1093/bioinformatics/btm026View ArticlePubMedGoogle Scholar
- Leong MK: A novel approach using pharmacophore ensemble/support vector machine (PhE/SVM) for prediction of hERG liability. Chem Res Toxicol 2007, 20: 217–226. 10.1021/tx060230cView ArticlePubMedGoogle Scholar
- Rice SB, Nenadic G, Stapley BJ: Mining protein function from text using term-based support vector machines. BMC Bioinformatics 2005, 6(Suppl 1):S22. 10.1186/1471-2105-6-S1-S22PubMed CentralView ArticlePubMedGoogle Scholar
- Entrez Programming Utilities. bethesda, MD: National Library of Medicine2006. [http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html]
- Rosener B: Fundamentals of Biostatistics. 5th edition. Boston. Duxbury Press; 2000:356–359.Google Scholar
- Chang CC, Lin CJ: A library for support vector machines.2001. [http://www.csie.ntu.edu.tw/~cjlin/libsvm]Google Scholar
- Lin HT, Lin CJ: A study on sigmoid kernels for SVM and the training of non-PSD kernels by SMO-type methods.Technical report, Department of Computer Science, National Taiwan University; 2003. [http://www.csie.ntu.edu.tw/~cjlin/papers/tanh.pdf]Google Scholar
- Eckstein R, Loy M, Wood M: Java Swing. O'Reilly & Associates, Inc., Sebastopol, CA,; 1998.Google Scholar
- EzInstall 5.2[http://www.download3000.com/download_500.html]
- DeLong ER, DeLong DM, Clarke-Pearson DL: Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 1988, 44: 837–845. 10.2307/2531595View ArticlePubMedGoogle Scholar
- Puri ML, Sen PK: Nonparametric Methods in Multivariate Analysis. Wiley; 1971.Google Scholar
- EMBASE Excerpta MedicaNew York, NY: Elsevier; 2005. [http://www.elsevier.com/wps/find/bibliographicdatabasedescription.cws_home/523328/description]
- Sebastiani F: Machine learning in automated text categorization. ACM Computing Surveys 2002, 34: 1–47. 10.1145/505282.505283View ArticleGoogle Scholar
- Ioannidis JP, Bernstein J, Boffetta P, Danesh J, Dolan S, Hartge P, et al.: A network of investigator networks in human genome epidemiology. Am J Epidemiol 2005, 162: 302–304. 10.1093/aje/kwi201View ArticlePubMedGoogle Scholar
- Lindberg DA, Humphreys BL, McCray AT: The Unified Medical Language System. Methods Inf Med 1993, 32: 281–291.PubMedGoogle Scholar
- Aronson AR: Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc AMIA Symp 2001, 17–21.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.