Methodology article | Open | Published:
SNPHunter: a bioinformatic software for single nucleotide polymorphism data acquisition and management
BMC Bioinformaticsvolume 6, Article number: 60 (2005)
Single nucleotide polymorphisms (SNPs) provide an important tool in pinpointing susceptibility genes for complex diseases and in unveiling human molecular evolution. Selection and retrieval of an optimal SNP set from publicly available databases have emerged as the foremost bottlenecks in designing large-scale linkage disequilibrium studies, particularly in case-control settings.
We describe the architectural structure and implementations of a novel software program, SNPHunter, which allows for both ad hoc-mode and batch-mode SNP search, automatic SNP filtering, and retrieval of SNP data, including physical position, function class, flanking sequences at user-defined lengths, and heterozygosity from NCBI dbSNP. The SNP data extracted from dbSNP via SNPHunter can be exported and saved in plain text format for further down-stream analyses. As an illustration, we applied SNPHunter for selecting SNPs for 10 major candidate genes for type 2 diabetes, including CAPN10, FABP4, IL6, NOS3, PPARG, TNF, UCP2, CRP, ESR1, and AR.
SNPHunter constitutes an efficient and user-friendly tool for SNP screening, selection, and acquisition. The executable and user's manual are available at http://www.hsph.harvard.edu/ppg/software.htm.
With the ever-increasing volume of single nucleotide polymorphisms (SNPs) deposited in publicly available databases such as National Center for Biotechnology Information (NCBI) dbSNP, laboratory geneticists are faced with the routine need of selecting an appropriate set of SNPs in both gene mapping and molecular evolution studies. The major bottleneck in the workflow for SNP-based studies has shifted away from SNP discovery toward SNP selection. Although it is beyond dispute that several web-based applications and stand-alone software packages are available for handling SNP data, including viewGene , Genotools SNP manager , SNPbox , SNPicker , and SNPper , these applications go off on a tangent when it comes to selecting the best SNP set because their applications focus on primer design (e.g. SNPbox and SNPicker), SNP visualization (e.g. viewGene), specific platform applications such as MassARRAY technology (e.g. Genotools SNP manager), and SNP search (e.g. SNPper). In light of the surging interest in haplotype inference [6, 7] and haplotype-based association studies, the power of a linkage disequlibrium (LD) study is determined not only by the number of SNPs used, but also by the quality. Contemporary geneticists aim to maximize the statistical power in detecting a disease-susceptible locus by selecting a "best set" of closely linked SNPs given a limited (and often fixed) genotyping budget .
In the case when a large number of SNPs are available for a susceptibility gene of interest, genotyping all SNPs on all samples is an inefficient utilization of resources. Recently, a cost-effective two-stage method has been proposed to identify disease-susceptibility markers . In stage I, a set of SNPs (S1), spaced in a predefined interval, is selected (e.g., evenly spaced every 3 to 5 Kb in and surrounding a candidate gene ). The genotypes of markers in S1 are then used to define LD blocks and to reconstruct haplotypes within blocks across the candidate gene locus in a representative random sample, C1, of the original source population (e.g., a multiethnic cohort of men and women ). In stage II, a representative set (S2; S2 ⊆ S1) of htSNPs is selected on the basis of the LD characterization in the random sample of C1, and S2 is then genotyped in a much larger case-control set C2 (C1 ⊂ C2), and haplotype-based association tests are performed in C2, nested in the original source population.
Both stages are critical to the success of an association study. However, it is not a trivial task to select S1 (i.e. a set of evenly spaced SNPs in and surrounding a candidate gene) because the number of available SNPs for each human gene varies dramatically (from <10 to >200) because of varying gene sizes and SNP densities. Furthermore, we certainly do not simply keep common SNPs in S1 [i.e., minor allele frequency (MAF) ≥ 5%]; missense and regulatory SNPs should still be considered to be included in S1 even if their MAFs fall below 5% . Hand-picking S1 by "eyeballing" is extremely labor-intensive, time-consuming, and error-prone for candidate genomic regions with hundreds and thousands of SNPs. Furthermore, obtaining a SNP flanking sequence long enough (~200 bp), and with annotation of nearby potential SNPs, is essential for the successful design of a SNP genotyping assay. Unfortunately, the flanking sequences of many SNPs recorded in dbSNP are short (<100 bp) and without any annotation of nearby SNPs.
NCBI's dbSNP offers a comprehensive SNP searching tool . However, tools are still needed to easily and efficiently locate the desired SNPs, to evaluate their annotations, and to export them in suitable formats for downstream analyses. To meet such needs, we have developed the program SNPHunter, a tool with a friendly graphical user interface (GUI) that works as a portal between the user and NCBI dbSNP . The program can extract and export SNP data retrieved from dbSNP, import saved SNP data, and offers a very flexible SNP selection function with graphic illustration of SNP position, function and heterozygosity. Furthermore, it retrieves any arbitrarily-defined, user-specified length of SNP flanking sequence with annotation of all nearby SNPs.
SNPHunter was written using Microsoft Visual Basic .NET. A schematic diagram of the architectural framework for SNPHunter is shown in Figure 1. This tool relies on an HTTP parser that delegates the user's query to databases including dbSNP , MapViewer, LocusLink , and AceView at NCBI (Figure 1, left), and parses the retrieved data. It consists of three modular components, SNP Search, SNP Management, and LocusLink SNP. In the SNP Search module, the user inputs the gene symbol of interest and chooses SNPs based on heterozygosity (HET), chromosomal position, and functional class (Figure 1, upper right). The user can also specify whether upstream/downstream sequences of the gene should also be included for search. In the SNP Management module, the user fetches and manages detailed information for SNPs retrieved in the SNP Search module or the user's own SNP list. In the LocusLink module, SNPHunter reads in a list of LocusLink gene IDs (i.e. Entrez Gene IDs) and performs a batch-mode SNP search via LocusLink (Although NCBI LocusLink was superseded by the NCBI Entrez Gene, this SNP search mode is still fully functional).This module is very useful for obtaining SNP data with a large number of genes. SNPHunter creates a SNP summary and pops up a new "Filter SNP" panel (Figure 1, lower right). SNP filtering can be performed on all or selected genes according to user-specified filtering criteria. One advantage of the SNPHunter's local filter is that it does not rely on any Web server to perform the filtering. Once the user has downloaded all the SNP information and exported it to a local file, SNPHunter will perform filtering either automatically or manually, which means that the user can further modify the selection after automatic filtering. The selected SNP list can be exported to local directories for storage or further analyses. This batch-mode search operation is impressively fast. In the example shown in Figure 1, all the SNPs on the six genes were retrieved and downloaded in 10 sec, and automatic filtering on a regular personal computer with one Intel Pentium 4 2.8 GHz processor took another 7 sec.
A detailed description of the implementation of the three modules has been presented in the User's Manual . In brief, since retrieval of the flanking sequences of a desired SNP relies on knowledge of its genomic coordinate, in an ad hoc mode, SNPHunter first pinpoints the SNP's genomic coordinate from dbSNP's reference SNP (refSNP) record, strand orientation, and the SNP's corresponding contig number. Moreover, SNPHunter communicates with the NCBI MapViewer database and retrieves the corresponding sequence centering at the desired SNP, with the sequence lengths specified by users. SNPHunter will detect all neighboring SNPs located within a user-defined radius around the SNP of interest. Once the SNP's genomic coordinate and contig data are retreived, SNPHunter also obtains nearby SNP data on all neighboring SNPs by querying dbSNP for all available SNPs that lie within the user-defined radius. Once the starting and ending coordinates of a particular gene are determined by SNPHunter through NCBI's AceView, the 5' upstream and 3' downstream regions of the gene can be retrieved according to user-defined lengths. In a batch-mode, SNPHunter communicates with NCBI's LocusLink to fetch the SNPs that reside within each LocusLink gene. Since LocusLink has a curated SNP list for each gene included in the LocusLink database, this batch-mode search offers a reliable, efficient way to conduct a systematic SNP search for a large set of candidate genes (e.g. belonging to the same biological pathway/network). Furthermore, SNP data can be stored in the user's local directories, and SNP filtering can be performed automatically according to user-defined criteria.
To demonstrate the SNP selection process from dbSNP using SNPHunter, we applied SNPHunter for S1 selection for 10 biological candidate genes (Table 1) for a type 2 diabetes mellitus (DM) case-control study. These 10 candidate genes were chosen on the basis of their biochemical and physiological functions.
We used the following four SNP selection criteria:
Genome coverage: SNPs should cover the gene region as well as its 30 Kb 5' upstream and 30 Kb 3' downstream regions (the gene sizes are shown in Table 1).
Functionality priority: coding SNPs (cSNPs; including both synonymous and nonsynonymous SNPs) and splice site SNPs (ssSNPs) must be kept; for SNPs located in the 5' upstream region and 3' downstream regions, the function is defined according to existing in vivo/in vitro experimental data. The priority of SNP selection is nonsynonymous SNPs > synonymous SNPs > ssSNPs > 5' upstream SNPs > 3' downstream SNPs > intronic SNPs.
Priority based on HET: For cSNPs and ssSNPs, no HET threshold is set (HET can be calculated using the POLYMORPHISM software ); for intronic and 5' upstream or 3' downstream region SNPs, those SNPs with HET values going above the threshold of 0.095 (which correspond to MAF ≥ 5%) have higher priorities.
SNP density: The SNPs should be relatively evenly distributed across the gene region (as well as the 30 Kb 5' upstream and 30 Kb 3' downstream regions) with a density of 5–50 SNPs/Kb depending on the gene sizes (see Table 1). The goal is that for gene sizes < 10 Kb, we use a density of 50 SNPs/Kb; for gene sizes 10–100 Kb, we use a density of 10 SNPs/Kb; for gene sizes > 100 Kb, we use a density of 5 SNPs/Kb.
To date, there are no turn-key solutions that can select the best SNP set automatically. Our SNP selection procedure is an iterative process consisting of the following four major steps:
Retrieve all SNPs regardless of HET values according to SNP selection criterion (1).
Select all cSNPs and ssSNPs; in addition, 5' upstream, 3' downstream and intronic SNPs with HET ≥ 0.095 will also be selected according to SNP selection criterion (2).
Enforce a relatively even SNP density according to SNP selection criterion (4). We implement this by setting the maximum inter-marker distance d (i.e., for a given set of selected SNPs S, if there exists a pair of neighboring SNPs (SNP i , SNP j ), where the physical distance between SNP i and SNP j is <d, the program recursively picks a random SNP, say SNP k , between SNP i and SNP j and inserts SNP k in the middle of SNP i and SNP j ; by mathematical induction, this process will guarantee that S will eventually be a saturated set, S', at a resolution level of d). Re-adjust the marker density by iteratively adding available SNPs in the priority order set by SNP selection criterion (2) and (3) until we come to a target number of SNPs with desired density, according to SNP selection criterion (4).
Include any non-redundant SNPs from sources other than dbSNP, such as from literature review.
Using these criteria and selection procedures, we selected a total of 670 SNPs for the 10 genes listed in Table 1.
Besides SNP selection, SNPHunter allows the retrieval of genomic coordinates and flanking sequences for specific SNPs and gives graphic illustration of all the SNPs within the gene of interest as well. Figure 2 gives an illustration of the 28 SNPs found in a 2.7 Kb region spanning the tumor necrosis factor (TNF) gene from NCBI dbSNP.
The motivation for developing SNPHunter is to allow the efficient and accurate selection of S1 (see Background) because of its intrinsic value in LD studies, particularly in a case-control setting. A few Web resources, such as NCBI's Entrez, Ensembl's EnsMart  and SNPper  provide SNP database searching and SNP information downloading according to user-specified criteria. These tools, each with its own unique capabilities and focuses, have benefited the work of geneticists. However, few of them are dedicated solely for SNP search purposes and for the management of SNP data. Although SNPper  offers a very helpful function of filtering SNP sets, it is a locally stored SNP-centric database resource maintained by the Children's Hospital Informatics Program, Harvard Medical School, and requires regular data downloads from NCBI dbSNP. By contrast, SNPHunter is designed to work as a stand-alone application that retrieves the most-updated SNP and sequence data without the need for complicated local database support. Thus, the user is relieved from maintaining a local database and updating the data frequently. The ability to export and to save every dataset locally in plain text format provides the user with the freedom for later reuse or any other customized analysis without any website support. In addition, SNPHunter offers a very friendly GUI, allowing researchers without much computer background to perform SNP searches easily and efficiently. Moreover, its batch search and automatic SNP selection proved very efficient in large-scale candidate genes study. Table 2 lists features comparisons between SNPHunter and other major SNP related software/web tools.
It is worth noting that SNPHunter relies on dbSNP for data retrieval, and thus is deprived of the independence whereas other application with local database support usually has. What's more, SNP selection should not be limited to NCBI dbSNP, although dbSNP represents the largest publicly available SNP database that can be accessed via the Internet worldwide. Some SNPs reported in the earlier literature have not yet been incorporated into dbSNP. Furthermore, there are several on-going gene re-sequencing projects for selected human genes, such as SeattleSNPs or SNP500Cancer . Therefore, SNPs from these other sources, if not yet included in dbSNP, should also be considered in SNP selection. Nevertheless, NCBI dbSNP has been steadily updated and has gradually emerged as one of the most comprehensive SNP depositories.
In summary, SNPHunter allows for customized SNP searches (both ad hoc-mode and batch-mode) by directly retrieving and managing SNP information from the NCBI dbSNP database, eliminating tedious and costly local database maintenance on the user's side. To date, SNPHunter has received more than 1000 downloads worldwide. We hope this simple program can serve as an efficient and reliable tool for researchers everywhere to facilitate their genetic studies.
Availability and requirements
Project name: SNPHunter
Project home page: http://www.hsph.harvard.edu/ppg/software.htm
Operating system(s): Microsoft Windows
Programming language: Visual Basic .NET
Other requirements: Microsoft .NET Framework 1.0 or above.
Any restrictions to use by non-academics: Contact authors
Kashuk C, SenGupta S, Eichler E, Chakravarti A: ViewGene: a graphical tool for polymorphism visualization and characterization. Genome Res 2002, 12: 333–338. 10.1101/gr.211202
Pusch W, Kraeuter KO, Froehlich T, Stalgies Y, Kostrzewa M: Genotools SNP manager: a new software for automated high-throughput MALDI-TOF mass spectrometry SNP genotyping. Biotechniques 2001, 30: 210–215.
Weckx S, De Rijk P, Van Broeckhoven C, Del-Favero J: SNPbox: web-based high-throughput primer design from gene to genome. Nucleic Acids Res 2004, 32: W170–2. 10.1093/nar/gnh168
Niu T, Hu Z: SNPicker: a graphical tool for primer picking in designing mutagenic endonuclease restriction assays. Bioinformatics 2004, 20: 3263–3265. 10.1093/bioinformatics/bth360
Riva A, Kohane IS: SNPper: retrieval and analysis of human SNPs. Bioinformatics 2002, 18: 1681–1685. 10.1093/bioinformatics/18.12.1681
Niu T, Qin ZS, Xu X, Liu JS: Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. Am J Hum Genet 2002, 70: 157–169. 10.1086/338446
Niu T: Algorithms for inferring haplotypes. Genet Epidemiol 2004, 27: 334–347. 10.1002/gepi.20024
Hoh J, Wille A, Zee R, Cheng S, Reynolds R, Lindpaintner K, Ott J: Selecting SNPs in two-stage analysis of disease association data: a model-free approach. Ann Hum Genet 2000, 64: 413–417. 10.1046/j.1469-1809.2000.6450413.x
Thompson D, Stram D, Goldgar D, Witte JS: Haplotype tagging single nucleotide polymorphisms and association studies. Hum Hered 2003, 56: 48–55. 10.1159/000073732
Stram DO, Haiman CA, Hirschhorn JN, Altshuler D, Kolonel LN, Henderson BE, Pike MC: Choosing haplotype-tagging SNPS based on unphased genotype data using a preliminary sample of unrelated subjects with an example from the Multiethnic Cohort Study. Hum Hered 2003, 55: 27–36. 10.1159/000071807
Haiman CA, Stram DO, Pike MC, Kolonel LN, Burtt NP, Altshuler D, Hirschhorn J, Henderson BE: A comprehensive haplotype analysis of CYP19 and breast cancer risk: the Multiethnic Cohort. Hum Mol Genet 2003, 12: 2679–2692. 10.1093/hmg/ddg294
Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K: dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 2001, 29: 308–311. 10.1093/nar/29.1.308
Pruitt KD, Maglott DR: RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res 2001, 29: 137–140. 10.1093/nar/29.1.137
Niu T, Struk B, Lindpaintner K: Statistical considerations for genome-wide scans: design and application of a novel software package POLYMORPHISM. Hum Hered 2001, 52: 102–109. 10.1159/000053361
Kasprzyk A, Keefe D, Smedley D, London D, Spooner W, Melsopp C, Hammond M, Rocca-Serra P, Cox T, Birney E: EnsMart: a generic system for fast and flexible access to biological data. Genome Res 2004, 14: 160–169. 10.1101/gr.1645104
Packer BR, Yeager M, Staats B, Welch R, Crenshaw A, Kiley M, Eckert A, Beerman M, Miller E, Bergen A, Rothman N, Strausberg R, Chanock SJ: SNP500Cancer: a public resource for sequence validation and assay development for genetic variation in candidate genes. Nucleic Acids Res 2004, 32(Database issue):D528-D532. 10.1093/nar/gkh005
We wish to thank the many SNPHunter users for their constructive comments, especially Illumina, Inc. We thank Melissa Veno for editorial assistance. This work was supported in part by National Institutes of Health grants R01 HG002518, R01 DK062290, R01 DK066401, and R01 HL073882.
Simin Liu, Tianhua Niu, and Lin Wang identified the need to develop such a program, initiated the project, and designed the basic functions. Lin Wang wrote the source code for the software and interface design. Tianhua Niu and Xin Xu contributed with ideas on overall design, feature requirements, and implementation. All authors participated in the drafting of the manuscript and approved the final version.
Tianhua Niu and Xin Xu contributed equally to this work.