- Methodology article
- Open Access
SNPHunter: a bioinformatic software for single nucleotide polymorphism data acquisition and management
© Wang et al; licensee BioMed Central Ltd. 2005
- Received: 12 January 2005
- Accepted: 18 March 2005
- Published: 18 March 2005
Single nucleotide polymorphisms (SNPs) provide an important tool in pinpointing susceptibility genes for complex diseases and in unveiling human molecular evolution. Selection and retrieval of an optimal SNP set from publicly available databases have emerged as the foremost bottlenecks in designing large-scale linkage disequilibrium studies, particularly in case-control settings.
We describe the architectural structure and implementations of a novel software program, SNPHunter, which allows for both ad hoc-mode and batch-mode SNP search, automatic SNP filtering, and retrieval of SNP data, including physical position, function class, flanking sequences at user-defined lengths, and heterozygosity from NCBI dbSNP. The SNP data extracted from dbSNP via SNPHunter can be exported and saved in plain text format for further down-stream analyses. As an illustration, we applied SNPHunter for selecting SNPs for 10 major candidate genes for type 2 diabetes, including CAPN10, FABP4, IL6, NOS3, PPARG, TNF, UCP2, CRP, ESR1, and AR.
SNPHunter constitutes an efficient and user-friendly tool for SNP screening, selection, and acquisition. The executable and user's manual are available at http://www.hsph.harvard.edu/ppg/software.htm.
- Gene Size
- Nonsynonymous SNPs
- Plain Text Format
- Linkage Disequlibrium
- Splice Site SNPs
With the ever-increasing volume of single nucleotide polymorphisms (SNPs) deposited in publicly available databases such as National Center for Biotechnology Information (NCBI) dbSNP, laboratory geneticists are faced with the routine need of selecting an appropriate set of SNPs in both gene mapping and molecular evolution studies. The major bottleneck in the workflow for SNP-based studies has shifted away from SNP discovery toward SNP selection. Although it is beyond dispute that several web-based applications and stand-alone software packages are available for handling SNP data, including viewGene , Genotools SNP manager , SNPbox , SNPicker , and SNPper , these applications go off on a tangent when it comes to selecting the best SNP set because their applications focus on primer design (e.g. SNPbox and SNPicker), SNP visualization (e.g. viewGene), specific platform applications such as MassARRAY technology (e.g. Genotools SNP manager), and SNP search (e.g. SNPper). In light of the surging interest in haplotype inference [6, 7] and haplotype-based association studies, the power of a linkage disequlibrium (LD) study is determined not only by the number of SNPs used, but also by the quality. Contemporary geneticists aim to maximize the statistical power in detecting a disease-susceptible locus by selecting a "best set" of closely linked SNPs given a limited (and often fixed) genotyping budget .
In the case when a large number of SNPs are available for a susceptibility gene of interest, genotyping all SNPs on all samples is an inefficient utilization of resources. Recently, a cost-effective two-stage method has been proposed to identify disease-susceptibility markers . In stage I, a set of SNPs (S1), spaced in a predefined interval, is selected (e.g., evenly spaced every 3 to 5 Kb in and surrounding a candidate gene ). The genotypes of markers in S1 are then used to define LD blocks and to reconstruct haplotypes within blocks across the candidate gene locus in a representative random sample, C1, of the original source population (e.g., a multiethnic cohort of men and women ). In stage II, a representative set (S2; S2 ⊆ S1) of htSNPs is selected on the basis of the LD characterization in the random sample of C1, and S2 is then genotyped in a much larger case-control set C2 (C1 ⊂ C2), and haplotype-based association tests are performed in C2, nested in the original source population.
Both stages are critical to the success of an association study. However, it is not a trivial task to select S1 (i.e. a set of evenly spaced SNPs in and surrounding a candidate gene) because the number of available SNPs for each human gene varies dramatically (from <10 to >200) because of varying gene sizes and SNP densities. Furthermore, we certainly do not simply keep common SNPs in S1 [i.e., minor allele frequency (MAF) ≥ 5%]; missense and regulatory SNPs should still be considered to be included in S1 even if their MAFs fall below 5% . Hand-picking S1 by "eyeballing" is extremely labor-intensive, time-consuming, and error-prone for candidate genomic regions with hundreds and thousands of SNPs. Furthermore, obtaining a SNP flanking sequence long enough (~200 bp), and with annotation of nearby potential SNPs, is essential for the successful design of a SNP genotyping assay. Unfortunately, the flanking sequences of many SNPs recorded in dbSNP are short (<100 bp) and without any annotation of nearby SNPs.
NCBI's dbSNP offers a comprehensive SNP searching tool . However, tools are still needed to easily and efficiently locate the desired SNPs, to evaluate their annotations, and to export them in suitable formats for downstream analyses. To meet such needs, we have developed the program SNPHunter, a tool with a friendly graphical user interface (GUI) that works as a portal between the user and NCBI dbSNP . The program can extract and export SNP data retrieved from dbSNP, import saved SNP data, and offers a very flexible SNP selection function with graphic illustration of SNP position, function and heterozygosity. Furthermore, it retrieves any arbitrarily-defined, user-specified length of SNP flanking sequence with annotation of all nearby SNPs.
A detailed description of the implementation of the three modules has been presented in the User's Manual . In brief, since retrieval of the flanking sequences of a desired SNP relies on knowledge of its genomic coordinate, in an ad hoc mode, SNPHunter first pinpoints the SNP's genomic coordinate from dbSNP's reference SNP (refSNP) record, strand orientation, and the SNP's corresponding contig number. Moreover, SNPHunter communicates with the NCBI MapViewer database and retrieves the corresponding sequence centering at the desired SNP, with the sequence lengths specified by users. SNPHunter will detect all neighboring SNPs located within a user-defined radius around the SNP of interest. Once the SNP's genomic coordinate and contig data are retreived, SNPHunter also obtains nearby SNP data on all neighboring SNPs by querying dbSNP for all available SNPs that lie within the user-defined radius. Once the starting and ending coordinates of a particular gene are determined by SNPHunter through NCBI's AceView, the 5' upstream and 3' downstream regions of the gene can be retrieved according to user-defined lengths. In a batch-mode, SNPHunter communicates with NCBI's LocusLink to fetch the SNPs that reside within each LocusLink gene. Since LocusLink has a curated SNP list for each gene included in the LocusLink database, this batch-mode search offers a reliable, efficient way to conduct a systematic SNP search for a large set of candidate genes (e.g. belonging to the same biological pathway/network). Furthermore, SNP data can be stored in the user's local directories, and SNP filtering can be performed automatically according to user-defined criteria.
Size, location, and the estimated number of SNPs for each of the 10 candidate genes for type 2 diabetes mellitus.
Size (bp) (# exons)a
Total # SNPs (#/Kb)b
# SNPs MAF ≥ 5% (density)c
# SNPs selected for S1
5' 30 Kb
3' 30 Kb
Genome coverage: SNPs should cover the gene region as well as its 30 Kb 5' upstream and 30 Kb 3' downstream regions (the gene sizes are shown in Table 1).
Functionality priority: coding SNPs (cSNPs; including both synonymous and nonsynonymous SNPs) and splice site SNPs (ssSNPs) must be kept; for SNPs located in the 5' upstream region and 3' downstream regions, the function is defined according to existing in vivo/in vitro experimental data. The priority of SNP selection is nonsynonymous SNPs > synonymous SNPs > ssSNPs > 5' upstream SNPs > 3' downstream SNPs > intronic SNPs.
Priority based on HET: For cSNPs and ssSNPs, no HET threshold is set (HET can be calculated using the POLYMORPHISM software ); for intronic and 5' upstream or 3' downstream region SNPs, those SNPs with HET values going above the threshold of 0.095 (which correspond to MAF ≥ 5%) have higher priorities.
SNP density: The SNPs should be relatively evenly distributed across the gene region (as well as the 30 Kb 5' upstream and 30 Kb 3' downstream regions) with a density of 5–50 SNPs/Kb depending on the gene sizes (see Table 1). The goal is that for gene sizes < 10 Kb, we use a density of 50 SNPs/Kb; for gene sizes 10–100 Kb, we use a density of 10 SNPs/Kb; for gene sizes > 100 Kb, we use a density of 5 SNPs/Kb.
Retrieve all SNPs regardless of HET values according to SNP selection criterion (1).
Select all cSNPs and ssSNPs; in addition, 5' upstream, 3' downstream and intronic SNPs with HET ≥ 0.095 will also be selected according to SNP selection criterion (2).
Enforce a relatively even SNP density according to SNP selection criterion (4). We implement this by setting the maximum inter-marker distance d (i.e., for a given set of selected SNPs S, if there exists a pair of neighboring SNPs (SNP i , SNP j ), where the physical distance between SNP i and SNP j is <d, the program recursively picks a random SNP, say SNP k , between SNP i and SNP j and inserts SNP k in the middle of SNP i and SNP j ; by mathematical induction, this process will guarantee that S will eventually be a saturated set, S', at a resolution level of d). Re-adjust the marker density by iteratively adding available SNPs in the priority order set by SNP selection criterion (2) and (3) until we come to a target number of SNPs with desired density, according to SNP selection criterion (4).
Include any non-redundant SNPs from sources other than dbSNP, such as from literature review.
Using these criteria and selection procedures, we selected a total of 670 SNPs for the 10 genes listed in Table 1.
Comparisons between SNPHunter and other publicly available software/tool.
SNP search related functions
Arbitrary flanking sequence length
Nearby SNP annotation
SNP selection interface
Graphic SNP illustration
NCBI dbSNP a
None for user; server contains latest updated data
None for user; server needs to update periodically
Web-Client (depends on dbSNP for new SNP retrieval)
None for user; rely on dbSNP for data update
Web-Client (depends on NEB restriction-enzyme data)
None for user; rely on NEB for data update
Software depends on annotated record
Needs to download annotated data
It is worth noting that SNPHunter relies on dbSNP for data retrieval, and thus is deprived of the independence whereas other application with local database support usually has. What's more, SNP selection should not be limited to NCBI dbSNP, although dbSNP represents the largest publicly available SNP database that can be accessed via the Internet worldwide. Some SNPs reported in the earlier literature have not yet been incorporated into dbSNP. Furthermore, there are several on-going gene re-sequencing projects for selected human genes, such as SeattleSNPs or SNP500Cancer . Therefore, SNPs from these other sources, if not yet included in dbSNP, should also be considered in SNP selection. Nevertheless, NCBI dbSNP has been steadily updated and has gradually emerged as one of the most comprehensive SNP depositories.
In summary, SNPHunter allows for customized SNP searches (both ad hoc-mode and batch-mode) by directly retrieving and managing SNP information from the NCBI dbSNP database, eliminating tedious and costly local database maintenance on the user's side. To date, SNPHunter has received more than 1000 downloads worldwide. We hope this simple program can serve as an efficient and reliable tool for researchers everywhere to facilitate their genetic studies.
Project name: SNPHunter
Project home page: http://www.hsph.harvard.edu/ppg/software.htm
Operating system(s): Microsoft Windows
Programming language: Visual Basic .NET
Other requirements: Microsoft .NET Framework 1.0 or above.
Any restrictions to use by non-academics: Contact authors
We wish to thank the many SNPHunter users for their constructive comments, especially Illumina, Inc. We thank Melissa Veno for editorial assistance. This work was supported in part by National Institutes of Health grants R01 HG002518, R01 DK062290, R01 DK066401, and R01 HL073882.
- Kashuk C, SenGupta S, Eichler E, Chakravarti A: ViewGene: a graphical tool for polymorphism visualization and characterization. Genome Res 2002, 12: 333–338. 10.1101/gr.211202PubMed CentralView ArticlePubMedGoogle Scholar
- Pusch W, Kraeuter KO, Froehlich T, Stalgies Y, Kostrzewa M: Genotools SNP manager: a new software for automated high-throughput MALDI-TOF mass spectrometry SNP genotyping. Biotechniques 2001, 30: 210–215.PubMedGoogle Scholar
- Weckx S, De Rijk P, Van Broeckhoven C, Del-Favero J: SNPbox: web-based high-throughput primer design from gene to genome. Nucleic Acids Res 2004, 32: W170–2. 10.1093/nar/gnh168PubMed CentralView ArticlePubMedGoogle Scholar
- Niu T, Hu Z: SNPicker: a graphical tool for primer picking in designing mutagenic endonuclease restriction assays. Bioinformatics 2004, 20: 3263–3265. 10.1093/bioinformatics/bth360View ArticlePubMedGoogle Scholar
- Riva A, Kohane IS: SNPper: retrieval and analysis of human SNPs. Bioinformatics 2002, 18: 1681–1685. 10.1093/bioinformatics/18.12.1681View ArticlePubMedGoogle Scholar
- Niu T, Qin ZS, Xu X, Liu JS: Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. Am J Hum Genet 2002, 70: 157–169. 10.1086/338446PubMed CentralView ArticlePubMedGoogle Scholar
- Niu T: Algorithms for inferring haplotypes. Genet Epidemiol 2004, 27: 334–347. 10.1002/gepi.20024View ArticlePubMedGoogle Scholar
- Hoh J, Wille A, Zee R, Cheng S, Reynolds R, Lindpaintner K, Ott J: Selecting SNPs in two-stage analysis of disease association data: a model-free approach. Ann Hum Genet 2000, 64: 413–417. 10.1046/j.1469-1809.2000.6450413.xView ArticlePubMedGoogle Scholar
- Thompson D, Stram D, Goldgar D, Witte JS: Haplotype tagging single nucleotide polymorphisms and association studies. Hum Hered 2003, 56: 48–55. 10.1159/000073732View ArticlePubMedGoogle Scholar
- Stram DO, Haiman CA, Hirschhorn JN, Altshuler D, Kolonel LN, Henderson BE, Pike MC: Choosing haplotype-tagging SNPS based on unphased genotype data using a preliminary sample of unrelated subjects with an example from the Multiethnic Cohort Study. Hum Hered 2003, 55: 27–36. 10.1159/000071807View ArticlePubMedGoogle Scholar
- Haiman CA, Stram DO, Pike MC, Kolonel LN, Burtt NP, Altshuler D, Hirschhorn J, Henderson BE: A comprehensive haplotype analysis of CYP19 and breast cancer risk: the Multiethnic Cohort. Hum Mol Genet 2003, 12: 2679–2692. 10.1093/hmg/ddg294View ArticlePubMedGoogle Scholar
- dbSNP: [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Snp&cmd=Limits].Google Scholar
- Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K: dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 2001, 29: 308–311. 10.1093/nar/29.1.308PubMed CentralView ArticlePubMedGoogle Scholar
- Pruitt KD, Maglott DR: RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res 2001, 29: 137–140. 10.1093/nar/29.1.137PubMed CentralView ArticlePubMedGoogle Scholar
- PPGWebsite: [http://www.hsph.harvard.edu/ppg/software.htm].Google Scholar
- Niu T, Struk B, Lindpaintner K: Statistical considerations for genome-wide scans: design and application of a novel software package POLYMORPHISM. Hum Hered 2001, 52: 102–109. 10.1159/000053361View ArticlePubMedGoogle Scholar
- Kasprzyk A, Keefe D, Smedley D, London D, Spooner W, Melsopp C, Hammond M, Rocca-Serra P, Cox T, Birney E: EnsMart: a generic system for fast and flexible access to biological data. Genome Res 2004, 14: 160–169. 10.1101/gr.1645104PubMed CentralView ArticlePubMedGoogle Scholar
- Packer BR, Yeager M, Staats B, Welch R, Crenshaw A, Kiley M, Eckert A, Beerman M, Miller E, Bergen A, Rothman N, Strausberg R, Chanock SJ: SNP500Cancer: a public resource for sequence validation and assay development for genetic variation in candidate genes. Nucleic Acids Res 2004, 32(Database issue):D528-D532. 10.1093/nar/gkh005PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.