With the ever-increasing volume of single nucleotide polymorphisms (SNPs) deposited in publicly available databases such as National Center for Biotechnology Information (NCBI) dbSNP, laboratory geneticists are faced with the routine need of selecting an appropriate set of SNPs in both gene mapping and molecular evolution studies. The major bottleneck in the workflow for SNP-based studies has shifted away from SNP discovery toward SNP selection. Although it is beyond dispute that several web-based applications and stand-alone software packages are available for handling SNP data, including viewGene [1], Genotools SNP manager [2], SNPbox [3], SNPicker [4], and SNPper [5], these applications go off on a tangent when it comes to selecting the best SNP set because their applications focus on primer design (e.g. SNPbox and SNPicker), SNP visualization (e.g. viewGene), specific platform applications such as MassARRAY technology (e.g. Genotools SNP manager), and SNP search (e.g. SNPper). In light of the surging interest in haplotype inference [6, 7] and haplotype-based association studies, the power of a linkage disequlibrium (LD) study is determined not only by the number of SNPs used, but also by the quality. Contemporary geneticists aim to maximize the statistical power in detecting a disease-susceptible locus by selecting a "best set" of closely linked SNPs given a limited (and often fixed) genotyping budget [8].
In the case when a large number of SNPs are available for a susceptibility gene of interest, genotyping all SNPs on all samples is an inefficient utilization of resources. Recently, a cost-effective two-stage method has been proposed to identify disease-susceptibility markers [9]. In stage I, a set of SNPs (S1), spaced in a predefined interval, is selected (e.g., evenly spaced every 3 to 5 Kb in and surrounding a candidate gene [10]). The genotypes of markers in S1 are then used to define LD blocks and to reconstruct haplotypes within blocks across the candidate gene locus in a representative random sample, C1, of the original source population (e.g., a multiethnic cohort of men and women [11]). In stage II, a representative set (S2; S2 ⊆ S1) of htSNPs is selected on the basis of the LD characterization in the random sample of C1, and S2 is then genotyped in a much larger case-control set C2 (C1 ⊂ C2), and haplotype-based association tests are performed in C2, nested in the original source population.
Both stages are critical to the success of an association study. However, it is not a trivial task to select S1 (i.e. a set of evenly spaced SNPs in and surrounding a candidate gene) because the number of available SNPs for each human gene varies dramatically (from <10 to >200) because of varying gene sizes and SNP densities. Furthermore, we certainly do not simply keep common SNPs in S1 [i.e., minor allele frequency (MAF) ≥ 5%]; missense and regulatory SNPs should still be considered to be included in S1 even if their MAFs fall below 5% [11]. Hand-picking S1 by "eyeballing" is extremely labor-intensive, time-consuming, and error-prone for candidate genomic regions with hundreds and thousands of SNPs. Furthermore, obtaining a SNP flanking sequence long enough (~200 bp), and with annotation of nearby potential SNPs, is essential for the successful design of a SNP genotyping assay. Unfortunately, the flanking sequences of many SNPs recorded in dbSNP are short (<100 bp) and without any annotation of nearby SNPs.
NCBI's dbSNP offers a comprehensive SNP searching tool [12]. However, tools are still needed to easily and efficiently locate the desired SNPs, to evaluate their annotations, and to export them in suitable formats for downstream analyses. To meet such needs, we have developed the program SNPHunter, a tool with a friendly graphical user interface (GUI) that works as a portal between the user and NCBI dbSNP [13]. The program can extract and export SNP data retrieved from dbSNP, import saved SNP data, and offers a very flexible SNP selection function with graphic illustration of SNP position, function and heterozygosity. Furthermore, it retrieves any arbitrarily-defined, user-specified length of SNP flanking sequence with annotation of all nearby SNPs.