Seq4SNPs: new software for retrieval of multiple, accurately annotated DNA sequences, ready formatted for SNP assay design

Background In moderate-throughput SNP genotyping there was a gap in the workflow, between choosing a set of SNPs and submitting their sequences to proprietary assay design software, which was not met by existing software. Retrieval and formatting of sequences flanking each SNP, prior to assay design, becomes rate-limiting for more than about ten SNPs, especially if annotated for repetitive regions and adjacent variations. We routinely process up to 50 SNPs at once. Implementation We created Seq4SNPs, a web-based, walk-away software that can process one to several hundred SNPs given rs numbers as input. It outputs a file of fully annotated sequences formatted for one of three proprietary design softwares: TaqMan's Primer-By-Design FileBuilder, Sequenom's iPLEX or SNPstream's Autoprimer, as well as unannotated fasta sequences. We found genotyping assays to be inhibited by repetitive sequences or the presence of additional variations flanking the SNP under test, and in multiplexes, repetitive sequence flanking one SNP adversely affects multiple assays. Assay design software programs avoid such regions if the input sequences are appropriately annotated, so we used Seq4SNPs to provide suitably annotated input sequences, and improved our genotyping success rate. Adjacent SNPs can also be avoided, by annotating sequences used as input for primer design. Conclusion The accuracy of annotation by Seq4SNPs is significantly better than manual annotation (P < 1e-5). Using Seq4SNPs to incorporate all annotation for additional SNPs and repetitive elements into sequences, for genotyping assay designer software, minimizes assay failure at the design stage, reducing the cost of genotyping. Seq4SNPs provides a rapid route for replacement of poor test SNP sequences. We routinely use this software for assay sequence preparation. Seq4SNPs is available as a service at and , currently for human SNPs, but easily extended to include any species in dbSNP.


Background
A survey of single nucleotide polymorphism (SNP) and primer design software reveals several packages that align EST or genome sequences to discover SNPs [1][2][3][4][5][6]. SNP-VISTA visualizes SNPs from aligned genome sequences [7]. Other packages take a chromosome region then use recorded SNP genotypes, and additional information, to reduce the set of SNPs that need genotyping [ [8,9] and references therein]. SNP information packages collect data on a SNP, given a Reference SNP ID (rs number) or chromosome position, and may predict protein disruption, and are presented as searchable database resources [10]. These include LS-SNP, an annotated database of human SNPs within genes, and CASCAD, a rat/zebrafish database of candidate developmental SNPs [11,12]. The Functional Element SNPs Database (FESD) lists human SNPs in promoters and UTRs, CpG islands, transcription start sites, polyAdenylation signals elements of genes, and provides fasta output of SNPs with flanking sequences [13].
Several web tools output fasta sequences for SNPs. The dbSNP website outputs multiple fasta sequences flanking SNPs, given the rs numbers, after a time [10]. SNPper outputs flanking sequences annotated for adjacent SNPs [14]. SNP-Flankplus takes one or more rs numbers as input, with sequence length (like Seq4SNPs) [15]. SNP500 Cancer is a resequencing project focusing on identified SNPs associated with cancer, so displays potentially novel SNPs in flanking sequences [16]. Repeat Masker takes a fasta DNA sequence and masks it for repetitive elements, while SNPmasker takes as input a chromosome position (region) or sequence and outputs a fasta sequence masked for adjacent dbSNP SNPs and repeat sequences [17,18].
Primer design programs, like Primer3, take standard fasta DNA sequences and output PCR primer pairs with pertinent information like melting temperatures [19]. BatchPrimer3 will design both PCR and genotyping allele detection primers [20]. SNPbox is a primer designer for uniform PCR amplification across multiple regions, utilizing Primer 3 and incorporating sequence masking: input is fasta sequence and annotation with SNPs in flanking sequences is not included [21]. Genotyping assay design software outputs PCR as well as genotyping primers. These include Taqman ® 's Assay-By-Design FileBuilder http://www.appliedbiosystems.com/filebuilder; SNP-IT, used in SNPstream ® 's Autoprimer, and iPLEX (Sequenom) also optimize PCR primers for multiplexing [22][23][24]. The standard fasta format is not a suitable input for these proprietary genotyping assay design software packages, so reformatting is required (Table 1).
Genotyping assay design software packages require as input a file containing specifically formatted DNA sequences, one line per assay (Table 1 and legend). To our knowledge, the only software (other than Seq4SNPs) that delivers sequences in genotyping assay designer format is Multiple SNP Query Tool (MSQT), used for Arabidopsisbut MSQT does not annotate the sequences for variations and repeats [4]. Seq4SNPs is unique in providing annotation of flanking sequences with adjacent SNPs from dbSNP, in assay designer formats. Successful genotyping assays avoid placing PCR or allele detection primers across a variation [25]. The performance of multiplexed genotyping assays decreases as the proportion of SNP sequences having repetitive elements increases ( Fig. 1) [26]. Assay design software packages all utilize the information on variations and repetitive elements, if provided as annotation of the input sequences (Table 1). We identified the reformatting of the text and annotation of the flanking sequences for repetitive elements and variations as a rate-limiting step.
We required a sequence collection and formatting software which could take multiple inputs of one to fifty or more SNPs as rs numbers. When we began, ~300 SNP sequences were the required input for multiplex assay design software, which output optimally multiplexed groups of 48 SNPs, partly based on allele [23,26]. More flexible genotyping assay protocols now require fewer sequences. We also needed alternative inputs for newly identified SNPs without an rs number, i.e. chromosome position and allele.
We created Seq4SNPs for human SNP assay sequence preparation, to improve our medium-throughput genotyping workflow on SNPstream ® (Fig. 2). In our current workflow, about 48 sequences are prepared at a time, submitting fasta output of unannotated sequences to Repeat Masker http://www.repeatmasker.org. Repeat Masker outputs fasta sequence that is masked for repetitive regions. Seq4SNPs combines the masking from that file into the final formatted output, and adds additional variations (SNPs). A file containing all the sequences is submitted to the multiplex assay designer, which rejects SNPs with too much repetitive sequence. The user replaces rejected SNPs with alternatives having less repetitive sequence. One to three rounds of assay design are required per 48-plex. Thus, problematic assays are dealt with bioinformatically, before primers are ordered.
Automation of the sequence collection/annotation/reformatting process facilitates multiplex design iteration and dramatically reduces the time/cost involved in assay design (~10 seconds compared to 5-30 minutes per SNP). We have increased the success of assays and reduced the genotyping cost (Fig. 1). This paper describes the Seq4SNPs software, and the algorithms which collect, annotate and format SNP sequences.
Uses of Seq4SNPs software may be extended to other bioinformatics, as, given a list of rs numbers the software will retrieve new rs numbers, create fasta sequences, find the most common European allele, and find the gene and heterozygosity information, writing the data to output files designed for use in Excel, or to simple fasta text files with multiple sequences.

Implementation
Briefly, Seq4SNPs takes a file of rs numbers, retrieves DNA sequences around the SNPs, from dbSNP, and formats them as a single file of fasta sequences, each with the allele in the fasta header. This fasta file is optionally submitted to Repeat Masker which outputs a masked fasta file. The original fasta file is automatically used for the Seq4SNPs annotation step, with the masked file (if submitted to Seq4SNPs). In the annotation step, the sequence is annotated with any dbSNP SNP or indel falling within the flanking sequences, and with masked sequence regions (if supplied). The annotated sequence is then formatted for assay designers. Completed sequences are saved to a file of sequences, formatted for submission to the genotype assay design software via the appropriate webpage (Table  1) [22,24,26]. A single SNP can be processed, also a SNP without an rs number. Data sources are listed ( Table 2). A detailed description follows.
To begin, Seq4SNPs takes each SNP input as an rs number with a laboratory name. For batch processing it takes a set of assay SNPs in a plain text file, containing one SNP per line (e.g. 'TOX3-A1, rs12443621'). An intuitive user interface is provided for Seq4SNPs: the information or input file is submitted via a web page, where the user chooses the size of the flanking sequence (e.g. 200 for a 401 nucleotide sequence, with a minimum of 20 and a maximum of 600 nucleotides either side). Alternative starting points are: submitted SNP assay IDs (dbSNP ss numbers) instead of rs numbers; for a single SNP without an rs number, a sequence may be submitted, with chromosome, position and alleles, so that any sequence may be annotated [9]. Two other options are available: selection of 'major allele first' (e.g. [G/A] where G is the common allele) asks Seq4SNPs to retrieve the major Caucasian human alleles and to compile a report containing the heterozygosity data from the cluster pages; 'gene report' will retrieve gene name and description from each dbSNP cluster page. European common allele is selected from the data given on dbSNP cluster pages for AFD_EUR, or, if unavailable, from HAPMAP CEU. (For non-human species the allele is reported as given by the cluster page).
For each rs, Seq4SNPs automatically retrieves the flanking sequences and alleles from the NCBI cluster pages and creates one output file containing multiple fasta sequences with specialized headers for the annotation step. The sequence is taken from the HTML source code of the web page retrieved with a GET query (perl CGI). If a sequence is too short, it is retrieved from the NCBI contig sequence referenced by the dbSNP rs cluster page -the contig used is referenced by the first link listed under 'GeneView' if present, or under 'Integrated Maps', and the contig position utilized to request a sequence of appropriate length (URLS are listed on the webpage report, see example in Table 2 legend). If an ss is given or the rs is out of date, the new rs is first retrieved ( Table 2 legend). Seq4SNPs outputs a text file containing fasta sequences (with customised fasta headers), which is used for the next stage, and may be saved. For identifying repetitive sequences this file can be submitted to the Repeat Masker server http:// www.repeatmasker.org which produces another fasta file containing the same sequences masked for repetitive sequences, homologous regions and transposon elements.
The final stage of Seq4SNPs is sequence annotation and formatting for genotyping assay design software. Inputs are: the fasta file generated above (automatic) and, if repeat masking is required, the masked file from repeat masker is uploaded via the webpage. The assay format is selected as Taqman, SNPstream or Sequenom. A fasta file may be submitted as the primary input, here, provided that headers are compatible (see Fig. 3 for example). Seq4SNPs will generate an appropriate fasta header for Reduction of the cost of genotyping by reducing the number of sequences in a multiplex having repetitive elements Figure 1 Reduction of the cost of genotyping by reducing the number of sequences in a multiplex having repetitive elements. The cost of genotyping using SNPstream 48-plex assays was calculated for a 100% pass rate (1 on y-axis): this cost increases non-linearly, as the pass rate decreases. Here we show how the cost increases with the number of assays containing repetitive sequences, per multiplex. Some repetitive sequences clearly affect the pass rate in the whole multiplex. We now use Seq4SNPs with Repeat Masker output, put masked sequences output from Seq4SNPs into SNP-IT primer design software for SNPstream, and replace SNPs rejected by SNP-IT. This improves assay success and reduces the cost of genotyping significantly. Using Seq4SNPs software as part of the Workflow in SNPstream genotyping sequence preparation Figure 2 (see previous page) Using Seq4SNPs software as part of the Workflow in SNPstream genotyping sequence preparation. After choosing 48 or more SNPs to assay (START), input to Seq4SNPs software (grey box) is a text file (white, top) containing a list of rs numbers and assay names and the desired assay size (e.g. 200 nucleotides each side of the assay SNP). The first process (blue box, top) outputs a file of sequences in fasta format (white, centre). The user submits this to Repeat Masker which quickly produces a masked fasta file (white, right); at this point the user might reject some assays and start again if insufficient assays remain. In the second process (blue box, bottom) the Seq4SNPs fasta file is automatically used to annotate and reformat. The user inputs the masked file for this step. Final outputs (white, bottom) are a formatted sequence file for assay design software, and a report for Excel, containing warnings, error flags, minisequences for adjacent SNPs and links to dbSNP and is used for assisted error checking. User time is mainly confined to the delay points (green). The formatted assays are submitted to the SNP-IT at Autoprimer (Assay Designer), which rejects sequences that are insufficiently repeat-free, when the user may choose alternatives again (upward arrows). Legend: Seq4SNPs (grey box): times (sec ") are per SNP and are improving. Start and end points (circles); processes (blue rectangles); file/stored output (white rectangles); additional software decision aids (yellow diamonds). Seq4SNPs locates the chromosome position from the rs number, finds all SNPs within the assay flanking sequences, annotates the sequence text for these adjacent SNPs, adds the repeat/masked regions (optional), formats the sequence and creates an output text file, containing all annotated assay sequences, that is compatible with assay design software. Also, an 'adjacent SNPs' report file is supplied for Excel (and also shown as a web page), contain-ing: links to dbSNP for all assay and adjacent SNPs; 21 bp sequence around every SNP or indel for checking; validation information; IUPAC code, chromosome position and position relative to the start of the assay sequence* [21,27]. GC content is calculated for SNPstream ® format.
Warnings are given about SNPs (and repeats) very close to the assay position, or short sequences (Fig. 3).
Adjacent SNPs are most efficiently retrieved from a local MySQL database currently containing only human SNPs.
(This dataset may easily be extended to other species by Final output: web page from Seq4SNPs after processing one SNP, showing report For multiple sequences, links to the output files are always given before processing, so that the user can watch progress, and recover the completed files. Every finished sequence is printed to the output file as soon as it is ready. Thus it is possible to process jobs containing more than 300 sequences: if a process does not complete then the input files can be truncated, and the remaining sequences completed. The 'adjacent SNPs' report can be recovered and opened in Excel, for assisted placement checking (21 mer sequences and links to dbSNP provided).

Results
This software has been used in our Taqman and SNPstream workflow for four years, with incremental refinement. Approximate usage is currently about 40 SNPs per day. We routinely process more than 300 SNPs to fasta stage from a single input file. For a file of 327 fasta sequences, up to 227 were annotated in one go: the remainder were processed separately from a portion of the fasta file, for SNPstream Autoprimer. We have also generated large numbers of sequences for iPLEX. Single SNPs and multiples of less than 50 rs numbers are routinely processed for sequence preparation for Taqman and SNPstream. We also use Seq4SNPs for fasta retrieval for other bioinformatics tests and to determine the chromosome and position of a SNP given the rs number, or the latest rs for an old rs or ss number.
For 327 sequences submitted as a single batch, fasta retrieval took 4 seconds per SNP (6 minutes/100 rs numbers), with fast network and internet connections. Annotation with adjacent SNPs also took 4 seconds per SNP. Only a couple of seconds is hands-on time, which compares with manual sequence preparation at 5-20 minutes per sequence, excluding time taken to find adjacent SNPs outside of genes. User time was limited to seconds during processing, and afterwards a few minutes, checking the 'adjacent SNPs' report for the word 'ERROR' (in Excel or by HTML search) -which, if arising, might result in some manual edits; the 'adjacent SNPs' report also indicates any sequences that are 'TOO SHORT' and single-side matches (usually for indels) where the final sequence may be checked for correct placement against the 21 mer sequences given in the report.
To test the accuracy of annotation with adjacent SNPs, the assay designer sequence output from Seq4SNPs was compared, character by character, with a file containing correctly annotated and masked sequences for 327 assays. The test file of formatted sequences was generated for SNPstream by an earlier version of Seq4SNPs: the SNPstream format contains all the IUPAC codes for each SNP so that match orientation could be tested. We checked the sequences against dbSNP cluster pages and manually corrected for every adjacent SNP, verifying the correct IUPAC code and placement by sequence. (The original test file was submitted to Autoprimer and used in genotyping.) We found that the automated placement gave a significant improvement over sequences corrected manually (P = 2e-6 by Fishers exact 2-sided test). This is based on a two by two table of correctly annotated vs. incorrectly annotated adjacent SNPs in 507 adjacent SNPs in 327 sequences prepared for the test batch: after manual correction, 20 adjacent SNPs were still misannotated, usually by use of the opposite IUPAC code and once by missing a SNP placed on the wrong side, while none were incorrect when done with Seq4SNPs. We also noted that since dbSNP had been updated from version 128 to 129, 73 additional SNPs, mostly indels, were now correctly placed into our sequences. Retrospective assessment of assay sequences delivering poor genotyping results with Seq4SNPs may rapidly reveal new variations not previously known to be in the flanking sequences.
The latest version of the software (September 2008) contained 0 error flags, indicating correct placement of 507 adjacent SNPs in 327 sequences. 380 sequences were correct for major European allele assignment. There were 177 flags for 327 sequences, of which 111 were reverse complement, 40 were single side matches and 12 were new placements 1 base from the given chromosome position (indels). The flags were confined to 96 SNPs, but none had placement errors.

Conclusion
Thorough testing has shown that Seq4SNPs delivers multiple sequences suited to genotyping assay designers, that are accurately annotated with adjacent SNPs. Repeat masking annotation and Caucasian major allele recovery are also completed correctly. Automated annotation by Seq4SNPs is significantly more accurate even than manual verification over large numbers of sequences, and by using Seq4SNPs with Repeat Masker to eliminate poor test SNPs, we have improved our SNPstream genotyping efficiency from a 70% pass rate to greater than 80%.