Snpdat: Easy and rapid annotation of results from de novo snp discovery projects for model and non-model organisms
© Doran and Creevey; licensee BioMed Central Ltd. 2013
Received: 18 September 2012
Accepted: 5 February 2013
Published: 8 February 2013
Single nucleotide polymorphisms (SNPs) are the most abundant genetic variant found in vertebrates and invertebrates. SNP discovery has become a highly automated, robust and relatively inexpensive process allowing the identification of many thousands of mutations for model and non-model organisms. Annotating large numbers of SNPs can be a difficult and complex process. Many tools available are optimised for use with organisms densely sampled for SNPs, such as humans. There are currently few tools available that are species non-specific or support non-model organism data.
Here we present SNPdat, a high throughput analysis tool that can provide a comprehensive annotation of both novel and known SNPs for any organism with a draft sequence and annotation. Using a dataset of 4,566 SNPs identified in cattle using high-throughput DNA sequencing we demonstrate the annotations performed and the statistics that can be generated by SNPdat.
SNPdat provides users with a simple tool for annotation of genomes that are either not supported by other tools or have a small number of annotated SNPs available. SNPdat can also be used to analyse datasets from organisms which are densely sampled for SNPs. As a command line tool it can easily be incorporated into existing SNP discovery pipelines and fills a niche for analyses involving non-model organisms that are not supported by many available SNP annotation tools. SNPdat will be of great interest to scientists involved in SNP discovery and analysis projects, particularly those with limited bioinformatics experience.
KeywordsSNPs Annotation Software Non-model organisms
Single nucleotide polymorphisms (SNPs) are the most common genetic variant found in vertebrates and invertebrates. SNPs are regularly utilised as the favoured molecular marker in association studies, genetic mapping and population genetics. Improving technologies and decreasing costs have enabled researchers to identify thousands of mutations, including rare variants, with potential influence on phenotypic variation[5, 6]. More frequently non-bioinformatics researchers are required to perform analysis of increasingly large datasets. Disease susceptibility, agriculture and evolution are among the areas concerned with understanding the influence SNPs have on biological function and phenotypic variation of complex traits[7-9]. However, annotating large numbers of SNPs with this type of information can prove daunting and impractical to perform manually.
The number of SNP annotations (ss#) in dbSNP for species with a reference sequence available from ensembl and at least one SNP annotation in dbSNP (build 137)
Annotations in dbSNP
Homo sapiens (Human)
Mus musculus (Mouse)
Pongo abelii (Orangutan)
Bos taurus (Cow)
Rattus norvegicus (Rat)
Canis familiaris (Dog)
Gallus gallus (Chicken)
Macaca mulatta (Macaque)
Taeniopygia guttata (Zebra Finch)
Pan troglodytes (Chimpanzee)
Danio rerio (Zebrafish)
Ornithorhynchus anatinus (Platypus)
Monodelphis domestica (Opossum)
Equus caballus (Horse)
Tetraodon nigroviridis (Tetraodon)
Sus scrofa (Pig)
Felis catus (Cat)
Caenorhabditis elegans (C.elegans)
Meleagris gallopavo (Turkey)
Gadus morhua (Cod)
Gasterosteus aculeatus (Stickleback)
Callithrix jacchus (Marmoset)
Gorilla gorilla (Gorilla)
We have developed a simple to use SNP data analysis tool (SNPdat) specifically for use with organisms which are not supported by other tools and may have a small number of annotated SNPs available, but can equally be used to analyse datasets from organisms which are densely sampled for SNPs.
SNPdat is a cross-platform command line tool written in Perl, allowing easy incorporation into existing SNP discovery or annotation pipelines or even run by a user on a standard desktop machine. SNPdat can provide comprehensive annotation of both novel and known SNPs for any organism with a draft sequence and annotation.
Many available tools require the user to create a local database before SNP annotation can be performed (FunctSNP, Snat, Annovar, SNPper). However, this process is not practical in all cases or straightforward enough for inexperienced users. For example to perform SNP annotation using FunctSNP, users must first supply a list of Uniform Resource Locators (URLs) linked with online resource data files and then download them. They must then decompress any of these files matching specific suffixes, convert the data to SQL format to be imported to a SQLite database and finally import these files into the SQLite database. This is time-consuming and difficult for users inexperienced in bioinformatics to annotate even one SNP.
Additionally, some tools (Annovar, Snat) involve a number of pre-processing steps to parse and reformat either sequence or annotation files. This can be a difficult and confusing step for novice users, especially when dealing with non-model organisms. SNPdat does not require the creation of any local relational databases or pre-processing of any mandatory input files.
SNPdat requires only three input files; a variant calling formatted (VCF) file or a simple tab delimited text file (containing chromosome ID, genomic location and the mutation for each SNP to be analysed) as the SNP input file, a reference FASTA formatted sequence file for the species of interest, and a gene annotation file in GFF/GTF format. GTF files are a standard format for storing information on gene structure (http://genome.ucsc.edu/FAQ/FAQformat.html#format4). GTF files define genomic structures as features. Typical features include coding sequences (CDS), exons, start and stop codons. Additional features may include untranslated regions (UTRs), introns and microRNAs.
Both FASTA and GTF files are available from Ensembl for over 50 eukaryotic specieshttp://www.ensembl.org/info/data/ftp/index.html). Optional files include a processed file of SNP information from other databases such as dbSNP. SNPdat uses the extra information provided by this file to cross reference de novo SNPs against known annotations. Separate scripts are provided to automate the retrieval and format the data for any organisms with SNP information in dbSNP. Additional scripts which automate the retrieval of GTF, FASTA and dbSNP information are described in the following sections and are available from the SNPdat webpage (http://code.google.com/p/snpdat/).
Retrieval of GTF and FASTA information
Retrieval of information from external databases
The script “dbSNP_finder.pl” retrieves SNP information for any organism in the dbSNP database (Figure 1B). This script also uses the cURL system call and requires a connection to the internet. Once run, the user is prompted to select an organism from all those currently with SNP information in dbSNP. The SNP information is then retrieved for that organism. SNP information from dbSNP can also be downloaded manually from the dbSNP ftp site (ftp://ftp.ncbi.nih.gov/snp/organisms/). When dbSNP information has been retrieved, an additional script (SNPdat_parse_dbsnp.pl) can be used to convert the dbSNP file into a format suitable for use with SNPdat.
Conversion tools for databases that are not currently supported are available upon request.
To run SNPdat, the user specifies the input/output files and desired options with a single command (Figure 1C). In the case of malformed commands, SNPdat will print an error message to the screen and a short example of how the correct command should look. SNPdat does not require the user to install any additional packages or modules and only uses modules included in the core installation of Perl.
Initially SNPdat reads the annotation information into memory from the GTF file. Each SNP is checked for errors such as non-numeric SNP locations and any warnings are printed to the output. All chromosome names provided by the user are compared against the annotation file. A warning message is printed to the output file for every SNP location provided which does not exist in the annotation. Once all SNPs have been parsed, SNPdat will read the FASTA file one chromosome at a time. To save on memory usage and time, any chromosomes that do not appear in the list of queried SNPs are skipped.
Summary description of the annotations provided by SNPdat
The queried SNPs chromosome ID
The queried SNPs genomic location
Whether or not the SNP was within a feature
Region containing the SNP; either exonic, intronic, or intergenic
Distance to nearest feature
Either the closest feature to the SNP or the feature containing the SNP
The number of different features that the SNP is annotated to
The number of annotations of the current feature
Start of feature (bp)
End of feature (bp)
The gene ID for the current feature
The gene name for the current feature
The transcript ID for the current feature
The transcript name for the current feature
The exon that contains the current feature and the total number of annotated exons for the gene containing the feature
The strand sense of the feature
The annotated reading frame (when contained in GTF)
The reading frame estimated by SNPdat
The estimated number of stop codons in the estimated reading frame
The codon containing the SNP, position in the codon and reference base and mutation
The amino acid for the reference codon and new amino acid with mutation in place
Whether or not the mutation is synonymous
The protein ID for the current feature
The RS identifier for queries that map to known SNPs
Error messages, warnings etc.
SNPs that do not have sequence information in the FASTA file but have information in the GTF are still annotated by SNPdat. However, the returned information is limited to the first 17 columns and columns 23, 24 and 25 of the output file (Table 2).
Next, all intronic and intergenic SNPs are identified and processed. The nearest feature to a non-coding SNP is identified and relevant data, such as distance to feature, feature IDs, strand sense, start and end position, is retrieved. If a SNP is equidistance from more than one feature, a separate line for each feature will be reported. Column seven in the output file contains the number of features reported for a SNP (see Table 2).
All features that a SNP occurs in are identified and printed to separate lines. Information calculated and retrieved for a feature containing a SNP is contained in columns 9-17 of the output file (see Table 2). Columns 18-22 contain information estimated from the sequence of the feature such as the reading frame, the position in the codon, reference and mutant amino acid and whether or not the SNP is synonymous. The estimated reading frame is relative to the strand sense of the feature. If no strand sense is available from the GTF, SNPdat assumes that the strand sense is positive.
Finally, all SNPs are cross referenced against information retrieved from external databases such as dbSNP. SNPs that do not have sequence information in the FASTA file but have information in the GTF are still annotated by SNPdat. However, the returned information is limited to information which can be returned without reference to the DNA sequence (columns 1 to 17 and 23 to 25). See Table 2 for more details.
A tutorial demonstrating the use of SNPdat and the additional scripts is available from the SNPdat website (http://code.google.com/p/snpdat/). A user manual and sample dataset are also available to download from here.
Results and discussion
To demonstrate its ease of use, de novo SNPs discovered by Mullen et al. (2012) were annotated using SNPdat. As a comparison, Annovar was also used to analyse this dataset. This dataset consists of 4,566 SNPs discovered using high-throughput DNA sequencing of target-enriched pooled DNA samples of 83 genomic regions from groups of dairy cattle. The SNPs included novel and putative variants from 28 chromosomes including the X chromosome.
For SNPdat: EnsGene annotation and FASTA sequence files for Bos taurus were retrieved from the UCSC ftp site (ftp://hgdownload.cse.ucsc.edu/goldenPath/bosTau4/). A GTF version of the ensGene annotation file was supplied to SNPdat along with the FASTA file. SNPdat does not require any pre-processing steps and so both these files were used as input for the software.
For Annovar: The same annotation and FASTA files were retrieved for use with Annovar. The FASTA file was pre-processed to create a sequence file using information from both the FASTA file and the ensGene annotation file. The new sequence file and original ensGene file were then supplied as input for Annovar.
The number of SNPs annotated to different regions by SNPdat and Annovar
3 prime UTR
5 prime UTR
SNPdat and Annovar both found a large proportion of (77%) of intergenic SNPs within 2,000 base pairs of coding regions. Additionally, from SNPdat output file it was determined that 39% of intronic SNPs were within a 1,000 base pair region surrounding exons (Figure 2D).
The rationale behind SNPdat is to provide a simple to use tool for researchers annotating the results of de novo SNP discovery projects. It is especially intended for use by researchers with limited bioinformatic experience. It can provide a valuable insight into the functional roles associated with discovered SNPs and cross reference information with external sources. As a command line tool it can easily be incorporated into existing SNP discovery pipelines and fills a niche for analyses involving non-model organisms that are not supported by many available SNP annotation tools.
Availability and requirements
Project name: SNPdat
Project home page:http://code.google.com/p/snpdat
Operating system: Platform independent
Programming language: Perl
Other requirements: Perl
Any restrictions to use by non-academics: None
Single Nucleotide Polymorphism
Uniform Resource Locator
Variant Calling Format
Gene Transfer Format.
AGD is funded under the Teagasc Walsh Fellowship Scheme (number:2009183); CJC is funded under the Science Foundation Ireland (SFI) Stokes lecturer scheme (number: 07/SK/B1236A).
- Cohuet A, Krishnakumar S, Simard F, Morlais I, Koutsos A, Fontenille D, Mindrinos M, Kafatos FC: SNP discovery and molecular evolution in Anopheles gambiae, with special emphasis on innate immune system. BMC Genomics 2008, 9: 227. 10.1186/1471-2164-9-227PubMed CentralView ArticlePubMedGoogle Scholar
- WTCCC: Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 2007,447(7145):661-678. 10.1038/nature05911View ArticleGoogle Scholar
- Hoskins RA, Phan AC, Naeemuddin M, Mapa FA, Ruddy DA, Ryan JJ, Young LM, Wells T, Kopczynski C, Ellis MC: Single nucleotide polymorphism markers for genetic mapping in Drosophila melanogaster. Genome Res 2001,11(6):1100-1113. 10.1101/gr.GR-1780RPubMed CentralView ArticlePubMedGoogle Scholar
- Tishkoff SA, Verrelli BC: Patterns of human genetic diversity: implications for human evolutionary history and disease. Annu Rev Genomics Hum Genet 2003, 4: 293-340. 10.1146/annurev.genom.4.070802.110226View ArticlePubMedGoogle Scholar
- Altshuler D, Pollara VJ, Cowles CR, Van Etten WJ, Baldwin J, Linton L, Lander ES: An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature 2000,407(6803):513-516. 10.1038/35035083View ArticlePubMedGoogle Scholar
- Mullen MP, Creevey CJ, Berry DP, McCabe MS, Magee DA, Howard DJ, Killeen AP, Park SD, McGettigan PA, Lucy MC, et al.: Polymorphism discovery and allele frequency estimation using high-throughput DNA sequencing of target-enriched pooled DNA samples. BMC Genomics 2012,13(1):16. 10.1186/1471-2164-13-16PubMed CentralView ArticlePubMedGoogle Scholar
- Allan MF, Smith TP: Present and future applications of DNA technologies to improve beef production. Meat Sci 2008,80(1):79-85. 10.1016/j.meatsci.2008.05.023View ArticlePubMedGoogle Scholar
- Corona E, Dudley JT, Butte AJ: Extreme evolutionary disparities seen in positive selection across seven complex diseases. PLoS One 2010,5(8):e12236. 10.1371/journal.pone.0012236PubMed CentralView ArticlePubMedGoogle Scholar
- Cutter AD, Choi JY: Natural selection shapes nucleotide polymorphism across the genome of the nematode Caenorhabditis briggsae. Genome Res 2010,20(8):1103-1111. 10.1101/gr.104331.109PubMed CentralView ArticlePubMedGoogle Scholar
- Shen TH, Carlson CS, Tarczy-Hornoch P: SNPit: a federated data integration system for the purpose of functional SNP annotation. Comput Methods Programs Biomed 2009,95(2):181-189. 10.1016/j.cmpb.2009.02.010PubMed CentralView ArticlePubMedGoogle Scholar
- Chelala C, Khan A, Lemoine NR: SNPnexus: a web database for functional annotation of newly discovered and public domain single nucleotide polymorphisms. Bioinformatics 2009,25(5):655-661. 10.1093/bioinformatics/btn653PubMed CentralView ArticlePubMedGoogle Scholar
- Li S, Ma L, Li H, Vang S, Hu Y, Bolund L, Wang J: Snap: an integrated SNP annotation platform. Nucleic Acids Res 2007, 35: D707-710. Database issue) Database issue) 10.1093/nar/gkl969PubMed CentralView ArticlePubMedGoogle Scholar
- Wang P, Dai M, Xuan W, McEachin RC, Jackson AU, Scott LJ, Athey B, Watson SJ, Meng F: SNP Function Portal: a web database for exploring the function implication of SNP alleles. Bioinformatics 2006,22(14):e523-529. 10.1093/bioinformatics/btl241View ArticlePubMedGoogle Scholar
- Riva A, Kohane IS: SNPper: retrieval and analysis of human SNPs. Bioinformatics 2002,18(12):1681-1685. 10.1093/bioinformatics/18.12.1681View ArticlePubMedGoogle Scholar
- Liu CK, Chen YH, Tang CY, Chang SC, Lin YJ, Tsai MF, Chen YT, Yao A: Functional analysis of novel SNPs and mutations in human and mouse genomes. BMC Bioinforma 2008,9(12):S10. 10.1186/1471-2105-9-S12-S10View ArticleGoogle Scholar
- Goodswen SJ, Gondro C, Watson-Haigh NS, Kadarmideen HN: FunctSNP: an R package to link SNPs to functional knowledge and dbAutoMaker: a suite of Perl scripts to build SNP databases. BMC Bioinforma 2010, 11: 311. 10.1186/1471-2105-11-311View ArticleGoogle Scholar
- Wang K, Li M, Hakonarson H: ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 2010,38(16):e164. 10.1093/nar/gkq603PubMed CentralView ArticlePubMedGoogle Scholar
- Flicek P, Amode MR, Barrell D, Beal K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fairley S, Fitzgerald S, et al.: Ensembl 2012. Nucleic Acids Res 2012, 40: D84-90. Database issue Database issue 10.1093/nar/gkr991PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.