FunctSNP: an R package to link SNPs to functional knowledge and dbAutoMaker: a suite of Perl scripts to build SNP databases
© Goodswen et al. 2010
Received: 21 March 2010
Accepted: 9 June 2010
Published: 9 June 2010
Skip to main content
© Goodswen et al. 2010
Received: 21 March 2010
Accepted: 9 June 2010
Published: 9 June 2010
Whole genome association studies using highly dense single nucleotide polymorphisms (SNPs) are a set of methods to identify DNA markers associated with variation in a particular complex trait of interest. One of the main outcomes from these studies is a subset of statistically significant SNPs. Finding the potential biological functions of such SNPs can be an important step towards further use in human and agricultural populations (e.g., for identifying genes related to susceptibility to complex diseases or genes playing key roles in development or performance). The current challenge is that the information holding the clues to SNP functions is distributed across many different databases. Efficient bioinformatics tools are therefore needed to seamlessly integrate up-to-date functional information on SNPs. Many web services have arisen to meet the challenge but most work only within the framework of human medical research. Although we acknowledge the importance of human research, we identify there is a need for SNP annotation tools for other organisms.
We introduce an R package called FunctSNP, which is the user interface to custom built species-specific databases. The local relational databases contain SNP data together with functional annotations extracted from online resources. FunctSNP provides a unified bioinformatics resource to link SNPs with functional knowledge (e.g., genes, pathways, ontologies). We also introduce dbAutoMaker, a suite of Perl scripts, which can be scheduled to run periodically to automatically create/update the customised SNP databases. We illustrate the use of FunctSNP with a livestock example, but the approach and software tools presented here can be applied also to human and other organisms.
Finding the potential functional significance of SNPs is important when further using the outcomes from whole genome association studies. FunctSNP is unique in that it is the only R package that links SNPs to functional annotation. FunctSNP interfaces to local SNP customised databases which can be built for any species contained in the National Center for Biotechnology Information dbSNP database.
Whole genome association studies (WGAS) are a set of methods to study common genetic variation across the entire genome of organisms with the purpose of identifying associations between genetic polymorphisms with observable phenotypic variation in traits. Single nucleotide polymorphisms (SNPs) are the most common genetic variant. There are essentially two advances that have allowed WGAS. Firstly, the huge increase in the availability of SNPs. For example, the dbSNP database housed by the National Center for Biotechnology Information (NCBI) contains millions of SNPs for many model species. Secondly, the falling cost and rapid improvements in SNP genotyping technology. One of the main outcomes from WGAS is a subset of SNPs at a specified level of statistical significance (p - values) and false discovery rates (FDR). These SNPs are currently used in two conventional, albeit alternative, approaches depending on the requirements of the researcher: 1) a molecular biologist usually searches for candidate genes in the region of the significant SNP and then does functional experiments, which could include targeted sequencing; 2) a geneticist uses significant SNPs as DNA markers for predicting disease risks, phenotypic performance or breeding value in individuals and may use this information in associated trait selection (marker-assisted selection/gene-assisted selection).
One of the major challenges for a researcher using WGAS is minimising false positive rates while maintaining the power to identify true positive associations . WGAS relies on linkage disequilibrium (LD) between DNA markers and causal variants. SNPs located near each other on a chromosome are said to be in LD if they tend to be inherited together more often than can be expected by chance . Unlike the SNP chip design for humans that had input from the HapMap project to assist in its SNP selection, the tag SNP selection for most other species is not strictly based on LD knowledge. For many species the LD patterns are still to be conclusively determined and therefore the tag SNPs for the genotyping SNP chip are selected to be evenly distributed .
In WGAS, association does not imply causation. The significant SNPs derived from WGAS can be generally classified into 4 types: 1) a causal SNP that contributes to variation in the complex trait; 2) a SNP with no known biological effect but in LD to an untagged SNP (not genotyped) that contributes to variation in the complex trait; 3) a SNP with an association only - no known biological effect or linkage to a causal SNP; 4) a SNP not associated with the complex trait (a false positive). Since the type of significant SNP following WGAS is not distinguishable to the researcher, there is a potential to perform functional experiments based on a false-positive SNP and/or use a SNP as a marker that is neither the causal variant nor in LD with the causal variant.
Currently, the only approach to determine if a WGAS derived significant SNP is a true positive and in LD with the causal variant is to replicate the association findings in an independent population of adequate size . Hence, a need exists for SNP annotation tools that can assist a researcher in distinguishing the type of significant SNP. The tool would need to first collect and integrate all existing biological functional information from online resources. Then provide an interface to access, analyse, and present the information so that the researcher may make informed judgments as to whether a SNP could be a causal variant (type 1) or is in LD with a causal variant (type 2).
Existing SNP annotation tools
External Resources Used
Analyses the potential functional significance of SNPs derived from genome wide association studies.
dbSNP, EntrezGene, UCSC Browser, HGMD, ECR Browser, Haplotter, SIFT
SNP Function Portal 
Explores the functional
implication of SNP alleles
dbSNP, UniSTS, NCBI ideogram, Entrez Gene, NCBI human genome assembly, HapMap II project
SNP prioritization and characterisation
Ensembl, Gene Ontology, OMIM
Provides functional annotation for both novel and public SNPs
NCBI RefSeq, UCSC Known Genes, VEGA, AceView
Annotates non-synonymous SNPs mapped to Protein Data Bank structures
UniProtKB, Genome Browser, dbSNP, PDB
Computationally predicts functional SNPs for disease association studies
PolyPhen, SIFT, SNPeffect, SNPs3D, LS-SNP, ESEfinder, RescueESE, ESRSearch, PESX, Ensembl, TFSearch, Consite, GoldenPath, Ensembl, KinasePhos, OGPET, Sulfinator, GoldenPath
Functional analysis of novel SNPs
NCBI, Ensembl, UCSC BLAT, Rescue-ESE
Facilitates the retrieval and use of Human SNPs for high-throughput research purposes.
LocusLink, GeneOntology, SWISS-PROT, OMIM, Unigene
An integrated SNP annotation platform
Ensembl, UCSC, Uniprot, UniProt, Pfam, DAS-CBS, MINT, BIND, KEGG, TreeFam
Predicts the effect of non-synonymous
SNPs on the molecular phenotype of proteins.
Ensembl human databases
Assigns molecular functional effects of non-synonymous SNPs based on structure and sequence analysis.
dbSNP, SWIIS-Prot, RefSeq, Bioscience, Gene Ontology, KEGG, BIND, OMIM, HGMD, Entrez Gene
For many of the bioinformatics tools, third party applications are required for upstream and downstream research analyses, which in most cases require data format changes. The R environment has a global data currency of vectors and data frames so there are few data conversion concerns between R packages. For instance, there are two R packages called GenABEL  and SNPassoc , which perform genome wide association studies. FunctSNP ideally complements these R packages by linking their output to SNP annotation. In addition, there are R packages called GSA , GoSim , GeneNet , and snp.plotter  for further downstream analysis to provide advanced algorithms, and coloured graphs.
The biological information required for SNP annotation is publically available in online databases. Although most of these online resources provide cross-links to other databases, it is not straightforward to extract the relevant research data in one query. Furthermore, even though the databases contain well-annotated data, finding and managing the data of interest among the large assortment of data can be a challenge. Our solution to superfluous and disseminated data was to link SNPs with functional knowledge within species-specific local relational databases.
External resources used to create SNP customised database
Single Nucleotide Polymorphism database
Genes and gene product attributes
Kyoto Encyclopaedia of Genes and Genomes
Universal Protein Resource
Protein sequences and functional information
Animal Quantitative Trait Locus database
Quantitative Trait Loci data (QTL)
Online Mendelian Inheritance in Animals
Genes, inherited disorders and traits
The core of dbAutoMaker is a species-specific configuration file in an INI file format. Although we have demonstrated dbAutoMaker using mainly livestock species, it can be configured by the user to create a SNP database for any species contained in NCBI's dbSNP [for more detail refer to dbAutoMaker User Guide on http://www.csiro.au/science/dbAutoMaker.html].
The SQLite databases can be downloaded from http://www.csiro.au/science/FunctSNP.html. The user can directly query the database using Structured Query Language (SQL) through the SQLite program. However, we recommend FunctSNP as the database user interface.
The quality of results that can be expected from FunctSNP is reliant on the amount and reliability of information extracted from the online resources. For example, based on the amount of data available, the Homo sapiens database is information rich in comparison to all other species. Bos taurus and Gallus gallus databases are poor in comparison to Homo sapiens but substantially richer compared to Sus scrofa and Ovis aries. The table in Additional file 1 is a summary of the number of entries in the FunctSNP species-specific databases that were extracted from the online resources. Given time it is expected that the amount of available data for all species will dramatically increase.
FunctSNP R Functions
Adds a new species to the list of species recognized by FunctSNP
Download pre-assembled species-specific databases
Extract gene ID information using SNP IDs or SNP locations
Extract gene information using SNP IDs or Gene IDs
Extract gene ID within a specified distance from a SNP
Extract gene ontology using SNP IDs or Gene IDs
Extract highest scoring SNP using SNP IDs or SNP locations
Extract homologous genes using SNP IDs or Gene IDs
Extract pathway names using SNP IDs or Gene IDs
Find nearest genes to either SNP IDs or SNP locations
Extract OMIA using SNP IDs or Gene IDs
Extract protein information using SNP IDs or Gene IDs
Extract SNP ID using Gene IDs or SNP locations
Extract SNP information using SNP IDs or Gene IDs
Extract traits associated with QTL regions using SNP IDs or Gene IDs
Displays the local available databases
Build a species-specific database from external sources
Sets the default species
Displays the supported species
Displays the species added by user
The FunctSNP functions provide access to the biological information contained within species-specific databases. Information such as: SNP chromosomal location, exon/intron status, synonymous/non-synonymous effect, SNPs in Quantitative Trait Loci (QTL) regions, biological pathways, GO terms, and protein products for related genes. Multiple databases (one for each species) can be queried in the same R session.
Central to the SNP databases are the SNP ID (NCBI's dbSNP rs# cluster ID) and gene ID (NCBI's gene ID). Consequently, the majority of FunctSNP functions accept, as input, a vector of either SNP IDs or gene IDs by specifying what type of IDs are being passed to the function using id.type. For example:
snps <- getSNPs(ids,id.type = "snp")
The getSNPs () function returns SNP information (e.g., chromosome number, chromosomal location, protein coding status etc.) for each SNP ID in the ids vector.
snps <- getSNPs(ids,id.type = "gene")
This time, the getSNPs () function returns SNP information for all SNPs residing within each of the genes specified by their gene IDs in the ids vector.
As a general guideline to the user in prioritising the SNPs most likely to have significant functional roles, all SNPs in the database have been scored. The total score is comprised of a simple sum of two components. The first component is a score according to the SNP location and the type of amino acid alteration it may cause. We use NCBI's dbSNP function classification for SNPs on genes. The table in Additional file 2 shows the function classification in the first column and the FunctSNP score value in the second column. The classification is ranked and scored according to the likelihood of the function type altering the amino acid. For example, a SNP is given a maximum score if the polymorphism alters an amino acid. We acknowledge that scoring is subjective and therefore it is our intention in a future version of FunctSNP to allow a user-defined score value for each function classification. The second component used in the calculation of the total score is a cumulative score according to the amount of supporting biological information found in the database. If the SNP can be linked to proteins, GO terms, KEGG pathways, QTL regions, and homologous genes, the score is incremented for each linkage. For example, a SNP assigned a score of 26 (the maximum possible score) is derived as follows: 20 (SNP alters codon to make an altered amino acid) + 1 (SNP is located in a QTL region) + 1 (Protein information found) + 1 (one or more GO terms found) + 1 (one or more KEGG pathways found) + 1 (one or more OMIA information found) + 1 (homologous genes found).
There are R functions to: find the nearest genes to each SNP ID entered, using a chromosomal base pairs search distance - getGenesByDist(); find the nearest high scoring SNP to each SNP ID, or chromosomal location entered - getHighScoreSNP(); find homologous genes across all species or a specified species for each SNP ID, or gene ID entered - getHomolo(); find the quantitative trait loci (QTL) region associated to a trait (as annotated within QTLdb database) for each SNP ID or gene ID entered - getTraits(); and extract SNP ID for each chromosomal location entered - getSNPID(). There are also R functions to download the latest customised SNP database for a particular species (currently, databases for cattle, chicken, sheep, pig, and human are available) - downloadDB(); and a function to invoke dbAutoMaker for the user to make their own databases - makeDB().
To test FunctSNP, we used 10 k SNP genotype data from a yet to be published livestock WGAS. From this 10 k SNP data, 165 SNPs had been identified as being significantly associated with a particular trait of economic interest. These SNPs were sufficiently quality controlled and accepted as reliable prior to using FunctSNP.
The aim of the test was to rank the 165 SNPs according to their likelihood of being reliable DNA markers. Our premise for reliable DNA markers is that WGAS derived SNPs should be either the causal variant or close to a SNP that is the causal variant. In FunctSNP, whether a SNP is a potential causal variant is determined by its assigned functional role and its links to biological information (e.g., pathways and gene ontology). The SNPs in the database have been scored in accordance with this premise. Although, high scoring SNPs in FunctSNP are the most likely to have significant functional roles, the expertise of the researcher is required to make that important mental link between the biological information provided and its relevance to the trait of interest. We also use the general rule that the closer a WGAS derived SNP is to a high scoring SNP the more likely the SNPs will be in LD. However, the researcher needs to be aware that this rule does not always hold true as there are blocks of SNPs along the chromosome with higher and lower degrees of LD. For example, for the most part the closer 2 SNPs are together the stronger the LD is going to be, nevertheless, it is possible that the 2 SNPs are located on different blocks of LD.
The test involved 4 different runs based on the following criteria that determined the data input to the FunctSNP functions: 1) using all 165 WGAS derived significant SNPs; 2) using only significant SNPs with a score greater than 7; 3) using SNPs with a score greater than 11 from the database that are within 10,000 base pairs from the 165 significant SNPs; and 4) using SNPs from the database with a score greater than 24 that are located on genes within 10,000 base pairs from the 165 significant SNPs. The threshold score used is at the discretion of the user. SNPs with a score equal or greater than 20 are considered highly likely to have significant functional roles. The base pairs search distance is also subjective. Knowledge of the LD patterns for the research species will need to govern the distance used.
Shows a summary of the output count after 4 program test runs
Number of SNPs ...
Number of entries found in database for SNPs meeting input criteria ++
meeting input criteria $$
derived from WGAS meeting criteria
found in database with score = > 20
Homologous Genes(with Homo sapiens)
The purpose of run 3 was to find the highest scoring SNPs in the database residing in close proximity to the significant SNPs. We found 15 SNPs located on 15 different genes and 8 of these SNPs were considered to have significant functional roles. Similarly, the purpose of run 4 was to find genes in close proximity to the significant SNPs and then find the highest scoring SNPs residing on these genes. We found 6 genes in close proximity that contained 11 SNPs with significant functional roles. Based on the results of the 4 tests we were able to rank the 165 SNPs according to their likelihood of being reliable DNA markers and concluded that the top 8 were realistic candidate DNA markers. In particular, 2 WGAS derived SNPs were identified to be in close proximity to 2 genes influencing biological pathways that literature have linked to the WGAS trait of interest. See additional file 3 for FunctSNP commands used in testing program.
As part of the ongoing development, we will monitor web resources for new and improved sources of data that can be incorporated into FunctSNP. The design of dbAutoMaker allows data to be extracted from new online resources with minimal modification to dbAutoMaker's configuration file.
FunctSNP and our approach for identifying SNP markers currently make the assumption that all causal SNPs are located on genes. There are now many suggestions that non coding SNPs (e.g., SNPs in transcription factor binding sites, splice sites, repressor, promoter regions, and eQTLs ) could significantly contribute to variation in complex traits. Therefore, as part of the ongoing development, online resources (such as coliSNP , TRANSFAC , SNPeffect , eQTL Explorer ) are being investigated to find data for the functional role of non-coding SNPs with the intention of incorporating this data in FunctSNP databases.
To take full advantage of WGAS, we need to make that essential link between the outcomes from WGAS and the information that exists about the function of genes and pathways. Whilst there are many emerging bioinformatics tools to make this link, we have developed a tool called FunctSNP which we believe is unique and addresses some of the shortcomings of existing tools. Firstly, FunctSNP is the only R package designed specifically for SNP annotation. There are considerable advantages over a web interface in integrating FunctSNP within the R statistical computing environment offering high quality graphics and data visualization tools. Operating in the same R environment, FunctSNP has notable complementary upstream and downstream research R packages. Secondly, the purpose built database for FunctSNP can include information about SNPs in other organisms besides human. We have developed an application called dbAutoMaker for a researcher to automatically build their own SNP database for any species contained in NCBI's dbSNP. We have scheduled dbAutoMaker to build databases for cattle, chicken, pig, sheep, and human on a monthly basis. These databases can easily be downloaded whenever required directly from the web or using a FunctSNP function.
We conclude that FunctSNP is a post-WGAS tool that provides an opportunity to screen and select for SNPs that have a higher likelihood to be related to variation in a particular complex trait of interest.
The FunctSNP R package is available from CRAN:
http://cran.r-project.org/package=FunctSNP. In FunctSNP there are no new classes, no overloading, and default storage modes for R objects are used.
The software for dbAutoMaker is implemented in Perl and supported on Linux and MS Windows. Databases and source code are freely available from: http://www.csiro.au/science/FunctSNP.html and http://www.csiro.au/science/dbAutoMaker.html
Office of the Chief Executive (OCE) of CSIRO for postgraduate scholarship. Special thanks to Professor Julius van der Werf (University of New England) who provided helpful guidance and manuscript editing.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.