SNPranker 2.0: a gene-centric data mining tool for diseases associated SNP prioritization in GWAS
© Merelli et al.; licensee BioMed Central Ltd. 2013
Published: 14 January 2013
Skip to main content
© Merelli et al.; licensee BioMed Central Ltd. 2013
Published: 14 January 2013
The capability of correlating specific genotypes with human diseases is a complex issue in spite of all advantages arisen from high-throughput technologies, such as Genome Wide Association Studies (GWAS). New tools for genetic variants interpretation and for Single Nucleotide Polymorphisms (SNPs) prioritization are actually needed. Given a list of the most relevant SNPs statistically associated to a specific pathology as result of a genotype study, a critical issue is the identification of genes that are effectively related to the disease by re-scoring the importance of the identified genetic variations. Vice versa, given a list of genes, it can be of great importance to predict which SNPs can be involved in the onset of a particular disease, in order to focus the research on their effects.
We propose a new bioinformatics approach to support biological data mining in the analysis and interpretation of SNPs associated to pathologies. This system can be employed to design custom genotyping chips for disease-oriented studies and to re-score GWAS results. The proposed method relies (1) on the data integration of public resources using a gene-centric database design, (2) on the evaluation of a set of static biomolecular annotations, defined as features, and (3) on the SNP scoring function, which computes SNP scores using parameters and weights set by users. We employed a machine learning classifier to set default feature weights and an ontological annotation layer to enable the enrichment of the input gene set. We implemented our method as a web tool called SNPranker 2.0 (http://www.itb.cnr.it/snpranker), improving our first published release of this system. A user-friendly interface allows the input of a list of genes, SNPs or a biological process, and to customize the features set with relative weights. As result, SNPranker 2.0 returns a list of SNPs, localized within input and ontologically enriched genes, combined with their prioritization scores.
Different databases and resources are already available for SNPs annotation, but they do not prioritize or re-score SNPs relying on a-priori biomolecular knowledge. SNPranker 2.0 attempts to fill this gap through a user-friendly integrated web resource. End users, such as researchers in medical genetics and epidemiology, may find in SNPranker 2.0 a new tool for data mining and interpretation able to support SNPs analysis. Possible scenarios are GWAS data re-scoring, SNPs selection for custom genotyping arrays and SNPs/diseases association studies.
The increasing importance of high throughput molecular biology techniques, such as whole genome genotyping and next generation sequencing, has boosted the identification of novel biomarkers for diseases having a genetic component [1, 2]. In particular, the evaluation of Single Nucleotide Polymorphisms (SNPs) is very promising, because they represent established single base variations with respect to the wild type, and their knowledge can be exploited to characterize each subject by associating a specific phenotype with the corresponding genomic pattern.
The human genome counts more than 10 million of SNPs  and the number of SNPs with a minor allele frequency over 10% is estimated to be perhaps as many as five million . SNPs are distributed throughout the human genome and their effect on phenotype depends on the biological role (e.g. exon or regulatory) and state (e.g. silent or active) of the genomic regions where they occur.
SNP knowledge is widely exploited for Genome Wide Association Studies (GWAS) [5–7], identification of Copy Number Variations (CNV) , and observations about Population Stratification . Nowadays, chip technologies allow the analysis of up to one million SNPs for each patient. The selection of SNPs to be included in the analysis is a critical problem for genotype array providers, which employ the non-random inheritance of these genomic variations to identify TAG SNPs representing haplotype blocks. A widely used approach to optimize the SNP probe set relies on the concept of Linkage Disequilibrium (LD) , which exploits a statistical similarity measure between adjacent SNPs to compute, for each couple of SNPs, the information improvement using both of them or only the most representative one. LD mapping is used to optimize the experimental information content by containing the number of probes employed for the genotype analysis into 1 million of TAG SNPs.
SNPs filtering and prioritizing methods are also very important in case of custom genotyping chip design, defining disease-oriented arrays by pre-selecting a set of SNPs that can be related to a specific pathology. In this general scenario, no automatic methods have been proposed to support the identification of the most probable SNPs associated to a pathology relying on the available biomolecular knowledge.
On the other hand, GWAS can identify SNPs associated to a disease working on genotypes and phenotypes analysis. Generally, a GWAS output has to be interpreted considering the biological context to enrich the pure statistical results, in which the effective disease related variations could be dispersed among many less critical SNPs. This process means to "re-rank" GWAS scores relying on SNP properties (annotations), in order to shed light on variations that are effectively critical for the pathology in analysis.
Herein we describe SNPranker 2.0, a system that enables the prioritization of SNPs, which relies on a previous published version of the system . SNPranker 2.0 ranks SNPs according to a user-selected set of features, which in this version has been enriched with epigenetics and functional genomics attributes, by employing a novel data mining approach. SNPranker 2.0 provides a machine learning derived scoring schema, which consists of a data mining model, optimized against experimental evidences by employing a genetic algorithm, for characterizing SNPs related to an input dataset of genes, biological processes or GWAS results. The system provides a ranked list of SNPs as output, with annotations about their statistical enrichment with respect to the most represented pathologies.
The bioinformatics analysis of genotype experiments is a complex task, which is usually addressed with statistical methods, if sufficient knowledge is available to formulate hypotheses, or using machine learning approaches, if it is necessary to create classificatory rules relying on data themselves.
Statistical approaches are commonly used in genetic epidemiology and in many researches these methods achieved good results [12, 13]. Despite these successes, they show some limits, which are mainly related to the underlying statistical hypotheses. The computation of P-values, which is a typical approach in GWAS, is prone to bias in the selection of the studied population and the capability of inferring correlations between genomic variations and pathologies is inevitably restricted to the set of TAG SNPs used as probe set (although a posteriori imputing techniques can partially correct this issue, at the price of a huge amount of computation). Statistical approaches are solid, but the abstraction they use to manage data often provides results difficult to interpret, because best hits are selected without any correlation to real genomic features that can be identified as causes of the disease.
On the other hand, machine learning approaches are very flexible thanks to their ability to directly create a model from the data, although in well defined analysis context (i.e. when hypotheses of statistical methods are very solid) are considered less reliable. Considering SNP prioritization as a classification problem, we chose a supervised machine learning approach to generate a function able to map inputs to desired outputs. While employing a supervised method, the selection of the training and validation sets must be carefully achieved, usually employing a cross-validation approach, since the model is created on them.
Machine learning approaches have a long tradition in bioinformatics, which requires the development of tools and methods capable of transforming 'omics' data into real knowledge about the biological underlying mechanism [14–16]. Nonetheless, there are only few applications developed to exploit machine learning approaches for genetic features ranking in relation to specific diseases. An example is Endeavour , a software that performs gene prioritization for ranking candidate genes involved in biological processes or diseases relying on their similarity to known genes related to these phenomena.
Concerning SNP prioritization some solutions are available, such as PupaSNP Finder , Wjst's system , PolyMAPr  and SNPselector . These servers usually integrate information from a variety of databases and analytical tools in order to create a knowledge base for SNP annotation, starting from public domain databases, such as dbSNP , GoldenPath  and SNPper , which contain well-organized catalogues of SNPs and provide portals to search for fundamental information about them. More recent solutions, which can be used in the frame of GWAS, are FastSNP , which employs a complete decision tree to assign risk rankings for SNP prioritization, F-SNP  that integrates more than 16 features for SNP annotation, SPOT  that relies on GIN (Genomic Information Networks) scores which are cumulative measures of the biological relevance obtained by combining information across multiple domains, and FitSNP  that provides predictions about SNPs involved in diseases relying on a meta-analysis of microarray data. Even an R package available in Bioconductor  has been developed in this context, based on variance prioritization, which selects SNPs having significant heterogeneity in variance per genotype using a pre-determined P-value threshold.
The first version of SNPranker  was also a web tool for SNP prioritization. As many of the listed resources, it relied on a data-warehouse approach for collecting as many data about SNP features as possible, to provide users the most complete annotation schema according to the public available information. The innovation of SNPranker concerned the use of an ontological expansion to enrich the set of input SNPs with data about semantic-associated genomic traits that could have statistical correlations and functional influences on the data provided by users. Nonetheless, in the first version of SNPranker, as in many of the discussed solutions, users must select weights of the SNP features upon their expertise. At the best of our knowledge, no methods are available in literature to evaluate SNPs by features scoring through machine learning algorithm using data mining approaches.
Similarly to the first implementation of SNPranker, SNPranker 2.0 relies on a data-warehouse architecture, which integrates public information about genes and genes products, in order to provide a solid knowledge base for the SNP scoring engine. As discussed in our previous work , the advantage of this database is the use of a strong systems biology approach for data organization, combined with an ontology layer for the annotation of retrieved data. An improvement of SNPranker 2.0 is the use of the NDB engine of MySQL Cluster as backend server that, in combination to the optimization of the database schema, overcomes the latency problem of some complex query requests.
The peculiarity of the developed database is represented by the multi-level approach to data integration , which enables a more comprehensive view of the examined process or disease, therefore leading to a better selection of the set of SNPs to be included in a disease-oriented custom chip or a better re-scoring of GWAS data.
The SNPranker 2.0 database presents a gene-centric approach, which means that all tables are related to each other using the concept of gene to create relation in the data-warehouse schema, allowing the connection of molecular levels to the pathway level. Human genes are annotated employing, among other features, their symbols, descriptions, aliases and sequences. Data about SNPs are downloaded from GoldenPath , with reference to the hg18 genome assembly, which allows the integration of data about chromosomal and contig positions, heterozygosity, alleles and functions of the related DNA portions. Data about known genes and SNPs involvement in particular diseases have been downloaded from OMIM .
From the epigenetics point of view, UCSC tracks about DNAse clusters, chromatin structures and methylation patterns have been downloaded and integrated in our database for different tissues and cell lines in order to characterize the specific activity of SNPs in particular environments.
Concerning transcriptomics data, gene products have been collected as lists of mRNA sequences, considering alternative splicing patterns and miRNA binding regions , which can be useful to characterize SNPs in the corresponding DNA regions. Since SNPs can modify the mRNA produced from the same locus, by varying the transcription start sites (TSSs), the protein coding DNA sequences (CDSs) or the untranslated regions (UTRs), gene isoforms are also stored in the SNPranker 2.0 database, according to the NCBI RefSeq annotations .
The systems biology knowledge base has been created by querying databases of biochemical pathways (KEGG ) and reactions (Reactome ) searching all human gene products, while information about protein-protein interactions (PPIs), collected from BioGRID , have been employed to complement the available data about hub proteins and neighbourhoods that are crucial for network based analyses.
SNPranker 2.0 exploits this multilevel knowledge integration as key infrastructure to perform SNP scoring. In this updated version of the database the set of features considered for SNP prioritization consists of more than 30 elements. A complete overview of integrated features is presented in Additional File 1.
The SNPranker 2.0 database has been built on a strong ontology layer, in order to provide a reliable framework for data integration and an improved engine for gene and SNP lists enrichment and annotation. In particular, genes and pathways data have been annotated with terms from the Gene Ontology  and the KEGG Pathway Ontology, respectively. By exploiting the ontological annotation of genes, in fact, it is possible to measure gene similarity, which then can be used to expand the initial gene lists. SNPranker 2.0 provides two similarity measures, which differ for taking into account the bare ontological terms (Rel measure [40–42]) or for considering also the ancestor terms following the ontological tree (Wang measure ).
Considering one of the ontologies provided by SNPranker 2.0 and a particular similarity measure, from an input gene list g 1 the system generates the list g 2 g 1 , which contains also the genes that correlate with the genes in the list g1 according to the selected similarity threshold.
The features are the characteristics of SNPs that represent the a priori knowledge of their underlying biology, which is the base for modelling the biomolecular information related to these polymorphisms. In this data mining approach, users can select the features they would like to consider for SNP evaluation and assign a custom relevance to these features in the score computation.
An added value of this work is the pre-computation of an optimal set of weights for the features to provide by default suitable ranked SNP lists associated to the disease genes provided as input. The idea is to give a general scoring model, which can predict the importance of each attribute in a generic pathological context, assuring a valuable SNP ranking. A machine learning approach has been used to find this parameter setting, which is proposed by default in the SNPranker 2.0 web site. To find the appropriate weights, we formulated an optimization problem, solved using a genetic algorithm, which considers as fitness the system sensitivity and involves cross-validation during the assessment of candidate weights. In other words, we exploit a genetic algorithm to optimize a model from the data, which is a classical method of supervised machine learning. The combination of this machine learning approach with a framework that allows users to perform a fine-tuning of the system parameters (useful for verifying the effect of changes in the features and relative weights on the final SNP scores) realizes the data mining approach.
This strategy allows the computation of the final SNP score as a single real number.
Starting from a set of genes associated by experimental evidence to specific pathologies, a genetic algorithm has been implemented in order to achieve the that minimizes the distance among the set of SNPs retrieved by the system and the list of SNPs experimentally associated to the same disease. The optimal values of the weights were found taking into account the specificity of the SNPranker 2.0 predictions. To this end, we considered, as input, all the genes and SNPs associated with a set of 16 pathologies, reported in Additional File 2, as described in OMIM. Given the set S of SNPs containing all the SNPs associated to the genes of the considered pathologies, taking into account a flanking region of 100,000 bp, and defining A as the set of SNPs certified by OMIM as involved in the disease, we minimise
In this way, a total of 640 simulations were run, by choosing iteratively one of the 16 pathologies as test set and evaluating 10 steps of (from 0.1 to 1.0) and 4 steps of (from 0.25 to 1.0). Once all the simulations were completed, we validated the parameters configurations against each disease previously chosen as test case for such simulation: for each validation test, we evaluated the fitness with Eq. 2 and we determined the sensitivity, the specificity, and the accuracy of that parameter configuration. Then, we calculated the average values of such indexes for all the 16 simulations with the same and , in order to estimate how accurately our predictive model will perform in practice. All the data relative to genetic evolutions of parameter configurations have been collected according to same values of and , in order to estimate the performance of the predictive model, as reported in Additional File 3. The best parameters for assigning higher scores to diseases associated SNPs determined using our machine learning approach are visible as default feature weights in the SNPranker 2.0 home page. Pathologies, however, are characterized by different traits, and so each parameter configuration may work better with certain diseases rather than others. For this reason, users can directly set each feature weight on the basis of the aims of their specific study. This fine-tuning procedure is possible by associating each feature to a weight that represents the importance attributed by users to the feature in the final SNP score computation.
Once users complete their selections and start the computation, SNPranker 2.0 first extracts all the SNPs related to the selected genes (or biological process), then computes the ontological expansion (if required), and finally computes the score for all the input SNPs according to the selected features and weights. For each SNP, all selected features are presented in a final table combined with each score, showing both the original data, for annotation purpose, and the related scores. The web interface displays on the fly all the results and, at the end of the computation, output SNPs can be effectively ranked according to their scores, which are available in the last column of the output table. When the result page is completely loaded, a link to download output data in compressed format is presented at the bottom of the table and the SNP list enrichment tools about pathways and diseases become available to users.
The enrichment of the ranked SNP list, considering KEGG pathways , GO  terms and OMIM gene and genetic disorders , is a valuable tool to interpret the output of the system. For example, considering the annotation of the top ranked SNPs in terms of KEGG pathways, it is possible to verify if the system has privileged genomic features belonging to a particular biological network. At the same way, an enrichment of best hits in a particular genetic disorder according to OMIM can be a clear indication that identified SNPs are effectively involved in a specific disease. The enrichment is computed by comparing the total number of genes that have a particular ontological annotation with respect to the number of top ranked genes with the same annotation (considering the genes that bring the identified SNPs). Statistical significance of the enrichments is assessed with appropriate hypergeometric tests, which permit to verify if the number of occurrences of a particular ontological annotation in the top ranked list of SNPs is by chance. Due to the high number of P-values computed for this analysis, the statistics is corrected using the False Discovery Rate control method , using the "phyper", "dhyper" and "p.adjust" routines available in R .
The system takes as input a list of genes, a set of SNPs or a biological process. Genes and SNPs can be provided as comma separated values of IDs (EntrezGene or GeneSymbol for genes, RS identifiers for SNPs). For the biological process option, the web interface provides an auto-completion box with the GO names of biological processes. Once a particular biological process is selected, all genes annotated with this GO term are provided as input to the system. Since many SNPs are not directly associated to genes because of their inter-genic localization, SNPranker 2.0 provides a parameter for customizing the flanking regions.
The ontological expansion is an important method for studying SNPs related to pathologies, since it allows to extend the analysis to SNPs that could potentially be involved in a pathology onset, but are not annotated as disease associated and have not being highlighted in more traditional approaches. The inclusion in the computation of SNPs belonging to genes that are annotated similarly to those provided as input permit to increase the number of associated SNPs under analysis. For this reason, the ontological expansion enriches the input list g 1 by adding new genes that are biologically related with them, relying on GO terms. The biological relationship among genes is evaluated through two semantic similarity metrics (referred as Rel and Wang in the web page), which compare the GO terms associated with each gene. Depending on the interests of the user, for each gene in g 1 , the system retrieves a number of genes with the highest semantic similarity according to their Gene Ontology annotations.
Default feature weights as result of the optimization process.
SNPs and Genes
Lamina associated domains
Epigenetics and transcription regulations
Methylation (seq regions)
CpG islands, promoters, first exons
TAF1 binding sites
Intergenic regulatory elements
Regulatory regions (OregAnno)
TXN factor ChIP-Seq
miRNA binding regions
A description of each feature is available within the web interface in an intuitive balloon text close to each feature name. For example, considering the epigenetics features, a user can select a particular tissue or a cell line, while in case of selection of the general feature only, without the detail of tissue or cell line, the average values are considered for the score computation.
Once all values have been computed, the last step consists in ranking all SNPs relying on their scores. Due to the great amount of SNPs potentially reported as output, SNPranker 2.0 allows users to cut the list at a given threshold, based on a percentile of the total number of SNPs. Moreover, the enrichment tools allow testing if the provided SNP list is enriched of genes associated with a particular disease or pathway. The final list of enriched SNPs with scores is specifically aimed at supporting the evaluation of disease associated SNPs.
The SNPranker 2.0 tool has been validated using OMIM data, considering a set of pathologies influenced by recognized SNPs. For each disease the list of associated genes has been given as input to the system and the list of ranked SNPs has been compared to the set of SNPs provided by OMIM for the same disease.
Optimal feature weights were found in order to obtain the best sensitivity, which is the system capability of detecting correct cases, but other indexes such as specificity, which evaluate the capability of the system to filter incorrect cases, and accuracy, which measure the degree of closeness of our classification to real cases, should be taken into account. This is due to the fitness dependence on (the ratio between high scored SNPs and the total number of evaluated SNPs), which makes sensitivity more unfavourable in case of higher .
Therefore, fitness values proposed by the machine learning algorithm should be carefully considered, because simulations that can identify almost all the disease associated SNPs do not filter out SNPs with the same effectiveness, and so the specificity and the accuracy indexes tend to lower values. Vice versa, greater values of accuracy and specificity mean less predictive power of the system.
The use of and seems reasonable, since it results in 81% of associated SNPs with an accuracy and specificity of 76%. In some considered cases, (such as Cystic Fibrosis, Sickle Cell Anaemia, and Haemophilia) the top ranked SNPs show a statistically significant enrichment (P < 0.05, hypergeometric test) concerning SNPs known to be associated with the tested pathologies. In Huntington's disease, the first three SNPs appearing in the ranked list are exactly those reported in OMIM for this pathology.
We tested SNPranker 2.0 using different parameters and here we discuss two case studies: the first scenario is a search for semantic annotation and the second case is a comparison with a GWAS output.
Semantic similarity analysis of tested genes.
B-Cell Cll/Lymphoma 2
SNPranker results comparison with a GWAS for Bipolar Disorder.
Given the need of tools for SNP prioritization, we updated our prototype system by developing SNPranker 2.0, a web based system that performs data mining of public available biomolecular knowledge of SNPs. SNPranker 2.0 is based on a gene-centric data-warehouse approach, which exploits a machine learning method to rank SNPs and compute final scores. It relies on the identification of a set of crucial features characterizing SNPs related to a list of input genes. This represents the a priori knowledge that employing our data mining approach allows the assessment of a final score for each SNP, which can be tuned by users according to their preferences. By employing a genetic algorithm we created a supervised classifier, which estimates the optimal weights of the SNP features. Using these parameters, SNPranker 2.0 provides a scored list of variations, which can be statistically analysed to verify its enrichment about particular pathways or diseases genes. Concrete scenarios of usage are the identification of the most important SNPs in population genetics studies, in order to create custom genotyping chips, and GWAS output re-scoring for interpreting top ranked SNPs in a specific biological context.
The publication costs for this article were funded by the Italian Ministry of Education and Research (MIUR) through the Flagship (PB05) "InterOmics" project.
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 1, 2013: Computational Intelligence in Bioinformatics and Biostatistics: new trends from the CIBB conference series. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S1.
This work has been supported by the Italian Ministry of Education and Research (MIUR) through the Flagship (PB05) "InterOmics", ITALBIONET (RBPR05ZK2Z), HIRMA (RBAP11YS7K) and the European "MIMOMICS" projects.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.