Genome-wide prediction of splice-modifying SNPs in human genes using a new analysis pipeline called AASsites
© Faber et al; licensee BioMed Central Ltd. 2011
Published: 5 July 2011
Skip to main content
© Faber et al; licensee BioMed Central Ltd. 2011
Published: 5 July 2011
Some single nucleotide polymorphisms (SNPs) are known to modify the risk of developing certain diseases or the reaction to drugs. Due to next generation sequencing methods the number of known human SNPs has grown. Not all SNPs lead to a modified protein, which may be the origin of a disease. Therefore, the recognition of functional SNPs is needed. Because most SNP annotation tools look for SNPs which lead to an amino acid exchange or a premature stop, we designed a new tool called AASsites which searches for SNPs which modify splicing.
AASsites uses several gene prediction programs and open reading frame prediction to compare the wild type (wt) and the variant gene sequence. The results of the comparison are combined by a handmade rule system to classify a change in splicing as “likely, probable, unlikely”. Having received good results from tests with SNPs known for changing the splicing pattern we checked 80,000 SNPs from the human genome which are located near splice sites for their ability to change the splicing pattern of the gene and hereby result in a different protein. We identified 301 “likely” and 985 “probable” classified SNPs with such characteristics. Within this set 33 SNPs are described in the ssSNP Target database to cause modified splicing.
With AASsites single SNPs can be checked for those causing splice modifications. Screening 80,000 known human SNPs we detected about 1,200 SNPs which probably modify splicing. AASsites is available at http://genius.embnet.dkfz-heidelberg.de/menu/biounit/open-husar using any web browser.
Approximately 6.5 million SNPs have been identified in human genes and have been deposited in the dbSNP database (http://www.ncbi.nlm.nih.gov/projects/SNP/) and are used by the EnsEMBL database (http://www.ensembl.org/). SNP does not only mean exchange of a nucleotide but also a deletion or insertion of one base in the dbSNP database (indels). For many SNPs located in genes the effects on the genes are not known. Application of the new sequencing technologies 454 and Solexa will allow the discovery of many more SNPs which need elucidation of their effects. It is important to know the effect as SNPs can be relevant for diseases e.g. a SNP in the APOE gene increases the risk for developing Alzheimer disease . SNPs account for differences in cancer risk (Dong et al., 2008; Chen et al., 2009) and drug metabolism . Available prediction tools for SNPs like LS-SNP  mostly evaluate if the SNP is within a coding region and changes or abolishes the protein. Others contain a collection of previously evaluated SNPs which can be queried by SNP id, disease or chromosomal region [4, 5](http://compbio.cs.queensu.ca/F-SNP/). Those SNPs are analysed and scored according to location of the SNP (splice site, ESE, TFBS, coding region) and known effects in diseases. A further list with more than ten web servers which analyze SNPs can be found in Karchin, 2009. In contrast, our tool AASsites looks at the potential of the SNPs to modify the splicing pattern of a gene and does not depend on the annotation of known SNPs. Modified splicing is likely to have a profound effect on the phenotype with relevance to disease risk or drug metabolism. A change in splicing can be caused by modifying any of the components of the splicing machinery such as splice sites or splice enhancers or silencers. Those are evaluated separately to predict a score for modulated splicing by “Skippy” . A new tool called SpliceScanII  is looking at all those elements for predicting splice changes in genetic variants and has proven to work in the context of disease-linked variations. AASsites uses the power of gene prediction programs which are trained to evaluate the splice relevant components in order to predict changes in splicing patterns caused by SNPs. Additionally, ESEdetector  for discovering changes in ESEs, and programs to detect changes in the open reading frame (ORF) are used. A handmade rule system combines the results and classifies the SNP as “likely”, “probably” or “unlikely” to lead to modified splicing of the gene.
Test results of AASsites using SNPs with known changes
Number of SNPs
Classification results of selected human SNPs
SNPs with known changes in splicing identified by AASsites
Change in splice pattern
Renal cell carcinoma
Pathways over-represented in genes with SNPs modifying splicing
Metabolism of xenobiotics by cytochrome P450
ABC transporters - General
Regulation of actin cytoskeleton
Small cell lung cancer
Cyanoamino acid metabolism
Non-small cell lung cancer
Pathogenic Escherichia coli infection - EPEC
To identify SNPs which modify the protein by changing the splicing pattern the pipeline AASsites was developed. This pipeline is available through its web interface at http://genius.embnet.dkfz-heidelberg.de/menu/biounit/open-husar. Unlike many other SNP analysis tools our tool predicts the effect of SNPs on splicing. Not only SNPs localized at splice sites can modify the splicing of a gene, but also SNPs near splice sites can have the same effect due to other regulatory sequences involved. Gene prediction programs take these regulatory sequences into account by using HMM models or similar algorithms. Still, we could not predict 433 genes because in these cases none of the five gene prediction programs worked correctly on the wt sequence. This minor problem could be solved by the implementation of one or two more prediction tools. A second problem is the prediction of SNPs in alternatively spliced products. Most gene prediction programs do not predict alternative splice sites. The only exception is Augustus (http://augustus.gobics.de) which should be implemented. Then also the different alternatively spliced wild type forms of the gene have to be considered.
We have shown with a set of SNPs known to affect or not to affect splicing, that the pipeline was able to correctly predict the change in splicing caused by the SNP in 83% of 109 cases. The problem of testing and improving the rule system for combining the results lies in the small number of experimentally proven SNP-derived modifications in splicing. With more experimental data available we could replace the rule system by a knowledge system based on machine learning algorithms or we could optimize the rules. The comparison with SpliceScanII  shows that AASsites performs better on our small test set. But the number of examples is much too small for a final evaluation.
New tools could be implemented to assist AASsites by selecting the correct splice change if different changes are predicted by the different gene prediction tools. A further analysis of the predicted splice sites with tools like the “Human Splicing Finder”  which predicts the effect of mutations on the splice signals or “Skippy”  which analyses ESEs and ESSs and the evolutionary constraint of the region surrounding the variant could complement our approach.
Another improvement could be the evaluation of different SNPs of the same haplotype together. At the moment, AASsites treats all SNPs as being independent. The analysis is done for only one SNP at a time, even if the input sequence contains several SNPs. That is the reason, that the combined effects of multiple SNPs are missed.
The genome-wide analysis of known SNPs near splice sites revealed 1300 SNPs which are probably capable of modifying the protein by changed splicing. It could be shown, that not only SNPs directly at splice sites are likely to modify splicing. Among the splice relevant SNPs were 33 cases which were experimentally verified and involved in the genesis of diseases according to the ssSNP target database proving the functionality of the pipeline. Other SNPs in genes which are related to diseases were found and could be candidates for further research.
To identify SNPs which modify the protein by changing the splicing pattern the pipeline AASsites was developed. This pipeline uses gene prediction programs for this purpose and is available through its web interface at http://genius.embnet.dkfz-heidelberg.de/menu/biounit/open-husar. The genome-wide analysis of human SNPs near splice sites revealed 1300 SNPs which are probably capable of modifying the protein by changed splicing. Some already known SNPs were identified, but other SNPs in genes related to diseases could be good candidate SNPs for further research.
An overview of the AASsites pipeline is outlined in Figure 1. Input is a DNA sequence containing the SNP and the EnsEMBL gene id (EnsEMBL version 53) to which the SNP belongs. The EnsEMBL gene id is used to extract the wt genomic sequence and the wt protein as well as to derive the real exon-intron structure. The different analysis steps which are outlined below are performed with the SNP containing sequence. An HTML report page with the classification and the single results (see Figure 2) is produced as output.
The input DNA sequence is compared to the wt sequence by the FASTA program . The position of the SNP determines its location in an intron or an exon. Depending on the location – intron or exon - a different set of tools is run and different rules are applied.
At the moment five different gene prediction programs are implemented into the AASsites pipeline. They rely on different models for prediction.
GenScan  is based on hidden markov models and considers elementary signals like basic transcriptional, translational and splicing signals as well as length distributions and compositional features of exons, introns and intergenic regions.
Class Hidden Markov models are used in HMMgene  to predict the most probable gene structure based directly on labelled sequences, using labels for coding regions, introns and intergenic regions.
The program GeneID  uses a hierarchical approach composed of three different steps to assemble the gene structure. It starts out by scoring splice sites, start and stop codon using so-called Position Weight Matrices (PWMs). In the second step, exons are built from the sites. Exons are scored as the sum of the scores of the defining sites, plus the log-likelihood ratio of a Markov Model for coding DNA. In the last step, from the set of predicted exons, the gene structure is assembled, maximizing the sum of the scores of the assembled exons.
A generalised HMM is the basis of GlimmerHMM , which also uses decision trees and the maximal dependence decomposition method.
The last program, GrailEXP6 (http://grail.lsd.ornl.gov/grailexp/index.html), is implemented as a building block system consisting of three different parts. It first uses statistical techniques to pinpoint possible locations of exons. Then it brings in empirical evidence from nucleotide and protein databases to create possible "pieces" of genes. Finally, an intelligent algorithm constructs the genes from these pieces.
To determine possible changes due to the SNP, the wt sequence and information about the structure have to be determined. Using the EnsEMBL Perl API (Ensembl53) the wildtype sequence, the intron-exon structure and the protein sequence are extracted from the EnsEMBL database.
Five gene prediction programs are used to predict the gene structure of the wildtype gene sequence. These predictions are compared independently to the gene structure derived from EnsEMBL. A given gene prediction program is used for the prediction of the sequence containing the mutation if the exon or intron, in which the SNP is localised, was correctly predicted for the wildtype sequence. This selection means that not all prediction programs are used for each SNP. If no prediction program can be found to predict the wt exon or intron, the program will output “No prediction available”. The predicted gene structures for the SNP-containing sequence are compared to the wildtype structure to detect changes.
Using the GeneWise program  changes in the Open Reading Frame are analysed. GeneWise combines a gene structure model and a homology model to predict the protein sequence for a genomic sequence and to compare this sequence with a homologous protein sequence. In AASsites 100 coding basepairs of the variant sequence around the SNP are analysed with GeneWise.
If a SNP is localised in an exon the ESEs are analysed with ESEfinder  or ESEdetector . The prediction of putative ESEs in query sequences performed by ESEfinder is based on weight matrices corresponding to the motifs of four different human SR proteins. The values that constitute the matrices are derived from frequency values obtained from the alignment of so called winner sequences of the SELEX experiments. ESEdetector is based on a support vector machine and uses a combined oligo-kernel to predict possible Exonic Splicing Enhancers in an input-sequence. It has a better prediction accuracy than ESEfinder but needs exons >=100bp. AASsites uses ESEdetector to predict ESE elements in the wildtype and in the variant exon of at least 100bp, otherwise it uses ESEfinder. Up to 300bp of the exon are taken into account and ESE elements in wildtype exon and variant exon are compared.
Scoring table for combining the results of the AASsites analysis tools
SNP distance to splice site
>2 nt and <=4 nt
Intron/ Exon disappared/ appeared
No prediction available
Indel with frameshift
Indel without frameshift
No frameshift no stop-codon appeared
New Amino Acid
No genewise prediction
ORF 1 or 0
ORF 2 and Gene prediction 1
ORF 2 and (Gene prediction 2 or 3 or 4)
ORF 3 or 4
ORF 5 and (Gene prediction 1 or 2) and (SNP distance 1 or 2)
ORF 5 and ((Gene prediction 3 or 4) or (SNP distance 3 or 4))
SNP distance 1 and (Gene prediction 1 or 2)
SNP distance 1 and (Gene prediction 3 or 4)
SNP distance 2 and (Gene prediction 1 or 2)
SNP distance 2 and (Gene prediction 3 or 4)
SNP distance 3 and (Gene prediction 1 or 2)
SNP distance 3 and (Gene prediction 3 or 4)
The database DBASS (database for aberrant splicing, first release, http://www.dbass.org.uk) contains mutations and their experimentally revealed effects on splicing. Using this database and the referred publications, a set of 37 SNPs could be selected which affected the splice pattern in a defined way (positive set 1). Added to this set was a randomly chosen set of 19 SNPs of DBASS3, not manually checked (positive set 2). As a negative set 1 23 SNPs were chosen which cause an amino acid exchange only. The SNPs of the positive set 1 and negative set 1 together with the described effects and the publications are shown in Additional file 1, Table 1. Additionally 30 SNPs randomly selected from dbSNP were used as a negative set 2, as splice modifying SNPs are rare and should not appear in a small randomly selected set. In this set 17 intronic SNPs are included. SpliceScanII  was run on all wt and variant sequences with default parameters and compact output. The differences in exon numbers or exon start/stop sites were counted as a predicted splice modification of the variant.
Approximately 5 million human SNPs located in protein coding genes and found in EnsEMBL 53 (http://www.ensembl.org) were the starting point. Assuming that SNPs near splice sites are more likely to be involved in splice changes, only intronic SNPs located within 10 bases of the exon-intron boundary and exonic SNPs within 100 bases of the splice site were considered. Additionally, a population frequency of the SNP of at least 0.1 in the CEU population was required. According to the described criteria 82,838 SNPs were selected by a perl script which used the EnsEMBL API for extracting the SNPs, the splice sites, the sequences and the population frequencies.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 4, 2011: Proceedings of the European Conference on Computational Biology (ECCB) 2010 Workshop: Annotation, interpretation and management of mutation (AIMM). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/11?issue=S4.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.