Volume 9 Supplement 10
Optimal spliced alignments of short sequence reads
© De Bona et al; licensee BioMed Central Ltd 2008
Published: 30 October 2008
Next generation sequencing technologies open exciting new possibilities for genome and transcriptome sequencing. While reads produced by these technologies are relatively short and error-prone compared to the Sanger method, their throughput is several magnitudes higher. We present a novel approach, called QPALMA, for computing accurate spliced alignments of short sequence reads that take advantage of the read's quality information as well as computational splice site predictions. In computational experiments we illustrate that the quality information as well as the splice site predictions  help to considerably improve the alignment quality. Our algorithms were optimized and tested using artificially spliced genomic reads produced with the Illumina Genome Analyzer for the model plant Arabidopsis thaliana.
In this work we aim to develop a method exploiting all available information to accurately align as many as possible spliced reads to the genome. In previous work we already proposed methods taking advantage of splice site predictions and an intron length model (Palma ). We extend this method to benefit from the read's quality scores. The algorithm is based on extensions of the Smith-Waterman algorithm using more sophisticated parametrized scoring functions. The idea is to tune the parameters of the scoring functions such that the true alignment does not only achieve a large score, but also that all other alignments score lower than the true alignment .
We first studied the accuracy of aligning 30, 000 spliced sequences using different variants of QPALMA: with and without quality information, splice site predictions, and intron length information. From the results given in Table 2 we can conclude that all three components help to reduce the alignment error rate. We also tested the proposed pipeline on about 3 million short reads which contained about 10% spliced reads. The alignment took 12.5 h (on one CPU) and almost all of the reads (98.4%) were aligned correctly. This illustrates that the approach is not only accurate but also fast enough to be used in a next generation mRNA sequencing project.
We have presented a novel approach to solve the difficult task of aligning short reads as generated by NG sequencing techniques over exon boundaries. We were able to successfully exploit all available information sources – the read including its quality score information, splice site predictions, the intron length and, of course, the genome – each significantly contributing to decreasing the alignment error rate.
- Sonnenburg S, Schweikert G, Philips P, Behr J, Rätsch G: Accurate Splice Site Prediction Using Support Vector Machines. BMC Bioinformatics 2007, 8(Suppl 10):S7. 10.1186/1471-2105-8-S10-S7PubMed CentralView ArticlePubMedGoogle Scholar
- Schulze U, Ong C, Hepp B, Rätsch G: PALMA: mRNA to genome alignments using large margin algorithms. Bioinformatics 2007, 23(15):1892–1900. 10.1093/bioinformatics/btm275View ArticlePubMedGoogle Scholar
- Tsochantaridis I, Hofmann T, Joachims T, Altun Y: Support Vector Machine Learning for Interdependent and Structured Output Spaces. Proceedings of the 16th International Conference on Machine Learning 2004.Google Scholar
This article is published under license to BioMed Central Ltd.