Volume 13 Supplement 18
Color call improvement in next generation sequencing using multi-class support vector machines
© Viswanath and Yang; licensee BioMed Central Ltd. 2012
Published: 14 December 2012
There is considerable ongoing effort towards making DNA sequencing machines faster and more affordable today. Improving the accuracy of next-generation sequencers directly lowers sequencing costs by reducing the need for resequencing, making genome-based diagnostics and research more affordable . In this paper, we show how the accuracy of next-generation sequencing machines is significantly improved using supervised learning, specifically, multi-class support vector machines. We demonstrate our methods on the SOLiD 5500/5500 XL platform.
Base-calling is the process of determining the order of nucleotides in the read sequence. In SOLiD, base-calling involves the process of color calling, since the SOLiD platform uses an encoding system where each adjacent pair of nucleotides is represented by one of four colored dyes . Base-callers have been developed for other next-generation sequencing platforms, in particular Illumina and Roche 454 . Most of them are based on explicit statistical models and some are based on support vector based supervised learning [3, 4]. But ours is the first supervised learning method applied on a large scale directly to color space. Also, this is the first supervised learning method to be applied on a large-scale to SOLiD. Moreover, we show that our methods require less training data and hence our training times are much faster than previous methods.
Materials and methods
Noise in sequencing is due to the imperfect nature of the chemical processes involved. Specifically, incomplete cleavage of bases from previous cycles results in residual signal, a problem known as phasing. Also, signal strength diminishes along the sequence due to depletion of chemicals. These errors accumulate over the sequence length, leading to lower accuracy at the end of a read sequence. We improve the sequencing accuracy by modeling these sources of error explicitly through support vector machines.
We represent the classification problem as one that takes as input, the raw color intensities of the current cycle (or sequence position) and presents as output, the color for that cycle. We use the raw dye intensities like , since, by doing so, we do not need to know each source of error explicitly, and the method will be more general and applicable to future releases, and different platforms. To address the phasing problem, we use not only the current cycle color intensities but also the previous cycle color intensities as input for the classifier. To account for depletion of chemicals, we train a separate classifier for each position in the read sequence. We use the SVMLight Multi-class package  with polynomial kernel and slack rescaling to test our methods.
Results and conclusions
- Ledergerber C, Dessimoz C: Base-calling for next-generation sequencing platforms. Briefings in bioinformatics 2011, 12(5):489–497. 10.1093/bib/bbq077PubMed CentralView ArticlePubMedGoogle Scholar
- Breu H: A theoretical understanding of 2 base color codes and its application to annotation, error detection, and error correction. White PaPer SOLiD™ System 2010.Google Scholar
- Kircher M, Stenzel U, Kelso J: Improved base calling for the Illumina Genome Analyzer using machine learning strategies. Genome biology 2009, 10(8):R83. 10.1186/gb-2009-10-8-r83PubMed CentralView ArticlePubMedGoogle Scholar
- Erlich Y, Mitra PP, delaBastide M, McCombie WR, Hannon GJ: Alta-Cyclic: a self-optimizing base caller for next-generation sequencing. Nature methods 2008, 5(8):679–682. 10.1038/nmeth.1230PubMed CentralView ArticlePubMedGoogle Scholar
- Joachims T: Making large-scale SVM learning practical. Cambridge: MIT Press; 1999.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.