- Poster presentation
- Open Access
Robust consensus computation
© Rausch et al; licensee BioMed Central Ltd 2008
- Published: 30 October 2008
- Average Read Length
- Trace Archive
- Progressive Alignment
- Segment Match
- Initial Layout
High-throughput sequencing technologies with short read data pose a new challenge to the current three-phase assembly methodology: Overlap-Phase, Layout-Phase, and Consensus-Phase. We describe a new consensus method that is robust in the face of high coverage, shorter reads, and genomic variation.
We used a read simulator and real data from the NCBI trace archive to evaluate our consensus tool. The main parameters of the read simulator are the source sequence length, the average read length, the number of reads and the error rate per base call. In addition, multiple haplotypes can be simulated. Two further parameters, namely the number of SNPs and the number of indels, specify the genetic variation randomly introduced into these haplotypes. We performed two experiments: (1) Given a source sequence length of 10000, we simulated reads under different settings. The read length varied from 35 to 200, the coverage from 20× to 50× and the error rate per base call from 2% to 4%. In all cases, the computed gap-free consensus matched the simulated source sequence in each position with coverage > 2. (2) Given two haplotypes each of length 10000 with 100 SNPs and 5 Indels, we simulated reads of length 200, coverage 20 and 4% error rate. We then manually inspected the multi-read alignment with Hawkeye to evaluate the consensus in case of genetic variation (see Fig. 2).
The results on simulated data are encouraging and preliminary results on real data show that our consensus quality is comparable to other tools. It remains to be shown that our program outperforms other tools in diffficult settings, namely high coverage and short, error-prone read data. The consensus tool is part of the SeqAn library  http://www.seqan.de and the read simulator is available on request: email@example.com.
- Rausch T, Emde AK, Weese D, Döring A, Notredame C, Reinert K: Segment-based multiple sequence alignment. Bioinformatics 2008, 24(16):i187–192. 10.1093/bioinformatics/btn281View ArticlePubMedGoogle Scholar
- Notredame C, Higgins D, Heringa J: T-Coffee: A Novel Method for Fast and Accurate Multiple Sequence Alignment. Journal of Molecular Biology 2000, 302: 205–217. 10.1006/jmbi.2000.4042View ArticlePubMedGoogle Scholar
- Schatz M, Phillippy A, Shneiderman B, Salzberg S: Hawkeye: an interactive visual analytics tool for genome assemblies. Genome Biology 2007, 8(3):R34. 10.1186/gb-2007-8-3-r34PubMed CentralView ArticlePubMedGoogle Scholar
- Döring A, Weese D, Rausch T, Reinert K: SeqAn – An efficient, generic C++ library for sequence analysis. BMC Bioinformatics 2008, 9: 11. 10.1186/1471-2105-9-11PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd.