FAAST: Flow-space Assisted Alignment Search Tool
© Lysholm et al; licensee BioMed Central Ltd. 2011
Received: 24 February 2011
Accepted: 19 July 2011
Published: 19 July 2011
High throughput pyrosequencing (454 sequencing) is the major sequencing platform for producing long read high throughput data. While most other sequencing techniques produce reading errors mainly comparable with substitutions, pyrosequencing produce errors mainly comparable with gaps. These errors are less efficiently detected by most conventional alignment programs and may produce inaccurate alignments.
We suggest a novel algorithm for calculating the optimal local alignment which utilises flowpeak information in order to improve alignment accuracy. Flowpeak information can be retained from a 454 sequencing run through interpretation of the binary SFF-file format. This novel algorithm has been implemented in a program named FAAST (Flow-space Assisted Alignment Search Tool).
We present and discuss the results of simulations that show that FAAST, through the use of the novel algorithm, can gain several percentage points of accuracy compared to Smith-Waterman-Gotoh alignments, depending on the 454 data quality. Furthermore, through an efficient multi-thread aware implementation, FAAST is able to perform these high quality alignments at high speed.
The tool is available at http://www.ifm.liu.se/bioinfo/
The nature of DNA sequencing has taken a dramatic turn in the last few years, most notably improved through the development and broad use of 2nd generation sequencing methods. The first 2nd generation sequencing method was 454 sequencing, introduced in 2005 with the GS20 sequencing machine which produced 20 million base-pairs (Mbp) per run . 454 sequencing has since been improved steadily both regarding quality and throughput, and the GS FLX Titanium, introduced in 2008, produces 500 Mbp per run, as reads of approximately 350 bp . Although, since 2005, other 2nd generation sequencing methods have emerged, 454 still produces the longest reads and is one of the most widely used platforms. The long reads produced by 454 sequencing makes the method especially attractive for metagenomic sequencing, where the sample is highly complex and overlapping reads are more rare. The enormous technology improvements represented by novel sequencing technologies do not only enable many new studies but also poses great challenges in terms of processing the sequence data. The major underlying technology for data processing is sequence alignment, which plays a key part in all steps from sequence assembly to annotation.
In 1970, the global sequence alignment was proposed and a computational method for solving it . The algorithm utilised the fact that the problem can be solved through solving a number of sub-problems, dynamic programming, which greatly reduced the number of pathways to explore. A decade later, through a modified dynamic programming algorithm, Smith and Waterman defined the local alignment and a method for solving it . Yet another year later Gotoh added non-linear gap penalties to the algorithm . In terms of accuracy dynamic programming methods are still the preeminent methods for solving the two problems and later methods such as FASTA [6, 7] and BLAST [8, 9] use the same underlying technology to calculate alignments.
Recently, to improve the alignment quality using 454 reads, an attempt was made to utilise the flowgram information through probabilistic flowgram matching . The downside of using flowgram matching, i.e. direct matching of flowgrams, is that an SNP will either shift the flowgram one cycle or be matched as two insertions, resulting in an insignificant hit or a hit of low significance, respectively. As SNPs, if not already a factor, also occur as PCR artefacts in sequencing, direct flowgram matching can only be used in conjunction with sequence alignment to improve the accuracy in the cases where homopolymer ambiguity affects the results. Another tool named PanGEA  employs a dynamic gap penalty for alignments where the gap penalties are decreased with an increasing homopolymer. The downside of PanGEA is that it does not consider the pyrosequencing flowpeak values and also uses a linear gap-extension penalty for homopolymer correcting gaps.
Through combining the ability to correct for homopolymer reading errors with sequence alignments, more accurate alignments of 454 data can be achieved. To address these problems, we suggest the use of flow-space assisted Smith-Waterman-Gotoh alignments, i.e. giving the local alignment algorithm the ability to correct for likely sequencing errors while computing the alignment. We implemented the flow-space assisted Smith-Waterman-Gotoh alignment algorithm in a C++ tool named FAAST (Flow-space Assisted Alignment Search Tool) and performed alignments using both regular Smith-Waterman-Gotoh alignments and FAAST.
Evaluation of the effect of flow-space assisted local alignment
By introducing the possibility to perform flow-peak correction, the 'degrees of freedom' for the maximum likelihood estimate increases, potentially producing untrue alignments. For example, if any flow-peak correction was allowed without penalty, any flowgram could match any sequence identically. Therefore, an extensive study of the effect of the flow-space assisted local alignment is needed. The model used for the Smith-Waterman-Gotoh alignment is match/mismatch score = 2/-3 and gap open/extended penalty 5/2, and the additional parameter (see Methods) for the flow-space assisted local alignment is program default (k = 0.25).
Three targets of 25, 50 and 100 nucleotides were randomly picked from the ethidium bromide resistance determinant of Staphylococcus epidermidis (NC_003969), see additional files. For each of these an additional 100 decoy sequences were generated, resulting in a database of 101 nucleotide sequences. The decoys were generated through introducing random SNPs corresponding to 92% identity. This small sequence set would represent the homologs found in an everyday database search.
Finally, query sequences were generated from the target sequence and the algorithms were assessed on their ability to recover the target sequence as the highest scoring alignment, thus find the 'correct' homology in a set of similar decoys. The query sequences were generated ranging from 100% down to 72% nucleotide identity (through introducing random SNPs) using no 454 data simulation (Plain) as well as using Flowsim  to simulate 454 data. 454 data was generated through Flowsim with the generation settings (-G) set to 'Titanium' and 'GS20', as well as a 'high noise' model. The 'high noise'-model constituted a LogNormal(-2.5, 0.2) distribution for negative flows and a Normal(n, 0.15*n) distribution for positive flows of length n.
Evaluating the performance of FAAST
As a straightforward performance test, FAAST was compared to NCBI BLAST 2 and SSAHA2 in a moderate sized alignment task. The test consisted of the typical task of aligning a set of sequenced reads against a small database. The query set was made up of Giardia P15 reads , produced using the Roche 454 GS FLX sequencing platform . The sequencing run produced 221,245 reads, in total 45.8 Mbp, averaging 207 bp in length. The Giardia reads were queried against a database of all Giardia sequences in GenBank (2011-02-20), accessed through taxonomy identifier 5740. The database was composed of 11,178 sequences, in total 65.3 Mbp, see additional files for a complete list of files and scripts used in the evaluation.
Performance evaluation of the FAAST program
FAAST - Flow-space Assisted Alignment Search Tool
In order to implement our algorithm, we have constructed an alignment search tool called FAAST (Flow-space Assisted Alignment Search Tool). FAAST is implemented as a C++ program and compatibility has been ensured using GNU GCC, Intel ICC 12.0 on Linux, but FAAST also compiles with minGW or Intel on the Windows platform. FAAST is an open source project and it is available both as pre-compiled Linux binaries and as source code at http://www.ifm.liu.se/bioinfo/. To facilitate searching with flow peak information, FAAST reads the SFF format (Standard Flowgram Format), which is used to pack 454 data. Furthermore, since the SFF format is a binary format that may be difficult to edit manually, a new format named FFASTA (Flowgram-FASTA) is supported. FFASTA is a FASTA-like format, but it expresses a flowgram for each entry instead of a nucleotide/amino acid sequence. In the FFASTA-format the flowgram is represented as an array of float values for each peak separated by white-space, just as the QUAL format for quality scores. FAAST is implemented with a wide range of parameters for adjusting indexing heuristics, the local alignment model, output-format etc. More information can be found in the FAAST documentation at http://www.ifm.liu.se/bioinfo/ (Under '454 Tools').
Due to the specific nature of pyrosequencing, mostly produced by 454 sequencing machines, regular Smith-Waterman-Gotoh alignments may be inadequate. A homopolymer reading error will introduce a gap in the alignment, which needs approximately 4-5 identities to be outweighed using typical alignment parameters. Thus, any homopolymer indel not flanked by a high enough number of identities will cause early termination of the alignment and/or erroneous alignments. By extending the model to allow the introduction of these gaps at a lower cost at points of homopolymer uncertainties, we show that alignment accuracy can be improved, see Figure 2. While flow-space assisted local alignment slightly decreases accuracy for non-pyrosequencing data, accuracy is gained for pyrosequencing data. Naturally, the improvement in alignment results also depends on the amount and lengths of homopolymer-tracts present in the original data as well as the complexity of the background, see Figure 2. We also note that with increasing sequence length the accuracy is higher at the same query identity level as 100 decoys sample less of the combinatorial space (i.e. there are more ways to place 4 SNPs in a 50 bp sequence then there are to place 2 SNPs in a 25 bp sequence). However, when then complexity of the background begin the affect the results the performance difference is similar regardless of nucleotide length. This is illustrated in Figure 3 where FAAST is able to improve the alignment results of full-length Titanium reads. However, since Titanium 454 data generally is of very high quality and trivially aligned, regular Smith-Waterman-Gotoh aligned over 99.5% of the nucleotides correctly and in total the gain of using FAAST could in some cases be considered small.
Even though FAAST was developed to deal with homopolymer reading errors of 454 data, the FAAST algorithm may be applied to other pyrosequencing methods or any sequencing method or data where homopolymer reading errors occur, for example the Ion Torrent Technology. Furthermore, many bioinformatic algorithms and software are based upon or use to some extent Smith-Waterman, for which the FAAST algorithm could be utilised to better handle homopolymer reading errors.
While BLAST relies on query indexing, FAAST uses database indexing, as implemented in the SSAHA alignment search tool . This provides a speed advantage at the cost of requiring more RAM. The heuristics of FAAST is rudimentary and restricts the number of alignments performed simply through requiring n number of k-mer hits along the same diagonal not spaced more than J nucleotides apart. Although SSAHA2 is much faster that FAAST in a single CPU context, FAAST utilizes the multi-core environment well and still completes the alignment task evaluated in reasonable time. FAAST also compares fairly well to BLAST and in general it would be possible to use the FAAST software for producing alignments with 454 data.
FAAST provides the possibility to both identify potential homopolymer reading errors in pyroseqencing data as well as providing more accurate alignments with pyrosequencing data. FAAST does not only provide high quality alignments but it does so using reasonable computational resources. Therefore, we propose that FAAST could serve as a useful tool in the analysis of genomic and metagenomic data as well as analysis where correctly aligned bases are vital, such as SNP detection.
FAAST alignment algorithm
The FAAST alignment algorithm is based on the Smith-Waterman-Gotoh algorithm. For a Smith-Waterman-Gotoh alignment using pyro-sequencing data, at least two maximum likelihood estimates are used. The first interprets the pyro-sequencing flowgram data into a nucleotide sequence and the second constitutes the actual alignment. By performing an alignment with flowpreak information, a single maximum likelihood estimate will be calculated, which eliminates error propagation from the first estimate to the second.
The variables, d and q represent the two sequences aligned where d i and q j are position i and j in the respective sequence. The function s(d i , q j ) determines the score for aligning d i with q j . Finally, E i, j describes the optimal score of each cell that ends with a gap in the sequence d and H i, j describes the optimal score of each cell that ends with a gap in the sequence q. The two variables are calculated through E i, j = max(E i-1, j - G 0 , E i-1, j - G e ) and H i, j = max(H i, j-1 - G 0 , H i, j-1 - G e ) where G 0 is the minimum gap penalty and G e is the gap extension penalty.
Here, S i. j is limited to describing a homopolymer correction up to four nucleotides. Pd j, n describes the 'down-calling' penalty for a query position j, where n is the number of bases by which the homopolymer is shortened. Accordingly, Pu j, n describes the 'up-calling' penalty for a query position j where n is the number of bases by which the homopolymer is lengthened. However, Pu is only evaluated as non-infinite when d i is equal to q j . Consequently, corrections are not allowed when a corresponding database nucleotide that can be corrected against is not found. For example, the query sequence "TAAT" would potentially align to "TAAAT" with a homopolymer correction in A, while it could not be aligned against "TAACT".
Homopolymer correction penalties
Indexing and heuristics
The first part of the search algorithm is composed of constructing a database index, by which the occurrence of any k-tuple can be requested. FAAST employs a direct and compact database indexing model where the occurrences of all valid k-tuples and the corresponding database position are noted, in the same way as in SSAHA . All occurrences (database positions) are noted in a list, L. Each k-tuple of valid nucleotides ("T", "A", "C" or "G") is then treated as an integer value of base 4 with the 4 nucleotides as alphabet. A second pointer list, P, of size 4 k is allocated and populated so that P holds a pointer to L for each k-tuple of valid nucleotides. Finally, L and P are ordered so that each position of the list P points to the first occurrence of the corresponding k-tuple in L and the value of P i < P i+1 . Consequently, the last occurrence of any k-tuple found at position, i, is retrieved through reading the first occurrence of the next k-tuple found in position, i + 1, in P.
Through matching all k-tuples found in each query against the database-index, all positions at which the query and database share at least k nucleotides are found, denoted as a 'hit'. The 'hits' are subsequently sorted by diagonal and an alignment is generated if at least n hits are found on the same diagonal spaced less than J nucleotides apart. Finally, for the v top-scoring alignments, a re-alignment with complete trace is performed to enable full-alignment output. The default parameters of FAAST use k = 11, n = 2 and J = 50, thus requiring at least two 11-mer 'hits' spaced no more than 50 bp apart.
We gratefully acknowledge financial support from the Swedish Research Council, the Research School of Medical Bioinformatics supported by the Knowledge Foundation Sweden, Karolinska Institutet and Linköping University. We thank Oscar Franzén for providing 454 SFF data from the Giardia P15 sequencing.
- Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, Dewell SB, Du L, Fierro JM, Gomes XV, Godwin BC, He W, Helgesen S, Ho CH, Irzyk GP, Jando SC, Alenquer ML, Jarvie TP, Jirage KB, Kim JB, Knight JR, Lanza JR, Leamon JH, Lefkowitz SM, Lei M, Li J, et al.: Genome sequencing in microfabricated high-density picolitre reactors. Nature 2005, 437: 376–380.PubMed CentralPubMed
- Droege M, Hill B: The Genome Sequencer FLX System--longer reads, more applications, straight forward bioinformatics and more complete data sets. J Biotechnol 2008, 136: 3–10. 10.1016/j.jbiotec.2008.03.021View ArticlePubMed
- Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970, 48: 443–453. 10.1016/0022-2836(70)90057-4View ArticlePubMed
- Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol 1981, 147: 195–197. 10.1016/0022-2836(81)90087-5View ArticlePubMed
- Gotoh O: An improved algorithm for matching biological sequences. J Mol Biol 1982, 162: 705–708. 10.1016/0022-2836(82)90398-9View ArticlePubMed
- Lipman DJ, Pearson WR: Rapid and sensitive protein similarity searches. Science 1985, 227: 1435–1441. 10.1126/science.2983426View ArticlePubMed
- Pearson WR, Lipman DJ: Improved tools for biological sequence comparison. Proc Natl Acad Sci USA 1988, 85: 2444–2448. 10.1073/pnas.85.8.2444PubMed CentralView ArticlePubMed
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215: 403–410.View ArticlePubMed
- Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMed
- Dayhoff MO: Computer analysis of protein evolution. Sci Am 1969, 221: 86–95.View ArticlePubMed
- Vacic V, Jin H, Zhu JK, Lonardi S: A probabilistic method for small RNA flowgram matching. Pac Symp Biocomput 2008, 75–86.
- Kofler R, Torres TT, Lelley T, Schlötterer C: PanGEA: identification of allele specific gene expression using the 454 technology. BMC Bioinformatics 2009, 10: 143. 10.1186/1471-2105-10-143PubMed CentralView ArticlePubMed
- Balzer S, Malde K, Lanzén A, Sharma A, Jonassen I: Characteristics of 454 pyrosequencing data--enabling realistic simulation with flowsim. Bioinformatics 2010, 26: i420--i425. 10.1093/bioinformatics/btq365PubMed CentralView ArticlePubMed
- Jerlström-Hultqvist J, Franzén O, Ankarklev J, Xu F, Nohýnková E, Andersson JO, Svärd SG, Andersson B: Genome analysis and comparative genomics of a Giardia intestinalis assemblage E isolate. BMC Genomics 2010, 11: 543. 10.1186/1471-2164-11-543PubMed CentralView ArticlePubMed
- Ning Z, Cox AJ, Mullikin JC: SSAHA: a fast search method for large DNA databases. Genome Res 2001, 11: 1725–1729. 10.1101/gr.194201PubMed CentralView ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.