Highly improved homopolymer aware nucleotide-protein alignments with 454 data
© Lysholm; licensee BioMed Central Ltd. 2012
Received: 13 March 2012
Accepted: 7 September 2012
Published: 12 September 2012
Skip to main content
© Lysholm; licensee BioMed Central Ltd. 2012
Received: 13 March 2012
Accepted: 7 September 2012
Published: 12 September 2012
Roche 454 sequencing is the leading sequencing technology for producing long read high throughput sequence data. Unlike most methods where sequencing errors translate to base uncertainties, 454 sequencing inaccuracies create nucleotide gaps. These gaps are particularly troublesome for translated search tools such as BLASTx where they introduce frame-shifts and result in regions of decreased identity and/or terminated alignments, which affect further analysis.
To address this issue, the Homopolymer Aware Cross Alignment Tool (HAXAT) was developed. HAXAT uses a novel dynamic programming algorithm for solving the optimal local alignment between a 454 nucleotide and a protein sequence by allowing frame-shifts, guided by 454 flowpeak values. The algorithm is an efficient minimal extension of the Smith-Waterman-Gotoh algorithm that easily fits in into other tools. Experiments using HAXAT demonstrate, through the introduction of 454 specific frame-shift penalties, significantly increased accuracy of alignments spanning homopolymer sequence errors. The full effect of the new parameters introduced with this novel alignment model is explored. Experimental results evaluating homopolymer inaccuracy through alignments show a two to five-fold increase in Matthews Correlation Coefficient over previous algorithms, for 454-derived data.
This increased accuracy provided by HAXAT does not only result in improved homologue estimations, but also provides un-interrupted reading-frames, which greatly facilitate further analysis of protein space, for example phylogenetic analysis. The alignment tool is available at http://bioinfo.ifm.liu.se/454tools/haxat.
In the last decade, next-generation sequencing methods have emerged, become widely used and are today a prerequisite for most large-scale studies. One of the early next-generation techniques is Roche 454 sequencing , currently extensively used as it produces longer sequence reads compared to other platforms. The first 454 sequencing platform was introduced in 2005 with the GS20 instrument. The GS20 brought a huge leap in performance over present Sanger based techniques and produced around 20 million bases (Mb) per run . Since then, the 454 technique has repeatedly evolved and the GS FLX Titanium instrument, introduced in 2008, produces as much as 500 Mb per run in reads of approximately 350 bases per read . The long reads produced by 454 sequencing are particularly useful when scaffolding is difficult, where homology to known sequences is distant or absent and overlaps may be ambiguous or non-existent, e.g. de novo genome sequencing or metagenomics. The underlying methodology involved in sequence comparisons as well as in most subsequent data processing steps, is sequence alignment, which is an important fundament for the entire field of bioinformatics.
The global sequence alignment problem and a method for solving it was first proposed by Needleman and Wunsch in 1970 . The suggested algorithm utilised a technique called dynamic programming. Dynamic programming solves a problem by dividing the problem into solvable sub-problems. In 1981, Smith and Waterman defined the local alignment and proposed a slightly modified algorithm to solve it . Finally, one year later Gotoh proposed the usage of non-linear gap penalties and the corresponding modified algorithms  for solving the global and local alignment problems. While a single sequence alignment is quickly calculated, a full database search using Smith-Waterman-Gotoh would be unnecessarily time consuming as most comparisons would render non-significant alignments. To address this issue, heuristic methods such as FASTA [6, 7] and BLAST [8, 9], which greatly reduces the number of sub-problems solved, were developed. Later, methods using improved heuristics, and/or innovative implementations to more efficiently utilise computer memory and processors have been proposed, e.g. MegaBLAST , SSAHA2 , BLAT  and Smith-Waterman-Gotoh alignments utilising SIMD instructions .
The characteristics of 454 data are to some extent different from most other sequencing technologies. 454 sequencing is a pyrosequencing method where the nucleotide reagents for detecting thymine (T), adenine (A), cytosine (C) and guanine (G) are repeatedly cycled over each DNA template fragment, while elongating the complementary strand. The intensity of each flow of nucleotide reagents is recorded, as a so-called flowpeak and collected in a flowgram . As the homopolymer length is estimated from peak intensity, occasional uncertainties occur . These uncertainties are challenging for downstream analysis algorithms. Current analysis methods do not provide any modification to the alignment algorithm, but instead recommend using adjusted heuristics, e.g. a shorter indexing word length, to cope with this problem. While a majority of the homopolymer insertions/deletions (indels) are placed correctly by gaps, close to the end of the read, the alignments are often truncated and sequential undercall-overcall (and vice versa) are often mistaken for a single nucleotide polymorphism (SNP). The first method that addressed the problem of homopolymer uncertainties was FLAT , which performs probabilistic flowgram matching. The downside of using flowgram matching is the limited possibility to detect non-identical matches, apart from homopolymer dissimilarities. Another tool that addressed this problem was PanGEA , which uses a modified Smith-Waterman-Gotoh alignment algorithm, which allows gaps at the end of long homopolymer runs at a lowered cost. FAAST  was developed in order to combine the benefits of both methods. FAAST utilises the flowgram to dynamically set a position specific homopolymer gap penalty relative to the uncertainty at that position, regardless of homopolymer length.
Since nucleotide alignments fail to produce significant results for remote homologues, homopolymer indel recover methods based on nucleotide alignments are not applicable at remote homology. Due to this limitation of nucleotide alignments, many metagenomic studies [17–19] perform translated homology searches (against a protein database), using translated BLAST, e.g. tBLASTx or BLASTx . However, BLAST does not handle frame-shifts and at best presents two alignments in each frame, and these may be either significantly separated or overlapping, depending on the similarity in the shifted frame. These issues add complexity to metagenomic analysis, and potential homopolymer inaccuracies often have to be corrected manually to preserve an open reading frame (ORF). There are methods that do allow nucleotide-protein alignments with frame-shifts, for example the method proposed by States and Botstein . However, none of these methods take homopolymer errors into account, which makes them unsuitable for 454 data or other data where homopolymer reading errors occur, for example the Ion Torrent platform .
To address these problems, I here introduce a new tool named HAXAT (Homopolymer Aware Cross Alignment Tool). This new tool solves the protein-nucleotide alignment through a novel dynamic programming algorithm. The algorithm is based on the Smith-Waterman-Gotoh algorithm, and is capable of handling frame-shifting matches. It considers homopolymer inaccuracies through position-specific penalties, provided by a pre-calculated query profile. Through this novel algorithm, HAXAT produces more sensitive results with 454 data, even in the absence of flowpeak information (FASTA input).
The new tool HAXAT addresses the issues of frame-shifts in translated nucleotide – protein alignments while considering an underlying 454 data model. HAXAT utilises a novel alignment algorithm, with corresponding parameters, to perform these alignments. HAXAT was designed to be useful in the analysis of metagenomic data, but it can be used in various genome sequencing and annotation tasks, such as ORF annotation.
A representative benchmark set with different degrees of difficulty was constructed from a single coding seed sequence. The ORF1 of Torque Teno Sus Virus 1, GU570202, was chosen as seed sequence. A Torque Teno virus sequence represents a likely single read sequence, resulting from a metagenomics or genome project, as TT viruses are highly heterogeneous . The nucleotide sequence was first translated into its corresponding amino acid sequence. Six additional sequences, one for each degree of difficulty, were subsequently constructed through mutations in protein space down to 90, 80, 70, 60, 50 and 40% identity. Each mutation position was randomly chosen, while the substitution frequencies were proportional to the corresponding BLOSUM(X) substitution probabilities , where X denotes the clustering identity threshold used. In order to supply query sequences, 1,000 454 Titanium reads were simulated by the “454sim” simulator , using default settings. Finally, 4216 parameter combinations were evaluated for all seven degrees of difficulty, by alignment of the 1,000 reads to each of the seven target protein sequences. The complete study design and evaluation scripts are available in Additional file 1. HAXAT was executed using the BLOSUM62 substitution matrix and a gap open/ext cost of 8 and 2 (default settings).
The predicted homopolymer inaccuracies were extracted from the alignment and the prediction accuracy was evaluated. A correct prediction was defined as either a correctly predicted deletion (flowpeak overcall) or correctly placed insertion (flowpeak undercall) of a particular nucleotide. Consequently, an incorrect prediction is an incorrectly placed deletion/insertion or an insertion of an incorrect nucleotide at a particular position. Homopolymer indels not covered by the alignment were not regarded as a false negative prediction (results where these are counted as FN are available in Additional file 1, showing highly similar results). The parameters evaluated were; the cost for opening a single (−S parameter) or a double (−D) frame-shifting gap. The 454 model specific parameters evaluated were; the homopolymer frame-shifting gap penalty fraction (−h) which determines the minimum single and double frame-shifting penalties (Sh = S · h and Dh = D · h), the relative flowpeak deviation allowed (−k) and whether to use insertion validation or not (−V), see Methods. HAXAT can be executed in three different modes by applying two different alignment models. The first mode uses a neutral (non-454-aware) alignment model (activated by the -F0 parameter with the HAXAT tool). Consequently, for this model, only a single (−S) and double frame-shifting gap cost (−D) is applicable. This model is highly similar to the method previously proposed by States and Botstein , while it provides the possibility to set different costs to single and double gaps. The second mode uses a 454 aware alignment model without flowpeak information, i.e. running HAXAT with FASTA input (e.g. homopolymer aware alignment). Finally, the last mode applies the same 454 aware alignment model but uses flowpeak information through SFF input (e.g. flow-space aware alignment). The prediction accuracy was evaluated by the Matthews Correlation Coefficient (MCC)  for all three modes.
Parameter evaluation using a non-454 model
S = 10, D = 20, F = 0
S = 8, D = 14, F = 0
S = 8, D = 12, F = 0
S = 6, D = 10, F = 0
S = 6, D = 12, F = 0
S = 8, D = 15, F = 0
Parameter evaluation using a 454-model without flowpeak information
S = 30, D = 54, h = 0.2, k = 0.9, V = 1
S = 15, D = 26, h = 0.4, k = 0.6, V = 0
S = 15, D = 29, h = 0.4, k = 0.6, V = 0
S = 12, D = 24, h = 0.5, k = 0.6, V = 0
S = 8, D = 15, h = 0.7, k = 1, V = 0
Parameter evaluation using a 454-model using flowpeak information
S = 30, D = 60, h = 0.2, k = 0.4, V = 1
S = 30, D = 60, h = 0.2, k = 0.5, V = 1
S = 20, D = 40, h = 0.3, k = 0.4, V = 1
S = 30, D = 60, h = 0.2, k = 0.8, V = 1
S = 15, D = 30, h = 0.4, k = 0.5, V = 1
S = 8, D = 15, h = 0.7, k = 0.4, V = 0
In conclusion, due to the low number of homopolymer indels (positives) in comparison to non-mutated positions (negatives) even with only a few misplaced indels, each resulting in a consequent FP and FN, it is difficult to reach a high MCC-value. This is clear from Table 3 where, even at 100% protein identity and using flowpeak information, the best MCC-value was 0.862. When examined, many of the indels were placed just a few bases away or even at the same location but with the adjacent homopolymer corrected, for example “TTTtAA” instead of the correct “TTTaAA”. As seen in the Tables 1, 2 and 3, in all cases, the novel 454 model significantly outperformed the previously existing neutral (non-454-aware) alignment model, regardless of the parameter settings.
Sequences homologues to GU570202
The homopolymer aware cross alignment algorithm has been implemented in a tool called HAXAT (Homopolymer Aware X-Alignment Tool). HAXAT was implemented in C++ and it compiles using GNU GCC or Intel ICC on both Linux/Windows. The tool is open source under the GNU GPL licence and available as source code and pre-compiled binaries at http://www.bioinfo.ifm.liu.se/454tools/haxat. HAXAT is able to use a wide range of parameters for adjusting the alignment model. More information about parameters, usage and input/output formats can be found in the documentation at the webpage. The webpage also provides a web-version of HAXAT which can either align two sequences using HAXAT alone or search a query sequence against a database using BLASTx heuristics.
As shown above, HAXAT introduces a novel sensitive tool for performing frame-shifting nucleotide-protein alignments using dynamic programming. The increased sensitivity stems from retaining flowpeak information and applying a 454 model to the frame-shifting gap-penalties. HAXAT is intended as an alignment refinement tool in down-stream studies of sequences suspected to contain homopolymer indels. At this moment, HAXAT does not implement any heuristics and performs full dynamic programming calculations for each query-database combination. This greatly reduces the speed at which a full database scan can be performed using this tool alone. On the other hand, a full alignment ensures that the optimal local alignment, given the alignment model, is found. Furthermore, many tools implement excellent heuristics, for example alignment search tools like TFASTA , BLASTx , and these can be used to reduce the number of alignments computed by HAXAT. An example on how to achieve HAXAT results with the aid of BLASTx heuristics (BLAST + package), using four simple commands, is available at the webpage (http://www.bioinfo.ifm.liu.se/454tools/haxat). The webpage also provides the possibility to run HAXAT directly, both for alignment of two sequences and running a query against a database using BLAST heuristics. Even though HAXAT is developed to handle 454 data, the algorithm could also be applied to other pyrosequencing methods or any sequencing method or data where homopolymer inaccuracies occur, for example the Ion Torrent platform .
The HAXAT alignment algorithm is optimised to isolate the most costly calculations so that these can be performed beforehand and stored into a query profile. The calculations of each cell introduces four additional evaluations and references at a maximum i-1, j-5, see Methods, in comparison to the Smith-Waterman-Gotoh algorithm which references i-1, j-1. This implementation is computationally efficient, but it has some limitations. Most notably; no homopolymer correction tracking is employed and thus it is possible to correct a homopolymer position several times. For example, if the single frame-shifting gap penalty is smaller than half of the double gap penalty, two single gaps would be favoured in cases where the double gap is between two codons (thus interchangeable with two single gaps). Furthermore, the single gap penalty plus the double gap penalty should be larger than the penalty for introducing a gap in protein space, to ensure that protein gaps are not substituted with homopolymer corrections. The HAXAT tool solves these issues by not allowing such penalty combinations. While this limited model for correcting homopolymer indels cannot stably correct homopolymer lengths at low costs, the results show that this model still provides a significant step forward in performance compared to the previously available neutral (non 454-aware) model.
The HAXAT algorithm has been put to the test, both in alignments against simulated data where the identity is fairly uniform (does not vary much along the sequence) and in alignments with real homologues containing low identity regions and complicating gaps. The algorithm provides a substantial gain compared to the previously available neutral model. Additionally, the ability to place homopolymer indels correctly does not only increase accuracy but also the query coverage (see Additional file 1) in these alignments as fewer alignments are terminated by the frame-shift.
Sequencing methods based on sequencing by synthesis without a terminator, for example 454 sequencing, suffer from frequent homopolymer indels. As most previous sequencing techniques do not show this type of error, many bioinformatics algorithms are not well adapted to handle homopolymer indels. This paper presents a novel algorithm for solving the fundamental problem of sequencing alignment between a nucleotide sequence, potentially burdened by homopolymer reading errors, and a protein sequence. This algorithm has been implemented in an open source tool called HAXAT, available as source code, pre-compiled binaries and through a web-based tool. The results show that HAXAT provides a significant improvement in alignment accuracy for this type of data. The improved accuracy stems from the homopolymer aware algorithm which makes use of raw flowpeak values to further improve homopolymer length predictions. HAXAT provides an important step forward in solving the ‘homopolymer problem’ which faces bioinformaticians when working with 454 data. HAXAT will be useful in metagenomic and genomic analysis, not only for 454 sequencing data, but also for emerging methods, such as the Ion Torrent platform. The HAXAT tool is open source under the GNU GPL licence and is publically available as source code, pre-compiled binaries and as a web-based tool at http://www.bioinfo.ifm.liu.se/454tools/haxat.
The novel alignment algorithm is a Smith-Waterman-Gotoh inspired dynamic programming algorithm and aligns a nucleotide sequence (horizontal) with a protein sequence (vertical). The dynamic programming table is defined with a cell for each protein and nucleotide sequence position, thus examining all three forward frames. As a consequence, the algorithm is modified for the match and protein gaps so that each step in nucleotide sequence space is in jumps of 3 instead of 1. Additionally, the nucleotide sequence can be gapped in a frame-breaking fashion through matching 1, 2, 4 and 5 nucleotides to one amino acid. This corresponds to; insertion of 2 nucleotides, 1 nucleotide, deletion of 1 nucleotide and 2 nucleotides, respectively.
The variable S i,j represents the optimal cell score in position i,j. The protein sequence is represented by p and thus q is the nucleotide query sequence. p i is an amino acid in p at position i and q j is the nucleotide j in q (q j-n.j denotes a range of n + 1 nucleotides that ends in j). The function, m 3 (p i , q j-2.j ), determines the score of aligning the amino acid p i with the q j-2.j nucleotide triplet. In general, the score of aligning n nucleotides with the amino acid p i determined by the m n (p i , q j-n+1.j ), where n = 3 is the trivial case of aligning a complete codon to an amino acid. Finally, E i,j and H i,j describe the optimal cell score that ends with a gap in the sequence p and q, respectively. The variables are calculated through; E i, j = max(S i-1, j - G 0 , E i-1, j - G e ) and H i, j = max(S i, j-3 - G 0 , H i, j-3 - G e ), where G 0 is the minimum gap penalty and G e is the gap extension penalty.
The frame-breaking match function defined for matching 1, 2 and 4, 5 nucleotides to an amino acid can be considered as inserting 2 and 1 as well as deleting 1 and 2 nucleotides in order to get a nucleotide triplet. The resulting match score is thus a composite score of the gap penalty and the match score of the resulting nucleotide triplet against the amino acid.
The variable T k represents one of the k possible nucleotide triplets, see Figure 6, and P k the insertion/deletion penalty. Finally, the m 3 (p i , T k ) function is simply the substitution score given by any substitution matrix where the T k triplet is translated to the corresponding amino acid.
The position specific frame-shift penalty is divided into a single and a double gap penalty. The single insertion/deletion match path employ the single gap penalty while the double insertion/deletion match path apply the double gap penalty for the grouped deletes/inserts and the double single gap penalty (2 × Single gap penalty) for a non-grouped deletes/inserts, see Figure 6 for delete/insert possibilities.
As 454 data are generated from an underlying peak interpretation of sequential flows, some inserts are not likely to occur. For example, given the default 454 flow-order, “TACG”, and the flows; 3.05(T) and 2.95(C), thus a “TTTCCC”, an adenine (A) insert between the T- and C-peak are more likely compared to a guanine (G) as no G has been flowed between the peaks and the T- and C-peaks are strong evidence of the sequentiality of the T- and C-mer, with reservation for a false negative A-mer in between. This rule is applied in the alignment algorithm to avoid these types of unlikely inserts (the -V parameter). Furthermore, if 454 data is used the A-mer peak in between is examined to determine the insertion penalty. For example, if the flowpeak value is 0.49(A) the factor, f, as given by Equation 3 is 1–0.01/k, and the resulting insertion penalty associated with an A-insert is low.
As the m n function is not dependent on the protein sequence, a query profile where the value of the m n function is pre-calculated can be constructed. Given such a query profile, the calculation cost of the cell-score is not much different from the regular Smith-Waterman-Gotoh algorithm cost, where just the values of four extra states are tested at each cell, see Equation 1. The algorithm is implemented, using C++, in a tool called HAXAT. HAXAT is available as open source and takes a wide range of parameters (see manual). HAXAT implements no heuristics and performs a complete traced alignment for all database/query combinations. HAXAT is available at http://www.bioinfo.ifm.liu.se/454tools/haxat.
I would like to thank Bengt Person, Björn Wallner and Björn Andersson for critical reading of the manuscript. I gratefully acknowledge financial support from the Swedish Research Council, the Research School of Medical Bioinformatics supported by the Knowledge Foundation Sweden, Karolinska Institutet and Linköping University.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.