Automated generation of heuristics for biological sequence comparison
© Slater and Birney; licensee BioMed Central Ltd. 2005
Received: 15 September 2004
Accepted: 15 February 2005
Published: 15 February 2005
Exhaustive methods of sequence alignment are accurate but slow, whereas heuristic approaches run quickly, but their complexity makes them more difficult to implement. We introduce bounded sparse dynamic programming (BSDP) to allow rapid approximation to exhaustive alignment. This is used within a framework whereby the alignment algorithms are described in terms of their underlying model, to allow automated development of efficient heuristic implementations which may be applied to a general set of sequence comparison problems.
The speed and accuracy of this approach compares favourably with existing methods. Examples of its use in the context of genome annotation are given.
This system allows rapid implementation of heuristics approximating to many complex alignment models, and has been incorporated into the freely available sequence alignment program, exonerate.
George Box  once remarked that,
"All models are wrong, but some are useful."
This statement bears much relevance to biological sequence alignment, where there is no guarantee that the alignment model will accurately represent the evolutionary (or other) processes which separated the two sequences. All that is certain is that exhaustive dynamic programming (DP) algorithms (such as Smith-Waterman ) will yield the optimally scoring path in terms of the given model. Heuristics for sequence comparison (such as BLAST ) generate alignments which are valid paths through the underlying alignment model, but are not guaranteed to be optimal. Alignments generated by these heuristics can be calculated much more rapidly, and often closely match alignments which would be generated by exhaustive methods. Furthermore, many problems in the context of genome analysis consist simply of alignment of gene products (cDNA or protein) back onto the gene from which they were produced, and consequently do not require very sensitive alignment. Hence the aim here is not to attempt to develop models which are correct, but to facilitate development of models which are more useful in the context of large scale analyses.
Transformation between a finite state automata describing an alignment model and the recurrence relations used in DP is a powerful technique [4, 5] as it allows modification of the alignment algorithm by direct manipulation of the alignment model. The Dynamite compiler  allows automated implementation of alignment algorithms from a description of the alignment model. This allows development of complex models which can be used to generate accurate alignments. However, calculation of alignments using these models always requires quadratic time, which is prohibitively slow for many large scale applications. This method can be applied to any alignment problem which can be represented by a regular grammar (or the equivalent finite state machine). This includes the simple three state model required for affine gaps in the Smith-Waterman-Gotoh algorithm [2, 7], but also more complex models such as that used by EST_GENOME , where splice site prediction is integrated into the DP, allowing alignments to include introns. Other alignment models which may be expressed by a regular grammar include those allowing non-equivalenced regions such as the ABC model  for improved modelling of divergent loops regions in protein alignments, and DNA Block Aligner  which finds co-linear conserved blocks in the alignment of genomic sequences. This framework also allows models such as that used by PairWise  where the sequences are translated during alignment allowing for frameshifts, and GeneWise  which integrates translated alignments with modelling of introns for alignment of proteins against genomic DNA.
The availability of vast amounts of sequence data has generated a need for faster alignment algorithms . Many of these approaches use fast algorithms to identify closely matching words which seed un-gapped alignments that are subsequently joined to form the final gapped alignment. For example, BLAST , FASTA , and sim4  all operate by first finding matching words before building alignments. Such word finding can be done by a multitude of techniques, such as finite state machines used in BLAST , table-lookup used in SSAHA , or by suffix arrays as used in QUASAR . Novel methods for word-based seeding of these alignments are not presented here, but rather a general system of joining these seeds to produce gapped alignments.
The alignment program FASTA , generates alignments from sets of seeds by performing DP confined to a diagonal band surrounding the initial matches . In contrast, when building alignments, Gapped BLAST  permits gapped extension which allows the DP to be applied to an arbitrary high-scoring region surrounding the HSP seeds. Such heuristics allow very fast sequence alignment, but it is difficult to apply to more complex models. Furthermore, features such as introns cannot easily be incorporated into the resulting alignments without necessitating a large amount of DP to ensure that both very short exons and large introns can be included in alignments.
The aim here is to combine the strengths of Gapped BLAST and Dynamite, in a system which allows automated generation of heuristics in terms of the underlying model. This allows the development of heuristics for complex models such as those used in GeneWise.
Firstly, we describe bounded sparse dynamic programming (BSDP), a novel heuristic for sequence alignment, then we describe the system for automated implementation of the complex alignment algorithms required. We present proof of principle results, but this paper is primarily focussed on developing a framework for the general case.
The basic strategy of seeding alignments used here is the same as for BLAST, in that alignments seeds are generated, and then extended to form High-scoring Segment Pairs (HSPs), which are then joined together to form the alignments. The alignments are seeded using an Aho and Corasick  type finite state machine (FSM) built using the word neighbourhood of the query sequence. This generates the seeds which are extended to form the HSPs. For large scale analyses, the FSM is multiplexed using word neighbourhoods from multiple sequences. This allows analysis of multiple queries during a single pass of a genomic database, in a manner similar to that used in MPBLAST .
However the methods for seeding HSPs are independent from those used for building the alignments, and this paper only deals with algorithms involved in the generation of gapped alignments from sets of HSPs, and not in the calculation of the HSPs themselves.
The following strategies are employed to enable alignments to be built efficiently from sets of HSPs:
To connect the underlying alignment model to the heuristics, a portal describes a set of states in the model which correspond to a set of HSPs, a span refers to a looping state for large alignment features such as introns, and a SAR (Sub-Alignment Region) describes a rectangular region on an HSP to which DP is applied.
To avoid DP in every SAR, upper bounds are generated for the best alignment score for each SAR, and BSDP (Bounded Sparse Dynamic Programming) exploits these bounds to yield alignments in an efficient manner.
To perform various types of DP in these SARs, the required models are generated automati cally, including C code to produce an efficient viterbi implementation for each model.
Bounded sparse dynamic programming
Dynamic programming (for any alignment model which can be represented by a regular grammar) requires quadratic time, and hence is the most computationally expensive part of building an alignment. For pairs of sequences more than a few kilobases long, DP becomes prohibitively slow. The approach used here is similar to sparse dynamic programming , and the fragment chaining approaches used in the program sim2 , in that DP is applied to rectangular regions surrounding alignment seeds. However, there are two major differences in our approach. Firstly, DP is only applied to two small discrete regions on each HSP, as it is assumed that most of the HSP itself should appear in the alignment. These sub-alignments improve the quality of the overall alignments, and they allow complex alignment models, and large gaps such as introns to be integrated into a sparse DP framework. Secondly, as it would take too long to apply DP to every sub-alignment region (SAR), upper bounds are calculated for the DP scores for each of these SARs. This allows the sub-alignment DP to be avoided in cases where it joins HSPs which cannot feature in the final alignment, so that alignments can still be generated very rapidly even when large numbers of HSPs and SARs are involved.
Regions for the sub-alignments are selected within the area between the centre points of the two HSPs to be joined, or in the case of terminal HSPs, between the HSP and the ends of the sequences. In addition, the positions of the SARs must be constrained to limit their size, and so that the HSPs correctly intersect with the corners of the SAR. In the case of overlapping HSPs, where there is a choice of positions for placement of the SAR, the position is chosen such that the highest scoring parts of the HSPs are outside the SARs.
The BSDP is mediated through a set of priority queues, one of which is associated with each HSP, and will contain an entry for each partial alignment that ends at the HSP. The key for these priority queues is the highest score for any partial alignment ending in that HSP. Additionally, there is one global priority queue containing the highest scores from each of the other priority queues. The upper bound scores are confirmed by DP in the SARs in the current highest scoring putative alignment path. The highest scoring path will change as the scores are updated during this process. DP is applied to SARs in this way until the highest scoring path does not contain any bound scores, and then the alignment may be extracted. Upper bounds dictate that there can be no better alignment using these HSPs.
This algorithm is similar to A* search, (but differs in that many different points may be the start or end of the search), and retains the admissibility property of A* search, such that the result of the BSDP computation is guaranteed to be the same as if DP had been performed on every candidate SAR prior to calculation of the alignment. This is because no alignment can be extracted until all the alternative sub-alignments (which have upper bounds that indicate they could contribute to a higher-score) have been eliminated.
The BSDP alignment process can be iterated to generate sub-optimal alignments similar to those generated by the Waterman-Eggert algorithm , with only minimal recalculation of the partial alignments in the SARs. Each HSP may only appear in a single alignment, but further constraints are required to prevent overlapping alignments arising from overlapping SARs. The likelihood of this is occurring is greatly increased during translated alignments when HSPs in different reading frames may overlap each other.
After the first alignment has been reported, the scores for any SARs which have already been confirmed by DP, but which are not yet included in a reported alignment, are then considered as an upper bound. SARs are disallowed before recalculation when the region between the centre of the HSP and its SARs overlaps with a previously reported HSP, in which case, the SARs are disallowed. Otherwise the DP is recomputed for SARs which contain part of an alignment which has been reported since the DP for the SAR was last calculated.
Automated model generation
As illustrated in the previous section, BSDP becomes quite complex and requires a large number of DP algorithms for computation of the alignment through the SARs. We have build a system to facilitate implementation of these models and the integration of the sub-alignments which they produce. To allow generalisation of the BSDP, everything must be defined in terms of the underlying alignment models. The alignment models are described as finite state machines, consisting of states and transitions, similar to those used in Dynamite .
Briefly, to convert these models into DP implementations, each state must correspond to a score in each cell of the DP matrix, and the scores for each cell are calculated by taking the maximum of the score from transitions arriving at each cell. In addition, a topological sort is required to satisfy dependency ordering for silent states.
However, in addition to automated generation of code from alignment models, the generation of the models required for DP in the SARs is also automated, as described below.
Building simple models
Examples of hierarchical model building in C4. Instead of building each model from scratch, most models are created in a modular fashion by making adaptions and additions to other models. This facilitates the building of complex models.
ungapped with gap penalties
NW + local alignment scope
SW + affine gap costs
SWG + intron models
SWG + translation + frameshifts
p2d + intron model
Building the heuristic model
To enable DP in the SARs for the BSDP, a heuristic model is generated from the original alignment model. This model is not used directly for calculation of alignments, but a derived model is generated corresponding to each transition in the heuristic model. Each of these derived models correspond to a type of SAR used in the BSDP.
The model is first annotated, labelling certain states as either portal states or span states. A portal defines a group of states which can share a set of HSPs (High-scoring Segment Pairs); these are the match states. A span is a state which has sequence independent looping transitions (e.g. states for introns, or non-equivalenced regions).
Building derived models
For some cases, such as between the match forward and match reverse states, shown in Figure 4, there is no possible path, and no corresponding transition in the heuristic model, in which case, a derived model is not produced.
Ten different derived models are generated from the model shown for cDNA to genomic alignment in Figure 4, because there are two portal states and two span states in this model, and therefore, a derived model is generated for each of the five cases in Figure 2 for both the forward and reversed genes.
Building models for calculation of upper bounds
For each derived model, an additional model is created which is used for the pre-calculation of upper bounds for all possible sizes of SARs. For each transition in the model which has a position dependent score associated with it, the upper bound is also supplied. For example, in a match transition, the upper bound is the maximum value from the substitution matrix. A special model is created using these transition upper bounds, instead of the normal sequence-dependent transition weights. As this removes the sequence-dependent components of the algorithm, it allows pre-calculation of the upper bound for alignment score of any sequences up to the maximum permitted size of SARs. These bounds are then stored in a table for retrieval during the BSDP.
Results of an evaluation comparing exonerate (using the methods presented here), with sim4 and est2genome. A subset of the ensembl database was used, containing 1,827 genes, 9,917 exons and 2,387,971 bases. Times are given for alignment of all cDNAs to all contigs. The CPU time for est2genome was estimated from a smaller number of alignments. The accuracies are comparable, but the times are much faster for the heuristic methods.
CPU Time (seconds)
≅ 7.2 years (226,637,523)
≅ 60 hours (216,956)
≅ 46 minutes (2,577)
This system is used for alignment of ESTs within the Ensembl gene-building pipeline , and a prototype implementation of this system has been used in comparison mouse and human sequences . In this analysis, 13 million raw shotgun mouse reads were aligned as part of the comparative analysis of chromosome 20.
Another example is the alignment of the Collagen alpha 1(IX) precursor to a region of chromosome 6 containing the corresponding gene. This is a large gene, containing 38 exons over about a 90 Kb region of the genome. Using GeneWise, the alignnment required 4260 seconds (about 1 hour, 11 minutes) using exonerate to perform full DP alignment required 2700 seconds (about 45 minutes), and using exonerate with the heuristic BSDP approach required only 7 seconds, with all three methods generating identical alignments.
This approach represents an advance from previous methods for automatic implementation of sequence alignment algorithms [4–6], in that it is not just the generation of code from the models which is automated, but also the generation of many of the models themselves.
This has allowed development of heuristics with sub-quadratic running times. This makes their application practical to a much larger set of problems, while retaining the much of the simplicity of their implementation.
We recognise that this approach is limited to the subset of sequence comparison problems which can be represented by a regular grammar. For example, it is not possible to use this system to model stochastic context free grammars as used in some types of RNA secondary structure such as pseudo-knots . This approach is also unsuitable for the types of DP algorithms used for determining optimal segmentation in gene finding algorithms , however for these types of problems running time of the algorithm is not an issue.
This framework has already been used for many different alignment models ranging from simple models such as Smith-Waterman-Gotoh, to more complex models such as those for protein to genome alignment as used by GeneWise . It is also well suited to problems requiring comparison of very long sequences, and we are currently extending this system to accommodate syntenic pairwise comparisons of genomic sequences of the type tackled by MUMmer .
Although this method is useful in many cases, for some types of distantly related sequences, the method can breakdown. For example, when long regions of the correct alignment contain many gaps with intervening sequences too short to be an alignment seed, the alignment cannot be extended beyond the boundary of the SAR, and the correct alignment may be missed. These cases can be avoided by increasing the size of the SARs, but this results in higher bound scores, so DP become necessary in more of the SARs. Such gap rich alignment are the type where the gapped HSP extension used by Gapped BLAST excels, however such gapped extension necessitates DP surrounding all HSPs, which becomes time consuming when allowing both short exons and long introns during alignments of cDNAs or proteins to genomic sequences.
As this system separates the development of the underlying alignment model from the heuristics which are built on top of them, we expect that this framework will prove useful for evaluation of the quality of the heuristics, as comparison between the alignments from the two techniques can be automated, and training may be performed to select optimal parameter sets.
The implementation of the system described here is called C4 (to reflect the aim of producing something more powerful than Dynamite ), and is implemented in C programming language. It is available as part of the exonerate sequence alignment package available from http://www.ebi.ac.uk/~guy/exonerate/, and in the exonerate module of the Ensembl CVS repository from http://www.ensembl.org/. It is built using the glib portabilility library available from http://www.gtk.org/ Both exonerate and glib are available under the GNU lesser general public license. The code has been tested extensively on Linux and OSF1/alpha but has been written to be portable to many UNIX systems.
Code generation for the DP is performed to exploit compile time optimisations such as loop un- rolling and and efficient handling of edge conditions in the viterbi matrix, which is particularly beneficial for the small DP calculations required by SARs. This generated code is typically about five times faster than using a generic viterbi implementation, but produces about a million lines of code for the current set of models used by exonerate, and hence compilation takes longer.
GSS would like to thank Ian Holmes and Roger Sayle for invaluable discussions (some several years ago), which have been a big influence on this work.
Also, we would like to thank all members of the Ensembl group, in particular Steve Searle contributed some code to an early version of exonerate.
- Box GE: Robustness in the Strategy of Scientific Model Building. In Robustness in Statistics. Edited by: Launer R, Wilkinson G. Academic Press New York; 1979.Google Scholar
- Smith T, Waterman M: Identification of Common Molecular Subsequences. Journal of Molecular Biology 1981, 147: 195–197.View ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic Local Alignment Search Tool. Journal of Molecular Biology 1990, 215: 403–410.View ArticlePubMedGoogle Scholar
- Searls DB, Murphy KP: Automata-Theoretic Models of Mutation and Alignment. In Proceedings of the Third International Conference On Intelligent Systems for Molecular Biology. The AAAI Press; 1995:341–349.Google Scholar
- Searls DB: Sequence alignment through pictures. Trends in Genetics 1996, 12: 35–37.View ArticlePubMedGoogle Scholar
- Birney E, Durbin R: Dynamite: A flexible code generating language for dynamic programming methods used in sequence comparison. In Proceedings of the Fifth International Conference On Intelligent Systems for Molecular Biology. The AAAI Press; 1997:56–64.Google Scholar
- Gotoh O: An Improved Algorithm for Matching Biological Sequences. Journal of Molecular Biology 1982, 162: 705–708.View ArticlePubMedGoogle Scholar
- Mott R: EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA. CABIOS 1997, 13(4):477–478.PubMedGoogle Scholar
- Altschul SF: Generalized Affine Gap Costs for Protein Sequence Alignment. PROTEINS: Structure, Function and Genetics 1998, 32: 88–96.View ArticleGoogle Scholar
- Jareborg N, Birney E, Durbin R: Comparative Analysis of Noncoding Regions of 77 Orthologous Mouse and Human Gene Pairs. Genome Research 1999, 9: 815–824.PubMed CentralView ArticlePubMedGoogle Scholar
- Birney E, Thompson JD, Gibson TJ: PairWise and SearchWise: finding the optimal alignment in a simultaneous comparison of a protein profile against all DNA translation frames. Nucleic Acids Research 1996, 24(14):2730–2739.PubMed CentralView ArticlePubMedGoogle Scholar
- Birney E, Clamp M, Durbin R: GeneWise and Genomewise. Genome Research 2004, 14(5):988–995.PubMed CentralView ArticlePubMedGoogle Scholar
- Smith RF: Perspectives: Sequence Data Base Searching in the Era of Large-scale Genomic Sequencing. Genome Research 1996, 6: 653–660.View ArticlePubMedGoogle Scholar
- Lipman DJ, Pearson WR: Rapid and Sensitive Protein Similarity Searches. Science 1985, 277: 1435–1441.View ArticleGoogle Scholar
- Florea L, Hartzell G, Zhang Z, Rubin GM, Miller W: A Computer Program for Aligning a cDNA Sequence with a Genomic DNA Sequence. Genome Research 1998, 8: 967–974.PubMed CentralPubMedGoogle Scholar
- Ning Z, Cox AJ, Mullikin JC: SSAHA: A Fast Search Method for Large DNA Databases. Genome Research 2001, 11(10):1725–1729.PubMed CentralView ArticlePubMedGoogle Scholar
- Burkhardt S, Crauser A, Ferragina P, Lenhof HP, Rivals E, Vingron M: q -gram Based Database Searching Using a Suffix Array (QUASAR). In Proceedings of the Third Annual International Conference on Computational Molecular Biology. The ACM Press; 1999:77–83.Google Scholar
- Chao KM, Pearson WR, Miller W: Aligning two sequences within a specified diagonal band. CABIOS 1992, 8(5):481–487.PubMedGoogle Scholar
- Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 1997, 25(17):3389–3402.PubMed CentralView ArticlePubMedGoogle Scholar
- Aho AV, Corasick MJ: Efficient String Matching: An Aid to Bibiographic Search. Communications of the ACM 1975, 18(6):333–340.View ArticleGoogle Scholar
- Korf I, Gish W: MPBLAST: improved BLAST performance with multiplexed queries. Bioinformatics 2000, 16(11):1052–1053.View ArticlePubMedGoogle Scholar
- Eppstein D, Galil Z, Giancarlo R, Italiano GF: Sparse Dynamic Programming. In Proc 1st Symp Discrete Algorithms. ACM and SIAM; 1990:513–522.Google Scholar
- Chao KM, Zhang J, Ostell J, Miller W: A local alignment tool for very long DNA sequences. CABIOS 1995, 11(2):147–153.PubMedGoogle Scholar
- Waterman MS, Eggert M: A New Algorithm for Best Subsequence Alignments with Application to tRNA-rRNA Comparisons. Journal of Molecular Biology 1987, 197: 723–728.View ArticlePubMedGoogle Scholar
- Curwen V, Eyras E, Andrews TD, Clarke L, Mongin E, Searle SM, Clamp M: The Ensembl Automatic Gene Annotation System. Genome Research 2004, 14: 942–950.PubMed CentralView ArticlePubMedGoogle Scholar
- Deloukas P, Matthews LH, Ashurst J, Burton J, Gilbert JGR, Jones M, Stavrides G, Almeida JP, Babbage AK, Bagguley CL, Bailey J, Barlow KF, Bates KN, Beard LM, Beare DM, Beasley OP, Bird CP, Blakey SE, Bridgeman AM, Brown AJ, Buck D, Burrill W, Butler AP, Carder C, Carter NP, Chapman JC, Clamp M, Clark G, Clark LN, Clark SY, Clee CM, Clegg S, Cobley VE, Collier RE, Connor R, Corby NR, Coulson A, Coville GJ, Dead-man R, Dhami P, Dunn M, Ellington AG, Frankland JA, Fraser A, French L, Garner P, Grafham DV, Griffiths C, Griffiths MND, Gwilliam R, Hall RE, Hammond S, Harley JL, Heath PD, Ho S, Holden JL, Howden PJ, Huckle E, Hunt AR, Hunt SE, Jekosch K, Johnson CM, Johnson D, Kay MP, Kimberley AM, King A, Knights A, Laird GK, Lawlor S, Lehvaslaiho MH, Leversha M, Lloyd C, Lloyd DM, Lovell JD, Marsh VL, Martin SL, McConnachie LJ, McLay K, McMurray AA, Milne S, Mistry D, Moore MJF, Mullikin JC, Nickerson T, Oliver K, Parker A, Patel R, Pearce TAV, Peck AI, Phillimore BJCT, Prathalingam SR, Plumb RW, Ramsay H, Rice CM, Ross MT, Scott CE, Sehra HK, Shownkeen R, Sims S, Skuce CD, Smith ML, Soderlund C, Steward CA, Sulston JE, Swann M, Sycamore N, Taylor R, Tee L, Thomas DW, Thorpe A, Tracey A, Tromans AC, Vaudin M, Wall M, Wallis JM, Whitehead SL, Whittaker P, Willey DL, Williams L, Williams SA, Wilming L, Wray PW, Hubbard T, Durbin RM, Bentley DR, Beck S, Rogers J: The DNA sequence and comparative analysis of human chromosome 20. Nature 2001, 414: 865–871.View ArticlePubMedGoogle Scholar
- Rivas E, Eddy SR: The language of RNA: a formal grammar that includes pseudoknots. Bioinformatics 2000, 16(4):334–340.View ArticlePubMedGoogle Scholar
- Snyder EE, Stormo GD: Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks. Nucleic Acids Research 1993, 21(3):607–613.PubMed CentralView ArticlePubMedGoogle Scholar
- Delcher AL, Kasif S, Fleischmann RD, Peterson J, White O, Salzberg SL: Alignment of whole genomes. Nucleic Acids Research 1999, 27(11):2369–2376.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.