STELLAR: fast and exact local alignments
© Kehr et al; licensee BioMed Central Ltd. 2011
Published: 5 October 2011
Large-scale comparison of genomic sequences requires reliable tools for the search of local alignments. Practical local aligners are in general fast, but heuristic, and hence sometimes miss significant matches.
We present here the local pairwise aligner STELLAR that has full sensitivity for ε-alignments, i.e. guarantees to report all local alignments of a given minimal length and maximal error rate. The aligner is composed of two steps, filtering and verification. We apply the SWIFT algorithm for lossless filtering, and have developed a new verification strategy that we prove to be exact. Our results on simulated and real genomic data confirm and quantify the conjecture that heuristic tools like BLAST or BLAT miss a large percentage of significant local alignments.
STELLAR is very practical and fast on very long sequences which makes it a suitable new tool for finding local alignments between genomic sequences under the edit distance model. Binaries are freely available for Linux, Windows, and Mac OS X at http://www.seqan.de/projects/stellar. The source code is freely distributed with the SeqAn C++ library version 1.3 and later at http://www.seqan.de.
Computing good local alignments is a fundamental problem in bioinformatics. By looking for local alignments of biological sequences one aims for example at identifying homologous regions, i.e. regions that are assumed to originate from the same ancestral sequence, or at finding functionally similar sequences. The problem has been studied for more than 30 years [1, 2], but still remains interesting. In the beginning, local alignments were used to look for homologous regions in relatively short protein or nucleic acid sequences. Also, for a long time, local alignments have been used to identify conserved, functionally related elements. More recently, local alignments were applied on a genomic scale as prerequisite to global genomic alignments. For several reasons genomic scale alignments are usually not collinear and hence one has to resort to computing local similarities. Now the aim is not anymore to identify some homologous regions but rather to display all similarities between two or more genomic sequences [3–7]. This requires not only efficiency in computation to process very long sequences, but also accuracy regarding sensitivity, i.e. exact tools that do not miss regions of significant local similarity.
For the computation of local alignments numerous tools have been developed: Early tools such as SSEARCH  and FASTA  are sensitive but too slow for large-scale sequence comparison. Then, there are efficient heuristics, with the BLAST family [17–19] being the most prominent example. Further developments for specific large-scale analyzes resulted in tools like BLAT  which was designed for high speed, and BLASTZ  which was designed for higher sensitivity. The more recently published tool BWT-SW  again focuses on sensitivity and is able to report all local alignments.
To assess homology of biological sequences by local alignment, generally some kind of similarity criterion is necessary. The most widely accepted criterion is the E-value[18, 23], a probabilistic measure that assesses the significance of a local alignment. The E-value denotes the expected number of local alignments with a minimal score occurring by chance in the input sequences. The score underlying the E-value is a Smith-Waterman-like score, most commonly using affine gap costs. All popular alignment tools that report E-values, e.g. BLAST, compute such a score, and afterward apply an E-value threshold on their output. In this paper, we describe a method that follows an alternative approach to compute only significant local alignments of nucleic acid sequences: We compute only high-scoring alignments, which are guaranteed to have a good (low) E-value. We use a maximal error rate for local alignments (normalized edit distance) as a score threshold, and additionally require a minimal alignment length. Our method is specialized on relatively low error rates, which in turn justifies the use of edit distance instead of affine gap costs. For a given error rate and a minimal alignment length our method is exact, i.e. it identifies all local alignments without loss of sensitivity. We compute an E-value for all generated local alignments and a minimal E-value for the input parameters. The method is implemented in the program STELLAR (SwifT Exact LocaL AligneR) using the SeqAn C++ library [24, 25]. The program depends only on very few and clearly understandable parameters. We prove that our algorithm is exact for all reasonable parameter settings and confirm this experimentally. We compare STELLAR against popular local alignment programs, namely BLAST, LASTZ, BLAT, and BWT-SW in terms of speed and sensitivity and show that some of the tools miss many significant local alignments that can be detected with STELLAR.
Definitions and overview of algorithm
The goal of this work is to find ε-matches of two sequences without ε-X-drops. Often, ε-matches overlap to a large extent (see Fig. 1 for an example), or segments of ε-matches are themselves ε-matches (e.g. we obtain an ε-match by removing a column from one end of the ε-matches in Fig. 1). Thus, the number of ε-matches can be very large with very redundant similarity information. We handle this issue by only identifying the longest ε-match of one location: If two ε-matches overlap, we output only the longer one unless the overhanging part of the shorter ε-match still has a minimal length of n0. In that case we output both complete ε-matches. Thus, for the example in Fig. 1 we only output one ε-match. We say that such an ε-match is maximal.
We now present an algorithm to compute exactly all maximal ε-matches without ε-X-drop. Our algorithm runs in two phases: filtering and verification. The filtering phase implements the SWIFT algorithm , a very efficient full-sensitivity filter for ε-matches. Note that SWIFT is a filter algorithm and does not output a list of ε-matches. While it guarantees not to miss any maximal ε-match and hugely reduces the search space, a verification phase is necessary to identify false positive hits of the filter. Furthermore, verification is needed to determine the exact start and end positions of maximal ε-matches. We have developed a verification strategy that runs in five steps: ε-core identification, ε-X-drop core filter, ε-X-drop extension, identification of maximal ε-matches, and filtering of overlapping matches. Verification may stop after any of these steps if it is clear that we will not identify a new ε-match without ε-X-drop. The strategy guarantees to find all maximal ε-matches without ε-X-drop.
A similar two-step algorithm that consists of SWIFT filtering and subsequent verification is implemented in the read mapper RazerS . The difference to RazerS is, however, that we are looking for ε-matches that are local in both sequences whereas read mappers compute semi-global alignments, i.e align the full read sequence to a reference. RazerS uses a slightly modified SWIFT filtering, and the verification is much simpler since the length of the final alignments is preset by the read length.
The SWIFT algorithm proposed by Rasmussen et al. is an efficient q-gram based filter to detect potential ε-match regions between two sequences. It is based on the q-gram lemma [29, 30]. This lemma states that every alignment of length n with k error columns contains at least T(n, k, q) := n + 1 – q(k + 1) q-hits, substrings of q consecutive match columns. Considering the dotplot of two sequences, every q-hit corresponds to a diagonal stretch of matches with length q. Obviously, all q-hits of an alignment with k errors can cover at most k + 1 different diagonals in the dotplot.
Rasmussen et al. proved that for any given ε and n0 there exist w, q, e and τ such that every ε-match contains τ q-hits that reside in a w × e parallelogram. A w × e parallelogram is the intersection of e + 1 consecutive diagonals and w + 1 consecutive columns in the dotplot.
To detect w × e parallelograms with τ q-hits in the dotplot, the SWIFT algorithm slides from left to right over one sequence and searches overlapping q-grams in a q-gram index of the other sequence. Found q-hits are counted in bins of Δ + e consecutive diagonals whose first diagonal is a multiple of Δ. As adjacent bins share e diagonals, every w × e parallelogram is contained in one bin. Every bin contains a q-hit counter and represents the parallelogram with columns bounded by the leftmost and rightmost contained q-hit. If a q-hit is found that is at most w – q columns apart from the rightmost q-hit, the parallelogram is extended. Otherwise it is closed and a new one starting at the found q-hit is opened as the two hits cannot be part of the same w × e parallelogram. A closed parallelogram whose bin counter has reached τ is output as a SWIFT hit and verified as described in the following section.
Fig. 4 demonstrates that the output of the filtering phase is not yet a list of ε-matches, although the SWIFT algorithm hugely reduces the search space. SWIFT hits may contain one or more ε-matches, but may as well be false positive and contain no ε-match. Some SWIFT hits may overlap and contain the same or parts of the same ε-match. Further, they may be much longer than a contained ε-match, or they may cover only parts of an ε-match. Therefore, we have developed the following verification strategy.
We start verifying SWIFT hits by identifying a segment of an ε-match that overlaps with the SWIFT hit. We call such a segment an ε-core. We guarantee not to miss any ε-match by identifying all ε-cores contained in a SWIFT hit. ε-cores will then serve as starting points for extension, possibly beyond the ends of SWIFT hits. Finally, we cut the longest ε-matches from extended ε-cores and remove overlapping ε-matches.
Definition and existence of ε-cores
In the following two lemmata we prove the correctness of our approach that starts verification from ε-cores.
Lemma 1. Every ε-mαtch contains at least one ε-core.
Lemma 1 settles the existence of an ε-core for every ε-match. Unfortunately, the SWIFT filter only guarantees to report SWIFT hits that overlap a part of each ε-match. Hence, in principle, the SWIFT hit could not contain an ε-core, which in turn could make our algorithm miss the ε-match. In the next lemma we will show that for a certain value of the parameter q this is never the case.
Lemma 2. The intersection of a SWIFT hit with an ε-match contains at least one ε-core if q := s min .
Proof. By definition, the intersection of a SWIFT hit with an ε-match contains at least τ q-grams interspersed by at most e errors. Therefore, every SWIFT hit contains at least one segment of at least consecutive q-grams. The length of this segment is at least . Because τ > 0 and e ≥ 0 the first summand is greater or equal one, so we obtain . Thus, if we set q := s min every SWIFT hit contains at least one segment of length s min , which is an ε-core. □
Step 1: ε-core identification
In our verification strategy, we identify ε-cores by applying a banded version of the Waterman-Eggert local alignment algorithm . The original algorithm computes all non-overlapping local alignments that reach a specified minimal score under a certain scoring scheme by dynamic programming (DP). We use the scoring scheme that scores matches by +1 and errors by p = – 1/ε + 1 and set the minimal score to s min . In our version, we reduce running time and space requirements of the algorithm by banding the computation of the DP matrix according to the parallelogram shape of SWIFT hits (see Fig. 4, only the shaded parts of the alignment matrix need to be computed). Thus, per ε-core there is a maximum running time of where w′ is the length and e the width of the corresponding SWIFT hit.
Step 2: ε-X-drop core filter
The second step of our verification strategy is a filter for ε-X-drops in the ε-cores. In the previous step we ignored ε-X-drops in ε-cores. If now one of our ε-cores contains an ε-X-drop, the ε-core should be divided into two cores in order to remove the ε-X-drop. Similarly, if an ε-core contains more than one ε-X-drop, the ε-core should be divided into even more cores.
For this decomposition of the ε-cores, we apply the post-processing algorithm from Zhang et al.  with the same scores and penalties for matching and error positions as in our ε-core identification step. The worst-case running time of this algorithm is linear in the length of the ε-core.
Possibly, we obtain more than one ε-core after this step, but in any case the following extension step has to be conducted only to the left and right of the original non-decomposed ε-core. If we started with an ε-core with more than one ε-X-drop, we can skip the following extension step for the middle parts, since the extension algorithm would run immediately into the previously detected ε-X-drops.
Step 3: ε-X-drop extension
The goal of the extension step is to obtain a region that spans all ε-matches without ε-X-drop containing the ε-core. In this region, the extended ε-core, we can then look for the maximal ε-match. Clearly, we can discard extended ε-cores that are shorter than n0.
For extension we apply the gapped extension algorithm by Zhang et al.  with the ε-adjusted scoring parameters as before. This algorithm is a score-only algorithm, i.e. it reports only the score and the sequence positions of the maximal extension but not the precise alignment. However, it reports the maximal and minimal diagonal of the alignment matrix (a band) that needs to be computed when looking for the precise alignment. We will determine the alignment in the next step of the verification strategy together with the exact begin and end positions of the maximal ε-match.
It is hard to do an informative running time analysis for this step. In theory, the dynamic programming algorithm could fill big parts of the alignment matrix. However, for very similar sequences only a narrow diagonal stretch will be filled, and for very distinct sequences we will soon reach an X-drop and stop. Still, we can estimate the running time by where L is the length of the extension and b is the width of the band.
It is easy to confirm that if an ε-core is part of an ε-match without ε-X-drop, then the extended ε-core spans this ε-match without ε-X-drop.
Step 4: identification of maximal ε-matches
The computation of the optimal end position of a trace for all possible extension lengths can be done along with the computation of the alignment matrix. Unfortunately, two traces of different lengths may end in the same column of the alignment matrix. For this reason, it is necessary to compute the alignment matrix using an algorithm for normalized alignment score [32, 33], which in addition iterates over all alignment lengths. As already mentioned, the alignment matrix can be banded by the minimal and maximal diagonal from the seed extension algorithm.
Along with the alignment matrix computation we check and if necessary update in each iteration step a list bestEnds with the best alignment matrix entry for the corresponding extension length. Each entry consists of the length and score (or number of errors) of the alignment, and coordinates in the alignment matrix. This list can afterward be reduced to a smaller set of possible lengths using the following observation: The position before the start and the position behind the end of the sought ε-match is an error, otherwise the ε-match would not be maximal. Hence, we only need to keep those list entries for lengths l where the score difference of bestEnd[l] and bestEnd[l + 1] is smaller than the score of a match. As a result we obtain two lists with possible traceback starting points, one for start positions of the ε-match obtained from the left extension, and one for end positions obtained from the right extension.
On these lists we then apply the following exhaustive search algorithm that iterates over combinations of possible start and end positions: We start with the leftmost possible start position and iterate over possible end positions from right to left. We continue with the next possible start position and restart with the rightmost possible end position as soon as the segment between the current start and end position has an error rate of at most ε (update currently longest ε-match), or if this segment is shorter than the minimal ε-match length or our currently longest ε-match (do not update currently longest ε-match). The algorithm stops when the segment between the current start position and the rightmost possible end position is shorter than the minimal ε-match length or the currently longest ε-match. Using this strategy we cannot miss the longest ε-match without ε-X-drop if the ε-core is part of one.
In case there is another maximal ε-match containing the ε-core, we have to recurse this search twice with the lists reduced by the following entries: All start positions before the start position of the longest ε-match and all end positions that are smaller than n0 added to the end position of the longest ε-match; and all start positions before the start position of the longest ε-match minus n0 and all end positions behind the end position of the longest ε-match.
As a last step, we have to look up the coordinates for the optimal extensions in the bestEnd lists and start traceback from these positions in the alignment matrices. The ε-core extended by the resulting alignments is a maximal ε-match that contains the ε-core.
The running time for computing the alignment matrix using a normalized alignment score is in where b is again the width of the band and L is the length of the extension. Dropping the normalization of the alignment score reduces running time by a factor of L. Determining the optimal start end end position in the left extension of length L1 and right extension of length L2 takes time per ε-match, and finally, the traceback takes time linear in the length of the final ε-match.
Step 5: removal of largely overlapping ε-matches
An ε-match identified during the verification phase is the longest that contains one specific ε-core but it is not necessarily maximal in that it does not overlap with a longer ε-match. In addition, an ε-match containing two ε-cores will be identified twice. To ensure that we output each ε-match only once and also only maximal ε-matches, this last step is necessary.
We remove overlapping ε-matches by sorting the ε-matches by their begin position in one sequence and pairwisely comparing here overlapping matches further. If two ε-matches are found to be identical, one is discarded. Also, if the shorter of the two ε-matches has no unique part of length n0, this ε-match is discarded. The running time of this last step is dominated by sorting the ε-matches, i.e. it is in , where M is the number of ε-matches before removal.
Theorem. Let M be the set of maximal ε-matches without ε-X-drop between two sequences. Then the algorithm that uses SWIFT for filtering and the described strategy for verification will detect exactly all ε-matches in M.
Proof. The SWIFT filter algorithm guarantees to report at least one overlapping SWIFT hit for every ε-match of the input sequences. The first step of the verification strategy detects all ε-cores in SWIFT hits. Apply Lemma 1 to prove that every ε-match contains an ε-core. According to Lemma 2, one of the ε-cores of every ε-match is contained in a SWIFT hit. Thus, for every ε-match in M the first verification step identifies at least one ε-core.
Let C′ be the set of ε-cores identified during the first step, and let C ⊆ C′ be the subset of ε-cores that are part of an ε-match in M. Since none of the ε-matches in M contain an ε-X-drop, the local alignment obtained after ε-X-drop extension (Step 3) of an ε-core c ∈ C spans the corresponding ε-match.
By cutting the extended ε-cores as described in Step 4, we eventually end up with a set of ε-matches M ′ that each contains a certain ε-core. Step 4 also guarantees that per ε-core no longer ε-match exists than the ε-match in M′. Therefore, after removal of overlapping ε-matches (Step 5), our set of ε-matches contains exactly all maximal ε-matches without ε-X-drop. □
Results and discussion
We have implemented the algorithmic pipeline in the program STELLAR following exactly the above described steps with one exception: To improve running time, STELLAR computes the alignment matrix during the identification of the longest ε-match with unnormalized alignment score. The following results show that this has in practice no effect on the sensitivity.
Running times and sensitivity on simulated sequences containing local alignments of different error rates
Running times and sensitivity on simulated sequences of different lengths
We downloaded the assembled genomes of Drosophíla melanogaster (release 5.26) and Drosophíla pseudoobscura (release 2.14) from FlyBase . We selected chromosome arm 2L from D. melanogaster (~23.5 Mb) and group 3 from the chromosome 4 assembly of D. pseudoobscura (~11.7 Mb) for our test runs. Unfortunately, this data set is too big to compute local alignments with SSEARCH, and since BLAST performs better than BLAT and LASTZ on the simulated data we chose to compare STELLAR on the genomic data only to BLAST. We expect BLAST to compute very short alignments with low error rates and some long alignments with higher error rates that do not fulfill the minimal length or error rate criterion for ε-matches, and hence STELLAR will not find them. Therefore, to double-check STELLAR’s sensitivity, we filter all ε-matches from the set of BLAST hits. Additonally, some of the longer BLAST hits may contain valid ε-matches that we extract and add to the set of filtered BLAST hits.
Results of STELLAR on drosophila chromosomes
num. of ε-matches
overlap BLAST a
Results of BLAST on drosophila chromosomes
num. of hits
overlap STELLAR 200 b
overlap STELLAR 100 b
We presented STELLAR, an algorithm to compute all local alignments of a minimal length according to a clear quality definition using the established measures error rate and X-drop. STELLAR brings exact local alignments to the community at the speed of heuristic state-of-the-art tools like BLAST, BLAT, or LASTZ. In addition, our experiments show that our effort is worthwhile since the heuristic tools miss up to about a third of the matches using simulated and real genomic data. Compared to its closest competitor, BWT-SW, it is in most benchmarks faster and offers with the X-drop parameter a possibility to exclude local alignments with bad regions. A limitation of STELLAR is that only ε-matches up to a certain error rate can be computed since the filtering phase loses specificity with increasing error rate. Therefore, for longer and less similar though significant local alignments BLAST remains more appropriate.
As an outlook another relatively new application for local alignments that has emerged with the advent of cheap next-generation sequencing should be mentioned. Standard read mapping programs [8–14] usually can only map entire reads to the reference. With the increasing read length, there will be more reads that span breakage points, e.g. translocations, gene fusions, or splice junctions. The application of an efficient and exact local alignment program could be one way to successfully map such reads and detect variation . Application of STELLAR is especially promising in that it uses the error rate for sensitivity control, an established criterion for read mappers. With a downstream chaining procedure of the partial read matches detected by STELLAR, it may then be possible to detect even multiple splits of reads. Hence, for finding local alignments in the tested range of error rates STELLAR could replace the heuristic tools.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 9, 2011: Proceedings of the Ninth Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/12?issue=S9.
- Sellers PH: The theory and computation of evolutionary distances: Pattern recognition. Journal of Algorithms 1980, 1(4):359–373. 10.1016/0196-6774(80)90016-4View ArticleGoogle Scholar
- Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol 1981, 147: 195–197. 10.1016/0022-2836(81)90087-5PubMedView ArticleGoogle Scholar
- Paten B, Herrero J, Beal K, Birney E: Sequence progressive alignment, a framework for practical large-scale probabilistic consistency alignment. Bioinformatics 2009, 25(3):295–301. 10.1093/bioinformatics/btn630PubMedView ArticleGoogle Scholar
- Darling AE, Mau B, Perna NT: progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS One 2010, 5(6):e11147. 10.1371/journal.pone.0011147PubMedPubMed CentralView ArticleGoogle Scholar
- Dubchak I, Poliakov A, Kislyuk A, Brudno M: Multiple whole-genome alignments without a reference organism. Genome Res 2009, 19(4):682–689. 10.1101/gr.081778.108PubMedPubMed CentralView ArticleGoogle Scholar
- Raphael B, Zhi D, Tang H, Pevzner P: A novel method for multiple alignment of sequences with repeated and shuffled elements. Genome Res 2004, 14(11):2336–2346. 10.1101/gr.2657504PubMedPubMed CentralView ArticleGoogle Scholar
- Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AFA, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, Haussler D, Miller W: Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res 2004, 14(4):708–715. 10.1101/gr.1933104PubMedPubMed CentralView ArticleGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 2009, 10(3):R25. 10.1186/gb-2009-10-3-r25PubMedPubMed CentralView ArticleGoogle Scholar
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009, 25(14):1754–1760. 10.1093/bioinformatics/btp324PubMedPubMed CentralView ArticleGoogle Scholar
- Li H, Ruan J, Durbin R: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 2008, 18(11):1851–1858. 10.1101/gr.078212.108PubMedPubMed CentralView ArticleGoogle Scholar
- Li R, Li Y, Kristiansen K, Wang J: SOAP: short oligonucleotide alignment program. Bioinformatics 2008, 24(5):713–714. 10.1093/bioinformatics/btn025PubMedView ArticleGoogle Scholar
- Rumble S, Brudno M: Shrimp – short read mapping package.2008. [http://compbio.cs.toronto.edu/shrimp/]Google Scholar
- Jiang H, Wong WH: SeqMap: mapping massive amount of oligonucleotides to the genome. Bioinformatics 2008, 24(20):2395–2396. 10.1093/bioinformatics/btn429PubMedPubMed CentralView ArticleGoogle Scholar
- Weese D, Emde AK, Rausch T, Döring A, Reinert K: RazerS–fast read mapping with sensitivity control. Genome Res 2009, 19(9):1646–1654. 10.1101/gr.088823.108PubMedPubMed CentralView ArticleGoogle Scholar
- Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 2008, 5(7):621–628. 10.1038/nmeth.1226PubMedView ArticleGoogle Scholar
- Pearson WR, Lipman DJ: Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A 1988, 85(8):2444–2448. 10.1073/pnas.85.8.2444PubMedPubMed CentralView ArticleGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215(3):403–410.PubMedView ArticleGoogle Scholar
- Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389PubMedPubMed CentralView ArticleGoogle Scholar
- Zhang Z, Schwartz S, Wagner L, Miller W: A greedy algorithm for aligning DNA sequences. J Comput Biol 2000, 7(1–2):203–214. 10.1089/10665270050081478PubMedView ArticleGoogle Scholar
- Kent WJ: BLAT-the BLAST-like alignment tool. Genome Res 2002, 12(4):656–664.PubMedPubMed CentralView ArticleGoogle Scholar
- Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W: Human-mouse alignments with BLASTZ. Genome Res 2003, 13: 103–107. 10.1101/gr.809403PubMedPubMed CentralView ArticleGoogle Scholar
- Lam TW, Sung WK, Tam SL, Wong CK, Yiu SM: Compressed indexing and local alignment of DNA. Bioinformatics 2008, 24(6):791–797. 10.1093/bioinformatics/btn032PubMedView ArticleGoogle Scholar
- Karlin S, Altschul SF: Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci U S A 1990, 87(6):2264–2268. 10.1073/pnas.87.6.2264PubMedPubMed CentralView ArticleGoogle Scholar
- Döring A, Weese D, Rausch T, Reinert K: SeqAn an efficient, generic C++ library for sequence analysis. BMC Bioinformatics 2008, 9: 11. 10.1186/1471-2105-9-11PubMedPubMed CentralView ArticleGoogle Scholar
- Gogol-Döring A, Reinert K: Biological Sequence Analysis Using the SeqAn C++ Library.Chapman & Hall/CRC Mathematical & Computational Biology, CRC Press, Boca Raton, USA; 2009. [http://www.crcpress.com/ecommerce_product/product_detail.jsf?isbn=9781420076233]Google Scholar
- Zhang Z, Berman P, Miller W: Alignments without low-scoring regions. J Comput Biol 1998, 5(2):197–210. 10.1089/cmb.1998.5.197PubMedView ArticleGoogle Scholar
- Zhang Z, Berman P, Wiehe T, Miller W: Post-processing long pairwise alignments. Bioinformatics 1999, 15(12):1012–1019. 10.1093/bioinformatics/15.12.1012PubMedView ArticleGoogle Scholar
- Rasmussen KR, Stoye J, Myers EW: Efficient q-gram filters for finding all ε-matches over a given length. J Comput Biol 2006, 13(2):296–308. 10.1089/cmb.2006.13.296PubMedView ArticleGoogle Scholar
- Burkhardt S, Crauser A, Ferragina P, Lenhof HP, Rivals E, Vingron M: q-gram based database searching using a suffix array (QUASAR). J Comput Biol, RECOMB ’99 1999, 77–83.Google Scholar
- Jokinen P, Ukkonen E: Two algorithms for approxmate string matching in static texts. Mathematical Foundations of Computer Science 1991, Volume 520 of Lect Notes Comput Sc 1991, 240–248. [http://www.springerlink.com/content/p58155n8012x0477/]View ArticleGoogle Scholar
- Waterman MS, Eggert M: A new algorithm for best subsequence alignments with application to tRNA-rRNA comparisons. J Mol Biol 1987, 197(4):723–728. 10.1016/0022-2836(87)90478-5PubMedView ArticleGoogle Scholar
- Marzal A, Vidal E: Computation of normalized edit distance and applications. IEEE T Pattern Anal 1993, 15: 926–932. 10.1109/34.232078View ArticleGoogle Scholar
- Arslan AN, Eǧecioǧlu Ö: Efficient algorithms for normalized edit distance. Journal of Discrete Algorithms 2000, 1: 3–20.Google Scholar
- Tweedie S, Ashburner M, Falls K, Leyland P, McQuilton P, Marygold S, Millburn G, Osumi-Sutherland D, Schroeder A, Seal R, Zhang H, Consortium F: FlyBase: enhancing drosophila gene ontology annotations. Nucleic Acids Res 2009, 37(Database issue):D555-D559.PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.