Volume 14 Supplement 11
Selected articles from The Second Workshop on Data Mining of NextGeneration Sequencing in conjunction with the 2012 IEEE International Conference on Bioinformatics and Biomedicine
libgapmis: extending shortread alignments
 Nikolaos Alachiotis^{1},
 Simon Berger^{1},
 Tomáš Flouri^{1},
 Solon P Pissis^{1, 2}Email author and
 Alexandros Stamatakis^{1}
DOI: 10.1186/1471210514S11S4
© Alachiotis et al; licensee BioMed Central Ltd. 2013
Published: 4 November 2013
Abstract
Background
A wide variety of shortread alignment programmes have been published recently to tackle the problem of mapping millions of short reads to a reference genome, focusing on different aspects of the procedure such as time and memory efficiency, sensitivity, and accuracy. These tools allow for a small number of mismatches in the alignment; however, their ability to allow for gaps varies greatly, with many performing poorly or not allowing them at all. The seedandextend strategy is applied in most shortread alignment programmes. After aligning a substring of the reference sequence against the highquality prefix of a short readthe seedan important problem is to find the best possible alignment between a substring of the reference sequence succeeding and the remaining suffix of low quality of the readextend. The fact that the reads are rather short and that the gap occurrence frequency observed in various studies is rather low suggest that aligning (parts of) those reads with a single gap is in fact desirable.
Results
In this article, we present libgapmis, a library for extending pairwise shortread alignments. Apart from the standard CPU version, it includes ultrafast SSE and GPUbased implementations. libgapmis is based on an algorithm computing a modified version of the traditional dynamicprogramming matrix for sequence alignment. Extensive experimental results demonstrate that the functions of the CPU version provided in this library accelerate the computations by a factor of 20 compared to other programmes. The analogous SSE and GPUbased implementations accelerate the computations by a factor of 6 and 11, respectively, compared to the CPU version. The library also provides the user the flexibility to split the read into fragments, based on the observed gap occurrence frequency and the length of the read, thereby allowing for a variable, but bounded, number of gaps in the alignment.
Conclusions
We present libgapmis, a library for extending pairwise shortread alignments. We show that libgapmis is bettersuited and more efficient than existing algorithms for this task. The importance of our contribution is underlined by the fact that the provided functions may be seamlessly integrated into any shortread alignment pipeline. The opensource code of libgapmis is available at http://www.exelixislab.org/gapmis.
Background
The problem of finding substrings of a text similar to a given pattern has been intensively studied over the past decades, and it is a central problem in a wide range of applications, including signal processing [1], information retrieval [2], searching for similarities among biological sequences [3], file comparison [4], spelling correction [5], and music analysis [6]. Some examples are recovering the original signals after their transmission over noisy channels, finding DNA subsequences after possible mutations, and text searching where there are typing or spelling errors.
Approximate string matching, in general, consists in locating all the occurrences of substrings inside a text t that are similar to a pattern x. It consists of producing the positions of the substrings of t that are at distance at most k from x, for a given natural number k. For the rest of this article, we assume that k <x ≤ t. We focus on online searchingthe text cannot be preprocessed to build an index on it. There exist four main approaches to online approximate string matching: algorithms based on dynamic programming; algorithms based on automata; algorithms based on wordlevel parallelism; and algorithms based on filtering. We focus on algorithms based on dynamic programming. There mainly exist two different distances for measuring the approximation: the edit distance and the Hamming distance.
The edit distance between two strings, not necessarily of the same length, is the minimum cost of a sequence of elementary edit operations between these two strings. A restricted notion of this distance is obtained by considering the minimum number of edit operations rather than the sum of their costs. The Hamming distance between two strings of the same length is the number of positions where mismatches occur between the two strings.
Alignments are a commonly used technique to compare strings and are based on notions of distance [1] or of similarity between strings; for example, similarities among biological sequences [3]. Alignments are often computed by dynamic programming [2].
A gap is a sequence of consecutive insertions or deletions (indels) of letters in an alignment. The extensive use of alignments on biological sequences has shown that it can be desirable to penalise the formation of long gaps rather than penalising individual insertions or deletions of letters.
The notion of gap in a biological sequence can be described as the absence (respectively, presence) of a fragment, which is (respectively, is not) present in another sequence [7]. Gaps occur naturally in biological sequences as part of the diversity between individuals. In many biological applications, a single mutational event can cause the insertion (or deletion) of a large DNA fragment, so the notion of gap in an alignment is an important one. Moreover, the creation of gaps can occur in a wide, but bounded, range of sizes with almost equal likelihood.
A number of natural processes can cause gaps in DNA sequences: long pieces of DNA can be copied and inserted by a single mutational event; slippage during the replication of DNA may cause the same area to be repeated multiple times as the replication machinery loses its place on the template; an insertion in one sequence paired with a reciprocal deletion in one other may be caused by unequal crossover in meiosis; insertion of transposable elementsjumping genesinto a DNA sequence; insertion of DNA by retroviruses; and translocations of DNA between chromosomes [8]. The accurate identification of gaps is shown to be fundamental in various studies on disorders; for example, on HajduCheney syndrome [9], a disorder of severe and progressive bone loss.
The focus of this work is directly motivated by the wellknown and challenging application of resequencingthe assembly of a genome directed by a reference sequence. New developments in sequencing technologies (see [10–12], for example) allow wholegenome sequencing to be turned into a routine procedure, creating sequencing data in massive amounts. Short sequences (reads) are produced in huge amounts (tens of gigabytes), and in order to determine the part of the genome from which a read was derived, it must be mapped (aligned) back to some reference sequence, a few gigabases long.
A wide variety of shortread alignment programmes (e.g. Bowtie [13], SOAP2 [14], REAL [15], BWA [16], Bowtie2 [17]) were published in the past five years to address the challenge of efficiently mapping tens of millions of short reads to a genome, focusing on different aspects of the procedure: speed, sensitivity, and accuracy. These tools allow for a small number of mismatches in the alignment; however, their ability to allow for gaps varies greatly, with many performing poorly and other not allowing them at all.
From Figure 1, we observe that a gap might need to be inserted in the leftmost position of the alignment (position 4). However, we are not able to know the length of the substring of the reference sequence to be aligned beforehand. Due to this observation, it is clear we need an intermediate between the global (NeedlemanWunsch algorithm [19], for example) and the local alignment (SmithWaterman algorithm [20], for example), known as semiglobal alignment, that allows the insertion of a gap at the end of an alignment with no penalty (positions 1012).
Moreover, Figure 3 shows an exponential decrease in the occurrence of gaps as the length increases and a preference for lengths which are multiples of 3. The presence of many gaps in short reads in the order of 25150 base pairs (bp) is rather unlikely due to the low gap occurrence frequency. Hence, applying a traditional dynamicprogramming approach, which essentially cannot bound the number of deletions and insertions in the alignment, would greatly affect the mapping confidence.
Motivated by the aforementioned observations, in [7], the authors presented GapMis, a tool for pairwise global and semiglobal sequence alignment with a single gap. In this article, we present libgapmis, the analogous library implementation. libgapmis also includes two highly optimised versions: one based on Streaming SIMD Extensions (SSE); and one based on Graphics Processing Units (GPU). Proof of concept versions of GapMis and libgapmis were presented in [25] and [21], respectively. Millions of pairwise sequence alignments, performed here under realistic conditions based on the properties of real fulllength genomes, demonstrate that libgapmis can increase the accuracy of extending shortread alignments endtoend compared to more traditional approaches. The importance of our contribution is underlined by the fact that the provided opensource library functions can directly be integrated into any shortread alignment programme.
Definitions and notation
In this section, we give a few definitions, generally following [26] and [7].
An alphabet ∑ is a finite nonempty set whose elements are called letters. A string on an alphabet ∑ is a finite, possibly empty, sequence of elements of ∑. The zeroletter sequence is called the empty string, and is denoted by ε. The set of all the strings on the alphabet ∑ is denoted by ∑*. The length of a string x is defined as the length of the sequence associated with the string x, and is denoted by x. We denote by x[i], for all 1 ≤ i ≤ x, the letter at index i of x. Each index i, for all 1 ≤ i ≤ x, is a position in x when x ≠ ε. It follows that the i th letter of x is the letter at position i in x, and that x = x[1 .. x]. A string x is a substring of a string y if there exist two strings u and v, such that y = uxv. Let x, y, u, and v be strings, such that y = uxv holds. If u = ε, then x is a prefix of y. If v = ε, then x is a suffix of y.
where y_{ i } ∈ ∑*, for all 0 ≤ i < n, and g_{ j } ∈ {★}*, for all 0 ≤ j < n  1, returns the string y = y_{0}y_{1} ... y_{n1}, such that y ∈ ∑*.
The approximate string matching with kmismatches and a single gap problem can now be formally defined:
Problem 1 ([21]) Given a text t of length n, a pattern x of length m ≤ n, an integer k, such that 0 ≤ k < m, and integers α and β, such that 0 ≤ α ≤ β and β < n, find all prefixes of t, such that for each prefix y

either there exists a singlegap string y', with a gap g, such that y = conc(y'), δ_{ G }(x, y') ≤ k, and α ≤ g ≤ β;

or there exists a singlegap string x', with a gap g, such that x = conc(x'), δ_{ G }(x', y) ≤ k, and α ≤ g ≤ β;

or δ_{ H }(x, y) ≤ k and α = 0.
Example 2 ([21]) Let t = AGCAGAGGAGCAGGCGTTCCGTGGT , x = ACCGT , k = 2, α = 6, and β = 7. A solution to this problem instance is the ending position 11, since there exists a singlegap string x' = ACC★★★★★★GT , with a gap g = ★★★★★★, such that x = conc(ACC★★★★★★GT), δ_{ G }(x', t[1 .. 11]) = 2, and g = 6.
Algorithm GapMis
Given the text t of length n, the pattern x of length m, and the threshold β as input, algorithm GAP MIS, first introduced in [7] (see Additional File 1), computes matrices G and H. In fact, we only need to compute a diagonal stripe (a narrow band) of width 2β + 1 in matrix G and in matrix H. As a result, algorithm GAP MIS computes a pruned version of matrices G and H, denoted by G^{P} and H^{P}, respectively (see Figure 4c and 4d).
Proposition 1 ([7]) There exist at most 2β + 1 cells of matrix G that solve Problem 1.
Proposition 2 ([7]) Problem 1 can be solved by algorithm GAP MIS in time $\mathcal{O}\left(m\beta \right).$
Alternatively, we could compute matrix G and matrix H based on a simple alignment score scheme depending on the application of the algorithm (see the following section or [27], for example), and compute the maximum score in time Θ(β) by Proposition 1.
Library libgapmis
In this section, we give a brief description of the library implementation. libgapmis was implemented in the C programming language. First, we start by describing the standard CPU version of the library. Thereafter, we discuss some technical issues regarding the SSE and GPUbased implementations. Finally, we describe how the provided functions are extended to accommodate a variable, but bounded, number of gaps in the alignment.
Finally, the total score for each alignment is obtained by adding these two scores: scoring matrix and affine gap penalty scores. The optimal alignment is the alignment with the highest such total score. The same alignment score scheme is applied in package EMBOSS [30].
We implemented the following functions:

gapmis_one_to_one: this function finds the optimal semiglobal alignment between two sequences. It first implements algorithm GAP MIS in time $\mathcal{O}\left(m\beta \right);$ thereafter, it finds the optimal semiglobal alignment in time $\mathcal{O}\left(\beta \right).$ Finally, gapmis_one_to_one finds the position of the single gap via backtracking in matrix H in time $\mathcal{O}\left(\mathsf{\text{m}}\right).$ The user can omit computing the position of the single gap and thereby computing matrix H.

gapmis_one_to_many: this function uses function gapmis_one_to_one as building block. It computes the ℓ optimal semiglobal alignments between a query sequence and a set of ℓ target sequences.

gapmis_many_to_many: this function uses function gapmis_one_to_many as building block. It computes the κ × ℓ optimal semiglobal alignments between a set of κ query sequences and a set of ℓ target sequences.
Finally, we implemented functions results_one_to_one, results_one_to_many, and results_many_to_many for generating the visualisation of the analogous output in a format similar to the one generated by EMBOSS.
SSEbased implementation
The SSEbased implementation is a direct application of the intersequence vectorisation scheme. It has been used to accelerate the SmithWaterman algorithm and analogous dynamicprogramming algorithms [31, 32]. Algorithm GAP MIS, under this vectorisation scheme, uses SSE instructions to simultaneously compute multiple separate matrices (usually 2, 4, or 8 depending on the vector width and the data type used) corresponding to alignments of one query sequence against multiple other target sequences.
Currently, the vectorisation uses 32 bit floatingpoint arithmetics to represent scores, implying that, on CPUs with SSE3 vector units, a vector width w := 4 is used. By restricting scores to integer values and using 16 bit integers, we may increase the vector width to w := 8. For performancerelated reasons, the SSEbased version only supports the computation of alignment scores, and, therefore, does not support backtracking. The functions provided are gapmis_one_to_many_opt_sse and gapmis_many_to_many_opt_sse, which make use of the aforementioned vectorisation scheme to compute the scores for each pair of sequences. Finally, we make use of the purely sequential function gapmis_one_to_one to find the position of the single gap via backtracking in matrix H. In order to further accelerate the computations, the user may optionally and transparently execute these functions on multicore architectures by setting the number of threads. More technical details of the SSEbased implementation can be found in [21].
GPUbased implementation
The function gapmis_one_to_one was ported to GPUs using OpenCL in order to maintain a vendorindependent GPU version. In analogy to the SSEbased implementation, only the computation of alignment scores are offloaded to the GPU. The GPU implementation is also similar to the SSEbased implementation in the sense that multiple dynamicprogramming matrices are computed simultaneously.
Similar to the SSEbased version, the functions provided are gapmis_one_to_many_opt_gpu and gapmis_many_to_many_opt_gpu. Finally, we make use of the purely sequential function gapmis_one_to_one to find the position of the single gap via backtracking in matrix H. More technical details of the GPUbased implementation can be found in [21].
Accommodating multiple gaps
The presence of multiple gaps is unlikely given the observed gap occurrence frequency in reallife applications: 5.7 × 10^{6} in the human exome (see the Background Section), 1.7 × 10^{5} in Beta vulgaris [24], 2.4 × 10^{5} in Arabidopsis thaliana [24], and 3.2 × 10^{6} in bacteriophage PhiX174 [24]. However, in order to increase the flexibility of our library, we implemented two additional functions, gapmis_one_to_one_f and gapmis_one_to_one_onf, to allow for a variable, but bounded, number of gaps in the alignment.

gapmis_one_to_one_f: this function provides the user the option to split the query sequence into f fragments, based on the observed gap occurrence frequency and the query length, by taking the number of fragments as input argument. It then uses function gapmis_one_to_one to perform a singlegap alignment for each fragment independently. The total score of the alignment is obtained by adding the f individual scores of the fragments. We denote this function by gm f <int>, where <int> is the number of fragments f used as input argument.

gapmis_one_to_one_onf: this function computes the alignment by using the optimal number of fragments. First, it takes the maximum number of fragments as input argument, say f_{max}, and only computes the total score of the alignments, for each different number 1, 2, ..., f_{max} of fragments. It then uses function gapmis_one_to_one_f to compute the alignment by passing the optimal number of fragmentsthe one that gives the maximum total score in the previous stepas input argument. We denote this function by gm onf <int>, where <int> is the maximum number of fragments f_{max} used as input argument.
Experimental results
The experiments were conducted on a Desktop PC using up to 4 cores of Intel i7 2600 CPU at 3.4 GHz under Linux, and an NVIDIA GeForce 560 GPU with 336 CUDA cores and 1 GB DDR5 device memory. libgapmis is distributed under the GNU General Public License (GPL). The library is available at http://www.exelixislab.org/gapmis, which is set up for maintaining the source code and the man page documentation.
To the best of our knowledge, lipgapmis is the first library for extending pairwise shortread alignments. The main design goal of lipgapmis is to identify a single gap in the alignment (see the Background Section for the motivation). Therefore, in this section, we focus on comparing the performance of function gapmis_one_to_one to the analogous performance of EMBOSS needle. The latter implements NeedlemanWunsch algorithm for semiglobal alignment. The NeedlemanWunsch algorithm is the traditional approach used for semiglobal alignment. needle is, uptodate, one of the most popular pairwise sequence alignment programmes for global and semiglobal alignment.
We generated 100, 000 pairs of 100 bplong sequences on the DNA alphabet. Initially, each pair consisted of two identical sequences. Subsequently, we inserted:

a single gap with a uniformly random length that ranged between 1 and 30 into one of the two sequences;

a uniformly random number of mismatches that ranged between 1 and 10.
Since the presence of multiple gaps is unlikely based on the gap occurrence frequency observed in real datasets, this experimental setting aims to demonstrate the suitability of the proposed algorithm compared to more traditional approaches in identifying the simulated inserted gap.
We seamlessly integrated function gapmis_one_to_one into a test programme, denoted by gapmis, for computing the optimal semiglobal alignment between a pair of sequences. In each case, for a fair comparison of needle and gapmis, an effort was made to run the programmes under as similar conditions as possible. In gapmis, we additionally used function results_one_to_one to produce the corresponding output. While parsing the output generated by the two programmes, we considered any inserted gap as gap, excluding, however, a gap inserted in the end of the alignment.
We consider as valid those alignments where the number of inserted gaps is less or equal to the ones originally inserted. Furthermore, we consider as correct those valid alignments with gaps whose total length is smaller or equal to the length of the ones originally inserted and with number of mismatches being less or equal to the ones originally inserted.
Valid and correct alignments with gap opening penalty 10.0 and gap extension penalty 0.5.
Programme  Valid  Correct 

Needle  94,552  94,516 
Gapmis  100,000  99,996 
Valid and correct alignments with gap opening penalty 8.0 and gap extension penalty 1.0.
Programme  Valid  Correct 

needle  76,512  76,501 
gapmis  100,000  99,997 
Valid and correct alignments with gap opening penalty 12.0 and gap extension penalty 0.5.
Programme  Valid  Correct 

needle  95,452  95,427 
gapmis  100,000  99,999 
We also evaluated the time efficiency of the accelerated SSE and GPUbased versions of libgapmis, by comparing their processing times against the ones of the standard CPU version. In particular, we generated a 75 bplong DNA query sequence and 4, 639, 576 100 bplong DNA target sequences. This represents a realistic setting for resequencing applications because the seed part of a short read usually occurs in thousands or millions of positions along the reference sequence. Hence an important problem in resequencing is the efficient and accurate extension of these thousands to millions of potential alignments. We used the following versions of the function gapmis_one_to_many:

the CPU version;

the singlecore SSE version;

the SSE version with 4 threads (t 4);

the GPU version.
As further experiment, in order to evaluate the performance of programme gapmis, function gapmis_one_to_one_f, function gapmis_one_to_one_onf, and needle, under real conditions, we simulated 1, 000, 000 100 bplong query sequences from the 30 Mbp chromosome 1 of Arabidopsis thaliana (AT) obtained from [33], and inserted mismatches and gaps into the reference sequence; then we aligned them back against the original reference sequence. As mismatch occurrence frequency and gap occurrence frequency we used 1.6 × 10^{  }^{3}and 2.4 × 10^{  }^{5}, respectivelythe ones observed in AT [24]. Since, in practice, insertions occur less frequently than deletions, 42% of the inserted gaps were insertions and 58% deletionsalso observed in AT [24]. For the length of the inserted gaps, we used the distribution of gap lengths shown in Figure 3, which is consistent with other studies on gap distributions (cf. [9, 22, 23]). Since the queries were simulated, we were able to know the exact location of the fragments of the reference sequence they were derived from (the target sequences). Hence, we were able to classify each generated alignment as valid/invalid and correct/incorrect. We define accuracy as the proportion of correct alignments in the dataset. Thus, we evaluated the accuracy of the aforementioned programmes in extending an alignment endtoend, assuming that the seed part of the alignment is already performed by using a conventional indexing scheme, that is, a hashbased index [15] or an FM index [16]. We repeated the same experiment by simulating 150 bplong query sequences and using other gap occurrence frequenciesobserved in Beta vulgaris (BV) [24] and Homo sapiens (HS) exome [9].
Correct alignments using gapmis, gapmis_one_to_one_f, gapmis_one_to_one_onf, and needle.
Species  Length of queries [bp]  Gap occurrence frequency  gapmis  gm f 2  gm f 3  gm onf 2  gm onf 3  needle 

AT  100  2.4 × 10^{  }^{5}  999,099  998,404  997,561  999,207  999,259  999,126 
AT  150  2.4 × 10^{  }^{5}  998,805  998,171  997,542  999,024  999,152  999,115 
BV  100  1.7 × 10^{  }^{5}  999,361  998,868  998,229  999,432  999,459  999,353 
BV  150  1.7 × 10^{  }^{5}  999,196  998,771  998,249  999,347  999,432  999,378 
HS  100  5.7 × 10^{  }^{6}  999,809  999,615  999,419  999,822  999,825  999,782 
HS  150  5.7 × 10^{  }^{6}  999,795  999,606  999,408  999,817  999,825  999,793 
Valid and correct alignments using needle.
Programme  Species  Length of queries [bp]  Gap occurrence frequency  Gap opening penalty  Gap extension penalty  Valid alignments  Correct alignments 

needle  AT  100  2.4 × 10^{5}  10.0  0.5  99,988  99,917 
needle  AT  100  2.4 × 10^{5}  15.0  0.5  99,992  99,911 
needle  AT  100  2.4 × 10^{5}  20.0  0.5  99,996  99,850 
needle  AT  150  2.4 × 10^{5}  10.0  0.5  99,991  99,919 
needle  AT  150  2.4 × 10^{5}  15.0  0.5  99,992  99,901 
needle  AT  150  2.4 × 10^{5}  20.0  0.5  99,996  99,834 
Conclusions
In this article, we presented libgapmis, an ultrafast and flexible library for extending pairwise shortread alignments endtoend. Apart from the standard CPU version, it includes ultrafast SSE and GPUbased implementations. libgapmis is based on GapMis, a tool that computes a different version of the traditional dynamicprogramming matrix for sequence alignment.
This work is directly motivated by the nextgeneration resequencing application. We demonstrated that libgapmis is more suitable and efficient than more traditional approaches for extending shortread alignments endtoend. Adding the flexibility of bounding the number of gaps inserted in the alignment, strengthens the classical scheme of scoring matrices and affine gap penalty scores. The presented experimental results are very promising, both in terms of identifying gaps and efficiency.
By exploiting the potential of modern CPU and GPU architectures and applying multithreading, we improved the performance of the purely sequential CPU version by more than one order of magnitude. More importantly, the functions provided in libgapmis can be directly integrated into any shortread alignment programme. Our immediate target is to further optimise the code, and also integrate the functions of this library into a shortread alignment pipeline.
Declarations
Acknowledgements
The publication costs for this article were funded by the Heidelberg Institute for Theoretical Studies (HITS gGmbH). NA, SB, and TF are supported by funding from the DFG (German Science Foundation, grants STA 860/2 and STA 860/4). SPP is supported by the NSFfunded iPlant Collaborative (NSF grant #DBI0735191). AS is supported by institutional funding from HITS gGmbH. We thank Rajesh Kumar Gottimukkala from Life Technologies for valuable comments and useful discussions.
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 11, 2013: Selected articles from The Second Workshop on Data Mining of NextGeneration Sequencing in conjunction with the 2012 IEEE International Conference on Bioinformatics and Biomedicine. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S11.
Authors’ Affiliations
References
 Levenshtein VI: Binary codes capable of correcting deletions, insertions, and reversals. Tech Rep 8. 1966, Soviet Physics Doklady
 Wagner RA, Fischer MJ: The StringtoString Correction Problem. Journal of the ACM. 1974, 21: 168173. 10.1145/321796.321811.View Article
 Sellers PH: On the Theory and Computation of Evolutionary Distances. SIAM Journal on Applied Mathematics. 1974, 26 (4): 787793. 10.1137/0126070.View Article
 Heckel P: A technique for isolating differences between files. Communications of the ACM. 1978, 21 (4): 264268. 10.1145/359460.359467.View Article
 Peterson JL: Computer programs for detecting and correcting spelling errors. Communications of the ACM. 1980, 23 (12): 676687. 10.1145/359038.359041.View Article
 Cambouropoulos E, Crochemore M, Iliopoulos CS, Mouchard L, Pinzon YJ: Algorithms for Computing Approximate Repetitions in Musical Sequences. International Journal of Computational Mathematics. 2000, 79 (11): 11351148.View Article
 Flouri T, Frousios K, Iliopoulos CS, Park K, Pissis SP, Tischler G: GapMis: a tool for pairwise sequence alignment with a single gap. Recent Pat DNA Gene Seq. 2013, 7: 8495. 10.2174/1872215611307020002.View ArticlePubMed
 Gusfield D: Algorithms on strings, trees, and sequences: computer science and computational biology. 1997, USA: Cambridge University PressView Article
 Simpson MA, Irving MD, Asilmaz E, Gray MJ, Dafou D, Elmslie FV, Mansour S, Holder SE, Brain CE, Burton BK, Kim KH, Pauli RM, Aftimos S, Stewart H, Kim CA, HolderEspinasse M, Robertson SP, Drake WM, Trembath RC: Mutations in NOTCH2 cause HajduCheney syndrome, a disorder of severe and progressive bone loss. Nature Genetics. 2011, 43 (4): 303305. 10.1038/ng.779.View ArticlePubMed
 Balasubramanian S, Klenerman D, Barnes C, Osborne M: 2007, Patent US20077232656
 Ju J, Li Z, Edwards J, Itagaki Y: 2007, Patent EP1790736
 Rothberg J, Bader J, Dewell S, McDade K, Simpson J, Berka J, Colangelo C: Founding patent of 454 Life Sciences. 2007, Patent US20077211390
 Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memoryefficient alignment of short DNA sequences to the human genome. Genome biology. 2009, 10 (3): R25+10.1186/gb2009103r25.PubMed CentralView ArticlePubMed
 Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen K, Wang J: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics. 2009, 25 (16): 19661967.View ArticlePubMed
 Frousios K, Iliopoulos CS, Mouchard L, Pissis SP, Tischler G: REAL: an efficient REad ALigner for next generation sequencing reads. Proceedings of the first ACM International Conference on Bioinformatics and Computational Biology (BCB 2011). Edited by: Zhang A, Borodovsky M, Özsoyoglu G, Mikler AR. 2010, USA: ACM, 154159.View Article
 Li H, Durbin R: Fast and accurate short read alignment with BurrowsWheeler transform. Bioinformatics. 2009, 25 (14): 17541760. 10.1093/bioinformatics/btp324.PubMed CentralView ArticlePubMed
 Langmead B, Salzberg SL: Fast gappedread alignment with Bowtie 2. Nat Methods. 2012, 9: 357359. 10.1038/nmeth.1923.PubMed CentralView ArticlePubMed
 Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic Local Alignment Search Tool. Journal of Molecular Biology. 1990, 215 (3): 403410. 10.1016/S00222836(05)803602.View ArticlePubMed
 Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology. 1970, 48 (3): 443453. 10.1016/00222836(70)900574.View ArticlePubMed
 Waterman MS, Smith TF: Identification of common molecular subsequences. Journal of Molecular Biology. 1981, 147: 195197. 10.1016/00222836(81)900875.View ArticlePubMed
 Alachiotis N, Berger S, Flouri T, Pissis SP, Stamatakis A: libgapmis: an ultrafast library for shortread singlegap alignment. Bioinformatics and Biomedicine Workshops (BIBMW), 2012 IEEE International Conference on: 47 October 2012. 2012, 688695. 10.1109/BIBMW.2012.6470221.View Article
 Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, Shaffer T, Wong M, Bhattacharjee A, Eichler EE, Bamshad M, Nickerson DA, Shendure J: Targeted capture and massively parallel sequencing of 12 human exomes. Nature. 2009, 461 (7261): 272276. 10.1038/nature08250.PubMed CentralView ArticlePubMed
 Ostergaard P, Simpson MA, Brice G, Mansour S, Connell FC, Onoufriadis A, Child AH, Hwang J, Kalidas K, Mortimer PS, Trembath R, Jeffery S: Rapid identification of mutations in GJC2 in primary lymphoedema using whole exome sequencing combined with linkage analysis with delineation of the phenotype. J Med Genet. 2011, 48 (4): 251255. 10.1136/jmg.2010.085563.View ArticlePubMed
 Minosche AE, Dohm JC, Himmelbauer H: Evaluation of genomic highthroughput sequencing data generated on Illumina HiSeq and Genome Analyzer systems. Genome Biology. 2011, 12: R11210.1186/gb20111211r112.View Article
 Flouri T, Frousios K, Iliopoulos CS, Park K, Pissis SP, Tischler G: Approximate stringmatching with a single gap for sequence alignment. Proceedings of the second ACM International Conference on Bioinformatics and Computational Biology (BCB 2011). Edited by: ACM. 2011, USA: ACM, 490492.View Article
 Crochemore M, Hancart C, Lecroq T: Algorithms on Strings. 2007, USA: Cambridge University PressView Article
 Na JC, Roh K, Apostolico A, Park K: Alignment of biological sequences with quality scores. International Journal of Bioinformatics Research and Applications. 2009, 5: 97113. 10.1504/IJBRA.2009.022466.View ArticlePubMed
 National Center for Biotechnology Information (NCBI): 2013, [ftp://ftp.ncbi.nih.gov/blast/matrices/NUC.4.4]
 National Center for Biotechnology Information (NCBI): 2013, [ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM62]
 Rice P, Longden I, Bleasby A: EMBOSS: The European Molecular Biology Open Software Suite. Trends in Genetics. 2000, 16 (6): 276277. 10.1016/S01689525(00)020242.View ArticlePubMed
 Alachiotis N, Berger S, Stamatakis A: Coupling SIMD and SIMT architectures to boost performance of a phylogenyaware alignment kernel. BMC Bioinformatics. 2012, 13: 19610.1186/1471210513196.PubMed CentralView ArticlePubMed
 Rognes T: Faster SmithWaterman database searches with intersequence SIMD parallelisation. BMC Bioinformatics. 2011, 12: 22110.1186/1471210512221.PubMed CentralView ArticlePubMed
 National Center for Biotechnology Information (NCBI): [http://www.ncbi.nlm.nih.gov/]
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.