Problem definition
Notation
The Levenshtein distance between two strings A=a_{1}a_{2}… a_{A}, B=b_{1}b_{2}…b_{B}, noted as ∥A,B∥_{
l
e
v
}, is given by a recursive formula (assuming A>0,B>0):
\phantom{\rule{15.0pt}{0ex}}{\u2225A,B\u2225}_{\mathit{\text{lev}}}\phantom{\rule{0.3em}{0ex}}:=\mathit{\text{min}}\phantom{\rule{0.3em}{0ex}}\left\{\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\begin{array}{l}{\u2225{a}_{1}{a}_{2}\dots {a}_{\leftA\right1},{b}_{1}{b}_{2}\dots {b}_{\leftB\right1}\u2225}_{\mathit{\text{lev}}}+{\delta}_{{a}_{\leftA\right},{b}_{\leftB\right}}\\ {\u2225{a}_{1}{a}_{2}\dots {a}_{\leftA\right1},B\u2225}_{\mathit{\text{lev}}}+1\\ {\u2225A,{b}_{1}{b}_{2}\dots {b}_{\leftB\right1}\u2225}_{\mathit{\text{lev}}}+1\end{array}\right.
where δ_{a,b}=1 if a≠b, otherwise δ_{a,b}=0. When m i n(A,B)=0, ∥A,B∥_{
l
e
v
}:=m a x(A,B).
kdifference problem
Given a sequence S=s_{1}s_{2}…s_{
n
}, a query pattern P=p_{1}p_{2}…p_{
m
}, and a threshold k(0≤k<m), search all substrings of S, noted as {P^{′}}, such that ∥P^{′},P∥_{
l
e
v
}≤k.
Extended kdifference problem
Given a sequence S=s_{1}s_{2}…s_{
n
}, a query pattern P=p_{1}p_{2}…p_{
m
}, and a threshold e(0≤e<1,⌊n×e⌋=k), search all substrings of S, noted as {P^{′}}, such that ∥P^{′},P∥_{
l
e
v
}≤k; and all suffixes of S, noted as {S^{′}}, such that ∃ a prefix of P, noted as P^{′}, and ∥S^{′},P^{′}∥_{
l
e
v
}≤S^{′}×e
Algorithms
For the kdifference problem, the classic approach [2] computes a (m+1)×(n+1) dynamic programming matrix C[0..m,0..n] using the following recurrence:
C[i,j]=\mathit{\text{min}}\left\{\begin{array}{l}C[i1,j1]+{\delta}_{\mathit{\text{ij}}}\\ C[i1,j]+1\\ C[i,j1]+1\end{array}\right.
where
{\delta}_{\mathit{\text{ij}}}=\left\{\begin{array}{ll}0,& \text{if}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}{p}_{i}={s}_{j},\\ 1,& \text{otherwise}.\end{array}\right.
with initialization at the upper boundary by C[0,j]=0, and at the left boundary by C[i,0]=i, for i=1,2,…,m and j=1,2,…,n. Finally, the algorithm tests the last row of the matrix, i.e. C[m,j], and reports those elements that are no greater than k. This algorithm has O(m n) time and O(m n) space complexity.
The space bound can be easily reduced to O(m) if matrix C is computed by columns, noted as C_{
j
} for j=1,2,…n, and report a match each time C_{
j
}[m]≤k, because computing column C_{
j
} requires only the knowledge of previous column C_{j−1}. With careful design, C_{
j
} and C_{j−1} can share one column vector, as proposed by Ukkonen [4].
Ukkonen also observed that for columns that have the last element greater than k, there is a boundary index of C_{
j
}, noted as l a c(C_{
j
}), such that C_{
j
}[l a c(C_{
j
})]=k and C_{
j
}[l]>k for l=l a c(C_{
j
})+1,…m. It is easy to prove that l a c(C_{
j
})≤l a c(C_{j−1})+1. Using this observation, Ukkonen reduced the time from O(m n) to expected O(k n) [4].
Our algorithm was developed from Ukkonen’s algorithm; however, we use a queue instead of an array to store all elements of current column above the boundary index. When there is a new element that corresponds to the topmost element of the new column, all elements in the queue shift automatically to the next (lower) position, just as elements transfer in the diagonals of matrix C. This process inherently keeps the basic properties of Ukkonen’s algorithm and facilitates subsequent improvements.
Lemma 1
In the dynamic programming matrix C for tackling the kdifference problem, the values of elements along each diagonal are monotonically nondecreasing.
The proof is provided in Additional file 5 Appendix A.
Theorem 1
All the matched elements of the query pattern and sequence are equal to their precursors in the diagonal and do not need to be updated in the dynamic programming process.
Proof.
This theorem is a direct consequence of Lemma 1 and the dynamic programming recurrence, when δ_{
i
j
}=0.
In other words, only mismatched elements need to be updated in the dynamic programming process. If bitvectors that denote mismatched positions of comparison between the adapter sequence and each of the four nucleotide characters are precomputed and a bitvector that marks all positions of the queue elements that exceed the kdifference constraint is maintained, then unnecessary computations in updating the column vector can be inhibited. This is the key point that led to the main improvement of our algorithm over Ukkonen’s algorithm.
As listed in Algorithm 1, the bitmasked kdifference matching algorithmhas the following characteristics:

Use a queue instead of an array to store all elements of the current column above boundary index.

In preprocessing, calculate for each of four nucleotide characters a bitvector that denotes the mismatched positions compared with the adapter.

Mark the internal cells that exceed the kdifference constraint by a bitvector which shifts as the queue pushes it.

When processing the column starting from each input nucleotide, update only the cells that mismatch and have not been marked.
This algorithm uses a queue of size m and several bit vectors of size ⌈m/w⌉, where w is the word length of the computer (for example w equals 64 for a 64bit machine), and hence has a space of O(m). For each of the n characters in a target sequence, the character enters the queue once and exits from the queue at most once. For a random sequence, the expected size of the queue is O(k); hence, generally the algorithm has O(k n) expected time. However, because it is restricted by the bitmask operations, each element in the queue usually updates at most k+1 times. Because bit operations are negligible compared with element update operations, this algorithm achieves O(k n) worstcase time in practice, which is better than the O(k n) expected time for Ukkonen’s algorithm.
Algorithm 1 can be improved further by avoiding all unnecessary updates through constant time bit operations within each iteration cycle of the target nucleotide. The basic principle is that when an element in a diagonal of the original dynamic programming matrix has a value that is derived from an adjacent diagonal (i.e. an indel occurs in the corresponding path), the score and associated index of the element will remain unchanged if the precursor remains unchanged.
Although the above improvement can reduce the theoretical time complexity from expected O(k n) to worstcase O(k n), experiments on large volumes of real data showed that the reduced element update operations did not compensate for the additional bit operations.
For the extended kdifference problem, an additional step bounded by O(m) is performed to check all the elements remained in the queue if no hit was found in previous steps.
Deal with basecall qualities
The main advantage of the bitmasked kdifference matching algorithm over Myer’s bitvector algorithm is that it can be extended to handle basecall quality values.
To handle basecall quality values, we introduce the following parameters: P_{
m
i
n
}=−l o g_{10}(1/3), the minimum penalty for a mismatch; P_{
m
a
x
}=−l o g_{10}(10^{40/(−10)}/3), the maximum penalty for a mismatch; d e l t a=P_{
m
a
x
}, the penalty for an insertion or deletion. The penalty of a mismatch with quality value q is calculated as:
\begin{array}{c}P\left(q\right)=\left\{\begin{array}{cc}{P}_{\mathit{\text{min}}}& q\le 0\\ {P}_{\mathit{\text{min}}}+q/40\times ({P}_{\mathit{\text{max}}}{P}_{\mathit{\text{min}}})& 0<q<40\\ {P}_{\mathit{\text{max}}}& q\ge 40\end{array}\right.\end{array}
It is trivial to prove that P(q)=−l o g_{10}(10^{q/(−10)}/3) when 0<q<40. This score is the negative logarithm of the probability that the corresponding base is actually a match to the adapter sequence with sequencing error.
Note that the scoring scheme herein only induces penalties for mismatches and insertions/deletions. For matches, because the possibility of false matching based on sequence homology reduces exponentially as the matching length increases, the false matching possibility can be reduced by setting a longer alignment length. When PE information is available, the false matching possibility can be reduced to a minimum.
The extended version of Algorithm 1 that can take quality values into consideration is outlined in Additional file 6 Appendix B.
Deal with pairedend information
Unlike the standard SE sequencing, PE sequencing reads out each DNA template twice, in opposite directions from different ends. The underlying fact is that all the PE reads that need to be trimmed must have the preserved paired sequences reversecomplement to each other, as illustrated in Figure 5 and Additional file 7 Appendix C.
Using this property, the program first finds all kdifference occurrences of adapters in both paired reads using the extended version of Algorithm 1 with quality values considered. Then the reversecomplementary property of each trimmed paired sequences is checked. Next, all candidates are evaluated with a scoring scheme that takes into account the fitness of adapter sequences in paired reads and the alignment of reversecomplementary counterparts. Finally, the program outputs the optimal occurrence, if any.
The scoring scheme we used is as follows:
\begin{array}{ll}\phantom{\rule{1em}{0ex}}\mathit{\text{score}}\left(\mathit{\text{idx}}\right)=& \mathit{\text{pscore}}\left(\mathit{\text{read}}1\right[\mathit{\text{idx}}\dots \mathit{\text{readLen}}1],\mathit{\text{adapter}}1)\\ +\mathit{\text{pscore}}\left(\mathit{\text{read}}2\right[\mathit{\text{idx}}\dots \mathit{\text{readLen}}1],\mathit{\text{adapter}}2)\\ +\mathit{\text{pscore}}\left(\mathit{\text{read}}1\right[0\dots \mathit{\text{idx}}1],\\ \mathit{\text{revComp}}\left(\mathit{\text{read}}2\right[0\dots \mathit{\text{idx}}1\left]\right))\end{array}
where idx is the start position for trimming, p s c o r e(x,y)=x×P_{
m
a
x
}−p e n a l t y(x,y), P_{
m
a
x
} is the maximum penalty for a difference, p e n a l t y(x,y) is the penalty for matching x and y as calculated by the kdifference matching algorithms, and r e v C o m p(x) denotes the reverse complementary sequence of x. The goal is to find the idx that meets the kdifference requirement and maximizes the score function.
Deal with Nextera LMP information
Matepair library sequencing allows the generation of longinsert PE libraries that are useful in the scaffolding process of de novo genome assembly and in the detection of longrange genome structural variations. In the Nextera LMP library construction process, there are additional reactions called “tagmentation” and “circularization” before the normal PE library construction. The tagmentation reaction uses a specially engineered transposome to fragment the DNA sample and tag the DNA fragments by attaching a pair of biotinylated junction adapters simultaneously to the ends. Next, the tagmented DNA molecules are circularized and sheared by ultrasonics, and the subfragments containing the original junction parts are enriched via the biotin tag in the junction adapter.
Trimming adapters from Nextera LMP reads is like a reverse process of Nextera LMP library construction. To process Nextera matepair reads, the program first trims the adapters as if it is dealing with PE reads. Then, it trims junction adapters from the processed paired reads separately using the extended version of Algorithm 1.