Skip to main content

Linear space string correction algorithm using the Damerau-Levenshtein distance

Abstract

Background

The Damerau-Levenshtein (DL) distance metric has been widely used in the biological science. It tries to identify the similar region of DNA,RNA and protein sequences by transforming one sequence to the another using the substitution, insertion, deletion and transposition operations. Lowrance and Wagner have developed an O(mn) time O(mn) space algorithm to find the minimum cost edit sequence between strings of length m and n, respectively. In our previous research, we have developed algorithms that run in O(mn) time using only O(smin{m,n}+m+n) space, where s is the size of the alphabet comprising the strings, to compute the DL distance as well as the corresponding edit sequence. These are so far the fastest and most space efficient algorithms. In this paper, we focus on the development of algorithms whose asymptotic space complexity is linear.

Results

We develop linear space algorithms to compute the Damerau-Levenshtein (DL) distance between two strings and determine the optimal trace (corresponding edit operations.)Extensive experiments conducted on three computational platforms–Xeon E5 2603, I7-x980 and Xeon E5 2695–show that, our algorithms, in addition to using less space, are much faster than earlier algorithms.

Conclusion

Besides using less space than the previously known algorithms,significant run-time improvement was seen for our new algorithms on all three of our experimental platforms. On all platforms, our linear-space cache-efficient algorithms reduced run time by as much as 56.4% and 57.4% in respect to compute the DL distance and an optimal edit sequences compared to previous algorithms. Our multi-core algorithms reduced the run time by up to 59.3% compared to the best previously known multi-core algorithms.

Background

Introduction

The Damerau-Levenshtein (DL) distance between two strings is the minimum number of substitutions, inserts, deletes, and transpositions of adjacent characters required to transform one string into the other. Some of the applications of the DL are spelling error correction [13], comparing packet traces [4], data mining and clustering [5], quantifying the similarity of biological sequences, and gene function prediction [6], analysis of B cell receptor repertoire data [7], virus detection in software [8], clustering of RNA-seq read se6gments [9], DNA repeats detection [10], and codes for DNA based storage [11]. In some of these applications (e.g., spelling error correction), the strings are rather small while in others (e.g., comparing protein sequences) the strings could be tens of thousands of characters long [12], and in yet others (e.g., comparing chromosomes) they could be millions of characters long [13].

Other string edit distances used in the literature permit only a proper subset of the operations permitted by the DL distance. For example, in the Levenshtein distance [14] transpositions are not permitted, in the Hamming distance [15] only substitutions are permitted, and in the Jaro distance [16], only transpositions are permitted. The correct distance metric to use depends on the application. In the applications cited above, the DL distance is used as all 4 edit operations are permitted.

Lowrance and Wagner [17] considered a generalization of DL distance to the case when substitutions, inserts, deletes, and transpositions have different costs. Through a suitable choice of weights, the weighted DL distance can be made equal to the DL distance, Levenshtein distance, Hamming distance, and Jaro distance.

Lowrance and Wagner [17] have developed an O(mn) time and O(mn) space algorithm to find the minimum cost edit sequence (ie., sequence of substitutions, inserts, deletes, and transpositions) that transforms a given string of length m into a given string of length n provided that 2TI+D, where T, I, and D, are respectively, the cost of a transposition, insertion, and deletion. In the DL distance, T=I=D=1 and so, 2T=I+D. Hence, the algorithm of Lowrance and Wagner [17] may be used to compute the DL distance as well as the corresponding edit sequence in O(mn) time and O(mn) space. This observation has also been made in [2]. In [18] we developed algorithms that run in O(mn) time using only O(s min{m,n}+m+n) space, where s is the size of the alphabet comprising the strings, to compute the DL distance as well as the corresponding edit sequence. Since s<<m and s<<n in most applications (e.g., s=20 for protein sequences), this reduction in space enables the solution of much larger instances than is possible using the algorithm of [17]. Our algorithms in [18] are much faster as well. In this paper, we develop algorithms to compute the DL distance and corresponding edit sequence using O(m+n) space and O(mn) time. Extensive experimentation using 3 different platforms indicates that the algorithms of this paper are also faster than those of [18]. In fact, our fastest algorithm for the DL distance is up to 56.4% faster than the fastest algorithm in [18] when run on a single core. The single core speedup to find the corresponding edit sequence is up to 57.4%. Our algorithms may be adapted to run on multicores providing a speedup of up to 59.3%.

DL dynamic programming recurrences

Let A[1:m]=a1a2am and B[1:n]=b1b2...bn be two strings of length m and n, respectively. Let Hij be the DL distance between A[1:i] and B[1:j]. So, Hmn is the DL distance between A and B. The dynamic programming recurrence for H is given below [17, 18].

$$ H_{i,0} = i,\ H_{0,j} = j, \ 0 \le i \le m, \ 0 \le j \le n $$
(1)

When i>0 and j>0,

$$ H_{i,j} = \min\left\{ \begin{array} {lcr} H_{i-1,j-1}+ c(a_{i},b_{j}) \\ H_{i,j-1}+ 1 \\ H_{i-1,j}+ 1 \\ H_{k-1,l-1} + (i-k-1)+ 1 + (j-l-1) \\ \end{array} \right. $$
(2)

where c(ai,bj) is 1 if aibj and 0 otherwise, k=lastA[i][bj] is the last (i.e.,rightmost) occurrence of bj in A that precedes position i of A, and l=lastB[j][ai] is the last occurrence of ai in B that precedes position j of B. When k or l do not exist, case 4 of Eq. 2 does not apply.

The four cases in Eq. 2 correspond to the four allowable edit operations: substitution, insertion, deletion and transposition. These cases are illustrated in Fig. 1, which depicts the possibilities for an optimal transformation of A[1:i] to B[1:j]. Figure 1a illustrates the first case, which is to optimally transform A[1:i−1] into B[1:j−1] and then substitute bj for ai. If ai=bj, the substitution cost is 0, otherwise it is 1. Figure 1b shows the second case. Here, A[1:i] is optimally transformed into B[1:j−1] and then bj is inserted at the end. In the third case (Fig. 1c) A[1:i−1] is optimally transformed into B[1:j] and then ai is deleted. For the fourth and final case (Fig. 1d), assume that k and l exist. In this case, we are going to transpose ak and ai. We first optimally transform A[1:k−1] into B[1:l−1]. Since only adjacent characters may be transposed, the transposition of ak and ai must be preceded by a deletion of ak+1 through ai−1, which results in ak and ai becoming adjacent. Following the transposition, we insert bl+1 through bj−1 between the transposed ak and ai, thereby transforming A[1:i] into B[1:j]. The cost of optimally transforming A[1:k−1] into B[1:l−1] is Hk−1,l−1. The ensuing deletions have a cost of ik−1 as this is the number of deletions performed, the transposition ak and ai costs 1, and the final inserts cost lk−1. So, the overall cost of case 4 is Hk−1,l−1 + (ik−1)+1 + (jl−1).

Fig. 1
figure 1

DL distance recurrence. a substitution. b insertion. c deletion. d translate A[k:i] to B[l:j] where (a k,bj) and (b l,ai) form a transposition opportunity

The algorithm of Lowrance and Wagner [17] computes Hm,n using a m×n array for H and one-dimensional arrays of size s for lastA and lastB, where s is the size of the alphabet from which the strings A and B are drawn. It computes the Hi,j’s by rows beginning with row 1 and within a row, the elements are computed left-to-right by columns. While algorithm LS_DL of [18] also computes H by rows and within a row by columns left-to-right, it does this using a one-dimensional array of size n for the current row being computed, a one-dimensional array of size s for lastA, and an s×n array T with the property that if w is the last row of H computed so far such that A[w]=c, then T[c][]=H[w−1][]. As noted in [18] when m<n, we may swap the strings A and B resulting in a space requirement of O(s min{m,n}+n). The time complexity of LS_DL is O(mn). A cache-efficient version, Strip_DL, of LS_DL that computes H by strips whose width is no larger than the cache size is also developed in [18]. This cache efficient algorithm has the same asymptotic time and space complexities as does LS_DL. But, as demonstrated in [18], Strip_DL is much faster than LS_DL.

The linear space algorithms we develop in this paper use a refined dynamic programming recurrence for H. We make the observation that when either ai=bj or min{ik,jl}≥2 in the fourth case of Eq. 2, then it is sufficient to consider only the first three cases. To see this, note that when ai=bj the transposition of ak and ai done following the deletion of ak+1 through ai−1 in case 4 (Fig. 1d) is unnecessary as ak = bj = ai. So, one of the first three cases has to result in a smaller value than case 4. Next, consider the case when aibj and min{ik,jl}≥2. Suppose that 2≤ikjl. The cost of transforming A[1:i] to B[1:j] by using an optimal transformation of A[1:k−1] to B[1:l−1] and then doing jl+1 substitutions and inserts is Hk−1,l−1+jl+1. Doing the transformation as is case 4 has a cost Hk−1,l−1+(ik−1)+1+(jl−1)≥Hk−1,l−1+jl+1. So, doing the transposition (case 4) isn’t any better than using only substitutions and inserts. Hence, case 4 need not be considered. The case when 2≤jlik is symmetric.

The preceding observation establishes the correctness of the following refined recurrence for H.

$$ H_{i,0} = i,\ H_{0,j} = j, \ 0 \le i \le m, \ 0 \le j \le n $$
(3)

When i>0 and j>0,

$$ {\begin{aligned} H_{i,j} = \min\left\{ \begin{array} {lcr} H_{i-1,j-1}+ c(a_{i},b_{j}) \\[-2pt] H_{i,j-1}+ 1 \\[-2pt] H_{i-1,j}+ 1 \\[-2pt] \left\{ \begin{array} {lcr} H_{k-1,j-2} + (i-k) \\[-2pt] \ \ \ \ \ \ if\ j-l=1\ and\ a_{i} \ne b_{j} \\[-2pt] H_{i-2,l-1} + (j-l) \\[-2pt] \ \ \ \ \ \ if\ i-k=1\ \ and\ a_{i} \ne b_{j} \\[-2pt] \infty \ otherwise \end{array} \right. \end{array} \right. \end{aligned}} $$
(4)

where c(ai,bj) is 1 if aibj and 0 otherwise, k=lastA[i][bj] and l=lastB[j][ai].

We observe that the above refined recurrence for H holds even in the weighted setting provided that 2SI+D≤2T, where S, I, D, and T are, respectively, the cost of a substitution, insertion, deletion, and transposition; the cost of a substitution is >0 when the characters involved are different and 0 when these are the same. This observation follows from the following.

When ai=bj,S=0

$$ \begin{aligned} &H_{k-1,l-1} + (i-k-1)D+ T + (j-l-1)I \\ &= H_{k,l} + (i-k-1)D+ T + (j-l-1)I \\ &\ge H_{i-1,l} + T + (j-l-1)I \\ &\ge H_{i-1,j-1} \\ &= H_{i,j} \end{aligned} $$
(5)

When 2≤ikjl

$$ \begin{aligned} &H_{k-1,l-1} + (i-k-1)D+ T + (j-l-1)I \\ &\ge H_{k-1,l-1} + (i-k-1)(2S-I)+ S + (j-l-1)I \\ &= H_{k-1,l-1} + (i-k+1)S + ((j-l)-(i-k))I\\ &\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ + (i-k-2)S \\ &\ge H_{k-1,l-1} + (i-k+1)S + ((j-l)-(i-k))I \\ &\ge H_{i,j} \\ \end{aligned} $$
(6)

The case when 2≤jlik is symmetric.

The algorithms developed in this paper are based on our refined recurrence for H.

Methods

DL distance algorithms

In this section, we develop two algorithms, LS_DL2 and Strip_DL2, to compute the DL distance between two strings of length m and n drawn from an alphabet of size s. We note that when s>m+n, at least smn characters of the alphabet appear in neither A nor B. So, these non-appearing characters may be removed from the alphabet and we can work with this reduced size alphabet. Hence, throughout this paper, we assume that sm+n. Our algorithms, which take O(m+n) space, are based on the recurrence of Eqs. 3 and 4 and are the counterparts of algorithms LS_DL and Strip_DL of [18] that are based on the recurrence of Eqs. 1 and 2.

Algorithm LS_DL2

Like algorithm LS_DL of [18], LS_DL2 (Algorithm 1) computes H by rows from top to bottom and within a row by columns from left to right. For convenience, we augment H with row −1 and column −1. All values on this row and on this column are maxVal, where maxVal is a large value. Algorithm LS_DL2 uses 4 one-dimensional arrays last_row_id[1:s],R[−1:n],R1[−1:n], and FR[−1:n] and a few simple variables that have the following semantics when we are computing Hij. k and l are as in Eqs. 4.

  1. 1.

    FR[q]=Hk−1,q−2 for the current i in case qj and for the next i in case q<j

  2. 2.

    R[q]=Hi,q if q<j and Hi−2,q if qj.

  3. 3.

    R1[q]=Hi−1,q

  4. 4.

    last_row_id[c] = largest k<i such that A[k]=c

  5. 5.

    last_col_id = largest l<j such that B[l]=A[i]

  6. 6.

    T is the value to use for Hi−2,l−1 should this be needed in the computation of Hij

  7. 7.

    last_i2l1=Hi−2,j−1

  8. 8.

    diag Case 1 of Eq. 4

  9. 9.

    left Case 2 of Eq. 4

  10. 10.

    up Case 3 of Eq. 4

  11. 11.

    transpose Case 4 of Eq. 4

Lines 2 and 3 of the algorithm initialize FR, R1, and R so that following the swap of line 6, R1[q]=H0,q and R[q]=H−1,q,−1≤qn. In other words, at the start of iteration i=1 of the for loop of lines 9-30, R1 and R, respectively, correspond to rows i−1 (i.e., row 0) and i−2 (i.e., row -1) of H. At this time, FR[−1:n]=maxValue, which corresponds to the initial situation that k=lastA[i][bj] is undefined. This will be updated as lastA[i][bj] gets defined. last_row_id[1:s] is initialized to −1 in line 4 for each character c to indicate the fact that at the start of the i=1 loop, A[p],p<1 isn’t a character of the alphabet. last_col_id is set to −1 at the start of each iteration of the loop of lines 5–32 as at the start of this loop, no character of B has been examined and so there is no last seen occurrence of A[i] in B (for this iteration of the for loop of lines 5–32). Also, at the start of the computation of each row of H, last_i2l1 is set to R[0], because, by the semantics of last_i2l1, when we are computing Hi,1,last_i2l1=Hi−2,0=R[0]. Following this, R[0] is set to i to indicate that the cheapest way to transform A[1:i] into B[0:0] is to do i deletes at a total cost of i (hence, when we are computing Hi,1 in the loop of lines 9–30, R[q]=Hij,q<1), and T is set to maxVal as when we start a row computation, l is undefined. So, the initializations establish the variable semantics given above.

In lines 10–12 of the loop of lines 9–30, diag, left, and up are set to the values specified in cases 1–3 of Eq. 4. Note that from the semantics of the variables, R1[j−1]=Hi−1,j−1,R[j−1]=Hi,j−1, and R1[j]=Hi−1,j at this time. Line 13 computes the minimum of the terms in the first 3 cases of Eq. 4. If A[i]=B[j] (line 14), then Hij is determined by cases 1–3 and the value of temp computed in line 13 is Hi,j. At this time, we need to update last_col_id as the most recently seen occurrence of A[i] is now at position j of B. Since A[i]=B[j],lastA[i+1][bj]=i. So, the Hk−1,j−2 to use for the next i in case 4 is Hi−1,j−2=R1[j−2]. This value is saved in FR[j] in line 16. Since A[i]=B[j], the value to use for Hi−2,l−1 in case 4 of Eq. 4 in future iterations of the forj loop (until, of course, we encounter another j where A[i]=B[j]) becomes Hi−2,j−1, which by the variable semantics is last_i2l1. This value is saved in T in line 17.

When A[i]≠B[j], lines 19–26 are executed. From the semantics of last_row_id and last_col_id, it follows that line 19 correctly sets k and l. Lines 22 and 24, respectively, compute the cost of case 4 when jl=1 and ik=1, respectively. Note that by the semantics of FR and T, FR[j]=Hk−1,j−2 and T=Hi−2,l−1. So, lines 22 and 25 update temp to be Hi,j. When we reach line 29, regardless of whether A[i]=B[j] or A[i]≠B[j],R[j]=Hi−2,j and temp=Hi,j.

In line 28, we set last_i2l1=R[j]=Hi−2,j. While this momentarily upsets the semantics of last_i2l1, the semantics are restored upon the start of next iteration of the forj loop as j increases by 1 or in line 8 if the forj loop terminates and we advance to the next iteration of the fori loop. Line 29 sets R[j]=Hij, which similarly upsets the semantics of R but correctly sets up the semantics for the next iteration. Finally, line 31 correctly updates last_row_id so as to preserve its semantics for the next iteration of the fori loop.

The correctness of LS_DL2 follows from the preceding discussion. The space and time complexities are readily seen to be O(m+n+s)=O(m+n) and O(mn+s)=O(mn), respectively. When m<n, the space complexity may be reduced by a constant factor by swapping A and B. Using the LRU cache model of [18], one may show that LS_DL2 has approximately 3mn/w cache misses, where w is the width of a cache line. By comparison, the number of cache misses for LS_DL is mn(1+3/w).

Strip algorithm Strip_DL2

As in [18], we can reduce cache misses, which in turn reduces run time, by partitioning H into n/q strips of size m×q, where q is the largest strip width for which the data needed in the computation of the strip fits into the cache. H is computed by strips from left to right and the computation of each strip is done using LS_DL2. To enable this computation by strips, one strip needs to pass computed values to the next using three additional one-dimensional arrays C, C1, and FC of size m each. C records the values of H computed for the rightmost column in the strip; C1 records the values of H computed for the next to rightmost column in the strip; and FC[i] is the value of T (i.e., H[i−2][l−1]) at row i, where l is the last column where B[l]=A[i] in the strip. We name this new algorithm as Strip_DL2. The space complexity of Strip_DL2 is O(m+n) and its time complexity is O(mn). For the LRU cache model of [18] the number of cache misses is approximately \(\frac {6mn}{wq}\).

DL trace algorithms

Wagner and Fischer [19] introduced the concept of a trace to describe an edit sequence when the edit operations are limited to insert, delete, and substitute. Lowrance and Wagner [17] extended the concept of a trace to include transpositions. We reproduce here the definition and example used by us in [18]. A trace for the strings A=a1am and B=b1bn is a set T of lines, where the endpoints u and v of a line (u,v) denote positions in A and B, respectively. A set of lines T is a trace iff:

  1. 1.

    For every (u,v)T,um and vn.

  2. 2.

    The lines in T have distinct A positions and distinct B positions. That is, no two lines in T have the same u or the same v.

A line (u,v) is balanced iff au=bv and two lines (u1,v1) and (u2,v2) cross iff (u1<u2) and (v1>v2). It is easy to see that T={(1,2),(3,1),(4,3),(5,6)} (see Fig. 2) is a trace for the strings A=dafac and B=fdbbec. Line (4,3) is not balanced as a4b3. The remaining 3 lines in the trace are balanced. The lines (1,2) and (3,1) cross.

Fig. 2
figure 2

DL trace example [18]

In a trace, unbalanced lines denote a substitution operation and balanced lines denote retaining the character of A. If ai has no line attached to it, ai is to be deleted and when bj has no attached line, it is to be inserted. When two balanced lines (u1,v1) and (u2,v2),u1<u2 cross, \(a_{u_{1}+1} \cdots a_{u_{2}-1}\) are to be deleted from A making \(a_{u_{1}}\) and \(a_{u_{2}}\) adjacent, then \(a_{u_{1}}\) and \(a_{u_{2}}\) are to be transposed, and finally, \(b_{v_{2}+1} \cdots b_{v_{1}-1}\) are to be inserted between the just transposed characters of A.

The edit sequence corresponding to the trace of Fig. 2 is delete a2, transpose a1 and a3, substitute b for a4, insert b4=b and b5=e, retain a5. The cost of this edit sequence is 5.

In [18], we used a divide-and-conquer strategy similar to that used by Hirschberg [20] to determine an optimal trace in O(mn) time and O(s min{m,n}+n) space. In [18], we made a distinction between traces that have a center crossing and those that do not. A trace has a center crossing iff it contains two lines (u1,v1) and (u2,v2) such that v2n/2 and v1>n/2,u1<u2, while satisfying (a) \(\phantom {\dot {i}\!}a_{i} \neq a_{u_{1}} = b_{v_{1}}, u_{1} < i < u_{2}\) and (b) \(\phantom {\dot {i}\!}b_{j} \neq b_{v_{2}} = a_{u_{2}}, v_{2} < j < v_{1}\). In words, u1 is the last (i.e., rightmost) occurrence of \(b_{v_{1}}\) in A that precedes position u2 of A and v2 is the last occurrence of \(\phantom {\dot {i}\!}a_{u_{2}}\) in B that precedes position v1 of B. (Figure 3).

Fig. 3
figure 3

Traces with and without center crossings [18]. a No center crossing. b With center crossing

In [18], we showed that the cost of an optimal trace T is given by Eq. 7 when T has no center crossing and by Eq. 8 when T has a center crossing. Hence, the cost of T is the smaller of these two costs.

$$ costNoCC(T) = \min_{1 \le i \le m}\{ H[i] + H'[i+1]\} $$
(7)

where H[i] is the cost of an optimal trace for A[1:i] and B[1:n/2] and H[i+1] that for an optimal trace for A[i+1:m] and B[n/2+1:n].

$$ {{}\begin{aligned} costCC(T) = \min\{H[u_{1}\,-\,1][v_{2}\,-\,1] + H'[u_{2}\,+\,1][v_{1}\!+1] \\[-2pt] + (u_{2}-u_{1}-1) + 1 + (v_{1}-v_{2}-1)\} \end{aligned}} $$
(8)

where H[i][j] is the cost of an optimal trace for A[1:i] and B[1][j] and H[i][j] is that for an optimal trace for A[i:m] and B[j][n]. For the min{}, we try 1≤u1<m and for each such u1, we set v1 to be the smallest i>n/2 for which \(\phantom {\dot {i}\!}b_{i} = a_{u_{1}}\). For each u1 we examine all characters other than \(a_{u_{1}}\) in the alphabet. For each such character c, v2 is set to the largest jn/2 for which bj=c and u2 is the smallest i>u1 for which ai=c.

Our new algorithms, LS_TRACE2 and Strip_TRACE2, are based on an adaptation of Eqs. 7 and 8 using Eq. 4.

Algorithm LS_T R A C E2

Consider the case when the optimal trace has no center crossing. Let Rf[] be the value of R[] when LS_DL2(B[1:n/2],A[1:m]) terminates and let Rf[] be the value of R[] when LS_DL2(B[n:n/2+1],A[m:1]) terminates. Let R1f,R1f,FRf, and FRf be the corresponding final values for R1 and RF. From Eq. 7 and LD_DL2, we obtain

$$ \begin{aligned} costNoCC(T) & = \min_{1 \le i \le m}\{ H[i] + H'[i+1]\} \\ & = \min_{1 \le i \le m}\{R^{f}[i] + R'^{f}[i+1] \} \\ \end{aligned} $$
(9)

When T has a center crossing {(u1,v1),(u2,v2)}, then it follows from Eq. 4 that either u1 and u2 are adjacent in A or v1 and v2 are adjacent in B (or both). When v1and v2 are adjacent in B, then \(v_{2} = \frac {n}{2}\) and \(v_{1}=\frac {n}{2}+1\). Substituting into Eq. 8, we get

$$ {{}\begin{aligned} costCC(T) &\,=\, \min\{H[u_{1}\,-\,1][v_{2}-1] \,+\, H'[u_{2}\,+\,1][v_{1}\,+\,1] \\ & + (u_{2}-u_{1}-1) + 1 + (v_{1}-v_{2}-1)\} \\ &= \min\{H[u_{1}\,-\,1][\frac{n}{2}\,-\,1] + H'[u_{2}\,+\,1][\frac{n}{2}+2] \\ & + (u_{2}-u_{1}) \} \\ &= \min\left\{R1^{f}[u_{1}\,-\,1] \,+\, R1'^{f}[u_{2}+1]+(u_{2}\,-\,u_{1}) \right\} \\ \end{aligned}} $$
(10)

When u1 and u2 are adjacent in A, then u2u1=1,v2 is the right most occurrence of A[u2] in B that precedes position \(\frac {n}{2}+1\) (i.e., v2n/2) and v1 is the left most occurrence of A[u1] in B after position \(\frac {n}{2}\) (i.e., v1n/2+1). So, we have

$$ {{}\begin{aligned} costCC(T) &= \min\{H[u_{1}\,-\,1][v_{2}\,-\,1] + H'[u_{2}\,+\,1][v_{1}\,+\,1] \\ & + (u_{2}-u_{1}-1) + 1 + (v_{1}-v_{2}-1)\} \\ &= \min\left\{H[u_{1}\,-\,1][v_{2}\,-\,1] \,+\, H'[u_{2}\,+\,1][v_{1}+1]\right.\\ & \left. + (v_{1}-v_{2}) \right\} \\ &\,=\, \min\left\{FR^{f}[u_{1}\,+\,1] + FR'^{f}[u_{2}\,-\,1]+ (v_{1}-v_{2}) \right\} \\ \end{aligned}} $$
(11)

Algorithm LS_TRACE2 (Algorithm 2) provides the pseudocode for our linear space computation of an optimal trace. It assumes that LS_DL2 has been modified to return the arrays R,R1 and FR.

Using an analysis similar to that used by us in [18] for the analysis of DL_TRACE, we see that the time complexity of DL_TRACE2 is O(mn). The space required is the same as for LS_DL2. The number of cache misses is approximately twice that for LS_DL2 when invoked with strings of size n and m. Hence, the cache miss count for LS_TRACE2 is ≈6mn/w.

Strip trace algorithm Strip_T R A C E2

This algorithm differs from LS_TRACE2 in that it uses a modified version of Strip_DL2 rather than a modified version of LS_DL2. The modified version of Strip_DL2 returns the arrays C, C1 and FC computed by Strip_DL2. The asymptotic time complexity of Strip_TRACE2 is also O(mn) and it takes the same amount of space as does Strip_DL2. The number of cache misses is approximately twice that for Strip_DL2.

Results

We benchmarked the single-core algorithms LS_DL2,Strip_DL2,DL_TRACE2, and Strip_TRACE2 of this paper against the corresponding single-core algorithms developed by us in [18]. Using the parallelization techniques of [18], we obtained multi-core versions of our new algorithms. Their names are obtained by prefixing PP_ to the single-core name (e.g., PP_LS_DL2 is the multi-core version of LS_DL2). The new multi-core versions also were benchmarked against the corresponding multi-core algorithms of [18].

Platforms and test data

The single-core algorithms were implemented using C and the multi-core ones using C and OpenMP. The relative performance of these algorithms was measured on the following platforms:

  1. 1.

    Intel Xeon CPU E5-2603 v2 Quad-Core processor 1.8GHz with 10MB cache.

  2. 2.

    Intel I7-x980 Six-Core processor 3.33GHz with 12MB LLC cache.

  3. 3.

    Intel Xeon CPU E5-2695 v2 2x12-Core processors 2.40GHz with 30MB cache.

For convenience, we will, at times, refer to these platforms as Xeon4, Xeon6, and Xeon24 (i.e., the number of cores is appended to the name Xeon).

All codes were compiled using the gcc compiler with the O2 option. On our Xeon4 platform, the benchmarking included a comparison of memory, cache misses, run time, and energy consumption. The cache miss count and the energy consumption was measured using the "perf" [21] software through the RAPL interface. For the Xeon6 and Xeon24 platforms only the run time was benchmarked.

For test data, we used randomly generated protein sequences as well as real protein sequences obtained from the Protein Data Bank [22] and DNA/RNA/protein sequences from the National Center for Biotechnology Information (NCBI) database [23]. The results for our randomly generated protein sequences were comparable to those for similarly sized sequences used from the two databases [22] and [23]. So, we present only the results for the random data sets here.

Xeon E5-2603 (Xeon4)

DL distance algorithms

Table 1 gives the memory required to process random protein sequences of length 400,000 using each of the single-core DL scoring algorithms considered in this paper. LS_DL takes 4.75 times the memory taken by LS_DL2 and LS_Strip takes 4.69 times the memory taken by LS_Strip2.

Table 1 Memory usage for DL distance algorithms on Xeon4

Figure 4 and Table 2 give the number of cache misses on our Xeon4 platform for randomly generated sequences of size between 40,000 and 400,000. The column of Table 2 labeled LvsL2 gives the percentage reduction in cache misses achieved by LS_DL2 relative to LS_DL, that labeled SvsS2 gives this percentage for Strip_DL2 relative to Strip_DL, and that labeled L2vsS2 gives this percentage for Strip_DL2 relative to LS_DL2. Strip_DL2 has the fewest cache misses. Strip_DL2 reduces cache misses by up to 91.9% relative to LS_DL2 and by up to 94.4% relative to Strip_DL.

Fig. 4
figure 4

Cache misses for DL distance algorithms, in billions, on Xeon4

Table 2 Cache misses for DL distance algorithms, in millions, on Xeon4

Figure 5 and Table 3 give the run times on our Xeon4 platform for our random data set. In the figure, the time is in seconds while in the table, the time is given using the format hh:mm:ss. The table also gives the percentage reduction in run time.

Fig. 5
figure 5

Run time for DL distance algorithms on Xeon4

Table 3 Run time in hh:mm:ss for DL distance algorithms on Xeon4

As can be seen, on our Xeon4 platform, Strip_DL2 is the fastest followed by LS_DL2,Strip_DL, and LS_DL. Strip_DL2 reduces run time by up to 15.9% relative to LS_DL2 and by up to 40.1% relative to Strip_DL.

Figure 6 and Table 4 give the CPU and cache energy consumed, in joules, on our Xeon4 platform. On our data sets, Strip_DL2 required up to 17.6% less CPU and cache energy than LS_DL2 and up to 40.0% less than Strip_DL. It is interesting to note that the energy reduction is comparable to the reduction in run time suggesting a close relationship between run time and energy consumption for this application.

Fig. 6
figure 6

CPU/cache energy for DL distance algorithms on Xeon4

Table 4 CPU/cache energy in joules for DL distance algorithms on Xeon4

DL trace algorithms

Figure 7 and Table 5 give the number of cache misses for the trace algorithms on our Xeon4 platform for randomly generated sequences of size between 40,000 and 400,000. The column of Table 5 labeled LvsL2 gives the percentage reduction in cache misses achieved by LS_Trace2 relative to LS_Trace, that labeled SvsS2 gives this percentage Strip_Trace2 relative to Strip_Trace, and that labeled L2vsS2 gives this percentage Strip_Trace2 relative to LS_Trace2. Strip_Trace2 has the fewest cache misses. Strip_Trace2 reduces cache misses by up to 89.1% relative to LS_Trace2 and by up to 95.4% relative to Strip_Trace.

Fig. 7
figure 7

Cache misses for DL trace algorithms on Xeon4

Table 5 Cache misses in millions for DL trace algorithms on Xeon4

Figure 8 and Table 6 give the run times on our Xeon4 platform for our random data set. The table also gives the percentage reduction in run time. Strip_Trace2 is the fastest followed by LS_Trace2,Strip_Trace, and LS_Trace (in this order). Strip_Trace2 reduces run time by up to 4.3% relative to LS_Trace2 and by up to 37.8% relative to Strip_Trace.

Fig. 8
figure 8

Run time for DL trace algorithms on Xeon4

Table 6 Run time in hh:mm:ss for DL trace algorithms on Xeon4

Figure 9 and Table 7 give the CPU and cache energy consumed, in joules, Strip_Trace2 required up to 6.3% less CPU and cache energy than LS_Trace2 and up to 37.9% less than Strip_Trace.

Fig. 9
figure 9

CPU/cache energy for DL trace algorithms on Xeon4

Table 7 CPU/cache energy in joules for DL trace algorithms on Xeon4

Parallel algorithms

Figure 10 and Table 8 give the run times for our parallel DL distance algorithms on our Xeon4 platform. PP_LS_DL2 is up to 61.2% faster than PP_LS_DL and PP_Strip_DL2 is up to 40.6% faster than PP_Strip_DL. Also, PP_LS_DL2 and PP_Strip_DL2 achieve a speedup of up to 3.15 and 3.98 compared to the corresponding single-core algorithms on a four-core machine, respectively.

Fig. 10
figure 10

Run time of parallel DL distance algorithms, in seconds, on Xeon4

Table 8 Run time of parallel DL distance algorithms, in hh:mm:ss, on Xeon4

Figure 11 and Table 9 give the run times for our parallel DL trace algorithms on our Xeon4 platform. PP_LS_Trace2 is up to 64.5% faster than PP_LS_Trace and PP_Strip_Trace2 is up to 35.4% faster than PP_Strip_Trace. Also, PP_LS_Trace2 and PP_Strip_Trace2 achieves a speedup up to 2.94 and 3.83 compared to the corresponding single-core algorithms, respectively.

Fig. 11
figure 11

Run time of parallel DL trace algorithms, in seconds, on Xeon4

Table 9 Run time of parallel DL trace algorithms, in hh:mm:ss, on Xeon4

I7-x980 (Xeon6)

DL distance algorithms

Figure 12 and Table 10 give the run times of our single-core distance algorithms on our Xeon6 platform. As can be seen, Strip_DL2 is the fastest followed by LS_DL2,Strip_DL, and LS_DL (in this order). Strip_DL2 reduces run time by up to 10.2% relative to LS_DL2 and by up to 42.0% relative to Strip_DL.

Fig. 12
figure 12

Run time for DL distance algorithms on Xeon6

Table 10 Run time in hh:mm:ss for DL distance algorithms on Xeon6

DL trace algorithms

Figure 13 and Table 11 give the run times of our single-core trace algorithms on our Xeon6 platform. Strip_Trace2 is the fastest followed by LS_Trace2,Strip_Trace, and LS_Trace (in this order). Strip_Trace2 reduces run time by up to 3.9% relative to LS_Trace2 and by up to 37.6% relative to Strip_Trace.

Fig. 13
figure 13

Run time for DL trace algorithms on Xeon6

Table 11 Run time in hh:mm:ss for DL trace algorithms on Xeon6

Parallel algorithms

Figure 14 and Table 12 give the run times for our parallel DL distance algorithms on our Xeon6 platform. PP_LS_DL2 is up to 82.4% faster than PP_LS_DL and PP_Strip_DL2 is up to 39.8% faster than PP_Strip_DL. Also, PP_LS_DL2 and PP_Strip_DL2 achieves a speedup up to 4.32 and 5.44 compared to the corresponding single core algorithms on a six core machine, respectively.

Fig. 14
figure 14

Run time of parallel DL distance algorithms, in seconds, on Xeon6

Table 12 Run time of parallel DL distance algorithms, in hh:mm:ss, on Xeon6

Figure 15 and Table 13 give the run times for our parallel DL trace algorithms on our Xeon6 platform. PP_LS_Trace2 is up to 74.7% faster than PP_LS_Trace and PP_Strip_Trace2 is up to 41.8% faster than PP_Strip_Trace. Also, PP_LS_Trace2 and PP_Strip_Trace2 achieves a speedup up to 4.32 and 5.30 compared to the corresponding single-core algorithms, respectively.

Fig. 15
figure 15

Run time of parallel DL trace algorithms, in seconds, on Xeon6

Table 13 Run time of parallel DL trace algorithms, in hh:mm:ss, on Xeon6

Xeon E5-2695 (Xeon24)

DL distance algorithms

Figure 16 and Table 14 give the run times of our single-core distance algorithms on our Xeon24 platform. Strip_DL2 is the fastest followed by LS_DL2,Strip_DL, and LS_DL (in this order). Strip_DL2 reduces run time by up to 13.8% relative to LS_DL2 and by up to 56.4% relative to Strip_DL.

Fig. 16
figure 16

Run time for DL distance algorithms on Xeon24

Table 14 Run time in hh:mm:ss for DL distance algorithms on Xeon24

DL trace algorithms

Figure 17 and Table 15 give the run times of our single-core trace algorithms on our Xeon24 platform. Strip_Trace2 is the fastest followed by LS_Trace2,Strip_Trace, and LS_Trace (in this order). Strip_Trace2 reduces run time by up to 9.4% relative to LS_Trace2 and by up to 57.4% relative to Strip_Trace.

Fig. 17
figure 17

Run time for DL trace algorithms on Xeon24

Table 15 Run time in hh:mm:ss for DL trace algorithms on Xeon24

Parallel algorithms

Figure 18 and Table 16 give the run times for our parallel DL distance algorithms on our Xeon24 platform. PP_LS_DL2 is up to 68.7% faster than PP_LS_DL and PP_Strip_DL2 is up to 54.6% faster than PP_Strip_DL. Also, PP_LS_DL2 and PP_Strip_DL2 achieves a speedup up to 11.2 and 21.36 compared to the corresponding single core algorithms on a twelve-four core machine, respectively.

Fig. 18
figure 18

Run time of parallel DL distance algorithms, in seconds, on Xeon24

Table 16 Run time of parallel DL distance algorithms, in hh:mm:ss, on Xeon24

Figure 19 and Table 17 give the run times for our parallel DL trace algorithms on our Xeon24 platform. PP_LS_Trace2 is up to 58.3% faster than PP_LS_Trace and PP_Strip_Trace2 is up to 59.3% faster than PP_Strip_Trace. Also, PP_LS_Trace2 and PP_Strip_Trace2 achieves a speedup up to 9.65 and 16.4 compared to the corresponding single-core algorithms, respectively.

Fig. 19
figure 19

Run time of parallel DL trace algorithms, in seconds, on Xeon24

Table 17 Run time of parallel DL trace algorithms, in hh:mm:ss, on Xeon24

Discussion

We have developed linear space algorithms to compute the DL distance between two strings and also to determine an optimal trace (edit sequence). Besides using less space than the space efficient algorithms of [18], these algorithms are also faster. Significant run-time improvement (relative to known algorithms) was seen for our new algorithms on all three of our experimental platforms.

Conclusion

On all platforms, the linear-space cache-efficient algorithms Strip_DL2 and Strip_TRACE2 were the best-performing single-core algorithms to determine the DL distance and optimal trace, respectively. Strip_DL2 reduced run time by as much as 56.4% relative to the Strip_DL and Strip_TRACE2 reduced run time by as much as 57.4% relative to the Strip_Trace. Our multi-core algorithms reduced the run time by up to 59.3% compared to the best previously known multi-core algorithms.

The linear space string correction algorithm developed in this paper requires 2SI+D≤2T, where S, I, D and T are, respectively, the cost of a substitution, insertion, deletion and transposition. As noted earlier, this requirement is met by the DL distance as for this metric, S=I=D=T=1. When only I+D≤2T is satisfied, the more general algorithm ([18]) that runs in O(mn) time and uses O(smin{m,n}+m+n) space, where s is the size of alphabet comprising the strings, may be used.

Availability of data and materials

Data sharing is not applicable to this article as all test data are randomly generated protein sequences.

Abbreviations

CPU:

Central Processing Unit

DL:

Damerau-Levenshtein

DNA:

DeoxyriboNucleic acid

LLC:

Last level cache

LRU:

Least recently used

NCBI:

National Center for Biotechnology Information

PDB:

Protein Data Bank

RNA:

RiboNucleic acid

References

  1. Brill E, Moore RC. An improved error model for noisy channel spelling correction. In: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics: 2000. p. 286–93.

    Google Scholar 

  2. Bard GV. Spelling-error tolerant, order-independent pass-phrases via the Damerau-Levenshtein string-edit distance metric. In: Proceedings of the Fifth Australasian Symposium on ACSW Frontiers - Volume 68. ACSW ’07. Darlinghurst: Australian Computer Society, Inc.: 2007. p. 117–24.

    Google Scholar 

  3. Li M, Zhang Y, Zhu M, Zhou M. Exploring distributional similarity based models for query spelling correction. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. ACL-44. Stroudsburg: Association for Computational Linguistics: 2006. p. 1025–32.

    Google Scholar 

  4. Cai X, Zhang XC, Joshi B, Johnson R. Touching from a distance: Website fingerprinting attacks and defenses. In: Proceedings of the 2012 ACM Conference on Computer and Communications Security. New York: ACM: 2012. p. 605–16.

    Google Scholar 

  5. Faloutsos C, Megalooikonomou V. On data mining, compression, and kolmogorov complexity. Data Min Knowl Discov. 2007; 15(1):3–20.

    Article  Google Scholar 

  6. Majorek KA, Dunin-Horkawicz S, Steczkiewicz K, Muszewska A, Nowotny M, Ginalski K, Bujnicki JM. The RNase H-like superfamily: new members, comparative structural analysis and evolutionary classification. In: Nucleic Acids Research, vol. 42. Oxford, United Kingdom: Oxford University Press: 2014. p. 4160–4179.

    Google Scholar 

  7. Bischof J, Ibrahim SM. bcRep: R package for comprehensive analysis of B cell receptor repertoire data. PLoS ONE. 2016; 11(8):1–15.

    Article  Google Scholar 

  8. Pomorova O, Savenko O, Lysenko S, Nicheporuk A. Information and Communication Technologies in Education, Research, and Industrial Applications. In: 12th Internal conference. Kyiv: Springer: 2016. http://ceur-ws.org/Vol-1614/.

    Google Scholar 

  9. Biswas AK, Gao JX. PR2S2Clust: patched rna-seq read segments’ structure-oriented clustering. J Bioinforma Comput Biol. 2016; 14(05):1650027.

    Article  CAS  Google Scholar 

  10. Pop PG. In-memory dedicated dot-plot analysis for DNA repeats detection. In: Intelligent Computer Communication and Processing (ICCP), 2016 IEEE 12th International Conference On. Washington, DC: IEEE: 2016. p. 317–20.

    Google Scholar 

  11. Gabrys R, Yaakobi E, Milenkovic O. Codes in the damerau distance for deletion and adjacent transposition correction. IEEE Trans Inf Theory. 2018; 64(4):2550–70.

    Article  Google Scholar 

  12. Largest Protein Sequence: Titin. https://en.wikipedia.org/wiki/Titin. Accessed 15 Feb 2019.

  13. Largest Chromosome. https://en.wikipedia.org/wiki/Chromosome_1. Accessed 115 Feb 2019.

  14. Levenshtein VI. Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Dokl. 1966; 10(8):707–10.

    Google Scholar 

  15. Robinson DJS. An Introduction to Abstract Algebra. Berlin: Walter de Gruyter; 2003, pp. 255–7.

    Book  Google Scholar 

  16. Jaro MA. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J Am Stat Assoc. 1989; 84(406):414–20.

    Article  Google Scholar 

  17. Lowrance R, Wagner RA. An extension of the string-to-string correction problem. J ACM. 1975; 22(2):177–83.

    Article  Google Scholar 

  18. Zhao C, Sahni S. String correction using the damerau-levenshtein distance. BMC Bioinforma; 20:277. https://doi.org/10.1186/s12859-019-2819-0.

  19. Wagner RA, Fischer MJ. The string-to-string correction problem. J ACM. 1974; 21(1):168–73.

    Article  Google Scholar 

  20. Hirschberg DS. A linear space algorithm for computing longest common subsequences. Commun ACM. 1975; 18(6):341–3.

    Article  Google Scholar 

  21. Perf Tool. https://perf.wiki.kernel.org/index.php/Main_Page. Accessed 15 Feb 2019.

  22. Protein Data Bank. http://www.rcsb.org/pdb/home/home.do.Accessed 15 Feb 2019.

  23. NCBI Database. http://www.ncbi.nlm.nih.gov/gquery. Accessed 15 Feb 2019.

Download references

Acknowledgments

The authors also would like to thank the anonymous reviewers for their valuable comments and suggestions.

About this supplement

This article has been published as part of BMC Bioinformatics Volume 21 Supplement 1, 2020: Selected articles from the 8th IEEE International Conference on Computational Advances in Bio and medical Sciences (ICCABS 2018): bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-21-supplement-1.

Funding

Publication of this supplement was funded, in part, by the National Science Foundation under award NSF 1447711. Publication costs were funded by the University of Florida.

Author information

Authors and Affiliations

Authors

Contributions

CZ and SS developed the new linear space cache efficient DL distance algorithms, did theoretical analysis and the experimental results analysis, and wrote the manuscript. CZ programmed the algorithms and ran the benchmark tests. All authors have read and approved the final manuscript.

Corresponding author

Correspondence to Chunchun Zhao.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhao, C., Sahni, S. Linear space string correction algorithm using the Damerau-Levenshtein distance. BMC Bioinformatics 21 (Suppl 1), 4 (2020). https://doi.org/10.1186/s12859-019-3184-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-019-3184-8

Keywords