Linear space string correction algorithm using the Damerau-Levenshtein distance

Background The Damerau-Levenshtein (DL) distance metric has been widely used in the biological science. It tries to identify the similar region of DNA,RNA and protein sequences by transforming one sequence to the another using the substitution, insertion, deletion and transposition operations. Lowrance and Wagner have developed an O(mn) time O(mn) space algorithm to find the minimum cost edit sequence between strings of length m and n, respectively. In our previous research, we have developed algorithms that run in O(mn) time using only O(s∗min{m,n}+m+n) space, where s is the size of the alphabet comprising the strings, to compute the DL distance as well as the corresponding edit sequence. These are so far the fastest and most space efficient algorithms. In this paper, we focus on the development of algorithms whose asymptotic space complexity is linear. Results We develop linear space algorithms to compute the Damerau-Levenshtein (DL) distance between two strings and determine the optimal trace (corresponding edit operations.)Extensive experiments conducted on three computational platforms–Xeon E5 2603, I7-x980 and Xeon E5 2695–show that, our algorithms, in addition to using less space, are much faster than earlier algorithms. Conclusion Besides using less space than the previously known algorithms,significant run-time improvement was seen for our new algorithms on all three of our experimental platforms. On all platforms, our linear-space cache-efficient algorithms reduced run time by as much as 56.4% and 57.4% in respect to compute the DL distance and an optimal edit sequences compared to previous algorithms. Our multi-core algorithms reduced the run time by up to 59.3% compared to the best previously known multi-core algorithms.


Introduction
The Damerau-Levenshtein (DL) distance between two strings is the minimum number of substitutions, inserts, deletes, and transpositions of adjacent characters required to transform one string into the other. Some of the applications of the DL are spelling error correction [1][2][3], comparing packet traces [4], data mining and clustering [5], quantifying the similarity of biological sequences, and gene function prediction [6], analysis of B cell receptor repertoire data [7], virus detection in software [8],   [18] Fig. 3 Traces with and without center crossings [18]. a No center crossing. b With center crossing Jaro distance [16], only transpositions are permitted. The correct distance metric to use depends on the application. In the applications cited above, the DL distance is used as all 4 edit operations are permitted. Lowrance and Wagner [17] considered a generalization of DL distance to the case when substitutions, inserts, deletes, and transpositions have different costs. Through a suitable choice of weights, the weighted DL distance can be made equal to the DL distance, Levenshtein distance, Hamming distance, and Jaro distance.
Lowrance and Wagner [17] have developed an O(mn) time and O(mn) space algorithm to find the minimum cost edit sequence (ie., sequence of substitutions, inserts, deletes, and transpositions) that transforms a given string of length m into a given string of length n provided that 2T ≥ I+D, where T, I, and D, are respectively, the cost of a transposition, insertion, and deletion. In the DL distance, T = I = D = 1 and so, 2T = I + D. Hence, the algorithm of Lowrance and Wagner [17] may be used to compute the DL distance as well as the corresponding edit sequence in O(mn) time and O(mn) space. This observation has also been made in [2]. In [18] we developed algorithms that run in O(mn) time using only O(s * min{m, n} + m + n) space, where s is the size of the alphabet comprising the strings, to compute the DL distance as well as the corresponding edit sequence. Since s << m and s << n in most applications (e.g., s = 20 for protein sequences), this reduction in space enables the solution of much larger instances than is possible using the algorithm of [17]. Our algorithms in [18] are much faster as well. In this paper, we develop algorithms to compute the DL distance and corresponding edit sequence using O(m + n) space and O(mn) time. Extensive experimentation using 3 different platforms indicates that the algorithms of this paper are also faster than those of [18]. In fact, our fastest algorithm for the DL distance is up to 56.4% faster than the fastest algorithm in [18] when run on a single core. The single core speedup to find the corresponding edit sequence is up to 57.4%. Our algorithms may be adapted to run on multicores providing a speedup of up to 59.3%.

DL dynamic programming recurrences
Let A[ 1 : m] = a 1 a 2 · · · a m and B[ 1 : n] = b 1 b 2 ...b n be two strings of length m and n, respectively. Let H ij be the DL distance between A[ 1 : i] and B [ 1 : j]. So, H mn is the DL distance between A and B. The dynamic programming recurrence for H is given below [17,18].
When i > 0 and j > 0, is the last occurrence of a i in B that precedes position j of B. When k or l do not exist, case 4 of Eq. 2 does not apply.
The four cases in Eq. 2 correspond to the four allowable edit operations: substitution, insertion, deletion and transposition. These cases are illustrated in Fig. 1, which depicts the possibilities for an optimal transformation of A[ 1 : i] to B [ 1 : j]. Figure 1a illustrates the first case, which is to optimally transform A[ 1 : i − 1] into B[ 1 : j − 1] and then substitute b j for a i . If a i = b j , the substitution cost is 0, otherwise it is 1. Figure 1b shows the second case. Here, A[ 1 : i] is optimally transformed into B[ 1 : j − 1] and then b j is inserted at the end. In the third case ( Fig. 1c) and then a i is deleted. For the fourth and final case (Fig. 1d), assume that k and l exist. In this case, we are going to transpose a k and a i . We first optimally transform A[ 1 : . Since only adjacent characters may be transposed, the transposition of a k and a i must be preceded by a deletion of a k+1 through a i−1 , which results in a k and a i becoming adjacent. Following the transposition, we insert b l+1 through b j−1 between the transposed a k and a i , thereby transforming A[ 1 : i] into B [ 1 : j]. The cost of optimally transforming A[ 1 : The ensuing deletions have a cost of i − k − 1 as this is the number of deletions performed, the transposition a k and a i costs 1, and the final inserts cost l − k − 1. So, the overall cost of case 4 is H k−1,l−1 + (i − k − 1) + 1 + (j − l − 1).  The algorithm of Lowrance and Wagner [17] computes H m,n using a m × n array for H and one-dimensional arrays of size s for lastA and lastB, where s is the size of the alphabet from which the strings A and B are drawn. It computes the H i,j 's by rows beginning with row 1 and within a row, the elements are computed left-to-right by columns. While algorithm LS_DL of [18] also computes H by rows and within a row by columns left-to-right, it does this using a one-dimensional array of size n for the current row being computed, a one-dimensional array of size s for lastA, and an s × n array T with the property that if w is the last row of H computed so far such that . As noted in [18] when m < n, we may swap the strings A and B resulting in a space requirement of O(s min{m, n} + n). The time complexity of LS_DL is O(mn). A cache-efficient version, Strip_DL, of LS_DL that computes H by strips whose width is no larger than the cache size is also developed in [18]. This cache efficient algorithm has the same asymptotic time and space complexities as does LS_DL. But, as demonstrated in [18], Strip_DL is much faster than LS_DL.
The linear space algorithms we develop in this paper use a refined dynamic programming recurrence for H. We make the observation that when either a i = b j or min{i − k, j − l} ≥ 2 in the fourth case of Eq. 2, then it is sufficient to consider only the first three cases. To see this, note that when a i = b j the transposition of a k and a i done following the deletion of a k+1 through a i−1 in case 4 ( Fig. 1d) is unnecessary as a k = b j = a i . So, one of the first three cases has to result in a smaller value than case 4. Next, consider the case when a i = b j and min{i − k, j − l} ≥ 2. Suppose So, doing the transposition (case 4) isn't any better than using only substitutions and inserts. Hence, case 4 need not be considered. The case when The preceding observation establishes the correctness of the following refined recurrence for H.
We observe that the above refined recurrence for H holds even in the weighted setting provided that 2S ≤ I + D ≤ 2T, where S, I, D, and T are, respectively, the cost of a substitution, insertion, deletion, and transposition; the cost of a substitution is > 0 when the characters involved are different and 0 when these are the same. This observation follows from the following. When The case when 2 ≤ j − l ≤ i − k is symmetric. The algorithms developed in this paper are based on our refined recurrence for H.

DL distance algorithms
In this section, we develop two algorithms, LS_DL2 and Strip_DL2, to compute the DL distance between two strings of length m and n drawn from an alphabet of size s. We note that when s > m + n, at least s − m − n characters of the alphabet appear in neither A nor B. So, these nonappearing characters may be removed from the alphabet and we can work with this reduced size alphabet. Hence, throughout this paper, we assume that s ≤ m + n. Our algorithms, which take O(m + n) space, are based on the recurrence of Eqs. 3 and 4 and are the counterparts of algorithms LS_DL and Strip_DL of [18] that are based on the recurrence of Eqs. 1 and 2.

Algorithm LS_DL2
Like algorithm LS_DL of [18], LS_DL2 (Algorithm 1) computes H by rows from top to bottom and within a row

FR[ q]
= H k−1,q−2 for the current i in case q ≥ j and for the next i in case q < j 2.
T is the value to use for H i−2,l−1 should this be needed in the computation of H ij left · · · Case 2 of Eq. 4 10. up · · · Case 3 of Eq. 4 11. transpose · · · Case 4 of Eq. 4 for j ← 1 to n do 10: 15: last_col_id ← j // last occurrence of a i 16: In line 28, we set last_i2l1 = R[ j] = H i−2,j . While this momentarily upsets the semantics of last_i2l1, the semantics are restored upon the start of next iteration of the for j loop as j increases by 1 or in line 8 if the for j loop terminates and we advance to the next iteration of the for i loop. Line 29 sets R[ j] = H ij , which similarly upsets the semantics of R but correctly sets up the semantics for the next iteration. Finally, line 31 correctly updates last_row_id so as to preserve its semantics for the next iteration of the for i loop.
The correctness of LS_DL2 follows from the preceding discussion. The space and time complexities are readily seen to be O(m + n + s) = O(m + n) and O(mn + s) = O(mn), respectively. When m < n, the space complexity may be reduced by a constant factor by swapping A and B. Using the LRU cache model of [18], one may show that LS_DL2 has approximately 3mn/w cache misses, where w is the width of a cache line. By comparison, the number of cache misses for LS_DL is mn(1 + 3/w).

Strip algorithm Strip_DL2
As in [18], we can reduce cache misses, which in turn reduces run time, by partitioning H into n/q strips of size m × q, where q is the largest strip width for which the data needed in the computation of the strip fits into the cache. H is computed by strips from left to right and the computation of each strip is done using LS_DL2. To enable this computation by strips, one strip needs to pass computed

DL trace algorithms
Wagner and Fischer [19] introduced the concept of a trace to describe an edit sequence when the edit operations are limited to insert, delete, and substitute. Lowrance and Wagner [17] extended the concept of a trace to include transpositions. We reproduce here the definition and example used by us in [18]. A trace for the strings A = a 1 · · · a m and B = b 1 · · · b n is a set T of lines, where the endpoints u and v of a line (u, v) denote positions in A and B, respectively. A set of lines T is a trace iff:  In a trace, unbalanced lines denote a substitution operation and balanced lines denote retaining the character of A. If a i has no line attached to it, a i is to be deleted and when b j has no attached line, it is to be inserted. When two balanced lines (u 1 , v 1 ) and (u 2 , v 2 ), u 1 < u 2 cross, a u 1 +1 · · · a u 2 −1 are to be deleted from A making a u 1 and a u 2 adjacent, then a u 1 and a u 2 are to be transposed, and finally, b v 2 +1 · · · b v 1 −1 are to be inserted between the just transposed characters of A.
The edit sequence corresponding to the trace of Fig. 2 is delete a 2 , transpose a 1 and a 3 , substitute b for a 4 , insert b 4 = b and b 5 = e, retain a 5 . The cost of this edit sequence is 5.
In [18], we used a divide-and-conquer strategy similar to that used by Hirschberg [20] to determine an optimal trace in O(mn) time and O(s min{m, n} + n) space. In [18], we made a distinction between traces that have a center crossing and those that do not. A trace has a center crossing iff it contains two lines (u 1 , v 1 ) and (u 2 , v 2 ) such that v 2 ≤ n/2 and v 1 > n/2, u 1 < u 2 , while satisfying (a) a i = a u 1 In words, u 1 is the last (i.e., rightmost) occurrence of b v 1 in A that precedes position u 2 of A and v 2 is the last occurrence of a u 2 in B that precedes position v 1 of B. (Figure 3).
In [18], we showed that the cost of an optimal trace T is given by Eq. 7 when T has no center crossing and by Eq. 8 when T has a center crossing. Hence, the cost of T is the smaller of these two costs. where where . For the min{}, we try 1 ≤ u 1 < m and for each such u 1 , we set v 1 to be the smallest i > n/2 for which b i = a u 1 . For each u 1 we examine all characters other than a u 1 in the alphabet. For each such character c, v 2 is set to the largest j ≤ n/2 for which b j = c and u 2 is the smallest i > u 1 for which a i = c.
Our new algorithms, LS_TRACE2 and Strip_TRACE2, are based on an adaptation of Eqs. 7 and 8 using Eq. 4.

Algorithm LS_TRACE2
Consider the case when the optimal trace has no center crossing. Let When T has a center crossing {(u 1 , v 1 ), (u 2 , v 2 )}, then it follows from Eq. 4 that either u 1 and u 2 are adjacent in A or v 1 and v 2 are adjacent in B (or both). When v 1 and v 2 are adjacent in B, then v 2 = n 2 and v 1 = n 2 + 1. Substituting into Eq. 8, we get When u 1 and u 2 are adjacent in A, then u 2 − u 1 = 1, v 2 is the right most occurrence of A[ u 2 ] in B that precedes position n 2 +1 (i.e., v 2 ≤ n/2) and v 1 is the left most occurrence of A[ u 1 ] in B after position n 2 (i.e., v 1 ≥ n/2 + 1). So, we have Algorithm LS_TRACE2 (Algorithm 2) provides the pseudocode for our linear space computation of an optimal trace. It assumes that LS_DL2 has been modified to return the arrays R, R1 and FR. Compute costNoCC(T) using Eq. 9. Let i be the i that minimizes the cost. 9: Compute costCC(T) using Eqs. 10 and 11. Let (u 1 , u 2 ) and (v 1 , v 2 ) be the indexes that minimize the cost. Return T1 T2 14: else 15: Using an analysis similar to that used by us in [18] for the analysis of DL_TRACE, we see that the time complexity of DL_TRACE2 is O(mn). The space required is the same as for LS_DL2. The number of cache misses is approximately twice that for LS_DL2 when invoked with strings of size n and m. Hence, the cache miss count for LS_TRACE2 is ≈ 6mn/w.

Results
We benchmarked the single-core algorithms LS_DL2, Strip_DL2, DL_TRACE2, and Strip_TRACE2 of this paper against the corresponding single-core algorithms developed by us in [18]. Using the parallelization techniques of [18], we obtained multi-core versions of our new algorithms. Their names are obtained by prefixing PP_ to the single-core name (e.g., PP_LS_DL2 is the multi-core version of LS_DL2). The new multi-core versions also were benchmarked against the corresponding multi-core algorithms of [18].

Platforms and test data
The single-core algorithms were implemented using C and the multi-core ones using C and OpenMP. The relative performance of these algorithms was measured on the following platforms: 1. Intel Xeon CPU E5-2603 v2 Quad-Core processor 1.8GHz with 10MB cache. 2. Intel I7-x980 Six-Core processor 3.33GHz with 12MB LLC cache. 3. Intel Xeon CPU E5-2695 v2 2x12-Core processors 2.40GHz with 30MB cache.
For convenience, we will, at times, refer to these platforms as Xeon4, Xeon6, and Xeon24 (i.e., the number of cores is appended to the name Xeon).
All codes were compiled using the gcc compiler with the O2 option. On our Xeon4 platform, the benchmarking included a comparison of memory, cache misses, run time, and energy consumption. The cache miss count and the energy consumption was measured using the "perf" [21] software through the RAPL interface. For the Xeon6 and Xeon24 platforms only the run time was benchmarked.
For test data, we used randomly generated protein sequences as well as real protein sequences obtained from the Protein Data Bank [22] and DNA/RNA/protein sequences from the National Center for Biotechnology Information (NCBI) database [23]. The results for our randomly generated protein sequences were comparable to those for similarly sized sequences used from the two databases [22] and [23]. So, we present only the results for the random data sets here. Table 1 gives the memory required to process random protein sequences of length 400,000 using each of the single-core DL scoring algorithms considered in this paper. LS_DL takes 4.75 times the memory taken by LS_DL2 and LS_Strip takes 4.69 times the memory taken by LS_Strip2. Figure 4 and Table 2 give the number of cache misses on our Xeon4 platform for randomly generated sequences of size between 40,000 and 400,000. The column of Table 2 labeled LvsL2 gives the percentage reduction in cache misses achieved by LS_DL2 relative to LS_DL, that labeled SvsS2 gives this percentage for Strip_DL2 relative to Strip_DL, and that labeled L2vsS2 gives this percentage for Strip_DL2 relative to LS_DL2. Strip_DL2 has the fewest cache misses. Strip_DL2 reduces cache misses by up to 91.9% relative to LS_DL2 and by up to 94.4% relative to Strip_DL.  Figure 5 and Table 3 give the run times on our Xeon4 platform for our random data set. In the figure, the time is in seconds while in the table, the time is given using the format hh : mm : ss. The table also gives the percentage reduction in run time.

Xeon E5-2603 (Xeon4) DL distance algorithms
As can be seen, on our Xeon4 platform, Strip_DL2 is the fastest followed by LS_DL2, Strip_DL, and LS_DL. Strip_DL2 reduces run time by up to 15.9% relative to LS_DL2 and by up to 40.1% relative to Strip_DL. Figure 6 and Table 4 give the CPU and cache energy consumed, in joules, on our Xeon4 platform. On our data sets, Strip_DL2 required up to 17.6% less CPU and cache energy than LS_DL2 and up to 40.0% less than Strip_DL. It is interesting to note that the energy reduction is comparable to the reduction in run time suggesting a close relationship between run time and energy consumption for this application. Figure 7 and Table 5 give the number of cache misses for the trace algorithms on our Xeon4 platform for randomly generated sequences of size between 40,000 and 400,000. The column of Table 5 labeled LvsL2 gives the percentage reduction in cache misses achieved by LS_Trace2 relative to LS_Trace, that labeled SvsS2 gives this percentage Strip_Trace2 relative to Strip_Trace, and that labeled L2vsS2 gives this percentage Strip_Trace2 relative to LS_Trace2. Strip_Trace2 has the fewest cache   Figure 8 and Table 6 give the run times on our Xeon4 platform for our random data set. The table also gives the percentage reduction in run time. Strip_Trace2 is the fastest followed by LS_Trace2, Strip_Trace, and LS_Trace (in this order). Strip_Trace2 reduces run time by up to 4.3% relative to LS_Trace2 and by up to 37.8% relative to Strip_Trace. Figure 9 and Table 7 give the CPU and cache energy consumed, in joules, Strip_Trace2 required up to 6.3% less CPU and cache energy than LS_Trace2 and up to 37.9% less than Strip_Trace. Figure 10 and Table 8 give the run times for our parallel DL distance algorithms on our Xeon4 platform. PP_LS_DL2 is up to 61.2% faster than PP_LS_DL and PP_Strip_DL2 is up to 40.6% faster than PP_Strip_DL. Also, PP_LS_DL2 and PP_Strip_DL2 achieve a speedup of up to 3.15 and 3.98 compared to the corresponding single-core algorithms on a four-core machine, respectively. Figure 11 and Table 9 give the run times for our parallel DL trace algorithms on our Xeon4 platform. PP_LS_Trace2 is up to 64.5% faster than PP_LS_Trace and PP_Strip_Trace2 is up to 35.4% faster than PP_Strip_Trace. Also, PP_LS_Trace2 and PP_Strip_Trace2 achieves a speedup up to 2.94 and 3.83 compared to the corresponding single-core algorithms, respectively. Figure 12 and Table 10 give the run times of our singlecore distance algorithms on our Xeon6 platform. As can  be seen, Strip_DL2 is the fastest followed by LS_DL2, Strip_DL, and LS_DL (in this order). Strip_DL2 reduces run time by up to 10.2% relative to LS_DL2 and by up to 42.0% relative to Strip_DL. Figure 13 and Table 11 give the run times of our single-core trace algorithms on our Xeon6 platform. Strip_Trace2 is the fastest followed by LS_Trace2, Strip_Trace, and LS_Trace (in this order). Strip_Trace2 reduces run time by up to 3.9% relative to LS_Trace2 and by up to 37.6% relative to Strip_Trace. Figure 14 and Table 12 give the run times for our parallel DL distance algorithms on our Xeon6 platform. PP_LS_DL2 is up to 82.4% faster than PP_LS_DL and PP_Strip_DL2 is up to 39.8% faster than PP_Strip_DL.

Parallel algorithms
Also, PP_LS_DL2 and PP_Strip_DL2 achieves a speedup up to 4.32 and 5.44 compared to the corresponding single core algorithms on a six core machine, respectively. Figure 15 and Table 13 give the run times for our parallel DL trace algorithms on our Xeon6 platform. PP_LS_Trace2 is up to 74.7% faster than PP_LS_Trace and PP_Strip_Trace2 is up to 41.8% faster than PP_Strip_Trace. Also, PP_LS_Trace2 and PP_Strip_Trace2 achieves a speedup up to 4.32 and 5.30 compared to the corresponding single-core algorithms, respectively.
Xeon E5-2695 (Xeon24) DL distance algorithms Figure 16 and Table 14 give the run times of our single-core distance algorithms on our Xeon24 platform. Strip_DL2 is the fastest followed by LS_DL2, Strip_DL, and LS_DL (in this order). Strip_DL2 reduces run time by up to 13.8% relative to LS_DL2 and by up to 56.4% relative to Strip_DL. Figure 17 and Table 15 give the run times of our single-core trace algorithms on our Xeon24 platform. Strip_Trace2 is the fastest followed by LS_Trace2, Strip_Trace, and LS_Trace (in this order). Strip_Trace2 reduces run time by up to 9.4% relative to LS_Trace2 and by up to 57.4% relative to Strip_Trace. Figure 18 and Table 16 give the run times for our parallel DL distance algorithms on our Xeon24 platform.

Parallel algorithms
PP_LS_DL2 is up to 68.7% faster than PP_LS_DL and PP_Strip_DL2 is up to 54.6% faster than PP_Strip_DL. Also, PP_LS_DL2 and PP_Strip_DL2 achieves a speedup up to 11.2 and 21.36 compared to the corresponding single core algorithms on a twelve-four core machine, respectively. Figure 19 and Table 17 give the run times for our parallel DL trace algorithms on our Xeon24 platform. PP_LS_Trace2 is up to 58.3% faster than PP_LS_Trace and PP_Strip_Trace2 is up to 59.3% faster than PP_Strip_Trace. Also, PP_LS_Trace2 and PP_Strip_Trace2 achieves a speedup up to 9.65 and 16.4 compared to the corresponding single-core algorithms, respectively.

Discussion
We have developed linear space algorithms to compute the DL distance between two strings and also to determine an optimal trace (edit sequence). Besides using less space than the space efficient algorithms of [18], these algorithms are also faster. Significant run-time improvement (relative to known algorithms) was seen for our new algorithms on all three of our experimental platforms.

Conclusion
On all platforms, the linear-space cache-efficient algorithms Strip_DL2 and Strip_TRACE2 were the bestperforming single-core algorithms to determine the DL distance and optimal trace, respectively. Strip_DL2 reduced run time by as much as 56.4% relative to the Strip_DL and Strip_TRACE2 reduced run time by as much as 57.4% relative to the Strip_Trace. Our multi-core algorithms reduced the run time by up to 59.3% compared to the best previously known multi-core algorithms.
The linear space string correction algorithm developed in this paper requires 2S ≤ I + D ≤ 2T, where S, I, D and T are, respectively, the cost of a substitution, insertion, deletion and transposition. As noted earlier, this requirement is met by the DL distance as for this metric, S = I = D = T = 1. When only I + D ≤ 2T is satisfied, the more general algorithm ( [18]) that runs in O(mn) time and uses O(s * min{m, n} + m + n) space, where s is the size of alphabet comprising the strings, may be used.