 Research
 Open access
 Published:
Linear space string correction algorithm using the DamerauLevenshtein distance
BMC Bioinformatics volume 21, Article number: 4 (2020)
Abstract
Background
The DamerauLevenshtein (DL) distance metric has been widely used in the biological science. It tries to identify the similar region of DNA,RNA and protein sequences by transforming one sequence to the another using the substitution, insertion, deletion and transposition operations. Lowrance and Wagner have developed an O(mn) time O(mn) space algorithm to find the minimum cost edit sequence between strings of length m and n, respectively. In our previous research, we have developed algorithms that run in O(mn) time using only O(s∗min{m,n}+m+n) space, where s is the size of the alphabet comprising the strings, to compute the DL distance as well as the corresponding edit sequence. These are so far the fastest and most space efficient algorithms. In this paper, we focus on the development of algorithms whose asymptotic space complexity is linear.
Results
We develop linear space algorithms to compute the DamerauLevenshtein (DL) distance between two strings and determine the optimal trace (corresponding edit operations.)Extensive experiments conducted on three computational platforms–Xeon E5 2603, I7x980 and Xeon E5 2695–show that, our algorithms, in addition to using less space, are much faster than earlier algorithms.
Conclusion
Besides using less space than the previously known algorithms,significant runtime improvement was seen for our new algorithms on all three of our experimental platforms. On all platforms, our linearspace cacheefficient algorithms reduced run time by as much as 56.4% and 57.4% in respect to compute the DL distance and an optimal edit sequences compared to previous algorithms. Our multicore algorithms reduced the run time by up to 59.3% compared to the best previously known multicore algorithms.
Background
Introduction
The DamerauLevenshtein (DL) distance between two strings is the minimum number of substitutions, inserts, deletes, and transpositions of adjacent characters required to transform one string into the other. Some of the applications of the DL are spelling error correction [1–3], comparing packet traces [4], data mining and clustering [5], quantifying the similarity of biological sequences, and gene function prediction [6], analysis of B cell receptor repertoire data [7], virus detection in software [8], clustering of RNAseq read se6gments [9], DNA repeats detection [10], and codes for DNA based storage [11]. In some of these applications (e.g., spelling error correction), the strings are rather small while in others (e.g., comparing protein sequences) the strings could be tens of thousands of characters long [12], and in yet others (e.g., comparing chromosomes) they could be millions of characters long [13].
Other string edit distances used in the literature permit only a proper subset of the operations permitted by the DL distance. For example, in the Levenshtein distance [14] transpositions are not permitted, in the Hamming distance [15] only substitutions are permitted, and in the Jaro distance [16], only transpositions are permitted. The correct distance metric to use depends on the application. In the applications cited above, the DL distance is used as all 4 edit operations are permitted.
Lowrance and Wagner [17] considered a generalization of DL distance to the case when substitutions, inserts, deletes, and transpositions have different costs. Through a suitable choice of weights, the weighted DL distance can be made equal to the DL distance, Levenshtein distance, Hamming distance, and Jaro distance.
Lowrance and Wagner [17] have developed an O(mn) time and O(mn) space algorithm to find the minimum cost edit sequence (ie., sequence of substitutions, inserts, deletes, and transpositions) that transforms a given string of length m into a given string of length n provided that 2T≥I+D, where T, I, and D, are respectively, the cost of a transposition, insertion, and deletion. In the DL distance, T=I=D=1 and so, 2T=I+D. Hence, the algorithm of Lowrance and Wagner [17] may be used to compute the DL distance as well as the corresponding edit sequence in O(mn) time and O(mn) space. This observation has also been made in [2]. In [18] we developed algorithms that run in O(mn) time using only O(s∗ min{m,n}+m+n) space, where s is the size of the alphabet comprising the strings, to compute the DL distance as well as the corresponding edit sequence. Since s<<m and s<<n in most applications (e.g., s=20 for protein sequences), this reduction in space enables the solution of much larger instances than is possible using the algorithm of [17]. Our algorithms in [18] are much faster as well. In this paper, we develop algorithms to compute the DL distance and corresponding edit sequence using O(m+n) space and O(mn) time. Extensive experimentation using 3 different platforms indicates that the algorithms of this paper are also faster than those of [18]. In fact, our fastest algorithm for the DL distance is up to 56.4% faster than the fastest algorithm in [18] when run on a single core. The single core speedup to find the corresponding edit sequence is up to 57.4%. Our algorithms may be adapted to run on multicores providing a speedup of up to 59.3%.
DL dynamic programming recurrences
Let A[1:m]=a_{1}a_{2}⋯a_{m} and B[1:n]=b_{1}b_{2}...b_{n} be two strings of length m and n, respectively. Let H_{ij} be the DL distance between A[1:i] and B[1:j]. So, H_{mn} is the DL distance between A and B. The dynamic programming recurrence for H is given below [17, 18].
When i>0 and j>0,
where c(a_{i},b_{j}) is 1 if a_{i}≠b_{j} and 0 otherwise, k=lastA[i][b_{j}] is the last (i.e.,rightmost) occurrence of b_{j} in A that precedes position i of A, and l=lastB[j][a_{i}] is the last occurrence of a_{i} in B that precedes position j of B. When k or l do not exist, case 4 of Eq. 2 does not apply.
The four cases in Eq. 2 correspond to the four allowable edit operations: substitution, insertion, deletion and transposition. These cases are illustrated in Fig. 1, which depicts the possibilities for an optimal transformation of A[1:i] to B[1:j]. Figure 1a illustrates the first case, which is to optimally transform A[1:i−1] into B[1:j−1] and then substitute b_{j} for a_{i}. If a_{i}=b_{j}, the substitution cost is 0, otherwise it is 1. Figure 1b shows the second case. Here, A[1:i] is optimally transformed into B[1:j−1] and then b_{j} is inserted at the end. In the third case (Fig. 1c) A[1:i−1] is optimally transformed into B[1:j] and then a_{i} is deleted. For the fourth and final case (Fig. 1d), assume that k and l exist. In this case, we are going to transpose a_{k} and a_{i}. We first optimally transform A[1:k−1] into B[1:l−1]. Since only adjacent characters may be transposed, the transposition of a_{k} and a_{i} must be preceded by a deletion of a_{k+1} through a_{i−1}, which results in a_{k} and a_{i} becoming adjacent. Following the transposition, we insert b_{l+1} through b_{j−1} between the transposed a_{k} and a_{i}, thereby transforming A[1:i] into B[1:j]. The cost of optimally transforming A[1:k−1] into B[1:l−1] is H_{k−1,l−1}. The ensuing deletions have a cost of i−k−1 as this is the number of deletions performed, the transposition a_{k} and a_{i} costs 1, and the final inserts cost l−k−1. So, the overall cost of case 4 is H_{k−1,l−1} + (i−k−1)+1 + (j−l−1).
The algorithm of Lowrance and Wagner [17] computes H_{m,n} using a m×n array for H and onedimensional arrays of size s for lastA and lastB, where s is the size of the alphabet from which the strings A and B are drawn. It computes the H_{i,j}’s by rows beginning with row 1 and within a row, the elements are computed lefttoright by columns. While algorithm LS_DL of [18] also computes H by rows and within a row by columns lefttoright, it does this using a onedimensional array of size n for the current row being computed, a onedimensional array of size s for lastA, and an s×n array T with the property that if w is the last row of H computed so far such that A[w]=c, then T[c][∗]=H[w−1][∗]. As noted in [18] when m<n, we may swap the strings A and B resulting in a space requirement of O(s min{m,n}+n). The time complexity of LS_DL is O(mn). A cacheefficient version, Strip_DL, of LS_DL that computes H by strips whose width is no larger than the cache size is also developed in [18]. This cache efficient algorithm has the same asymptotic time and space complexities as does LS_DL. But, as demonstrated in [18], Strip_DL is much faster than LS_DL.
The linear space algorithms we develop in this paper use a refined dynamic programming recurrence for H. We make the observation that when either a_{i}=b_{j} or min{i−k,j−l}≥2 in the fourth case of Eq. 2, then it is sufficient to consider only the first three cases. To see this, note that when a_{i}=b_{j} the transposition of a_{k} and a_{i} done following the deletion of a_{k+1} through a_{i−1} in case 4 (Fig. 1d) is unnecessary as a_{k} = b_{j} = a_{i}. So, one of the first three cases has to result in a smaller value than case 4. Next, consider the case when a_{i}≠b_{j} and min{i−k,j−l}≥2. Suppose that 2≤i−k≤j−l. The cost of transforming A[1:i] to B[1:j] by using an optimal transformation of A[1:k−1] to B[1:l−1] and then doing j−l+1 substitutions and inserts is H_{k−1,l−1}+j−l+1. Doing the transformation as is case 4 has a cost H_{k−1,l−1}+(i−k−1)+1+(j−l−1)≥H_{k−1,l−1}+j−l+1. So, doing the transposition (case 4) isn’t any better than using only substitutions and inserts. Hence, case 4 need not be considered. The case when 2≤j−l≤i−k is symmetric.
The preceding observation establishes the correctness of the following refined recurrence for H.
When i>0 and j>0,
where c(a_{i},b_{j}) is 1 if a_{i}≠b_{j} and 0 otherwise, k=lastA[i][b_{j}] and l=lastB[j][a_{i}].
We observe that the above refined recurrence for H holds even in the weighted setting provided that 2S≤I+D≤2T, where S, I, D, and T are, respectively, the cost of a substitution, insertion, deletion, and transposition; the cost of a substitution is >0 when the characters involved are different and 0 when these are the same. This observation follows from the following.
When a_{i}=b_{j},S=0
When 2≤i−k≤j−l
The case when 2≤j−l≤i−k is symmetric.
The algorithms developed in this paper are based on our refined recurrence for H.
Methods
DL distance algorithms
In this section, we develop two algorithms, LS_DL2 and Strip_DL2, to compute the DL distance between two strings of length m and n drawn from an alphabet of size s. We note that when s>m+n, at least s−m−n characters of the alphabet appear in neither A nor B. So, these nonappearing characters may be removed from the alphabet and we can work with this reduced size alphabet. Hence, throughout this paper, we assume that s≤m+n. Our algorithms, which take O(m+n) space, are based on the recurrence of Eqs. 3 and 4 and are the counterparts of algorithms LS_DL and Strip_DL of [18] that are based on the recurrence of Eqs. 1 and 2.
Algorithm LS_DL2
Like algorithm LS_DL of [18], LS_DL2 (Algorithm 1) computes H by rows from top to bottom and within a row by columns from left to right. For convenience, we augment H with row −1 and column −1. All values on this row and on this column are maxVal, where maxVal is a large value. Algorithm LS_DL2 uses 4 onedimensional arrays last_row_id[1:s],R[−1:n],R1[−1:n], and FR[−1:n] and a few simple variables that have the following semantics when we are computing H_{ij}. k and l are as in Eqs. 4.

1.
FR[q]=H_{k−1,q−2} for the current i in case q≥j and for the next i in case q<j

2.
R[q]=H_{i,q} if q<j and H_{i−2,q} if q≥j.

3.
R1[q]=H_{i−1,q}

4.
last_row_id[c] = largest k<i such that A[k]=c

5.
last_col_id = largest l<j such that B[l]=A[i]

6.
T is the value to use for H_{i−2,l−1} should this be needed in the computation of H_{ij}

7.
last_i2l1=H_{i−2,j−1}

8.
diag ⋯ Case 1 of Eq. 4

9.
left ⋯ Case 2 of Eq. 4

10.
up ⋯ Case 3 of Eq. 4

11.
transpose ⋯ Case 4 of Eq. 4
Lines 2 and 3 of the algorithm initialize FR, R1, and R so that following the swap of line 6, R1[q]=H_{0,q} and R[q]=H_{−1,q},−1≤q≤n. In other words, at the start of iteration i=1 of the for loop of lines 930, R1 and R, respectively, correspond to rows i−1 (i.e., row 0) and i−2 (i.e., row 1) of H. At this time, FR[−1:n]=maxValue, which corresponds to the initial situation that k=lastA[i][b_{j}] is undefined. This will be updated as lastA[i][b_{j}] gets defined. last_row_id[1:s] is initialized to −1 in line 4 for each character c to indicate the fact that at the start of the i=1 loop, A[p],p<1 isn’t a character of the alphabet. last_col_id is set to −1 at the start of each iteration of the loop of lines 5–32 as at the start of this loop, no character of B has been examined and so there is no last seen occurrence of A[i] in B (for this iteration of the for loop of lines 5–32). Also, at the start of the computation of each row of H, last_i2l1 is set to R[0], because, by the semantics of last_i2l1, when we are computing H_{i,1},last_i2l1=H_{i−2,0}=R[0]. Following this, R[0] is set to i to indicate that the cheapest way to transform A[1:i] into B[0:0] is to do i deletes at a total cost of i (hence, when we are computing H_{i,1} in the loop of lines 9–30, R[q]=H_{ij},q<1), and T is set to maxVal as when we start a row computation, l is undefined. So, the initializations establish the variable semantics given above.
In lines 10–12 of the loop of lines 9–30, diag, left, and up are set to the values specified in cases 1–3 of Eq. 4. Note that from the semantics of the variables, R1[j−1]=H_{i−1,j−1},R[j−1]=H_{i,j−1}, and R1[j]=H_{i−1,j} at this time. Line 13 computes the minimum of the terms in the first 3 cases of Eq. 4. If A[i]=B[j] (line 14), then H_{ij} is determined by cases 1–3 and the value of temp computed in line 13 is H_{i,j}. At this time, we need to update last_col_id as the most recently seen occurrence of A[i] is now at position j of B. Since A[i]=B[j],lastA[i+1][b_{j}]=i. So, the H_{k−1,j−2} to use for the next i in case 4 is H_{i−1,j−2}=R1[j−2]. This value is saved in FR[j] in line 16. Since A[i]=B[j], the value to use for H_{i−2,l−1} in case 4 of Eq. 4 in future iterations of the forj loop (until, of course, we encounter another j where A[i]=B[j]) becomes H_{i−2,j−1}, which by the variable semantics is last_i2l1. This value is saved in T in line 17.
When A[i]≠B[j], lines 19–26 are executed. From the semantics of last_row_id and last_col_id, it follows that line 19 correctly sets k and l. Lines 22 and 24, respectively, compute the cost of case 4 when j−l=1 and i−k=1, respectively. Note that by the semantics of FR and T, FR[j]=H_{k−1,j−2} and T=H_{i−2,l−1}. So, lines 22 and 25 update temp to be H_{i,j}. When we reach line 29, regardless of whether A[i]=B[j] or A[i]≠B[j],R[j]=H_{i−2,j} and temp=H_{i,j}.
In line 28, we set last_i2l1=R[j]=H_{i−2,j}. While this momentarily upsets the semantics of last_i2l1, the semantics are restored upon the start of next iteration of the forj loop as j increases by 1 or in line 8 if the forj loop terminates and we advance to the next iteration of the fori loop. Line 29 sets R[j]=H_{ij}, which similarly upsets the semantics of R but correctly sets up the semantics for the next iteration. Finally, line 31 correctly updates last_row_id so as to preserve its semantics for the next iteration of the fori loop.
The correctness of LS_DL2 follows from the preceding discussion. The space and time complexities are readily seen to be O(m+n+s)=O(m+n) and O(mn+s)=O(mn), respectively. When m<n, the space complexity may be reduced by a constant factor by swapping A and B. Using the LRU cache model of [18], one may show that LS_DL2 has approximately 3mn/w cache misses, where w is the width of a cache line. By comparison, the number of cache misses for LS_DL is mn(1+3/w).
Strip algorithm Strip_DL2
As in [18], we can reduce cache misses, which in turn reduces run time, by partitioning H into n/q strips of size m×q, where q is the largest strip width for which the data needed in the computation of the strip fits into the cache. H is computed by strips from left to right and the computation of each strip is done using LS_DL2. To enable this computation by strips, one strip needs to pass computed values to the next using three additional onedimensional arrays C, C1, and FC of size m each. C records the values of H computed for the rightmost column in the strip; C1 records the values of H computed for the next to rightmost column in the strip; and FC[i] is the value of T (i.e., H[i−2][l−1]) at row i, where l is the last column where B[l]=A[i] in the strip. We name this new algorithm as Strip_DL2. The space complexity of Strip_DL2 is O(m+n) and its time complexity is O(mn). For the LRU cache model of [18] the number of cache misses is approximately \(\frac {6mn}{wq}\).
DL trace algorithms
Wagner and Fischer [19] introduced the concept of a trace to describe an edit sequence when the edit operations are limited to insert, delete, and substitute. Lowrance and Wagner [17] extended the concept of a trace to include transpositions. We reproduce here the definition and example used by us in [18]. A trace for the strings A=a_{1}⋯a_{m} and B=b_{1}⋯b_{n} is a set T of lines, where the endpoints u and v of a line (u,v) denote positions in A and B, respectively. A set of lines T is a trace iff:

1.
For every (u,v)∈T,u≤m and v≤n.

2.
The lines in T have distinct A positions and distinct B positions. That is, no two lines in T have the same u or the same v.
A line (u,v) is balanced iff a_{u}=b_{v} and two lines (u_{1},v_{1}) and (u_{2},v_{2}) cross iff (u_{1}<u_{2}) and (v_{1}>v_{2}). It is easy to see that T={(1,2),(3,1),(4,3),(5,6)} (see Fig. 2) is a trace for the strings A=dafac and B=fdbbec. Line (4,3) is not balanced as a_{4}≠b_{3}. The remaining 3 lines in the trace are balanced. The lines (1,2) and (3,1) cross.
In a trace, unbalanced lines denote a substitution operation and balanced lines denote retaining the character of A. If a_{i} has no line attached to it, a_{i} is to be deleted and when b_{j} has no attached line, it is to be inserted. When two balanced lines (u_{1},v_{1}) and (u_{2},v_{2}),u_{1}<u_{2} cross, \(a_{u_{1}+1} \cdots a_{u_{2}1}\) are to be deleted from A making \(a_{u_{1}}\) and \(a_{u_{2}}\) adjacent, then \(a_{u_{1}}\) and \(a_{u_{2}}\) are to be transposed, and finally, \(b_{v_{2}+1} \cdots b_{v_{1}1}\) are to be inserted between the just transposed characters of A.
The edit sequence corresponding to the trace of Fig. 2 is delete a_{2}, transpose a_{1} and a_{3}, substitute b for a_{4}, insert b_{4}=b and b_{5}=e, retain a_{5}. The cost of this edit sequence is 5.
In [18], we used a divideandconquer strategy similar to that used by Hirschberg [20] to determine an optimal trace in O(mn) time and O(s min{m,n}+n) space. In [18], we made a distinction between traces that have a center crossing and those that do not. A trace has a center crossing iff it contains two lines (u_{1},v_{1}) and (u_{2},v_{2}) such that v_{2}≤n/2 and v_{1}>n/2,u_{1}<u_{2}, while satisfying (a) \(\phantom {\dot {i}\!}a_{i} \neq a_{u_{1}} = b_{v_{1}}, u_{1} < i < u_{2}\) and (b) \(\phantom {\dot {i}\!}b_{j} \neq b_{v_{2}} = a_{u_{2}}, v_{2} < j < v_{1}\). In words, u_{1} is the last (i.e., rightmost) occurrence of \(b_{v_{1}}\) in A that precedes position u_{2} of A and v_{2} is the last occurrence of \(\phantom {\dot {i}\!}a_{u_{2}}\) in B that precedes position v_{1} of B. (Figure 3).
In [18], we showed that the cost of an optimal trace T is given by Eq. 7 when T has no center crossing and by Eq. 8 when T has a center crossing. Hence, the cost of T is the smaller of these two costs.
where H[i] is the cost of an optimal trace for A[1:i] and B[1:n/2] and H^{′}[i+1] that for an optimal trace for A[i+1:m] and B[n/2+1:n].
where H[i][j] is the cost of an optimal trace for A[1:i] and B[1][j] and H^{′}[i][j] is that for an optimal trace for A[i:m] and B[j][n]. For the min{}, we try 1≤u_{1}<m and for each such u_{1}, we set v_{1} to be the smallest i>n/2 for which \(\phantom {\dot {i}\!}b_{i} = a_{u_{1}}\). For each u_{1} we examine all characters other than \(a_{u_{1}}\) in the alphabet. For each such character c, v_{2} is set to the largest j≤n/2 for which b_{j}=c and u_{2} is the smallest i>u_{1} for which a_{i}=c.
Our new algorithms, LS_TRACE2 and Strip_TRACE2, are based on an adaptation of Eqs. 7 and 8 using Eq. 4.
Algorithm LS_T R A C E2
Consider the case when the optimal trace has no center crossing. Let R^{f}[] be the value of R[] when LS_DL2(B[1:n/2],A[1:m]) terminates and let R^{′}^{f}[] be the value of R[] when LS_DL2(B[n:n/2+1],A[m:1]) terminates. Let R1^{f},R1^{′}^{f},FR^{f}, and FR^{′}^{f} be the corresponding final values for R1 and RF. From Eq. 7 and LD_DL2, we obtain
When T has a center crossing {(u_{1},v_{1}),(u_{2},v_{2})}, then it follows from Eq. 4 that either u_{1} and u_{2} are adjacent in A or v_{1} and v_{2} are adjacent in B (or both). When v_{1}and v_{2} are adjacent in B, then \(v_{2} = \frac {n}{2}\) and \(v_{1}=\frac {n}{2}+1\). Substituting into Eq. 8, we get
When u_{1} and u_{2} are adjacent in A, then u_{2}−u_{1}=1,v_{2} is the right most occurrence of A[u_{2}] in B that precedes position \(\frac {n}{2}+1\) (i.e., v_{2}≤n/2) and v_{1} is the left most occurrence of A[u_{1}] in B after position \(\frac {n}{2}\) (i.e., v_{1}≥n/2+1). So, we have
Algorithm LS_TRACE2 (Algorithm 2) provides the pseudocode for our linear space computation of an optimal trace. It assumes that LS_DL2 has been modified to return the arrays R,R1 and FR.
Using an analysis similar to that used by us in [18] for the analysis of DL_TRACE, we see that the time complexity of DL_TRACE2 is O(mn). The space required is the same as for LS_DL2. The number of cache misses is approximately twice that for LS_DL2 when invoked with strings of size n and m. Hence, the cache miss count for LS_TRACE2 is ≈6mn/w.
Strip trace algorithm Strip_T R A C E2
This algorithm differs from LS_TRACE2 in that it uses a modified version of Strip_DL2 rather than a modified version of LS_DL2. The modified version of Strip_DL2 returns the arrays C, C1 and FC computed by Strip_DL2. The asymptotic time complexity of Strip_TRACE2 is also O(mn) and it takes the same amount of space as does Strip_DL2. The number of cache misses is approximately twice that for Strip_DL2.
Results
We benchmarked the singlecore algorithms LS_DL2,Strip_DL2,DL_TRACE2, and Strip_TRACE2 of this paper against the corresponding singlecore algorithms developed by us in [18]. Using the parallelization techniques of [18], we obtained multicore versions of our new algorithms. Their names are obtained by prefixing PP_ to the singlecore name (e.g., PP_LS_DL2 is the multicore version of LS_DL2). The new multicore versions also were benchmarked against the corresponding multicore algorithms of [18].
Platforms and test data
The singlecore algorithms were implemented using C and the multicore ones using C and OpenMP. The relative performance of these algorithms was measured on the following platforms:

1.
Intel Xeon CPU E52603 v2 QuadCore processor 1.8GHz with 10MB cache.

2.
Intel I7x980 SixCore processor 3.33GHz with 12MB LLC cache.

3.
Intel Xeon CPU E52695 v2 2x12Core processors 2.40GHz with 30MB cache.
For convenience, we will, at times, refer to these platforms as Xeon4, Xeon6, and Xeon24 (i.e., the number of cores is appended to the name Xeon).
All codes were compiled using the gcc compiler with the O2 option. On our Xeon4 platform, the benchmarking included a comparison of memory, cache misses, run time, and energy consumption. The cache miss count and the energy consumption was measured using the "perf" [21] software through the RAPL interface. For the Xeon6 and Xeon24 platforms only the run time was benchmarked.
For test data, we used randomly generated protein sequences as well as real protein sequences obtained from the Protein Data Bank [22] and DNA/RNA/protein sequences from the National Center for Biotechnology Information (NCBI) database [23]. The results for our randomly generated protein sequences were comparable to those for similarly sized sequences used from the two databases [22] and [23]. So, we present only the results for the random data sets here.
Xeon E52603 (Xeon4)
DL distance algorithms
Table 1 gives the memory required to process random protein sequences of length 400,000 using each of the singlecore DL scoring algorithms considered in this paper. LS_DL takes 4.75 times the memory taken by LS_DL2 and LS_Strip takes 4.69 times the memory taken by LS_Strip2.
Figure 4 and Table 2 give the number of cache misses on our Xeon4 platform for randomly generated sequences of size between 40,000 and 400,000. The column of Table 2 labeled LvsL2 gives the percentage reduction in cache misses achieved by LS_DL2 relative to LS_DL, that labeled SvsS2 gives this percentage for Strip_DL2 relative to Strip_DL, and that labeled L2vsS2 gives this percentage for Strip_DL2 relative to LS_DL2. Strip_DL2 has the fewest cache misses. Strip_DL2 reduces cache misses by up to 91.9% relative to LS_DL2 and by up to 94.4% relative to Strip_DL.
Figure 5 and Table 3 give the run times on our Xeon4 platform for our random data set. In the figure, the time is in seconds while in the table, the time is given using the format hh:mm:ss. The table also gives the percentage reduction in run time.
As can be seen, on our Xeon4 platform, Strip_DL2 is the fastest followed by LS_DL2,Strip_DL, and LS_DL. Strip_DL2 reduces run time by up to 15.9% relative to LS_DL2 and by up to 40.1% relative to Strip_DL.
Figure 6 and Table 4 give the CPU and cache energy consumed, in joules, on our Xeon4 platform. On our data sets, Strip_DL2 required up to 17.6% less CPU and cache energy than LS_DL2 and up to 40.0% less than Strip_DL. It is interesting to note that the energy reduction is comparable to the reduction in run time suggesting a close relationship between run time and energy consumption for this application.
DL trace algorithms
Figure 7 and Table 5 give the number of cache misses for the trace algorithms on our Xeon4 platform for randomly generated sequences of size between 40,000 and 400,000. The column of Table 5 labeled LvsL2 gives the percentage reduction in cache misses achieved by LS_Trace2 relative to LS_Trace, that labeled SvsS2 gives this percentage Strip_Trace2 relative to Strip_Trace, and that labeled L2vsS2 gives this percentage Strip_Trace2 relative to LS_Trace2. Strip_Trace2 has the fewest cache misses. Strip_Trace2 reduces cache misses by up to 89.1% relative to LS_Trace2 and by up to 95.4% relative to Strip_Trace.
Figure 8 and Table 6 give the run times on our Xeon4 platform for our random data set. The table also gives the percentage reduction in run time. Strip_Trace2 is the fastest followed by LS_Trace2,Strip_Trace, and LS_Trace (in this order). Strip_Trace2 reduces run time by up to 4.3% relative to LS_Trace2 and by up to 37.8% relative to Strip_Trace.
Figure 9 and Table 7 give the CPU and cache energy consumed, in joules, Strip_Trace2 required up to 6.3% less CPU and cache energy than LS_Trace2 and up to 37.9% less than Strip_Trace.
Parallel algorithms
Figure 10 and Table 8 give the run times for our parallel DL distance algorithms on our Xeon4 platform. PP_LS_DL2 is up to 61.2% faster than PP_LS_DL and PP_Strip_DL2 is up to 40.6% faster than PP_Strip_DL. Also, PP_LS_DL2 and PP_Strip_DL2 achieve a speedup of up to 3.15 and 3.98 compared to the corresponding singlecore algorithms on a fourcore machine, respectively.
Figure 11 and Table 9 give the run times for our parallel DL trace algorithms on our Xeon4 platform. PP_LS_Trace2 is up to 64.5% faster than PP_LS_Trace and PP_Strip_Trace2 is up to 35.4% faster than PP_Strip_Trace. Also, PP_LS_Trace2 and PP_Strip_Trace2 achieves a speedup up to 2.94 and 3.83 compared to the corresponding singlecore algorithms, respectively.
I7x980 (Xeon6)
DL distance algorithms
Figure 12 and Table 10 give the run times of our singlecore distance algorithms on our Xeon6 platform. As can be seen, Strip_DL2 is the fastest followed by LS_DL2,Strip_DL, and LS_DL (in this order). Strip_DL2 reduces run time by up to 10.2% relative to LS_DL2 and by up to 42.0% relative to Strip_DL.
DL trace algorithms
Figure 13 and Table 11 give the run times of our singlecore trace algorithms on our Xeon6 platform. Strip_Trace2 is the fastest followed by LS_Trace2,Strip_Trace, and LS_Trace (in this order). Strip_Trace2 reduces run time by up to 3.9% relative to LS_Trace2 and by up to 37.6% relative to Strip_Trace.
Parallel algorithms
Figure 14 and Table 12 give the run times for our parallel DL distance algorithms on our Xeon6 platform. PP_LS_DL2 is up to 82.4% faster than PP_LS_DL and PP_Strip_DL2 is up to 39.8% faster than PP_Strip_DL. Also, PP_LS_DL2 and PP_Strip_DL2 achieves a speedup up to 4.32 and 5.44 compared to the corresponding single core algorithms on a six core machine, respectively.
Figure 15 and Table 13 give the run times for our parallel DL trace algorithms on our Xeon6 platform. PP_LS_Trace2 is up to 74.7% faster than PP_LS_Trace and PP_Strip_Trace2 is up to 41.8% faster than PP_Strip_Trace. Also, PP_LS_Trace2 and PP_Strip_Trace2 achieves a speedup up to 4.32 and 5.30 compared to the corresponding singlecore algorithms, respectively.
Xeon E52695 (Xeon24)
DL distance algorithms
Figure 16 and Table 14 give the run times of our singlecore distance algorithms on our Xeon24 platform. Strip_DL2 is the fastest followed by LS_DL2,Strip_DL, and LS_DL (in this order). Strip_DL2 reduces run time by up to 13.8% relative to LS_DL2 and by up to 56.4% relative to Strip_DL.
DL trace algorithms
Figure 17 and Table 15 give the run times of our singlecore trace algorithms on our Xeon24 platform. Strip_Trace2 is the fastest followed by LS_Trace2,Strip_Trace, and LS_Trace (in this order). Strip_Trace2 reduces run time by up to 9.4% relative to LS_Trace2 and by up to 57.4% relative to Strip_Trace.
Parallel algorithms
Figure 18 and Table 16 give the run times for our parallel DL distance algorithms on our Xeon24 platform. PP_LS_DL2 is up to 68.7% faster than PP_LS_DL and PP_Strip_DL2 is up to 54.6% faster than PP_Strip_DL. Also, PP_LS_DL2 and PP_Strip_DL2 achieves a speedup up to 11.2 and 21.36 compared to the corresponding single core algorithms on a twelvefour core machine, respectively.
Figure 19 and Table 17 give the run times for our parallel DL trace algorithms on our Xeon24 platform. PP_LS_Trace2 is up to 58.3% faster than PP_LS_Trace and PP_Strip_Trace2 is up to 59.3% faster than PP_Strip_Trace. Also, PP_LS_Trace2 and PP_Strip_Trace2 achieves a speedup up to 9.65 and 16.4 compared to the corresponding singlecore algorithms, respectively.
Discussion
We have developed linear space algorithms to compute the DL distance between two strings and also to determine an optimal trace (edit sequence). Besides using less space than the space efficient algorithms of [18], these algorithms are also faster. Significant runtime improvement (relative to known algorithms) was seen for our new algorithms on all three of our experimental platforms.
Conclusion
On all platforms, the linearspace cacheefficient algorithms Strip_DL2 and Strip_TRACE2 were the bestperforming singlecore algorithms to determine the DL distance and optimal trace, respectively. Strip_DL2 reduced run time by as much as 56.4% relative to the Strip_DL and Strip_TRACE2 reduced run time by as much as 57.4% relative to the Strip_Trace. Our multicore algorithms reduced the run time by up to 59.3% compared to the best previously known multicore algorithms.
The linear space string correction algorithm developed in this paper requires 2S≤I+D≤2T, where S, I, D and T are, respectively, the cost of a substitution, insertion, deletion and transposition. As noted earlier, this requirement is met by the DL distance as for this metric, S=I=D=T=1. When only I+D≤2T is satisfied, the more general algorithm ([18]) that runs in O(mn) time and uses O(s∗min{m,n}+m+n) space, where s is the size of alphabet comprising the strings, may be used.
Availability of data and materials
Data sharing is not applicable to this article as all test data are randomly generated protein sequences.
Abbreviations
 CPU:

Central Processing Unit
 DL:

DamerauLevenshtein
 DNA:

DeoxyriboNucleic acid
 LLC:

Last level cache
 LRU:

Least recently used
 NCBI:

National Center for Biotechnology Information
 PDB:

Protein Data Bank
 RNA:

RiboNucleic acid
References
Brill E, Moore RC. An improved error model for noisy channel spelling correction. In: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics: 2000. p. 286–93.
Bard GV. Spellingerror tolerant, orderindependent passphrases via the DamerauLevenshtein stringedit distance metric. In: Proceedings of the Fifth Australasian Symposium on ACSW Frontiers  Volume 68. ACSW ’07. Darlinghurst: Australian Computer Society, Inc.: 2007. p. 117–24.
Li M, Zhang Y, Zhu M, Zhou M. Exploring distributional similarity based models for query spelling correction. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. ACL44. Stroudsburg: Association for Computational Linguistics: 2006. p. 1025–32.
Cai X, Zhang XC, Joshi B, Johnson R. Touching from a distance: Website fingerprinting attacks and defenses. In: Proceedings of the 2012 ACM Conference on Computer and Communications Security. New York: ACM: 2012. p. 605–16.
Faloutsos C, Megalooikonomou V. On data mining, compression, and kolmogorov complexity. Data Min Knowl Discov. 2007; 15(1):3–20.
Majorek KA, DuninHorkawicz S, Steczkiewicz K, Muszewska A, Nowotny M, Ginalski K, Bujnicki JM. The RNase Hlike superfamily: new members, comparative structural analysis and evolutionary classification. In: Nucleic Acids Research, vol. 42. Oxford, United Kingdom: Oxford University Press: 2014. p. 4160–4179.
Bischof J, Ibrahim SM. bcRep: R package for comprehensive analysis of B cell receptor repertoire data. PLoS ONE. 2016; 11(8):1–15.
Pomorova O, Savenko O, Lysenko S, Nicheporuk A. Information and Communication Technologies in Education, Research, and Industrial Applications. In: 12th Internal conference. Kyiv: Springer: 2016. http://ceurws.org/Vol1614/.
Biswas AK, Gao JX. PR2S2Clust: patched rnaseq read segments’ structureoriented clustering. J Bioinforma Comput Biol. 2016; 14(05):1650027.
Pop PG. Inmemory dedicated dotplot analysis for DNA repeats detection. In: Intelligent Computer Communication and Processing (ICCP), 2016 IEEE 12th International Conference On. Washington, DC: IEEE: 2016. p. 317–20.
Gabrys R, Yaakobi E, Milenkovic O. Codes in the damerau distance for deletion and adjacent transposition correction. IEEE Trans Inf Theory. 2018; 64(4):2550–70.
Largest Protein Sequence: Titin. https://en.wikipedia.org/wiki/Titin. Accessed 15 Feb 2019.
Largest Chromosome. https://en.wikipedia.org/wiki/Chromosome_1. Accessed 115 Feb 2019.
Levenshtein VI. Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Dokl. 1966; 10(8):707–10.
Robinson DJS. An Introduction to Abstract Algebra. Berlin: Walter de Gruyter; 2003, pp. 255–7.
Jaro MA. Advances in recordlinkage methodology as applied to matching the 1985 census of Tampa, Florida. J Am Stat Assoc. 1989; 84(406):414–20.
Lowrance R, Wagner RA. An extension of the stringtostring correction problem. J ACM. 1975; 22(2):177–83.
Zhao C, Sahni S. String correction using the dameraulevenshtein distance. BMC Bioinforma; 20:277. https://doi.org/10.1186/s1285901928190.
Wagner RA, Fischer MJ. The stringtostring correction problem. J ACM. 1974; 21(1):168–73.
Hirschberg DS. A linear space algorithm for computing longest common subsequences. Commun ACM. 1975; 18(6):341–3.
Perf Tool. https://perf.wiki.kernel.org/index.php/Main_Page. Accessed 15 Feb 2019.
Protein Data Bank. http://www.rcsb.org/pdb/home/home.do.Accessed 15 Feb 2019.
NCBI Database. http://www.ncbi.nlm.nih.gov/gquery. Accessed 15 Feb 2019.
Acknowledgments
The authors also would like to thank the anonymous reviewers for their valuable comments and suggestions.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 21 Supplement 1, 2020: Selected articles from the 8th IEEE International Conference on Computational Advances in Bio and medical Sciences (ICCABS 2018): bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume21supplement1.
Funding
Publication of this supplement was funded, in part, by the National Science Foundation under award NSF 1447711. Publication costs were funded by the University of Florida.
Author information
Authors and Affiliations
Contributions
CZ and SS developed the new linear space cache efficient DL distance algorithms, did theoretical analysis and the experimental results analysis, and wrote the manuscript. CZ programmed the algorithms and ran the benchmark tests. All authors have read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Zhao, C., Sahni, S. Linear space string correction algorithm using the DamerauLevenshtein distance. BMC Bioinformatics 21 (Suppl 1), 4 (2020). https://doi.org/10.1186/s1285901931848
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285901931848