 Research
 Open Access
 Published:
String correction using the DamerauLevenshtein distance
BMC Bioinformatics volume 20, Article number: 277 (2019)
Abstract
Background
In the string correction problem, we are to transform one string into another using a set of prescribed edit operations. In string correction using the DamerauLevenshtein (DL) distance, the permissible edit operations are: substitution, insertion, deletion and transposition. Several algorithms for string correction using the DL distance have been proposed. The fastest and most space efficient of these algorithms is due to Lowrance and Wagner. It computes the DL distance between strings of length m and n, respectively, in O(mn) time and O(mn) space. In this paper, we focus on the development of algorithms whose asymptotic space complexity is less and whose actual runtime and energy consumption are less than those of the algorithm of Lowrance and Wagner.
Results
We develop space and cacheefficient algorithms to compute the DamerauLevenshtein (DL) distance between two strings as well as to find a sequence of edit operations of length equal to the DL distance. Our algorithms require O(s min{m,n}+m+n) space, where s is the size of the alphabet and m and n are, respectively, the lengths of the two strings. Previously known algorithms require O(mn) space. The space and cacheefficient algorithms of this paper are demonstrated, experimentally, to be superior to earlier algorithms for the DL distance problem on time, space, and enery metrics using three different computational platforms.
Conclusion
Our benchmarking shows that, our algorithms are able to handle much larger sequences than earlier algorithms due to the reduction in space requirements. On a single core, we are able to compute the DL distance and an optimal edit sequence faster than known algorithms by as much as 73.1% and 63.5%, respectively. Further, we reduce energy consumption by as much as 68.5%. Multicore versions of our algorithms achieve a speedup of 23.2 on 24 cores.
Background
Introduction
In the string correction problem, we are given two strings A and B and are required to find the minimum number of edit operations needed to transform A into B. The permitted edit operations are: (a) substitute a character in A to a different character, (b) insert a character into A, (c) delete a character of A, and (d) transpose two adjacent characters of A. When all four edit operations are permitted, the length of the optimal edit sequence is known as the DamerauLevenshtein (DL) distance [1, 2]. Some applications limit the permissible edit operations to a subset of the stated four operations. As a result, string correction has been studied using other distance metrics as well. For example, the Levenshtein distance [1] is the length of the shortest sequence of substitutions, insertions, and deletions needed to transform A into B. This distance is used in the longest common subsequence problem [3], for example. When only substitutions are allowed, the length of the minimum edit sequence is the Hamming distance [4] and when only transpositions are allowed, this length is the Jaro distance [5].
The cost of an edit sequence may be generalized by using weights for the various operations. For example, in sequence alignment using the methods of Needleman and Wunsch [6] and Smith and Waterman [7], transpositions are not permitted, the cost of a substitution depends on the two characters involved, and there is a gap penalty. The stringtostring correction algorithm of Lowrance and Wagner [8] uses a cost of S for a substitution, I for an insertion, D for a deletion, and T for a transposition and requires 2T≥I+D. We note that the costs used in computing the DL distance are S=I=D=T=1 and that these costs satisfy the 2T≥I+D requirement of the algorithm of Lowrance and Wagner [8]. In fact, the best algorithm currently known for the DL distance is the one in [8] with edit operation costs set to 1.
Spelling error correction [9–11], data clustering and data mining [12], comparing packet traces [13], quantifying the similarity of DNA/RNA/protein sequences, gene finding, and gene function prediction [14] are some of the applications of the DL distance. While, in spelling error correction, the strings A and B are relatively short, in other applications, these strings may be quite long. For example, the length of a protein sequence may exceed 300,000 [15].
Bard [10] has shown that the DL distance is a true metric; that is, it satisfies 1) nonnegativity, 2) identity, 3) symmetry, and 4) triangle inequality. The algorithm of Bard [10] computes the DL distance in O(mn∗ max{m,n}) time, where m is the length of string A and n is the length of B. This algorithm uses O(mn) space. Hyyro [16] has developed a bitparallel algorithm to determine whether the DL distance between two strings is less than a specified threshold. This bitparallel algorithm was tested using DNA sequences of length up to 10,000.
In an effort to reduce time complexity, Oommen and Loke [17] consider restricting edit sequences so that no substring is edited more than once. We illustrate this restriction using the example given in [18]. The string CA may be transformed into ABC using the edit sequence CA (transposition) → AC (insertion) → ABC. So, the DL distance between CA and ABC is 2. With the restriction of [17], the second operation in this edit sequence is not permitted as it involves reediting AC, which resulted from the first edit operation. The restricted DL distance is 3, which corresponds to the restricted edit sequence CA (deletion) → A (insertion) → AB (insertion) → ABC. The restricted DL distance is not a metric as it does not satisfy the triangle inequality.
The algorithm of Lowrance and Wagner [8] computes the DL distance in O(mn) time while also using O(mn) space. This is the fastest and most space efficient algorithm known for string correction using the DL distance.
Neither the algorithm of Bard [10] nor that of Lowrance and Wagner [8] is practical when m and n are large due to their excessive space requirement. The former algorithm becomes impractical also due to its excessive run time. In this paper, we focus on the development of algorithms that are more space, time, and energy efficient than that of Lowrance and Wagner [8]. To obtain space efficiency, we observe that the DL distance can be computed by retaining only O(sm) or O(sn) data, where s is the size of the alphabet. We note that, when m and n are large, s is much smaller than m and n. In fact, s=4 for RNA and DNA sequences and s=20 for protein sequences and the length of these sequences is often orders of magnitude larger than s.
Cache model
To analyze the cache performance of our algorithms, we use the rather simple cache model which has been used by us successfully in our past work [19]. In this model we have a singlelevel cache that has l cache lines of size w, where w is the number of data items that can be stored in one cache line. So, when the data size is 4 bytes and w=8, each cache line is 32 bytes. The size (i.e., capacity) of our onelevel cache is lw. In accordance with this cache model, we assume that main memory is divided into blocks whose size is the same as that of a cache line (i.e., w words each). When we attempt to read a piece of data that is not in the cache, a read miss occurs. A read miss causes the corresponding block of main memory to be read into a cache line. When the cache is full, this read miss requires us to first evict the block that is in the least recently used (LRU) cache line. This eviction results in a write of the evicted block to main memory in case the evicted block has changed. A write miss occurs when we attempt to write data that is not in a cache line. At this time, the corresponding block of main memory is read into a cache line and the data we wish to write is written to this cache line.
Notice that every read and write miss results in a read access of main memory; some read and write misses also result in the writing of a cache line to main memory.
Today’s computers actually employ multiple levels of cache and a far more sophisticated and proprietary cache servicing policy combined with prefetching to hide memory latency. As a result, it is extremely difficult to analyze cache performance using a realistic cache model. The described simple cache model is amenable to analysis and our experiments establish its usefulness for this purpose as algorithms with reduced cache misses using this model actually run faster on computers with more sophisticated cache architectures, replacement policies, and prefetching techniques.
Classical DL distance algorithm
Wagner and Fischer [20] developed the notion of a trace, which is useful in reasoning about edit sequences that are limited to substitutions, insertions, and deletions. Lowrance and Wagner [8] extended this notion to include the transposition operation. A trace for the strings A=a_{1}⋯a_{m} and B=b_{1}⋯b_{n} is a set T of lines, where the endpoints u and v of a line (u,v) denote positions in A and B, respectively. A set of lines T is a trace iff:

1
For every (u,v)∈T, u≤m and v≤n.

2
The lines in T have distinct A positions and distinct B positions. That is, no two lines in T have the same u or the same v.
A line (u,v) is balanced iff a_{u}=b_{v} and two lines (u_{1},v_{1}) and (u_{2},v_{2}) cross iff (u_{1}<u_{2}) and (v_{1}>v_{2}). As an example, consider A=dafac and B=fdbbec. The set of lines T={(1,2),(3,1),(4,3),(5,6)} satisfies the requirements for a trace. Line (4,3) is not balanced as a_{4}≠b_{3}. The remaining 3 lines in the trace are balanced. The lines (1,2) and (3,1) cross. This trace may be depicted as a diagram as in Fig. 1.
In a trace, an unbalanced line denotes a substitution operation and a balanced line denotes retaining the character of A. If a_{i} has no line attached to it, a_{i} is to be deleted and when b_{j} has no attached line, it is to be inserted. When two balanced lines (u_{1},v_{1}) and (u_{2},v_{2}) cross, \(a_{u_{1}+1} \cdots a_{u_{2}1}\) are to be deleted from A making \(a_{u_{1}}\) and \(a_{u_{2}}\) adjacent, then \(a_{u_{1}}\) and \(a_{u_{2}}\) are to be transposed, and finally, \(b_{v_{2}+1} \cdots b_{v_{1}1}\) are to be inserted between the just transposed characters of A.
The edit sequence corresponding to the trace of Fig. 1 is delete a_{2}, transpose a_{1} and a_{3}, substitute b for a_{4}, insert b_{4}=b and b_{5}=e, retain a_{5}. The cost of this edit sequence is 5.
Lowrance and Wagner [8] have proved the following properties:

The cost of a trace equals the number of unbalanced lines plus the number of positions in A and B not touched by a line plus the number of line crossings.

There is a trace whose cost equals that of an optimal edit sequence (Theorem 2 of [8]). Since every trace corresponds to an edit sequence, it follows that the edit sequence that corresponds to a minimum cost trace is optimal.

There is a minimum cost trace in which each line crosses at most one other line and in which every line that crosses another is balanced (Theorem 4 of [8]).

There is trace T that satisfies property P3 and for every pair of crossing lines (u_{1},v_{1}), (u_{2},v_{2}), u_{1}<u_{2} in T, (a) \(a_{i} \neq a_{u_{1}} = b_{v_{1}}\phantom {\dot {i}\!}\), u_{1}<i<u_{2} and (b) \(b_{j} \neq b_{v_{2}} = a_{u_{2}}\phantom {\dot {i}\!}\), v_{2}<j<v_{1}. In words, u_{1} is the last (i.e., rightmost) occurrence of \(b_{v_{1}}\) in A that precedes position u_{2} of A and v_{2} is the last occurrence of \(a_{u_{2}}\) in B that precedes position v_{1} of B. We refer to these positions as \(lastA[\!u_{2}][\!b_{v_{1}}]\phantom {\dot {i}\!}\) and \(\phantom {\dot {i}\!}lastB[\!v_{1}][\!a_{u_{2}}]\), respectively (Theorem 5 of [8]).
Let H_{ij} be the DL distance between A[ 1:i] to B[ 1:j]. So, H_{mn} is the DL distance between A and B. The following dynamic programming recurrence follows from properties P1P4 of a trace.
When i>0 and j>0,
where c(a_{i},b_{j}) is 1 if a_{i}≠b_{j} and 0 otherwise, k=lastA[ i][ b_{j}] and l=lastB[ j][ a_{i}]. If k or l do not exist, then case 4 of the recurrence does not apply.
Figure 2 illustrates the four cases of this recurrence. These cases correspond to the four possibilities for an optimal trace that transforms A[ 1:i] into B[ 1:j] and satisfies properties P2P4. Such a trace may (a) contain the line (i,j), (b) contain no line that touches b_{j}, (c) contain no line that touches a_{i}, or (d) have crossing balanced lines that involve a_{i} and b_{j}. Figure 2a illustrates the first case, which is a substitution between a_{i} and b_{j}; we optimally transform A[ 1:i−1] into B[ 1:j−1] and then substitute b_{j} for a_{i}. If a_{i}=b_{j}, the substitution cost is 0, otherwise it is 1. Figure 2b shows the second case. Here, b_{j} is inserted at the end of B[ 1:j−1] following an optimal transformation of A[ 1:i] into B[ 1:j−1]. Figure 2c shows the third case in which a_{i} is deleted from A[1:i] following an optimal transformation of A[ 1:i−1] into B[ 1:j]. Figure 2d shows the case of crossing balanced lines (i,l) and (k,j). Here, A[ 1:k−1] must be optimally transformed into B[ 1:l−1]. Note that to perform the crossing operation, we must delete i−k−1 characters from A, do an adjacent character transposition in A, and then insert j−l−1 characters from B between the two just transposed positions. So, the cost is (i−k−1)+1+(j−l−1).
Algorithm 1 is the pseudocode to compute H using Eqs. 1 and 2. This is a simplification of the pseudocode given in Lowrance and Wagner [8] to the case when each edit operation has unit cost. In this algorithm, last_row_id[c] keeps track of the last occurrence of character c in A (note that this is a row index of H) and last_col_id keeps track of the last occurrence of a_{i} in B.
We shall refer to Algorithm 1 as algorithm DL. Its time and space complexities are readily seen to be O(mn). Once H has been computed using algorithm DL, an optimal trace may be obtained in O(m+n) additional time using a standard dynamic programming traceback. We refer to the combination of DL and the traceback as algorithm DL_TRACE.
The total number of cache misses is dominated by the read and write misses of the array H. So, we count only these misses. In each iteration of the loop for computing row i of H, we need the elements of rows i and i−1 of H in lefttoright order as in Algorithm 1 lines 911 and 14. Since these rows are read from main memory in blocks of size w and row i is written to main memory in blocks of this size, lines 911 and 14 result in 2n/w read accesses and n/w write accesses for each i. These lines, therefore, result in 3mn/w cache misses over the entire execution of DL. Line 13 makes one read access of H per iteration and so contributes at most mn to the total cachemiss count. Hence, the cachemiss count for algorithm DL is approximately mn(1+3/w).
Methods
Singlecore algorithms
In this section, we develop four linearspace singlecore algorithms for string correction using the DL distance. All four run in O(mn) time. The first two (LS_DL and Strip_DL) compute only the score H_{mn} of the optimal trace; they differ in their cache efficiency. The last two (LSDL_TRACE and Strip_TRACE) compute an optimal trace.
The linear space algorithm L S_D L
Let s be the size of the alphabet. Instead of using the array H used in DL, algorithm LS_DL uses a onedimensional array U[ −1:n] and a twodimensional array T[ 1:s][ −1:n]. These two arrays have a space requirement of O((s+1)n) = O(n) for constant s. When m<n, one may swap A and B to reduce the required memory. Adding the memory needed for A and B, the space complexity is O(s min{m,n}+m+n) = O(m+n) when s is a constant.
As in algorithm DL, the H_{ij} values are computed by rows. The onedimensional array U is used to save the H[ i][ ∗] values computed by algorithm DL when row i is being computed. Let H[ w][ ∗] be the last row computed for character c. Then, T[ c][ ∗] is row w−1 of H. Algorithm 2 gives the pseudocode for LS_DL. Its correctness follows from the correctness of algorithm DL. Note that swap(T[A[ i]],U) takes O(1) time as pointers to 2 onedimensional arrays are swapped rather than the content of these arrays. The cachemiss count for LS_DL is the same as that for DL when n is suitably large as both have the same data access pattern. However, for smaller instances LS_DL will exhibit much better cache behavior. For example, because of its use of much less memory, we may have enough LLC cache to store all the data in LS_DL but not in DL (O(sn) vs O(mn)).
The cacheefficient linearspace algorithm S t r i p_D L
When (s+1)n is larger than the size of the LLC cache, we may reduce cache misses relative to algorithm LS_DL by computing H_{ij} by strips of width q, for some q less than n (the last strip may have a width smaller than q). This is shown in Fig. 3. The strips are computed in the order 0, 1,... using algorithm LS_DL. However, the space needed by T and U in LS_DL is reduced to (s+1)q as the strip width is q rather than n. By choosing q small enough, we can ensure that blocks of the T and U arrays used by LS_DL are not evicted from cache once they are brought in. So, if each entry of T and U takes 1 word, then when the cache size is lw, we have q<lw/(s+1). Note that, in addition to T and U, the cache needs to hold partials of A, B and other arrays needed to pass the data from one strip to the next.
To pass the data from one strip to next, we use an additional onedimensional array strip of size m and a twodimensional s∗m array V. The array strip records the values of H computed for the rightmost column in the strip. V[ c][i] gives the H value in the rightmost column j of row i of H that is (a) in a strip to the left of the one currently being computed and (b) c=B[ j].
The pseudocode for Strip_DL is given in Algorithm 3. For clarity, this pseudocode uses two strip arrays (lines 18 and 30) and two V arrays (lines 24 and 32). One set of arrays is used to fetch data calculated for the previous strip and the other set for data that is to be passed to the next strip. In the actual implementation, we use a single strip array and a single V array overwriting values received from the previous strip with values to be passed to the next strip.
The time and complexity of Strip_DL are, respectively, O(mn) and O((s+1)m+(s+1)q+n) = O(sm+sq+n) = O(sm+n) as q is a constant. When m>n, we may switch A and B to conserve memory and so the space complexity becomes O(s min{m,n}+m+n) = O(m+n) for constant s.
When we analyze the cache miss, we note that q is chosen such that U and T fit into cache. We make the reasonable assumption that the LRU replacement rule does not cause any block of U or T to be evicted during the running of algorithm Strip_DL. As a result, the total number of cache misses due to U and T is independent of m and n and so may be ignored in the analysis. The initialization of strip and V results in m/w and (s+1)m/w read accesses, respectively. The number of write accesses is approximately the same as the number of read accesses. The computation for each strip accesses the array strip in ascending order of index. This results in (approximately) the same number of cache misses as made during the initialization phase. Hence, the total number of cache misses due to strip is approximately (2m/w)(n/q+1). For V, we note that when computing the current strip, the elements in any row of V are accessed in nondecreasing order of index (i.e., from left to right) and that we need to retain, in cache, only the most recently read value for each character of the alphabet (i.e., at most s values are to be retained). Making the assumption that a V value is evicted from cache only when a new value for the same character is accessed, the total number of read misses from V when computing a single strip is sm/w. The number of write misses is approximately the same. So, V contributes (2sm/w)(n/q+1). Hence, the total number of cache misses for algorithm Strip_DL is ≈2(s+1)mn/(wq) when m and n are large.
Recall that the approximate cachemiss count for algorithms DL and LS_DL is mn(1+3/w). This is (wq+3q)/(2s+2) times that for Strip_DL.
The linearspace trace algorithm L S D L_T R A C E
Although algorithms LS_DL and Strip_DL determine the score (cost) of an optimal trace (and hence of an optimal edit sequence) that transforms A into B, these algorithms do not save enough information to actually determine an optimal trace. To determine an optimal trace using linear space, we adopt a divideandconquer strategy similar to that used by Hirschberg [21] for the simple string editing problem (i.e., transpositions are not permitted) and Myers and Miller [22] for the sequence alignment problem.
We say that a trace has a center crossing iff it contains two lines (u_{1},v_{1}) and (u_{2},v_{2}), u_{1}<u_{2} such that v_{1}>n/2 and v_{2}≤n/2 (Fig. 4).
Let T be an optimal trace that satisfies properties P2P4. If T contains no center crossing, then its lines may be partitioned into sets TL and TR such that TL contains all lines (u,v)∈T with v≤n/2 and TR contains the remaining lines (Fig. 4a). Since there is no center crossing, all lines in TR have a u value greater than the u value of every line in TL. It follows from properties P2P4 that there is an i, 1≤i≤m such that T is the union of an optimal trace for A[ 1:i] and B[ 1:n/2] and that for A[ i+1:m] and B[ n/2+1:n]. Let H[ i] be the cost the former optimal trace and H^{′}[ i+1] that of the latter optimal trace. We see that when T has no center crossing, the cost of T is
When T contains a center crossing, its lines may be partitioned into 3 sets, TL, TM, and TR, as shown in Fig. 4b. Let (u_{1},v_{1}) and (u_{2},v_{2}) be the lines defining the center crossing. Note that TL contains all lines of T with v<v_{2}, TR contains all lines with v>v_{1}, and TM={(u_{1},v_{1}),(u_{2},v_{2})}. Note also that all lines in TL have a u<u_{1} and all in TR have u>u_{2}. From property P1, it follows that TL is an optimal trace for A[ 1:u_{1}−1] and B[ 1:v_{2}−1] and TR is an optimal trace for A[ u_{2}+1:m] and B[ v_{1}+1:n]. Further, since (u_{1},v_{1}) and (u_{2},v_{2}) are balanced lines, the cost of TM is (u_{2}−u_{1}−1)+1+(v_{1}−v_{2}−1). Also, A[ u_{1}]≠A[ u_{2}] as otherwise, replacing the centercrossing lines with (u_{1},v_{2}) and (u_{2},v_{1}) results in a lower cost trace. From property P4, we know that \(u_{1} = lastA[\!u_{2}][\!b_{v_{1}}]\phantom {\dot {i}\!}\) and \(v_{2} = lastB[\!v_{1}][\!a_{u_{2}}]\phantom {\dot {i}\!}\). Let H[ i][ j] be the cost of an optimal trace for A[ 1:i] and B[ 1:j] and let H^{′}[ i][ j] be that for an optimal trace for A[ i:m] and B[ j:n]. So, when T has a center crossing, its cost is
where, for the min{}, we try 1≤u_{1}<m and for each such u_{1}, we set v_{1} to be the smallest i>n/2 for which \(b_{i} = a_{u_{1}}\phantom {\dot {i}\!}\). For each u_{1} we examine all characters other than \(\phantom {\dot {i}\!}a_{u_{1}}\) in the alphabet. For each such character c, v_{2} is set to the largest j≤n/2 for which b_{j}=c and u_{2} is the smallest i>u_{1} for which a_{i}=c. So, the min is taken over (s−1)m terms.
Let U_{top} and T_{top} be the final U and T arrays computed by LS_DL with inputs B[ 1:n/2] and A[ 1:m] and U_{bot} and T_{bot} be these arrays when the inputs are the reverse of B[ n/2+1] and A[m:1]. From these arrays, we may readily determine the H and H^{′} values needed to evaluate Eqs. 3 and 4. Algorithm LSDL_TRACE (Algorithm 4) provides the pseudocode for our linear space computation of an optimal trace. It assumes that LS_DL has been modified to return both the arrays U and T.
For the time complexity, we see that at the top level of the recursion, we invoke LS_DL twice with strings A and B of size m and n/2, respectively. This takes at most amn time for some constant a. The time required to compute Eqs. 3 and 4 is O(sn) and may be absorbed into amn by using a suitably large constant a. At the next level of recursion, LS_DL is invoked 4 times. The sum of the lengths of the A strings across these 4 invocations is at most 2m and the B string has length at most n/4. So, the time for these four invocations is at most amn/2. Generalizing to the remaining levels of recursion, we see that algorithm LSDL_TRACE takes amn(1+1/2+1/4+1/8+…)<2amn=O(mn) time. The space needed is the same as that for LS_DL (note that the parameters to this algorithm have been switched). From the time analysis, it follows that the number of cache misses is approximately twice that for LS_DL when invoked with strings of size m and n. Hence the approximate cache miss count for LSDL_TRACE is 2mn(1+3/w).
We note that some reduction in actual run time can be achieved by switching A and B when A is shorter than B thus ensuring that the shorter string is split at each level of recursion. This enables us to get the recursion terminates faster.
The strip trace algorithm S t r i p_T R A C E
This algorithm differs from LSDL_TRACE in that it uses a modified version of Strip_DL rather than a modified version of LS_DL. The modified version of Strip_DL returns the arrays strip and V computed by Strip_DL. Correspondingly, Strip_TRACE uses V_{top} and V_{bot} in place of T_{top} and T_{bot}. The asymptotic time complexity of Strip_TRACE is also O(mn) and it takes the same amount of space as does Strip_DL (note that the parameters to Strip_DL are switched relative to those for Strip_TRACE). The number of cache misses is approximately twice that for Strip_DL.
Multicore algorithms
In this section, we describe our parallelizations of algorithm DL and the four singlecore algorithms of previous section. These parallelizations assume that the number of processors is small relative to string length. The naming convention we adopt for the parallel versions is adding PP_ as a prefix to the name of the singlecore algorithm.
The algorithm P P_D L
Our parallel version of algorithm DL, PP_DL, computes the elements in the same order as does DL. However, it starts the computation of a row before the computation of its preceding row is complete. Each processor is assigned a unique row to compute and it computes this row from left to right. Let p be the number of processors. Processor z is initially assigned to do the outer loop computation for i=z, 1≤i≤p. Processor z begins after a suitable time lag relative to the start of processor z−1 so that the data it needs for its computation have already been computed by processor z−1. In our code, the time lag between the start of the computation of two consecutive rows is the time needed to compute n/p elements. Upon completion of its iteration i computation, the processor proceeds to iteration i+p of the outer loop. The time complexity of PP_DL is O(mn/p).
The algorithm P P_L S_D L
While the general parallelization strategy for PP_LS_DL is the same as that used in PP_DL, extra care is needed to ensure a computation identical to that of LS_DL. Divergence in results is possible when two or more processors are simultaneously computing different rows of H using the same memory. This happens for example when A=aaabc⋯ and p≥3. We start with processor i assigned to compute row i of H, 1≤i≤p. Suppose that U=x and T[ a]=y initially (note that x and y are addresses in memory). Because of the swap(T[ A[ i]],U) statement in LS_DL, processor 1 begins to compute row 1 of H using memory beginning at the address y. If processor 2 begins with a suitable time lag as in PP_DL, it will compute row 2 of H using memory beginning at the address x. With a further lag, processor 3 will begin to compute row 3 of H again using memory beginning at the address y. Now, both processors 1 and 3 are using the same memory to compute different rows of H and so we run the risk of overwriting H values that may be needed for subsequent computations. As another example, consider A=ababa⋯ and p≥4. Suppose that U=x and T[ a,b]=[ y,z] initially. Processor 1 begins to compute row 1 using the memory y, then, with a lag, processor 2 begins to compute row 2 using memory z, then processor 3 starts to compute row 3 using memory x. Next processor 4 begins to compute row 4 using memory y. At this time processor 1 is computing row 1 with A[ 1]=a and processor 4 is computing row 4 with A[ 4]=b and both processors are using the same row memory y.
Let p_{1} and p_{2} be two processors that are using the same memory to compute rows r_{1}<r_{2} of H and that no processor is using this memory to compute a row between r_{1} and r_{2}. From the swapping assignment scheme used in LS_DL, it follows that p_{1} is computing the row \(r_{1} = lastA[\!r_{2}][\!a_{r_{2}}]1\phantom {\dot {i}\!}\). The H values in this row are needed to compute rows r_{1}+1 through r_{2} as \(\phantom {\dot {i}\!}r_{1} = lastA[\!i][\!a_{r_{2}}] r_{1} < i \le r_{2}\). These values are not needed for rows i>r_{2} as for these rows \(\phantom {\dot {i}\!}lastA[\!i][\!a_{r_{2}}] = r_{2} > r_{1} + 1 = lastA[\!r_{2}][\!a_{r_{2}}]\). Let j_{1} be such that \(\phantom {\dot {i}\!}b_{j} = a_{r_{2}} = a_{r_{1} + 1}\). Then, for j>j_{1}, \(\phantom {\dot {i}\!}lastB[\!j][\!a_{r_{2}}] \ge j_{1}\). Hence, for j>j_{1} columns 1 through j_{1}−2 of row r_{1} are not needed to compute an H in rows between r_{1} and r_{2}.
Our parallel code uses a synchronization scheme that is based on the observations of the preceding paragraph to delay the overwriting of values that are needed for later computations and ensure a correct computation of the DL distance. Our synchronization scheme employs another array W[ 1:n] that is initialized to 1. Suppose that a processor is computing row i of H and that A[ i]=a. When this processor first encounters an a in B, say at position j_{1}, it increments W[0:j_{1}−2]. When the next a isencountered, say at j_{2}, it increments W[ j_{1}−1:j_{2}−2] by 1. When the processor finishes its computation of row i, the remaining positions of W are incremented by 1. The processor assigned to compute row q of H may compute U[ j] iff W[ j]=q. From our earlier observations, it follows that when W[ j]=q, the old values in memory positions U[ 1:j] may be overwritten as these are not needed for future computations.
This pprocessor algorithm PP_LS_DL’s time complexity depends on the data sets as the synchronization delay is data dependent. We, however, expect a runtime performance of approximately O(mn/p) when the characters in B are roughly uniformly distributed.
The algorithm P P_S t r i p_D L
In the parallel version PP_Strip_DL of Strip_DL, processor i is initially assigned to compute strip i, 1≤i≤p. Upon completion of its currently assigned strip j, the processor proceeds to compute strip j+p. An array signal[] is used for synchronization purposes. When computing a row r in its assigned strip s, a processor needs to wait until signal[ r]=s. signal[ r] is set to s by the processor working on strip s−1 when the values to the left of strip s needed in the computation of row r of strip s have been computed and there is no risk that the computations for row r of strip s will overwrite V values needed by other processors. signal works very much like W in PP_LS_DL.
Note that when we are working on p strips, we need p copies of the arrays U and T used by Strip_DL.
The time complexity of PP_Strip_DL depends on the synchronization delay and is expected to approximate O(mn/p).
The algorithm P P_D L_T R A C E
This algorithm first uses PP_DL to compute H[][]. Then, a single processor performs a traceback to construct the optimal trace. For reasonable values of p, the run time is dominated by PP_DL and so, the complexity of PP_DL_TRACE is also O(mn/p).
The algorithms P P_L S D L_T R A C E and P P_S t r i p_T R A C E
In LSDL_TRACE (Strip_TRACE), we repeatedly partition the problem into two and apply either LS_DL (Strip_DL) to each partition. The parallel version PP_LSDL_TRACE (PP_Strip_TRACE) employs the following parallelization strategy:

Each subproblem is solved using PP_LS_DL (PP_Strip_DL) when the number of independent subproblems is small; all p processors are assigned to the parallel solution of a single subproblem. I.e., the subproblems are solved in sequence.

p subproblems are solved in parallel using LS_DL (Strip_DL) to solve each subproblem serially when the number of independent subproblems is large,
The time complexity of PP_LSDL_TRACE and PP_Strip_TRACE is O(mn/p).
Results
Experimental platform and test data
The singlecore algorithms were implemented using C and the multicore ones using C and OpenMP. Our codes may be downloaded from [23]. The following computational platforms were used:

1
Xeon4: Intel Xeon CPU E52603 v2 QuadCore processor 1.8GHz with 10MB cache and 32GB memory.

2
Xeon6: Intel I7x980 SixCore processor 3.33GHz with 12MB LLC cache and 16GB memory.

3
Xeon24: Intel Xeon CPU E52695 v2 2xTwelveCore processors 2.40GHz with 30MB cache and 512GB memory.
We compiled all codes using the gcc compiler with the O2 option. Cache miss and energy consumption data were obtained for our Xeon4 platform using the “perf” [24] software and the RAPL interface. This is the only platform for which we obtained cache miss and energy consumption data.
For test data, we downloaded the real DNA/RNA/protein sequences from the NCBI (National Center for Biotechnology Information) server [25] and PDB (Protein Data Bank) server [15]. In addition to that, we also generated random DNA/RNA and protein sequences.
Xeon E52603 (Xeon4) using random data
DL distance algorithms
The observed cache misses for our DL distance algorithms on our Xeon4 platform for randomly generated sequences of size between 40000 and 400000 are given in Fig. 5 and Table 1. “**” in the table indicates there was insufficient memory for the algorithm to run. The column of Table 1 labeled LvsD (SvsD) presents the percentage changes in cache misses reduced by LS_DL (Strip_DL) relative to DL while that labeled SvsL gives this percentage changes reduced by Strip_DL relative to LS_DL.
Notice that DL runs out of memory when A=B≥160000. Strip_DL has fewer cache misses than LS_DL and LS_DL has fewer cache misses than DL. Strip_DL reduces cache misses by up to 99.0% relative to DL and by up to 99.5% relative to LS_DL.
Run times are given in seconds in Fig. 6 and using the format hh:mm:ss in Table 2 for our random data set. Strip_DL is the fastest followed by LS_DL and DL. Strip_DL reduces run time by up to 61.3% relative to DL and by up to 47.6% relative to LS_DL.
Energy consumption by the CPU and cache are gievn, in joules, in Fig. 7 and Table 3. Strip_DL required up to 68.5% less CPU and cache energy than DL and up to 48.2% less than LS_DL.
DL trace algorithms
The observed cache misses for our singlecore DL trace algorithms on our Xeon4 platform are given in Fig. 8 and Table 4. Since DL_TRACE is simply DL with a linear time traceback added, that cache miss count for DL_TRACE is only slightly more than that for DL. LSDL_TRACE has a higher count than does DL_TRACE for the instances that DL has sufficient memory to solve though the gap narrows with increasing instance size. Strip_TRACE consistently has fewer cache misses than both DL_TRACE and LSDL_TRACE. Strip_TRACE reduces cache misses by up to 98.6% relative to DL_TRACE and by up to 99.5% relative to LSDL_TRACE.
Run times of the DL trace algorithms on our Xeon4 platform are given in seconds in Fig. 9 and Table 5. Strip_TRACE is competitive with DL_TRACE on our instances of size 40,000 and 80,000 and 23.5% faster on the instance of size 120,000. Strip_TRACE was consistently faster than LSDL_TRACE achieving a speedup of up to 47.1%.
The energy consumed by the CPU and cache is given in Fig. 10 and Table 6. Strip_TRACE required up to 46.8% less CPU and cache energy than LSDL_TRACE.
Parallel DL distance algorithms
The observed cache misses for our parallel DL algorithms are given in Fig. 11 and Table 7. PP_Strip_DL has the fewest cache misses followed by PP_LS_DL and PP_DL (in this order). The reduction is cache misses achieved by PP_Strip_DL is up to 99.6% relative to PP_DL and up to 99.4% relative to PP_LS_DL.
Run times for our parallel DL algorithms are given in Fig. 12 and Table 8. PP_Strip is up to 84.2% faster than PP_DL and up to 57.0% faster than PP_LS_DL.
Speedup numbers are given in Table 9. The column labeled DL/PP, for example, is the time for DL divided by that for PP_DL. PP_Strip_DL has a speedup between 3.95 and 3.99, which is quite close to the number of cores (4) on our Xeon4 platform. The speedup for PP_DL is up to 3.45 and that for PP_LS_DL is up to 3.40.
Energy data are given in Fig. 13 and Table 10. PP_Strip_DL used up to 81.4% less CPU and cache energy than did PP_DL and up to 55.7% less than PP_LS_DL.
Although the multicore algorithms use more CPU power than used by their singlecore counterparts, the power increase is less than the decrease in run time. Hence, energy consumption is reduced.
Parallel DL trace algorithms
The number of cache misses incurred by our multicore DL trace algorithms is given in Fig. 14 and Table 11. PP_Strip_TRACE has the fewest number of cache misses. PP_Strip_TRACE reduces cache misses by up to 99.3% and 99.6% relative to PP_DL_TRACE and PP_LSDL_TRACE, respectively.
Run times are given in Fig. 15 and Table 12. PP_Strip_Trace is faster than PP_LSDL_TRACE by up to 59.2%. As in Table 13, the speedup achieved by PP_Strip_TRACE relative to its singlecore version ranges from 3.44 to 3.90.
Energy consumption data are given in Fig. 16 and Table 14. PP_Strip_TRACE required up to 57.6% less CPU and cache energy than PP_LSDL_TRACE.
Xeon E52603 (Xeon4) using real data
Tables 15 and 16, respectively, give the run times for our singlecore and multicore DL and DL trace algorithms using real DNA sequences on our Xeon4 platform. The observed times are quite comparable to those for similarly sized random strings. Further, the speed up achieved by our parallel algorithms relative to the singlecore algorithms is also comparable to that for random strings. So, for our remaining test platforms, we present only the results for our randomly generated data sets.
I7x980 (Xeon6) using random data
DL distance algorithms
Single core run times are given in Fig. 17 and Table 17 for our Xeon6 platform. As can be seen, Strip_DL is the fastest followed by LS_DL and DL. Strip_DL reduces run time by up to 52.3% relative to DL and by up to 76.1% relative to LS_DL. The classical DL algorithm ran out memory when A=B=8000.
DL trace algorithms
The run times for the DL trace algorithms are given in Fig. 18 and Table 18. Strip_TRACE reduces run time by up to 63.5% relative to LSDL_TRACE.
Parallel DL distance algorithms
Run times for the parallel DL distance algorithms are given in Fig. 19 and Table 19. As was the case on our Xeon4 platform, PP_Strip_DL is faster than PP_DL and PP_LS_DL. It reduces the run time by up to 25.6% and 79.4%, respectively. The speedup of our parallel algorithm PP_Strip_DL relative to its singlecore version (Table 20) is up to 5.71. This is quite close to the number of cores (6). The maximum speedup achieved by PP_DL and PP_LS_DL was 4.89 and 5.27, respectively.
Parallel DL trace algorithms
Xeon6 run times for the parallel DL trace algorithms are given in Fig. 20 and Table 21. PP_Strip_TRACE is faster than PP_LSDL_TRACE and reduces the run time by up to 68.9%. As shown in Table 22, PP_Strip_TRACE obtains a speedup of up to 5.33 while the maximum speedup by PP_DL_TRACE and PP_LSDL_TRACE was 5.23 and 4.55, respectively.
Xeon E52695 (Xeon24) using random data
DL distance algorithms
The run times for our singlecore DL distance algorithms on the Xeon24 are given in Fig. 21 and Table 23. As on our other test platforms, Strip_DL is the fastest followed by LS_DL and DL. Strip_DL reduces run time by up to 73.1% relative to DL and by up to 42.9% relative to LS_DL.
DL trace algorithms
The run times for our singlecore DL trace algorithms on the Xeon24 are given in Fig. 22 and Table 24. Strip_TRACE reduces run time by up to 46.9% and 31.5% relative to DL_TRACE and LSDL_TRACE, respectively.
Parallel DL trace algorithms
Parallel DL trace run times are given in Fig. 24 and Table 27. PP_Strip_TRACE is faster than PP_DL_TRACE and PP_LSDL_TRACE on large data. It reduces the run time by up to 51.1% and 50.1%, respectively. PP_Strip_TRACE achieves a speedup of up to 17.42 (Table 28); PP_DL_TRACE and PP_LSDL_TRACE have maximum speedups of 16.98 and 12.60.
Parallel DL distance algorithms
Parallel DL distance run times are given in Fig. 23 and Table 25. PP_Strip_DL is faster than PP_DL and PP_LS_DL and reduces the run time by up to 79.1% and 72.8%, respectively. As can be seen from Table 26, PP_Strip_DL scales quite well and results in a speedup of up to 23.22. The maximum speedups provided by PP_DL and PP_LS_DL are 18.00 and 17.88, respectively.
Discussion
Cache efficient and multicore linearspace algorithms to compute the DL distance between two strings as well as to determine an optimal trace (edit sequence) have been developed. The reduction in space provided by these algorithms enables the solution of much larger instances than is possible using previously known algorithms.
Conclusion
Our algorithms were empirically evaluated on 3 computational platforms. Cachemisses were experimentally measured on one of these platforms and we verified that the algorithms analyzed to have a smaller number of cache misses using our simple cache model actually had fewer misses on a real computational platform. Significant runtime improvement (relative to known algorithms) was seen for our cacheefficient algorithms on all three platforms. On all platforms, the linearspace cacheefficient algorithms Strip_DL and Strip_TRACE were the bestperforming singlecore algorithms to determine the DL distance and optimal trace, respectively.
Strip_DL reduced run time by as much as 73.1% relative to the classical distance algorithm DL and Strip_TRACE reduced run time by as much as 63.5% relative to the classical trace algorithm. Multicore versions of these two algorithms scaled quite well and achieved a speedup of up to 23.22 on a 24 core computer.
We also measured the energy efficiency of our algorithms on one of the platforms. Our best singlecore algorithms reduced energy consumption by as much as 68.5% (relative to the best previously known algorithm) when computing the DL distance and by as much as 46.8% when computing an optimal trace. Our best multicore algorithms achieves up to 81.4% and 57.6% energy consumption reduction, respectively.
Abbreviations
 CPU:

Central processing unit
 DL:

DamerauLevenshtein
 DNA:

DeoxyriboNucleic acid
 LLC:

Last level cache
 LRU:

Least recently used
 NCBI:

National center for biotechnology information
 PDB:

Protein data bank
 RNA:

RiboNucleic acid
References
 1
Levenshtein VI. Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Dokl. 1966; 10(8):707–10.
 2
Damerau FJ. A technique for computer detection and correction of spelling errors. Commun ACM. 1964; 7(3):171–6.
 3
Maier D. The complexity of some problems on subsequences and supersequences. J ACM. 1978; 25(2):322–36.
 4
Robinson DJS. An Introduction to Abstract Algebra. Berlin: Walter de Gruyter; 2003, pp. 255–7.
 5
Jaro MA. Advances in recordlinkage methodology as applied to matching the 1985 census of Tampa, Florida. J Am Stat Assoc. 1989; 84(406):414–20.
 6
Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970; 48(3):443–53.
 7
Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981; 147:195–7.
 8
Lowrance R, Wagner RA. An extension of the stringtostring correction problem. J ACM. 1975; 22(2):177–83.
 9
Brill E, Moore RC. An improved error model for noisy channel spelling correction. In: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics: 2000. p. 286–93.
 10
Bard GV. Spellingerror tolerant, orderindependent passphrases via the DamerauLevenshtein stringedit distance metric. In: Proceedings of the Fifth Australasian Symposium on ACSW Frontiers  Volume 68. ACSW ’07. Darlinghurst: Australian Computer Society, Inc.: 2007. p. 117–24.
 11
Li M, Zhang Y, Zhu M, Zhou M. Exploring distributional similarity based models for query spelling correction. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. ACL44. Stroudsburg: Association for Computational Linguistics: 2006. p. 1025–32.
 12
Faloutsos C, Megalooikonomou V. On data mining, compression, and kolmogorov complexity. Data Min Knowl Disc. 2007; 15(1):3–20.
 13
Cai X, Zhang XC, Joshi B, Johnson R. Touching from a distance: Website fingerprinting attacks and defenses. In: Proceedings of the 2012 ACM Conference on Computer and Communications Security. New York: ACM: 2012. p. 605–16.
 14
Majorek KA, DuninHorkawicz S, Steczkiewicz K, Muszewska A, Nowotny M, Ginalski K, Bujnicki JM. The RNase Hlike superfamily: new members, comparative structural analysis and evolutionary classification. In: Nucleic Acids Research, vol. 2. Oxford: Oxford University Press: 2014. p. 4160–79.
 15
Protein Data Bank. http://www.rcsb.org/pdb/home/home.do. Accessed 15 Aug 2017.
 16
Hyyrö H. A bitvector algorithm for computing levenshtein and damerau edit distances. Nord J Comput. 2003; 10(1):29–39.
 17
Oommen BJ, Loke R. Pattern recognition of strings containing traditional and generalized transposition errors. In: 1995 IEEE International Conference on Systems, Man and Cybernetics. Intelligent Systems for the 21st Century, vol. 2. Washington: IEEE: 1995. p. 1154–9.
 18
Damerau–Levenshtein Distance. https://en.wikipedia.org/wiki/DamerauLevenshtein_distance. Accessed 15 Aug 2017.
 19
Zhao C, Sahni S. Cache and energy efficient alignment of very long sequences. In: 5th IEEE International Conference on Computational Advances in Bio and Medical Sciences. Washington: IEEE: 2015. p. 1–6.
 20
Wagner RA, Fischer MJ. The stringtostring correction problem. J ACM. 1974; 21(1):168–73.
 21
Hirschberg DS. A linear space algorithm for computing longest common subsequences. Commun ACM. 1975; 18(6):341–3.
 22
Myers EW, Miller W. Optimal alignments in linear space. Bioinformatics. 1988; 4(1):11–7.
 23
Source Code Download Link. http://www.cise.ufl.edu/%7esahni/papers/DLDistanceAlgs.tar.gz. Accessed 15 Aug 2017.
 24
Perf Tool. https://perf.wiki.kernel.org/index.php/Main_Page. Accessed 15 Aug 2017.
 25
NCBI Database. http://www.ncbi.nlm.nih.gov/gquery. Accessed 15 Aug 2017.
Acknowledgments
The authors also would like to thank the anonymous reviewers for their valuable comments and suggestions.
Funding
This research was funded, in part, by the National Science Foundation under award NSF 1447711. Publication costs were funded by the University of Florida.
Availability of data and materials
All data generated or analyzed during this study are included in this supplementary information files.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 20 Supplement 11, 2019: Selected articles from the 7th IEEE International Conference on Computational Advances in Bio and Medical Sciences (ICCABS 2017): bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume20supplement11.
Author information
Affiliations
Contributions
CZ and SS developed the new linear space cache efficient DL distance algorithms, did theoretical analysis and the experimental results analysis, and wrote the manuscript. CZ programmed the algorithms and ran the benchmark tests. Both authors read and approved the final manuscript.
Corresponding author
Correspondence to Chunchun Zhao.
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Published
DOI
Keywords
 Edit distance
 DamerauLevenshtein distance
 Cache efficient
 String correction