String correction using the Damerau-Levenshtein distance

Background In the string correction problem, we are to transform one string into another using a set of prescribed edit operations. In string correction using the Damerau-Levenshtein (DL) distance, the permissible edit operations are: substitution, insertion, deletion and transposition. Several algorithms for string correction using the DL distance have been proposed. The fastest and most space efficient of these algorithms is due to Lowrance and Wagner. It computes the DL distance between strings of length m and n, respectively, in O(mn) time and O(mn) space. In this paper, we focus on the development of algorithms whose asymptotic space complexity is less and whose actual runtime and energy consumption are less than those of the algorithm of Lowrance and Wagner. Results We develop space- and cache-efficient algorithms to compute the Damerau-Levenshtein (DL) distance between two strings as well as to find a sequence of edit operations of length equal to the DL distance. Our algorithms require O(s min{m,n}+m+n) space, where s is the size of the alphabet and m and n are, respectively, the lengths of the two strings. Previously known algorithms require O(mn) space. The space- and cache-efficient algorithms of this paper are demonstrated, experimentally, to be superior to earlier algorithms for the DL distance problem on time, space, and enery metrics using three different computational platforms. Conclusion Our benchmarking shows that, our algorithms are able to handle much larger sequences than earlier algorithms due to the reduction in space requirements. On a single core, we are able to compute the DL distance and an optimal edit sequence faster than known algorithms by as much as 73.1% and 63.5%, respectively. Further, we reduce energy consumption by as much as 68.5%. Multicore versions of our algorithms achieve a speedup of 23.2 on 24 cores.


Introduction
In the string correction problem, we are given two strings A and B and are required to find the minimum number of edit operations needed to transform A into B. The permitted edit operations are: (a) substitute a character in A to a different character, (b) insert a character into A, (c) delete a character of A, and (d) transpose two adjacent characters of A. When all four edit operations are permitted, the length of the optimal edit sequence is known *Correspondence: czhao@cise.ufl.edu Department of Computer and Information Science and Engineering, University of Florida , Gainesville 32611, FL, USA as the Damerau-Levenshtein (DL) distance [1,2]. Some applications limit the permissible edit operations to a subset of the stated four operations. As a result, string correction has been studied using other distance metrics as well. For example, the Levenshtein distance [1] is the length of the shortest sequence of substitutions, insertions, and deletions needed to transform A into B. This distance is used in the longest common subsequence problem [3], for example. When only substitutions are allowed, the length of the minimum edit sequence is the Hamming distance [4] and when only transpositions are allowed, this length is the Jaro distance [5].
The cost of an edit sequence may be generalized by using weights for the various operations. For example, in sequence alignment using the methods of Needleman and Wunsch [6] and Smith and Waterman [7], transpositions are not permitted, the cost of a substitution depends on the two characters involved, and there is a gap penalty. The string-to-string correction algorithm of Lowrance and Wagner [8] uses a cost of S for a substitution, I for an insertion, D for a deletion, and T for a transposition and requires 2T ≥ I + D. We note that the costs used in computing the DL distance are S = I = D = T = 1 and that these costs satisfy the 2T ≥ I + D requirement of the algorithm of Lowrance and Wagner [8]. In fact, the best algorithm currently known for the DL distance is the one in [8] with edit operation costs set to 1.
Spelling error correction [9][10][11], data clustering and data mining [12], comparing packet traces [13], quantifying the similarity of DNA/RNA/protein sequences, gene finding, and gene function prediction [14] are some of the applications of the DL distance. While, in spelling error correction, the strings A and B are relatively short, in other applications, these strings may be quite long. For example, the length of a protein sequence may exceed 300,000 [15]. Bard [10] has shown that the DL distance is a true metric; that is, it satisfies 1) non-negativity, 2) identity, 3) symmetry, and 4) triangle inequality. The algorithm of Bard [10] computes the DL distance in O(mn * max{m, n}) time, where m is the length of string A and n is the length of B. This algorithm uses O(mn) space. Hyyro [16] has developed a bit-parallel algorithm to determine whether the DL distance between two strings is less than a specified threshold. This bit-parallel algorithm was tested using DNA sequences of length up to 10,000.
In an effort to reduce time complexity, Oommen and Loke [17] consider restricting edit sequences so that no substring is edited more than once. We illustrate this restriction using the example given in [18]. The string CA may be transformed into ABC using the edit sequence CA (transposition)→ AC (insertion)→ ABC. So, the DL  distance between CA and ABC is 2. With the restriction of [17], the second operation in this edit sequence is not permitted as it involves re-editing AC, which resulted from the first edit operation. The restricted DL distance is 3, which corresponds to the restricted edit sequence CA (deletion)→ A (insertion)→ AB (insertion)→ ABC. The restricted DL distance is not a metric as it does not satisfy the triangle inequality.
The algorithm of Lowrance and Wagner [8] computes the DL distance in O(mn) time while also using O(mn)  space. This is the fastest and most space efficient algorithm known for string correction using the DL distance.
Neither the algorithm of Bard [10] nor that of Lowrance and Wagner [8] is practical when m and n are large due to their excessive space requirement. The former algorithm becomes impractical also due to its excessive run time. In this paper, we focus on the development of algorithms that are more space, time, and energy efficient than that of Lowrance and Wagner [8]. To obtain space efficiency, we observe that the DL distance can be computed by retaining only O(sm) or O(sn) data, where s is the size of the alphabet. We note that, when m and n are large, s is much smaller than m and n. In fact, s = 4 for RNA and DNA sequences and s = 20 for protein sequences and the length of these sequences is often orders of magnitude larger than s.

Cache model
To analyze the cache performance of our algorithms, we use the rather simple cache model which has been used by us successfully in our past work [19]. In this model we have a single-level cache that has l cache lines of size w, where w is the number of data items that can be stored in one cache line. So, when the data size is 4 bytes and w = 8, each cache line is 32 bytes. The size (i.e., capacity) of our one-level cache is lw. In accordance with this cache model, we assume that main memory is divided into blocks whose size is the same as that of a cache line (i.e., w words each). When we attempt to read a piece of data that is not in the cache, a read miss occurs. A read miss causes the corresponding block of main memory to be read into a cache line. When the cache is full, this read miss requires us to first evict the block that is in the least recently used (LRU) cache line. This eviction results in a write of the evicted block to main memory in case the evicted block has changed. A write miss occurs when we attempt to write data that is not in a cache line. At this time, the corresponding block of main memory is read into a cache line and the data we wish to write is written to this cache line. Notice that every read and write miss results in a read access of main memory; some read and write misses also result in the writing of a cache line to main memory.
Today's computers actually employ multiple levels of cache and a far more sophisticated and proprietary cache servicing policy combined with prefetching to hide memory latency. As a result, it is extremely difficult to analyze cache performance using a realistic cache model. The described simple cache model is amenable to analysis and our experiments establish its usefulness for this purpose as algorithms with reduced cache misses using this model actually run faster on computers with more sophisticated cache architectures, replacement policies, and prefetching techniques.

Classical DL distance algorithm
Wagner and Fischer [20] developed the notion of a trace, which is useful in reasoning about edit sequences that are limited to substitutions, insertions, and deletions. Lowrance and Wagner [8] extended this notion to include the transposition operation. A trace for the strings A = a 1 · · · a m and B = b 1 · · · b n is a set T of lines, where the endpoints u and v of a line (u, v) denote positions in A and B, respectively. A set of lines T is a trace iff: The lines in T have distinct A positions and distinct B positions. That is, no two lines in T have the same u or the same v.
A line (u, v) is balanced iff a u = b v and two lines (u 1 , v 1 ) and (u 2 , v 2 ) cross iff (u 1 < u 2 ) and (v 1 > v 2 ). As an example, consider A = dafac and B = fdbbec. The set of lines T = {(1, 2), (3, 1), (4, 3), (5, 6)} satisfies the requirements for a trace. Line (4,3) is not balanced as a 4 = b 3 . The remaining 3 lines in the trace are balanced. The lines (1,2) and (3,1) cross. This trace may be depicted as a diagram as in Fig. 1. In a trace, an unbalanced line denotes a substitution operation and a balanced line denotes retaining the character of A. If a i has no line attached to it, a i is to be deleted and when b j has no attached line, it is to be inserted. When two balanced lines (u 1 , v 1 ) and (u 2 , v 2 ) cross, a u 1 +1 · · · a u 2 −1 are to be deleted from A making a u 1 and a u 2 adjacent, then a u 1 and a u 2 are to be transposed, and finally, b v 2 +1 · · · b v 1 −1 are to be inserted between the just transposed characters of A.
The edit sequence corresponding to the trace of Fig. 1 is delete a 2 , transpose a 1 and a 3 , substitute b for a 4 , insert b 4 = b and b 5 = e, retain a 5 . The cost of this edit sequence is 5.  Lowrance and Wagner [8] have proved the following properties: P1: The cost of a trace equals the number of unbalanced lines plus the number of positions in A and B not touched by a line plus the number of line crossings. P2: There is a trace whose cost equals that of an optimal edit sequence (Theorem 2 of [8]). Since every trace corresponds to an edit sequence, it follows that the edit sequence that corresponds to a minimum cost trace is optimal. P3: There is a minimum cost trace in which each line crosses at most one other line and in which every line that crosses another is balanced (Theorem 4 of [8]). P4: There is trace T that satisfies property P3 and for every pair of crossing lines ( When i > 0 and j > 0, If k or l do not exist, then case 4 of the recurrence does not apply. Figure 2 illustrates the four cases of this recurrence. These cases correspond to the four possibilities for an optimal trace that transforms A[1 : i] into B[1 : j] and satisfies properties P2-P4. Such a trace may (a) contain the   Figure 2d shows the case of crossing balanced lines (i, l) and (k, j).
. Note that to perform the crossing operation, we must delete i − k − 1 characters from A, do an adjacent character transposition in A, and then insert j − l − 1 characters from B between the two just transposed positions. So, the cost is . Algorithm 1 is the pseudocode to compute H using Eqs. 1 and 2. This is a simplification of the pseudocode given in Lowrance and Wagner [8] to the case when each edit operation has unit cost. In this algorithm, last_row_id[ c] keeps track of the last occurrence of character c in A (note that this is a row index of H) and last_col_id keeps track of the last occurrence of a i in B.
We shall refer to Algorithm 1 as algorithm DL. Its time and space complexities are readily seen to be O(mn). Once H has been computed using algorithm DL, an optimal trace may be obtained in O(m + n) additional time using a standard dynamic programming traceback. We refer to the combination of DL and the traceback as algorithm DL_TRACE. 16: last_col_id ← j 17: end if 18: end for 19: The total number of cache misses is dominated by the read and write misses of the array H. So, we count only these misses. In each iteration of the loop for computing row i of H, we need the elements of rows i and i − 1 of H in left-to-right order as in Algorithm 1 lines 9-11 and 14. Since these rows are read from main memory in blocks of size w and row i is written to main memory in blocks of this size, lines 9-11 and 14 result in 2n/w read accesses and n/w write accesses for each i. These lines, therefore, result in 3mn/w cache misses over the entire execution of DL. Line 13 makes one read access of H per iteration and so contributes at most mn to the total cache-miss count. Hence, the cache-miss count for algorithm DL is approximately mn(1 + 3/w).

Single-core algorithms
In this section, we develop four linear-space single-core algorithms for string correction using the DL distance. All four run in O(mn) time. The first two (LS_DL and Strip_DL) compute only the score H mn of the optimal trace; they differ in their cache efficiency. The last two (LSDL_TRACE and Strip_TRACE) compute an optimal trace.

The linear space algorithm LS_DL
Let s be the size of the alphabet. Instead of using the array H used in DL, algorithm LS_DL uses a one-dimensional array U[−1 : n] and a two-dimensional array T[1 : s]  13: for j ← 1 to n do 14:  (1) time as pointers to 2 one-dimensional arrays are swapped rather than the content of these arrays. The cache-miss count for LS_DL is the same as that for DL when n is suitably large as both have the same data access pattern. However, for smaller instances LS_DL will exhibit much better cache behavior. For example, because of its use of much less memory, we may have enough LLC cache to store all the data in LS_DL but not in DL (O(sn) vs O(mn)).

The cache-efficient linear-space algorithm Strip_DL
When (s + 1)n is larger than the size of the LLC cache, we may reduce cache misses relative to algorithm LS_DL by computing H ij by strips of width q, for some q less than n (the last strip may have a width smaller than q). This is shown in Fig. 3. The strips are computed in the order 0, 1, ... using algorithm LS_DL. However, the space needed by T and U in LS_DL is reduced to (s + 1)q as the strip width is q rather than n. By choosing q small enough, we can ensure that blocks of the T and U arrays used by LS_DL are not evicted from cache once they are brought in. So, if each entry of T and U takes 1 word, then when the cache size is lw, we have q < lw/(s + 1). Note that, in addition to T and U, the cache needs to hold partials of A, B and other arrays needed to pass the data from one strip to the next.
To pass the data from one strip to next, we use an additional one-dimensional array strip of size m and a two-dimensional s * m array V.
The pseudocode for Strip_DL is given in Algorithm 3. For clarity, this pseudocode uses two strip arrays (lines 18 and 30) and two V arrays (lines 24 and 32). One set of arrays is used to fetch data calculated for the previous strip and the other set for data that is to be passed to the next strip. In the actual implementation, we use a single strip array and a single V array overwriting values received from the previous strip with values to be passed to the next strip.
The time and complexity of Strip_DL are, respectively, O(mn) and O((s + 1)m + (s + 1)q + n) = O(sm + sq + n) = O(sm + n) as q is a constant. When m > n, we may switch A and B to conserve memory and so the space When we analyze the cache miss, we note that q is chosen such that U and T fit into cache. We make the reasonable assumption that the LRU replacement rule does not cause any block of U or T to be evicted during the running of algorithm Strip_DL. As a result, the total number of cache misses due to U and T is independent of m and n and so may be ignored in the analysis. The initialization of strip and V results in m/w and (s + 1)m/w read accesses , respectively. The number of write accesses is approximately the same as the number of read accesses. The computation for each strip accesses the array strip in ascending order of index. This results in (approximately) end for 7: end for 8: for t ← 1 to n/q do 9: for j ← t * q to t * q + q − 1 do 10: 11: for each character c in the alphabet do 12: T[c] [j] ← maxVal; 13: end for 14: end for 15: for i ← 1 to m do 16: the same number of cache misses as made during the initialization phase. Hence, the total number of cache misses due to strip is approximately (2m/w)(n/q + 1). For V, we note that when computing the current strip, the elements in any row of V are accessed in non-decreasing order of index (i.e., from left to right) and that we need to retain, in cache, only the most recently read value for each character of the alphabet (i.e., at most s values are to be retained). Making the assumption that a V value is evicted from cache only when a new value for the same character is accessed, the total number of read misses from V when computing a single strip is sm/w. The number of write misses is approximately the same. So, V contributes (2sm/w)(n/q + 1). Hence, the total number of cache misses for algorithm Strip_DL is ≈ 2(s+1)mn/(wq) when m and n are large.

The linear-space trace algorithm LSDL_TRACE
Although algorithms LS_DL and Strip_DL determine the score (cost) of an optimal trace (and hence of an optimal edit sequence) that transforms A into B, these algorithms do not save enough information to actually determine an optimal trace. To determine an optimal trace using linear space, we adopt a divide-and-conquer strategy similar to that used by Hirschberg [21] for the simple string editing problem (i.e., transpositions are not permitted) and Myers and Miller [22] for the sequence alignment problem.
We say that a trace has a center crossing iff it contains two lines (u 1 , v 1 ) and (u 2 , v 2 ), u 1 < u 2 such that v 1 > n/2 and v 2 ≤ n/2 (Fig. 4).
Let T be an optimal trace that satisfies properties P2-P4. If T contains no center crossing, then its lines may be partitioned into sets TL and TR such that TL contains all lines (u, v) ∈ T with v ≤ n/2 and TR contains the remaining lines (Fig. 4a). Since there is no center crossing, all lines in TR have a u value greater than the u value of every line in TL. It follows from properties P2-P4 that there is an i, 1 ≤ i ≤ m such that T is the union of an optimal trace for A [1 : i] When T contains a center crossing, its lines may be partitioned into 3 sets, TL, TM, and TR, as shown in Fig. 4b. Let (u 1 , v 1 ) and (u 2 , v 2 ) be the lines defining the center crossing. Note that TL contains all lines of T with v < v 2 , TR contains all lines with v > v 1 , and TM = {(u 1 , v 1 ), (u 2 , v 2 )}. Note also that all lines in TL have a u < u 1 and all in TR have u > u 2 . From property P1, it follows that TL is an optimal trace for A[1 : where, for the min{}, we try 1 ≤ u 1 < m and for each such u 1 , we set v 1 to be the smallest i > n/2 for which b i = a u 1 . For each u 1 we examine all characters other than a u 1 in the alphabet. For each such character c, v 2 is set to the largest j ≤ n/2 for which b j = c and u 2 is the smallest i > u 1 for which a i = c. So, the min is taken over (s − 1)m terms.
Let  (Algorithm 4) provides the pseudocode for our linear space computation of an optimal trace. It assumes that LS_DL has been modified to return both the arrays U and T.
For the time complexity, we see that at the top level of the recursion, we invoke LS_DL twice with strings A and B of size m and n/2, respectively. This takes at most amn time for some constant a. The time required to compute Eqs. 3 and 4 is O(sn) and may be absorbed into amn by using a suitably large constant a. At the next level of recursion, LS_DL is invoked 4 times. The sum of the lengths of the A strings across these 4 invocations is at most 2m and the B string has length at most n/4. So, the time for these four invocations is at most amn/2. Generalizing to the remaining levels of recursion, we see that algorithm LSDL_TRACE takes amn(1 + 1/2 + 1/4 + 1/8 + . . .) < 2amn = O(mn) time. The space needed is the same as that  for LS_DL (note that the parameters to this algorithm have been switched). From the time analysis, it follows that the number of cache misses is approximately twice that for LS_DL when invoked with strings of size m and n. Hence the approximate cache miss count for LSDL_TRACE is 2mn(1 + 3/w).
We note that some reduction in actual run time can be achieved by switching A and B when A is shorter than B thus ensuring that the shorter string is split at each level of recursion. This enables us to get the recursion terminates faster.

The strip trace algorithm Strip_TRACE
This algorithm differs from LSDL_TRACE in that it uses a modified version of Strip_DL rather than a modified version of LS_DL. The modified version of Strip_DL returns the arrays strip and V computed by Strip_DL. Correspondingly, Strip_TRACE uses V top and V bot in place of T top and T bot . The asymptotic time complexity of Strip_TRACE is also O(mn) and it takes the same amount of space as does Strip_DL (note that the parameters to Strip_DL are switched relative to those for Strip_TRACE). The number of cache misses is approximately twice that for Strip_DL.

Multi-core algorithms
In this section, we describe our parallelizations of algorithm DL and the four single-core algorithms of previous section. These parallelizations assume that the number of processors is small relative to string length. The naming convention we adopt for the parallel versions is adding PP_ as a prefix to the name of the single-core algorithm.

The algorithm PP_DL
Our parallel version of algorithm DL, PP_DL, computes the elements in the same order as does DL. However, it starts the computation of a row before the computation of its preceding row is complete. Each processor is assigned a unique row to compute and it computes this row from left to right. Let p be the number of processors. Processor z is initially assigned to do the outer loop computation for i = z, 1 ≤ i ≤ p. Processor z begins after a suitable time lag relative to the start of processor z − 1 so that the data it needs for its computation have already been computed by processor z − 1. In our code, the time lag between the start of the computation of two consecutive rows is the time needed to compute n/p elements. Upon completion of its iteration i computation, the processor proceeds to iteration i + p of the outer loop. The time complexity of PP_DL is O(mn/p).

The algorithm PP_LS_DL
While the general parallelization strategy for PP_LS_DL is the same as that used in PP_DL, extra care is needed to ensure a computation identical to that of LS_DL. Divergence in results is possible when two or more processors are simultaneously computing different rows of H using the same memory. This happens for example when A = aaabc · · · and p ≥ 3. We start with processor i assigned to compute row i of H, 1 ≤ i ≤ p. Suppose that U = x and T[a] = y initially (note that x and y are addresses in memory).

Because of the swap(T[A[i] ] , U) statement in
LS_DL, processor 1 begins to compute row 1 of H using memory beginning at the address y. If processor 2 begins with a suitable time lag as in PP_DL, it will compute row 2 of H using memory beginning at the address x. With a further lag, processor 3 will begin to compute row 3 of H again using memory beginning at the address y. Now, both processors 1 and 3 are using the same memory to compute different rows of H and so we run the risk of overwriting H values that may be needed for subsequent computations. As another example, consider A = ababa · · · and p ≥ 4. Suppose that U = x and T[a, b] =[y, z] initially. Processor 1 begins to compute row 1 using the memory y, then, with a lag, processor 2 begins to compute row 2 using memory z, then processor 3 starts to compute row 3  Let p 1 and p 2 be two processors that are using the same memory to compute rows r 1 < r 2 of H and that no processor is using this memory to compute a row between r 1 and r 2 . From the swapping assignment scheme used in LS_DL, it follows that p 1 is computing the row r 1 = lastA[r 2 ] [a r 2 ] −1. The H values in this row are needed to compute rows r 1 + 1 through r 2 as r 1 = lastA[i] [a r 2 ] r 1 < i ≤ r 2 . These values are not needed for rows i > r 2 as for these rows lastA Let j 1 be such that b j = a r 2 = a r 1 +1 . Then, for j > j 1 , lastB[j] [a r 2 ] ≥ j 1 . Hence, for j > j 1 columns 1 through j 1 − 2 of row r 1 are not needed to compute an H in rows between r 1 and r 2 .
Our parallel code uses a synchronization scheme that is based on the observations of the preceding paragraph to delay the overwriting of values that are needed for later computations and ensure a correct computation of the DL distance. Our synchronization scheme employs another array W [1 : n] that is initialized to 1. Suppose that a processor is computing row i of H and that A[i] = a. When this processor first encounters an a in B, say at position j 1 , it increments W [ 0 : j 1 − 2]. When the next a is encountered, say at j 2 , it increments W [j 1 − 1 : j 2 − 2] by This p-processor algorithm PP_LS_DL's time complexity depends on the data sets as the synchronization delay is data dependent. We, however, expect a run-time performance of approximately O(mn/p) when the characters in B are roughly uniformly distributed.

The algorithm PP_Strip_DL
In the parallel version PP_Strip_DL of Strip_DL, processor i is initially assigned to compute strip i, 1 ≤ i ≤ p.
Upon completion of its currently assigned strip j, the processor proceeds to compute strip j + p. An array signal[ ] is used for synchronization purposes. When computing a row r in its assigned strip s, a processor needs to wait until signal[r] = s. signal[r] is set to s by the processor working on strip s − 1 when the values to the left of strip s needed in the computation of row r of strip s have been computed and there is no risk that the computations for row r of strip s will overwrite V values needed by other processors. signal works very much like W in PP_LS_DL.
Note that when we are working on p strips, we need p copies of the arrays U and T used by Strip_DL.
The time complexity of PP_Strip_DL depends on the synchronization delay and is expected to approximate O(mn/p).

The algorithms PP_LSDL_TRACE and PP_Strip_TRACE
In LSDL_TRACE (Strip_TRACE), we repeatedly partition the problem into two and apply either LS_DL (Strip_DL) to each partition. The parallel version PP_LSDL_TRACE (PP_Strip_TRACE) employs the following parallelization strategy: • Each subproblem is solved using PP_LS_DL (PP_Strip_DL) when the number of independent subproblems is small; all p processors are assigned to the parallel solution of a single subproblem. I.e., the subproblems are solved in sequence. • p subproblems are solved in parallel using LS_DL (Strip_DL) to solve each subproblem serially when the number of independent subproblems is large, The time complexity of PP_LSDL_TRACE and PP_Strip_TRACE is O(mn/p).

Experimental platform and test data
The single-core algorithms were implemented using C and the multi-core ones using C and OpenMP. Our codes may be downloaded from [23]. The following computational platforms were used: 1 Xeon4: Intel Xeon CPU E5-2603 v2 Quad-Core processor 1.8GHz with 10MB cache and 32GB memory. We compiled all codes using the gcc compiler with the O2 option. Cache miss and energy consumption data were obtained for our Xeon4 platform using the "perf " [24] software and the RAPL interface. This is the only platform for which we obtained cache miss and energy consumption data.
For test data, we downloaded the real DNA/RNA/protein sequences from the NCBI (National Center for Biotechnology Information) server [25] and PDB (Protein Data Bank) server [15]. In addition to that, we also generated random DNA/RNA and protein sequences.

Xeon E5-2603 (Xeon4) using random data DL distance algorithms
The observed cache misses for our DL distance algorithms on our Xeon4 platform for randomly generated sequences of size between 40000 and 400000 are given in Fig. 5 and Table 1. "**" in the table indicates there was insufficient memory for the algorithm to run. The column of Table 1 labeled LvsD (SvsD) presents the percentage changes in cache misses reduced by LS_DL (Strip_DL) relative to DL while that labeled SvsL gives this percentage changes reduced by Strip_DL relative to LS_DL. Notice that DL runs out of memory when |A| = |B| ≥ 160000. Strip_DL has fewer cache misses than LS_DL and LS_DL has fewer cache misses than DL. Strip_DL reduces cache misses by up to 99.0% relative to DL and by up to 99.5% relative to LS_DL.
Run times are given in seconds in Fig. 6 and using the format hh : mm : ss in Table 2 for our random data set. Strip_DL is the fastest followed by LS_DL and DL. Strip_DL reduces run time by up to 61.3% relative to DL and by up to 47.6% relative to LS_DL.
Energy consumption by the CPU and cache are gievn, in joules, in Fig. 7 and Table 3. Strip_DL required up to 68.5% less CPU and cache energy than DL and up to 48.2% less than LS_DL.

DL trace algorithms
The observed cache misses for our single-core DL trace algorithms on our Xeon4 platform are given in Fig. 8 and Table 4. Since DL_TRACE is simply DL with a linear time traceback added, that cache miss count for DL_TRACE is only slightly more than that for DL. LSDL_TRACE has a higher count than does DL_TRACE for the instances that DL has sufficient memory to solve though the gap narrows with increasing instance size. Strip_TRACE consistently has fewer cache misses than both DL_TRACE and LSDL_TRACE. Strip_TRACE reduces cache misses by up to 98.6% relative to DL_TRACE and by up to 99.5% relative to LSDL_TRACE.
Run times of the DL trace algorithms on our Xeon4 platform are given in seconds in Fig. 9 and Table 5. Strip_TRACE is competitive with DL_TRACE on our instances of size 40,000 and 80,000 and 23.5% faster on the instance of size 120,000. Strip_TRACE was consistently faster than LSDL_TRACE achieving a speedup of up to 47.1%. The energy consumed by the CPU and cache is given in Fig. 10 and Table 6. Strip_TRACE required up to 46.8% less CPU and cache energy than LSDL_TRACE.

Parallel DL distance algorithms
The observed cache misses for our parallel DL algorithms are given in Fig. 11 and Table 7. PP_Strip_DL has the fewest cache misses followed by PP_LS_DL and PP_DL (in this order). The reduction is cache misses achieved by PP_Strip_DL is up to 99.6% relative to PP_DL and up to 99.4% relative to PP_LS_DL.
Run times for our parallel DL algorithms are given in Fig. 12 and Table 8. PP_Strip is up to 84.2% faster than PP_DL and up to 57.0% faster than PP_LS_DL.
Speedup numbers are given in Table 9. The column labeled DL/PP, for example, is the time for DL divided by that for PP_DL. PP_Strip_DL has a speedup between 3.95 and 3.99, which is quite close to the number of cores (4) on our Xeon4 platform. The speedup for PP_DL is up to 3.45 and that for PP_LS_DL is up to 3.40.
Energy data are given in Fig. 13 and Table 10. PP_Strip_DL used up to 81.4% less CPU and cache energy than did PP_DL and up to 55.7% less than PP_LS_DL.
Although the multi-core algorithms use more CPU power than used by their single-core counterparts, the power increase is less than the decrease in run time. Hence, energy consumption is reduced.

Parallel DL trace algorithms
The number of cache misses incurred by our multicore DL trace algorithms is given in Fig. 14 and Table 11. PP_Strip_TRACE has the fewest number of  Fig. 15 and Table 12. PP_Strip_Trace is faster than PP_LSDL_TRACE by up to 59.2%. As in Table 13, the speedup achieved by PP_Strip_TRACE relative to its single-core version ranges from 3.44 to 3.90.
Energy consumption data are given in Fig. 16 and Table 14. PP_Strip_TRACE required up to 57.6% less CPU and cache energy than PP_LSDL_TRACE. Tables 15 and 16, respectively, give the run times for our single-core and multi-core DL and DL trace algorithms using real DNA sequences on our Xeon4 platform. The observed times are quite comparable to those for similarly sized random strings. Further, the speed up achieved by our parallel algorithms relative to the single-core algorithms is also comparable to that for random strings. So, for our remaining test platforms, we present only the results for our randomly generated data sets.

I7-x980 (Xeon6) using random data DL distance algorithms
Single core run times are given in Fig. 17 and Table 17 for our Xeon6 platform. As can be seen, Strip_DL is the fastest followed by LS_DL and DL. Strip_DL reduces run time by up to 52.3% relative to DL and by up to 76.1% relative to LS_DL. The classical DL algorithm ran out memory when |A| = |B| = 8000.

DL trace algorithms
The run times for the DL trace algorithms are given in Fig. 18 and Table 18. Strip_TRACE reduces run time by up to 63.5% relative to LSDL_TRACE.

Parallel DL distance algorithms
Run times for the parallel DL distance algorithms are given in Fig. 19 and Table 19. As was the case on our Xeon4 platform, PP_Strip_DL is faster than PP_DL and PP_LS_DL. It reduces the run time by up to 25.6% and 79.4%, respectively. The speedup of our parallel algorithm PP_Strip_DL relative to its single-core version (Table 20) is up to 5.71. This is quite close to the number of cores (6). The maximum speedup achieved by PP_DL and PP_LS_DL was 4.89 and 5.27, respectively.

Parallel DL trace algorithms
Xeon6 run times for the parallel DL trace algorithms are given in Fig. 20 and Table 21. PP_Strip_TRACE is faster than PP_LSDL_TRACE and reduces the run time by up to 68.9%. As shown in Table 22, PP_Strip_TRACE obtains a speedup of up to 5.33 while the maximum speedup by PP_DL_TRACE and PP_LSDL_TRACE was 5.23 and 4.55, respectively.

Xeon E5-2695 (Xeon24) using random data DL distance algorithms
The run times for our single-core DL distance algorithms on the Xeon24 are given in Fig. 21 and Table 23. As on our other test platforms, Strip_DL is the fastest followed by LS_DL and DL. Strip_DL reduces run time by up to 73.1% relative to DL and by up to 42.9% relative to LS_DL.

DL trace algorithms
The run times for our single-core DL trace algorithms on the Xeon24 are given in Fig. 22 and Table 24. Strip_TRACE reduces run time by up to 46.9% and 31.5% relative to DL_TRACE and LSDL_TRACE, respectively.

Parallel DL trace algorithms
Parallel DL trace run times are given in Fig. 24 and

Parallel DL distance algorithms
Parallel DL distance run times are given in Fig. 23 and Table 25. PP_Strip_DL is faster than PP_DL and PP_LS_DL and reduces the run time by up to 79.1% and 72.8%, respectively. As can be seen from

Discussion
Cache efficient and multi-core linear-space algorithms to compute the DL distance between two strings as well as to determine an optimal trace (edit sequence) have been developed. The reduction in space provided by these algorithms enables the solution of much larger instances than is possible using previously known algorithms.

Conclusion
Our algorithms were empirically evaluated on 3 computational platforms. Cache-misses were experimentally measured on one of these platforms and we verified that the algorithms analyzed to have a smaller number of cache misses using our simple cache model actually had fewer misses on a real computational platform. Significant runtime improvement (relative to known algorithms) was seen for our cache-efficient algorithms on all three platforms. On all platforms, the linear-space cache-efficient algorithms Strip_DL and Strip_TRACE were the best-       performing single-core algorithms to determine the DL distance and optimal trace, respectively. Strip_DL reduced run time by as much as 73.1% relative to the classical distance algorithm DL and Strip_TRACE reduced run time by as much as 63.5% relative to the classical trace algorithm. Multi-core versions of these two algorithms scaled quite well and achieved a speedup of up to 23.22 on a 24 core computer.
We also measured the energy efficiency of our algorithms on one of the platforms. Our best singlecore algorithms reduced energy consumption by as much as 68.5% (relative to the best previously known algorithm) when computing the DL distance and by as much as 46.8% when computing an optimal trace. Our best multi-core algorithms achieves up to 81.4% and 57.6% energy consumption reduction, respectively.