String correction using the Damerau-Levenshtein distance

Zhao, Chunchun; Sahni, Sartaj

doi:10.1186/s12859-019-2819-0

Volume 20 Supplement 11

Selected articles from the 7th IEEE International Conference on Computational Advances in Bio and Medical Sciences (ICCABS 2017): bioinformatics

Research
Open access
Published: 06 June 2019

String correction using the Damerau-Levenshtein distance

Chunchun Zhao¹ &
Sartaj Sahni¹

BMC Bioinformatics volume 20, Article number: 277 (2019) Cite this article

14k Accesses
32 Citations
4 Altmetric
Metrics details

Abstract

Background

In the string correction problem, we are to transform one string into another using a set of prescribed edit operations. In string correction using the Damerau-Levenshtein (DL) distance, the permissible edit operations are: substitution, insertion, deletion and transposition. Several algorithms for string correction using the DL distance have been proposed. The fastest and most space efficient of these algorithms is due to Lowrance and Wagner. It computes the DL distance between strings of length m and n, respectively, in O(mn) time and O(mn) space. In this paper, we focus on the development of algorithms whose asymptotic space complexity is less and whose actual runtime and energy consumption are less than those of the algorithm of Lowrance and Wagner.

Results

We develop space- and cache-efficient algorithms to compute the Damerau-Levenshtein (DL) distance between two strings as well as to find a sequence of edit operations of length equal to the DL distance. Our algorithms require O(s min{m,n}+m+n) space, where s is the size of the alphabet and m and n are, respectively, the lengths of the two strings. Previously known algorithms require O(mn) space. The space- and cache-efficient algorithms of this paper are demonstrated, experimentally, to be superior to earlier algorithms for the DL distance problem on time, space, and enery metrics using three different computational platforms.

Conclusion

Our benchmarking shows that, our algorithms are able to handle much larger sequences than earlier algorithms due to the reduction in space requirements. On a single core, we are able to compute the DL distance and an optimal edit sequence faster than known algorithms by as much as 73.1% and 63.5%, respectively. Further, we reduce energy consumption by as much as 68.5%. Multicore versions of our algorithms achieve a speedup of 23.2 on 24 cores.

Background

Introduction

In the string correction problem, we are given two strings A and B and are required to find the minimum number of edit operations needed to transform A into B. The permitted edit operations are: (a) substitute a character in A to a different character, (b) insert a character into A, (c) delete a character of A, and (d) transpose two adjacent characters of A. When all four edit operations are permitted, the length of the optimal edit sequence is known as the Damerau-Levenshtein (DL) distance [1, 2]. Some applications limit the permissible edit operations to a subset of the stated four operations. As a result, string correction has been studied using other distance metrics as well. For example, the Levenshtein distance [1] is the length of the shortest sequence of substitutions, insertions, and deletions needed to transform A into B. This distance is used in the longest common subsequence problem [3], for example. When only substitutions are allowed, the length of the minimum edit sequence is the Hamming distance [4] and when only transpositions are allowed, this length is the Jaro distance [5].

The cost of an edit sequence may be generalized by using weights for the various operations. For example, in sequence alignment using the methods of Needleman and Wunsch [6] and Smith and Waterman [7], transpositions are not permitted, the cost of a substitution depends on the two characters involved, and there is a gap penalty. The string-to-string correction algorithm of Lowrance and Wagner [8] uses a cost of S for a substitution, I for an insertion, D for a deletion, and T for a transposition and requires 2T≥I+D. We note that the costs used in computing the DL distance are S=I=D=T=1 and that these costs satisfy the 2T≥I+D requirement of the algorithm of Lowrance and Wagner [8]. In fact, the best algorithm currently known for the DL distance is the one in [8] with edit operation costs set to 1.

Spelling error correction [9–11], data clustering and data mining [12], comparing packet traces [13], quantifying the similarity of DNA/RNA/protein sequences, gene finding, and gene function prediction [14] are some of the applications of the DL distance. While, in spelling error correction, the strings A and B are relatively short, in other applications, these strings may be quite long. For example, the length of a protein sequence may exceed 300,000 [15].

Bard [10] has shown that the DL distance is a true metric; that is, it satisfies 1) non-negativity, 2) identity, 3) symmetry, and 4) triangle inequality. The algorithm of Bard [10] computes the DL distance in O(mn∗ max{m,n}) time, where m is the length of string A and n is the length of B. This algorithm uses O(mn) space. Hyyro [16] has developed a bit-parallel algorithm to determine whether the DL distance between two strings is less than a specified threshold. This bit-parallel algorithm was tested using DNA sequences of length up to 10,000.

In an effort to reduce time complexity, Oommen and Loke [17] consider restricting edit sequences so that no substring is edited more than once. We illustrate this restriction using the example given in [18]. The string CA may be transformed into ABC using the edit sequence CA (transposition) → AC (insertion) → ABC. So, the DL distance between CA and ABC is 2. With the restriction of [17], the second operation in this edit sequence is not permitted as it involves re-editing AC, which resulted from the first edit operation. The restricted DL distance is 3, which corresponds to the restricted edit sequence CA (deletion) → A (insertion) → AB (insertion) → ABC. The restricted DL distance is not a metric as it does not satisfy the triangle inequality.

The algorithm of Lowrance and Wagner [8] computes the DL distance in O(mn) time while also using O(mn) space. This is the fastest and most space efficient algorithm known for string correction using the DL distance.

Neither the algorithm of Bard [10] nor that of Lowrance and Wagner [8] is practical when m and n are large due to their excessive space requirement. The former algorithm becomes impractical also due to its excessive run time. In this paper, we focus on the development of algorithms that are more space, time, and energy efficient than that of Lowrance and Wagner [8]. To obtain space efficiency, we observe that the DL distance can be computed by retaining only O(sm) or O(sn) data, where s is the size of the alphabet. We note that, when m and n are large, s is much smaller than m and n. In fact, s=4 for RNA and DNA sequences and s=20 for protein sequences and the length of these sequences is often orders of magnitude larger than s.

Cache model

To analyze the cache performance of our algorithms, we use the rather simple cache model which has been used by us successfully in our past work [19]. In this model we have a single-level cache that has l cache lines of size w, where w is the number of data items that can be stored in one cache line. So, when the data size is 4 bytes and w=8, each cache line is 32 bytes. The size (i.e., capacity) of our one-level cache is lw. In accordance with this cache model, we assume that main memory is divided into blocks whose size is the same as that of a cache line (i.e., w words each). When we attempt to read a piece of data that is not in the cache, a read miss occurs. A read miss causes the corresponding block of main memory to be read into a cache line. When the cache is full, this read miss requires us to first evict the block that is in the least recently used (LRU) cache line. This eviction results in a write of the evicted block to main memory in case the evicted block has changed. A write miss occurs when we attempt to write data that is not in a cache line. At this time, the corresponding block of main memory is read into a cache line and the data we wish to write is written to this cache line.

Notice that every read and write miss results in a read access of main memory; some read and write misses also result in the writing of a cache line to main memory.

Today’s computers actually employ multiple levels of cache and a far more sophisticated and proprietary cache servicing policy combined with prefetching to hide memory latency. As a result, it is extremely difficult to analyze cache performance using a realistic cache model. The described simple cache model is amenable to analysis and our experiments establish its usefulness for this purpose as algorithms with reduced cache misses using this model actually run faster on computers with more sophisticated cache architectures, replacement policies, and prefetching techniques.

Classical DL distance algorithm

Wagner and Fischer [20] developed the notion of a trace, which is useful in reasoning about edit sequences that are limited to substitutions, insertions, and deletions. Lowrance and Wagner [8] extended this notion to include the transposition operation. A trace for the strings A=a₁⋯a_m and B=b₁⋯b_n is a set T of lines, where the endpoints u and v of a line (u,v) denote positions in A and B, respectively. A set of lines T is a trace iff:

1
For every (u,v)∈T, u≤m and v≤n.
2
The lines in T have distinct A positions and distinct B positions. That is, no two lines in T have the same u or the same v.

A line (u,v) is balanced iff a_u=b_v and two lines (u₁,v₁) and (u₂,v₂) cross iff (u₁<u₂) and (v₁>v₂). As an example, consider A=dafac and B=fdbbec. The set of lines T={(1,2),(3,1),(4,3),(5,6)} satisfies the requirements for a trace. Line (4,3) is not balanced as a₄≠b₃. The remaining 3 lines in the trace are balanced. The lines (1,2) and (3,1) cross. This trace may be depicted as a diagram as in Fig. 1.

In a trace, an unbalanced line denotes a substitution operation and a balanced line denotes retaining the character of A. If a_i has no line attached to it, a_i is to be deleted and when b_j has no attached line, it is to be inserted. When two balanced lines (u₁,v₁) and (u₂,v₂) cross, $a_{u_{1}+1} \cdots a_{u_{2}-1}$ are to be deleted from A making $a_{u_{1}}$ and $a_{u_{2}}$ adjacent, then $a_{u_{1}}$ and $a_{u_{2}}$ are to be transposed, and finally, $b_{v_{2}+1} \cdots b_{v_{1}-1}$ are to be inserted between the just transposed characters of A.

The edit sequence corresponding to the trace of Fig. 1 is delete a₂, transpose a₁ and a₃, substitute b for a₄, insert b₄=b and b₅=e, retain a₅. The cost of this edit sequence is 5.

Lowrance and Wagner [8] have proved the following properties:

The cost of a trace equals the number of unbalanced lines plus the number of positions in A and B not touched by a line plus the number of line crossings.
There is a trace whose cost equals that of an optimal edit sequence (Theorem 2 of [8]). Since every trace corresponds to an edit sequence, it follows that the edit sequence that corresponds to a minimum cost trace is optimal.
There is a minimum cost trace in which each line crosses at most one other line and in which every line that crosses another is balanced (Theorem 4 of [8]).
There is trace T that satisfies property P3 and for every pair of crossing lines (u₁,v₁), (u₂,v₂), u₁<u₂ in T, (a) $a_{i} \neq a_{u_{1}} = b_{v_{1}}\phantom {\dot {i}\!}$, u₁<i<u₂ and (b) $b_{j} \neq b_{v_{2}} = a_{u_{2}}\phantom {\dot {i}\!}$, v₂<j<v₁. In words, u₁ is the last (i.e., rightmost) occurrence of $b_{v_{1}}$ in A that precedes position u₂ of A and v₂ is the last occurrence of $a_{u_{2}}$ in B that precedes position v₁ of B. We refer to these positions as $lastA[\!u_{2}][\!b_{v_{1}}]\phantom {\dot {i}\!}$ and $\phantom {\dot {i}\!}lastB[\!v_{1}][\!a_{u_{2}}]$, respectively (Theorem 5 of [8]).

Let H_ij be the DL distance between A[ 1:i] to B[ 1:j]. So, H_mn is the DL distance between A and B. The following dynamic programming recurrence follows from properties P1-P4 of a trace.

$$ H_{i,0} = i,\ H_{0,j} = j, \ 0 \le i \le m, \ 0 \le j \le n $$

(1)

When i>0 and j>0,

$$ {}H_{i,j} \,=\, \min\left\{\! \begin{array} {lcr} H_{i-1,j-1}+ c(a_{i},b_{j}) \\ H_{i,j-1}+ 1 \\ H_{i-1,j}+ 1 \\ H_{k-1,l-1} + (i-k-1)+ 1 + (j-l-1) \\ \end{array} \right. $$

(2)

where c(a_i,b_j) is 1 if a_i≠b_j and 0 otherwise, k=lastA[ i][ b_j] and l=lastB[ j][ a_i]. If k or l do not exist, then case 4 of the recurrence does not apply.

Figure 2 illustrates the four cases of this recurrence. These cases correspond to the four possibilities for an optimal trace that transforms A[ 1:i] into B[ 1:j] and satisfies properties P2-P4. Such a trace may (a) contain the line (i,j), (b) contain no line that touches b_j, (c) contain no line that touches a_i, or (d) have crossing balanced lines that involve a_i and b_j. Figure 2a illustrates the first case, which is a substitution between a_i and b_j; we optimally transform A[ 1:i−1] into B[ 1:j−1] and then substitute b_j for a_i. If a_i=b_j, the substitution cost is 0, otherwise it is 1. Figure 2b shows the second case. Here, b_j is inserted at the end of B[ 1:j−1] following an optimal transformation of A[ 1:i] into B[ 1:j−1]. Figure 2c shows the third case in which a_i is deleted from A[1:i] following an optimal transformation of A[ 1:i−1] into B[ 1:j]. Figure 2d shows the case of crossing balanced lines (i,l) and (k,j). Here, A[ 1:k−1] must be optimally transformed into B[ 1:l−1]. Note that to perform the crossing operation, we must delete i−k−1 characters from A, do an adjacent character transposition in A, and then insert j−l−1 characters from B between the two just transposed positions. So, the cost is (i−k−1)+1+(j−l−1).

Algorithm 1 is the pseudocode to compute H using Eqs. 1 and 2. This is a simplification of the pseudocode given in Lowrance and Wagner [8] to the case when each edit operation has unit cost. In this algorithm, last_row_id[c] keeps track of the last occurrence of character c in A (note that this is a row index of H) and last_col_id keeps track of the last occurrence of a_i in B.

We shall refer to Algorithm 1 as algorithm DL. Its time and space complexities are readily seen to be O(mn). Once H has been computed using algorithm DL, an optimal trace may be obtained in O(m+n) additional time using a standard dynamic programming traceback. We refer to the combination of DL and the traceback as algorithm DL_TRACE.

The total number of cache misses is dominated by the read and write misses of the array H. So, we count only these misses. In each iteration of the loop for computing row i of H, we need the elements of rows i and i−1 of H in left-to-right order as in Algorithm 1 lines 9-11 and 14. Since these rows are read from main memory in blocks of size w and row i is written to main memory in blocks of this size, lines 9-11 and 14 result in 2n/w read accesses and n/w write accesses for each i. These lines, therefore, result in 3mn/w cache misses over the entire execution of DL. Line 13 makes one read access of H per iteration and so contributes at most mn to the total cache-miss count. Hence, the cache-miss count for algorithm DL is approximately mn(1+3/w).

Methods

Single-core algorithms

In this section, we develop four linear-space single-core algorithms for string correction using the DL distance. All four run in O(mn) time. The first two (LS_DL and Strip_DL) compute only the score H_mn of the optimal trace; they differ in their cache efficiency. The last two (LSDL_TRACE and Strip_TRACE) compute an optimal trace.

The linear space algorithm L S_D L

Let s be the size of the alphabet. Instead of using the array H used in DL, algorithm LS_DL uses a one-dimensional array U[ −1:n] and a two-dimensional array T[ 1:s][ −1:n]. These two arrays have a space requirement of O((s+1)n) = O(n) for constant s. When m<n, one may swap A and B to reduce the required memory. Adding the memory needed for A and B, the space complexity is O(s min{m,n}+m+n) = O(m+n) when s is a constant.

As in algorithm DL, the H_ij values are computed by rows. The one-dimensional array U is used to save the H[ i][ ∗] values computed by algorithm DL when row i is being computed. Let H[ w][ ∗] be the last row computed for character c. Then, T[ c][ ∗] is row w−1 of H. Algorithm 2 gives the pseudocode for LS_DL. Its correctness follows from the correctness of algorithm DL. Note that swap(T[A[ i]],U) takes O(1) time as pointers to 2 one-dimensional arrays are swapped rather than the content of these arrays. The cache-miss count for LS_DL is the same as that for DL when n is suitably large as both have the same data access pattern. However, for smaller instances LS_DL will exhibit much better cache behavior. For example, because of its use of much less memory, we may have enough LLC cache to store all the data in LS_DL but not in DL (O(sn) vs O(mn)).

The cache-efficient linear-space algorithm S t r i p_D L

When (s+1)n is larger than the size of the LLC cache, we may reduce cache misses relative to algorithm LS_DL by computing H_ij by strips of width q, for some q less than n (the last strip may have a width smaller than q). This is shown in Fig. 3. The strips are computed in the order 0, 1,... using algorithm LS_DL. However, the space needed by T and U in LS_DL is reduced to (s+1)q as the strip width is q rather than n. By choosing q small enough, we can ensure that blocks of the T and U arrays used by LS_DL are not evicted from cache once they are brought in. So, if each entry of T and U takes 1 word, then when the cache size is lw, we have q<lw/(s+1). Note that, in addition to T and U, the cache needs to hold partials of A, B and other arrays needed to pass the data from one strip to the next.

To pass the data from one strip to next, we use an additional one-dimensional array strip of size m and a two-dimensional s∗m array V. The array strip records the values of H computed for the rightmost column in the strip. V[ c][i] gives the H value in the rightmost column j of row i of H that is (a) in a strip to the left of the one currently being computed and (b) c=B[ j].

The pseudocode for Strip_DL is given in Algorithm 3. For clarity, this pseudocode uses two strip arrays (lines 18 and 30) and two V arrays (lines 24 and 32). One set of arrays is used to fetch data calculated for the previous strip and the other set for data that is to be passed to the next strip. In the actual implementation, we use a single strip array and a single V array overwriting values received from the previous strip with values to be passed to the next strip.

The time and complexity of Strip_DL are, respectively, O(mn) and O((s+1)m+(s+1)q+n) = O(sm+sq+n) = O(sm+n) as q is a constant. When m>n, we may switch A and B to conserve memory and so the space complexity becomes O(s min{m,n}+m+n) = O(m+n) for constant s.

When we analyze the cache miss, we note that q is chosen such that U and T fit into cache. We make the reasonable assumption that the LRU replacement rule does not cause any block of U or T to be evicted during the running of algorithm Strip_DL. As a result, the total number of cache misses due to U and T is independent of m and n and so may be ignored in the analysis. The initialization of strip and V results in m/w and (s+1)m/w read accesses, respectively. The number of write accesses is approximately the same as the number of read accesses. The computation for each strip accesses the array strip in ascending order of index. This results in (approximately) the same number of cache misses as made during the initialization phase. Hence, the total number of cache misses due to strip is approximately (2m/w)(n/q+1). For V, we note that when computing the current strip, the elements in any row of V are accessed in non-decreasing order of index (i.e., from left to right) and that we need to retain, in cache, only the most recently read value for each character of the alphabet (i.e., at most s values are to be retained). Making the assumption that a V value is evicted from cache only when a new value for the same character is accessed, the total number of read misses from V when computing a single strip is sm/w. The number of write misses is approximately the same. So, V contributes (2sm/w)(n/q+1). Hence, the total number of cache misses for algorithm Strip_DL is ≈2(s+1)mn/(wq) when m and n are large.

Recall that the approximate cache-miss count for algorithms DL and LS_DL is mn(1+3/w). This is (wq+3q)/(2s+2) times that for Strip_DL.

The linear-space trace algorithm L S D L_T R A C E

Although algorithms LS_DL and Strip_DL determine the score (cost) of an optimal trace (and hence of an optimal edit sequence) that transforms A into B, these algorithms do not save enough information to actually determine an optimal trace. To determine an optimal trace using linear space, we adopt a divide-and-conquer strategy similar to that used by Hirschberg [21] for the simple string editing problem (i.e., transpositions are not permitted) and Myers and Miller [22] for the sequence alignment problem.

We say that a trace has a center crossing iff it contains two lines (u₁,v₁) and (u₂,v₂), u₁<u₂ such that v₁>n/2 and v₂≤n/2 (Fig. 4).

Let T be an optimal trace that satisfies properties P2-P4. If T contains no center crossing, then its lines may be partitioned into sets TL and TR such that TL contains all lines (u,v)∈T with v≤n/2 and TR contains the remaining lines (Fig. 4a). Since there is no center crossing, all lines in TR have a u value greater than the u value of every line in TL. It follows from properties P2-P4 that there is an i, 1≤i≤m such that T is the union of an optimal trace for A[ 1:i] and B[ 1:n/2] and that for A[ i+1:m] and B[ n/2+1:n]. Let H[ i] be the cost the former optimal trace and H^′[ i+1] that of the latter optimal trace. We see that when T has no center crossing, the cost of T is

$$ costNoCC(T) = \min_{1 \le i \le m}\{ H[\!i] + H'[\!i+1]\} $$

(3)

When T contains a center crossing, its lines may be partitioned into 3 sets, TL, TM, and TR, as shown in Fig. 4b. Let (u₁,v₁) and (u₂,v₂) be the lines defining the center crossing. Note that TL contains all lines of T with v<v₂, TR contains all lines with v>v₁, and TM={(u₁,v₁),(u₂,v₂)}. Note also that all lines in TL have a u<u₁ and all in TR have u>u₂. From property P1, it follows that TL is an optimal trace for A[ 1:u₁−1] and B[ 1:v₂−1] and TR is an optimal trace for A[ u₂+1:m] and B[ v₁+1:n]. Further, since (u₁,v₁) and (u₂,v₂) are balanced lines, the cost of TM is (u₂−u₁−1)+1+(v₁−v₂−1). Also, A[ u₁]≠A[ u₂] as otherwise, replacing the center-crossing lines with (u₁,v₂) and (u₂,v₁) results in a lower cost trace. From property P4, we know that $u_{1} = lastA[\!u_{2}][\!b_{v_{1}}]\phantom {\dot {i}\!}$ and $v_{2} = lastB[\!v_{1}][\!a_{u_{2}}]\phantom {\dot {i}\!}$. Let H[ i][ j] be the cost of an optimal trace for A[ 1:i] and B[ 1:j] and let H^′[ i][ j] be that for an optimal trace for A[ i:m] and B[ j:n]. So, when T has a center crossing, its cost is

$$ {\begin{aligned} costCC(T) \,=\,& \min\{H[\!u_{1}\,-\,1][\!v_{2}-1] + H'[\!u_{2}\!+1][\!v_{1}+1] \\ &+ (u_{2}-u_{1}-1) + 1 + (v_{1}-v_{2}-1)\} \end{aligned}} $$

(4)

where, for the min{}, we try 1≤u₁<m and for each such u₁, we set v₁ to be the smallest i>n/2 for which $b_{i} = a_{u_{1}}\phantom {\dot {i}\!}$. For each u₁ we examine all characters other than $\phantom {\dot {i}\!}a_{u_{1}}$ in the alphabet. For each such character c, v₂ is set to the largest j≤n/2 for which b_j=c and u₂ is the smallest i>u₁ for which a_i=c. So, the min is taken over (s−1)m terms.

Let U_top and T_top be the final U and T arrays computed by LS_DL with inputs B[ 1:n/2] and A[ 1:m] and U_bot and T_bot be these arrays when the inputs are the reverse of B[ n/2+1] and A[m:1]. From these arrays, we may readily determine the H and H^′ values needed to evaluate Eqs. 3 and 4. Algorithm LSDL_TRACE (Algorithm 4) provides the pseudocode for our linear space computation of an optimal trace. It assumes that LS_DL has been modified to return both the arrays U and T.

For the time complexity, we see that at the top level of the recursion, we invoke LS_DL twice with strings A and B of size m and n/2, respectively. This takes at most amn time for some constant a. The time required to compute Eqs. 3 and 4 is O(sn) and may be absorbed into amn by using a suitably large constant a. At the next level of recursion, LS_DL is invoked 4 times. The sum of the lengths of the A strings across these 4 invocations is at most 2m and the B string has length at most n/4. So, the time for these four invocations is at most amn/2. Generalizing to the remaining levels of recursion, we see that algorithm LSDL_TRACE takes amn(1+1/2+1/4+1/8+…)<2amn=O(mn) time. The space needed is the same as that for LS_DL (note that the parameters to this algorithm have been switched). From the time analysis, it follows that the number of cache misses is approximately twice that for LS_DL when invoked with strings of size m and n. Hence the approximate cache miss count for LSDL_TRACE is 2mn(1+3/w).

We note that some reduction in actual run time can be achieved by switching A and B when A is shorter than B thus ensuring that the shorter string is split at each level of recursion. This enables us to get the recursion terminates faster.

The strip trace algorithm S t r i p_T R A C E

This algorithm differs from LSDL_TRACE in that it uses a modified version of Strip_DL rather than a modified version of LS_DL. The modified version of Strip_DL returns the arrays strip and V computed by Strip_DL. Correspondingly, Strip_TRACE uses V_top and V_bot in place of T_top and T_bot. The asymptotic time complexity of Strip_TRACE is also O(mn) and it takes the same amount of space as does Strip_DL (note that the parameters to Strip_DL are switched relative to those for Strip_TRACE). The number of cache misses is approximately twice that for Strip_DL.

Multi-core algorithms

In this section, we describe our parallelizations of algorithm DL and the four single-core algorithms of previous section. These parallelizations assume that the number of processors is small relative to string length. The naming convention we adopt for the parallel versions is adding PP_ as a prefix to the name of the single-core algorithm.

The algorithm P P_D L

Our parallel version of algorithm DL, PP_DL, computes the elements in the same order as does DL. However, it starts the computation of a row before the computation of its preceding row is complete. Each processor is assigned a unique row to compute and it computes this row from left to right. Let p be the number of processors. Processor z is initially assigned to do the outer loop computation for i=z, 1≤i≤p. Processor z begins after a suitable time lag relative to the start of processor z−1 so that the data it needs for its computation have already been computed by processor z−1. In our code, the time lag between the start of the computation of two consecutive rows is the time needed to compute n/p elements. Upon completion of its iteration i computation, the processor proceeds to iteration i+p of the outer loop. The time complexity of PP_DL is O(mn/p).

The algorithm P P_L S_D L

While the general parallelization strategy for PP_LS_DL is the same as that used in PP_DL, extra care is needed to ensure a computation identical to that of LS_DL. Divergence in results is possible when two or more processors are simultaneously computing different rows of H using the same memory. This happens for example when A=aaabc⋯ and p≥3. We start with processor i assigned to compute row i of H, 1≤i≤p. Suppose that U=x and T[ a]=y initially (note that x and y are addresses in memory). Because of the swap(T[ A[ i]],U) statement in LS_DL, processor 1 begins to compute row 1 of H using memory beginning at the address y. If processor 2 begins with a suitable time lag as in PP_DL, it will compute row 2 of H using memory beginning at the address x. With a further lag, processor 3 will begin to compute row 3 of H again using memory beginning at the address y. Now, both processors 1 and 3 are using the same memory to compute different rows of H and so we run the risk of overwriting H values that may be needed for subsequent computations. As another example, consider A=ababa⋯ and p≥4. Suppose that U=x and T[ a,b]=[ y,z] initially. Processor 1 begins to compute row 1 using the memory y, then, with a lag, processor 2 begins to compute row 2 using memory z, then processor 3 starts to compute row 3 using memory x. Next processor 4 begins to compute row 4 using memory y. At this time processor 1 is computing row 1 with A[ 1]=a and processor 4 is computing row 4 with A[ 4]=b and both processors are using the same row memory y.

Let p₁ and p₂ be two processors that are using the same memory to compute rows r₁<r₂ of H and that no processor is using this memory to compute a row between r₁ and r₂. From the swapping assignment scheme used in LS_DL, it follows that p₁ is computing the row $r_{1} = lastA[\!r_{2}][\!a_{r_{2}}]-1\phantom {\dot {i}\!}$. The H values in this row are needed to compute rows r₁+1 through r₂ as $\phantom {\dot {i}\!}r_{1} = lastA[\!i][\!a_{r_{2}}] r_{1} < i \le r_{2}$. These values are not needed for rows i>r₂ as for these rows $\phantom {\dot {i}\!}lastA[\!i][\!a_{r_{2}}] = r_{2} > r_{1} + 1 = lastA[\!r_{2}][\!a_{r_{2}}]$. Let j₁ be such that $\phantom {\dot {i}\!}b_{j} = a_{r_{2}} = a_{r_{1} + 1}$. Then, for j>j₁, $\phantom {\dot {i}\!}lastB[\!j][\!a_{r_{2}}] \ge j_{1}$. Hence, for j>j₁ columns 1 through j₁−2 of row r₁ are not needed to compute an H in rows between r₁ and r₂.

Our parallel code uses a synchronization scheme that is based on the observations of the preceding paragraph to delay the overwriting of values that are needed for later computations and ensure a correct computation of the DL distance. Our synchronization scheme employs another array W[ 1:n] that is initialized to 1. Suppose that a processor is computing row i of H and that A[ i]=a. When this processor first encounters an a in B, say at position j₁, it increments W[0:j₁−2]. When the next a isencountered, say at j₂, it increments W[ j₁−1:j₂−2] by 1. When the processor finishes its computation of row i, the remaining positions of W are incremented by 1. The processor assigned to compute row q of H may compute U[ j] iff W[ j]=q. From our earlier observations, it follows that when W[ j]=q, the old values in memory positions U[ 1:j] may be overwritten as these are not needed for future computations.

This p-processor algorithm PP_LS_DL’s time complexity depends on the data sets as the synchronization delay is data dependent. We, however, expect a run-time performance of approximately O(mn/p) when the characters in B are roughly uniformly distributed.

The algorithm P P_S t r i p_D L

In the parallel version PP_Strip_DL of Strip_DL, processor i is initially assigned to compute strip i, 1≤i≤p. Upon completion of its currently assigned strip j, the processor proceeds to compute strip j+p. An array signal[] is used for synchronization purposes. When computing a row r in its assigned strip s, a processor needs to wait until signal[ r]=s. signal[ r] is set to s by the processor working on strip s−1 when the values to the left of strip s needed in the computation of row r of strip s have been computed and there is no risk that the computations for row r of strip s will overwrite V values needed by other processors. signal works very much like W in PP_LS_DL.

Note that when we are working on p strips, we need p copies of the arrays U and T used by Strip_DL.

The time complexity of PP_Strip_DL depends on the synchronization delay and is expected to approximate O(mn/p).

The algorithm P P_D L_T R A C E

This algorithm first uses PP_DL to compute H[][]. Then, a single processor performs a traceback to construct the optimal trace. For reasonable values of p, the run time is dominated by PP_DL and so, the complexity of PP_DL_TRACE is also O(mn/p).

The algorithms P P_L S D L_T R A C E and P P_S t r i p_T R A C E

In LSDL_TRACE (Strip_TRACE), we repeatedly partition the problem into two and apply either LS_DL (Strip_DL) to each partition. The parallel version PP_LSDL_TRACE (PP_Strip_TRACE) employs the following parallelization strategy:

Each subproblem is solved using PP_LS_DL (PP_Strip_DL) when the number of independent subproblems is small; all p processors are assigned to the parallel solution of a single subproblem. I.e., the subproblems are solved in sequence.
p subproblems are solved in parallel using LS_DL (Strip_DL) to solve each subproblem serially when the number of independent subproblems is large,

The time complexity of PP_LSDL_TRACE and PP_Strip_TRACE is O(mn/p).

Results

Experimental platform and test data

The single-core algorithms were implemented using C and the multi-core ones using C and OpenMP. Our codes may be downloaded from [23]. The following computational platforms were used:

1
Xeon4: Intel Xeon CPU E5-2603 v2 Quad-Core processor 1.8GHz with 10MB cache and 32GB memory.
2
Xeon6: Intel I7-x980 Six-Core processor 3.33GHz with 12MB LLC cache and 16GB memory.
3
Xeon24: Intel Xeon CPU E5-2695 v2 2xTwelve-Core processors 2.40GHz with 30MB cache and 512GB memory.

We compiled all codes using the gcc compiler with the O2 option. Cache miss and energy consumption data were obtained for our Xeon4 platform using the “perf” [24] software and the RAPL interface. This is the only platform for which we obtained cache miss and energy consumption data.

For test data, we downloaded the real DNA/RNA/protein sequences from the NCBI (National Center for Biotechnology Information) server [25] and PDB (Protein Data Bank) server [15]. In addition to that, we also generated random DNA/RNA and protein sequences.

Xeon E5-2603 (Xeon4) using random data

DL distance algorithms

The observed cache misses for our DL distance algorithms on our Xeon4 platform for randomly generated sequences of size between 40000 and 400000 are given in Fig. 5 and Table 1. “**” in the table indicates there was insufficient memory for the algorithm to run. The column of Table 1 labeled LvsD (SvsD) presents the percentage changes in cache misses reduced by LS_DL (Strip_DL) relative to DL while that labeled SvsL gives this percentage changes reduced by Strip_DL relative to LS_DL.

Table 1 Cache misses for DL distance algorithms, in millions, on Xeon4

Selected articles from the 7th IEEE International Conference on Computational Advances in Bio and Medical Sciences (ICCABS 2017): bioinformatics

String correction using the Damerau-Levenshtein distance

Abstract

Background

Results

Conclusion

Background

Introduction

Cache model

Classical DL distance algorithm

Methods

Single-core algorithms

The linear space algorithm L S_D L

The cache-efficient linear-space algorithm S t r i p_D L

The linear-space trace algorithm L S D L_T R A C E

The strip trace algorithm S t r i p_T R A C E

Multi-core algorithms

The algorithm P P_D L

The algorithm P P_L S_D L

The algorithm P P_S t r i p_D L

The algorithm P P_D L_T R A C E

The algorithms P P_L S D L_T R A C E and P P_S t r i p_T R A C E

Results

Experimental platform and test data

Xeon E5-2603 (Xeon4) using random data

DL distance algorithms

DL trace algorithms

Parallel DL distance algorithms

Parallel DL trace algorithms

Xeon E5-2603 (Xeon4) using real data

I7-x980 (Xeon6) using random data

DL distance algorithms

DL trace algorithms

Parallel DL distance algorithms

Parallel DL trace algorithms

Xeon E5-2695 (Xeon24) using random data

DL distance algorithms

DL trace algorithms

Parallel DL trace algorithms

Parallel DL distance algorithms

Discussion

Conclusion

Abbreviations

References

Acknowledgments

Funding

Availability of data and materials

About this supplement

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us