Linear space string correction algorithm using the Damerau-Levenshtein distance

Zhao, Chunchun; Sahni, Sartaj

doi:10.1186/s12859-019-3184-8

Volume 21 Supplement 1

Selected articles from the 8th IEEE International Conference on Computational Advances in Bio and medical Sciences (ICCABS 2018): bioinformatics

Research
Open access
Published: 09 December 2020

Linear space string correction algorithm using the Damerau-Levenshtein distance

BMC Bioinformatics volume 21, Article number: 4 (2020) Cite this article

2189 Accesses
1 Citations
10 Altmetric
Metrics details

Abstract

Background

The Damerau-Levenshtein (DL) distance metric has been widely used in the biological science. It tries to identify the similar region of DNA,RNA and protein sequences by transforming one sequence to the another using the substitution, insertion, deletion and transposition operations. Lowrance and Wagner have developed an O(mn) time O(mn) space algorithm to find the minimum cost edit sequence between strings of length m and n, respectively. In our previous research, we have developed algorithms that run in O(mn) time using only O(s∗min{m,n}+m+n) space, where s is the size of the alphabet comprising the strings, to compute the DL distance as well as the corresponding edit sequence. These are so far the fastest and most space efficient algorithms. In this paper, we focus on the development of algorithms whose asymptotic space complexity is linear.

Results

We develop linear space algorithms to compute the Damerau-Levenshtein (DL) distance between two strings and determine the optimal trace (corresponding edit operations.)Extensive experiments conducted on three computational platforms–Xeon E5 2603, I7-x980 and Xeon E5 2695–show that, our algorithms, in addition to using less space, are much faster than earlier algorithms.

Conclusion

Besides using less space than the previously known algorithms,significant run-time improvement was seen for our new algorithms on all three of our experimental platforms. On all platforms, our linear-space cache-efficient algorithms reduced run time by as much as 56.4% and 57.4% in respect to compute the DL distance and an optimal edit sequences compared to previous algorithms. Our multi-core algorithms reduced the run time by up to 59.3% compared to the best previously known multi-core algorithms.

Background

Introduction

The Damerau-Levenshtein (DL) distance between two strings is the minimum number of substitutions, inserts, deletes, and transpositions of adjacent characters required to transform one string into the other. Some of the applications of the DL are spelling error correction [1–3], comparing packet traces [4], data mining and clustering [5], quantifying the similarity of biological sequences, and gene function prediction [6], analysis of B cell receptor repertoire data [7], virus detection in software [8], clustering of RNA-seq read se6gments [9], DNA repeats detection [10], and codes for DNA based storage [11]. In some of these applications (e.g., spelling error correction), the strings are rather small while in others (e.g., comparing protein sequences) the strings could be tens of thousands of characters long [12], and in yet others (e.g., comparing chromosomes) they could be millions of characters long [13].

Other string edit distances used in the literature permit only a proper subset of the operations permitted by the DL distance. For example, in the Levenshtein distance [14] transpositions are not permitted, in the Hamming distance [15] only substitutions are permitted, and in the Jaro distance [16], only transpositions are permitted. The correct distance metric to use depends on the application. In the applications cited above, the DL distance is used as all 4 edit operations are permitted.

Lowrance and Wagner [17] considered a generalization of DL distance to the case when substitutions, inserts, deletes, and transpositions have different costs. Through a suitable choice of weights, the weighted DL distance can be made equal to the DL distance, Levenshtein distance, Hamming distance, and Jaro distance.

Lowrance and Wagner [17] have developed an O(mn) time and O(mn) space algorithm to find the minimum cost edit sequence (ie., sequence of substitutions, inserts, deletes, and transpositions) that transforms a given string of length m into a given string of length n provided that 2T≥I+D, where T, I, and D, are respectively, the cost of a transposition, insertion, and deletion. In the DL distance, T=I=D=1 and so, 2T=I+D. Hence, the algorithm of Lowrance and Wagner [17] may be used to compute the DL distance as well as the corresponding edit sequence in O(mn) time and O(mn) space. This observation has also been made in [2]. In [18] we developed algorithms that run in O(mn) time using only O(s∗ min{m,n}+m+n) space, where s is the size of the alphabet comprising the strings, to compute the DL distance as well as the corresponding edit sequence. Since s<<m and s<<n in most applications (e.g., s=20 for protein sequences), this reduction in space enables the solution of much larger instances than is possible using the algorithm of [17]. Our algorithms in [18] are much faster as well. In this paper, we develop algorithms to compute the DL distance and corresponding edit sequence using O(m+n) space and O(mn) time. Extensive experimentation using 3 different platforms indicates that the algorithms of this paper are also faster than those of [18]. In fact, our fastest algorithm for the DL distance is up to 56.4% faster than the fastest algorithm in [18] when run on a single core. The single core speedup to find the corresponding edit sequence is up to 57.4%. Our algorithms may be adapted to run on multicores providing a speedup of up to 59.3%.

DL dynamic programming recurrences

Let A[1:m]=a₁a₂⋯a_m and B[1:n]=b₁b₂...b_n be two strings of length m and n, respectively. Let H_ij be the DL distance between A[1:i] and B[1:j]. So, H_mn is the DL distance between A and B. The dynamic programming recurrence for H is given below [17, 18].

$$ H_{i,0} = i,\ H_{0,j} = j, \ 0 \le i \le m, \ 0 \le j \le n $$

(1)

When i>0 and j>0,

$$ H_{i,j} = \min\left\{ \begin{array} {lcr} H_{i-1,j-1}+ c(a_{i},b_{j}) \\ H_{i,j-1}+ 1 \\ H_{i-1,j}+ 1 \\ H_{k-1,l-1} + (i-k-1)+ 1 + (j-l-1) \\ \end{array} \right. $$

(2)

where c(a_i,b_j) is 1 if a_i≠b_j and 0 otherwise, k=lastA[i][b_j] is the last (i.e.,rightmost) occurrence of b_j in A that precedes position i of A, and l=lastB[j][a_i] is the last occurrence of a_i in B that precedes position j of B. When k or l do not exist, case 4 of Eq. 2 does not apply.

The four cases in Eq. 2 correspond to the four allowable edit operations: substitution, insertion, deletion and transposition. These cases are illustrated in Fig. 1, which depicts the possibilities for an optimal transformation of A[1:i] to B[1:j]. Figure 1a illustrates the first case, which is to optimally transform A[1:i−1] into B[1:j−1] and then substitute b_j for a_i. If a_i=b_j, the substitution cost is 0, otherwise it is 1. Figure 1b shows the second case. Here, A[1:i] is optimally transformed into B[1:j−1] and then b_j is inserted at the end. In the third case (Fig. 1c) A[1:i−1] is optimally transformed into B[1:j] and then a_i is deleted. For the fourth and final case (Fig. 1d), assume that k and l exist. In this case, we are going to transpose a_k and a_i. We first optimally transform A[1:k−1] into B[1:l−1]. Since only adjacent characters may be transposed, the transposition of a_k and a_i must be preceded by a deletion of a_k+1 through a_i−1, which results in a_k and a_i becoming adjacent. Following the transposition, we insert b_l+1 through b_j−1 between the transposed a_k and a_i, thereby transforming A[1:i] into B[1:j]. The cost of optimally transforming A[1:k−1] into B[1:l−1] is H_k−1,l−1. The ensuing deletions have a cost of i−k−1 as this is the number of deletions performed, the transposition a_k and a_i costs 1, and the final inserts cost l−k−1. So, the overall cost of case 4 is H_k−1,l−1 + (i−k−1)+1 + (j−l−1).

The algorithm of Lowrance and Wagner [17] computes H_m,n using a m×n array for H and one-dimensional arrays of size s for lastA and lastB, where s is the size of the alphabet from which the strings A and B are drawn. It computes the H_i,j’s by rows beginning with row 1 and within a row, the elements are computed left-to-right by columns. While algorithm LS_DL of [18] also computes H by rows and within a row by columns left-to-right, it does this using a one-dimensional array of size n for the current row being computed, a one-dimensional array of size s for lastA, and an s×n array T with the property that if w is the last row of H computed so far such that A[w]=c, then T[c][∗]=H[w−1][∗]. As noted in [18] when m<n, we may swap the strings A and B resulting in a space requirement of O(s min{m,n}+n). The time complexity of LS_DL is O(mn). A cache-efficient version, Strip_DL, of LS_DL that computes H by strips whose width is no larger than the cache size is also developed in [18]. This cache efficient algorithm has the same asymptotic time and space complexities as does LS_DL. But, as demonstrated in [18], Strip_DL is much faster than LS_DL.

The linear space algorithms we develop in this paper use a refined dynamic programming recurrence for H. We make the observation that when either a_i=b_j or min{i−k,j−l}≥2 in the fourth case of Eq. 2, then it is sufficient to consider only the first three cases. To see this, note that when a_i=b_j the transposition of a_k and a_i done following the deletion of a_k+1 through a_i−1 in case 4 (Fig. 1d) is unnecessary as a_k = b_j = a_i. So, one of the first three cases has to result in a smaller value than case 4. Next, consider the case when a_i≠b_j and min{i−k,j−l}≥2. Suppose that 2≤i−k≤j−l. The cost of transforming A[1:i] to B[1:j] by using an optimal transformation of A[1:k−1] to B[1:l−1] and then doing j−l+1 substitutions and inserts is H_k−1,l−1+j−l+1. Doing the transformation as is case 4 has a cost H_k−1,l−1+(i−k−1)+1+(j−l−1)≥H_k−1,l−1+j−l+1. So, doing the transposition (case 4) isn’t any better than using only substitutions and inserts. Hence, case 4 need not be considered. The case when 2≤j−l≤i−k is symmetric.

The preceding observation establishes the correctness of the following refined recurrence for H.

$$ H_{i,0} = i,\ H_{0,j} = j, \ 0 \le i \le m, \ 0 \le j \le n $$

(3)

When i>0 and j>0,

$$ {\begin{aligned} H_{i,j} = \min\left\{ \begin{array} {lcr} H_{i-1,j-1}+ c(a_{i},b_{j}) \\[-2pt] H_{i,j-1}+ 1 \\[-2pt] H_{i-1,j}+ 1 \\[-2pt] \left\{ \begin{array} {lcr} H_{k-1,j-2} + (i-k) \\[-2pt] \ \ \ \ \ \ if\ j-l=1\ and\ a_{i} \ne b_{j} \\[-2pt] H_{i-2,l-1} + (j-l) \\[-2pt] \ \ \ \ \ \ if\ i-k=1\ \ and\ a_{i} \ne b_{j} \\[-2pt] \infty \ otherwise \end{array} \right. \end{array} \right. \end{aligned}} $$

(4)

where c(a_i,b_j) is 1 if a_i≠b_j and 0 otherwise, k=lastA[i][b_j] and l=lastB[j][a_i].

We observe that the above refined recurrence for H holds even in the weighted setting provided that 2S≤I+D≤2T, where S, I, D, and T are, respectively, the cost of a substitution, insertion, deletion, and transposition; the cost of a substitution is >0 when the characters involved are different and 0 when these are the same. This observation follows from the following.

When a_i=b_j,S=0

$$ \begin{aligned} &H_{k-1,l-1} + (i-k-1)D+ T + (j-l-1)I \\ &= H_{k,l} + (i-k-1)D+ T + (j-l-1)I \\ &\ge H_{i-1,l} + T + (j-l-1)I \\ &\ge H_{i-1,j-1} \\ &= H_{i,j} \end{aligned} $$

(5)

When 2≤i−k≤j−l

$$ \begin{aligned} &H_{k-1,l-1} + (i-k-1)D+ T + (j-l-1)I \\ &\ge H_{k-1,l-1} + (i-k-1)(2S-I)+ S + (j-l-1)I \\ &= H_{k-1,l-1} + (i-k+1)S + ((j-l)-(i-k))I\\ &\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ + (i-k-2)S \\ &\ge H_{k-1,l-1} + (i-k+1)S + ((j-l)-(i-k))I \\ &\ge H_{i,j} \\ \end{aligned} $$

(6)

The case when 2≤j−l≤i−k is symmetric.

The algorithms developed in this paper are based on our refined recurrence for H.

Methods

DL distance algorithms

In this section, we develop two algorithms, LS_DL2 and Strip_DL2, to compute the DL distance between two strings of length m and n drawn from an alphabet of size s. We note that when s>m+n, at least s−m−n characters of the alphabet appear in neither A nor B. So, these non-appearing characters may be removed from the alphabet and we can work with this reduced size alphabet. Hence, throughout this paper, we assume that s≤m+n. Our algorithms, which take O(m+n) space, are based on the recurrence of Eqs. 3 and 4 and are the counterparts of algorithms LS_DL and Strip_DL of [18] that are based on the recurrence of Eqs. 1 and 2.

Algorithm LS_DL2

Like algorithm LS_DL of [18], LS_DL2 (Algorithm 1) computes H by rows from top to bottom and within a row by columns from left to right. For convenience, we augment H with row −1 and column −1. All values on this row and on this column are maxVal, where maxVal is a large value. Algorithm LS_DL2 uses 4 one-dimensional arrays last_row_id[1:s],R[−1:n],R1[−1:n], and FR[−1:n] and a few simple variables that have the following semantics when we are computing H_ij. k and l are as in Eqs. 4.

1.
FR[q]=H_k−1,q−2 for the current i in case q≥j and for the next i in case q<j
2.
R[q]=H_i,q if q<j and H_i−2,q if q≥j.
3.
R1[q]=H_i−1,q
4.
last_row_id[c] = largest k<i such that A[k]=c
5.
last_col_id = largest l<j such that B[l]=A[i]
6.
T is the value to use for H_i−2,l−1 should this be needed in the computation of H_ij
7.
last_i2l1=H_i−2,j−1
8.
diag ⋯ Case 1 of Eq. 4
9.
left ⋯ Case 2 of Eq. 4
10.
up ⋯ Case 3 of Eq. 4
11.
transpose ⋯ Case 4 of Eq. 4

Lines 2 and 3 of the algorithm initialize FR, R1, and R so that following the swap of line 6, R1[q]=H_0,q and R[q]=H_−1,q,−1≤q≤n. In other words, at the start of iteration i=1 of the for loop of lines 9-30, R1 and R, respectively, correspond to rows i−1 (i.e., row 0) and i−2 (i.e., row -1) of H. At this time, FR[−1:n]=maxValue, which corresponds to the initial situation that k=lastA[i][b_j] is undefined. This will be updated as lastA[i][b_j] gets defined. last_row_id[1:s] is initialized to −1 in line 4 for each character c to indicate the fact that at the start of the i=1 loop, A[p],p<1 isn’t a character of the alphabet. last_col_id is set to −1 at the start of each iteration of the loop of lines 5–32 as at the start of this loop, no character of B has been examined and so there is no last seen occurrence of A[i] in B (for this iteration of the for loop of lines 5–32). Also, at the start of the computation of each row of H, last_i2l1 is set to R[0], because, by the semantics of last_i2l1, when we are computing H_i,1,last_i2l1=H_i−2,0=R[0]. Following this, R[0] is set to i to indicate that the cheapest way to transform A[1:i] into B[0:0] is to do i deletes at a total cost of i (hence, when we are computing H_i,1 in the loop of lines 9–30, R[q]=H_ij,q<1), and T is set to maxVal as when we start a row computation, l is undefined. So, the initializations establish the variable semantics given above.

In lines 10–12 of the loop of lines 9–30, diag, left, and up are set to the values specified in cases 1–3 of Eq. 4. Note that from the semantics of the variables, R1[j−1]=H_i−1,j−1,R[j−1]=H_i,j−1, and R1[j]=H_i−1,j at this time. Line 13 computes the minimum of the terms in the first 3 cases of Eq. 4. If A[i]=B[j] (line 14), then H_ij is determined by cases 1–3 and the value of temp computed in line 13 is H_i,j. At this time, we need to update last_col_id as the most recently seen occurrence of A[i] is now at position j of B. Since A[i]=B[j],lastA[i+1][b_j]=i. So, the H_k−1,j−2 to use for the next i in case 4 is H_i−1,j−2=R1[j−2]. This value is saved in FR[j] in line 16. Since A[i]=B[j], the value to use for H_i−2,l−1 in case 4 of Eq. 4 in future iterations of the forj loop (until, of course, we encounter another j where A[i]=B[j]) becomes H_i−2,j−1, which by the variable semantics is last_i2l1. This value is saved in T in line 17.

When A[i]≠B[j], lines 19–26 are executed. From the semantics of last_row_id and last_col_id, it follows that line 19 correctly sets k and l. Lines 22 and 24, respectively, compute the cost of case 4 when j−l=1 and i−k=1, respectively. Note that by the semantics of FR and T, FR[j]=H_k−1,j−2 and T=H_i−2,l−1. So, lines 22 and 25 update temp to be H_i,j. When we reach line 29, regardless of whether A[i]=B[j] or A[i]≠B[j],R[j]=H_i−2,j and temp=H_i,j.

In line 28, we set last_i2l1=R[j]=H_i−2,j. While this momentarily upsets the semantics of last_i2l1, the semantics are restored upon the start of next iteration of the forj loop as j increases by 1 or in line 8 if the forj loop terminates and we advance to the next iteration of the fori loop. Line 29 sets R[j]=H_ij, which similarly upsets the semantics of R but correctly sets up the semantics for the next iteration. Finally, line 31 correctly updates last_row_id so as to preserve its semantics for the next iteration of the fori loop.

The correctness of LS_DL2 follows from the preceding discussion. The space and time complexities are readily seen to be O(m+n+s)=O(m+n) and O(mn+s)=O(mn), respectively. When m<n, the space complexity may be reduced by a constant factor by swapping A and B. Using the LRU cache model of [18], one may show that LS_DL2 has approximately 3mn/w cache misses, where w is the width of a cache line. By comparison, the number of cache misses for LS_DL is mn(1+3/w).

Strip algorithm Strip_DL2

As in [18], we can reduce cache misses, which in turn reduces run time, by partitioning H into n/q strips of size m×q, where q is the largest strip width for which the data needed in the computation of the strip fits into the cache. H is computed by strips from left to right and the computation of each strip is done using LS_DL2. To enable this computation by strips, one strip needs to pass computed values to the next using three additional one-dimensional arrays C, C1, and FC of size m each. C records the values of H computed for the rightmost column in the strip; C1 records the values of H computed for the next to rightmost column in the strip; and FC[i] is the value of T (i.e., H[i−2][l−1]) at row i, where l is the last column where B[l]=A[i] in the strip. We name this new algorithm as Strip_DL2. The space complexity of Strip_DL2 is O(m+n) and its time complexity is O(mn). For the LRU cache model of [18] the number of cache misses is approximately $\frac {6mn}{wq}$.

DL trace algorithms

Wagner and Fischer [19] introduced the concept of a trace to describe an edit sequence when the edit operations are limited to insert, delete, and substitute. Lowrance and Wagner [17] extended the concept of a trace to include transpositions. We reproduce here the definition and example used by us in [18]. A trace for the strings A=a₁⋯a_m and B=b₁⋯b_n is a set T of lines, where the endpoints u and v of a line (u,v) denote positions in A and B, respectively. A set of lines T is a trace iff:

1.
For every (u,v)∈T,u≤m and v≤n.
2.
The lines in T have distinct A positions and distinct B positions. That is, no two lines in T have the same u or the same v.

A line (u,v) is balanced iff a_u=b_v and two lines (u₁,v₁) and (u₂,v₂) cross iff (u₁<u₂) and (v₁>v₂). It is easy to see that T={(1,2),(3,1),(4,3),(5,6)} (see Fig. 2) is a trace for the strings A=dafac and B=fdbbec. Line (4,3) is not balanced as a₄≠b₃. The remaining 3 lines in the trace are balanced. The lines (1,2) and (3,1) cross.

In a trace, unbalanced lines denote a substitution operation and balanced lines denote retaining the character of A. If a_i has no line attached to it, a_i is to be deleted and when b_j has no attached line, it is to be inserted. When two balanced lines (u₁,v₁) and (u₂,v₂),u₁<u₂ cross, $a_{u_{1}+1} \cdots a_{u_{2}-1}$ are to be deleted from A making $a_{u_{1}}$ and $a_{u_{2}}$ adjacent, then $a_{u_{1}}$ and $a_{u_{2}}$ are to be transposed, and finally, $b_{v_{2}+1} \cdots b_{v_{1}-1}$ are to be inserted between the just transposed characters of A.

The edit sequence corresponding to the trace of Fig. 2 is delete a₂, transpose a₁ and a₃, substitute b for a₄, insert b₄=b and b₅=e, retain a₅. The cost of this edit sequence is 5.

In [18], we used a divide-and-conquer strategy similar to that used by Hirschberg [20] to determine an optimal trace in O(mn) time and O(s min{m,n}+n) space. In [18], we made a distinction between traces that have a center crossing and those that do not. A trace has a center crossing iff it contains two lines (u₁,v₁) and (u₂,v₂) such that v₂≤n/2 and v₁>n/2,u₁<u₂, while satisfying (a) $\phantom {\dot {i}\!}a_{i} \neq a_{u_{1}} = b_{v_{1}}, u_{1} < i < u_{2}$ and (b) $\phantom {\dot {i}\!}b_{j} \neq b_{v_{2}} = a_{u_{2}}, v_{2} < j < v_{1}$. In words, u₁ is the last (i.e., rightmost) occurrence of $b_{v_{1}}$ in A that precedes position u₂ of A and v₂ is the last occurrence of $\phantom {\dot {i}\!}a_{u_{2}}$ in B that precedes position v₁ of B. (Figure 3).

In [18], we showed that the cost of an optimal trace T is given by Eq. 7 when T has no center crossing and by Eq. 8 when T has a center crossing. Hence, the cost of T is the smaller of these two costs.

$$ costNoCC(T) = \min_{1 \le i \le m}\{ H[i] + H'[i+1]\} $$

(7)

where H[i] is the cost of an optimal trace for A[1:i] and B[1:n/2] and H^′[i+1] that for an optimal trace for A[i+1:m] and B[n/2+1:n].

$$ {{}\begin{aligned} costCC(T) = \min\{H[u_{1}\,-\,1][v_{2}\,-\,1] + H'[u_{2}\,+\,1][v_{1}\!+1] \\[-2pt] + (u_{2}-u_{1}-1) + 1 + (v_{1}-v_{2}-1)\} \end{aligned}} $$

(8)

where H[i][j] is the cost of an optimal trace for A[1:i] and B[1][j] and H^′[i][j] is that for an optimal trace for A[i:m] and B[j][n]. For the min{}, we try 1≤u₁<m and for each such u₁, we set v₁ to be the smallest i>n/2 for which $\phantom {\dot {i}\!}b_{i} = a_{u_{1}}$. For each u₁ we examine all characters other than $a_{u_{1}}$ in the alphabet. For each such character c, v₂ is set to the largest j≤n/2 for which b_j=c and u₂ is the smallest i>u₁ for which a_i=c.

Our new algorithms, LS_TRACE2 and Strip_TRACE2, are based on an adaptation of Eqs. 7 and 8 using Eq. 4.

Algorithm LS_T R A C E2

Consider the case when the optimal trace has no center crossing. Let R^f[] be the value of R[] when LS_DL2(B[1:n/2],A[1:m]) terminates and let R^′^f[] be the value of R[] when LS_DL2(B[n:n/2+1],A[m:1]) terminates. Let R1^f,R1^′^f,FR^f, and FR^′^f be the corresponding final values for R1 and RF. From Eq. 7 and LD_DL2, we obtain

$$ \begin{aligned} costNoCC(T) & = \min_{1 \le i \le m}\{ H[i] + H'[i+1]\} \\ & = \min_{1 \le i \le m}\{R^{f}[i] + R'^{f}[i+1] \} \\ \end{aligned} $$

(9)

When T has a center crossing {(u₁,v₁),(u₂,v₂)}, then it follows from Eq. 4 that either u₁ and u₂ are adjacent in A or v₁ and v₂ are adjacent in B (or both). When v₁and v₂ are adjacent in B, then $v_{2} = \frac {n}{2}$ and $v_{1}=\frac {n}{2}+1$. Substituting into Eq. 8, we get

$$ {{}\begin{aligned} costCC(T) &\,=\, \min\{H[u_{1}\,-\,1][v_{2}-1] \,+\, H'[u_{2}\,+\,1][v_{1}\,+\,1] \\ & + (u_{2}-u_{1}-1) + 1 + (v_{1}-v_{2}-1)\} \\ &= \min\{H[u_{1}\,-\,1][\frac{n}{2}\,-\,1] + H'[u_{2}\,+\,1][\frac{n}{2}+2] \\ & + (u_{2}-u_{1}) \} \\ &= \min\left\{R1^{f}[u_{1}\,-\,1] \,+\, R1'^{f}[u_{2}+1]+(u_{2}\,-\,u_{1}) \right\} \\ \end{aligned}} $$

(10)

When u₁ and u₂ are adjacent in A, then u₂−u₁=1,v₂ is the right most occurrence of A[u₂] in B that precedes position $\frac {n}{2}+1$ (i.e., v₂≤n/2) and v₁ is the left most occurrence of A[u₁] in B after position $\frac {n}{2}$ (i.e., v₁≥n/2+1). So, we have

$$ {{}\begin{aligned} costCC(T) &= \min\{H[u_{1}\,-\,1][v_{2}\,-\,1] + H'[u_{2}\,+\,1][v_{1}\,+\,1] \\ & + (u_{2}-u_{1}-1) + 1 + (v_{1}-v_{2}-1)\} \\ &= \min\left\{H[u_{1}\,-\,1][v_{2}\,-\,1] \,+\, H'[u_{2}\,+\,1][v_{1}+1]\right.\\ & \left. + (v_{1}-v_{2}) \right\} \\ &\,=\, \min\left\{FR^{f}[u_{1}\,+\,1] + FR'^{f}[u_{2}\,-\,1]+ (v_{1}-v_{2}) \right\} \\ \end{aligned}} $$

(11)

Algorithm LS_TRACE2 (Algorithm 2) provides the pseudocode for our linear space computation of an optimal trace. It assumes that LS_DL2 has been modified to return the arrays R,R1 and FR.

Using an analysis similar to that used by us in [18] for the analysis of DL_TRACE, we see that the time complexity of DL_TRACE2 is O(mn). The space required is the same as for LS_DL2. The number of cache misses is approximately twice that for LS_DL2 when invoked with strings of size n and m. Hence, the cache miss count for LS_TRACE2 is ≈6mn/w.

Strip trace algorithm Strip_T R A C E2

This algorithm differs from LS_TRACE2 in that it uses a modified version of Strip_DL2 rather than a modified version of LS_DL2. The modified version of Strip_DL2 returns the arrays C, C1 and FC computed by Strip_DL2. The asymptotic time complexity of Strip_TRACE2 is also O(mn) and it takes the same amount of space as does Strip_DL2. The number of cache misses is approximately twice that for Strip_DL2.

Results

We benchmarked the single-core algorithms LS_DL2,Strip_DL2,DL_TRACE2, and Strip_TRACE2 of this paper against the corresponding single-core algorithms developed by us in [18]. Using the parallelization techniques of [18], we obtained multi-core versions of our new algorithms. Their names are obtained by prefixing PP_ to the single-core name (e.g., PP_LS_DL2 is the multi-core version of LS_DL2). The new multi-core versions also were benchmarked against the corresponding multi-core algorithms of [18].

Platforms and test data

The single-core algorithms were implemented using C and the multi-core ones using C and OpenMP. The relative performance of these algorithms was measured on the following platforms:

1.
Intel Xeon CPU E5-2603 v2 Quad-Core processor 1.8GHz with 10MB cache.
2.
Intel I7-x980 Six-Core processor 3.33GHz with 12MB LLC cache.
3.
Intel Xeon CPU E5-2695 v2 2x12-Core processors 2.40GHz with 30MB cache.

For convenience, we will, at times, refer to these platforms as Xeon4, Xeon6, and Xeon24 (i.e., the number of cores is appended to the name Xeon).

All codes were compiled using the gcc compiler with the O2 option. On our Xeon4 platform, the benchmarking included a comparison of memory, cache misses, run time, and energy consumption. The cache miss count and the energy consumption was measured using the "perf" [21] software through the RAPL interface. For the Xeon6 and Xeon24 platforms only the run time was benchmarked.

For test data, we used randomly generated protein sequences as well as real protein sequences obtained from the Protein Data Bank [22] and DNA/RNA/protein sequences from the National Center for Biotechnology Information (NCBI) database [23]. The results for our randomly generated protein sequences were comparable to those for similarly sized sequences used from the two databases [22] and [23]. So, we present only the results for the random data sets here.

Xeon E5-2603 (Xeon4)

DL distance algorithms

Table 1 gives the memory required to process random protein sequences of length 400,000 using each of the single-core DL scoring algorithms considered in this paper. LS_DL takes 4.75 times the memory taken by LS_DL2 and LS_Strip takes 4.69 times the memory taken by LS_Strip2.

Table 1 Memory usage for DL distance algorithms on Xeon4

Selected articles from the 8th IEEE International Conference on Computational Advances in Bio and medical Sciences (ICCABS 2018): bioinformatics

Linear space string correction algorithm using the Damerau-Levenshtein distance

Abstract

Background

Results

Conclusion

Background

Introduction

DL dynamic programming recurrences

Methods

DL distance algorithms

Algorithm LS_DL2

Strip algorithm Strip_DL2

DL trace algorithms

Algorithm LS_T R A C E2

Strip trace algorithm Strip_T R A C E2

Results

Platforms and test data

Xeon E5-2603 (Xeon4)

DL distance algorithms

DL trace algorithms

Parallel algorithms

I7-x980 (Xeon6)

DL distance algorithms

DL trace algorithms

Parallel algorithms

Xeon E5-2695 (Xeon24)

DL distance algorithms

DL trace algorithms

Parallel algorithms

Discussion

Conclusion

Availability of data and materials

Abbreviations

References

Acknowledgments

About this supplement

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us