 Research
 Open Access
 Published:
A fast and efficient path elimination algorithm for largescale multiple common longest sequence problems
BMC Bioinformatics volumeÂ 23, ArticleÂ number:Â 366 (2022)
Abstract
Background
In various fields, searching for the Longest Common Subsequences (LCS) of Multiple (i.e., three or more) sequences (MLCS) is a classic but difficult problem to solve. The primary bottleneck in this problem is that present stateoftheart algorithms require the construction of a huge graph (called a direct acyclic graph, or DAG), which the computer usually has not enough space to handle. Because of their massive time and space consumption, present algorithms are inapplicable to issues with lengthy and largescale sequences.
Results
A mini Directed Acyclic Graph (miniDAG) model and a novel Path Elimination Algorithm are proposed to address largescale MLCS issues efficiently. In miniDAG, we employ the branch and bound approach to reduce paths during DAG construction, resulting in a very mini DAG (miniDAG), which saves memory space and search time.
Conclusion
Empirical experiments have been performed on a standard benchmark set of DNA sequences. The experimental results show that our model outperforms the leading algorithms, especially for largescale MLCS problems.
Introduction
In various fields such as cancer treatment [1], cancer detection [2], protein sequence classifying [3], gene data searching [4], and gene data analyzing [5], searching for the Longest Common Subsequences (LCS) of Multiple (i.e., three or more) sequences (MLCS) is a classic but difficult problem to solve. With the increase in the number of sequences and the advancement of biotechnology, this problem is usually divided into two categories. The first is to find the longest common subsequence between two sequences, called the LCS problem; the second is to find the longest common subsequence among three or more sequences, called the MLCS problem.
In the past few decades, many algorithms dedicated to solving LCS problems have been proposed; for example, Sankoff [6] published a paper in which he described how to use the dynamic programming (DP) algorithm to determine the LCS of two sequences. LCS problems can be solved in \(O(n^2)\) running time and memory space, where n is the length of the sequences to be dealt with in each case. Generally, MLCS problems are more difficult to solve than LCS ones. Numerous algorithms developed for LCS concerns are inapplicable to MLCS challenges. [7,8,9,10,11,12], especially largescale MLCS problems (i.e., problems with numerous and long sequences). As the number and length of sequences rise, the amount of runtime and memory space used exponentially increases owing to the high time and space complexity of \(O(n^d)\) [13], where d (d \(\ge\) 2) denotes the number of sequences and n denotes the length of sequences.
Similarly, the dominant pointbased approach, whose central concept was first introduced by Hakata and Imai [14, 15], is a category of the algorithm for the MLCS problem that is more effective and efficient than its predecessors, which constructs a Directed Acyclic Graph and transforms the problem of finding MLCS into finding the longest path from the source node to the ending node on the DAG graph. It is based on the fact that the vast majority of points in DPtype algorithmsâ€™ dynamic tables are irrelevant and that only the critical points, i.e., the socalled dominant points, are required to be calculated and saved [13]. Unquestionably, the dominant pointbased approach resulted in a significantly narrower search field than DPtype approaches. It also turns out that considerable performance and memory space reductions are feasible as a consequence of this approach. Subsequently, several versions of the dominant pointbased approach have been presented in an effort to further enhance its performance [16,17,18,19]. In order to accelerate up a search for the match pointâ€™s successors, a special data structure called the successor table was designed by Chen et al. [17] referred to as FastLCS. Additionally, Wang et al. [19] described a new method dubbed QuickDP that employs a divideandconquer strategy to accelerate the generation of DAG. In terms of temporal complexity, QuickDP outperforms FastLCS. However, as the number and length of sequences grow, the DAG generated by FastLCS and QuickDP will grow in size. As it turns out, FastLCS and QuickDP often get stuck during the DAG building process. This is because the temporal complexity of both algorithmsâ€™ nondominated sorting approach is \(O(N^2)\), where N is the number of match points in the DAG, which is substantially more than n and d.
Recently wang et al. [20] introduced a unique algorithm dubbed TopMLCS that employs a novel approach for DAG construction and a forwardandbackward topological sorting technique to determine the longest paths in the DAG. Due to the topological sorting technology, this approach has a lower time consumption. Nonetheless, as the size of the DAG grows, topological sorting algorithms consume a significant amount of memory space because they must store the entire DAG, including all match points and paths (i.e., unnecessary match points and nonoptimal paths cannot be identified and removed in time). As a result of the memory overflow, they cannot properly tackle the largescale MLCS problem. Liu et al. [21] offered a character merging algorithm (CMA) that merges consecutively repeated characters, shortens sequences, minimizes the complexity of problems to be handled, and therefore efficiently solves the largescale MLCS problem. This CMA method is very effective in resolving MLCS with a greater number of repeated characters. In 2020 [22], a new PRDAG model for largescale MLCS challenges was developed. There are no repeated match points in the PRDAG model, and each match point is allocated a unique path recorder (a key precursor pointer) to keep track of the longest paths connecting the source point to itself. In addition to the optimal algorithm, some excellent heuristic algorithms have been proposed. An algorithm, MLCSA*, [27] is presented to find an LCS for any given number of sequences. MLCSA* is a variant of the A* algorithm, a provably optimal bestfirst search algorithm [28]. But unlike A*, which finds the leastcost path in a graph, MLCSA* searches in a multidimensional matrix for the longest path corresponding to an LCS. In 2020, a new anytime A* search [26] was proposed to solve the instance problem of various scenarios; apart from providing excellent solutions, the anytime A* search can return proven gaps at almost any time when terminated prematurely. However, with the increase in the number and length of sequences, these strategies still can not cope with the challenges of largescale MLCS problems.
In this paper, miniMLCS will be created to assure robust performance while searching for MLCS problems. The primary difference between the proposed algorithm and stateoftheart algorithms is that miniMLCS employs a novel path elimination strategy based on lower bound and upper bound estimation to efficiently remove a large number of unnecessary match points and nonoptimal paths from a DAG, avoiding the use of nondominated sorting and topological sorting, which are both extremely timeconsuming and require a large amount of memory space. As a result, the size of the created DAG is modest, and miniMLCS can efficiently identify the longest paths from the DAG to the MLCSs with minimal runtime and memory usage. Our main contributions are as follows:

1
Design a novel branch and bound strategy to eliminate unnecessary paths when constructing DAG graphs. Before obtaining the final MLCS, if we can judge that the currently calculated match point is not the point that constitutes the MLCS, then the path through this point will not be the longest; these are called the nonpoint and nonoptimal paths. Therefore, we do not need to include them in the DAG.

2
Design a smaller DAG (miniDAG) to prevent the nondominated and topological sorting during the typical DAG construction process. The proposed branch and bound strategy can eliminate the match points without comparing the match points. It greatly saves time for nondominated sorting and topological sorting.

3
Propose a fast and efficient algorithm (miniMLCS) to deal with largescale sequence problems with lower time and space costs. We design a novel branch and bound graph strategy MLCS algorithm called miniMLCS and compare it with the stateoftheart algorithms. The results show that our algorithm is better than these algorithms and is suitable for largescale MLCS problems.
Related work
Definitions of LCS/MLCS problems
Definition 1
Let s = \(\langle\) \(c_1\),\(c_2\),...,\(c_n\) \(\rangle\) represent a sequence on a character set \(\Sigma\), where \(c_i\) \(\in\) \(\Sigma\), 1 \(\le\) i \(\le\) n, \(\left \Sigma \right\) represents the cardinality of \(\Sigma\), and \(\left s\right\) represents the length of s, i.e.,\(\left s\right\) = n. If the sequence \(s^*\) = \(\langle\) \(c_{i_1}\),\(c_{i_2}\),...,\(c_{i_m}\) \(\rangle\) fufils 1 \(\le\) \(i_1\) < \(i_2\) < ...< \(i_m\) \(\le\) n, \(s^{*}\) is refered to as a subsequence of s, represented by \(s^{*}\) \(\in\) sub(s), where sub(s) is the set of all subsequences of s.
Actually, if you eliminate zero or more ordered or unordered characters from a given sequence s, the resulting sequence will be shorter than the original sequence s. This is referred to as a subsequence of the original sequence s.
For example, if \(s = \texttt {ACGTA}\) deletes the characters G and T from the sequence s, the resulting sequence \(s^* = \texttt {ACA}\) is a subsequence of the sequence s.
Definition 2
Given a sequence set \(Y = \left\{ s_1,s_2,...,s_d \right\}\), where d is the number of sequences contained in Y and d \(\ge\) 2, \(s_1\), \(s_2\),..., \(s_d\) on character set \(\Sigma\), if there is a sequence \(s^*\), which is a subsequence of any sequence in the sequence set Y, then the sequence \(s^*\) is called the common subsequence of all sequences in the sequence Y. and if the length of the sequence \(s^*\) is the longest of all common subsequences in the set Y, then the sequence \(s^*\) is the longest common subsequence of all sequences in the set Y.
For example, \(Y = \left\{ s_1 = \texttt {AACGTCGT}, s_2 = \texttt {CGACGTCC}, s_3 = \texttt {GACCGTCT} \right\}\), the existence sequences \(s^*_1 = \texttt {ACGTC}\), \(s^*_2 = \texttt {AGC}\), ..., \(s^*_m = \texttt {C}\) all belong to the common subsequence of \(s_1\), \(s_2\), \(s_3\) in the set Y, and \(s^*_1\) is the longest among all common subsequences, then the sequence \(s^*_1\) is the longest common subsequence of all sequences in the set Y.
Usually, there are more than one LCS for given d sequences. If \(d = 2\), the problem of finding LCS is usually called LCS problem; otherwise, if \(d \ge 3\), the problem is called MLCS problem.
Definition 3
Give a sequence s = \(\langle\) \(c_1\),\(c_2\),...,\(c_n\) \(\rangle\), for i = 0,1,...,n, define pre(s[i]) = \(\langle\) \(c_1\),\(c_2\),...,\(c_i\) \(\rangle\) as the ith prefix of s and suf(s[i]) = \(\langle\) \(c_{i+1}\), \(c_{i+2}\), ..., \(c_n\) \(\rangle\) as the (i+1)th suffix of s(exclude ith character).
For example, if \(s = \texttt {AACGTCGT}\), then \(pre(s[5]) = \texttt {AACGT}\) is the 5th prefix of s and \(suf(s[5]) = \texttt {CGT}\) is the 6th suffix of s, and pre(s[0]) is an empty sequence, suf(s[0]) is an entire sequence.
Dynamic programming approaches
The DP approach is a timehonored method for solving LCS and MLCS problems [23]. Given d sequences of length n, \(s_1\),\(s_2\),...,\(s_d\), it will iteratively construct a ddimension score table L which have \(n^d\) elements, where the element \(L[i_1, i_2,..., i_d]\) represents the length of the MLCS of the prefix sequences \(pre(s_1[i_1])\), \(pre(s_2[i_2])\),..., \(pre(s_d[i_d])\), which can be calculated by the following formula [24]:
where \({\bar{L}} = \left\{ L\left[ i_{1}, i_{2}, \cdots ,\left( i_{k}1\right) , \cdots , i_{d}\right] \mid k=1,2, \cdots , d\right\}\).
After constructing the score table L, the MLCS may be calculated by traversing from the bottomright element L[n,Â n,Â ...,Â n] to the topleft element L[0,Â 0,Â ...,Â 0]. For example, Fig. 1 illustrates the score table L constructed for the sequences \(s_1 = \texttt {AACGTCGT}\) and \(s_2 = \texttt {CGACGTCC}\), and the LCS for these two sequences is determined by traversing from L[8,Â 8] to L[0,Â 0].
It can be observed from Fig. 1 that for given d sequences, and their length is n, the DP approach has a time and space complexity of up to \(O(n^d)\) [9]. As d and n expand, these approaches use exponentially more space and time. That is, the scalability of a DP approach is restricted, making it unsuitable for largescale MLCS problems.
Dominant pointbased approaches
Before delving into the details of the dominant pointbased method, weâ€™ll define a few terms.
Definition 4
Given a sequence set \(Y = \left\{ s_1,s_2,...,s_d \right\}\), where d is the number of sequences contained in Y and d \(\ge\) 2, \(s_1\), \(s_2\),..., \(s_d\) on character set \(\Sigma\), let \(s_i[p_i]\) represent the \(p_i\)th character in the leftmost sequence \(s_i\). If \(s_1[p_1]\) = \(s_2[p_2]\) = ... = \(s_d[p_d]\) = \(\sigma\), the vector p = (\(p_1\), \(p_2\),..., \(p_d\)) is referred to as a match point for these d sequences. Each match point p = (\(p_1\), \(p_2\),..., \(p_d\)) is associated with a distinct symbol \(\sigma\). As a result, we often use p = \(\sigma\)(\(p_1\), \(p_2\),..., \(p_d\)) to express the match point, where \(\sigma\) is the symbol for p, and is represented by Ch(p) = \(\sigma\).
For example, if two sequences \(s_1 = \texttt {AACGTCGT}\) and \(s_2 = \texttt {CGACGTCC}\) are supplied, there are several match points of the form \(\sigma (i, j)\). The common character \(\sigma\) \(\in\) \(\Sigma\), connected by a dotted line, corresponds to its indices i and j in two sequences, i.e.,\(s_1[i]\) = \(s_2[j]\) = \(\sigma\), such as A(1,Â 3), G(4,Â 2). Because \(Ch(1,3) = A\), the match point A(1,Â 3) is sometimes abbreviated as (1,Â 3). Similarly, G(4,Â 2) might be denoted as (4,Â 2).
Definition 5
Given two match points p and q of d sequences on a symbol set T, we say:

1
p = q if \(\forall\) i (1 \(\le\) i \(\le\) d), \(p_{i}=q_{i}\).

2
p weakly dominates q, if \(\forall\) i (1 \(\le\) i \(\le\) d), \(p_{i}\) \(\le\) \(q_{i}\) and \(\exists\) i, \(p_{i}\) < \(q_{i}\) (denoted by p \(\preceq\) q).

3
p dominates q or q is dominated by p, if \(\forall\) i (1 \(\le\) i \(\le\) d), \(p_{i}\) < \(q_{i}\) (denoted by p \(\prec\) q).

4
q is called a successor of p if p \(\prec\) q. Further, if there is no match point r to satisfy p \(\prec\) r \(\prec\) q, then q is called an immediate successor of p.

5
If q is a successor of p, we call p a predecessor of q.
Generally, a match point p has no more than \(\Sigma \) successors.
Definition 6
Given a collection of matches P = \(\left\{ P_1, P_2, ...,P_m \right\}\), for a match point \(P_{j}\) \(\in\) P , If \(\lnot\) \(\exists\) \(P_{i}\) \(\preceq\) \(P_{j}\), 1 \(\le\) i, j \(\le\) m, i \(\ne\) j, \(P_{j}\) is called a nondominated point (dominant point for short) on P. All of dominant points on P form the dominant set of P .
The dominant pointbased approaches are based on constructing a direct acyclic graph (DAG). Their organizational structure is as follows. To begin, given d sequences, the graphâ€™s source point is defined as a ddimensional point O(0,Â 0...0). This point has no input edge, so its indegree is 0. The source pointâ€™s level is defined as level 0. Following that, we identify all successor points of the present point O and create a directed edge connecting it to every one of its successors. The successor pointsâ€™ level is defined as level 1. Nondominated sorting is used to compute the set of all dominated points on level1. Then, for each nondominated point on level1, we identify all successor points, create an edge connecting each nondominated point on level1 to every successor, and designate the level of all these successors as level 2. We use nondominated sorting to identify all dominated points on level2. This procedure is continued until no additional successor points are formed, at which time the DAG building is complete. If a point without a successor, it is defined as the ending point (\(\infty\),\(\infty\),...,\(\infty\)). When the DAG is established, the LCS/MLCS is formed by the character sequence represented by the points along the longest path from the source point to the ending point. Thus, the LCS/MLCS problemâ€™s primary problem is how to create the DAG.
For example, as shown in Fig. 2, the sequences \(s_1 = \texttt {AACGTCGT}\), \(s_2 = \texttt {CGACGTCC}\) and \(s_3 = \texttt {GACCGTCT}\), the MLCS is generated using the dominant pointsbased approach.

1
Initialization.
Set the source node O(0,Â 0,Â 0) and the ending node (\(\infty\),\(\infty\),\(\infty\)).

2
DAG construction on level 1.
For point O(0,Â 0,Â 0) on \(D_0\), find all of its successors: A(1,Â 3,Â 2), C(3,Â 1,Â 3), G(4,Â 2,Â 1) and T(5,Â 6,Â 6), and add a direct edge from point O(0,Â 0,Â 0) to each one of them. These successors are all level0 successors and all level1 points. Put them in \(L_1\)â€™s set. The dominated point T(5,Â 6,Â 6) in set \(L_1\) is removed using nondominated sorting (notice that A(1,Â 3,Â 2) \(\prec\) T(5,Â 6,Â 6), T(5,Â 6,Â 6) is a dominated point). Set \(D_1\) \(=\) \(\left\{ A(1, 3, 2), C(3, 1, 3), G(4, 2, 1) \right\}\), and k to k \(+\) 1.

3
DAG construction on level 2.
For each point in \(D_1\) \(=\) \(\left\{ A(1,3,2), C(3,1,3), G(4,2,1) \right\}\), find all of its successors. C(3,Â 4,Â 3), G(4,Â 5,Â 5), and T(5,Â 6,Â 6) are all successors to point A(1,Â 3,Â 2) \(\in\) \(D_1\) and are included in \(L_2\). Create a direct edge connecting point A(1,Â 3,Â 2) to each one of its successors. For point C(3,Â 1,Â 3), C(6,Â 4,Â 4), G(4,Â 2,Â 5), and T(5,Â 6,Â 6) are all its successors and are included in \(L_2\). Add a direct edge from C(3,Â 1,Â 3) to each of its successors. For point G(4,Â 2,Â 1), C(6,Â 4,Â 3), G(7,Â 5,Â 5), and T(5,Â 6,Â 6) are all its successors and add a direct edge from G(4,Â 3,Â 2) to each of its successors, put these successors into \(L_2\). Eliminate the redundant successors in \(L_2\). The dominated points G(4,Â 5,Â 5), T(5,Â 6,Â 6) and G(7,Â 5,Â 5) are deleted using the nondominated sorting on \(L_2\). Allow \(D_2\) = \(\left\{ C(3, 4, 3), C(6, 4, 4), G(4, 2, 5), C(6, 4, 3) \right\}\), and k to k \(+\) 1.

4
Repeat step 3 until no successor exists for all points in the set \(D_k\). Then substitute \((\infty , \infty , \infty )\) for the points in \(D_k\), and the DAG construction is complete.
As seen above, the dominant pointbased techniques have the following significant disadvantages:

1
Each level may contain numerous repeated match points and dominated points. (e.g., T(5,Â 6,Â 6) appears three times in \(L_2\) and three points G(4,Â 5,Â 5), T(5,Â 6,Â 6), G(7,Â 5,Â 5) are dominated points), and a match point appearing in one level may appear numerous times in subsequent levels (e.g., T(5,Â 6,Â 6) appears in \(L_1\)\(L_4\)) and is only useful in the final level. Thus, the created DAG will be very large, to the point that the computer will run out of memory to hold it.

2
The nondominated sorting approach will need a significant amount of work to obtain \(D_k\). It has an \(O(d{N_{k}}^2)\) time complexity at level k, where \(N_k\) is the number of match points in \(L_k\) and d represents the number of sequences. Note that \(N_k\) will be really huge (in the worstcase case, \(N_k = \Sigma ^k\) rises exponentially as level k increases). Thus, when n, d, and \(\left \Sigma \right\) are big, i.e., when the MLCS issue becomes a largescale problem, the nondominated sorting approach becomes very timeconsuming.
FastLCS [17] and QuickDP [19] are two representative algorithms of this kind.
The proposed miniMLCS
The main framework of miniMLCS
As previously stated, existing approaches cannot address largescale MLCS problems owing to their enormous time and space requirements [22]. The underlying reason behind this is that as the number d and length n of sequences grow, nondominant sorting and topological sorting will spend a lot of time on the comparison between match points. As it turns out, the computing time and storage space requirements surpass the maximum limits. To address these problems, the proposed miniMLCS rapidly finds unnecessary match points and nonoptimal paths during DAG building and then eliminates them in time to limit the DAGâ€™s size.
To be precise, in order to get the final MLCS of sequences, miniMLCS first designed a strategy to quickly predict the genuine MLCS Râ€™s lower bound Lower(R). Then, before deciding whether or not to include match point p in the DAG, miniMLCS calculates an upper bound \(Upper(p, \infty )\) on the length of any path from the match point p to the ending match point. Assuming that the true distance between p and the ending point is the distance(p), the obtained upper bound \(Upper(p,\infty )\) should be greater than or equal to distance(p). Finally, determine whether the estimated distance from the starting point to the ending point is less than Lower(R) (Note that the distance from the starting point O to p is the current level value level(p) when the DAG calculates to p). If \(Upper(O,p,\infty )\) = level(p) + \(Upper(p,\infty )\) < Lower(R), then no path via p is the longest path. As a result, p is an unnecessary match point, and all paths going through it are not the longest in the DAG. Based on this discovery, all unnecessary match points and nonoptimal paths may be deleted immediately.
Estimation of the lower bound Lower(R) in a short time
We do not really know the genuine length of MLCS R until we receive it, but we can gain a lowest bound Lower(R) by generating an estimated MLCS. Then the length of this estimated MLCS is a lower bound on R, The longer the length of the estimated MLCS, the more closely it resembles R. Our objective is to find an estimated MLCS as rapidly as feasible. A rapid heuristic strategy for calculating the lower bound Lower(R) is designed based on these concepts. The critical stages are listed below.
For a ddimensional match point p \(=\) (\(p_1\), \(p_2\),... , \(p_d\)), max(p) represents the largest number in the match point p, min(p) represents the smallest number in the match point p, and \(\varphi (p) = max(p)  min(p)\) represents the largest position offset of d sequences in the match point p. Among all match points at each level, we pick up the first t smallest \(\varphi ()\). Because it can be observed from Fig. 1, the match point with the smaller \(\varphi ()\) tends to contain a larger dominates regions than points of larger \(\varphi ()\), and larger dominates regions may contain more match points, so the smaller \(\varphi ()\) is more likely to be the match point that constitutes the longest common subsequence than the match point with the larger \(\varphi ()\). For example, in Fig. 1, C(3,Â 4) contains more dominates regions than C(3,Â 7), so C(3,Â 4) is more likely to form the longest common subsequence. Based on this idea, two strategies are proposed to get an accurate lower bound.
Strategy 1: Assigning a small initial value to t in the DAG construction process means that t match points with the smallest \(\varphi ()\) value are selected in each level of the DAG. The role of t is to reduce the search space and get a suitable Lower(R) at a faster speed. In this way we will get an initial Lower(R). Next, add a certain step length \(\mu\) to t each time, so that \(t = t + \mu\), and continue to construct the DAG to calculate the Lower(R). Update Lower(R) if it changes, if there is no change in the Lower(R) for more than \(\tau\) (we define it by ourselves) times, then it can be considered that a more accurate Lower(R) has been obtained.
The following is an example of calculating the lower bound.
Assume t is a small positive integer (for example, \(t = 4\)) and D is a collection of t randomly chosen match points with the first t lowest values of \(\varphi ()\)

1
Initialization: Set \(O = (0,0,...,0)\) as the first chosen match point, i.e., \(D = \left\{ O \right\}\), and set \(Lower(R) = 0\).

2
Update Lower(R): Pick up t successors with the first t lowest values of \(\varphi ()\) from all successors in D (if there are only \(\beta\) successors with \(\beta\) \(\le\) t, then let \(t =\) \(\beta\)). Update D by deleting all of its existing elements and inserting the chosen match points, and letting \(Lower(R) = Lower(R) + 1\).

3
Return Lower(R) if D is empty; otherwise, proceed to step 2.
In Fig. 3, we utilize the previous example to demonstrate the method. \(Lower(R) = 0\) at first, and \(D = O\). O has four successors (1,Â 3,Â 2), (3,Â 1,Â 3), (4,Â 2,Â 1), (5,Â 6,Â 6). because \(t =\) 4, all these successors are selected. Thus, we update D by \(D = \left\{ (1,3,2),(3,1,3),(4,2,1),(5,6,6) \right\}\) and set \(Lower(R)= Lower(R)+ 1\). Calculate the successors for each match point in D, and there are 10 successors in total (in Fig. 2, in order to find the lower bound as soon as possible, point filtering is not carried out, that is, all filtered points will be included, including the successor point C(6,Â 7,Â 7) of T(5,Â 6,Â 6)), with four successors (3,Â 4,Â 3),(4,Â 5,Â 5), (5,Â 6,Â 6), (6,Â 4,Â 4) with the first four smallest \(\varphi ()\) values selected (note that the smallest \(\varphi ()\) value match point is (3,Â 4,Â 3), (4,Â 5,Â 5), (5,Â 6,Â 6) and the second smallest \(\varphi ()\) value match point has two: (6,Â 4,Â 4) and (7,Â 5,Â 5) with the same \(\varphi ()\) value. In this scenario, we merely need to choose one at random from (6,Â 4,Â 4) and (7,Â 5,Â 5) (assuming (6,Â 4,Â 4) is chosen) and update D by \(D =\) \(\left\{ (3,4,3),(4,5,5,),(5,6,6),(6,4,4) \right\}\) and update Lower(R) by \(Lower(R) = 2\). Similarly, with \(Lower(R) = 3\), \(D =\) \(\left\{ (4,5,5),(5,6,6),(6,7,7),(6,7,4) \right\}\). When \(Lower(R) = 4\), the appropriate \(D =\) \(\left\{ (6,7,7),(5,6,6) \right\}\). Finally, we obtain \(Lower(R) = 5\) and the appropriate \(D =\) \(\left\{ (6,7,7) \right\}\).
Estimation of the upper bound \(Upper(O,p,\infty )\) with efficiency
Assuming that p is a current point on the DAG and that we want to know the lengths of all paths from O to the ending match point that passes through p. However, the lengths of these paths are unknown until they are constructed. But if we can estimate an upper bound \(Upper(O,p,\infty )\) on the lengths of these paths and know that it is less than the lower bound Lower(R) (i.e., \(Upper(O,p,\infty )\) < Lower(R)), then we can conclude that these paths via p are not the longest paths and can be removed from the DAG. In this manner, the new DAG will be far smaller than the previous ones.
Notably, the DAGâ€™s current match point p has been established, and the length of the longest path from O to p may be determined. Indeed, it is the DAG level of p (denoted by level(p)).
Additionally, the genuine length of the longest path distance(p) between the current match point p and the ending match point is generally unknown. A possible method is to estimate the upper bound \(Upper(O,p,\infty )\). Then \(Upper(O,p,\infty )= level(p) + Upper(p,\infty )\) is the upper bound on the length of any path via p. In the following, we will design some strategies for rapidly estimating \(Upper(p,\infty )\) and bringing it as near to the real value \(distance(p,\infty )\) as feasible (i.e., make it as small as possible).
Given d sequences \(s_1\), \(s_2\),...,\(s_d\) on a character set and a match point \(p =\) (\(p_1\), \(p_2\),..., \(p_d\)), the following conclusion is obtained:
Theorem 1
For each longest path between match point p = (\(p_1\), \(p_2\),..., \(p_d\)) and the ending match point, and n is the length of the sequence corresponding to max(p). Hence Upper(p,\(\infty\)) = n  max(p) is an upper bound on the length of any longest path between match points p = (\(p_1\), \(p_2\),..., \(p_d\)) and the ending match point \(\infty\), and Upper(O,p,\(\infty\)) = level(p) + n  max(p) is an upper bound of the length of the longest path from O to \(\infty\) through p.
Proof
Denote \(\xi\) \(= n  max(p)\). Obviously, distance(p) is equal to the longest common subsequence of sequence \(suf(s_i[p_i])(1 \le i \le d)\), \(distance(p) \le \xi\) because for the sequence corresponding to max(p) there are at most \(\xi\) character after it. Therefore \(Upper(p,\infty ) = n  max(p)\), namely \(Upper(O,p,\infty ) = level(p) + n  max(p)\).
Some extensions can be developed on the basis of Theorem 1. Select the first \(\delta\) max(p) of match point p, then compute the LCS of sequences \({suf(s_i[max_i(p)])\ (}1 \le i \le \delta )\). and use it as the upper bound of p. The upper bound obtained will undoubtedly be closer to distance(p) than the upper bound obtained by Theorem 1, but it will take longer to calculate each match point.
Let us analyze the aforementioned example in Fig. 2 in further depth to demonstrate the strategy for identifying unnecessary match points. The MLCSâ€™s estimated lower bound is known, i.e., \(Lower(R) = 5\), shown in Fig. 3. According to Theorem 1, \(upper(p,\infty )\) can be estimated, and level(p) can be achieved during the DAG construction process, shown in Fig. 4a.
The match point C(3,Â 1,Â 3)â€™s level(C) is defined as 1. C(3,Â 1,Â 3) is followed by three successors: C(6,Â 4,Â 4), G(4,Â 2,Â 5) and T(5,Â 6,Â 6). Set level(p) equal to 2 for each of its successors, i.e., \(level(C(6, 4, 4)) = level(G(4, 2, 5)) = level(T(5, 6, 6)) = 2\), because the length of the longest paths from the starting match point to each of the successors is 1, and \(Lower(R) = 5\), shown in Fig. 3. According to Theorem 1, \(Upper(C(6, 4, 4),\infty ) = 8  max(C(6, 4, 4)) = 2\), \(Upper(C(4, 2, 5),\infty ) = 8  max(C(4, 2, 5)) = 3\) and \(Upper(C(5, 6, 6),\infty ) = 8  max(C(5, 6, 6)) = 2\). So as seen in Fig. 4a, match point C(6,Â 4,Â 4) and T(5,Â 6,Â 6) are clearly identified as unnecessary match points since they meet
\(Upper(O,p, \infty ) = level(p) + Upper(p, \infty )\) < Lower(R)
and none of the paths (branches) linking O and them will be included in the DAG.
In addition to the methods for finding the upper bound introduced above, the literatures [26] and [27] mentioned two more compact strategies for finding the upper bound. They build vectors in the preprocessing stage and then use them to find the upper bound for each match point in the DAG construction stage. The specific details are expanded in strategy 2 and strategy 3.
Strategy 2: Let \({num_{s_{i}}^c}\) represent the number of the character c in sequence \({s_i\ (}1 \le i \le d)\). For each longest path between match point p = (\(p_1\), \(p_2\),..., \(p_d\)) and the ending match point. The following conclusions can be drawn, for any c in \(\Sigma\), number of c in the sequence on the path from p to the end node will not exceed \(min\left\{ num_{suf(s_1[p_1])}^c,num_{suf(s_2[p_2])}^c,...,num_{suf(s_d[p_d])}^c\right\}\).
Hence,
Figure 4b shows an application of strategy 2. The match point A(1,Â 3,Â 2) has three successors in level2, namely C(3,Â 4,Â 3), G(4,Â 5,Â 5) and T(5,Â 6,Â 6). For G(4,Â 5,Â 5), \(suf(s_1[4]) = \texttt {TCGT}\), \(suf(s_2[5]) = \texttt {TCC}\) and \(suf(s_3[5]) = \texttt {TCT}\). \(Upper(G(4,5,5),\infty ) = \sum \limits _{c \in \Sigma } \min \left\{ num_{suf(s_1[p_1])}^c,num_{suf(s_2[p_2])}^c,num_{suf(s_3[p_3])}^c\right\} = 1 + 1 + 0 + 0 = 2\), because for \(\texttt {C}\) and \(\texttt {T}\), their minimum number of occurrences in \(suf(s_1[4])\), \(suf(s_2[5])\) and \(suf(s_3[5])\) is once, \(\texttt {A}\) and \(\texttt {G}\) appears at least 0 times in \(suf(s_1[4])\), \(suf(s_2[5])\) and \(suf(s_3[5])\). \(Upper(O,G(4,5,5),\infty ) = level(G(4,5,5)) + Upper(G(4,5,5),\infty ) = 2 + 2\) < Lower(R), match point G(4,Â 5,Â 5) is identified as unnecessary point and none of the paths (branches) linking O and G(4,Â 5,Â 5) will be included in the DAG.
Strategy 3: For each longest path between match point p = (\(p_1\), \(p_2\),..., \(p_d\)) and the ending match point and a vector \(m_{i}\), where \(m_{i}[p_{i}, p_{i+1}]\) with \(p_{i} = 1, . . . , s_{i}\) and \(p_{i+1} = 1, . . . , s_{i+1}\), stores the length of the LCS of strings \(suf(s_i[p_{i}])\) and \(suf(s_{i+1}[p_{i+1}])\).
Hence,
Figure 4c shows an application of strategy 3. For match point G(4,Â 2,Â 1) on level1, \(suf(s_1[4]) = \texttt {TCGT}\), \(suf(s_2[2]) = \texttt {ACGTCC}\) and \(suf(s_3[1]) = \texttt {ACCGTCT}\). The result is \(Upper(G(4,2,1),\infty )\) \(= min(3,5) = 3\), it is the minimum value of LCS of \(suf(s_1[4])\) and \(suf(s_2[2])\) and LCS of \(suf(s_2[2])\) and \(suf(s_3[1])\), namely \(Upper(O, G(4,2,1), \infty ) = level(G(4,2,1)) + Upper(G(4,2,1),\infty ) = 3 + 1 = 4 < Lower(R)\), match point G(4,Â 2,Â 1) is identified as unnecessary point and none of the paths (branches) linking O and G(4,Â 2,Â 1) will be included in the DAG.
For strategies 2 and 3, when the scale of n and d is particularly large, it will take a long time to preprocess. Therefore, in order to reduce the preprocessing time, we do not need to apply all sequences to the preprocessing, but choose \(\delta\) sequences as different as possible. In this way, we can get the pretreatment results in a shorter time.
Fortunately, Wang et al. [25] define a metric for comparing two sequences. This metric is referred to as the diversity metric.
Let \({num_{s_{i}}^c}\) represent the number of the character c in sequence \(s_i\). Note that the greater the value of \(\left {num_{s_{i}}^c}  {num_{s_{j}}^c}\right\), the more diverse \(s_i\) and \(s_j\) are, and the greater the difference between these two sequences. By considering this factor, the diversity between \(s_i\) and \(s_j\) is defined as:
Here are a few instances that illustrate the above.
We can choose the second sequence with greater diversity using sequence \(s_1\) based on the diversity measure.
\(s_1 = \texttt {AACGTCGT}\).
\(s_2 = \texttt {CGACGTCC}\).
\(s_3 = \texttt {GACCGTCT}\), we choose \(s_1\) = \(s_i\), then we count how many times each character appears in these sequences.
\({num_{s_{1}}^A} = 2\), \({num_{s_{1}}^C} = 2\), \({num_{s_{1}}^G} = 2\), \({num_{s_{1}}^T} = 2\)
\({num_{s_{2}}^A} = 1\), \({num_{s_{2}}^C} = 4\), \({num_{s_{2}}^G} = 2\), \({num_{s_{2}}^T} = 1\)
\({num_{s_{3}}^A} = 1\), \({num_{s_{3}}^C} = 3\), \({num_{s_{3}}^G} = 2\), \({num_{s_{3}}^T} = 2\)
\(diversity\left( s_{1}, {s}_{2}\right) = \frac{2}{8}2  1 + \frac{2}{8}2  4 + \frac{2}{8}2  2 + \frac{2}{8}2  1 = \frac{3}{4}\)
\(diversity\left( s_{1}, {s}_{2}\right) = \frac{2}{8}2  1 + \frac{2}{8}2  3 + \frac{2}{8}2  2 + \frac{2}{8}2  2 = \frac{1}{2}\)
As a result, we conclude that \(s_1\) and \(s_2\) are more dissimilar than \(s_1\) and \(s_3\).
Therefore, when the scale of the sequence is large, we choose \(s_1 = s_i\) and pick up \(\delta\) \((\delta \ll d)\) most different(contains \(s_1\) sequence) sequences from the given d sequences using the formula (4). In this way, we greatly reduce the preprocessing time of strategies 2 and 3.
Construct miniDAG
Based on the above branch elimination approach, we construct the miniDAG level by level. First, level zero \(D_0\) consists of merely the beginning match point O, and then level 1 through level R, represented by \(D_1\), \(D_2\),..., \(D_R\), respectively, are consecutively created, where R denotes the length of the final MLCS. To minimize the time and space, we merely create and store one level each time.
After \(D_k\) is constructed (currently, \(D_0\) is constructed), the following procedures can be taken to build \(D_{k+1}\):

1
Select every match point \(p \in D_k\), search its successorset succ(p).

2
For each successor \(q \in succ(p)\), set the level of q (i.e.,the length of the current longest path from O to q) as \(level(q) = k + 1\).

3
Identify whether q is a useless match point according to Theorem 1, Strategy 2 and Strategy 3. If yes, do not put q in DAG, go to step 5. Otherwise, put q into \(D_{k+1}\).

4
Add a directed edge from p to q in DAG.

5
If successors of all match points in \(D_k\) have been checked, the construction of \(D_{k+1}\) is finished. Otherwise, go to step 1.
MiniMLCS algorithm
The pseudocodes of algorithm miniMLCS are presented in Algorithm 1 in order to describe the new algorithm in detail.
At the beginning, the estimated lower bound Lower(R) is calculated. The proposed algorithmâ€™s key steps are lines 3 \(\sim\) lines 21, which explain how a miniDAG is built level by level. Ch(q) in the lines 8 means the character represented by the match point q, which is described in Definition 4, and q.precs represents the longest common subsequence from the beginning match point O to the current point q. Finally, from the miniDAG, the longest paths corresponding to MLCSs may be obtained, and all MLCSs will be returned in lines 22 \(\sim\) lines 23.
Time and space complexity
MiniMLCS time complexity
In order to show the efficiency of Algorithm 1 compared with other algorithms, the time complexity of the proposed Algorithm 1 and the comparison algorithm are shown here. First, the length of the sequences is denoted by n, and d represents the number of sequences. In the initialization, we built the Successor Table that was proposed by FastLCS so that we could rapidly discover the successor nodes of a point with \(O(d \Sigma  n)\) [17]. Second, we estimate the time cost to find successor nodes and add them to the Vector Hash Table. Use N to represent the whole collection of points in the miniDAG and the time complexity is O(N). Finally, in miniDAG, we use E to represent the whole collection of edges, and the time complexity is O(E). For strategy 1, the time complexity of building a DAG once and finding itâ€™s lower bound is \(O(d\Sigma MLCSt)\). Then calculate the time complexity of the upper bound. In Theorem 1, the time complexity of finding the upper bound is O(d). In the pretreatment phase (before the miniDAG construction), we choose the \(\delta\) sequence from d sequence by formula (4) and apply it to strategies 2 and 3. The preprocessing results are stored by using appropriate data structures, we can compute the \(Upper(p,\infty )\) of strategy 2 at any match point quickly, the time complexity is \(O(\delta \Sigma )\) for each match point. And strategy 3 in the preprocessing phase, it takes \(O(\delta n^{2})\). Therefore, considering the worst case, the time complexity of miniMLCS is \(O(d \Sigma  n)\) + \(O(d\Sigma MLCSt)\) + \(O(\delta n^{2})\) + O(E) + O(N). Since \(O(d \Sigma  n) \ll O(\delta n^{2})\), \(O(d\Sigma MLCSt) \ll O(\delta n^{2})\), O(N) = O(E) and \(O(\delta n^{2})\) < O(N), the time complexity of our proposed Algorithm 1 is O(N).
For the compared algorithms, QuickDP takes \(O\left( d(\log n)^{d2}\left N_{Q}\right \right)\) [19], where \(N_{Q}\) is the set of points in the DAG constructed by QuickDP, and the time complexity of TopMLCS is \(O(N_{T})\) [20], where \(N_{T}\) is the set of points in the ICSG constructed by TopMLCS. It should note that, due to the lack of a reasonable scheme to reduce the search space, the DAG constructed by TopMLCS is much larger than those constructed by QuickDP and miniMLCS, and DAG constructed by QuickDP is larger than that constructed by miniMLCS, i.e., \(N_{T}\) \(\gg\) \(N_{Q}\) > N. But QuickDP uses the timeconsuming nondominated sorting method to reduce the search space, so \(O\left( d(\log n)^{d2}\left N_{Q}\right \right)\) > \(O(N_{T})\). Thus, \(O\left( d(\log n)^{d2}\left N_{Q}\right \right)\) > \(O(N_{T})\) \(\gg\) O(N).
MiniMLCS space complexity
Next, we calculate the space complexity of Algorithm 1. The space consumed by the Successor Table is \(O(d \Sigma  n)\); for strategy 2, the space complexity is \(O(\delta \Sigma  n)\), the space complexity of strategy 3 is at most \(O(\delta n^2)\), but the space spent by storing points is O(dN). And the storing edge takes up O(E) area. Since \(O(d\Sigma n) \ll O(dN) + O(E)\), \(O(\delta \Sigma  n) \ll O(dN) + O(E)\), \(O(dN) = O(E)\) and \(O(\delta n^2) \ll O(dN) + O(E)\), the space complexity of Algorithm 1 is O(dN). The space complexity of QuickDP and TopMLCS can be expressed as \(O(dN_{Q})\) [19] and \(O(dN_{T})\) [20], respectively. For \(N_{T}\) > \(N_{Q}\) > N, we can deduce that our miniMLCS algorithm has lower space complexity than two compared algorithms due to the use of the branch and bound strategy.
Experiments and analysis
Experimental setups and compared algorithms
To illustrate miniMLCSâ€™s performance on largescale MLCS problems, we conduct experiments comparing it to four stateoftheart algorithms FastLCS [17], QuickDP [19], TopMLCS [20], and A* search [26]. All experiments are conducted on a server equipped with four Intel(R) Xeon(R) E52640 2.40 GHz tencore CPUs, 160 GB RAM, 4 NVidia Tesla K40 graphics cards, and 1.1TB of disc space. Ubuntu 16.04 is the operating system. All algorithms are written in Eclipse and compiled using C and C++ code. Biological sequences from NCBI http://www.ncbi.nlm.nih.gov/nuccore/110645304?report = fasta are selected as the test sets. This is the complete genome sequence of Pseudomonas aeruginosa PAO1, and the experimental data will be randomly selected from this genome. The related literature [26] on the LCS problem offers a public benchmark sets for the LCS problem. The BL benchmark [26] consists of 450 problem instances grouped by different values for the number of input strings (d), the maximum length of the input strings (n), and the alphabet size (\(\Sigma \)). For each combination of d, n, and \(\Sigma \) the set offers ten instances generated uniformly at random. We conduct the following four types of experiments. And these results are summarized in Tables 1, 2, 3 and 4.
For the first kind of experiment, we fixed the sequence length to 120 and performed trials on 9 examples with a sequence count ranging from 10,000 to 50000. For the second kind of experiment, we limit the number of sequences to 20,000 and run the experiment on sixteen cases with sequence lengths ranging from 90 to 120. For the third kind of experiments, we extract instances with sequence length of 100 from the BL set and obtained the average results in Table 3. For the last kind of experiments, to show the estimate methodâ€™s robustness, we evaluate the effects of t on Lower(R) by changing the values of t through the experiments.
Column 1 in Tables 1 and 3 indicates the total number of DNA sequences; column 1 in Tables 2 and 4 indicates the total length of DNA sequences; column 2 in Tables 1, 2 and 4 indicates the total length R of MLCS in the test sequences. In Tables 1, 2 and 4, columns 3 to 6 provide the mean running time(s)/memory(GB) for miniMLCS, TopMLCS, QuickDP and FastMLCS respectively. Bold face numbers denote the data setâ€™s lowest running time, and â€™+â€™ indicates that results are not obtained when the running time exceeds 5000 s. For Table 3, â€™âˆ’â€™ indicates that the running memory exceeds 32GB and the result cannot be obtained.
Experimental results and analysis
As seen in Table 1, the FastLCS algorithm consistently fails to handle DNA sequences with a length of 10,000 to 50,000, owing to its extraordinarily lengthy runtime. At the same time the QuickDP algorithm is likewise incapable of dealing with DNA sequences with a length of between 10,000 and 25,000. The results indicate that the difficulty of the tasks does not increase as the number of sequences increases. This is because, for a fixed length of sequences, increasing the number of sequences increases the cost of searching all MLCS, but once the number of sequences reaches a certain level, both the number and length of MLCS drop. As a result, the cost of searching the MLCS is reduced. The primary reason these two algorithms take so long is that they require too much time for nondominated sorting (note that as the number of sequences increases, nondominated sorting requires a significant amount of time and space), whereas TopMLCS and miniMLCS do not require nondominated sorting and thus take much less time.
As shown in Table 2, the time consumption of FastLCS and QuickDP increases dramatically as the length of DNA sequences increases. However the time consumption of TopMLCS and miniMLCS increases considerably more slowly. Additionally, QuickDP runs faster than TopMLCS and miniMLCS when the sequence length is less than 95. QuickDP outperforms TopMLCS and miniMLCS in these circumstances due to the short time required for nondominated sorting.
Table 3 lists average solution lengths \({\overline{R}}\), average times \({\overline{t}}\) in seconds until proven optimality has been reached, and the number of instances that could be solved to optimality \(\#\)opt (out of ten per line) for three approaches. It can be seen that miniMLCS is more efficient than TopMLCS and A* in processing small instance sets because it filters most of the match points, it will not cause memory overflow. However, none of the instances with \(\Sigma \) = 4 and d \(\ge\) 50 could be solved to optimality by the A* and TopMLCS due to the memory limit. As shown in Table 4, there is one parameter t in the proposed algorithm for calculating the lower bound Lower(R) of the length of MLCS. To assess the estimate methodâ€™s robustness, we examine the influence of Lower(R) by altering the values of t throughout the experiments. The more exact the predicted lower limit Lower(R) is, the more unnecessary match points can be found and removed from the DAG. The experiments on problems include 20000 sequences of varying lengths between 90 and 120. As can be observed from the experimental results, for each test problem with a constant length, the values of Lower(R) are very little affected by changes in the values of t. This indicates that the effect of t on Lower(R) is negligible and that the Lower(R) estimate approach is robust.
Conclusion
This paper proposed a unique branch elimination strategy(miniMLCS) by identifying the useless match points for effectively and efficiently tackling largescale MLCS problems. Which quickly identifies unnecessary match points during the construction of the DAG, reducing the time spent in nondominated sorting each level of the previous DAG. The experimental results show that it outperforms current stateoftheart algorithms FastLCS, QuickDP and TopMLCS, and is capable of addressing largescale MLCS problems. The approach takes much less time and space than FastLCS, QuickDP, and TopMLCS, particularly for largescale MLCS problems.
Availability of data and materials
This program code can be available at https://github.com/BioLab310/mini_MLCS. Biological sequences from NCBI (http://www.ncbi.nlm.nih.gov/nuccore/110645304?report =fasta) are selected as the test sets.
References
Nogrady B. How cancer genomics is transforming diagnosis and treatment. Nature. 2020;579(7800):S10â€“1.
Aravanis A, Lee M, Klausner R. Nextgeneration sequencing of circulating tumor DNA for early cancer detection. Cell. 2017;168(4):571â€“4.
Huang DS, Zhao XM, Huang GB, Cheung YM. Classifying protein sequences using hydropathy blocks. Pattern Recognit. 2006;39(12):2293â€“300.
Pham D. Spectral distortion measures for biological sequence comparisons and database searching. Pattern Recognit. 2007;40(2):516â€“29.
OuYang L, Zhang XF, Yan H. Sparse regularized lowrank tensor regression with applications in genomic data analysis. Pattern Recognit. 2020;107(502): 107516.
Sankoff D. Matching sequences under deletionâ€“insertion constraints. Proc Natl Acad Sci USA. 1972;69(1):4â€“6.
Hirschberg DS. Algorithms for the longest common subsequence problem. J ACM. 1977;24(4):664â€“75.
Masek WJ, Paterson M. A faster algorithm computing string edit distances. J Comput Syst Sci. 1980;20(1):18â€“31.
Hsu WJ, Du HW. Computing the longest common subsequence for a set of strings. BIT. 1984;24(1):45â€“59.
Apostolico A, Browne S, Guerra C. Fast linearspace computations of longest common subsequences. Theor ComputerScience. 1992;92(1):3â€“17.
Gregor J, Thomason MG. Dynamic programming alignment of sequences representing cyclic patterns. IEEE Trans Pattern Anal Mach Intell. 1993;15(2):129â€“35.
Huang RS, Yang CB, Tseng KT, Peng YH, Ann HY. Dynamic programming algorithms for the mosaic longest common subsequence problem. Inf Process Lett. 2007;102(2â€“3):99â€“103.
Yang J, Xu Y, Shang Y, Chen G, Peng YH, Ann HY. A spacebounded anytime algorithm for the multiple longest common subsequence problem. IEEE Trans Knowl Data Eng. 2014;26(11):2599â€“609.
Hakata K, Imai H. The longest common subsequence problem for small alphabet size between many strings. ISAAC. 1992;92:469â€“78.
Hakata K, Imai H. Algorithms for the longest common subsequence problem for multiple strings based on geometric maxima. Optim Methods Softw. 1998;10(2):223â€“60.
Korkin D. A new dominant pointbased parallel algorithm for multiple longest common subsequence problem. Technical Report TR01148, Univ. of New Brunswick, Tech. Rep. 2001.
Chen Y, Wan A, Liu W. A fast parallel algorithm for finding the longest common sequence of multiple biosequences. BMC Bioinform. 2006;7(S4):S4.
Korkin D, Wang Q, Shang Y.: An efficient parallel algorithm for the multiple longest common subsequence (MLCS) problem. ICPP. 2008;354â€“363
Wang Q, Korkin D, Shang Y. A fast multiple longest common subsequence (MLCS) algorithm. IEEE Trans Knowl Data Eng. 2011;23(3):321â€“34.
Li Y, Wang Y, Zhang Z, Wang Y, Ma D, Huang J.: A novel fast and memoryefficient parallel MLCS algorithm for long and largescale sequences alignments. ICDE. 2016;1170â€“1181
Liu S, Wang Y, Tong W, Wei S. A fast and memory efficient MLCS algorithm by character merging for DNA sequences alignment. Bioinformatics. 2019;36(4):1066â€“79.
Wei S, Wang Y, Yang Y, Liu S. A path recorder algorithm for Multiple Longest Common Subsequences (MLCS) problems. Bioinformatics. 2020;36(10):3035â€“42.
Smith T, Waterman M. Identification of common molecular subsequences. J Mol Biol. 1981;147(1):195â€“7.
Peng Z, Wang Y. A novel efficient graph model for the multiple longest common subsequences (MLCS) problem. Front Genet. 2017;8:104.
Wang C, Wang Y, Cheung Y. A branch and bound irredundant graph algorithm for largescale MLCS problems. Pattern Recognit. 2021;119(4): 108059.
Djukanovic M, Raidl GR, Blum C. finding longest common subsequences: new anytime A* search results. Appl Soft Comput. 2020;95: 106499.
Wang Q, Pan M, Shang Y. A fast heuristic search algorithm for finding the longest common subsequence of multiple strings. AAAI. 2010;24(1):1287â€“92.
Judea P. Heuristicsintelligent search strategies for computer problem solving. Fri. 1984;1(1):382.
Acknowledgements
Not applicable.
Funding
This work is supported by the National Natural Science Foundation of China (61772124).
Author information
Authors and Affiliations
Contributions
CY implemented the algorithms, performed the experiments, analyzed the data, and wrote the paper. PL and YZ designed the experiments and the algorithms, analyzed the data, and reviewed drafts of the paper. TR and GW analyzed the data, and reviewed drafts of the paper. All authors have read and approved the manuscript for publication.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Yu, C., Lin, P., Zhao, Y. et al. A fast and efficient path elimination algorithm for largescale multiple common longest sequence problems. BMC Bioinformatics 23, 366 (2022). https://doi.org/10.1186/s12859022049065
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859022049065
Keywords
 Multiple longest common subsequences (MLCS)
 The branch and bound
 MiniMLCS