A fast and efficient path elimination algorithm for large-scale multiple common longest sequence problems

Background In various fields, searching for the Longest Common Subsequences (LCS) of Multiple (i.e., three or more) sequences (MLCS) is a classic but difficult problem to solve. The primary bottleneck in this problem is that present state-of-the-art algorithms require the construction of a huge graph (called a direct acyclic graph, or DAG), which the computer usually has not enough space to handle. Because of their massive time and space consumption, present algorithms are inapplicable to issues with lengthy and large-scale sequences. Results A mini Directed Acyclic Graph (mini-DAG) model and a novel Path Elimination Algorithm are proposed to address large-scale MLCS issues efficiently. In mini-DAG, we employ the branch and bound approach to reduce paths during DAG construction, resulting in a very mini DAG (mini-DAG), which saves memory space and search time. Conclusion Empirical experiments have been performed on a standard benchmark set of DNA sequences. The experimental results show that our model outperforms the leading algorithms, especially for large-scale MLCS problems.

track of the longest paths connecting the source point to itself. In addition to the optimal algorithm, some excellent heuristic algorithms have been proposed. An algorithm, MLCS-A*, [27] is presented to find an LCS for any given number of sequences. MLCS-A* is a variant of the A* algorithm, a provably optimal best-first search algorithm [28]. But unlike A*, which finds the least-cost path in a graph, MLCS-A* searches in a multidimensional matrix for the longest path corresponding to an LCS. In 2020, a new anytime A* search [26] was proposed to solve the instance problem of various scenarios; apart from providing excellent solutions, the anytime A* search can return proven gaps at almost any time when terminated prematurely. However, with the increase in the number and length of sequences, these strategies still can not cope with the challenges of large-scale MLCS problems.
In this paper, mini-MLCS will be created to assure robust performance while searching for MLCS problems. The primary difference between the proposed algorithm and state-of-the-art algorithms is that mini-MLCS employs a novel path elimination strategy based on lower bound and upper bound estimation to efficiently remove a large number of unnecessary match points and non-optimal paths from a DAG, avoiding the use of non-dominated sorting and topological sorting, which are both extremely time-consuming and require a large amount of memory space. As a result, the size of the created DAG is modest, and mini-MLCS can efficiently identify the longest paths from the DAG to the MLCSs with minimal run-time and memory usage. Our main contributions are as follows: 1 Design a novel branch and bound strategy to eliminate unnecessary paths when constructing DAG graphs. Before obtaining the final MLCS, if we can judge that the currently calculated match point is not the point that constitutes the MLCS, then the path through this point will not be the longest; these are called the non-point and non-optimal paths. Therefore, we do not need to include them in the DAG. 2 Design a smaller DAG (mini-DAG) to prevent the non-dominated and topological sorting during the typical DAG construction process. The proposed branch and bound strategy can eliminate the match points without comparing the match points. It greatly saves time for non-dominated sorting and topological sorting. 3 Propose a fast and efficient algorithm (mini-MLCS) to deal with large-scale sequence problems with lower time and space costs. We design a novel branch and bound graph strategy MLCS algorithm called mini-MLCS and compare it with the stateof-the-art algorithms. The results show that our algorithm is better than these algorithms and is suitable for large-scale MLCS problems.

Definitions of LCS/MLCS problems
Definition 1 Let s = c 1 ,c 2 ,...,c n represent a sequence on a character set , where c i ∈ , 1 ≤ i ≤ n, | | represents the cardinality of , and |s| represents the length of s, i.e.,|s| = n. If the sequence s * = c i 1 ,c i 2 ,...,c i m fufils 1 ≤ i 1 < i 2 < ...< i m ≤ n, s * is refered to as a subsequence of s, represented by s * ∈ sub(s), where sub(s) is the set of all subsequences of s. Actually, if you eliminate zero or more ordered or unordered characters from a given sequence s, the resulting sequence will be shorter than the original sequence s. This is referred to as a subsequence of the original sequence s.
For example, if s = ACGTA deletes the characters G and T from the sequence s, the resulting sequence s * = ACA is a subsequence of the sequence s. Definition 2 Given a sequence set Y = {s 1 , s 2 , ..., s d } , where d is the number of sequences contained in Y and d ≥ 2, s 1 , s 2 ,..., s d on character set , if there is a sequence s * , which is a subsequence of any sequence in the sequence set Y, then the sequence s * is called the common subsequence of all sequences in the sequence Y. and if the length of the sequence s * is the longest of all common subsequences in the set Y, then the sequence s * is the longest common subsequence of all sequences in the set Y.
For example, Y = {s 1 = AACGTCGT, s 2 = CGACGTCC, s 3 = GACCGTCT} , the existence sequences s * 1 = ACGTC , s * 2 = AGC , ..., s * m = C all belong to the common subsequence of s 1 , s 2 , s 3 in the set Y, and s * 1 is the longest among all common subsequences, then the sequence s * 1 is the longest common subsequence of all sequences in the set Y. Usually, there are more than one LCS for given d sequences. If d = 2 , the problem of finding LCS is usually called LCS problem; otherwise, if d ≥ 3 , the problem is called MLCS problem.

Dynamic programming approaches
The DP approach is a time-honored method for solving LCS and MLCS problems [23]. Given d sequences of length n, s 1 ,s 2 ,...,s d , it will iteratively construct a d-dimension score table L which have n d elements, where the element L[i 1 , i 2 , ..., i d ] represents the length of the MLCS of the prefix sequences pre(s 1 , which can be calculated by the following formula [24]: After constructing the score table L, the MLCS may be calculated by traversing from the bottom-right element L[n, n, ..., n] to the top-left element L[0, 0, ..., 0]. For example, Fig. 1 illustrates the score table L constructed for the sequences s 1 = AACGTCGT and s 2 = CGACGTCC , and the LCS for these two sequences is determined by traversing from L [8,8] It can be observed from Fig. 1 that for given d sequences, and their length is n, the DP approach has a time and space complexity of up to O(n d ) [9]. As d and n expand, these approaches use exponentially more space and time. That is, the scalability of a DP approach is restricted, making it unsuitable for large-scale MLCS problems.

Dominant point-based approaches
Before delving into the details of the dominant point-based method, we'll define a few terms.   Definition 5 Given two match points p and q of d sequences on a symbol set T, we say: 4 q is called a successor of p if p ≺ q. Further, if there is no match point r to satisfy p ≺ r ≺ q, then q is called an immediate successor of p. 5 If q is a successor of p, we call p a predecessor of q.
Generally, a match point p has no more than | | successors.

Definition 6
Given a collection of matches P = {P 1 , P 2 , ..., P m } , for a match point P j ∈ P , If ¬ ∃ P i P j , 1 ≤ i, j ≤ m, i = j, P j is called a non-dominated point (dominant point for short) on P. All of dominant points on P form the dominant set of P .
The dominant point-based approaches are based on constructing a direct acyclic graph (DAG). Their organizational structure is as follows. To begin, given d sequences, the graph's source point is defined as a d-dimensional point O(0, 0...0). This point has no input edge, so its in-degree is 0. The source point's level is defined as level 0. Following that, we identify all successor points of the present point O and create a directed edge connecting it to every one of its successors. The successor points' level is defined as level 1. Non-dominated sorting is used to compute the set of all dominated points on level-1. Then, for each non-dominated point on level-1, we identify all successor points, create an edge connecting each non-dominated point on level-1 to every successor, and designate the level of all these successors as level 2. We use non-dominated sorting to identify all dominated points on level-2. This procedure is continued until no additional successor points are formed, at which time the DAG building is complete. If a point without a successor, it is defined as the ending point ( ∞,∞,...,∞ ). When the DAG is established, the LCS/MLCS is formed by the character sequence represented by the points along the longest path from the source point to the ending point. Thus, the LCS/MLCS problem's primary problem is how to create the DAG.
As seen above, the dominant point-based techniques have the following significant disadvantages: 1 Each level may contain numerous repeated match points and dominated points. (e.g., T(5, 6, 6) appears three times in L 2 and three points G(4, 5, 5), T(5, 6, 6), G(7, 5, 5) are dominated points), and a match point appearing in one level may appear numerous times in subsequent levels (e.g., T(5, 6, 6) appears in L 1 -L 4 ) and is only useful in the final level. Thus, the created DAG will be very large, to the point that the computer will run out of memory to hold it. 2 The non-dominated sorting approach will need a significant amount of work to obtain D k . It has an O(dN k 2 ) time complexity at level k, where N k is the number of match points in L k and d represents the number of sequences. Note that N k will be really huge (in the worst-case case, N k = | | k rises exponentially as level k increases). Thus, when n, d, and | | are big, i.e., when the MLCS issue becomes a large-scale problem, the non-dominated sorting approach becomes very time-consuming.
Fast-LCS [17] and Quick-DP [19] are two representative algorithms of this kind.

The proposed mini-MLCS
The main framework of mini-MLCS As previously stated, existing approaches cannot address large-scale MLCS problems owing to their enormous time and space requirements [22]. The underlying reason behind this is that as the number d and length n of sequences grow, non-dominant sorting and topological sorting will spend a lot of time on the comparison between match points. As it turns out, the computing time and storage space requirements surpass the maximum limits. To address these problems, the proposed mini-MLCS rapidly finds unnecessary match points and non-optimal paths during DAG building and then eliminates them in time to limit the DAG's size.
To be precise, in order to get the final MLCS of sequences, mini-MLCS first designed a strategy to quickly predict the genuine MLCS R's lower bound Lower(R). Then, before deciding whether or not to include match point p in the DAG, mini-MLCS calculates an upper bound Upper(p, ∞) on the length of any path from the match point p to the ending match point. Assuming that the true distance between p and the ending point is the distance(p), the obtained upper bound Upper(p, ∞) should be greater than or equal to distance(p). Finally, determine whether the estimated distance from the starting point to the ending point is less than Lower(R) (Note that the distance from the starting point O to p is the current level value level(p) when the DAG calculates to p). If Upper(O, p, ∞) = level(p) + Upper(p, ∞) < Lower(R), then no path via p is the longest path. As a result, p is an unnecessary match point, and all paths going through it are not the longest in the DAG. Based on this discovery, all unnecessary match points and non-optimal paths may be deleted immediately.

Estimation of the lower bound Lower(R) in a short time
We do not really know the genuine length of MLCS R until we receive it, but we can gain a lowest bound Lower(R) by generating an estimated MLCS. Then the length of this estimated MLCS is a lower bound on R, The longer the length of the estimated MLCS, the more closely it resembles R. Our objective is to find an estimated MLCS as rapidly as feasible. A rapid heuristic strategy for calculating the lower bound Lower(R) is designed based on these concepts. The critical stages are listed below.
For a d-dimensional match point p = ( p 1 , p 2 ,... , p d ), max(p) represents the largest number in the match point p, min(p) represents the smallest number in the match point p, and ϕ(p) = max(p) − min(p) represents the largest position offset of d sequences in the match point p. Among all match points at each level, we pick up the first t smallest ϕ() . Because it can be observed from Fig. 1, the match point with the smaller ϕ() tends to contain a larger dominates regions than points of larger ϕ() , and larger dominates regions may contain more match points, so the smaller ϕ() is more likely to be the match point that constitutes the longest common subsequence than the match point with the larger ϕ() . For example, in Fig. 1, C(3, 4) contains more dominates regions than C(3, 7), so C(3, 4) is more likely to form the longest common subsequence. Based on this idea, two strategies are proposed to get an accurate lower bound.
Strategy 1: Assigning a small initial value to t in the DAG construction process means that t match points with the smallest ϕ() value are selected in each level of the DAG. The role of t is to reduce the search space and get a suitable Lower(R) at a faster speed. In this way we will get an initial Lower(R). Next, add a certain step length µ to t each time, so that t = t + µ , and continue to construct the DAG to calculate the Lower(R). Update Lower(R) if it changes, if there is no change in the Lower(R) for more than τ (we define it by ourselves) times, then it can be considered that a more accurate Lower(R) has been obtained.

Estimation of the upper bound Upper(O, p, ∞) with efficiency
Assuming that p is a current point on the DAG and that we want to know the lengths of all paths from O to the ending match point that passes through p. However, the lengths of these paths are unknown until they are constructed. But if we can estimate an upper bound Upper(O, p, ∞) on the lengths of these paths and know that it is less than the lower bound Lower(R) (i.e., Upper(O, p, ∞) < Lower(R)), then we can conclude that these paths via p are not the longest paths and can be removed from the DAG. In this manner, the new DAG will be far smaller than the previous ones.
Notably, the DAG's current match point p has been established, and the length of the longest path from O to p may be determined. Indeed, it is the DAG level of p (denoted by level(p)).
Additionally, the genuine length of the longest path distance(p) between the current match point p and the ending match point is generally unknown. A possible method is to estimate the upper bound Upper(O, p, ∞) . Then Upper(O, p, ∞) = level(p) + Upper(p, ∞) is the upper bound on the length of any path via p. In the following, we will design some strategies for rapidly estimating Upper(p, ∞) and bringing it as near to the real value distance(p, ∞) as feasible (i.e., make it as small as possible).
Given d sequences s 1 , s 2 ,...,s d on a character set and a match point p = ( p 1 , p 2 ,..., p d ), the following conclusion is obtained: Proof Denote ξ = n − max(p) . Obviously, distance(p) is equal to the longest common subsequence of sequence suf (s i [p i ])(1 ≤ i ≤ d) , distance(p) ≤ ξ because for the sequence corresponding to max(p) there are at most ξ character after it. Therefore Upper(p, ∞) = n − max(p) , namely Upper(O, p, ∞) = level(p) + n − max(p).
Some extensions can be developed on the basis of Theorem 1. Select the first δ max(p) of match point p, then compute the LCS of sequences suf (s i [max i (p)]) (1 ≤ i ≤ δ) . and use it as the upper bound of p. The upper bound obtained will undoubtedly be closer to distance(p) than the upper bound obtained by Theorem 1, but it will take longer to calculate each match point.
Let us analyze the aforementioned example in Fig. 2 in further depth to demonstrate the strategy for identifying unnecessary match points. The MLCS's estimated lower bound is known, i.e., Lower(R) = 5 , shown in Fig. 3. According to Theorem 1, upper(p, ∞) can be estimated, and level(p) can be achieved during the DAG construction process, shown in Fig. 4a.
Fortunately, Wang et al. [25] define a metric for comparing two sequences. This metric is referred to as the diversity metric.
Let num c s i represent the number of the character c in sequence s i . Note that the greater the value of num c s i − num c s j , the more diverse s i and s j are, and the greater the difference between these two sequences. By considering this factor, the diversity between s i and s j is defined as: Here are a few instances that illustrate the above.
We can choose the second sequence with greater diversity using sequence s 1 based on the diversity measure. s 1 = AACGTCGT. s 2 = CGACGTCC. s 3 = GACCGTCT , we choose s 1 = s i , then we count how many times each character appears in these sequences.
As a result, we conclude that s 1 and s 2 are more dissimilar than s 1 and s 3 . Therefore, when the scale of the sequence is large, we choose s 1 = s i and pick up δ (δ ≪ d) most different(contains s 1 sequence) sequences from the given d sequences using the formula (4). In this way, we greatly reduce the preprocessing time of strategies 2 and 3.

Construct mini-DAG
Based on the above branch elimination approach, we construct the mini-DAG level by level. First, level zero D 0 consists of merely the beginning match point O, and then level 1 through level R, represented by D 1 , D 2 ,..., D R , respectively, are consecutively created, where R denotes the length of the final MLCS. To minimize the time and space, we merely create and store one level each time.
After D k is constructed (currently, D 0 is constructed), the following procedures can be taken to build D k+1 :

Mini-MLCS algorithm
The pseudo-codes of algorithm mini-MLCS are presented in Algorithm 1 in order to describe the new algorithm in detail.
At the beginning, the estimated lower bound Lower(R) is calculated. The proposed algorithm's key steps are lines 3 ∼ lines 21, which explain how a mini-DAG is built level by level. Ch(q) in the lines 8 means the character represented by the match point q, which is described in Definition 4, and q.precs represents the longest common subsequence from the beginning match point O to the current point q. Finally, from the mini-DAG, the longest paths corresponding to MLCSs may be obtained, and all MLCSs will be returned in lines 22 ∼ lines 23.

Mini-MLCS time complexity
In order to show the efficiency of Algorithm 1 compared with other algorithms, the time complexity of the proposed Algorithm 1 and the comparison algorithm are shown here. First, the length of the sequences is denoted by n, and d represents the number of sequences. In the initialization, we built the Successor Table that was proposed by Fast-LCS so that we could rapidly discover the successor nodes of a point with O(d|�|n) [17]. Second, we estimate the time cost to find successor nodes and add them to the Vector Hash Table. Use N to represent the whole collection of points in the mini-DAG and the time complexity is O(|N|). Finally, in mini-DAG, we use E to represent the whole collection of edges, and the time complexity is O(|E|). For strategy 1, the time complexity of building a DAG once and finding it's lower bound is O(d|�||MLCS|t) . Then calculate the time complexity of the upper bound. In Theorem 1, the time complexity of finding the upper bound is O(d). In the pretreatment phase (before the mini-DAG construction), we choose the δ sequence from d sequence by formula (4) and apply it to strategies 2 and 3. The preprocessing results are stored by using appropriate data structures, we can compute the Upper(p, ∞) of strategy 2 at any match point quickly, the time complexity is O(δ|�|) for each match point. And strategy 3 in the preprocessing phase, it takes O(δn 2 ) . Therefore, considering the worst case, the time complexity of mini-MLCS is For the compared algorithms, Quick-DP takes O d(log n) d−2 N Q [19], where N Q is the set of points in the DAG constructed by Quick-DP, and the time complexity of Top-MLCS is O(|N T |) [20], where N T is the set of points in the ICSG constructed by Top-MLCS. It should note that, due to the lack of a reasonable scheme to reduce the search space, the DAG constructed by Top-MLCS is much larger than those constructed by Quick-DP and mini-MLCS, and DAG constructed by Quick-DP is larger than that constructed by mini-MLCS, i.e., |N T | ≫ |N Q | > |N|. But Quick-DP uses the time-consuming non-dominated sorting method to reduce the search space,

Mini-MLCS space complexity
Next, we calculate the space complexity of Algorithm 1. The space consumed by the Successor Table is [19] and O(d|N T |) [20], respectively. For |N T | > |N Q | > |N|, we can deduce that our mini-MLCS algorithm has lower space complexity than two compared algorithms due to the use of the branch and bound strategy.

Experimental setups and compared algorithms
To illustrate mini-MLCS's performance on large-scale MLCS problems, we conduct experiments comparing it to four state-of-the-art algorithms Fast-LCS [17], Quick-DP [19], Top-MLCS [20], and A* search [26]. All experiments are conducted on a server equipped with four Intel(R) Xeon(R) E5-2640 2.40 GHz ten-core CPUs, 160 GB RAM, 4 NVidia Tesla K40 graphics cards, and 1.1TB of disc space. Ubuntu 16.04 is the operating system. All algorithms are written in Eclipse and compiled using C and C++ code. Biological sequences from NCBI http:// www. ncbi. nlm. nih. gov/ nucco re/ 11064 5304? repor t = fasta are selected as the test sets. This is the complete genome sequence of Pseudomonas aeruginosa PAO1, and the experimental data will be randomly selected from this genome. The related literature [26] on the LCS problem offers a public benchmark sets for the LCS problem. The BL benchmark [26] consists of 450 problem instances grouped by different values for the number of input strings (d), the maximum length of the input strings (n), and the alphabet size ( | | ). For each Table 1 The run time (s)/memory (GB) consumed by the compared algorithms on DNA sequences with length fixed to 120 The bold values represent the minimum running time and minimum memory of all the algorithms in the table on the dataset  combination of d, n, and | | the set offers ten instances generated uniformly at random. We conduct the following four types of experiments. And these results are summarized in Tables 1, 2, 3 and 4.

Number of sequences
For the first kind of experiment, we fixed the sequence length to 120 and performed trials on 9 examples with a sequence count ranging from 10,000 to 50000. For the second kind of experiment, we limit the number of sequences to 20,000 and run the experiment on sixteen cases with sequence lengths ranging from 90 to 120. For the third kind of experiments, we extract instances with sequence length of 100 from the BL set and obtained the average results in Table 3. For the last kind of experiments, to show the estimate method's robustness, we evaluate the effects of t on Lower(R) by changing the values of t through the experiments.
Column 1 in Tables 1 and 3 indicates the total number of DNA sequences; column 1 in Tables 2 and 4 indicates the total length of DNA sequences; column 2 in Tables 1, 2 and 4 indicates the total length R of MLCS in the test sequences. In Tables 1, 2 and 4, columns 3 to 6 provide the mean running time(s)/memory(GB) for mini-MLCS, Top-MLCS, Quick-DP and Fast-MLCS respectively. Bold face numbers denote the data set's lowest running time, and '+' indicates that results are not obtained when the running time exceeds 5000 s. For Table 3, '−' indicates that the running memory exceeds 32GB and the result cannot be obtained.

Experimental results and analysis
As seen in Table 1, the Fast-LCS algorithm consistently fails to handle DNA sequences with a length of 10,000 to 50,000, owing to its extraordinarily lengthy runtime. At the same time the Quick-DP algorithm is likewise incapable of dealing with DNA sequences with a length of between 10,000 and 25,000. The results indicate that the difficulty of the tasks does not increase as the number of sequences increases. This is because, for a fixed length of sequences, increasing the number of sequences increases the cost of searching all MLCS, but once the number of sequences reaches a certain level, both the number and length of MLCS drop. As a result, the cost of searching the MLCS is reduced. The primary reason these two algorithms take so long is that they require too much time for non-dominated sorting (note that as the number of sequences increases, non-dominated sorting requires a significant amount of time and space), whereas Top-MLCS and mini-MLCS do not require non-dominated sorting and thus take much less time. As shown in Table 2, the time consumption of Fast-LCS and Quick-DP increases dramatically as the length of DNA sequences increases. However the time consumption of Top-MLCS and mini-MLCS increases considerably more slowly. Additionally, Quick-DP runs faster than Top-MLCS and mini-MLCS when the sequence length is less than 95. Quick-DP outperforms Top-MLCS and mini-MLCS in these circumstances due to the short time required for non-dominated sorting. Table 3 lists average solution lengths R , average times t in seconds until proven optimality has been reached, and the number of instances that could be solved to optimality #opt (out of ten per line) for three approaches. It can be seen that mini-MLCS is more efficient than Top-MLCS and A* in processing small instance sets because it filters most of the match points, it will not cause memory overflow. However, none of the instances with | | = 4 and d ≥ 50 could be solved to optimality by the A* and Top-MLCS due to the memory limit. As shown in Table 4, there is one parameter t in the proposed algorithm for calculating the lower bound Lower(R) of the length of MLCS. To assess the estimate method's robustness, we examine the influence of Lower(R) by altering the values of t throughout the experiments. The more exact the predicted lower limit Lower(R) is, the more unnecessary match points can be found and removed from the DAG. The experiments on problems include 20000 sequences of varying lengths between 90 and 120. As can be observed from the experimental results, for each test problem with a constant length, the values of Lower(R) are very little affected by changes in the values of t. This indicates that the effect of t on Lower(R) is negligible and that the Lower(R) estimate approach is robust.

Conclusion
This paper proposed a unique branch elimination strategy(mini-MLCS) by identifying the useless match points for effectively and efficiently tackling large-scale MLCS problems. Which quickly identifies unnecessary match points during the construction of the DAG, reducing the time spent in non-dominated sorting each level of the previous DAG. The experimental results show that it outperforms current state-of-the-art algorithms Fast-LCS, Quick-DP and Top-MLCS, and is capable of addressing large-scale MLCS problems. The approach takes much less time and space than Fast-LCS, Quick-DP, and Top-MLCS, particularly for large-scale MLCS problems.