- Research
- Open Access
A fast exact sequential algorithm for the partial digest problem
- Mostafa M. Abbas^{1}Email author and
- Hazem M. Bahig^{2}Email author
https://doi.org/10.1186/s12859-016-1365-2
© The Author(s). 2016
- Published: 22 December 2016
Abstract
Background
Restriction site analysis involves determining the locations of restriction sites after the process of digestion by reconstructing their positions based on the lengths of the cut DNA. Using different reaction times with a single enzyme to cut DNA is a technique known as a partial digestion. Determining the exact locations of restriction sites following a partial digestion is challenging due to the computational time required even with the best known practical algorithm.
Results
In this paper, we introduce an efficient algorithm to find the exact solution for the partial digest problem. The algorithm is able to find all possible solutions for the input and works by traversing the solution tree with a breadth-first search in two stages and deleting all repeated subproblems. Two types of simulated data, random and Zhang, are used to measure the efficiency of the algorithm. We also apply the algorithm to real data for the Luciferase gene and the E. coli K12 genome.
Conclusion
Our algorithm is a fast tool to find the exact solution for the partial digest problem. The percentage of improvement is more than 75% over the best known practical algorithm for the worst case. For large numbers of inputs, our algorithm is able to solve the problem in a suitable time, while the best known practical algorithm is unable.
Keywords
- Restriction site analysis
- Digestion process
- Partial digest problem
- DNA
- Bioinformatics algorithm
- Breadth first search
Background
In 1970, Hamilton Smith discovered that long DNA molecules could be digested into a set of restriction fragments by the restriction enzyme HindII based on the occurrence of restriction sites with the sequences GTGCAC or GTTAAC [1]. Since that time, restriction enzymes have played a crucial role in many biological experiments, including genome editing [2], gene cloning [3], protein expression [4, 5] and genome mapping [6–11]. Restriction enzymes are commonly used to physically map genomes. In physical mapping, the restriction enzymes are used to cut a DNA molecule at restriction sites with the goal of identifying the locations of the restriction sites after digestion. Their positions in the genome are determined by analyzing the lengths of the digested DNA. Based on the experimental assumptions of digestion, there are two main types of digestions, a partial digest [9] and a double digest [10]. Constructing an accurate physical map following a partial digestion is a fundamental problem in genome analysis. In this work, we consider the partial digestion.
In a partial digestion experiment, one restriction enzyme is used to cut one or more target DNA molecules at several specific restriction site. The digestion results in a collection of short DNA fragments, and the lengths of these fragments are recorded in multiset A. Attempting to reconstruct the locations of the restriction sites in the target DNA molecules using multiset A is known as the Partial Digest Problem, PDP. Some modifications have been introduced into the partial digestion process to produce simplified variants of PDP. These variants include: the simplified partial digest problem, SPDP [12], the labeled partial digest problem, LPDP [13] and the probed partial digest problem, PPDP [14]. In this work, we consider the PDP.
Several algorithms [15–20] have been developed to solve the PDP. Some of these algorithms have short running times, but the solutions are not exact. These algorithms [15–17] are based on heuristic and approximation strategies. Other algorithms require a long running time in the worst case, but the solution is exact for any instance. These algorithms [18–20] are based on brute force or branch and bound strategies.
In this research paper, we describe an algorithm with a suitable run time that generates an exact solution for the PDP. The previous algorithms that yield an exact PDP solution can be divided into impractical and practical types. The impractical solutions are based on the brute force strategy and polynomial factorization [19]. The best known practical algorithm for PDP is the algorithm proposed by Skiena et al. [20], which is based on the branch and bound strategy. The algorithm is practical because the running time of the algorithm is O (n ^{2} log n) for an average case, while exponential amounts of time were required for the worst case.
Problem formulation and related definitions
Before we give the formal definition of the PDP, we need the following related definitions.
Definition 1 [21]: The difference of two multisets D and L denoted by D\L such that D\L = {x|x ∈ D and C(D\L, x) = C(D, x) − C(L, x) > 0}, where C (D, x) denotes the number of occurrences of element x ∈ D in D.
Definition 2 [21]: The sum or (disjoint union) of two multisets D and L denoted by D ∪_{+} L such that D ∪_{+} L = {x|x ∈ D or x ∈ L and C (D ∪_{+} L, x) = C (D, x) + C (L, x)}.
Definition 3 [22]: The differences of element y and a set X denoted by Δ (y, X) such that Δ(y, X) = {|y − x _{ i }| : x _{ i } ∈ X}.
Definition 4 [22]: The differences for a set X, denoted by ΔX, is a multiset such that: ΔX = {|x _{ j } − x _{ i }|, 0 ≤ i < j ≤ n − 1}.
Remark 1: We can write ΔX in another form ΔX = _{ i = 0} ^{ n − 1} ∪_{+} Δ (x _{ i }, (X\x _{ i })), where X = {x _{0}, x _{1}, …, x _{ n − 1}}.
The Partial Digest Problem, PDP [11]: Given a multiset of \( N=\left(\begin{array}{c}\hfill n\hfill \\ {}\hfill 2\hfill \end{array}\right) \) positive integers D = {d _{0}, d _{1}, d _{2}, …, d _{ N − 1}}. Is there a set of n integers X = {x _{0}, x _{1}, x _{2}, …, x _{ n − 1}} such that ΔX = D ?.
We also need two propositions. The first proposition is used to give another formula for the difference between three multiple sets. The formula will be used to prove the correctness of our proposed method. The second proposition is used to illustrate how to construct an example for the worst case, which leads to an exponential time for Skiena’s algorithm [23].
Proposition 1: Let D, L and Z be three multisets, then (D\L)\Z = D\(L ∪_{+} Z).
Proposition 2 [23]: Let \( 0<\varepsilon <\frac{1}{12}n \), A _{1} = {1 − nε, …, 1 − 2ε, 1 − ε}, A _{2} = {ε, 2ε, …, nε}, A _{3} = {(n + 1)ε, (n + 2)ε, …, 2nε}, A _{4} = {(2n + 1)ε, (2n + 2)ε, …, 3nε}, A _{5} = {1 − 3nε, …, 1 − (2n + 2)ε, 1 − (2n + 1)ε} and D = F ∪ G where F and G are disjoint sets satisfying F ∪ G* = A _{3} and G* = {1 − g| ∀ g ∈ G}. Let A = A _{1} ∪ A _{2} ∪ A _{4} ∪ A _{5} ∪ D ∪ {0, 1}, we can choose D such that giving ΔA to Skiena’s algorithm will take it at least Ω(2^{ n − 1}) time to find A.
Methods
In this section, we present three algorithms. The first one is the best previous practical algorithm, while the other two algorithms are the proposed algorithms.
Best previous practical algorithm
The main goal of PDP is to reconstruct the elements, x _{ i }, from a multiset, D, of N = n (n − 1)/2 integers by finding a set X such that ΔX = D. The best known algorithm for solving the PDP is based on the branch and bound strategy. The main idea of this algorithm is to construct the set X incrementally. The algorithm is based on the depth-first search algorithm with two bounding conditions. We refer to this method as Algorithm BBd (branch and bound based on depth).
Observation
- i.
Each node in the solution tree for the PDP represents a pair of sets (D _{ k } ^{ i } , X _{ k } ^{ i } ) where the index k represents the level number, and the index i represents the node number in the level k.
- ii.
The number t inside the circle at the top right of each node represents the t-th calling for the procedure Place.
- iii.
The symbol “⊥” is used when the current node does not generate any new elements.
- 1.
Add an element y to X _{ k } ^{ j } to generate X _{ k + 1} ^{ i } = X _{ k } ^{ j } ∪ {y}.
- 2.
Remove the elements of Δ (y, X _{ k } ^{ j } ) from D _{ k } ^{ j } to generate D _{ k + 1} ^{ i } = D _{ k } ^{ j } \ Δ(y, X _{ k } ^{ j } ).
- 1.
There are two identical subproblems (D _{ 2 } ^{ 0 } , X _{ 2 } ^{ 0 } ) and (D _{2} ^{1} , X _{2} ^{1} ), such that D _{2} ^{0} = D _{2} ^{1} = {1, 3, 5, 7} and X _{2} ^{0} = X _{2} ^{1} = {0, 2, 8, 10}.
- 2.
The two identical subproblems lead to two identical solutions, {0, 2, 7, 8, 10} and {0, 2, 3, 8, 10}.
First proposed method
- 1.
Build the solution tree for the PDP using the breadth-first strategy, level by level.
- 2.
Before creating the nodes of the new level, we remove all repeated nodes existing in the current level such that any node appears only one time in the current level.
Algorithm BBb (branch and bound based on breadth) shows the steps of our proposed method to solve the PDP. The input of the algorithm is the multiset D that consists of n (n − 1)/2 elements. Initially the algorithm starts with two sets and two lists. The two sets are X _{0} ^{0} = {0, width} and D _{0} ^{0} = D \ {width}. The two lists are L _{ X } = {X _{0} ^{0} } and L _{ D } = {D _{0} ^{0} }. In general, L _{ D } and L _{ X } represent the lists of sets, D _{ k } ^{ j } and X _{ k } ^{ j } , respectively, at the current level, k, of the solution tree. The main step of the proposed algorithm is a while loop that represents the number of levels in the solution tree for the PDP. In each iteration k of the while loop, we will generate the elements of the next level, k + 1, by calling the procedure GenerateNextLevel (L _{ D }, L _{ X }, S), k ≥ 0. The inputs of the procedure are three lists of sets, L _{ D }, L _{ X } and S, for the current level k. The outputs of the procedure are three lists of sets, L _{ D }, L _{ X } and S, for the level k + 1.
- (i)
D _{ k + 1} ^{ j } = D _{ k } ^{ i } \Δ (y, X _{ k } ^{ i } ) and X _{ k + 1} ^{ j } = X _{ k } ^{ i } ∪ {y } if the condition Δ (y , X _{ k } ^{ i } ) ⊆ D _{ k } ^{ i } is true.
- (ii)
D _{ k + 1} ^{ l } = D _{ k } ^{ i } \ Δ (width ‐ y, X _{ k } ^{ i } ) and X _{ k + 1} ^{ l } = X _{ k } ^{ i } ∪ {width ‐ y } if the condition Δ (width ‐ y , X _{ k } ^{ i } ) ⊆ D _{ k } ^{ i } is true.
The two sets, X _{ k + 1} ^{ j } and D _{ k + 1} ^{ j } and in a similar way, X _{ k + 1} ^{ l } and D _{ k + 1} ^{ l } , will be added to the auxiliary lists AL _{ X } and AL _{ D }, respectively if set X _{ k + 1} ^{ j } does not exist in list AL _{ X }.
Now, we investigate how to test the equality between two nodes (D _{ k } ^{ i } , X _{ k } ^{ i } ) and (D _{ k } ^{ j }, X _{ k } ^{ j } ) in the solution tree for the PDP. The test can be performed by comparing the equality between D _{ k } ^{ i } and D _{ k } ^{ j } and the equality between X _{ k } ^{ i } and X _{ k } ^{ j } . The following theorems and corollary prove that the two nodes (D _{ k } ^{ i } , X _{ k } ^{ i } ) and (D _{ k } ^{ j } , X _{ k } ^{ j } ) are equal if the set X _{ k } ^{ i } is equal to the set X _{ k } ^{ j } .
Theorem 1: Given a subproblem (D _{ k } ^{ i } , X _{ k } ^{ i } ) in the solution tree for the PDP, the relation D _{ k } ^{ i } = D \ ΔX _{ k } ^{ i } is valid, where k ≥ 0, D is the input of the PDP, D _{ k } ^{ i } is a modified version of D produced by removing some of its elements and X _{ k } ^{ i } contains k + 2 elements of the candidate solution.
Proof : We will prove the theorem using mathematical induction.
First, we prove that the relation is true at k = 0.
The algorithm starts by finding the maximum elements of the set D, width, and updates the value of the sets D and X. So, X _{0} ^{0} = {0, width} and D _{0} ^{0} = D\{width}. From the definition of Δ, we have, ΔX _{0} ^{0} = {width}. Therefore, D _{0} ^{0} = D\ΔX _{0} ^{0} .
Second, we assume that the relation is true at k (i.e., D _{ k } ^{ i } = D\ΔX _{ k } ^{ i } ).
Third, we prove that the relation is true at k + 1.
From the algorithm, the set D _{ k + 1} ^{ j } can be constructed from a node at level k, say (D _{ k } ^{ i } , X _{ k } ^{ i } ). Therefore, D _{ k + 1} ^{ j } = D _{ k } ^{ i } \ Δ (y, X _{ k } ^{ i } ) and X _{ k + 1} ^{ j } = X _{ k } ^{ i } ∪ {y}. This implies that D _{ k + 1} ^{ j } = (D \ ΔX _{ k } ^{ i } )\Δ(y, X _{ k } ^{ i } ), because D _{ k } ^{ i } = D \ ΔX _{ k } ^{ i } . Therefore, D _{ k + 1} ^{ j } = D \ (ΔX _{ k } ^{ i } ∪_{+} Δ(y, X _{ k } ^{ i } )) (from Proposition 1). From Definition 4 and because X _{ k + 1} ^{ j } = {y} ∪ X _{ k } ^{ i } , then D _{ k + 1} ^{ j } = D \ ΔX _{ k + 1} ^{ j } .
Theorem 2: If there are two subproblems (D _{ k } ^{ i } , X _{ k } ^{ i } ) and (D _{ k } ^{ j } , X _{ k } ^{ j } ) such that X _{ k } ^{ i } = X _{ k } ^{ j } , then D _{ k } ^{ i } = D _{ k } ^{ j } .
Proof : From Theorem 1, the following equations are valid D _{ k } ^{ i } = D\ΔX _{ k } ^{ i } and D _{ k } ^{ j } = D\ΔX _{ k } ^{ j } , for the subproblems (D _{ k } ^{ i } , X _{ k } ^{ i } ) and (D _{ k } ^{ j } , X _{ k } ^{ j } ), respectively. Because X _{ k } ^{ i } = X _{ k } ^{ j } , then D _{ k } ^{ i } = D\ΔX _{ k } ^{ i } = D\ΔX _{ k } ^{ j } = D _{ k } ^{ j } .
Corollary 1: If there are two subproblems (D _{ k } ^{ i } , X _{ k } ^{ i } ) and (D _{ k } ^{ j } , X _{ k } ^{ j } ) such that X _{ k } ^{ i } = X _{ k } ^{ j } , then (D _{ k } ^{ i } , X _{ k } ^{ i } ) = (D _{ k } ^{ j } , X _{ k } ^{ j } ).
Final proposed algorithm
- 1.
Building the solution tree for the PDP until a specific level α is reached by using the BBb algorithm.
- 2.
For each node in the level α, building the remainder subtrees individually with the BBb algorithm.
Let α _{ M } be the number of levels that lead to the minimum memory required by BBb2. The value of α _{ M } can be computed by determining the number of levels, α, that minimizes the memory consumption M(α), for 0 ≤ α ≤ n − 1.The Find_α_{ M } procedure computes the value of α_{ M } for the instance n in O(n) time. The input of the procedure is the size of the multiset D, N, and the output of the procedure is α_{ M }. From the size of the multiset D, we can compute the value of n by solving the quadratic equation \( \frac{n\;\left(n-1\right)}{2}=N \). The pseudocode of the Find_α_{ M } procedure is as follows:
Procedure Find_α_{ M } (N, α_{ M })
- 1.$$ n=\frac{1+\sqrt{1+8N}}{2} $$
- 2.$$ \alpha = {\alpha}_M = 1 $$
- 3.$$ {M}_{min}=\kern0.37em {n}^2\left({2}^{\alpha }+{2}^{n-\alpha}\right) $$
- 4.
for α = 2 to n do
- 5.$$ M={n}^2\left({2}^{\alpha }+{2}^{n-\alpha}\right) $$
- 6.
if M < M _{ min } then
- 7.$$ {M}_{min}=M $$
- 8.$$ {\alpha}_M = \alpha $$
- 9.
end if
- 10.
end for
End
In Algorithm BBb2, we start by finding the value of α _{ M } and then we build the solution tree until the level α _{ M } is reached by using the breadth-first strategy and deleting all similar elements. At the level α _{ M }, we have at most \( {2}^{\alpha_M} \) elements. After obtaining these elements, we consider each node as a root and then expand this node using the BBb algorithm. For the worst case, the number of handled elements at levels α_{ M } + 1, α_{ M } +2, …,n are 2, 2^{2}, …, \( {2}^{n-{\alpha}_M} \) respectively, while the number of handled elements during execution of the BBb algorithm at levels α_{ M } + 1, α_{ M } + 2,…,n are \( {2}^{\alpha_M+1} \), \( {2}^{\alpha_M+2} \), …,2^{ n } respectively. Therefore, BBb2 reduces the memory required by the BBb algorithm.
Test methodology
In this section, we present the methodology that is used to evaluate the performance of the algorithms, BBd, BBb and BBb2, according to their running times and memory consumptions.
Platform Specification
We implemented the algorithms on a Dual Octa-core processor machine (Intel Xeon E5-2690) with 128 GB RAM. Each processor has a 2.9 GHz speed with 20 MB of cache. The algorithms were implemented in C++ programs. The programs were compiled using g++ with the -O3 flag under 64-bit Red Hat Enterprise Linux 4.4. In the experiments, we used a single core only.
Simulated data
We studied the performance of the three algorithms, BBd, BBb and BBb2, on different types of data and different sized data sets. We used two types of data described in previous studies. The first data set was a random data set (RD) [16], while the second data set is the Zhang data (ZD) [16, 17, 23]. The RD was used to measure the performance of the algorithm for an average case; while the ZD was used to measure the performance of the algorithm for the worst case. In terms of data set size, there are two factors that affect the performance of the algorithms for each data set. The first factor is the number of elements in the set X, n, while the second factor is the values of the n elements, M (maximum elements of X).
In the case of the RD, we assumed that there were n restriction sites in the DNA segments distributed randomly. Each input instance of the simulated data is a multiset D = ΔX such that the set X contains n positive numbers randomly selected and each number is less than or equal to M. In the ZD, the locations of the restriction sites were selected randomly according to Proposition 2 [23].
The values of n in the case of the RD are 100, 200, 300,… and 1000, while the values of n in the case of ZD are 15, 20, 25, 30,… and 90. The range of n for the ZD is small because the running time for the algorithm was greater than 1 day when n > 90. We also used different values of M as follows: M = n*q, where q = 10, 100, 1000 and 10,000.
Running time and memory
For each value of n, we ran each algorithm 50 times with different inputs for the RD. For the ZD, we reduced this number to 20 due to the increased running time of the algorithms, especially BBd. The running time for each algorithm for a fixed n represents the average time for all instances studied. If the running time of an algorithm was greater than 24 h for an input instance then the algorithm was terminated. Therefore, the value of the algorithm for this input instance was omitted from the figures and the results.
We also measured the standard error of the mean (SEM) and coefficient of variation (CV) for the three algorithms for RD and ZD. Finally, we used a non-parametric statistical test which is Wilcoxon-signed-rank test [24] to determine if there are a significant differences between the three algorithms in running time or memory consumption on RD and ZD. We applied the Wilcoxon-signed-rank test, at a significance level of 0.05, to the following pairs: BBd & BBb, BBd & BBb2 and BBb & BBb2 algorithms with respect to running time and memory consumption for RD and ZD.
We measured the running time in seconds using a C++ function. We also measured the memory consumed by the algorithm in megabytes using the Linux command top.
Results and discussion
In the case of the RD, the results showed that the running times of the proposed algorithms, BBb and BBb2, were less than the running time of BBd for all values of n and M as shown in Fig. 4. For large values of n and small values of q, the BBd algorithm required a large amount of time to find a solution, while the proposed algorithms, BBb and BBb2, found the solution in a suitable time. For example, in the case of n ≥ 300 and M = n * 10, the running time for the BBd algorithm was greater than 24 h, while both proposed algorithms found the solution in time less than 13 min, as shown in Fig. 4a and b. However, the difference in running times between the BBd algorithm and the proposed algorithms, BBb and BBb2, decreased with increasing values of M. This behavior is attributed to the probability of repeated subproblems decreasing with increasing values of M. Therefore, both algorithms, BBb and BBb2, spend a large amount of time checking for the repetition of elements with a low probability of repetitions. Additionally, for small M values the probability of repeated subproblems is high, thereby increasing the running time of the BBd algorithm. If we fixed the value of n, the running time for all algorithms decreased with increasing M values as shown in (Additional file 1: Figures F1 a-c). Both proposed algorithms behaved similarly with increasing M values. We also observed that the difference in the running times for two successive values of M is relatively small for both proposed algorithms (especially when M is large) as shown in (Additional file 1: Figures F1 b and c). On the other hand, the difference in the running times for the BBd algorithm for two successive values of M is large when M = n * 100 and M = n * 1000 as shown in (Additional file 1: Figure F1a). Finally, we found that there was little difference, increasing or decreasing, between the running times of BBb and BBb2 for all values of n and M. In general, with large values of n and M, the BBb2 algorithm was faster than the BBb algorithm. For example, in the case of n ≥700 and M = n * q and q = 1000 and 10,000, BBb2 was slightly better than BBb.
For the ZD, there was no change in the running time when we used different values of M for the three algorithms. Therefore, we used the ZD with the M value fixed at M = n*1000. The consistent performance of the algorithms using the ZD with different values of M was due to the systematic selection of the elements of A according to Proposition 2.
The performance of the three algorithms for the Zhang data set instances is given in Fig. 5. We observed that the running time of BBb2 was less than BBb and BBb was less than BBd for all instances. The percentage of running time improvement for BBb and BBb2 with respect to BBd increased with increasing n. In our studied cases, the running time was improved by at least 75%. Moreover, the BBd and BBb algorithms cannot solve any instances with n ≥ 60 and n ≥ 80 in time less than 24 h respectively. In the other side, the BBb2 algorithm solves any instance with \( n\le 90 \) in time less than 24 h.
The improved running time of BBb algorithm can be attributed to its reduction of the number of subproblems handled at the levels α _{ M } + 1, α _{ M } + 2,…,n in the solution tree for the PDP. Therefore, the time spent checking subproblem repetition is reduced.
- i.
BBd and BBb in case of RD and ZD.
- ii.
BBd and BBb2 in case of RD and ZD.
- iii.
BBb and BBb2 in case of ZD.
For the ZD, less memory was consumed by BBb2 than BBb for all instances. So, the BBb2 algorithm significantly reduced the memory required by the BBb algorithm. The percentage of improvement in memory consumption of BBb2 compared to BBb increases as n increases. In general, the memory required by the BBd algorithm is less than the BBb and BBb2 algorithms.
- i.
BBd and BBb on ZD.
- ii.
BBd and BBb2 on ZD.
- iii.
BBb and BBb2 on ZD.
Real data
We tested the final proposed BBb2 algorithm on real digestion experiments and simulated digestion experiments.
Luciferase gene
We extracted the data from a partial digestion of the luciferase gene [25] of length 2009. The partial digestion was performed with the enzyme TaqI, which cuts the gene at the tcga sequence motif. The output of the partial digestion process is the multiset \( D \) consisting of the distances between the tcga locations on the luciferase gene, which is D = {9, 30, 100, 170, 293, 302, 393, 402, 462, 562, 632, 732, 855, 864, 945, 954, 975, 984, 1025, 1034, 1247, 1277, 1347, 1377, 1809, 1839, 1979, 2009}. Our proposed algorithms take the multiset D as input and output two solutions in the set S = {{0, 170, 632, 732, 1025, 1034, 1979, 2009}, {0, 30, 975, 984, 1277, 1377, 1839, 2009}}. The solution X = {0, 30, 975, 984, 1277, 1377, 1839, 2009}, represents the solution for the real data. This means that the map of tcga on luciferase gene at the locations 30, 975, 984, 1277, 1377, and 1839.
E. Coli K12 genome digestion simulation
The Performance of the BBb2 algorithm for the restriction enzymes used in this study
Restriction enzyme | Restriction site | N | n | Time (Seconds) | Memory (MB) |
---|---|---|---|---|---|
NotI | GCGGCCGC | 780 | 40 | 0.007 | 0.2 |
XbaI | TCTAGA | 1891 | 62 | 0.017 | 3.6 |
SmaI | CCCGGG | 103,285 | 455 | 56.2 | 8.9 |
BamHI | GGATCC | 135,460 | 521 | 72 | 11.7 |
HindIII | AAGCTT | 170,820 | 585 | 138.6 | 13.6 |
EcoRI | GAATTC | 228,826 | 677 | 360 | 15.5 |
PvuII | CAGCTG | 1,599,366 | 1789 | 24,012 | 98.8 |
EcoRV | GATATC | 1,999,000 | 2000 | 42,840 | 979 |
Conclusions
In this paper, we addressed the PDP and proposed two algorithms, BBb and BBb2. In the BBb algorithm, we built the solution tree for the PDP in the breadth-first manner instead of the depth-first manner taking into consideration two conditions of pruning and deleting all repeated subproblems in the same level. The BBb solves many instances which are not solved by BBd in time less than 24 h. The main disadvantage of BBb is that the memory consumed grows exponentially. In BBb2, we reduced the memory required by BBb by solving the problem using two stages, with each stage working in the breadth-first manner. We also determined the number of the levels, which leads to reduced memory consumption without increasing the running time.
We measured the efficiency of the proposed algorithm compared to the best known practical algorithm on the basis of time and memory consumption. In the evaluation, we considered the following parameters: (1) types of data, RD and ZD; (2) value of n; and (3) value of M. In the case of running time, the BBb2 algorithm is faster than other algorithms. The efficiency increased when the ZD was used. In the case of memory, the BBd algorithm consumed less memory than other algorithms, but the running time was very slow especially for the ZD. Finally, we applied the BBb2 algorithm on Luciferase gene and the E. coli K12 genome.
Declarations
Acknowledgements
The experimental study of this work was done where the first author was at KINDI Center for Computing Research, College of Engineering, Qatar University, Doha, Qatar.
Declarations
This article has been published as part of BMC Bioinformatics Volume 17 Supplement 19, 2016. 15th International Conference On Bioinformatics (INCOB 2016): bioinformatics. The full contents of the supplement are available online at https://www.bmcbioinformatics.biomedcentral.com/articles/supplements/volume-17-supplement-19.
Funding
Publication charges for this article have been funded by Qatar Computing Research Institute (QCRI), Hamad Bin Khalifa University, Qatar Foundation.
Availability of data and materials
Data used in the article are publicly available. Availability of software exists at: https://www.github.com/mostafa-abbas/PDP.
Authors’ contributions
Both authors contributed to theoretical and practical study equally. Both authors wrote and approved the manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
All the datasets used in this study were obtained from open-access databases and previously published by other authors. No human or animal data were collected. Therefore, no informed consent forms were needed from this study.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
- Pevzner P. DNA physical mapping and alternating eulerian cycles in colored graphs. Algorithmica. 1995;13(1–2):77–105.View ArticleGoogle Scholar
- Baker M. Gene-editing nucleases. Nat methods. 2012;9(1):23–6.View ArticlePubMedGoogle Scholar
- Sambrook J, Fritsch EF, Maniatis T. Molecular cloning: a laboratory manual. 2nd ed. New York: Cold Spring Harbor Laboratory Press, Cold Spring Harbor; 1989.Google Scholar
- Liu Z, Ping-Chang Y. Construction of pET-32 α (+) vector for protein expression and purification. N am j med sci. 2012;4(12):651–5.View ArticlePubMedPubMed CentralGoogle Scholar
- He X, Hull V, Thomas JA, Fu X, Gidwani S, Gupta YK, Black LW, Xu SY. Expression and purification of a single-chain type IV restriction enzyme Eco94GmrSD and determination of its substrate preference. Sci rep. 2015;5:9747.View ArticlePubMedPubMed CentralGoogle Scholar
- Narayanan P. Bioinformatics: A primer. New Age International. 2005. ISBN 10: 8122416101, ISBN 13: 9788122416107.Google Scholar
- Kalavacharla V, Hossain K, Riera-Lizarazu O, Gu Y, Maan SS, Kianian SF. Radiation hybrid mapping in crop plants. Adv agron. 2009;102:201–22.View ArticleGoogle Scholar
- Dear PH. Genome mapping. eLS 2001. John Wiley & Sons. doi:10.1038/npg.els.0001467.
- Błażewicz J, Formanowicz P, Kasprzak M, Jaroszewski M, Markiewicz WT. Construction of DNA restriction maps based on a simplified experiment. Bioinformatics. 2001;17(5):398–404.View ArticlePubMedGoogle Scholar
- Paliswiat B, Pryputniewicz P. On the complexity of the double digest problem. Control cybern. 2004;33(1):133–40.Google Scholar
- Cieliebak M, Eidenbenz S, Penna P. Noisy data make the partial digest problem NP-hard. Lect notes comput cci. 2003;2812:111–23.View ArticleGoogle Scholar
- Blazewicz J, Burke EK, Kasprzak M, Kovalev A, Kovalyov MY. Simplified partial digest problem: enumerative and dynamic programming algorithms. IEEE/ACM trans comput biol bioinform. 2007;4(4):668–80.View ArticlePubMedGoogle Scholar
- Pandurangan G, Ramesh H. The restriction mapping problem revisited. J comput syst sci. 2002;65(3):526–44.View ArticleGoogle Scholar
- Karp RM, Newberg LA. An algorithm for analysing probed partial digestion experiments. Comput appl biosci. 1995;11(3):229–35.PubMedGoogle Scholar
- Dakic T: On the turnpike problem. PhD thesis, Simon Fraser University 2000, ISBN:0-612-61635-5.Google Scholar
- Nadimi R, Fathabadi HS, Ganjtabesh M. A fast algorithm for the partial digest problem. Jpn j ind appl math. 2011;28:315–25.View ArticleGoogle Scholar
- Ahrabian H, Ganjtabesh M, Nowzari-Dalini A, Razaghi-Moghadam-Kashani Z. Genetic algorithm solution for partial digest problem. Int j bioinform res appl. 2013;9(6):584–94.View ArticlePubMedGoogle Scholar
- Rosenblatt J, Seymour PD. The structure of homometric sets. SIAM j algebraic discrete meth. 1982;3(3):343–50.View ArticleGoogle Scholar
- Lemke P, Werman M. On the complexity of inverting the autocorrelation function of a finite integer sequence, and the problem of locating n points on a line, given the (nC2) unlabelled distances between them. Technical report # 453. IMA: Minneapolis; 1988.Google Scholar
- Skiena SS, Smith WD, Lemke P. Reconstructing sets from interpoint distances. SCG '90 Proceedings of the sixth annual symposium on Computational geometry, 332–9.Google Scholar
- Syropoulos A. Mathematics of multisets. WMP '00 Proceedings of the Workshop on Multiset Processing: Multiset Processing, Mathematical, Computer Science, and Molecular Computing Points of View. 2000;347–58.Google Scholar
- Jones NC, Pevzner P. An introduction to bioinformatics algorithms. Chapter 4, 83-123, MIT press 2004.Google Scholar
- Zhang Z. An exponential example for a partial digest mapping algorithm. J comput biol. 1994;1(3):235–9.View ArticlePubMedGoogle Scholar
- Woolson RF. Wilcoxon Signed‐Rank Test. Wiley encyclopedia of clinical trials 2008. doi:10.1002/9780471462422.eoct979.
- Devine JH, Kutuzova GD, Green VA, Ugarova NN, Baldwin TO. Luciferase from the east European firefly luciola mingrelica: cloning and nucleotide sequence of the cDNA, overexpression in Escherichia coli and purification of the enzyme. Biochim biophys acta. 1993;1173(2):121–32.View ArticlePubMedGoogle Scholar