DDmap: a MATLAB package for the double digest problem using multiple genetic operators

Wang, Licheng; Suo, Jingwen; Pan, Yun; Li, Lixiang

doi:10.1186/s12859-019-2862-x

Research article
Open access
Published: 18 June 2019

DDmap: a MATLAB package for the double digest problem using multiple genetic operators

Licheng Wang ORCID: orcid.org/0000-0001-8418-1897¹,
Jingwen Suo¹,
Yun Pan² &
…
Lixiang Li¹

BMC Bioinformatics volume 20, Article number: 348 (2019) Cite this article

1559 Accesses
2 Citations
9 Altmetric
Metrics details

Abstract

Background

In computational biology, the physical mapping of DNA is a key problem. We know that the double digest problem (DDP) is NP-complete. Many algorithms have been proposed for solving the DDP, although it is still far from being resolved.

Results

We present DDmap, an open-source MATLAB package for solving the DDP, based on a newly designed genetic algorithm that combines six genetic operators in searching for optimal solutions. We test the performance of DDmap by using a typical DDP dataset, and we depict exact solutions to these DDP instances in an explicit manner. In addition, we propose an approximate method for solving some hard DDP scenarios via a scaling-rounding-adjusting process.

Conclusions

For typical DDP test instances, DDmap finds exact solutions within approximately 1 s. Based on our simulations on 1000 random DDP instances by using DDmap, we find that the maximum length of the combining fragments has observable effects towards genetic algorithms for solving the DDP problem. In addition, a Maple source code for illustrating DDP solutions as nested pie charts is also included.

Background

The physical mapping of DNA is a key problem in computational biology [5]. A large DNA molecule is a long string composed of four nucleotides, A, C, G and T. To understand the structure of DNA molecules, it is of interest to determine the occurrences of short substrings, such as GAATTC, on the DNA. Double digest experiments (DDE for short) are a standard approach for constructing physical DNA maps [2]. Given two restriction enzymes, denoted by $ \mathfrak{A} $ and $ \mathfrak{B} $, this approach cuts a target DNA sequence by using only enzyme $ \mathfrak{A} $, only enzyme $ \mathfrak{B} $, and both enzymes simultaneously, in three separate and parallel experiments [5]. As a result, we obtain three multisets of short DNA fragments. However, due to certain experimental limitations, only the length information (i.e., The number of nucleotides) of these short fragments can be measured with certain accuracy using certain mature biological techniques, such as gel electrophoresis. The objective of the double digest problem (DDP) is to reconstruct the original ordering of the fragments in the target DNA molecule.

Since the first successful reconstruction of restriction site mapping in the earlier 1970s [7, 11], the DDP problem has become an intensively studied issue that covers a variety of disciplines [6, 9]. Although the major concerns come from the community of bioinformation, the challenges related to this problem have also attracted attention from the artificial intelligence, algorithmic complexity, and optimization communities. We now know that DDP is strongly NP-complete [1, 2], and many algorithms have been proposed for solving the DDP problem [3,4,5,6, 8,9,10, 12,13,14,15]. However, the DDP problem is still far from being resolved. All of the algorithms developed to address this problem have encountered significant difficulties as the number of restriction sites increases. Moreover, even for different DDP instances with the same size, the hardness for finding an exact solution might vary remarkably.

The main motivation of this work comes from three considerations: First, almost all existing formulations of the DDP problem use multiset as the basic data structure, while we find that it is even easier to model the DDP problem by using vectors. Second, some recently proposed genetic algorithms [3, 13] for addressing the DDP problem should be improved. Third, it is of interest to develop an open-source package for studying the DDP problem by using easily accessible engineering computation platforms, such as MATLAB.

Our main contributions are summarized as follows:

A vector-based formulation of the DDP problem is presented and illustrated step-by-step.
A novel genetic algorithm for solving the DDP problem is proposed by combining six genetic operators, and a MATLAB package, DDmap, is implemented by integrating the proposed genetic algorithm and other necessary supporting and testing widgets. Then, by using DDmap, exact solutions for typical DDP test instances [13] are explicitly derived and depicted. (See the right column of Table 1.)
A relation between the hardness of certain DDP instances and the maximum length of double digest sequences is revealed based on our simulations of 1000 random DDP instances. Meanwhile, an approximate approach for typical hard DDP instances is conceived based on this relation.

Table 1 Main results: separated and integrated effects of all six genetic operators. Instance 1,3,4,5,7,8 come from [13], instance 2' is derived by using a scaling-rounding-adjusting process towards instance 2, m, n, k are the lengths of the input fragments A, B and C, respectively. There are six genetic operators, RWS is selection operator defined as the well-known roulette wheel algorithm. PCC and RSC are crossing operators, PCC is the combination of two permutations, RSC is Referencing Sorting Crossing, P4X, FLP, CSH are mutation operators, P4X is a four-point mutating, FLP defined as the flipping of the given fragment. CSH defined as the cyclic shifting of the given fragment. The average running time, average evolution generations and success rate are listed in the table. At the right of the table, we draw pie charts of DDmap’s two solutions.

Full size table

Results

To test the utility of DDmap, eight DDP instances, referred to as INS_j(j = 1⋯8), are taken from [13]. They are shown in the following Table 2:

Table 2 Test instances from [13]. Suppose giving two restriction enzymes, denoted by $ \mathfrak{A} $ and $ \mathfrak{B} $, $ \overrightarrow{a},\overrightarrow{b},\overrightarrow{c} $ are the multisets of short DNA fragments by cuts a target DNA sequence by using enzyme $ \mathfrak{A} $ only, enzyme $ \mathfrak{B} $ only, and both enzymes simultaneously.

Full size table

First, the integrated effects of the six aforementioned genetic operators of DDmap are verified. For the instances INS_{1, 3, 4, 5, 7, 8}, DDmap performs considerably well, and the related results are collected in Table 1. For each instance, 100 trails were run using DDmap with respect to each combination of six genetic operators. Then, the average running time, the average evolution generation and the success rate of finding exact DDP solutions are counted. Two different exact solutions for the instances INS_{1, 3, 4, 5, 7, 8} are also depicted in the right column of Table 1. In addition, the average running time and the average evolution generations of finding exact DDP solutions are depicted in Fig. 1. From Table 1 and Fig. 1. We can see that the genetic operators combination of RWS + PCC performs best in running time, RWS + ALL performs best in evolving generation, while other combinations of different genetic operators perform similarly and equally effective. Moreover, the tendency of running time curve and evolving generation curve are very similar.

However, we find that DDmap performs very poorly for INS₂ and INS₆. Upon further examination, we find that INS₆, is invalid, Simple calculation shows that as for INS₆, we have

$$ 45=\sum \left(\overrightarrow{a}\right)=\sum \left(\overrightarrow{b}\right)\ne \sum \left(\overrightarrow{c}\right)=19 $$

because it violates the restriction condition of (5) (See Definition 1).

For INS₂, we run DDmap 100 trails and successfully obtain exact solutions of INS₂ in 67 trails. But the average running time and evolution generations for reaching the exact solution of INS₂ are 122 s and 3828, respectively, i.e., approximately 1000 times slower than the results of other test instances (see Table 1). Furthermore, we find that these 67 solutions are essentially the same: One solution is depicted in Fig. 2(a), and another solution is just to read out the sequences A, B and C of Fig. 2(a) in an reverse order. It seems that the solution to INS₂ ’s solutions are very sparse, and thus, DDmap faces the difficulty of escaping from so many local optima.

We deal with the INS₂ by using the scaling-rounding-adjusting approach. As expected, DDmap can find solutions towards INS_2′ very efficiently. For each combination of six genetic operators, we run DDmap towards INS_2′ 100 trials. The average running time is no more than 2 s, the evolution generation is no more than 80, and the success rate for finding exact DDP solutions is always 100%. The results are already contained in Table 1 and Fig. 1. Now, we directly take some INS₂, ’s solution, (μ, ν) ∈ S_m × S_n, as an approximate solution of INS₂. The resulted double digest pie charts are depicted in Fig. 2(b). Compared to the exact solution given in Fig. 2(a), we think this kind of approximation is an interesting result in the sense that the relative error, defined as the proportion of total length of gaps between two miss-aligned fragments, is merely 4.8%, calculated by

$$ \frac{115+17+256+171+117+188+280+1120}{48502}=0.0487. $$

Next, via a number of simulations, we find that DDmap’s performance is tightly related to the maximum length of a piece in the sequence of C, denoted by ρ_C = max c_i. This is reasonable considering that for a fixed length of sequence C, denoted by L_C = |C|, the smaller ρ_C is, the denser the solutions, and thus, the easier for genetic algorithms, such as DDmap, to meet an exact solution during the evolution process. Based on our simulations towards 1000 random DDP instances with different ρ_C, the relationship between the success rate of finding exact DDP solutions with respect to ρ_C is depicted in Fig. 3.

Discussion

♦ Cases of k ≠ m + n − 1

Note that in both INS₄ and INS₅, the given two enzymes cut the target DNA molecule at some of the same sites and lead to the case where k ≠ m + n − 1. At the beginning, DDmap performs very poorly on INS₄ and INS₅. The performance of DDmap on INS₄ and INS₅ improves remarkably after we adopt the following simple preprocessing strategy:

• If k < m + n − 1, then introduce δ = (m + n − 1) − k fragments with length 0 into.

the sequence $ \overrightarrow{c} $;

• Otherwise, if k > m + n − 1, then introduce δ = k − (m + n − 1) fragments with length 0.

into the shorter sequence among $ \overrightarrow{a} $ and $ \overrightarrow{b} $;

• Otherwise, do nothing.

An interesting observation is that the newly introduced 0-length fragments will explicitly appear in the pie charts of exact DDP solutions. For instance, Fig. 4(a) shows that a 0-length fragment in sequence $ \overrightarrow{c} $ of INS₄ appears at the fifteenth site, while Fig. 4(b) shows that two 0-length fragments in sequence $ \overrightarrow{c} $ of INS₅ appear at the sixth and eighth sites, respectively.

Here, we follow the convention of reading a pie chart from 0^° to 180^° or 360^°.

♦ Comparison

Figure 5(a) and (b) are the comparison of the average running time between DDmap and the algorithm in 2005 [13] and 2012 [3]. Operator 1–5 are the crossover and mutation operator in DDmap. Because the crossover operator in [13] is the same as our operator 2 and the two mutation operators in [3] are similar to our operators op4 and op5, so we only implement the mutation operator op6 in [13] and crossover operator op7 in [3]. Eight instances are from the paper [13]. In the comparison experiment, each instance is run 100 times for operators op1–7 respectively, and then we got the average running time and the success rate of finding the exact DDP solution.

Through the experimental data, we found the data of op6 is much larger than that of the other six operators, the data of the other six operators will be neglected in the rectangular coordinate system, so we choose the logarithmic coordinate system. Figure 5(a) is the comparison between DDmap and the algorithm in 2005 [13], the blue line is the average running time of op6, it is higher than the other six lines, our algorithm has a significant time advantage over the [3]‘s algorithm. As can be seen from Fig. 5(b), the six lines have little difference, however, the op7’s line is always at the top, so our algorithm has a slight advantage over that of [3].

The comparison of success rate is shown in Fig. 5(c). The success rate of operators 1, 2, 3, 4, 5, 7 is 100%, they are all effective for these instances. Operator 6 runs very irregularly and the results are not very good.

Instance 2 and 6 does not appear in Fig. 5. In fact, INS₆ is invalid. As aforementioned, INS₂ is very complex, so we analyze it separately. To reset the maximum evolution generation as large as 100,000, running each operator 10 times towards INS₂, the average running time and the success rate is shown in Fig. 6(a) and (b), respectively. We can see that the running time of op6 is about 10 times longer than other operators, while the running time of op7 is about twice longer than our operators op1–5. The success rates of our five operators are all 100%, however, op7’s success rate is 90%, but op6 does not produce the exact DDP solution.

In conclusion, DDmap is much better than the algorithm in [13] and it is slightly better than [3]’s algorithm.

Conclusions

An open-source MATLAB package DDmap based on a newly designed genetic algorithm that combines six genetic operators is designed for solving the double digest problem. This algorithm finds exact solutions within approximately 1 s for typical DDP test instances. For some hard DDP instances, DDmap performs very well via a scaling-rounding-adjusting process. The experimental results of our algorithm confirm its efficiency.

Methods

Problem formulation

Let S_m denote the symmetric group on m indices {1, 2, ⋯, m}. Then, for a given permutation π ∈ S_m and a given vector $ \overrightarrow{a}=\left({a}_1,\cdots, {a}_m\right) $, the action of π on $ \overrightarrow{\mathrm{a}} $ derives a vector $ {\overrightarrow{a}}^{\pi }=\left({a}_{\pi (i)},\cdots, {a}_{\pi (m)}\right) $, reassembling of the order of entries of $ \overrightarrow{\mathrm{a}} $ according to π. Further, let us define the accumulative sum vector of $ \overrightarrow{\mathrm{a}} $, denoted by $ \mathrm{AS}\left(\overrightarrow{\mathrm{a}}\right) $, and the step difference vector of $ \overrightarrow{\mathrm{a}} $, denoted by $ \mathrm{AS}\left(\overrightarrow{\mathrm{a}}\right) $, as follows:

$$ \mathrm{AS}\left(\overrightarrow{\mathrm{a}}\right)=\left(\Sigma \left(\overrightarrow{a},1\right),\cdots, \Sigma \left(\overrightarrow{a},\mathrm{m}\right)\right) $$

(1)

and

$$ \mathrm{SD}\left(\overrightarrow{\mathrm{a}}\right)=\left(\sum \left(\overrightarrow{a},1\right),\sum \left(\overrightarrow{a},2\right)-\sum \left(\overrightarrow{a},1\right),\cdots, \sum \left(\overrightarrow{a},\mathrm{m}\right)-\sum \left(\overrightarrow{a},\mathrm{m}-1\right)\right) $$

(2)

where $ \Sigma \left(\overrightarrow{a},\mathrm{j}\right)={\sum}_{i=1}^j{a}_i\left(j=1,\cdots, m\right) $ indicates the partial sum of $ \overrightarrow{\mathrm{a}} $.

Now, the double digest problem (DDP) can be formulated by the following steps:

Given two vectors $ \overrightarrow{a}=\left({a}_1,\cdots, {a}_m\right) $ and $ \overrightarrow{b}=\left({b}_1,\cdots, {b}_n\right) $ with the restriction $ \Sigma \left(\overrightarrow{a},\mathrm{m}\right)=\Sigma \left(\overrightarrow{b},\mathrm{n}\right) $, we define the combining sequence of $ \overrightarrow{\mathrm{a}} $ and $ \overrightarrow{\mathrm{b}} $, denoted by $ \coprod \left(\overrightarrow{a},\overrightarrow{b}\right) $, as the concatenation of vectors $ \mathrm{AS}\left(\overrightarrow{\mathrm{a}}\right) $ and $ \mathrm{AS}\left(\overrightarrow{\mathrm{b}}\right) $ and removing the tail entry. That is,

$$ \coprod \left(\overrightarrow{a},\overrightarrow{b}\right)=\left(\mathrm{AS}{\left(\overrightarrow{\mathrm{a}}\right)}_1,\cdots, \mathrm{AS}{\left(\overrightarrow{\mathrm{a}}\right)}_m,\mathrm{AS}{\left(\overrightarrow{\mathrm{b}}\right)}_1,\cdots, \mathrm{AS}{\left(\overrightarrow{\mathrm{b}}\right)}_{n-1}\right) $$

(3)

The sequence $ \coprod \left(\overrightarrow{a},\overrightarrow{b}\right) $ can be reassembled to obtain a new sequence according to the nondecreasing order, denoted by $ \hat{\bigsqcup}\left(\overrightarrow{a},\overrightarrow{b}\right) $.
The double digest sequence of $ \overrightarrow{\mathrm{a}} $ and $ \overrightarrow{\mathrm{b}} $, denoted by $ \mathrm{DDS}\left(\overrightarrow{a},\overrightarrow{b}\right) $, can be defined as the step difference vector of $ \hat{\bigsqcup}\left(\overrightarrow{a},\overrightarrow{b}\right) $. That is,

$$ \mathrm{DDS}\left(\overrightarrow{a},\overrightarrow{b}\right)= SD\left(\hat{\bigsqcup}\left(\overrightarrow{a},\overrightarrow{b}\right)\right) $$

(4)

Now, we introduce the following definition:

Definition 1

A double digest problem (DDP) instance is specified by three vectors $ \overrightarrow{a}=\left({a}_1,\cdots, {a}_m\right),\overrightarrow{b}=\left({b}_1,\cdots, {b}_n\right)\ and\;\overrightarrow{c}=\left({c}_1,\cdots, {c}_k\right) $ with the restriction of

$$ \Sigma \left(\overrightarrow{a},m\right)=\Sigma \left(\overrightarrow{b},n\right)=\Sigma \left(\overrightarrow{c},k\right) $$

(5)

and the objective is to find a pair permutations (μ, ν) ∈ S_m × S_n such that.

$$ DDS\left({\overrightarrow{a}}^{\mu },{\overrightarrow{b}}^v\right)={\overrightarrow{c}}^{\pi }\ for\; som\ \pi \in {S}_m $$

(6)

Remark 1

If two enzymes cut a target DNA molecule at disjoint sites, then we have the condition k = m + n − 1. It was previously suspected that this case might lead to easier reconstruction problems [2]. (However, our simulation does support this conjecture, and details are given in the supplementary part). However, due to some unavoidable experimental errors, this condition does not always hold. Thus, in DDmap, we employ a very simple strategy to address the cases of k = m + n − 1: Introducing 0-length fragments in sequence A,B, or C if necessary. Our simulation results show that this method is considerably robust.

Remark 2

If we take into consideration possible partial cleavage errors, then the optimization goal (6) should be updated to

$$ \left.{\mathit{\min}}_{\mu \in {S}_m,\nu \in {S}_n}\right|\left| DDS\left({\overrightarrow{a}}^{\mu },{\overrightarrow{b}}^{\nu}\right)\oplus \overrightarrow{c}\right| $$

(7)

where symbol ⊕ indicates the set exclusive operation, and the two operands $ DDS\left({\overrightarrow{a}}^{\mu },{\overrightarrow{b}}^{\nu}\right) $ and$ \overrightarrow{c} $ should be regarded as unordered multisets. By doing so, the searching space of the DDP solution is reduced to S_m × S_n, instead of S_m × S_n × S_k. In fact,π can be easily extracted from any valid solution (μ, ν). A simple method for obtaining π is to at first sort $ DDS\left({\overrightarrow{a}}^{\mu },{\overrightarrow{b}}^{\nu}\right) $ to obtain a nondecreasing sequence and then let π be the permutation specified by the reverse index of the sorting subscripts. Apparently, this step can be performed within the complexity Ο(klogk).

Example 1

For given three vectors $ \overrightarrow{a}=\left(1,2,3,5\right) $ , $ \overrightarrow{b}=\left(2,2,3,4\right) $ and $ \overrightarrow{c}=\left(1,1,1,2,2,2,2\right) $ as well as two permutations.

$ \mu =\left(\begin{array}{c}1\ 2\ 3\ 4\\ {}2\ 4\ 3\ 1\end{array}\right) $ and $ \nu =\left(\begin{array}{c}1\ 2\ 3\ 4\\ {}3\ 1\ 2\ 4\end{array}\right) $, we can verify that (μ, ν) is a valid solution for the DDP instance specified by $ \left(\overrightarrow{a},\overrightarrow{b},\overrightarrow{c}\right) $. The pie charts of a solution and the corresponding calculation steps and complexities are depicted in Table 3 .

Table 3 Illustration of the proposed formulation. This is the detailed process of solving the double digest problem, The calculation process of example 1 is listed in (a). The pie chart for this example’s solution is in (b).

Full size table

The proposed genetic operators

Recall that the basic idea of a genetic algorithm consists of the following concepts: an individual is totally specified by a chromosome; a chromosome is the carrier of a gene, and the position of a gene in a chromosome is called a locus; the gene composition of an individual is called the genotype; and the fitness value, called phenotype, is the result of mutual effects of genotype and external environments. Thus, to design a genetic algorithm for a given optimization problem, we need to specify how to represent a chromosome, evaluate the fitness value, design genetic operators, and determine evolution strategies such as the population size, the maximum evolution generation, the elitism keeping method, the probabilities for each genetic operator, etc.

First, for a given DDP instance $ \left(\overrightarrow{a},\overrightarrow{b},\overrightarrow{c}\right) $, we directly use a random pair of permutations (μ, ν) ∈ S_m × S_n to represent a chromosome, and its fitness value is given by

$$ f\left(\mu, \nu \right)=\frac{1}{1+\left| DDS\left({\overrightarrow{a}}^{\mu },{\overrightarrow{b}}^{\nu}\right)\oplus \overrightarrow{c}\right|} $$

(8)

Second, the following 6 genetic operators are employed in this work:

RWS. This is a natural selection operator defined as the well-known roulette wheel algorithm.
PCC. This is a crossing operator defined as a combination of two permutations. Given two chromosomes (μ⁽¹⁾, ν⁽¹⁾) and (μ⁽²⁾, v⁽²⁾), this operator produces two new offspring

$$ \left({\mu}^{(1)}\circ {\mu}^{(2)}\circ {\nu}^{(1)}\circ {\nu}^{(2)}\right) $$

and

$$ \left({\mu}^{(2)}\circ {\mu}^{(1)}\circ {\nu}^{(2)}\circ {\nu}^{(1)}\right) $$

respectively.

RSC. This is a crossing operator defined as the so-called referencing sorting (RS). Given a target sequence $ \overrightarrow{a} $ and a reference sequence $ \overrightarrow{b} $, assuming both are defined over the same alphabet. Then, during the sorting process, the swapping operation of two elements in $ \overrightarrow{a} $ is performed only if they are in the reverse order in the referencing sequence $ \overrightarrow{b} $. RS is a generalization of ordinary sorting in the sense that any two elements can be compared even if they do not come from a complete order. RS is inspired by operator precedence grammars. More details about RS and RSC are given in the supplementary section. In fact, RSC is called order preserving weighted crossover in [13].
P4X. This is a four-point mutating operator defined as follows: Given a chromosome (μ, ν), randomly exchange two elements of μ and two elements of ν.
FLP. This is a fragment mutating operator defined as flipping of the given fragment. By flipping a fragment (2, 5, 4, 1), we obtain (1, 4, 5, 2).
CSH. This is a fragment mutating operator defined as cyclic shifting of the given fragment. By cyclically shifting a fragment (2, 5, 4, 1), we obtain (5, 4, 1, 2).

More details about the referenced sorting crossover (RSC) genetic operator.

RSC is in fact the order preserving weighted crossover given in [13]. Suppose two parent chromosomes are

$$ {\displaystyle \begin{array}{l}{\mathrm{p}}_1=\left(1,3,2,1,3,4,2,2\right)\mathrm{and}\\ {}{\mathrm{p}}_2=\left(1,2,2,2,4,3,3,1\right),\end{array}} $$

and the crossover point is 3. Then, the producing of the offspring is given below:

(1)
p₁ is split into two pieces: p₁₁ = (1, 3, 2) and p₁₂ = (1, 3, 4, 2, 2), and p₂ is split into two pieces: p₂₁ = (1, 2, 2) and p₂₂ = (2, 4, 3, 3, 1).
(2)
The piece p₁₂ is sorted by taking p₂ as the referenced sequence. Since in p₂ there exists a chain 2 − 2 − 4 − 3 − 1 this leads to p ' ₁₂ = (2, 2, 4, 3, 1).
(3)
Similarly, p₂₂ is sorted by taking p₁ as the referenced sequence. This time, we obtain p '₂₂ = (3, 1, 3, 4, 2) since there exists a chain 3 − 1 − 3 − 4 − 2 in p₁.
(4)
Two offspring chromosomes are

$$ {\displaystyle \begin{array}{l}{\mathrm{c}}_1={\mathrm{p}}_{11}\Big\Vert {\mathrm{p}}_{12}^{\prime }=\operatorname{}\left(1,3,2,2,2,4,3,1\right)\mathrm{and}\\ {}{\mathrm{c}}_2={\mathrm{p}}_{21}\Big\Vert {\mathrm{p}}_{22}^{\prime}\operatorname{}\left(1,2,2,3,1,3,4,2\right).\end{array}} $$

Among the above 6 genetic operators, RWS is widely used in most genetic algorithms, and RSC was first used in [13] to solve the DDP problem. Four other genetic operators, although being easily conceived, are new to DDP-oriented genetic algorithms, as far as we know.

Third, the evolution strategies in this work refer to [13]. That is, the population size and maximum evolution generation are set to 50 and 10,000, respectively. Elitists in each generation are kept, and the crossing probability is set to 0.85. The linearly adaptive mutation probability in [13] is also used in our work, but with a slight modification to ensure the cyclic increment of mutation probability is nonnegative. The details are as follows:

We follow the suggestion given in [13] by letting the mutation probability vary linearly in cycles of 200 iterations. However, in the original paper, this cycle varies from $ \frac{2}{m+n} $ to 0.45, while in our work, the cycle varies from $ \frac{2}{m+n} $ to 0.55, considering that in the case of m = n = 2, the start point would be 0.5, which is larger than 0.45.

Scaling-rounding-adjusting approach

Based on the above observation, we try to deal with the instance INS₂ in another way. A new test instance, INS₂, is derived by using a scaling-rounding-adjusting process on INS₂. The details of this process are as follows:

Scaling and rounding. Because the minimum length of pieces in sequence $ \overrightarrow{c} $ of INS₂ is 1120, we take 0.001 as the scaling factor. That is, we multiply the sequences $ \overrightarrow{\mathrm{a}},\overrightarrow{\mathrm{b}},\overrightarrow{\mathrm{c}} $ by 0.001 and then round them. By doing so, we obtain

$$ {\displaystyle \begin{array}{c}\overrightarrow{{\mathrm{a}}^{\prime }}=\left(6,6,7,7,7,17\right)\\ {}\overrightarrow{{\mathrm{b}}^{\prime }}=\left(4,5,6,6,7,21\right)\\ {}\overrightarrow{{\mathrm{c}}^{\prime }}=\left(1,2,3,3,3,4,4,4,5,6,16\right)\end{array}} $$

Adjusting. Next, we find that

$$ \sum \left(\overrightarrow{{\mathrm{a}}^{\prime }}\right)=50\ne \sum \left(\overrightarrow{{\mathrm{b}}^{\prime }}\right)=49\ne \sum \left(\overrightarrow{{\mathrm{c}}^{\prime }}\right)=51 $$

That is, $ \left(\overrightarrow{\mathrm{a}},\overrightarrow{\mathrm{b}},\overrightarrow{\mathrm{c}}\right) $ is an invalid DDP instance. Intuitively, this occurs because the round operation, round(·), introduces more errors. Thus, we try to adjust the rounding operation in the previous step according to the so-called rounding-up and rounding-down strategies:

- Rounding-up: round(x) is replaced by x ' = round(x + 0.1), and we obtain

$$ {\displaystyle \begin{array}{c}\overrightarrow{{\mathrm{a}}^{{\prime\prime} }}=\left(6,6,7,7,7,17\right)\\ {}\overrightarrow{{\mathrm{b}}^{{\prime\prime} }}=\left(4,5,6,6,8,21\right)\\ {}\overrightarrow{{\mathrm{c}}^{{\prime\prime} }}=\left(1,2,3,3,3,4,4,4,5,6,16\right)\end{array}} $$

This DDP instance is again invalid since

$$ \sum \left(\overrightarrow{{\mathrm{a}}^{{\prime\prime} }}\right)=50=\sum \left(\overrightarrow{{\mathrm{b}}^{{\prime\prime} }}\right)\ne \sum \left(\overrightarrow{{\mathrm{c}}^{{\prime\prime} }}\right)=51 $$

- Rounding-down: round(x) is replaced by x ' = round(x − 0.1)

$$ {\displaystyle \begin{array}{c}\overrightarrow{{\mathrm{a}}^{{\prime\prime\prime} }}=\left(5,6,6,7,7,17\right)\\ {}\overrightarrow{{\mathrm{b}}^{{\prime\prime\prime} }}=\left(3,5,6,6,7,21\right)\\ {}\overrightarrow{{\mathrm{c}}^{{\prime\prime\prime} }}=\left(1,2,2,3,3,3,4,4,5,5,16\right)\end{array}} $$

Now, the DDP instance may be valid since

$$ \sum \left(\overrightarrow{{\mathrm{a}}^{{\prime\prime\prime} }}\right)=\sum \left(\overrightarrow{{\mathrm{b}}^{{\prime\prime\prime} }}\right)=\sum \left(\overrightarrow{{\mathrm{c}}^{{\prime\prime\prime} }}\right)=48 $$

Note that the constant 0.1 in rounding-up/rounding-down is a value defined by experience. A reasonable domain of this constant would be the interval [0.0001, 0.4999].

• Now, the newly derived DDP instance INS₂, is given by the three vectors $ \left(\overrightarrow{{\mathrm{a}}^{{\prime\prime\prime} }},\overrightarrow{{\mathrm{b}}^{{\prime\prime\prime} }},\overrightarrow{{\mathrm{c}}^{{\prime\prime\prime} }}\right) $.

Finally, we would like to mention that all simulations in this work are conducted on a X1 Carbon laptop with Windows(TM) 8, Intel(R)Core(TM)i5 − 4300U CPU @ 1.90 GHz/2.49 GHz and 8GB RAM. The complete genetic algorithm for solving the DDP problem is implemented as a MATLAB package, DDmap, and a Maple source code for drawing DDP solutions as nested pie charts is also included in this package.

The package DDmap consists of

13 MATLAB algorithms:
permGA.m, the MATLAB genetic algorithm for solving the DDP problem. This is the main algorithm, and its flowchart is depicted in Fig. 7. Note that this file also contains the definitions of five genetic operators — RWS, PCC, P4X, FLP, CSH and related MATLAB functions for calculating the fitness values.

referIndexSort.m, the MATLAB algorithm for implementing the so-called referenced sorting (based on index).
opPermCross.m, the MATLAB algorithm for implementing the RSC genetic operator.
getInstance.m, the auxiliary MATLAB algorithm for outputting test DDP instances in [Sur-Kolay S. et al., 2005].
randDDPinstance.m, the auxiliary MATLAB algorithm for producing a valid DDP instance according the given parameters.
strABC.m, the auxiliary MATLAB algorithm for producing Maple commands for reading data before calling the Maple algorithm DDdraw.mws.
simu1004.m, simu1004plots.m, simu1007.m, and simu1008.m, the auxiliary MATLAB algorithms for organizing simulations and producing the related figures.
Trans.m, the MATLAB algorithm for implementing Scaling-rounding-adjusting approach for Cases of INS_2.
Plot1.m, the auxiliary MATLAB algorithms for comparison of DDmap and algorithm in [3, 13].
Plot2.m, the auxiliary MATLAB algorithms for comparison of DDmap and other algorithms under the condition of INS_2.
1 Maple algorithm, DDdraw.mws, is used for drawing the DDP solution in nested pie charts, with inputs A, B and C that are assigned by using Maple commands produced by strABC.m.
43 Data files: 42 of them are named as INS_{xx − ggg}. TEX, where xx ∈ {01, 03, 04, 05, 07, 08, 10} and ggg ∈ {pcc, rsc, p4x, flp, csh, all}, and the last is named as INS_{02 − rs − 1008}. TEX. These data files are in fact the running records of our simulations towards the 7 valid DDP test instances given in [13]. In these running records, many exact DDP solutions are provided.

Abbreviations

CSH:: Cyclic shifting
DDE:: Double digest experiments
DDP:: Double digest problem
FLP:: Flipping
P4X:: Four-point exchange
PCC:: Permutations combination crossing
RSC:: Referencing sorting crossing
RWS:: Roulette wheel selection

References

Blazewicz J, Burke E, Jaroszewski M, Kasprzak M, Paliswiat B, Pryputniewicz P. On the complexity of the double digest problem. Control Cybern. 2004;33(1):133–40.
Google Scholar
Cieliebak M, Eidenbenz S, Woeginger GJ. Double digest revisited: complexity and Approximability in the presence of Noisy data. In: International conference on Computing & Combinatorics: Springer-Verlag; 2003. p. 519–27.
Ganjtabesh M, Ahrabian H, Nowari-Dalini A, Moghadam ZRK. Genetic algorithm solution for double digest problem. Bioinformation. 2012;8(10):453–6.
Article Google Scholar
Goldstein L, Waterman MS. Mapping DNA by stochastic relaxation. Adv Appl Math. 1987;8:194–207.
Article Google Scholar
Kao MY, Samet J, Sung WK. The enhanced double digest problem for dna physical mapping. J Comb Optim. 2003;7(1):69–78.
Article Google Scholar
Karp R. Mapping the genome: some combinatorial problems arising in molecular biology. In: ACM symposium on theory of computing; 1993. p. 278–85.
Google Scholar
Nathans D, Smith HO. Restriction endonuleases in the analysis and restructuring of DNA molecules. Annu Rev Biochem. 1975;44:273–93.
Article CAS Google Scholar
Pevzner PA. DNA physical mapping, flows in networks and minimum cycles mean in graphs. In: Gindikin SG, editor. DIMACS Series in Discrete Mathematics and Theoretical Computer Science 8: Mathematical Methods of Analysis of Biopolymer Sequences, pages 99–112. Providence: American Mathematical Society; 1992.
Google Scholar
Pevzner PA. DNA physical mapping and alternating Eulerian cycles in colored graphs. Algorithmica. 1995;13(1/2):77–105.
Article Google Scholar
Schmitt W, Waterman MS. Multiple solutions of DNA restriction mapping problems. Adv Appl Math. 1991;12:412–27.
Article Google Scholar
Smith HO, Wilcox KW. A restriction enzyme from Hemophilus influenzae. I. Purification and general properties. J Mol Biol. 1970;51(2):379–91.
Article CAS Google Scholar
Stefik M. Inferring DNA structure from segmentation data. Artif Intell. 1978;11:85–114.
Article Google Scholar
Sur-Kolay S, Banerjee S, Mukhopadhyaya S, et al. Genetic algorithm for double digest problem. In: International conference on Pattern Recognition & Machine Intelligence. Berlin Heidelberg: Springer; 2005.
Google Scholar
Waterman MS, Griggs JR. Interval graphs and maps of DNA. Bull Math Biol. 1986;48(2):189–95.
Article CAS Google Scholar
Wu Z, Zhang Y. Solving large double digestion problems for DNA restriction mapping by using branch-and-bound integer linear programming. Int J Bioinforma Res Appl. 2008;4(4):351.
Article CAS Google Scholar

Download references

Acknowledgements

We thank the anonymous reviewers for giving us valuable suggestions.

Funding

This work was supported by the National Key R&D Program of China (Grant No. 2016YFB0800602), the Shandong provincial Key R&D Program of China (Grant No. 2018CXGC0701), and the 111 Project (No. B08004). The experimental environments and the editing services are supported by the Shandong provincial Key R&D Program of China (Grant No. 2018CXGC0701) and the 111 Project (No. B08004), respectively. The publication cost is funded by National Key R&D Program of China (Grant No. 2016YFB0800602). The funders had no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

Availability of data and materials

DDmap is open source and available at https://github.com/wanglc2012/DDmap.

Author information

Authors and Affiliations

State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, 10 West Tucheng Road, Haidian District, Beijing, 100876, China
Licheng Wang, Jingwen Suo & Lixiang Li
School of Computer Science, Communication University of China, 1 East of Dingfuzhuang Street, Chaoyang District, Beijing, 100024, China
Yun Pan

Authors

Licheng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jingwen Suo
View author publications
You can also search for this author in PubMed Google Scholar
Yun Pan
View author publications
You can also search for this author in PubMed Google Scholar
Lixiang Li
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

LW proposed the main concept of the presented technique, implemented the main framework. JS finished the main experimental study, as well as analysis on related work. YP conceived the main idea about this work and implemented the algorithms about different genetic operators. LL verified the main method, checked the correctness of the presented technique. All authors reviewed the manuscript and approved the final manuscript.

Corresponding author

Correspondence to Yun Pan.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Wang, L., Suo, J., Pan, Y. et al. DDmap: a MATLAB package for the double digest problem using multiple genetic operators. BMC Bioinformatics 20, 348 (2019). https://doi.org/10.1186/s12859-019-2862-x

Download citation

Received: 19 February 2019
Accepted: 29 April 2019
Published: 18 June 2019
DOI: https://doi.org/10.1186/s12859-019-2862-x

DDmap: a MATLAB package for the double digest problem using multiple genetic operators

Abstract

Background

Results

Conclusions

Background

Results

Discussion

Conclusions

Methods

Problem formulation

Definition 1

Remark 1

Remark 2

Example 1

The proposed genetic operators

Scaling-rounding-adjusting approach

Abbreviations

References

Acknowledgements

Funding

Availability of data and materials

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

BMC Bioinformatics

Contact us