DDmap: a MATLAB package for the double digest problem using multiple genetic operators

Background In computational biology, the physical mapping of DNA is a key problem. We know that the double digest problem (DDP) is NP-complete. Many algorithms have been proposed for solving the DDP, although it is still far from being resolved. Results We present DDmap, an open-source MATLAB package for solving the DDP, based on a newly designed genetic algorithm that combines six genetic operators in searching for optimal solutions. We test the performance of DDmap by using a typical DDP dataset, and we depict exact solutions to these DDP instances in an explicit manner. In addition, we propose an approximate method for solving some hard DDP scenarios via a scaling-rounding-adjusting process. Conclusions For typical DDP test instances, DDmap finds exact solutions within approximately 1 s. Based on our simulations on 1000 random DDP instances by using DDmap, we find that the maximum length of the combining fragments has observable effects towards genetic algorithms for solving the DDP problem. In addition, a Maple source code for illustrating DDP solutions as nested pie charts is also included.


Background
The physical mapping of DNA is a key problem in computational biology [5]. A large DNA molecule is a long string composed of four nucleotides, A, C, G and T. To understand the structure of DNA molecules, it is of interest to determine the occurrences of short substrings, such as GAATTC, on the DNA. Double digest experiments (DDE for short) are a standard approach for constructing physical DNA maps [2]. Given two restriction enzymes, denoted by A and B, this approach cuts a target DNA sequence by using only enzyme A , only enzyme B , and both enzymes simultaneously, in three separate and parallel experiments [5]. As a result, we obtain three multisets of short DNA fragments. However, due to certain experimental limitations, only the length information (i.e., The number of nucleotides) of these short fragments can be measured with certain accuracy using certain mature biological techniques, such as gel electrophoresis. The objective of the double digest problem (DDP) is to reconstruct the original ordering of the fragments in the target DNA molecule.
Since the first successful reconstruction of restriction site mapping in the earlier 1970s [7,11], the DDP problem has become an intensively studied issue that covers a variety of disciplines [6,9]. Although the major concerns come from the community of bioinformation, the challenges related to this problem have also attracted attention from the artificial intelligence, algorithmic complexity, and optimization communities. We now know that DDP is strongly NP-complete [1,2], and many algorithms have been proposed for solving the DDP problem [3-6, 8-10, 12-15]. However, the DDP problem is still far from being resolved. All of the algorithms developed to address this problem have encountered significant difficulties as the number of restriction sites increases. Moreover, even for different DDP instances with the same size, the hardness for finding an exact solution might vary remarkably.
The main motivation of this work comes from three considerations: First, almost all existing formulations of Table 1 Main results: separated and integrated effects of all six genetic operators. Instance 1,3,4,5,7,8 come from [13], instance 2' is derived by using a scaling-rounding-adjusting process towards instance 2, m, n, k are the lengths of the input fragments A, B and C, respectively. There are six genetic operators, RWS is selection operator defined as the well-known roulette wheel algorithm. PCC and RSC are crossing operators, PCC is the combination of two permutations, RSC is Referencing Sorting Crossing, P4X, FLP, CSH are mutation operators, P4X is a four-point mutating, FLP defined as the flipping of the given fragment. CSH defined as the cyclic shifting of the given fragment. The average running time, average evolution generations and success rate are listed in the table. At the right of the table, we draw pie charts of DDmap's two solutions.
the DDP problem use multiset as the basic data structure, while we find that it is even easier to model the DDP problem by using vectors. Second, some recently proposed genetic algorithms [3,13] for addressing the DDP problem should be improved. Third, it is of interest to develop an open-source package for studying the DDP problem by using easily accessible engineering computation platforms, such as MATLAB.
Our main contributions are summarized as follows: A vector-based formulation of the DDP problem is presented and illustrated step-by-step. A novel genetic algorithm for solving the DDP problem is proposed by combining six genetic operators, and a MATLAB package, DDmap, is implemented by integrating the proposed genetic algorithm and other necessary supporting and testing widgets. Then, by using DDmap, exact solutions for typical DDP test instances [13] are explicitly derived and depicted. (See the right column of Table 1.) A relation between the hardness of certain DDP instances and the maximum length of double digest sequences is revealed based on our simulations of 1000 random DDP instances. Meanwhile, an approximate approach for typical hard DDP instances is conceived based on this relation.

Results
To test the utility of DDmap, eight DDP instances, referred to as INS j (j = 1⋯8), are taken from [13]. They are shown in the following Table 2: First, the integrated effects of the six aforementioned genetic operators of DDmap are verified. For the instances INS 1,3,4,5,7,8 , DDmap performs considerably well, and the related results are collected in Table 1. For each instance, 100 trails were run using DDmap with respect to each combination of six genetic operators. Then, the average running time, the average evolution generation and the success rate of finding exact DDP solutions are counted. Two different exact solutions for the instances INS 1, 3, 4, 5, 7, 8 are also depicted in the right column of Table 1. In addition, the average running time and the average evolution generations of finding exact DDP solutions are depicted in Fig. 1. From Table 1 and Fig. 1. We can see that the genetic operators combination of RWS + PCC performs best in running time, RWS + ALL performs best in evolving generation, while other combinations of different genetic operators perform similarly and equally effective. Moreover, the tendency of running time curve and evolving generation curve are very similar.
However, we find that DDmap performs very poorly for INS 2 and INS 6 . Upon further examination, we find that INS 6 , is invalid, Simple calculation shows that as for INS 6 because it violates the restriction condition of (5) (See Definition 1).
For INS 2 , we run DDmap 100 trails and successfully obtain exact solutions of INS 2 in 67 trails. But the average running time and evolution generations for reaching the exact solution of INS 2 are 122 s and 3828, respectively, i.e., approximately 1000 times slower than the results of other test instances (see Table 1). Furthermore, we find that these 67 solutions are essentially the same: One solution is depicted in Fig. 2(a), and another solution is just to read out the sequences A, B and C of Fig. 2(a) in an reverse order. It seems that the solution to INS 2 's solutions are very sparse, and thus, DDmap faces the difficulty of escaping from so many local optima. We deal with the INS 2 by using the scaling-roundingadjusting approach. As expected, DDmap can find solutions towards INS 2′ very efficiently. For each combination of six genetic operators, we run DDmap towards INS 2′ 100 trials. The average running time is no more than 2 s, the evolution generation is no more than 80, and the success rate for finding exact DDP solutions is always 100%. The results are already contained in Table 1 and Fig. 1. Now, we directly take some INS 2 , 's solution, (μ, ν) ∈ S m × S n , as an approximate solution of INS 2 . The resulted double digest pie charts are depicted in Fig. 2(b). Compared to the exact solution given in Fig. 2(a), we think this kind of approximation is an interesting result in the sense that the relative error, defined as the proportion of total length of gaps between two miss-aligned fragments, is merely 4.8%, calculated by 115 þ 17 þ 256 þ 171 þ 117 þ 188 þ 280 þ 1120 48502 ¼ 0:0487: Next, via a number of simulations, we find that DDmap's performance is tightly related to the maximum length of a piece in the sequence of C, denoted by ρ C = max c i . This is reasonable considering that for a fixed length of sequence C, denoted by L C = |C|, the smaller ρ C is, the denser the solutions, and thus, the easier for genetic algorithms, such as DDmap, to meet an exact solution during the evolution process. Based on our simulations towards 1000 random DDP instances with different ρ C , the relationship between the success rate of finding exact DDP solutions with respect to ρ C is depicted in Fig. 3.

Discussion
Note that in both INS 4 and INS 5 , the given two enzymes cut the target DNA molecule at some of the same sites and lead to the case where k ≠ m + n − 1. At the beginning, DDmap performs very poorly on INS 4 and INS 5 . The performance of DDmap on INS 4 and INS 5 improves remarkably after we adopt the following simple preprocessing strategy: • If k < m + n − 1, then introduce δ = (m + n − 1) − k fragments with length 0 into. the sequence c ! ; • Otherwise, if k > m + n − 1, then introduce δ = k − (m + n − 1) fragments with length 0.
into the shorter sequence among a ! and b ! ; • Otherwise, do nothing. An interesting observation is that the newly introduced 0-length fragments will explicitly appear in the pie charts of exact DDP solutions. For instance, Fig. 4(a) shows that a 0-length fragment in sequence c ! of INS 4 appears at the fifteenth site, while Fig. 4(b) shows that two 0-length fragments in sequence c ! of INS 5 appear at the sixth and eighth sites, respectively. Here, we follow the convention of reading a pie chart from 0°to 180°or 360°. operator in [13] is the same as our operator 2 and the two mutation operators in [3] are similar to our operators op4 and op5, so we only implement the mutation operator op6 in [13] and crossover operator op7 in [3]. Eight instances are from the paper [13]. In the comparison experiment, each instance is run 100 times for operators op1-7 respectively, and then we got the average running time and the success rate of finding the exact DDP solution.

♦ Comparison
Through the experimental data, we found the data of op6 is much larger than that of the other six operators, the data of the other six operators will be neglected in the rectangular coordinate system, so we choose the logarithmic coordinate system. Figure 5(a) is the comparison between DDmap and the algorithm in 2005 [13], the blue line is the average running time of op6, it is higher than the other six lines, our DDmap's performance is tightly related to the maximum length of piece in C, we generated a series of random double digest instances with the maximum length of C ranging from 10 to 100, then test the DDmap's success rate, the line of success rate changing with the maximum length of C is shown in Fig. 3 algorithm has a significant time advantage over the [3]'s algorithm. As can be seen from Fig. 5(b), the six lines have little difference, however, the op7's line is always at the top, so our algorithm has a slight advantage over that of [3].
The comparison of success rate is shown in Fig. 5(c). The success rate of operators 1, 2, 3, 4, 5, 7 is 100%, they are all effective for these instances. Operator 6 runs very irregularly and the results are not very good.
Instance 2 and 6 does not appear in Fig. 5. In fact, INS 6 is invalid. As aforementioned, INS 2 is very complex, so we analyze it separately. To reset the maximum evolution generation as large as 100,000, running each operator 10 times towards INS 2 , the average running time and the success rate is shown in Fig. 6(a) and (b), respectively. We can see that the running time of op6 is about 10 times longer than other operators, while the running time of op7 is about twice longer than our operators op1-5. The success rates of our five operators are all 100%, however, op7's success rate is 90%, but op6 does not produce the exact DDP solution.
In conclusion, DDmap is much better than the algorithm in [13] and it is slightly better than [3]'s algorithm.

Problem formulation
Let S m denote the symmetric group on m indices {1, 2, ⋯, m}. Then, for a given permutation π ∈ S m and a given vector a ! ¼ ða 1 ; ⋯; a m Þ , the action of π on a ! derives a vector a ! π ¼ ða πðiÞ ; ⋯; a πðmÞ Þ, reassembling of the order of entries of a ! according to π. Further, let us define the accumulative sum vector of a ! , denoted by ASð a ! Þ, and the step difference vector of a ! , denoted by ASð a ! Þ, as follows: where Σð a ! ; jÞ ¼ X j i¼1 a i ð j ¼ 1; ⋯; mÞ indicates the partial sum of a ! . Now, the double digest problem (DDP) can be formulated by the following steps: vectors ASð a ! Þ and ASð b ! Þ and removing the tail entry. That is, The sequence ∐ð a ! ; b ! Þ can be reassembled to obtain a new sequence according to the nondecreasing order, denoted by⊔ð a ! ; b ! Þ. The double digest sequence of a ! and b ! , denoted by DDSð a ! ; b ! Þ, can be defined as the step difference vector of⊔ð a ! ; b ! Þ. That is, Now, we introduce the following definition: (a) (b) Fig. 6 Comparison of DDmap and algorithm in [3,13] under the condition of INS 2 . The maximum evolution generation is set to 100,000, running each operator 10 times, (a) is the average running time of each operator. b is the success rate of each operator (See figure on previous page.) Fig. 5 Comparison of DDmap and algorithm in [3,13]. Operators 1-5 are the crossover and mutation operators in DDmap, op6 is the mutation operator in [13] and op7 is the crossover operator in [3]. Each instance is run 100 times by using op1-7 respectively. a is a logarithmic coordinate system figure, we can see the average running time comparison between DDmap and the algorithm in [13] in (a). b is the average running time comparison between DDmap and the algorithm in [3]. c is the success rate comparison between DDmap and the algorithm in [3,13] Definition 1 A double digest problem (DDP) instance is specified by three vectors a ! ¼ ða 1 ; ⋯; a m Þ; b ! ¼ ðb 1 ; ⋯; b n Þ and and the objective is to find a pair permutations (μ, ν) ∈ S m × S n such that.
Remark 1 If two enzymes cut a target DNA molecule at disjoint sites, then we have the condition k = m + n − 1. It was previously suspected that this case might lead to easier reconstruction problems [2]. (However, our simulation does support this conjecture, and details are given in the supplementary part). However, due to some unavoidable experimental errors, this condition does not always hold. Thus, in DDmap, we employ a very simple strategy to address the cases of k = m + n − 1: Introducing 0-length fragments in sequence A,B, or C if necessary.
Our simulation results show that this method is considerably robust.

Remark 2
If we take into consideration possible partial cleavage errors, then the optimization goal (6) should be updated to where symbol ⊕ indicates the set exclusive operation, and the two operands DDSð a ! μ ; b ! ν Þ and c ! should be regarded as unordered multisets. By doing so, the searching space of the DDP solution is reduced to S m × S n , instead of S m × S n × S k . In fact,π can be easily extracted from any valid solution (μ, ν). A simple method for obtaining π is Fig. 7 Flowchart of main GA algorithm of DDmap. The input DDP instance includes the instances in [13] and random instances, after calculating the fitness value, if not satisfied the stop condition, the crossover and mutation operators will be performed probabilistically, then generate new offsprings and recalculate the fitness values to at first sort DDSð a ! μ ; b ! ν Þ to obtain a nondecreasing sequence and then let π be the permutation specified by the reverse index of the sorting subscripts. Apparently, this step can be performed within the complexity Ο(klogk  Table 3.

The proposed genetic operators
Recall that the basic idea of a genetic algorithm consists of the following concepts: an individual is totally specified by a chromosome; a chromosome is the carrier of a gene, and the position of a gene in a chromosome is called a locus; the gene composition of an individual is called the genotype; and the fitness value, called phenotype, is the result of mutual effects of genotype and external environments. Thus, to design a genetic algorithm for a given optimization problem, we need to specify how to represent a chromosome, evaluate the fitness value, design genetic operators, and determine evolution strategies such as the population size, the maximum evolution generation, the elitism keeping method, the probabilities for each genetic operator, etc.
First, for a given DDP instance ð a ! ; b ! ; c ! Þ, we directly use a random pair of permutations (μ, ν) ∈ S m × S n to represent a chromosome, and its fitness value is given by Second, the following 6 genetic operators are employed in this work: RWS. This is a natural selection operator defined as the well-known roulette wheel algorithm. PCC. This is a crossing operator defined as a combination of two permutations. Given two chromosomes (μ (1) , ν (1) ) and (μ (2) , v (2) ), this operator produces two new offspring RSC. This is a crossing operator defined as the so-called referencing sorting (RS). Given a target sequence a ! and a reference sequence b ! , assuming both are defined over the same alphabet. Then, during the sorting process, the swapping operation Table 3 Illustration of the proposed formulation. This is the detailed process of solving the double digest problem, The calculation process of example 1 is listed in (a). The pie chart for this example's solution is in (b).