DNA motifs are short sequences in the genome that play important functional roles in gene regulation. Due to their short length, it is difficult to identify these regions using features intrinsic in their composition. Assuming that the motifs are conserved in closely related species due to the importance of their function, it is possible to discover them by comparing the respective DNA sequences to identify the sub-sequences that are very similar to each other.

There are two common combinatorial formulations that identify the motifs: The first is the consensus motif problem which made its first appearance in 1984 [1], while the second is the planted (*l, d*)-motif problem that was presented in 2000 [2]. It is worth noting that the latter formulation is a special case of the former. The exact definitions are as follows:

Given a set of *t* sequences *s*
_{
i
} where 1 ≤ *i* ≤ *t* defined over an alphabet ∑. The consensus motif problem is to find an *l*-length motif sequence *M* such that in each sequence *s*
_{
i
}, 1 ≤ *i* ≤ *t*, there is at least one subsequence *p*
_{
i
} differing with at most *d* mismatches from *M; i.e., d*
_{
H
}
*(p*
_{
i
}
*, M)* ≤ *d*, where *d*
_{
H
} is the hamming distance between *p*
_{
i
} and *M*.

The planted (*l, d*) motif problem is a special case of the consensus problem in which we restrict that *p*
_{
i
} occurs only once in *s*
_{
i
}.

Due to its combinatorial nature, the consensus motif problem and its variant defined above is extremely challenging. Over a benchmark data of 20 sequences, each of length 600 characters, large instances of (15, 5), (17, 6), (19, 7) and (21, 8) have been addressed and many algorithms have been developed to solve them one after another. These algorithms can be classified into two major categories: approximation algorithms [2–12] and exact algorithms [13–30]. Approximation algorithms are based on probabilistic local search techniques, such as Gibbs Sampling, Expectation Maximization, etc. Although these algorithms may solve the challenging instances in practice, there is no guarantee that the motif can be found even when *l* is short.

Exact algorithms are based on exhaustive search techniques. The brute force algorithm proceeds by testing all possible motifs of length *l* using pattern matching, leading to *O* (*l n t* 4^{
l
}) time complexity. This algorithm, however, is not suitable for discovering long motifs in practice, and many algorithms have been developed to provide faster solutions. Examples of these algorithms are CENSUS [23], PMS1 [26], PMSP [27], PMSprune [29], PMS5 [30], SMILE [19], RISO [24], RISOTTO [28], and Voting [25]. In the following we briefly review the most efficient ones and the ones related to our work.

The algorithms SMILE [19], RISO [24], and RISOTTO [28] are based on the use of suffix tree. The time complexity of these algorithms is the same and it is O(*t*
^{2}
*Nv*(*l, d*)), where
is the size of the *d*-mismatch neighbourhood of motifs of length *l* and
, *n*
_{
i
} is the length of sequence *i* from input sequences. RISOTTO improved the time complexity of SMILE and RISO in the average case and solved some challenging instances such as (15, 5) and (17, 6).

PMSP [27] is based on exploring the neighbourhood of the *l*-mer of the first sequence and checking whether the elements of such neighbourhoods are (*l, d*) motifs. The time complexity is
. It is able to solve some challenging instances such as (15, 5) and (17, 6). PMSprune [29, 31] is an improved version of the PMSP algorithm, based on the branch and bound strategy. Although it has the same worst-case time complexity as PMSP algorithm, it is more efficient in practice and it could tackle the (17, 6) and (19, 7) instances for the first time. PMS5 [30] is based on computing the common *d*-neighbourhood of three *l*-mers using integer programming formulation. It combines this novel idea with the algorithms PMS1 and PMSPrune. PMS5 can tackle the large challenging instances (19, 7), (21, 8) and (23, 9). The only drawback of PMS5, it requires larger amount of internal memory to finish computation.

### Our contribution

In a previous work [32, 33], we have introduced an idea composed of two stages to speed up the exact algorithms: In the first stage, we generate a set of candidate motifs by applying one of the exact algorithms based on the neighbourhood method (like Voting [25] or PMSP [27] algorithms) using *q* ≤ *t* sequences. In the second stage, for each candidate motif we check if it is a valid motif or not using pattern matching on the reminder (*t* - *q*) sequences. This dramatically reduces the search space and leads to significant speed up. The bottleneck in this approach, however, was the determination of the *q* value that yields the fastest running time. That is, the user has to guess the value of *q*, which might lead to non-optimal running time and even no speed up compared to the traditional methods. Also, the authors in [34] have used the same idea on PMS1, RISOTTO, and PMSprune algorithms.

In this paper, we present a theoretical method which can be used to determine the appropriate value of *q*. Then we apply this strategy on PMSprune algorithm and solve some big challenging instances such as (21, 8). Furthermore, we propose a parallel version of our algorithm to present a practical solution to the challenging instances of the motif problem. Our parallel version further speeds up the solution of the (21, 8) instance.

### Definitions and related work

In this section, we introduce some notations and definitions that will help us to describe our algorithm and related work in a concise manner.

**Definition 1 adapted from [29]:** For any string *x*, with |*x*| = *l*, let *B*
_{
d
}(*x*) = {*y*: |*y*| = *l, d*
_{
H
}(*y, x*) ≤ *d*}, where *d*
_{
H
} denotes the Hamming distance and *B*
_{
d
}(*x*) denotes the set of neighbourhoods of *x*. We also write *v*(*l, d*) to refer to |*B*
_{
d
}(*x*)|.

**Definition 2 adapted from [29]:** Let *s* denote a string of length *n* and let *x* denote another string of length *l, l* <*n*. We define the minimum distance between *s* and *x* as
, where *x* ⊲_{
l
}
*s* denotes that *x* is a substring of *s* with length *l*.

**Definition 3 adapted from [29]:** Given an *l*-length string *x* and a set of strings *S* = {*s*
_{1}, ..., *s*
_{
t
}} with |*s*
_{
i
}| = *n* for *i =* 1, ..., *t* and *l* <*n*, we define the distance between *S* and *x* as

**Definition 4 adapted from [29]:**A string

*x* is an (

*l, d*) motif for a set of sequences

*S* = {

*s*
_{1}, ...,

*s*
_{
t
}}, if:

- 1)

- 2)

**Proposition 1 adapted from [10]:** Let *u* and *v* be two random strings of length *l* over an alphabet of 4 characters with equal probability of occurrence. The probability *p*
_{
d
} that *d*
_{
H
}(*u, v*) ≤ *d* is
, and the probability that
is (1-(1-*p*
_{
d
})^{
n-l+1})^{
t
}. The expected number of *l*-length motifs that occur at least once in each of the *t* sequences with up to *d* substitutions is *E*(*l, d, t, n*) = 4^{
l
}(1-(1-*p*
_{
d
})^{
n-l+1})^{
t
}.

### PMSprune Algorithm

Because the first stage of our method will depend on the PMSprune algorithm. We will review the basic steps of it in the notions presented above.

The main strategy of PMSprune is to generate

*B*
_{
d
}(

*y*), for every

*l*-mer

*y* in

*s*
_{1}, using a branch and bound technique. An element

*x*∈

*B*
_{
d
}(

*y*) is a motif only if

. The step of verifying that

is achieved by scanning all substrings of

*S*. For fixed values of

*t, n*, and

*l*, the expected time complexity of PMSprune is equal to

where *p*
_{2d
}is the probability that the hamming distance between two strings is at most 2*d*, and it is defined in Proposition 1. For fixed values of *t, n*, and *l*, value *d'* was estimated such that the probability of
is close to 1. (The probability of
is given in Proposition 1 and it is
).

### Implementation

#### Our proposed strategy

Our new strategy, referred to as *hybrid exact pattern motif search* (HEP), is composed of three steps: first, we determine the value *q*, corresponding to the size of a subset of input sequences, as explained below. Second, we apply an exact exhaustive algorithm £ (like, PMSprune) on the set of *q* sequences to find the set of *d*-neighbourhood *B*
_{
d
}(*x*) (review definition 1 for exact definition of *d*-neighbourhood). We call this set the candidate motif set. Finally, we apply a pattern search algorithm over the remaining sequences to verify each motif. Note that our algorithm is generic in the sense that it takes the program £ also as input in addition to the input sequences and user parameters. A pseudo code for this strategy using the exact algorithm £ is as follows:

**Algorithm 2: HEP** (£, *s*
_{1},..., *s*
_{
t
}
*, n, l, d*)

**Begin**
- 1)
Determine the number of sequences *q* using the method given below.

- 2)
Implement the exact algorithm £ on *q* input sequences. Let *C* be the set of candidate motifs found in the *q* sequences.

- 3)
For each pattern *v* in *C*, check if *v* is a valid motif or not in the reminder (*t - q*) input sequences using pattern matching Algorithm.

**End**.

**Theorem 1:** Algorithm 2 correctly finds all (*l, d*) motifs in a given *t* input sequences.

**Proof:** Step 2 of the algorithm is exhaustive and finds the whole set of *d*-neighborhood for the *q* sequences. Therefore, and by definition of the (*l, d*) motif problem, any (*l, d*) motif belongs to this set, even if *q* = 1. In Step 3, each candidate motif is verified by comparison to each substring in the remaining sequences. This step is conducted by an approximate pattern matching algorithm for each *l*-length substring in the candidate motif set and each *l*-length substring in the remaining sequences such that the hamming distance between these two substrings is ≤ *d*. This guarantees that no motif is missing.

**Theorem 2:** The running time of the HEP is equal to

where T_{
£(q) }is the running time of step 2 involving the use of an exact algorithm £ on the *q* input sequences and *l*(*t* - *q*) (*n* - *l* + 1) *E*(*l, d, q, n*) is the running time of step 3 such that *E*(*l, d, q, n*) is the number of elements in the set *C*, which is estimated to be 4*l*(1- (1 - *p*
_{
d
})^{
n - l + 1})^{
q
}. Note that the complexity of step 1 takes constant time, as we will explain below. Note that the running time of the brute force algorithm is acquired if *q* = 0 in equation 2. The running time of the exact algorithm £ is acquired if *q* = *t* in equation 2.

### Determination of the best *q*

The range of the number of sequences

*q*, enhancing the performance of the exact motif finding problem is calculated by solving the following inequality for the parameter

*q*:

**Definition 5:** We define *mns* as the minimum number of sequences *q* that yields better running time; i.e., the first value of *q* that verifies the inequality. We also define *ons* as the optimal number of sequences *q* that yields the best running time; i.e., the value of *q* such that *T*
_{
HEP
} is minimum over 1 ≤ *q* ≤ *t*.

### Implementing HEP based on PMSprune

We decided to use PMSprune for implementing the first step in our method, because of its superiority compared to other algorithms as discussed in [31]. However, we stress that our approach is generic and can be used with any better algorithm that appears in future. In the following, we will refer to our method based on PMSprune as *HEP_PMSprune*. If *q* = *mns* we will denote it with *HEP_PMSprune*(*mns*), and if *q = ons* we will denote it with *HEP_PMSprune*(*ons*).

### Determining *mns* for PMSprune

Replacing

*T*
_{
£(q) }by the time of PMSprune on

*q* sequences, Equations (1) and (2) can be rewritten as follows:

Replacing

*T*
_{
HEP
} with

*T*
_{
HEP_PMSprune
} and T

_{£} with

*T*
_{
PMSprune
} in the inequality (3)results in the following variation:

Substituting the value of

*E*(

*l, d, q, n*) with the value given in Proposition 1 in the left hand side yields

Dividing both sides by 4

^{
l
} and taking the logarithm,

The inequality (4) provides the range of the values of

*q* that makes the running time of HEP using PMSprune less than the running time of the original PMSprune over the all set of sequences. The minimum value of

*q* in the range of the inequality is called

*mns* and it is equal to:

### Determining *ons* for PMSprune

For fixed values of *t, n, l* and *d, ons* can be calculated for PMSprune by selecting the value of *q* that minimizes the total number of operations *T*
_{
HEP_PMSprune
}for 1 ≤ *q* ≤ *t*. The following algorithm computes the value of *ons* for each instance (*l, d*).

**Algorithm 3: Find**
ons

**Begin**
- 1)

- 2)

- 3)

- 4)
**if**
*T* <*T*
_{min}
**then**

*T*
_{min} = *T*

*ons = q*

- 5)

**End**

The above algorithm computes *q* in *O*(*t*) time. In practice, the time for computing *q* takes negligible time with respect to the rest of motif finding steps; it took maximum one second for all experiments included in this paper with simulated and real datasets. To save some time, our implementation includes a look-up table containing pre-computed values of *q* for different values of *l, n*, and *d*, where *l* < 20, *d* < 3, and selected values of *n* with n = 300, n = 350, 400, ..., *n* = 700. For other values of *l, n*, and *d*, we compute the best *q* using the above algorithm.

### Parallel version of HEP_PMSprune(*ons*)

We propose a parallel version for HEP_PMSprune(*ons*) called PHEP_PMSprune(*ons*). The two main steps of HEP_PMSprune(*ons*) can be parallelized as follows:

We parallelize the PMSprune algorithm by assigning a set of *l*-mers from *s*
_{1} to each processor for establishing the set of neighboring motifs. The resulting sets are stored in candidate motif lists *C*
_{
i
}
*, i* ∈ {1, 2, ..., *p*}, where *p* is the number of processors. After each processor finishes computation, the *C*
_{
i
} lists are merged together in a larger set *C*, such that each motif is represented once in this list; i.e., all repetitions are removed. Creating the *C* list is done in linear time with respect to the number of candidate motifs and it is achieved as follows:

We incrementally construct the partial list *C*
_{
j
} that contains the *L*
_{
j
} lists, 1 ≤ *j* ≤ p, by appending the list *L*
_{
j
} at the end of the list *C*
_{
j-1 }such that all elements in *L*
_{
j
} existing in *C*
_{
j-1 }are discarded. This continues until *j* = *p*; i.e., *C*
_{
p
} is *C*. Discarding a repeated element is done efficiently as follows: For small values of *l*, we create a look-up table with size Σ^{
l
}, where Σ is the alphabet size. Each possible *l*-length string can be mapped to a number in the range between zero and Σ^{
l
} in *O*(*l*) time. The *i*
^{th} entry in this table contains one if a string in *C*
_{
j-1 }is mapped to *i*. Otherwise, it contains zero. The strings in *C*
_{
j
} are queried against this look-up table to discard repetitions and set entries they are mapped to with value one. For longer values of *l*, we use the Aho-Corasick automaton to index all *l*-length motifs in *C*
_{
j-1}, and check if a strings in *C*
_{
j
} exists in the automaton or not and add the new strings of *C*
_{
j
} to the automaton. For these string matching algorithms, we refer the reader to [35].

In the second step, we validate each candidate motif independently in parallel over the available processors. The running time of this algorithm is *O*(*T*
_{s}
*/p +|C|*), where *T*
_{
s
} is the sequential running time and |*C*| is the size of set *C*.

The first step in the parallel algorithm does not lead to loss of any motifs. This is because the set *C* includes the *d*-neighborhood set of the *q*-sequences. The reason is that we run PMSprune in parallel against the strings (*x, s*
_{2}, s_{3}, ..., s_{
q
}), where *x* is a substring of *s*
_{1}. That is, each substring is not processed. The second step in the parallel algorithm is also correct, because the elements in *C* are independent of each other and checking the validity of each candidate motif can be safely run in parallel. Our experimental results confirm the correctness of our parallelization procedure.