Sequence motif finder using memetic algorithm

Background De novo prediction of Transcription Factor Binding Sites (TFBS) using computational methods is a difficult task and it is an important problem in Bioinformatics. The correct recognition of TFBS plays an important role in understanding the mechanisms of gene regulation and helps to develop new drugs. Results We here present Memetic Framework for Motif Discovery (MFMD), an algorithm that uses semi-greedy constructive heuristics as a local optimizer. In addition, we used a hybridization of the classic genetic algorithm as a global optimizer to refine the solutions initially found. MFMD can find and classify overrepresented patterns in DNA sequences and predict their respective initial positions. MFMD performance was assessed using ChIP-seq data retrieved from the JASPAR site, promoter sequences extracted from the ABS site, and artificially generated synthetic data. The MFMD was evaluated and compared with well-known approaches in the literature, called MEME and Gibbs Motif Sampler, achieving a higher f-score in the most datasets used in this work. Conclusions We have developed an approach for detecting motifs in biopolymers sequences. MFMD is a freely available software that can be promising as an alternative to the development of new tools for de novo motif discovery. Its open-source software can be downloaded at https://github.com/jadermcg/mfmd. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-2005-1) contains supplementary material, which is available to authorized users.


Background
Sequence motifs are small sequences capable of acting as binding sites for a particular transcription factor [1]. In many situations, the localization of the motifs should be learned without prior knowledge. For that reason, this problem is called de novo motif discovery [2].
Transcription factors are specific proteins that bind to distinct sites on the genome. This binding is an essential process in gene regulation which may lead to changes in transcriptional activity for a particular gene target [3]. These sites are short (< 30 bps) and have a typical nucleotide sequence, although there may normally be variations due to mutations that occurred because of the selective pressure that the genome has undergone over time [4].
According to [5], several approaches have been proposed to solve efficiently this problem. Also, we have algorithm [16], that is a greedy local search method that explores the solution space through systematic exchanges of increasingly distant neighborhood structures. Also, the VNS step is important for recombination and mutation sub-stages to fine-tune individuals previously constructed by GRASP.

Previous work
We have developed in previous work two approaches called Discovery Motifs by Evolutionary Computation (DMEC) [17] and Discovery Motifs by Memetic Algorithms (DMMA) [18]. In the DMEC, we evolved a population of PSSM matrices using a canonical evolutionary algorithm and a greedy mutation operator. Good results were obtained in several synthetic datasets and some real ones, such as the cyclic-AMP dataset (CRP). DMMA is an evolution of DMEC where we have some heuristics along with traditional evolutionary algorithm. Furthermore, the DMMA algorithm obtained a substantial gain compared to DMEC. MFMD extends the idea of DMMA and DMEC including a new mechanism of search that control the exploration vs exploitation in the search space.
For most of these approaches, the emphasis is on the application of canonical evolutionary algorithms to solve biosequence problems. Our motivation is slightly different in that we intend to use the flexibility of evolutionary algorithms in addition to the efficiency that some heuristics have. Thus, it was possible to develop strategies that are more applicable to the resolution of discovery motif problems in real situations.

Problem definition
Although there are several formulations of this problem, we will begin with the canonical and more general definition of motif discovery in the following manner.
Let S = {s 1 , s 2 , · · · , s n } be the set of sequences over = {A, C, G, T} and let w be the motif length. In this paper, we assume that the length of all sequences is equal to L and 0 < w L. The problem consists in finding the most promising pattern of subsequences X * = {x 1 , x 2 , . . . , x n } of size w and their respective initial positions in each sequence in S. The choice of a particular pattern is based on the definition of one or more score functions that measure the similarity or difference between the motifs pattern and their respective occurrences. Li et al. (1999) proved that the canonical definition of motif problem is NP-Hard even with the most simplified assumptions [19].
There are several methods for measuring the quality of the motifs. The objective functions should be able to reflect the efficiency of a modeling accurately. An inadequate evaluation function will not be able to provide a good solution even whether a strong optimization algorithm is used. We have used in this work the Information Content Score [20] and the Complexity Score [21] as objective functions.
Information Content (IC) can be interpreted as an energy estimate that a set of motifs exerts on its respective binding site as opposed to the rest of the organism's genome [1]. In other words, the IC measures the statistical difference between a motif from a specific probabilistic model or a motif from a background probabilistic model (usually inferred from the genomic sequences of a given organism). The background statistical model is typically constructed under a homogeneous Markov chain of order zero or higher. Complexity score was defined by Gary B. Fogel and Weekes [21] and penalizes sequences with low complexity, i.e., whose entropy value is very low. In general, this may disrupt the search and should be considered a noise [22].

Implementation
MFMD was developed using Java programming language release 8u111 (64-bit) and Ubuntu Linux operating system. The algorithm evolves a population of PSSM matrix and finds solutions that maximize the Information Content Score and Complexity Score using a bi-objective Weighted Sum Model. The algorithm receives as input a typical DNA dataset of co-regulated genes and returns the initial positions of the found motifs. MFMD was divided into three steps: Pre-Processing, Pattern Discovery and Pattern Matching. Figure 1 illustrates the simplified MFMD pipeline.

Preprocessing
This step aims to find and remove subsegment entries that can direct the search to invalid locations. According to D'haeseleer [6] these subsequences, called spurious [23], can contribute negatively to the performance of the search algorithms. To mitigate this problem, before the algorithm starts the pattern discovery phase, we execute DUST [24], to meet the above requirements. DUST is a tool created by R. L. Tatusov and D. J. Lipman, whose objective is to remove sub-sequences with low complexity from the dataset.

Pattern discovery
This step consists of optimizing and discovering the best PSSM matrix from an input dataset. Moreover, we have sub-divided the cycle into Initial Population Construction, Fitness Calculation, Recombination, Mutation and Selection steps.

Initial population
This step is the most important action of the algorithm in which each solution is represented by a tree-like data structure. In this structure, the nodes represent the initial (1) In Preprocessing step MFMD uses DUST to remove sub-sequences with low complexity entropy. If DUST can not be run, MFMD uses an objective function defined in [21] to mitigate this problem. (2) In Pattern Discovery step, MFMD attempts to find the best PSSM matrix using GRASP and VNS heuristics. (3) In Pattern Matching step, MFMD uses the PSSM matrix found in the previous step to predict the initial positions of the motifs in the dataset positions, where the root node represents the initial position of the first dataset sequence. In this way, the algorithm creates a tree solution for each valid position in the dataset sequence. For example, whether the dataset has 100 valid positions, then the algorithm will generate 100 trees, each with its starting position.
The total number of valid positions can be obtained by where v is the total number of valid positions, L is the size of each sequence and w is the size of a particular motif. In MFMD, solutions are built gradually with the aid of a GRASP-based heuristic. In general, this paradigm shift led the initial solutions to the most promising locations in the search space.
The modifications consist in the use of a variable q that modifies the algorithm behavior and determines whether it will make a greedy or a random choice. The multi-start function has also been disabled because in this approach GRASP is only used as a startup tool. Then, at each iteration, a number n ∈ [0, 1] is uniformly drawn, and the behavior of the algorithm follows Eq. 1: If the choice is greedy, the algorithm tests whether there are still other positions having a score equal to the best score found so far, i.e., whether there is a tie between the scores from the valid positions list. If so, all tied positions are added to the tree. If the choice is not greedy, the solutions are ranked in a Restricted Candidate List (RCL). Then a solution of the list is uniformly chosen and included in the tree.
The RCL size and the q parameter can be quite different, but in our experiments we have found the best values empirically, hence, we have used the following values RCL = 5 e q = 0.9. Finally, the algorithm is done when all initial positions are included in the tree, i.e, when the height of the tree is equal to the number of sequences minus one.
The algorithm complexity grows according to the size of the dataset. For example, whether a dataset has N sequences of length L = 30 and motifs with length w = 5 there will L + w − 1 = 26 valid positions in that dataset. Thus, the algorithm will make 26 2 comparisons between the first and second sequences, plus 26 2 comparisons between the second and third sequences, and so on.
Therefore, the final complexity of the algorithm is In the worst case the algorithm can achieve the complexity O(L N ) whether all valid positions of all dataset sequences should tie in terms of score value. However, this is extremely unlikely and in practice, we have only a few draws occurring at each iteration with complexity The objective of this approach is to establish a compromise between the need to maintain the practical computational algorithm and the desire to obtain the mathematically optimal alignment.

Fitness calculation
Fitness is calculated by converting the initial positions of each individual into a structure called Multiple Local Sequence Alignment (MLSA). From the MLSA it is possible to calculate the Position Specific Score Matrix (PSSM). The PSSM is a zero-order non-homogeneous Markov chain [7] commonly used to represent probabilistic models of motifs whose statistical independence between the different "columns" of an MLSA is assumed. That means, from a statistical point of view, the nucleic bases that form the regulatory elements do not correlate. In practice, according to Benos et al., this independence is a good approximation [25].
For a motif of size w, a PSSM takes the form of a matrix 4 × w. More details can be reviewed at [26]. The fitness of each individual was calculated using the bi-objective weighted sum model, whose functions were: Information Content (Eq. 2) and Complexity Score (Eq. 3).
Where w is the motif size, Σ is the number of letters from the alphabet ( = 4 for nucleotides), (i,j) is the matrix of the relative frequencies and (0,i) is the vector of background probabilities. The IC measures the statistical difference between a motif from a specific probabilistic model or a motif from a probabilistic background model [1]. The specific probabilistic model is constructed using a non-homogeneous Markov chain of order 0 or higher. In particular, we use the PSSM model that has zero order. The background statistical model is typically constructed under a homogeneous Markov chain of order zero or higher.
Where Σ is the number of letters from the alphabet. ( = 4 for nucleotides), N = 4 for nucleotides, w is the motif size and n i is the total number of nucleotides i ∈ A, C, G, T. The Complexity Score was defined by Gary B. Fogel and Weekes [21] and penalizes low complexity sequences, i.e., sequences whose entropy value is very low. In general, this may di srupt the search and should be considered a noise [22]. For example, the motif "aaaaaa" (n a = 6, n c = 0, n g = 0, n t = 0) will have minimal complexity since it will obtain a maximum value in Π n i . On the other hand, the motif "atacgt" (n a = 2, n c = 1, n g = 1, n t = 2) will obtain a value of complexity greater than the previous one, since the value of the function Π n i will be smaller. In this example, CS(aaaaaa) = In Eq. 4, v ∈ [0, 1] are arbitrary weight chosen in a random way. The parameter v changes the importance level of each objective function. In our experiments v = 0.8 is the value that produced the best results. This equation establishes a relation between the objective functions and the parameter v. In particular, it becomes important when is not possible to remove spurious a priori.
Since DUST runs in the preprocessing step, the MFMD can be reduced to a mono-objective algorithm running only the Information Content Score. This brings faster execution and does not compromise the accuracy of the approach. For palindromic sequences, the reverse motif complement must also be taken into account. Whether the motif is a palindrome, this predilection may lead the algorithm to more accurate results. It is important to note that when inserting the reverse complement in the (a) score calculation, the PSSM matrix becomes a symmetric matrix.

Recombination, mutation and selection
The recombination operator is applied in some individuals from the initial population P. The individuals are selected in a random way. Also, the recombination occurs between pairs called individuals parents generating child individuals that are stored in an new population called intermediate population Q. At each recombination, the algorithm calculates the scores of parents p 1 and p 2 , selects the best and puts it in p * . After the children c 1 and c 2 are generated, The score of these are also calculated and compared with p * . If F(c 1 ) < p * then the mutation occurs through the local search in c 1 using the VNS heuristic. The same situation holds true for the child c 2 . The mutation is performed through the following rule (Eq. 5): After mutation operator, populations P and Q are joined generating the R population (R = P ∪ Q). Then the R population is sorted in descending order and the first |P| solutions from R are put back to the population P.

Pattern matching
This step consists in the application of statistical techniques for the motifs recognition that were not found along the Pattern Recognition stage. In many cases, the promoter regions have more than one binding site. Therefore, it is expected that search algorithms will be able to find as many motifs as possible from a particular coregulated gene.
The MFMD assume the distribution of the final scores is a Gaussian distribution [27] of mean μ and standard deviation σ X ∼ N μ, σ 2 . The parameters of the statistical model were estimated using the PSSM matrix found in the previous step. Thus, the scores are normalized and transformed into z-scores using Eq. 6:  Where x is the raw score, μ is the mean and σ is the estimated standard deviation.
Then the p-value is calculated using the cumulative distribution function defined by Eq. 7: Where The objective is to calculate the area under the curve and find which positions have the highest statistical significance. In short, the following actions are performed in this step (

Illustrative examples
Let us consider the dataset S = {seq 1 , seq 2 } of length L = 180 from the alphabet = A, C, G, T. In addition, we have a motif size w = 11. There are L − w + 1 = 170 valid positions and, for each of them, MFMD constructs a different solution tree. Without loss of generalization and for simplification purposes we will consider that the dataset is normalized and therefore we will use Eq. 2 to calculate the scores.
Here, we introduce how the tree is generated from the first valid position (GTCTGTGGTTT) whose parameters are represented by and can be viewed in the Fig. 3a. The building of the solutions tree depends on a random number n ∈ [0, 1] and a constant q = 0.9. Whether n ≤ q the algorithm constructs the tree greedily, otherwise it uses a Restricted Candidates List RCL = 5 to choose the next node. Some predictors failed to score in these experiments because they found initial positions with a deviation greater than 2. These data are highlighted in bold Whether the choice is greedy, only the node with the best score is added to the tree. If there is a tie between the best node and the others, all tied nodes are added to the tree. Figure 3b illustrates a model where three nodes (0,49 and 80) have the same score (2 ≤ s ≤ 2.2). In this instance all three nodes would be added to the tree, as shown in Fig. 3c.
If the choice is semi-greedy, the nodes are sorted in descending order and the top five are added in the RCL. Then, a node is uniformly chosen to compose the tree as shown in Fig. 3d. It is interesting to note that if the choice is greedy, more than one node can be added to the tree whereas in semi-greedy choice only one node is added.
It is important to highlight that the ABS, SCPD, CRP and JASPAR datasets have real background data and motifs. For simplicity, at this point we will call the ABS, SCPD and CRP datasets of real datasets experiments. We have emphasized the discussion only in ABS, SCPD, CRP and JASPAR datasets since they are real, publicly available and they have been used extensively in other works.
In JASPAR were randomly selected the datasets based on their identification. Five datasets were chosen using data collected from ChIP-seq experiments. Table 1 shows a summary of these datasets. Finally, eleven datasets were used in real datasets experiments, seven extracted from the ABS site [29], three extracted from the SCPD site [30] and one extracted from the publication of Stormo and Hartzell [20]. Table 2 shows the information about these datasets.
For details and results about simulated datasets, see Additional file 1.

Evaluation methods
For each dataset, 30 tests were performed and the results obtained were compared to two other approaches: Gibbs Motif Sampler [31] and Meme (Multiple EM for Motif Elicitation) [8].
To measure the performance of each strategy, we adopted the initial position that each approach found. A position is considered correct if it equals the real or varies two units more or less. For example, if the annotated position of a given motif is 60, all of these values will be considered correct: 58, 59, 60, 61 e 62. For each experiment performed by MFMD, we calculated the mean and standard deviation of the performance measures. For the experiments performed by Meme and Gibbs Motif Sampler, values were used which showed better execution performance.
We evaluated the approaches according to the metrics of information retrieval precision, recall, and f-score [32]. These measures have a minimum value of zero and a maximum value of one, where zero represents no predicted position, and one characterizes a perfect prediction.

Rank analysis
The results were compared using the dominance method proposed by L. I. Kuncheva and J. J. Rodríguez [33]. In this system, each approach receives a score when compared to the other approaches. The dominance hierarchy is determined by the classification of methodologies according to a score calculated through the losses and victories that each approach has achieved in each f-score measure. This corresponds to the total number of times that, for example, the "A" approach was able to be better than the "B" approach minus the total number of times that the "B" approach was better than the "A" approach.
In addition, wins and losses were defined in terms of the f-score values that each strategy was able to achieve. Since the f-score represents the harmonic mean between precision and recall, the magnitude of its value is directly influenced by both measurements, i.e., a low precision value will imply a low f-score even if the recall is high. The inverse relationship is also true.

Statistical analysis
The objective of this analysis was to compare the results obtained by the MFMD with the results achieved by the other approaches using statistical methods of hypothesis testing. The purpose of this test is to indicate if there is a significant difference between them and to determine which approach presented the best performance. Statistical significance tests were performed between the differences of the f-scores by all approaches.
The analyzes consisted of the following steps: (1) sample selection: some datasets were selected to compare the statistical test. There were 2 of each synthetic group, 5 ChIPseq and 2 real datasets experiment, totaling 21 datasets; (2) statistical analysis: the analyzed parameter was the f-score calculated from the 30 executions performed in each dataset by each algorithm; (3) the Shapiro-Wilk test [34] was applied to each set of parameters. In the case of normality being verified, a paired Student's T test [35] was applied. Otherwise, the non-parametric test used was the Wilcoxon [36] paired; (3) the significance level used was 0.05 or 95%. Tables 3 and 4 illustrate the results obtained by the predictors in JASPAR and real datasets experiments, respectively. It is important to note in Table 4, that in some datasets MEME obtained zero in precision, recall and f-score measures. In particular, MEME reached this value in datasets CREB and NFKB. Also, it is evident that the deviation measured by the initial positions predicted by MEME was higher than two, leading to the true positive (TP) counts to zero. Consequently, this led to the values of precision, recall and f-score also at zero. Table 5 shows the results obtained by the approaches in the ranking analysis. Moreover, it is possible to observe that MFMD presented a higher score (ranking) in relation to the other approaches compared for all datasets analyzed. The good relationship between precision and recall evidenced that the MFMD achieved a balance between the true positives and the predicted false positives.

Results and discussion
In Table 6 all approaches are ordered according to the performance obtained in Table 5. In this case, the leftmost algorithm indicates a better performance compared to the rightmost algorithm (ordering from best to worst). From the analysis of Table 6, we can verify that MEME performed well in the JASPAR datasets. This was even more evident in the data presented in Tables 3 and 7, where we highlight the good behavior of this algorithm in the GATA3 and IRF1 datasets. On the other hand, the Gibbs Motif Sampler has obtained good results in real datasets experiments. However, MFMD still figures first in both. This demonstrates the good capability of MFMD to handle datasets of varying sizes. This is even more visible in smaller datasets, as shown in Table 4, where the MFMD performed considerably better than MEME and Gibbs Motif Sampler. The MCB, PDR3 and NF-Kb are the smallest real datasets, having 6, 7 and 6 sequences respectively. MFMD ties with Gibbs Motif Sampler in NF-Kb and wins both in the others. In this context, with less number of samples, the estimation of the probabilistic model loses precision, but MFMD was able to recognize a greater number of motifs. In general, the best performance achieved by MFMD can be attributed to its optimization architecture and the most effective way that its heuristics are applied, allowing to explore the search space more efficiently and thus achieving better results. Table 7 shows the result of the statistical test performed with the f-scores obtained by each approach.
The following experiments were conducted: MFMD vs Gibbs Motif Sampler and MFMD vs MEME and the results were presented as follows: (+) there is statistical difference favorable to MFMD; (=) there is no statistical difference; and (-) There is statistical difference unfavorable to MFMD. The statistical test corroborates the results presented in Table 6 (ranking) where MFMD obtained an advantage in relation to the other approaches.
MFMD uses in construction step q = 0.9. Whether q = 1 then the algorithm is greedy. On the other hand, whether q = 0, then the algorithm is random. While low values of q promote randomness and consequently lowquality solutions, high values of q lead the algorithm to local optima. The same is true of RCL. If RCL = 1, then the algorithm becomes greedy, even though q = 0. Conversely, if RCL = L − w + 1, the algorithm becomes random. It is important to note that in both cases a compromise must be found between randomness and greediness.
The significance level used in the Pattern Matching step was 0.0001. The addition of this value would lead to greater permissibility to the method, increasing the number of predicted false positives. On the other hand, its decrease would leave the approach more "rigid" and consequently a smaller number of true positives would be observed. Therefore, the correct adjustment of this parameter directly implies the prediction quality of the algorithm.
Although all the programs compared in this work are based on probabilistic models, there are considerable differences in the results obtained due to the size of the search space and the existence of a large number of possible solutions. Optimization algorithms, such as MEME for example, can optimize the statistical models locally. However, the inherent multi-modality of the search space, in general, does not allow purely local driven optimization procedures to explore many different solutions. The MFMD architecture allows greater flexibility of search engine space because it applies an evolutionary process to a population of possible candidate solutions.
Finally, Figs. 4 and 5 compare the logos obtained by MFMD in the JASPAR and real datasets experiments with the logos generated from the real motifs. In them, we can see that the logos generated by the MFMD is very similar to the real logos.

Conclusions
In this work we propose a new algorithm for the motif discovery in DNA sequences using local search and evolutionary algorithms as an optimization strategy.
The proposed approach, called MFMD, starts from a population of gradually generated motifs and performs an extensive search through operations such as recombination, mutation, and local search.
To demonstrate the efficiency of MFMD, several experiments were carried out in four groups of datasets: simulated datasets; JASPAR (datasets and motifs extracted from ChIP-seq experiments) and real datasets experiments. Through the comparisons made between the MFMD and other approaches found in the literature, it can be concluded that the MFMD was able to achieve better results in most of the experiments in all datasets.
Although there are several more robust probabilistic models than PSSM, such as Dinucleotide Weight Matrices (DWM) [37] and Transcription Factor Flexible Models (TFFM) [38], the objective of this work was to highlight the efficiency of the hybrid evolutionary approach in relation to approaches Literature. In future works, we intend to investigate other forms of representation. While there is a considerable effort in the scientific community, it remains a complex challenge for computational biologists to predict convincing regulatory elements in DNA sequences.
Current paradigms of motif discovery can be seen as an approximation of biological reality, although recent efforts have sought to include correlation between motif positions [39], phylogenetic information [40], and synergistic relationships among transcription factors [41]. As the complexity of these models increases, the need arises to develop increasingly sophisticated algorithms that can find optimal solutions for these models and this will become increasingly important over time.