 Software
 Open access
 Published:
biomapp::chip: largescale motif analysis
BMC Bioinformatics volume 25, Article number: 128 (2024)
Abstract
Background
Discovery biological motifs plays a fundamental role in understanding regulatory mechanisms. Computationally, they can be efficiently represented as kmers, making the counting of these elements a critical aspect for ensuring not only the accuracy but also the efficiency of the analytical process. This is particularly useful in scenarios involving large data volumes, such as those generated by the ChIPseq protocol. Against this backdrop, we introduce biomapp::chip, a tool specifically designed to optimize the discovery of biological motifs in large data volumes.
Results
We conducted a comprehensive set of comparative tests with stateoftheart algorithms. Our analyses revealed that biomapp::chip outperforms existing approaches in various metrics, excelling both in terms of performance and accuracy. The tests demonstrated a higher detection rate of significant motifs and also greater agility in the execution of the algorithm. Furthermore, the smt component played a vital role in the system’s efficiency, proving to be both agile and accurate in kmer counting, which in turn improved the overall efficacy of our tool.
Conclusion
biomapp::chip represent real advancements in the discovery of biological motifs, particularly in large data volume scenarios, offering a relevant alternative for the analysis of ChIPseq data and have the potential to boost future research in the field. This software can be found at the following address: (https://github.com/jadermcg/biomappchip).
Background
Motifs are subsequences of length \(k\) that occur with high frequency in a set of sequences of dna, rna, or proteins. Formally, consider an alphabet \(\Sigma\), which can be \(\{A, C, G, T\}\) for dna, \(\{A, C, G, U\}\) for rna, or a set of amino acids for proteins. A sequence \(s\) is an ordered list of symbols from \(\Sigma\) and a motif \(m\) is a subsequence of \(s\) such that \(m \in \Sigma ^k\), where \(k\) is the length of the motif. The discovery of motifs involves identifying these subsequences that frequently occur in a set of sequences, possibly with some variations [4, 5].
It is important to note that motifs can be represented through kmers (contiguous subsequences of a given size k), making the efficient counting of these structures highly relevant. The representation of motifs as kmers allows for the simplification of complex computational problems, enabling efficient counting algorithms to be applied as a preliminary step in identifying biologically significant sequences [18].
In motif discovery, there are two main approaches: enumerative techniques and probabilistic techniques. Both have their own advantages and disadvantages, and the choice between them depends on the specific needs of the analysis at hand. Enumerative techniques, as the name suggests, aim to identify motifs by exhaustively enumerating all possible subsequences within a set of sequences. These methods are generally more accurate as they consider all possibilities. However, they are computationally intensive and may not be feasible for large volumes of data or for motifs of significant length [10].
On the other hand, probabilistic techniques such as Gibbs Sampling (gs) and Expectation Maximization (em) employ statistical models to estimate the presence of motifs. These methods are generally faster in handling large volumes of data; however, they are sensitive to initial conditions and model parameters, which may compromise accuracy. Both approaches have their merits: while enumerative techniques are typically more accurate, probabilistic techniques are more timeefficient. Nevertheless, it is possible to integrate these two approaches to leverage the advantages of both. For example, one could use an enumerative approach to filter candidates and a probabilistic approach for final optimization [15].
This interconnection between kmer counting and motif discovery highlights the importance of efficient algorithms in both domains for the acceleration and enhancement of bioinformatics analyses. In this context, algorithms for kmer counting serve as initial tools in the processing chain for discovering biological motifs, especially in analyses that involve large volumes of data, such as ChIPseq (Chromatin Immunoprecipitation followed by Sequencing) data.
The ChIPseq (Chromatin Immunoprecipitation followed by Sequencing) protocol is a modern technology that allows the identification of dnaprotein interactions on a genomic scale. This method has become an indispensable tool in the discovery of biological motifs, as it provides a comprehensive mapping of proteinbinding regions throughout the genome. One of the major challenges of ChIPseq is the substantial volume of data generated. The handling, storage, and analysis of such data require robust computational infrastructure and efficient algorithms, as traditional motif discovery techniques may not be suitable for such a magnitude of data [9].
In addition to the data volume challenge, complexity and the presence of noise are also significant factors. The ChIPseq protocol is susceptible to various types of artifacts and noise that can introduce errors in motif identification. Therefore, robust preprocessing and filtering methods are required before the motif discovery stage itself. Many current motif discovery algorithms were not designed to handle the challenges imposed by ChIPseq, such as large data volumes and complexity in sequence structure. This often results in a tradeoff between accuracy and computational efficiency [12].
In this context, biomapp::chip (Biological Application for ChIPseq data) is designed to address these challenges. Our algorithm adopts a twostep approach for motif discovery: counting and optimization. In the counting phase, the smt (Sparse Motif Tree) is employed for efficient kmer counting, enabling rapid and precise analysis. For the optimization stage, biomapp::chip employs an enhanced version of the em algorithm, aimed at improving accuracy in motif identification.
Implementation
Implemented in C++ and R, biomapp::chip combines analytical and numerical methods optimized for indepth data treatment, particularly data derived from ChIPseq experiments. The objective of this approach is to build effective solutions that assist researchers and experts in the field of molecular biology, more specifically in the study of conserved sequences, thereby facilitating complex investigations that involve sequence motif analysis. Through the appropriate combination of methods, some already established and others specifically created for our framework, biomapp::chip offers a robust and adaptable tool for the scenario of functional genomics. Figure 1 illustrates the general pipeline, all the framework modules, and their respective interactions.
As illustrated in Fig. 1, the framework initiates its process with the preprocessing of the input sequences. These are then forwarded to the smt, responsible for the efficient counting of kmers, generating seeds that have already gone through an initial stage of optimization as a result. These seeds are subsequently refined by the fastem algorithm. The architecture is modular and composed of seven main modules: utils, smt, oops, zoops, anr, em, and biomapp::chip. The utils module acts as the backbone of the system, providing essential functionalities used by all the other modules.
The smt module is developed on the foundation provided by the utils, and the oops, zoops, anr, and em modules are likewise constructed on this foundation. In addition to being supported by the utils, the em module has interdependencies with the oops, zoops, and anr modules. The biomapp::chip module, in turn, represents the highest level of complexity as it integrates the functionalities of all the previous modules to execute its operations. Parallelization is supported in all modules through the openmp (https://www.openmp.org/) and thread building blocks (https://www.intel.com/content/www/us/en/developer/tools/oneapi/onetbb.html) libraries. The source code of biomapp::chip is available at: https://github.com/jadermcg/biomappchip.
Preprocessing
Before delving into the method description, it is important to understand how the data are acquired. They can be obtained from various public and private sources. For more information on the ChIPseq databases used in this work, please refer to “Results and discussion” Section. The data acquisition process can be summarized in the following steps:

1.
Experiment execution: the proteins of interest are immunoprecipitated along with the associated dna fragments.

2.
Raw data processing: the raw sequencer data are processed to obtain short dna sequences, called reads.

3.
Peak identification: specific programs, such as macs [21] or sicer [20], are used to identify regions of reads enrichment, called peaks.

4.
Preprocessing: removal of peaks in nonspecific regions, peak filtering based on quality criteria, data normalization, and removal of spurious peaks.

5.
Analysis: performing the analysis of interest, which may include: motif extraction, functional annotation, regulatory network analysis, conservation analysis, among others.
The steps 1 and 2 are part of the biological experiment and are generally conducted by a biologist. Step 3, called enrichment analysis, aims to identify the peak regions, ranging from 100 to 300 bp^{Footnote 1}, where there is a high probability that the fragments of interest are present. Step 4 is critical as it involves the removal of repetitive and lowcomplexity sequences. The final step consists of the analysis of the preprocessed data.
The biomapp::chip starts operating at the end of step 3, where it receives as input the peak regions properly extracted from the genome. In step 4, specialized algorithms are used for preprocessing, and step 5 is fully implemented. The biomapp::chip consists of three main phases: preprocessing, initialization, and optimization.
The preprocessing phase involves adapting the dataset for the execution of experiments, such as standardizing the size of the sequences and converting all characters to uppercase. However, the main goal of this stage is to identify lowcomplexity sequences, transposable elements such as alus (Arthrobacter luteus) and sines (Short Interspersed Nuclear Elements), E. coli insertions, or any other components that could lead to inaccurate conclusions in subsequent steps. This is carried out with the help of specialized algorithms like dust [19] and repeat masker [17]. This phase involves the elimination of redundant sequences and error correction, ensuring the quality and integrity of the data.
Counting
The counting phase takes as input preprocessed peak regions, where it is presumed that there are fragments from the distribution of interest surrounded by background distribution elements. The purpose of the initialization algorithm is to efficiently identify, count, and group enriched kmers, as well as to perform hypothesis tests to determine their statistical significance. The core of this stage consists of two components specifically developed for this purpose: i) smt: a scalable data structure based on suffix trees responsible for storing all kmers from the main dataset; ii) kdive: an efficient algorithm whose objective is to search for fragments with up to d degrees of mutation.
The counting phase follows a set of steps to identify enriched kmers in the main dataset and generate initial models based on this information. First, the algorithm loads the preprocessed data, which contains the nucleotide sequences of interest. Next, if available, control sequences are loaded. Otherwise, they are generated by shuffling the main dataset using the Markov method [6] or the Euler method [1]. A background probabilistic model is then generated from the control data, which will serve as a reference for identifying enriched kmers. The value of k is then estimated, or a predefined value is used, depending on the approach adopted. Based on the peak regions, the smt data structure is built, allowing the efficient extraction of enriched kmers.
SMT
smt is represented by a twodimensional data structure \(M^{\nu \times 6}\), where \(\nu\) is the number of nodes, implemented from fixedwidth text fragments. It is lightly inspired by room squares theory [2], in which each element \(M_{i,j}\) is either empty or has a value set between 1 and \(\nu\). Its main goal is to efficiently store all kmers belonging to the main dataset, along with their representation and count, allowing for rapid searches. Thus, each row of \(M\) represents a node \(\{1, 2, 3, \ldots , \nu \}\), columns 1 to 4 represent the nucleotides, and the last two columns, 5 and 6, represent the number of times a fragment appeared in the dataset and its respective address or numeric representation. Figure 2 graphically shows how this process occurs.
In panel (a), we observe the extraction of six overlapping kmers from the ACGTACGAT dna sequence, where each kmer is visually distinguished by a specific color. This color overlay method makes it easy to follow how the kmers are interspersed throughout the sequence. Moving forward to panel (b), these kmers are integrated into the smt. The structure of smt is represented by a matrix, in which kmers are positioned according to their corresponding nucleotide sequences. Each cell in the matrix is colorcoded according to the kmer it represents, creating a visual mapping that reflects the composition and frequency in the original dna sequence. The columns in the matrix marked with the @ symbol record the addresses, indicating the specific location of each kmer in the dna sequence from which they were extracted. Conversely, columns marked with the \(\#\) symbol account for the frequency, providing a quantitative means to assess the recurrence of each sequence within the dataset. Notably, the fully colored lines within the matrix denote the socalled totalization nodes of the smt. These nodes are fundamental to the tree structure, as they aggregate counts and consolidate their respective positioning information. Through this mechanism, smt captures the diversity and distribution of kmers in the sequence and also provides a cumulative summary that is essential for subsequent analyses.
For example, if \(M_{1,T} = 10\), then the symbol T is present between nodes 1 and 10. Its construction has time complexity \(O(nk)\), where \(k\) is the fixed size and n is the number of fragments. However, k does not scale with n and has a welldefined limit falling between \(5 \le k \le 35\), so we can simplify the analysis to O(n) which is linear in n. Furthermore, it is possible to run several algorithms on smt in linear time, such as exact search and approximate search, which, for example, allow quick recovery of the most abundant kmers in the dataset. The approximate search is performed by the kdive algorithm that will be presented later. Algorithm 1 shows how the smt is constructed.
In line 3, the root (root) is set to the value 1 and the new node (new_node) as 2. In line 5, all the sequences \(\{s_1, s_2, s_3, \ldots , s_n\}\) are processed starting from the root, each of which has a width of k. In line 9, the algorithm checks if M[node, symbol] is equal to 0. If this occurs, it means that there is no edge with the symbol defined by the variable symbol leaving from the variable node and going towards another node. In other words, it is necessary to add a new node at this position, which is done in lines 10 and 11. Finally, the variable node takes the value of the new node (new_node) and in line 12, the new node is incremented. In line 13, if M[node, symbol] is different from 0, then the node already exists, and in line 14, the variable node is simply updated with the value represented by \(M[\text {node, symbol}]\).
The space complexity of smt in the worst case is \(O(\sigma \nu )\), where \(\sigma\) is the number of symbols in the alphabet, in this case 4, and \(\nu\) is the number of nodes in the tree. The value of \(\nu\) directly depends on the size and variance of the fragments. Thus, the higher the variance, the greater the number of nodes. An inherent characteristic of smt is that most elements in M remain empty. This behavior is recurrent and can be calculated. In particular, we can compute the expected occupancy level of M. To do so, consider that the maximum and minimum number of children a node can have is 4 and 0, respectively. The average would then be \(\frac{4 + 0}{2} = 2\) children per node.
A tree with depth k and with each node having an average of 2 children, will have an average of \(2^k\) leaves. The average number of nodes in the tree will be \(\nu = \sum _{i=0}^{k} 2^i = 2^{k+1}  1\). In this case, we will need a matrix M with \(2^{k+1}1\) rows and 4 columns (two more columns for metadata). However, each line has 4 columns and on average only 2 of them will be occupied. With this, we can then calculate the occupancy level using \(\omega = \frac{2 \nu }{4 \nu } = 0.5\). In practice, this number will be even smaller, as the average number of children decreases as the depth of the tree increases. This way, the average space complexity of M is \(O(\omega \nu )\), as \(\omega < \sigma\), then \(O(\omega \nu ) < O(\sigma \nu )\). Thus, we can expect that only half of the capacity of \(\texttt {M}\) will be utilized, which characterizes a significant waste of space. For this reason, the \(\textsc {createsmt}\) algorithm was implemented using a highperformance sparse matrix through the \(\textsc {armadillo}\) linear algebra library [16], employing the \(\textsc {C++}\) programming language.
Example
To add the kmer ACGT to smt, we start at the root of the matrix, which represents the starting point for all kmers. Each node in the matrix, or row, can be considered a decision point, which directs to the next node based on the next nucleotide. We insert the first nucleotide, A, checking the corresponding cell in the root row. If there is no previous entry (indicated by a zero), this means that A is a new path and a new node is created to represent it, with its index recorded in the cell A of the root.
Now, moving to the new node, we repeat the process for the second nucleotide, C. Likewise, the absence of a node for C indicates that we must add a new node, connecting it to the path starting with A. This addition process is sequential and continues for G and T, creating a unique path within smt for the ACGT fragment. When we reach the end of kmer, we update the count and address at the end node to reflect the new addition. When two or more fragments are identical or share a common prefix, smt uses this shared structure to save space and facilitate parsing. For example, if we already have kmer ACGT in smt and we come across it again in the sequence, we simply increment the count at the final node, without the need to add new nodes to this fragment repeated.
In the case of kmers that share a prefix, such as ACGA and ACGT, smt takes advantage of common nodes and diverges only in the last step, creating new paths for distinct nucleotides finals. This is particularly efficient because the paths that are shared do not need to be duplicated. Each kmer has its own count register and address, allowing smt to accumulate information in a compact and structured way. Totalization nodes aggregate the counts of all kmers that pass through that point, providing a cumulative count along different paths.
KDIVE
While kdive stands as the most important algorithm for analyzing conserved sequences in the context of smt, it is not the only option available. Other algorithms that operate on the smt include ksearch, iupacsearch, and khmap. Each of these methods offers distinct advantages and functionalities, and they are discussed in detail in the supplementary data.
kdive was created with the objective of performing efficient text fragment searches in the \(\textsc {smt}\), even if these present up to \(\texttt {d}\) mutations. In other words, \(\textsc {kdive}\) should return true for the search even if the query \(\textsc {string}\) contains up to \(\texttt {d}\) \(\textsc {mismatches}\). It is important to highlight that the algorithm assumes as a premise that the probability of a mutation occurring is the same for any position in the sequence. For example, if the probability of a \(\textsc {match}\) is equal to \(\frac{1}{4}\), then that of a \(\textsc {mismatch}\) is \(1  \frac{1}{4} = \frac{3}{4}\). Therefore, the occurrence of a mutation is three times more likely than that of a base conservation. This characteristic is important because it alters the expected number of computations carried out and consequently modifies the algorithm’s complexity. Algorithm 2 shows the basic functioning of \(\textsc {kdive}\).
Algorithm 2 was implemented recursively and takes as its main parameters the \(\textsc {smt}\) matrix, the fragment \(\texttt {s}\), and the total number of allowed mutations \(\texttt {d}\). If \(s \in M\) with at most \(\texttt {d}\) mutations, the algorithm returns the boolean value \(\textsc {true}\), indicating the presence of the fragment with up to \(\texttt {d}\) mutations. The algorithm performs its task by traversing the smt to identify matches between the search string and the stored kmers, taking into account the number of allowed mutations. It is worth noting that this algorithm can be easily adapted to return the complete set of mutated elements, instead of just a boolean value (true or false). This modification can be useful in certain applications, as it allows for more detailed information about the matches found. For example, it is possible to identify which kmers in the smt correspond to the search fragment, considering the allowed mutations.
The use of a recursive algorithm in the implementation of kdive comes from several strategic and technical considerations. Tree data structures, such as smt, benefit from the recursive approach because of how naturally recursion allows you to navigate nodes and branches. Code simplicity is another significant advantage, where recursion offers a more readable and concise implementation compared to iterative methods that would require explicit stack or queue management. Furthermore, recursion aligns perfectly with the divide and conquer paradigm, dividing the problem into manageable subproblems and handling them individually in an efficient manner.
This approach simplifies implementation and also improves code clarity and maintainability. The divide and conquer nature of recursion is particularly useful in scenarios with multiple choices or branches, common in structures like smt where each node can lead to multiple paths. Efficiency in dealing with these multiple branches is one of the fundamental reasons for choosing a recursive method. Furthermore, recursion facilitates backtracking, allowing the algorithm to explore alternative paths in an orderly and efficient manner, essential in search and optimization processes such as the one described by kdive.
Although recursion carries the risk of stack overflow and overhead associated with function calls, these concerns are mitigated by the fact that the problem is well bounded and the depth of recursion is controlled by the size of the kmer, imposing a natural limit. Thus, the advantages of recursion, especially in terms of simplicity, readability and suitability for tree structures, justify its choice for kdive, providing a robust and effective method for analyzing kmers in smt.
The parameters \(d_{\text {max}}\), d, i, and k allow for the control of searching for exact or approximate matches. The recursion enables the algorithm to traverse the tree in depth and check all possible paths until it finds a match or reaches the maximum depth defined by the parameter \(d_{\text {max}}\). A possible call to Algorithm 2 could be kdive(M, s, \(d_{\text {max}} = 2\), d = 0, i = 1, k = 10, node = root, resp = false). In this case, the algorithm expects to find a match even if up to \(d=2\) mutations are detected in sequences of size \(k=10\). Algorithm 2 has two base cases, shown on lines 1 and 4. The first checks if the number of mutations has exceeded the limit \(d_{\text {max}}\), that is, if \(d \ge d_{\text {max}}\). If this condition is met, a pruning will occur in the search. The second base case checks if the algorithm has completely analyzed the query string, which will happen if \(i \ge k\). If this condition is true, then the pattern has been found and the \(resp\) variable is updated to the value true.
The time complexity of kdive can be measured through the Negative Binomial distribution. Consider a smt with \(\nu\) nodes, \(d\) mutations, and fragments of width \(k\). Also consider the random variable \(X \sim NB(d, p)\), which counts the number of comparisons that the kdive algorithm performs. It’s easy to verify that \(p = \frac{3}{4}\), since if the probability of a match is \(\frac{1}{4}\), the probability of a mismatch will be \(1  \frac{1}{4} = \frac{3}{4}\), therefore \(X \sim NB(d, \frac{3}{4})\). The PMF \(P_x(d, p)\) is given by Eq. 1 and the expectation by Eq. 2.
The Negative Binomial distribution is a generalization of the geometric distribution, which is defined as \(X \sim GEOM(p) = NB(1, p)\). Therefore, the geometric distribution is the Negative Binomial distribution with \(d = 1\), making it possible to use it for calculating the expected value. For example, consider \(X \sim NB(1, p)\), \(Y \sim NB(1, p)\), and \(Z = X + Y\). If \(X\) and \(Y\) are independent, then \(Z = X + Y \sim NB(2, p)\). Therefore, we can calculate \(E(Z) = E(X) + E(Y) = \frac{1}{p} + \frac{1}{p} = \frac{2}{p}\). In this way, \(X \sim NB(d, p)\) can be written as \(Z = X_1 + X_2 + X_3 + \cdots + X_d\), where \(X_i \sim NB(1, p)\). Calculating \(E(Z) = E(X_1) + E(X_2) + E(X_3) + \cdots + E(X_d) = \frac{1}{p} + \cdots + \frac{1}{p} = \frac{d}{p}\).
The time complexity in the worst case will still be O(k), but on average \(\frac{d\times 4}{3}\) comparisons will be executed. With this result, we can write that the average time complexity of kdive is \(O\left( \frac{d \times 4}{3}\right) = O(d)\). It is easy to verify that the larger the size of \(d\), the more operations will be necessary. It is important to highlight that motifs rarely exceed a width of 20 nucleotides, and the number of mutations is generally no more than 20% of their size. Considering this, we can expect to find an average of \(d = 5\) mutations, which results in \(\frac{5 \times 4}{3} \approx 6.666\) comparisons per fragment.
We can calculate the complexity of the kdive algorithm in relation to the size of the input sequences. Consider a dataset with \(n\) sequences of width \(t\). We have a total of \(m = t  k + 1\) fragments of size \(k\) in each sequence of size \(t\). Let \(X\) be the number of comparisons made for each fragment. We know that \(X\) follows a negative binomial distribution with parameters \(p = \frac{3}{4}\) and \(d\), and that \(E(X) = \frac{d}{p} = \frac{4d}{3}\). Therefore, the number of comparisons per sequence will be \(\frac{4dm}{3}\) resulting in an average asymptotic behavior O(dm).
Optimization
The aim of this stage is to optimize the initial seeds obtained in the previous phase through the em algorithm. The use of this algorithm allowed refining the parameters associated with the seeds, making them more accurate representations. This optimization process is important for the quality of the resulting models and for the success of subsequent analyses.
In this phase, preoptimized seeds obtained from the smt are used as a starting point for running the em algorithm. The em algorithm is responsible for refining these seeds, adjusting their parameters in order to maximize the likelihood of the given observations. As a result of this phase, each seed is transformed into a pwm model, which represents a more accurate and adjusted description of the motif in question. This pwm model can then be employed on new data to find patterns that have not yet been labeled.
To select the appropriate variant of the em algorithm for ChIPseq data, it’s important to consider the protocol’s nuances, which include the presence of noise and nonspecific signals. While ChIPseq aims to capture sequences with the motif of interest, not all peaks may contain it due to various factors. Therefore, the zoops model is generally the most fitting for motif analysis in ChIPseq data, allowing for the motif’s occasional absence. Depending on the biological context, anr or oops models may also be relevant.
FASTEM
The fastem represents a significant optimization over traditional oops and zoops models. This algorithm was inspired by the implementation provided by [7, 8], whose modification not only speeds up the execution time but also enhances efficiency in handling larger data sets. In other words, fastem enhances the capabilities of the original models, making them more adaptable and robust when faced with large volumes of information.
The calculation of marginal probabilities represents one of the most computationally demanding phases in the em algorithm. This stage requires computing the probability of a kmer being found at each of the \(n \times (t  k + 1)\) valid positions, given the \(n\) input sequences. This process can become a significant bottleneck, especially in ChIPseq experiments where a large volume of sequences is often dealt with, in turn making the value of \(n\) considerably high. Although it is possible to mitigate this computational challenge by using only a subset of the sequences as a sample, this approach compromises the predictive power of the model.
To understand its workings, we need to recap some concepts. The bayes’ rule for the em algorithm states: \(P(p_j  s_i) = \frac{P(s_i  p_j) \times P(p_j)}{P(s_i)}\), which can be interpreted as: “the probability of a motif existing at position \(p_j\) given we are observing sequence \(i\), divided by the marginal probability of \(s_i\).” The marginal probability can be written as the weighted sum: \(p(s_i) = P(s_i  p_1) P(p_1) + \cdots + P(s_i  p_m) P(p_m)\). If we consider that a motif can be in any position with equal likelihood, then this sum simplifies to: \(p(s_i) = P(s_i  p_1) + \cdots + P(s_i  p_m)\).
To compute each term of the marginal probability, it is necessary to employ both the positive model \((\alpha )\) and the negative model \((\beta )\) through the following equation: \(P(s_i  p_u) = \prod _{j=1}^u P(s_{ij}  \beta ) \prod _{j=u+1}^{u+k} P(s_{ij}  \alpha ) \prod _{j=u+k+1}^m P(s_{ij}  \beta )\), for \(1 \le u \le m\). In other words, this equation must be run for all \(m\) valid positions in each sequence. The fastem algorithm does this by computing \(P(s_i  \beta ) = \prod _{j=1}^m P(s_{ij}  \beta )\) just once. Then, for each \(1 \le u \le m\), this value is divided by \(P(p_u  \beta ) = \prod _{j=u}^{u+k} P(s_{ij}  \beta )\) and multiplied by \(P(p_u  \alpha ) = \prod _{j=u}^{u+k} P(s_{ij}  \alpha )\). To work on a logarithmic scale, simply replace multiplications with additions and divisions with subtractions. This simple optimization, when combined with multithreading techniques, makes the fastem algorithm significantly faster and, therefore, better suited for handling large volumes of data.
To more intuitively understand the optimization provided by fastem, we can consider the calculation of marginal probabilities as the most expensive optimization operation. In essence, the algorithm seeks to understand the probability of each kmer in all possible positions, a considerable computational challenge when dealing with a large number of sequences. The efficiency of fastem lies in its ability to simplify this process without compromising the integrity of the results. Instead of repeatedly calculating complex probabilities for each position and kmer, fastem makes use of a positive model and a negative model to precalculate and reuse results, significantly reducing the number of operations required.
By focusing on precalculating the negative model for the entire sequence and adjusting it as needed for each position, the algorithm avoids redundancy and speeds up the process. This optimization, although simple in concept, has a profound impact on the efficiency of the algorithm, allowing it to operate faster and handle significantly larger volumes of data. Furthermore, by operating on a logarithmic scale, replacing multiplications with additions and divisions with subtractions, fastem further improves its performance.
Results and discussion
In this section, we will show and analyze the results obtained by the biomappchip framework in comparison with stateoftheart algorithms. The biomappchip was rigorously compared with stateoftheart algorithms, including meme [3], prosampler [13], and homer [11]. In the experiments conducted, two distinct types of data were employed to evaluate the effectiveness of our method. Synthetic data were used for load tests, allowing a comparative analysis of time consumption and ram memory usage between the different approaches. On the other hand, real data were employed to assess accuracy, thus ensuring a more realistic evaluation of its applicability in practical scenarios.
In addition to comparisons with stateoftheart methods, we also conducted a direct evaluation between the smt and jellyfish [14], a widelyused tool for kmer counting. The aim was to understand the efficacy of the smt algorithm in relation to established methods in the literature. Details on this comparison, including performance metrics, are provided in supplementary data. This additional analysis reinforces the robustness of our approach and offers further insights into the performance of the smt algorithm.
It is important to highlight that all performance metrics presented in this study were rigorously evaluated. Load tests, which include measurements of execution time and memory consumption, have been standardized and quantified. The execution time of each algorithm was measured in seconds, while memory consumption was recorded in megabytes. In addition to performance metrics, accuracy was evaluated using several distance and correlation measures as a reference. These included euclidean distance, manhattan distance, hellinger distance, pearson correlation, bhattacharyya coefficient, and sandelinwasserman similarity. These measurements were used to evaluate the models found in relation to the reference models.
Synthetic data
The generation of synthetic data was carefully planned to allow a comprehensive comparison between biomappchip framework in relation to other motif discovery algorithms. For the comparison, 50 datasets were generated, with sizes ranging from \(1 \times 10^5\) to \(2 \times 10^7\) bases. Each of these datasets was submitted to all algorithms with kmer sizes (k) ranging from 5 to 30 bases. This experimental design allowed for a rigorous evaluation of performance and efficiency in kmer counting under different load and complexity conditions.
Real data
Real data were extracted from the jaspar database version 2022 along with the genomic position files (.bed). These experiments form a fundamental step in validating our approach. By using real data, we can evaluate the performance of the algorithms in practical applications, increasing the reliability of the results and the robustness of our conclusions. Table 1 presents a detailed analysis of the data volume (measured by the number of peak regions) generated by different types of experiments available in the jaspar 2022 database.
The ChIPseq protocol stands out as the main source of data, generating the largest volume of information, with an average number of peak regions of 17,538.78. The data volume generated by this protocol varies considerably, ranging from a minimum of 22 to a maximum of 322,803 peak regions. Other significant protocols in terms of data volume include capselex and htselex, which generated an average of 9,878.26 and 9,856.47 peak regions, respectively. Although these protocols generate considerable data volumes, they are still significantly below ChIPseq.
In the experiments involving real data, datasets containing more than 10,000 sequences were selected. This criterion resulted in an initial set of 148 distinct datasets. However, 17 of these datasets were subsequently excluded from the analysis, as they were still in the process of validation. Therefore, the final set for evaluation consisted of 131 validated datasets. The choice of this selection criterion aims to ensure a sufficiently large and varied sample to rigorously and comprehensively evaluate the performance of the algorithms. This approach allows not only to test the efficacy of the algorithms on real data, but also provides an indepth analysis of their applicability in scenarios closer to the experimental conditions frequently encountered in research in the area of biological motif discovery.
Parameters
In order to ensure the reproducibility of experiments, we established the parameters for each algorithm according to specific guidelines. The standardization of these settings is important to guarantee the integrity and comparability of the results obtained. Furthermore, all experiments were run on Intel Xeon CPU E52673 v4 2.30GHz servers with 8Gb of Ram memory with Linux/Ubuntu 22.04 operating system. The parameters used in the experiments followed those displayed in Table 2.
It is essential to clarify some aspects regarding the choice of parameters. All methods were adjusted to identify only the most effective model, as evidenced by the options n 1, nmotifs 1, m 1, and S 1 for the smt, meme, prosampler, and homer algorithms, respectively. The zoops model was employed in the biomapp::chip and meme algorithms, while homer uses the zoops model intrinsically. The prosampler algorithm, on the other hand, does not offer the option of selecting this model. Additionally, the parameters r 1000 and f 0.001 are used to govern the convergence of biomapp::chip. The iterative process will be halted if the increment in the score difference between two successive iterations is below f 0.001. Otherwise, the algorithm will continue until it reaches a limit of r 1000 iterations.
Biomappchip on synthetic data
The aim of these experiments was to measure the performance of the biomapp::chip framework on a diverse set of synthetic datasets. These data were generated with variable parameters to simulate different scenarios and complexities associated with the real world. The use of synthetic data offers a controlled platform that allows isolating and evaluating the performance and efficacy of algorithms under welldefined conditions. This analysis procedure is essential for the initial validation of the adopted methodological approach, serving as a preliminary step before handling real data.
These trials were designed to collect basic data on the time consumption and ram memory usage by the biomapp::chip algorithm. This information was compared with time and space metrics obtained from other established algorithms in the field, such as meme, prosampler, and homer. The purpose of this comparison was to understand how biomapp::chip stands in relation to other solutions and identify the strengths and weaknesses of each approach. To carry out the experiments, the \(k\) parameter, representing the size of the kmers, was systematically varied within a range that covers values from 5 to 30, thus keeping in line with the criteria established in the smt load tests. In this way, 5200 experiments were performed, 1300 for each algorithm. The capture of relevant metrics, which include both time consumption and ram memory allocation, was conducted using the unix/linux /usr/bin/time v command, ensuring consistent and comparable data collection across all algorithms under study.
Figure 3 illustrates the global averages of the tests for each algorithm, considering both time and space consumption. biomapp::chip proved superior in terms of time efficiency, closely followed by meme. On the other hand, prosampler and homer showed less efficacy in this aspect, requiring longer periods for execution. Regarding memory usage, meme led in performance but was closely followed by biomapp::chip. The latter employs the smt data structure to initialize its models, which demonstrates the efficiency of this structure in terms of memory usage. As observed in the time dimension, homer and prosampler occupied lower positions in memory consumption.
The Fig. 4 analyzes the execution time of the four algorithms – biomapp::chip, meme, homer, and prosampler – as a function of the k size. It is noticeable that biomapp::chip is the most timeefficient algorithm, closely followed by meme, both maintaining an average execution time below 100 s for all evaluated cases. In contrast, the average execution times for prosampler and homer were substantially higher, with prosampler exceeding 200 s and homer surpassing 400 s. Additionally, the analysis also reveals that the execution time for all evaluated algorithms increases as the size of k grows.
The Fig. 5 presents a line graph illustrating the ram memory consumption for each algorithm. It is a unified graph, allowing for direct comparison between all of them. Similar to what was observed in execution time, biomapp::chip and meme were close in memory efficiency, although meme had a slight advantage, consuming just under 500 Mbytes in the worst case. biomapp::chip showed an average memory consumption slightly below 1000 Mbytes. In contrast, the homer and prosampler algorithms displayed much higher memory usage, reaching just over 2000 Mbytes and exceeding 4000 Mbytes in the worst scenario, respectively.
Biomapp::chip on real data
Although experiments in simulated environments are useful for initial understanding and finetuning of algorithms, it is the trials with real data that provide the true validity test for any computational method in bioinformatics. They offer a much more complex and variable scenario, which more accurately plays out the conditions that algorithms will face in realworld applications. Therefore, this section is the most critical and provides more precise information about the feasibility and robustness of the evaluated algorithms.
In this test set, all algorithms were executed on a sample of 131 datasets, obtained from the jaspar repository. These data were carefully selected based on two specific criteria: (i) all are from ChIPseq experiments and (ii) all have more than 10 thousands sequences. The main focus of this test was the comparison of accuracy between the approaches. For this reason, unlike the tests performed on synthetic data, where the value of k was varied systematically, in this scenario the value of k adopted was that suggested by scientific literature. Thus, each algorithm was tested with a unique and specific value of k, according to the most recent academic guidelines.
Performance in relation to time and space consumption
Figure 6 reveals important nuances in the performance of the four examined algorithms in terms of execution time and ram memory consumption. biomapp:chip stands out for displaying the shortest average execution time, only 15.9 s, making it the most agile option. On the other hand, homer proved to be the slowest, with an average time of 624 s and an average ram consumption of 1436.16 MB, which could be a limitation in scenarios that demand quick responses.
In terms of ram consumption, meme was the most efficient with an average of 217 mbytes, while homer exhibited the highest consumption, with 1436 mbytes. prosampler, which had moderate performance in time, registered the highest peak in ram usage, reaching 3501 mbytes. In addition, prosampler displayed the greatest variation in both time and ram consumption, suggesting that its performance may be highly sensitive to the characteristics of the analyzed dataset. Overall, the tests indicated that biomapp::chip and meme proved to be more robust, making them more predictable in their performance, as shown in Table 3.
The analysis of the data presented in Figs. 7 and 8 provides important insights into the performance and efficiency of the four algorithms evaluated in various scenarios, parameterized by the value of k. First, it is possible to observe that the biomapp::chip algorithm displays, on average, the lowest consumption of time and ram memory in almost all scenarios. Its average execution time ranges from 9.79 to 23.00 s, while the average ram usage lies between 251.50 and 838.70 mbytes. The algorithm is also consistent in maintaining a relatively low maximum and minimum execution time, as shown by the data in Table 4.
On the other hand, the homer algorithm presents the highest consumption of both time and ram. It is interesting to note the increase in average execution time and average ram usage as k increases, reaching a peak of 1174.47 s and 1950.41 mbytes for k = 16.
The meme algorithm shows relative stability in average execution time, ranging from 134.64 to 143.33 s, but with peaks of ram usage reaching 1618.04 mbytes for k = 12. The prosampler displays intermediate performance in terms of time efficiency and ram usage, with a large variation in maximum and minimum metrics, particularly for ram usage, which reaches up to 3500.79 mbytes for k = 12.
Comparison with reference models
In this subsection, we will address the evaluation of the models generated by each algorithm, contrasting them with reference models extracted from the jaspar database version 2022. The goal of this comparison is to quantify how well the algorithms were able to approximate the pwm matrices found by each approach to their respective reference matrices. For this, specific metrics were employed to measure the distance between the matrices, thus providing a comprehensive analysis of the accuracy of the methods in finding reliable and precise representations of the investigated biological motifs.
According to Fig. 9 and Table 5, the biomapp algorithm showed superiority in almost all metrics analyzed. In the bha metric, the biomapp algorithm recorded an average of 0.993 and a very low standard deviation of 0.0308, signaling consistency in the results. However, it is interesting to note that prosampler had the lowest minimum in this metric (0.685), which could be a point of concern in terms of consistency.
When we observe distance metrics such as euc, hell, and man, biomapp once again stands out for having the lowest averages. Here, homer’s performance is notably inferior; for example, in the man metric, the average is 0.264 with a standard deviation of 0.194, both values being the highest among the algorithms analyzed. This could point to a generally inferior performance of homer in multidimensional comparisons. Regarding the pcc metric, all algorithms showed high average values, indicating a good degree of correlation. However, it is interesting to note that prosampler had the lowest minimum value (0.600) in this metric. Such a minimum value is significantly lower than the others, which may indicate instability in specific cases. Figure 10 displays the evolution of each algorithm as a function of the variable k.
It is possible to observe in this figure that biomapp in the euclidean, hellinger, and manhattan distances always stayed below the other algorithms, for all values of k. In the bhattacharyya coefficient, prosampler obtained the best result (0.988) for k = 10, followed by meme (0.975) and biomapp::chip (0.972). For other values of k, biomapp::chip achieved better values in this metric. Something similar occurred with the Pearson correlation and sandelinwasserman similarity, in which biomapp::chip had the lowest score for k = 10. The sw metric showed a closer competition among the algorithms, with all averages approaching 2 and relatively low standard deviations. However, prosampler here showed the lowest minimum value of 1.39, which once again raises questions about its consistency compared to the other algorithms. While biomapp exhibited a generally superior performance in all metrics evaluated, the other algorithms have weak points that deserve attention. The inferior performance of homer in distance metrics and the variability of prosampler in metrics like bha and pcc are points that require additional investigations.
Analyzing Fig. 10 focusing on biomapp and considering the points where it is surpassed by other algorithms, an interesting pattern can be seen in the behavior in relation to the different metrics and k values. In the Euclidean (euc), Hellinger (hell) and Manhattan (man) distance metrics, biomapp demonstrate to be superior in all k values, which implies significant consistency and efficiency in capturing differences between motifs, regardless of the kmer size. However, when evaluating the bha, pcc and sw metrics, we observed that biomapp is not the leader for \(k = 10\). This could be a reflection of intrinsic characteristics of the data or the nature of the shorter motifs, which may not be as well captured by the algorithm as those of larger size. In small kmers, measurement precision may be more susceptible to variations due to the limited amount of information available for analysis. When considering k values greater than 10, biomapp outperforms other algorithms in most cases, indicating optimization for intermediate to large kmers. From \(k = 15\) onwards, biomapp’s performance is quite similar to that of prosampler, suggesting that both algorithms are efficient in capturing the relevant features of the motifs for these sizes of kmers, but biomapp has a slight advantage. The reason biomapp exhibits an improvement may be related to its ability to integrate and interpret more complex information that is more evident in larger sequences. As kmer grows, there is more context for analysis, allowing biomapp’s strengths in terms of algorithms and statistical modeling to come to the fore. This may be a direct consequence of the method used to model motifs or the way biomapp weights different aspects of sequences when calculating similarity and correlation measures.
Statistical analysis
In the performance analysis of algorithms, it is imperative to employ rigorous statistical techniques to determine whether the differences observed in various metrics are truly significant. The application of statistical tests provides a means to validate comparisons between different approaches or settings, ensuring that the conclusions drawn are robust and reliable.
In this subsection, use is made of the friedman test, a nonparametric method employed to identify significant variations in medians among multiple paired groups. This statistical test is especially relevant when the conditions of normality and homoscedasticity, which are prerequisites for the application of anova, are not verified, as observed in the present study. After the main tests, post hoc tests will be performed for multiple comparison analyses. The aim is to identify which groups are statistically different from each other. The results of the post hoc tests are essential for establishing concrete conclusions about the evaluated metrics.
The data displayed in Table 6 show the results of the friedman and nemenyi statistical tests applied to various metrics to evaluate the performance of the motif discovery algorithms. The metrics evaluated include euc (euclidean distance), man (manhattan distance), pcc (pearson correlation), hell (hellinger distance), sw (sandelinwasserman coefficient), and bha (bhattacharyya distance).
For each metric, the friedman test produced highly significant pvalues, pointing to a considerable difference between the approaches. All the pvalues are very close to zero, well below the adopted significance level of \(\alpha = 0.05\), indicating that the differences are statistically significant. The winner column indicates that the biomapp program outperformed all other approaches for all the tested metrics, as denoted by the symbol (+). The results of the nemenyi test reinforce this conclusion, with pvalues very close to zero. These data can be graphically visualized through Fig. 11.
Analyzing this figure, we can verify that the results obtained from the nemenyi test point to statistically significant differences between the biomapp::chip program and the other evaluated methodologies, covering all the metrics in question. It is relevant to highlight that meme was statistically similar to homer in all distance metrics. However, in the pcc metric, the meme program was more similar to prosampler, showing no statistically significant differences between them in this regard. The similarity between meme and homer in distance metrics and the closeness between meme and prosampler in the pcc metric can be attributed to various factors. These may include the nature of the algorithms, sensitivity to outliers, data peculiarities, or even optimization for different loss functions. Each metric may be highlighting different aspects of the relationships between the approaches, which may contribute to this behavior.
Lastly, Fig. 12 presents the critical distance graphs between the different algorithms, providing an intuitive visualization of the relative positions of each method in terms of performance. The critical distance is a key value obtained from statistical tests such as the nemenyi posthoc test, and serves as a threshold for determining whether performance differences between multiple algorithms are statistically significant. This measure is important because it provides an objective criterion for evaluating and comparing the effectiveness of different methods. If the difference in performance between two algorithms exceeds the critical distance, we can conclude that one algorithm is significantly better than the other with a certain level of confidence. It is clearly observed that the biomapp::chip algorithm achieved the best performance, demonstrating statistical superiority compared to all other approaches. A close relationship is also noted between the meme and homer algorithms in various metrics, especially those related to distance, corroborating the analysis performed on Fig. 12. Similarly, it is also possible to see in this figure that, in the pcc metric, the meme algorithm displayed a high proximity to the prosampler algorithm, indicating that there are no statistically significant differences between them in this specific metric.
Conclusions
This study addressed the problem of biological motif discovery, a field of extreme relevance for understanding regulatory mechanisms in genomes and with implications in various areas of biology. Aiming to overcome the current limitations of available algorithms, we introduced biomapp::chip, a tool designed to optimize both accuracy and efficiency in the analysis of large volumes of data, such as those generated by ChIPseq protocols. biomapp::chip sets itself apart with its twostep approach: the first is dedicated to efficient kmer counting through the smt algorithm, and the second focuses on optimization via an enhanced version of the em algorithm. Our comparative analyses with stateoftheart methods, such as meme, prosampler, and homer, demonstrated that biomapp::chip offers a strong balance between accuracy and efficiency. The success of biomapp::chip in the experiments suggests that the tool can be an important resource for researchers looking for reliable and effective approaches to motif discovery in large data sets. Future research may explore additional optimizations and the application of the tool in different biological scenarios.
Availability and requirements
Project name: biomapp::chip; Project home page: https://github.com/jadermcg/biomappchip; Operating system(s): Linux; Programming language: C++ and R; Other requirements: Threading Building Blocks https://www.intel.com/content/www/us/en/developer/tools/oneapi/onetbb.html; License: Apache License 2.0.
Availability of data and materials
All datasets used in the paper can be downloaded in: https://github.com/jadermcg/biomappchip/tree/main/datasets.
Notes
The resolution of the reads heavily depends on the sequencing technology employed.
References
Altschul SF, Erickson BW. Significance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinucleotide and codon usage. Mol Biol Evol. 1985;2(6):526–38.
Archbold J, Johnson N. A construction for room’s squares and an application in experimental design. Ann Math Stat. 1958;29(1):219–25.
Bailey TL, Elkan C et al. Fitting a mixture model by expectation maximization to discover motifs in bipolymers. UCSD Technical Report CS94351; 1994
D’haeseleer P. How does DNA sequence motif discovery work? Nat Biotechnol. 2006;24(8):959–61.
D’haeseleer P. What are DNA sequence motifs? Nat Biotechnol. 2006;24(4):423–25.
Fitch WM. Random sequences. J Mol Biol. 1983;163(2):171–6.
Garbelini JMC, Sanches DS, Pozo ATR. Expectation maximization based algorithm applied to DNA sequence motif finder. In: 2022 IEEE Congress on Evolutionary Computation (CEC), IEEE, 2022; pp 1–8
Garbelini JMC, Sanches DS, Pozo ATR. Towards a better understanding of heuristic approaches applied to the biological motif discovery. In: Brazilian conference on intelligent systems. Springer, 2022; pp 180–194
Hashim FA, Mabrouk MS, AlAtabany W. Review of different sequence motif finding algorithms. Avicenna J Med Biotechnol. 2019;11(2):130.
He Y, Shen Z, Zhang Q, et al. A survey on deep learning in DNA/RNA motif mining. Brief Bioinform. 2021;22(4):bbaa229
Heinz S, Benner C, Spann N, et al. Simple combinations of lineagedetermining transcription factors prime cisregulatory elements required for macrophage and b cell identities. Mol Cell. 2010;38(4):576–89.
Kumar A, Hu MY, Mei Y, et al. CSSQ: a chipseq signal quantifier pipeline. Front Cell Dev Biol. 2023;11:1167111.
Li Y, Ni P, Zhang S, et al. ProSampler: an ultrafast and accurate motif finder in large chipseq datasets for combinatory motif discovery. Bioinformatics. 2019;35(22):4632–9.
Marçais G, Kingsford C. A fast, lockfree approach for efficient parallel counting of occurrences of kmers. Bioinformatics. 2011;27(6):764–70.
Pevzner PA, Sze SH, et al. Combinatorial approaches to finding subtle signals in DNA sequences. In: ISMB, 2000; pp. 269–278
Sanderson C, Curtin R. Armadillo: a templatebased c++ library for linear algebra. J Open Source Softw. 2016;1(2):26.
Smit AF, Hubley R, Green P. Repeatmasker 1996.
Stormo GD. DNA binding sites: representation and discovery. Bioinformatics. 2000;16(1):16–23.
Tatusov R, Lipman D. Dust, in the NCBI. Toolkit available at 1996; http://blast.wustl.edu/pub/dust
Zang C, Schones DE, Zeng C, et al. A clustering approach for identification of enriched domains from histone modification chipseq data. Bioinformatics. 2009;25(15):1952–8.
Zhang Y, Liu T, Meyer CA, et al. Modelbased analysis of chipseq (MACS). Genome Biol. 2008;9(9):1–9.
Acknowledgements
The authors would like to thank Coordenação de Aperfeiçoamento de Pessoal de NÃvel Superior—Brasil (CAPES)—Finance Code 001—for the financial support given to this research.
Funding
This work has been supported by Coordenação de Aperfeiçoamento de Pessoal de Nível Superior—Brasil (CAPES) under Finance Code 001.
Author information
Authors and Affiliations
Contributions
JMCG conceived and designed the approach. ATPR and DSS oversaw and coordinated the project. JMCG developed, implemented, realized the experiments and analyzed the results. JMCG, ATPR and DSS tested the algorithm and wrote this paper. All authors approved the final version of this manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Garbelini, J.M.C., Sanches, D.S. & Pozo, A.T.R. biomapp::chip: largescale motif analysis. BMC Bioinformatics 25, 128 (2024). https://doi.org/10.1186/s12859024057523
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859024057523