A fast weak motif-finding algorithm based on community detection in graphs

Background Identification of transcription factor binding sites (also called ‘motif discovery’) in DNA sequences is a basic step in understanding genetic regulation. Although many successful programs have been developed, the problem is far from being solved on account of diversity in gene expression/regulation and the low specificity of binding sites. State-of-the-art algorithms have their own constraints (e.g., high time or space complexity for finding long motifs, low precision in identification of weak motifs, or the OOPS constraint: one occurrence of the motif instance per sequence) which limit their scope of application. Results In this paper, we present a novel and fast algorithm we call TFBSGroup. It is based on community detection from a graph and is used to discover long and weak (l,d) motifs under the ZOMOPS constraint (zero, one or multiple occurrence(s) of the motif instance(s) per sequence), where l is the length of a motif and d is the maximum number of mutations between a motif instance and the motif itself. Firstly, TFBSGroup transforms the (l, d) motif search in sequences to focus on the discovery of dense subgraphs within a graph. It identifies these subgraphs using a fast community detection method for obtaining coarse-grained candidate motifs. Next, it greedily refines these candidate motifs towards the true motif within their own communities. Empirical studies on synthetic (l, d) samples have shown that TFBSGroup is very efficient (e.g., it can find true (18, 6), (24, 8) motifs within 30 seconds). More importantly, the algorithm has succeeded in rapidly identifying motifs in a large data set of prokaryotic promoters generated from the Escherichia coli database RegulonDB. The algorithm has also accurately identified motifs in ChIP-seq data sets for 12 mouse transcription factors involved in ES cell pluripotency and self-renewal. Conclusions Our novel heuristic algorithm, TFBSGroup, is able to quickly identify nearly exact matches for long and weak (l, d) motifs in DNA sequences under the ZOMOPS constraint. It is also capable of finding motifs in real applications. The source code for TFBSGroup can be obtained from http://bioinformatics.bioengr.uic.edu/TFBSGroup/.


Background
Transcription factors play an irreplaceable role in the activation and repression of gene expression by binding to specific sites within promoter regions of target genes. Identification of transcription factor binding sites (TFBSs) is a basic task for elucidating the molecular mechanisms of transcriptional regulation. Although traditional footprinting assays can accurately identify the precise binding sites of any factor, this low-throughput method is highly technical and can analyze only a single small region (< 1 kb) at a time. With the development of high-throughput http://www.biomedcentral.com/1471-2105/14/227 sites [2][3][4] or (2) characterize a motif as an l-length consensus string describing a motif with the most frequent nucleotide in each position of all aligned sites. According to these two TFBS models, existing motif-finding algorithms can be divided into two classes. The first includes algorithms that maximize a statistical or entropy-related score of a PSSM [ n j,k ] l×4 . CONSENSUS [5], MEME [6], Gibbs Sampler [7], AlignACE [8], PROJECTION [9], and CRMD [10] belong to this group. These algorithms use optimization techniques from the fields of statistics and machine learning, including the greedy strategy [5], the Expectation-Maximization method [6], Gibbs sampling methods [7][8][9], and the clustering method [10]. These algorithms usually have a fast run time. Sometimes, however, they cannot converge to the global optimum, especially for short motifs with high levels of statistical noise or long motifs with a large search space. The second class of algorithms usually searches for (l, d) motifs based on the consensus model [11] by employing heuristic methods, but in some cases use optimal techniques. It is supposed that each sequence contains zero, one, or multiple motif instance(s) with up to d mutations within a true motif [12]. A large number of algorithms have been proposed to exactly or almost exactly extract (l, d) motifs from N input sequences with length L. SPELLER [13], WEEDER [14,15], MITRA-count [16], Voting [17], PMSprune [18], WIN-NOWER [11], iTriplet [19], VINE [20], Stemming [21], RecMotif [22], and sMCL-WMR [23] are included in this group. These algorithms usually have a high time complexity for long motifs. This limits their application, especially toward prokaryotic promoters [24,25]. In this study, we intend to offer a highly efficient algorithm for finding long and weak (l, d) motifs and to use this algorithm to identify TFBSs in prokaryotic data sets.
Motif-finding algorithms based on the consensus model can be further divided into two categories: pattern-driven and sample-driven approaches [16]. Using a patterndriven approach, one tries to enumerate all possible 4 l l-mer motifs with lexical order. When applying a sampledriven approach, all possible (l, d) motifs generated from the real l-mers of input sequences are often tested. SPELLER, WEEDER, and MITRA-count are patterndriven approaches and Voting, PMPprune, WINNOWER, iTriple, VINE, Stemming, RecMotif, and sMCL-WMR are sample-driven. In general, pattern-driven approaches can automatically find (l, d) motifs in samples without being given the length l. However, the state-of-the-art sampledriven approaches are usually faster than the state-of-theart pattern-driven approaches, and thus can be used to extract motifs with a larger l and d.
Recently, the sample-driven approach, which transforms the (l, d) motif search by extracting the maximum clique or q-cliques (q ≤ N) from an N-partite graph, has attracted much attention. In this graph, each vertex is an l-mer. There is an edge between two l-mers from two different sequences if the Hamming distance between these two l-mers is no more than 2d. This is because the Hamming distance between each instance of a motif and the motif itself is assumed to be at most d. Thus, all instances of a motif must form a maximum clique or q-cliques in the graph. This idea was first presented in WINNOWER, which utilizes an extendable mechanism to cut off spurious edges by extending k-cliques (k = 2 or k = 3) to larger (k + 1)-cliques. However, the accuracy of WINNOWER cannot be guaranteed since there is strong background noise in many sequences and true edges may be pruned by its local extension mechanism. The VINE algorithm, with its rigorous pruning steps, was proposed to speed up and increase the precision rate of WINNOWER. Similarly, iTriplet was designed to randomly select two reference sequences and identify all triplets (3-cliques) in these as well as each of the remaining N − 2 sequences. All discovered triplets along with their sequence information are then inserted into hash tables as candidate motifs. If a table has enough instances (e.g., at least q), the motif can be identified as 'true' . RecMotif was created to extract N-cliques by using reference sequences as well. It takes the selected reference vertices from the first x (x = {1, 2, · · · , N}) reference sequences in order to select new reference vertices in the remaining sequences. As x is increased, the selection is continued (x ← x + 1) if new reference vertices can be selected from the remaining sequences to obtain xcliques. Otherwise, the algorithm backtracks to the first x − 1 reference vertices and finds a substitute. RecMotif has been shown to be very fast for some (l, d) cases (e.g., (15,4), (18,5) and (21,6)). However, in tests performed by Sun et al. [22], it failed to find some weak motifs such as (15,5), (18,6), and (19,7). Additionally, RecMotif operates under the OOPS constraint. During real applications, some sequences may not contain any instance of a true motif while some may contain multiple instances. With this work, we offer a more efficient algorithm for extracting long and weak (l, d) motifs from N-partite graphs using the more biologically-relevant ZOMOPS constraint.
During our research, we made the following observations: (1) there may be too many spurious edges in an N-partite graph to extract the q-cliques (q ≤ N) needed to identify a weak motif (e.g., (15,4) or (18,6)) and (2) if we construct the N-partite graph such that there is an edge between two l-mers from two different sequences if the Hamming distance between these two l-mers is no more than x (d ≤ x ≤ 2d) instead of exactly 2d, the motif instances in the graph may form a dense subgraph instead of a clique. Based on this information, we present a new algorithm: TFBSGroup. It first extracts dense subgraphs, which are groups of candidate instances (i.e., TFBSs), by the fast community detection algorithm BGLL [26]. It http://www.biomedcentral.com/1471-2105/14/227 then greedily refines these candidate motifs towards the true motif within their own communities. Our empirical study shows that TFBSGroup can quickly discover long and weak (l, d) motifs (e.g., (18,6) and (24,8) motifs within 30 seconds) in synthetical samples under the ZOMOPS constraint. More importantly, it is able to rapidly identify motifs in a large data set of prokaryotic promoters [25] generated from the Escherichia coli database Regu-lonDB [27]. It is also able to accurately discover motifs in 12 mouse transcription factor ChIP-seq data sets involved in ES cell pluripotency and self-renewal [28].

Results
We first tested TFBSGroup on a series of synthetic (l, d) samples and compared it with iTriplet (source code: http://www.rci.rutgers.edu/∼gundersn/iTriplet/) and RecMotif (source code provided by the authors). iTriplet and RecMotif are both sample-driven algorithms which heuristically extract q-cliques from an N-partite graph (q = N for RecMotif because of the OOPS constraint). Meanwhile, we compared TFBSGroup with the pattern-driven algorithms SPELLER, WEEDER, and MITRA in order to reveal more differences between sample-driven and pattern-driven approaches. We then used TFBSGroup on a large data set of prokaryotic promoters generated from the Escherichia coli database RegulonDB for the purpose of finding real long and weak motifs. Also, we showed the results of TFBSGroup in discovering motifs in ChIP-seq data sets for 12 mouse transcription factors involved in ES cell pluripotency and self-renewal [28]. All experiments were performed on a computer with an Intel 2.99 GHz processor and 2GB of main memory running the Windows XP.

Benchmark data sets
Like the previous work of Buhler and Tompa [9], the testing samples were generated synthetically using the following steps: 1) A parent motif of length l was chosen by randomly picking l bases from the nucleotide set {A, C, G, T}. 2) N i.i.d. background sequences of length L were constructed at random. 3) q (q ≤ N) sequences were randomly selected from these N background sequences. 4) The following steps were performed for each of the selected q background sequences: 4.1) An instance of the parent motif was created by randomly choosing d (d < l) positions of the motif and randomly mutating these d bases to one of the four nucleotides. 4.2) A consecutive substring of random length l was selected from each background sequence.

4.3) The substring in
Step 4.2 was replaced with a newly generated instance of the motif.
In our experiments, unless otherwise specified, the number N and the length L of sequences in an (l, d) sample are set to 20 and 600, respectively.

Comparisons between TFBSGroup and state-of-art algorithms using (l, d) samples
Firstly, to show the efficiency of TFBSGroup, we compared it with state-of-the-art algorithms including the pattern-driven SPELLER, WEEDER, and MITRA-count (MITRA for short) and the sample-driven iTriplet and RecMotif on the same testing samples (Table 1). Secondly, we tested the effect of the maximal length L of DNA sequences [20]. The test results are shown in Table 2. WEEDER(q) indicates the execution time of WEEDER given q, q/f indicates that WEEDER failed to find the true motif for the given value q, TFBSGroup(x) indicates the run time of TFBSGroup given x (d ≤ x ≤ 2d), s, m, and h denote seconds, minutes, and hours respectively, and − indicates a run time of over 10h.
To demonstrate the accuracy of our algorithm, we ran TFBSgroup on 100 randomly generated test samples for each (l, d) pair and reported the number of samples in which the implanted motifs were correctly reported in the top 1/5/10, and which were correctly identified but listed below the top 10 (denoted b10). We also reported the number of samples in which the implanted motifs were not correctly reported by TFBSGroup (denoted f ), since our algorithm TFBSGroup may fail in some cases. The accuracy of TFBSGroup for different (l, d) pairs is shown as 1/5/10/b10/f after TFBSGroup(x) in Tables 1 and 2. For example, we correctly found motifs ranked first in 95 samples, motifs ranked within the top 5 in 98 samples, and motifs ranked within the top 10 in 99 samples for (15,4). However, we failed on one sample set. Thus, the accuracy of TFBSGroup on 100 (15,4) samples can be estimated as 95/98/99/0/1, where motifs were ranked by their significance score [14,15]. Furthermore, since the run times of TFBSGroup on different samples of the same (l, d) pair showed negligible difference (usually < 1 second) under the same parameter settings, we did not average the run times of 100 samples for an (l, d) pair but instead kept the run times of TFBSGroup on the same sample sets used in the efficiency comparisons with other algorithms.
In addition, Table 1 and Table 2 describe the results of VINE (Huang et al. [20]) and sMCL-WMR (Boucher and King [23], sMCL for short) in order to compare these sample-driven algorithms, which extract cliques from N-partite graphs, to our work. * indicates that no result was available from literature. The experiments using VINE were performed on a PC with an Intel Pentium IV 3.40 GHz processor and 1GB of main memory http://www.biomedcentral.com/1471-2105/14/227 running Windows. Those for sMCL-WMR were performed on a Linux machine with a 2.6 GHz processor and 1Gb of RAM running Ubuntu Linux. We also tested iTriplet and found that the run times of two implementations of this algorithm on the same (l, d) sample were greatly different due to its random mechanism for selecting two reference sequences and an l-mer within a reference sequence. Taking five runs of iTriplet on the same (15,4) sample as an example, the minimum execution time was 0.859 seconds and the maximum was 9.156 seconds. We have reported the average run time of 5 runs of iTriplet.
As shown in Table 1, the sample-driven algorithms run faster than the pattern-driven variety. However, except for MITRA, we used only q and d as input in all implementations of the pattern-driven algorithms. The length l of planted motifs can be predicted by these algorithms while all the sample-driven types tested http://www.biomedcentral.com/1471-2105/14/227 Most planted motifs in the synthetic samples (with the exception of (10, 2)) were correctly reported in the top 1/5/10. TFBSGroup performed with high accuracy when identifying exact matches for long and weak (l, d) motifs such as (15,4), (16,5), (18,6), and (19,7). However, TFBSGroup failed to find exact planted motifs in many cases involving conserved motifs (e.g., (10,2)). This may be because the networks are too sparse to form good communities. For hard (l, d) motif search problems such as (15,5), (16,5), (17,6), and (18,6), TFBSGroup is much more efficient than iTriplet, RecMotif, VINE, and sMCL-WMR. In addition, it is not easy to tune the parameter q in WEEDER since, with the decrease of q, the run time is dramatically increased. Moreover, according to our experiments, iTriplet cannot be guaranteed to find true inserted motifs in all cases because of its random mechanism. Also, the memory usage of iTriplet is much higher and can freeze the computer during searches for long and weak motifs such as (19,7). Table 1 and Table 2 show that the sample-driven algorithms are more sensitive to the length L of DNA sequences, which influences the scale and the noise ratio of an N-partite graph given x. The pattern-driven approach is much more sensitive to l and d, which dominate the scale of the search space. These results are consistent with the time complexity of these algorithms collected from [21,22] and shown in Table 3. According to Table 1 and Table 2, the value of q has a strong influence on the efficiency of WEEDER. For instance, when L = 700, the true motif could not be found when we set q = 19. We then let q = 18 and ran WEEDER again. The run time when q = 18 was two to three times longer than when q = 19. We also observed that a shorter length L of sequences led to a more accurate TFBSGroup result. This is consistent with the theoretical results in Zia and Moses [29].

Mining for transcription factor binding sites in Escherichia coli K-12
In order to further evaluate TFBSGroup, we used the algorithm on a large data set (ECRDB70-10) to find the  LN(l, d)) binding sites within promoters of Escherichia coli K-12 DNA. This data, collected by Hu et al. [25], is stored in RegulonDB [27] and contains groups of sequences with experimentally-determined binding sites in the middle regions of the sequences. We selected sequence sets with more than five DNA sequences. We used published motif consensuses in the literature, especially the results in Li et al. [30], as a guide for inferring the values of l and d.
We listed the motif consensuses, which were similar to the consensuses published in the literature. If no exact match or similar result was found in the literature, we listed the top ranked motif consensuses with the most binding locations in the middle regions of the sequences. The test results (obtained within 1 minute by running TFBSGroup) are shown in Figures 1 and 2, where 'TF' indicates the name of the transcription factor, '#' indicates the number of sequences in the corresponding set, 'Consensus' indicates the motif consensus of the corresponding TFBSs given a TF, 'Logo' indicates the sequence logo of all TFBSs for a specified TF (created using the webbased application tool Weblogo [31]), 'Rank' indicates the ranking number of the significance score [14,15] for the motif consensuses, 'Lit. ' indicates that similar motif consensuses have been published in the literature while '*' means no similar motif consensus was found in the literature, and (l, d) and x are represented in the same way as in Table 1.
As shown Figures 1 and 2, TFBSGroup can efficiently find over-represented long motifs from prokaryotic promoters. We illustrate this point using the well-studied TFs CRP, FNR, and LexA as examples [21]. Binding site data for the CRP protein includes 138 DNA sequences of length 219 with the consensus TGTGAnnnnnnTCACA (consensus model: (18,7)) and 138 actual binding sites. The FNR data includes 50 DNA sequences of length 222 with the consensus TTGATnnnnATCAA (consensus model: (14,4)) and 50 actual binding sites. The LexA data includes 10 DNA sequences of length 222 with the consensus CTGTnnnnnnnnnnCAG (consensus model: (16,6)) and 10 actual binding sites. For all three sets, the expected motifs are ranked first by TFBSGroup in terms of their significance score: CRP is reported to have 121 sites (63 true), FNR is predicted to have 46 sites (23 true), and LexA is reported to have 12 sites (8 true). The precision on the site level (precision = TP TP+FP ) is close to or greater than 50% on these three data sets, where TP is the number of true positive sites and FP is the number of false positive sites. It should be pointed out, however, that some results marked with an '*' in Figures 1 and 2 may not be satisfactory due to the low specificity of binding sites for the TFs, insufficient number of sequences from which to a draw statistical conclusion, or a lack of knowledge of the proper (l, d) models. Compounding the problem is the fact that the true consensuses in these data sets are unknown, a difficulty which exists for all consensus model-based algorithms.

Motif discovery in 12 mouse ES CELL ChIP-seq data sets
To further evaluate the accuracy of motifs predicted by TFBSGroup, we analyzed the ChIP-seq data sets for 12 DNA-binding TFs (CTCF, cMyc, Esrrb, Klf4, Nanog, nMyc, Oct4, Smad1, Sox2, STAT3, Tcfcp2I1, and Zfx) involved in mouse embryonic stem cell pluripotency and self-renewal [28]. To prepare the data sets for use with motif discovery algorithms, we first extracted peak regions from ChIP-seq data using MACS [50] with a FDR (false discovery rate) threshold of 0.2. We then mapped the centers of the ChIP-seq peaks to the mouse mm10 assembly and extracted 100-bp of genomic sequence centered around each peak. To compare motifs identified by TFBSGroup with motifs found in Chen et al. [28], we ran TFBSGroup on hundreds of peaks with low p-values. The results of Chen et al. and TFBSGroup are shown in Figure 3, where all sequence logos predicted by TFB-SGroup, including those in Figure 4, were also created using the web-based application tool Weblogo [31]. We found motifs matching those identified in Chen et al. [28]. Specifically, motif logos predicted by Chen et al. [28] and TFBSGroup for each TF in Figure 3 (with the exception of Klf4 and Zfx) are exactly or 'almost' exactly the same. However, it is well known that Klf4 is able to recognize GC-rich regions. ZFX has no known published consensus sequence, but its predicted motif agrees to some extent with the result of Chen et al. and the result predicted by cEMRMIT [51].
In [28], Chen et al. used WEEDER and then refined and extended the motifs with an Expectation-Maximization method. This second step was necessary because the supplied version of the WEEDER algorithm limited the motif search to a maximum of 12 bps. As discussed in the previous sections, WEEDER operated with low efficiency for long motifs and was difficult to tune for the parameter q. On the contrary, TFBSGroup was able to find both long and weak motifs. We obtained the motifs and their TFBS locations in sequences within 1 minute for all data sets with just one run of TFBSGroup.
In addition, we found alternative motifs for some TFs such as OCT4, Esrrb and CTCF, which were also reported in a previous study [52]. One significant alternative motif for each of the three TFs is shown in Figure 4. The TFBS sequences of this alternative motif were complementary to those of the main motif in Figure 3 for each of three TFs.
algorithm designed for processing large-scale networks (BGLL). Based on these extracted communities, TFB-SGroup heuristically refines candidate motifs and their instances towards the true motifs. Experimental tests on synthetical samples have shown that TFBSGroup can very quickly discover long and weak (l, d) motifs and their instances. More importantly, TFBSGroup has achieved good performance in rapidly identifying motifs in a large data set of promoters generated from Escherichia coli and in accurately discovering motifs in ChIP-seq data sets http://www.biomedcentral.com/1471-2105/14/227 for 12 mouse transcription factors involved in ES cell pluripotency and self-renewal. Still, TFBSgroup may not work well in the extreme case that the number of mutations between each motif instance and the motif itself is exactly d, since the graph will be too dense to be partitioned sufficiently. Fortunately, this case seldom occurs in real applications. In the future, we plan to improve the algorithm by combining structure-and sequence-based methods in order to address this issue. Also, we plan to improve the algorithm's ability to process large-scale ChIP-Seq data sets.

(l, d) motif search and dense subgraph extraction
Given a set of sequences S={s 1 , s 2 , . . . , s N } over a symbol set ={A,C,G,T} and positive integers l and d (|s i | ≤ L, indicates the Hamming distance between the two strings t and t ij . Since the Hamming distance between each instance of a motif and the motif itself is at most d, the Hamming distance between two instances is no more than 2d and all instances of the true motif must form a q-clique. Therefore, we can obtain (l, d) motifs by extracting q-cliques from an N-partite graph where each vertex is an l-mer in S and there is an edge between two l-mers l i and l j (l i and l j are l-mers of s i and s j , respectively, i = j) if the Hamming distance between the two l-mers is no more than 2d.
As far as synthetic samples randomly generated by the method mentioned in the above section are concerned, the probability of two random l-mers with a maximum distance of x is instances. Still, only q * (q − 1)/2 ≤ N * (N − 1)/2 edges are true positive links forming an expected q-clique. Suppose there is an edge between two vertices in an Npartite graph if the Hamming distance of the vertices is no more than x (d ≤ x ≤ 2d) instead of 2d. In this case, the number of spurious edges may be dramatically reduced. For example, if we set x = 7 for a real (18,6) sample, there are only 82,343 edges in the N-partite graph (there are an estimated 80,335 edges using Eq. 1). However, the instances of a motif in this situation may not form a qclique but rather a densely connected subgraph. We can obtain an (l, d) motif by detecting dense subgraphs in an N-partite graph where the distance of two vertices is at most x.

Community detection and dense subgraph identification
In recent years, complex network analysis has been highlighted in the research community. It is a powerful tool used to describe the structure of many complex systems in nature and society and has many potential applications. A network is usually represented by a graph G = (V , E), where V is a set of n vertices and E is a set of m edges representing relationships between pairs of vertices.
Community structure is one of the most important topological characteristics in a network. A community structure is a subgraph of a network whose vertices are more highly connected with each other than with vertices outside the subgraph. Therefore, the problem of community detection requires the partition of a network into communities of densely connected nodes such that ∀u, v, u = v, C u ∩ C v = ∅ and ∪ u C u = V , where C u (or C v ) is one of the partitioned communities. It should be apparent that community structure is a type of dense subgraph. The current algorithms for identifying communities in complex networks can be used to find dense subgraphs within graphs. Many methods to identify community structures in complex networks have been developed [53,54]. As mentioned above, however, an N-partite graph of input sequences with a long and weak (l, d) motif may be a large network with millions of edges. Fast community detection methods are required to partition a large-scale graph. In the field of complex network analysis, algorithms including Infomap [55], BGLL [26], LPA [56], and RG [57] are designed to efficiently detect communities in very large networks. In this study, we use the BGLL algorithm [26] to find dense subgraphs in an N-partite graph. This algorithm is best for our purposes since we only need a coarse partition and BGLL is very fast and easy to use. The source code for this software can be obtained from http://findcommunities.googlepages.com/.

BGLL: a near linear time algorithm for community detection
BGLL is a heuristic method for optimizing modularity (Eq. 2), which measures the difference between the empirical distribution of in-community connections of a partition and the expected distribution of in-community connections of a partition in a randomly generated graph with the same vertex degree distribution [58].
where A ij is the weight of the edge (i, j). If a network is unweighted, A ij is the sum of the weights of the edges attached to vertex i or the degree of the node i for an unweighted network. C i is the community to which the node i is assigned and δ(·, ·) is the Kronecker function. A larger Q yields a better partition.
The BGLL algorithm can be divided into two iterative phases. Firstly, it assigns a different community to each node of a network. Then, for each node i, it considers the neighbors j of i ((i, j) ∈ E) and evaluates the gain of modularity Q (Eq. 3) that would take place by removing i from its community and placing it in the community of j.
where in is the sum of the weights of the edges inside a community C u , tot is the sum of the weights of the edges incident to nodes in C u , and k i,in is the sum of the weights of the edges from i to nodes in C u . If the gain is negative, i stays in its original community, otherwise, i is placed in the community which provides maximum gain. The second phase of the algorithm involves building a new network whose nodes make up the communities found during the first phase. The weights of the edges between the new nodes are the sum of the weight of the edges between nodes in the corresponding two communities. Edges between nodes of the same community lead to self-loops for this community in the new network. The algorithm naturally incorporates a notion of hierarchy, which results in communities of communities. The BGLL algorithm is extremely fast and performs in linear time on typical and sparse data, since the gain of modularity is very easy to compute with Eq. 2 and the number of communities decreases drastically after just a few passes. Most of the run time lies within the first iteration [26]. In this study, we took the results of the first iteration only since we are interested in obtaining coarse-grained candidate motifs and their group of instances from dense subgraphs of the original network and are not concerned about the hierarchical structure of communities. http://www.biomedcentral.com/1471-2105/14/227

TFBSGroup: a fast motif-finding algorithm
TFBSgroup operates in two phases. Firstly, we construct an N-partite graph where the distances of pairs of vertices are no more than x (d ≤ x ≤ 2d) for a set of input sequences, which is assumed to contain an (l, d) motif and at least q (q ≤ N) instances. We then detect all communities with a size of at least t (the default is q/2) in the N-partite graph using the BGLL algorithm to obtain a candidate motif consensus by aligning all l-mers in each community. Secondly, we greedily refine these candidate motifs toward the true motif using the following three steps: 1) For each candidate motif consensus, we look for the vertices in the neighbor set of the current community C u for which the Hamming distance between the consensus and the corresponding l-mers of these vertices are no more than d in order to form a new community. A vertex belonging to the new community is in the neighbor set of the current community C u if the position of the corresponding l-mer of the vertex is in the interval [ max{0, pos v i − l}, min{pos v i + l, L − l}]. pos v i is the start position of the corresponding l-mer of a vertex v i (v i ∈ C u ) in the sequence to which it belongs. 2) We align the new community to obtain a new candidate motif consensus. We then iteratively execute step 1 and step 2 until each new candidate consensus cannot be changed. 3) We shift the corresponding l-mers (an l-mer corresponds to a vertex v i in the final community) of the final candidate motif in the interval [ max{0, pos v i − l/3 }, min{pos v i + l/3 , L − l}], since the true motif and its instances may be near the final candidate motif consensus and their instances. Furthermore, since there may be many false positive motifs, we sort all output motifs according to their statistical significance using the method of Pavesi et al. [14,15] and then delete the duplicates. Finally, the top k significant motifs and their instances are returned. TFBSGroup is so named because it completes its run after all potential motifs and the groups of their instances (corresponding to the groups of TFBSs in a set of DNA sequences) are reported as output.
The details of the TFBSGroup algorithm are shown below, where (i, j) indicates an instance starting at the jth position of the i-th sequence s i . t (default = q/2) is used to filter false positive groups forming small communities during the initial stage. This will not affect the result or the speed of the algorithm in our simulations because the largest group usually has the highest significance score. We can set t = 0 to ensure all candidate groups are examined. Furthermore, the window size l/3 is used to ensure that the predicted TFBSs are within its inserted positions and not around them. Generally, we can identify true motifs when the window size is less than l/3. The source code for TFBSGroup can be obtained from http://bioinformatics.bioengr.uic.edu/ TFBSGroup/ or Additional file 1.

Time and space complexity of TFBSGroup
The time complexity of TFBSGroup depends mainly on the first phase of the algorithm, which includes time for constructing an N-partite graph with distance x (d ≤ x ≤ 2d) and time for extracting communities from the constructed N-partite graph. During the second phase, the algorithm searches through candidate motif consensuses and their instances within each of the communities.  6) and (19, 7). The height of a panel in the figure correspond to the number n y of instance pairs with distance y (y = {0, 1, 2, · · · , 2d}).
vertices in a network, where time Q is the time complexity for computing Q. As a result, the time complexity of TFBSGroup for the worst case is where m is the number of edges in a network, m = O(p(l, x) × N 2 × L 2 ) and t = O(p(l, x) × N × L), as estimated using Eq. 1. However, it should be noted that the above bound can be a substantial overestimate. The time complexity of TFBSGroup is almost equal to the time complexity of BGLL, which is near linear with respect to m in real applications, especially for sparse networks [26]. The space complexity of TFBSGroup is mainly affected by the storage of all l-mers and an N-partite graph, where the distance of two vertices is at most x. Thus, the space complexity of TFBSGroup is O(m) = O(p(l, x) × N 2 × L 2 ), while it is at least O(p(l, 2d) × N 2 × L 2 ) for previous graph-based algorithms.
The time and space complexity of TFBSGroup and several related algorithms (except for sMCL, since no complexity analysis is available for this algorithm in the literature [23]) are shown in Table 3, where the left half lists pattern-driven algorithms (labeled 'Pattern-D') and the right half lists sample-driven algorithms (labeled 'Sample-D'). N(l, d) = d i=0 i l 3 i and w is the word length, which corresponds to bit length of a processor. The time complexity of RecMotif relies heavily on the value of p(l, 2d) (see Table one in Sun et al. [22]). According to Table 3, each algorithm has its own advantages. The time complexity of pattern-driven algorithms is higher but they have lower space complexity. The time complexity of sample-driven algorithms is lower but they generally have higher space complexity. RecMotif is too sensitive to p(l, 2d) and L. When p(l, 2d) is small, RecMotif runs very fast. However, when p(l, 2d) is larger than 0.28, RecMotif can not produce the results within a reasonable amount of time. TFBSGroup is a complement to sample-driven algorithms since it makes a reasonable trade-off between speed and accuracy.

The choice of x
The key problem with TFBSGroup is the selection of the parameter x. If x is too large, the N-partite graph may be too dense to define a community containing only the instances of a motif. If x is too small, the graph is too sparse to form the expected communities and the true group of TFBSs will be missed. In this study, we use an experimental statistical method to estimate x for a specified (l, d) motif search problem. Firstly, for a given l-mer consensus, we created 500 instances of the consensus with the Hamming distance between the consensus and each instance equal to at most d. We then computed the Hamming distance for each pair of instances and counted the number n y of instance pairs with distance y (y = {0, 1, 2, · · · , 2d}) to get the frequency n y distribution. The center of this distribution should be an estimation of x, i.e., x ≥ max{n y , y ∈ {0, 1, 2, · · · , 2d}} or close to this. Taking (18,6) and (19,7) as examples, the histograms of the frequency n y distribution are shown in Figure 5. Using these distributions as a guide, we set x = 7 and x = 8 for (18,6) and (19,7), respectively.