 Research article
 Open access
 Published:
SamSelect: a sample sequence selection algorithm for quorum planted motif search on large DNA datasets
BMC Bioinformatics volume 19, Article number: 228 (2018)
Abstract
Background
Given a set of t nlength DNA sequences, q satisfying 0 < q ≤ 1, and l and d satisfying 0 ≤ d < l < n, the quorum planted motif search (qPMS) finds llength strings that occur in at least qt input sequences with up to d mismatches and is mainly used to locate transcription factor binding sites in DNA sequences. Existing qPMS algorithms have been able to efficiently process small standard datasets (e.g., t = 20 and n = 600), but they are too time consuming to process large DNA datasets, such as ChIPseq datasets that contain thousands of sequences or more.
Results
We analyze the effects of t and q on the time performance of qPMS algorithms and find that a large t or a small q causes a longer computation time. Based on this information, we improve the time performance of existing qPMS algorithms by selecting a sample sequence set D’ with a small t and a large q from the large input dataset D and then executing qPMS algorithms on D’. A sample sequence selection algorithm named SamSelect is proposed. The experimental results on both simulated and real data show (1) that SamSelect can select D’ efficiently and (2) that the qPMS algorithms executed on D’ can find implanted or real motifs in a significantly shorter time than when executed on D.
Conclusions
We improve the ability of existing qPMS algorithms to process large DNA datasets from the perspective of selecting highquality sample sequence sets so that the qPMS algorithms can find motifs in a short time in the selected sample sequence set D’, rather than take an unfeasibly long time to search the original sequence set D. Our motif discovery method is an approximate algorithm.
Background
DNA motif discovery is a key factor in locating regulatory elements (e.g., transcription factor binding sites) in DNA sequences [1,2,3,4]. The quorum planted motif search (qPMS) [5, 6], a widely studied formulation for motif discovery, defines a motif as an llength string (lmer) m that occurs in at least qt out of t nlength (n > l) input sequences with up to d (0 ≤ d < l) mismatches, where q (0 < q ≤ 1) is the proportion of the input sequences containing motif occurrences; m and its occurrences in the sequences are called an (l, d) motif and its instances, respectively. Given a set of t nlength DNA sequences D = {s_{1}, s_{2}, …, s_{ t }} containing a motif m and the parameters l, d and q describing m, the task of qPMS is to find all (l, d) motifs present in D such that m must exist in the found motifs.
qPMS is NPcomplete [7]. Over the past two decades, there have been many studies on qPMS algorithms [8,9,10,11]. The qPMS algorithms are based on searching possible combinations of motif instances or possible candidate motifs and are either sample driven or pattern driven. The sampledriven qPMS algorithms, such as WINNOWER [5], DPCFG [12] and RecMotif [13], have an initial search space of (n – l + 1)^{t}ttuples (x_{1}, x_{2}, …, x_{ t }) in the case of q = 1; each t tuple is composed of t lmers from t input sequences, i.e., a group of possible motif instances. The patterndriven qPMS algorithms have an initial search space of 4^{l} candidate motifs and verify if each candidate motif is an (l, d) motif. Because of the much smaller initial search space, the patterndriven qPMS algorithms usually exhibit better time performance than the sampledriven qPMS algorithms.
The time performance of the patterndriven qPMS algorithms depends mainly on two aspects: the number of candidate motifs and the efficiency of candidate motif verification. To speed up candidate motif verification, the suffix treebased pattern driven (stpd) qPMS algorithms, such as Speller [14], Weeder [15], RISOTTO [16] and FMotif [17], construct a suffix tree of input sequences. The basic procedure for verifying a candidate motif m is then as follows: match m along different paths from the suffix tree root and record the current number of mismatches e on each path; if e is greater than d, then terminate the match on the corresponding path; and if the llength paths with e ≤ d correspond to a group of strings that can span at least qt input sequences, then m is determined to be an (l, d) motif.
With a focus on reducing the number of candidate motifs, some algorithms combine the sampledriven and patterndriven approaches. These are called samplepatterndriven (spd) qPMS algorithms. In the sampledriven phase, these algorithms use t – qt + h reference sequences, which must contain at least h motif instances, and traverse all the htuples (x_{1}, x_{2}, …, x_{ h }) in these reference sequences. An htuple consists of h lmers from different reference sequences, i.e., a group of h possible motif instances. In the patterndriven phase, these algorithms generate common dneighbors of each htuple (a dneighbor of an htuple is an lmer y such that the Hamming distance between y and each lmer x_{ i } in the htuple is less than or equal to d), and take them as candidate motifs to verify one by one. The existing spd qPMS algorithms can be classified according to the different values of h, as follows: PMSP [18] and PMSprune [6] have h = 1, PairMotif [19], qPMS7 [20] and TravStrR [21] have h = 2, iTriplet [22] and PMS5 [23] have h = 3, and PMS8 [24] and qPMS9 [25] have h ≥ 3.
The existing qPMS algorithms currently perform well when processing traditional standard DNA datasets [5] (e.g., t = 20, n = 600), even for challenging (l, d) problem instances [26]. However, these algorithms encounter bottlenecks when processing large DNA datasets, such as the ChIPseq datasets [9, 27], which typically contain thousands of DNA sequences or even more. ChIPseq datasets enable the identification of transcription factor binding sites within the genome but present a significant computational challenge for qPMS. First, the sampledriven qPMS algorithms undergo a combinatorial explosion because the search space grows exponentially with the number t of DNA sequences. Second, for the stpd qPMS algorithms, the running time shows quadratic growth as t increases and also increases as q decreases (see the analysis in the section Why to Select Sample Sequences). Third, for the spd qPMS algorithms, there are too many htuples to be considered in the t – qt + h reference sequences, greatly extending the time required. Therefore, it is necessary to accelerate the existing qPMS algorithms for large DNA datasets.
As described above, the time performance of the qPMS algorithms is affected by both the number t of input sequences and the proportion q of the input sequences containing motif instances; specifically, a large t or a small q will increase the computation time for both the stpd and the spd qPMS algorithms. Consider a dataset D of a motif m such that there are qt sequences containing instances of m in a total of t sequences and a subset D’ of D such that there are q’t’ sequences containing instances of m in a total of t’ sequences, satisfying 0 < t’ < t and 1 ≥ q’ > q > 0. It is not difficult to find that when a qPMS algorithm is executed on D and D’ separately, the motif m can be found in both cases, and the running time on D’ can be significantly smaller than that on D. Based on this consideration, given a large DNA dataset D, one way to effectively improve the time performance of qPMS algorithms is to select a portion of the sequences from D to form a sample sequence set D’, making the proportion of the sequences containing motif instances higher in D’ than in D, and then execute qPMS algorithms on D’ to perform motif discovery.
In this paper, we analyze why the selection of sample sequences for the qPMS algorithms is important. Then, we propose a method of selecting sample sequences. Additionally, we use both simulated data and real data to validate the ability of the qPMS algorithms to perform motif discovery on the selected sample sequences, i.e., whether they can find the implanted or real motifs in a significantly shorter time.
Methods
Why to select sample sequences
The notations frequently used in this paper are summarized in Table 1.
Fixing (l, d) and the length n of a single sequence, we analyze the effects of the number t of input sequences and the proportion q of the input sequences containing motif instances on the time performance of qPMS algorithms. We analyze the stpd and the spd qPMS algorithms.
The stpd qPMS algorithms construct a suffix tree of t nlength input sequences [14]. In the tree, each edge is labeled with a nonempty substring of the input sequences, and each node v corresponds to a string str_{ v } representing the concatenation of the substrings on the path from the root of tree to v. If v is a leaf, then str_{ v } is a suffix of input sequences; otherwise, str_{ v } is a common prefix of the suffixes represented by all leaves under v. The suffix tree has exactly tn leaves, representing tn suffixes of input sequences. For each node v of the tree, the IDs of sequences in which str_{ v } occurs exactly are stored by using a vector of t bits for good storage efficiency.
In addition to the suffix tree, these algorithms also use a pattern tree, a complete quadtree of depth l representing all the patterns over Σ with length ranging from 1 to l. Then, they perform a depthfirst search on the pattern tree. When visiting a node v corresponding to a pattern p, they use the suffix tree to obtain the IDs of sequences in which all dneighbors of p occur exactly, i.e., the IDs of sequences in which p occurs with up to d mismatches. If the number of the sequence IDs obtained is greater than or equal to qt and the length of p is less than l, they continue to visit the children of v corresponding to the patterns pb (b ∈ Σ) and otherwise prune the subtree of v. Finally, they output all the llength patterns that span at least qt sequences.
The time and space complexity of the stpd qPMS algorithms can be evaluated as follows [14]. The suffix tree of t nlength sequences has tn leaves and thus up to tn nodes of llength strings; for each such node v in the suffix tree, at most B_{ d }(str_{ v }) patterns in the pattern tree have up to d mismatches with str_{ v }; for each such pattern y, when it is verified as a candidate motif, the node v needs to be visited once, and the binary OR operation is executed on the vector of t bits in O(t) time. Therefore, the time complexity is O(t^{2}nB_{ d }(str_{ v })), which is approximately O(t^{2}nl^{d}4^{d}). Since a vector of t bits is stored in each of O(tn) nodes of the suffix tree, the space complexity is O(t^{2}n/w), where w is the word size of the computer.
We find that t has a strong effect on both the time and space performance of the stpd qPMS algorithms, i.e., both the running time and the storage space show quadratic growth as t increases. Furthermore, although q does not appear in the time complexity evaluated above, it also affects the time performance because it affects the pruning efficiency when searching the pattern tree. As described above, the subtree of a node v corresponding to a pattern p that cannot span at least qt sequences is pruned. If q is small, then p has a higher probability P_{span} of spanning at least qt sequences (P_{span} is calculated by (1), where P_{ d } is the probability that the Hamming distance between two random lmers is less than or equal to d), which is detrimental to pruning. Therefore, the smaller the value of q, the higher is the computational time of the stpd qPMS algorithms.
The time performance of the spd qPMS algorithms depends mainly on the number of generated candidate motifs. These algorithms use all htuples in t – qt + h reference sequences to generate candidate motifs. That is, they must consider all possible combinations of h reference sequences in t – qt + h reference sequences; the number of possible combinations is denoted by N_{ com } and calculated by (3). For a given algorithm, the value of h (h ≥ 1) is generally fixed, so N_{ com } is mainly affected by t and q. Obviously, when t increases or q decreases, N_{ com } will increase, leading to more candidate motifs and a higher computation time.
Based on the above analysis, both t and q have the same effect on the stpd qPMS algorithms as on the spd qPMS algorithms: a large t or a small q will increase the computation time. Large DNA datasets, such as ChIPseq datasets (see Tables 2 and 3), typically contain thousands DNA sequences or even more; that is, t is very large. On the other hand, the proportion of sequences containing motif instances is not large, that is, q is small. The two aspects make qPMS algorithms too time consuming to process large DNA datasets.
One way to effectively improve the time performance of qPMS algorithms is to select a sample sequence set D’ with a larger proportion of sequences containing motif instances from the given dataset D and then to execute qPMS algorithms on D’ to perform motif discovery. Accordingly, the problem to be solved is described as follows.
Sample sequence selection problem
Given a set of t nlength DNA sequences D = {s_{1}, s_{2}, …, s_{ t }} containing instances of a motif m, along with the parameters l, d and q describing m (see Table 1 for the explanation of these parameters), the task is to select a portion of the sequences from D to form a sample sequence set D’ (let t’ = D’, and let q’ be the proportion of sequences containing instances of m in D’), so that t’ < t and q’ > q.
How to select sample sequences
Basic concept
Because of the conservation of DNA motifs, the instances of a particular motif are similar to each other. Thus, if a substring x in the input sequences overlaps a motif instance, the occurrence frequency of x is generally higher than that of a substring y with y = x in the background sequences. Based on this difference in frequency, our basic idea is to convert the problem of selecting sample sequences containing motif instances into the problem of selecting sample sequences containing highfrequency substrings. That is, we test whether a sequence contains a highfrequency substring to determine whether the sequence contains a motif instance.
Since most of the motif instances are similar but not exactly the same, the occurrence frequency of a substring x is evaluated by the count of x in D with up to k mismatches, denoted by count_{ k }(x), i.e., the number of substrings y in D satisfying d_{ H }(y, x) ≤ k. Notably, the time complexity of computing count_{ k }(x) for a substring x grows dramatically as k increases; moreover, we need to compute count_{ k }(x) for all substrings of a specified length w in the input sequences. Therefore, the value of k cannot be large if good time complexity is to be achieved. When k is small, the length w should also be small to obtain enough substrings overlapping motif instances.
The length w is generally smaller than the motif length l, and a motif instance in a sequence may produce multiple overlapped highfrequency wmers. Therefore, after fetching highfrequency wmers, a step is needed to combine multiple overlapped wmers into one highfrequency substring. The length of the combined highfrequency substrings may not be equal but is generally greater than l. A highfrequency substring is expected to cover a motif instance.
Furthermore, the obtained highfrequency substrings need to be grouped. To guarantee a large value of q’, a sample sequence set is expected to contain only instances of a single motif. However, the input sequences may contain multiple motifs and the disturbance of random highfrequency substrings; that is, in general, the obtained highfrequency substrings are composed of instances of multiple motifs and some random highfrequency substrings. Therefore, we use a clustering method to divide the obtained highfrequency substrings into groups and thus may obtain two or more highquality sample sequence sets so that a sample sequence set exists corresponding to the motif to be found.
Based on these considerations, SamSelect consists of the following three steps: i) word count with mismatches, used to fetch highfrequency wmers; ii) highfrequency substring obtainment, used to obtain highfrequency substrings by combining overlapped wmers; and iii) highfrequency substring grouping, used to obtain sample sequence sets by clustering highfrequency substrings.
Word count with mismatches
We compute count_{ k }(x) for all wmers x in the input sequences. Given a wmer x, count_{ k }(x) is represented as
where I_{ y } is an indicator variable and it is 1 if d_{ H }(y, x) ≤ k, 0 otherwise.
Our method for computing count_{ k }(x) is based on the count operation (computing the number of occurrences of a string y in D, i.e., count(y)) of FMIndex [28]. That is, count_{ k }(x) is converted into the sum of the number of occurrences of all kneighbors of x:
FMIndex is a selfindexed data structure. Let [L_{ y }, R_{ y }] denote the ranking interval of the suffixes of input sequences prefixed by a string y. With [L_{ y }, R_{ y }], count(y) = R_{ y }– L_{ y } + 1 can be obtained immediately. The process of computing [L_{ y }, R_{ y }] is to traverse w characters of y from right to left (i.e., backward search); when the ith (1 ≤ i ≤ w) character y[i] is visited, the interval [L_{ φ }, R_{ φ }] for φ = y[i..w] is obtained in O(logΣ) time based on the interval [L_{φ’}, R_{φ’}] for φ’ = y[i + 1..w] through FMIndex. Thus, count(y) is computed in O(wlogΣ) time.
The count of a single wmer can be computed efficiently with FMIndex, but if we obtain count_{ k }(x) by independently computing the count of each wmer in B_{ k }(x), then the backward search on the common suffixes of wmers in B_{ k }(x) will be performed repeatedly. For example, when computing count_{1}(x) for a 3mer x = ACG, if we independently compute the counts of the four 3mers ACG, CCG, GCG and TCG in B_{1}(x), then the backward search on the common suffix CG will be performed four times. Moreover, our goal is to obtain count_{ k }(x) for all wmers x in the input sequences, making the number of repeated backward searches even larger.
To address this problem, we design a method to minimize the number of repeated backward searches. As shown in Fig. 1, we first efficiently compute the values of count(y) for all wmers y in the input sequences by using Algorithm 1 and store them in a Table T of size 4^{w}, where T[i] stores the value of count(y) for the wmer y with stn(y) = i; then, we obtain count_{ k }(x) for a given wmer x by querying T B_{ k }(x) times and summing T[stn(y)] for each y in B_{ k }(x). In Algorithm 1, we obtain T by searching a quadtree of depth w. The leaves and internal nodes of the quadtree correspond to all wlength strings over Σ and their common suffixes, respectively. All elements in T are initialized to zero; in searching the quadtree, when the value of count(y) for a wmer y is greater than zero, T[stn(y)] is updated to count(y).
Algorithm 1 is able to minimize the number of repeated backward searches. When an arbitrary node v of the quadtree is being visited (let φ be the string corresponding to v), the interval [L_{φ’}, R_{φ’}] for φ’ = φ[2..φ] has already been obtained, and only O(logΣ) time is needed to obtain the interval [L_{ φ }, R_{ φ }] for φ. Therefore, for all strings with a common suffix φ, the backward search on the suffix φ is only executed once. Moreover, we use pruning technology in the search process. Once count(φ) for a string φ that corresponds to a node v is 0, the subtree of v is pruned.
To guarantee good space and time performance of word count with up to k mismatches, it is necessary to select appropriate values of w and k. Except for building FMIndex, which is not affected by w and k, the space complexity is O(4^{w}), which is mainly used to store the Table T. The time complexity T_{count} depends on two parts, T_{1} and T_{2}. T_{1} is involved in building T by visiting every node of the wdepth quadtree in the worst case. T_{2} is used to compute count_{ k }(x) for each wmer x in t nlength sequences by querying T B_{ k }(wmer) times.
Because k affects the time T_{2}, it is expected to be kept as small as possible; on the other hand, since the instances of a particular motif are a group of substrings similar to each other, it is more meaningful that k is greater than or equal to 1. The value of w affects both the space and time performance of the word count with up to k mismatches. According to empirical studies, w should be less than 15 to guarantee good performance by a personal computer. In SamSelect, we set w and k to 12 and 1, respectively. With this setting, in addition to the guarantee of good space and time performance, we would also like to obtain more motif information, as the probability analysis shows that count_{1}(12mer) for a motif instance is significantly larger than that for a background substring [29].
Highfrequency substring obtainment
We use highfrequency substrings in input sequences to represent the corresponding sequences, and make the following considerations for obtaining highfrequency substrings. First, we select the wmers x in input sequences with count_{ k }(x) greater than a certain threshold f, combine the overlapped wmers to one substring and store the substrings of length greater than or equal to l in a set A. Second, to guarantee good time performance of the substring clustering in the next step, we set the total number of substrings to no more than 5000, which is much larger than the number of outputted sample sequences; if we obtain more than 5000 substrings, we will increase f repeatedly by a small amount. Third, we need to segment long highfrequency substrings because they may contain instances of two or more adjacent different motifs. This division guarantees that the substrings in a particular group correspond to the instances of the same motif; after segmentation, we store the substrings of length greater than or equal to l to a set A’.
The overall process of this step is shown in Fig. 2. The initial value of threshold f is set to the sum of N_{ r } and N_{ m }, where N_{ r } and N_{ m } are count_{ k }(wmer) for a background substring and a motif instance for a random case, respectively; the calculation method of N_{ r } and N_{ m } is given in [29]. For any two overlapped wmers, if the length of the overlap is greater than or equal to w/2, we combine the two wmers into one substring. Notably, some substrings are obtained by combining more than two overlapped wmers (e.g., the substring of s_{ t } in Fig. 2).
Next, we describe how to segment substrings. We first give some definitions. A φ – l + 1 size table denoted by attractTable_{ φ } is built for each substring φ in A. To explain this table, we define the distance dis(φ, φ’) between two given substrings φ and φ’ as the minimum Hamming distance between two lmers x ∈_{ l }φ and x’ ∈_{ l }φ’; dis(φ, φ’) is calculated by (7). The ith element of the table attractTable_{ φ }[i] is calculated by (8), where minPos_{ φ }(φ’) is the set of all positions of the lmers in φ leading to dis(φ, φ’).
The process of segmenting a substring φ is given in Algorithm 3. Let x be the lmer in φ with the position of the maximum element in attractTable_{ φ }. Since some deviations may occur between the position of x and that of the corresponding motif instance, we cut out x from φ and form a new substring by extending up to 3 characters from both the left and the right side of x. After cutting out x, if the length of the remaining left/right part of φ is still greater than or equal to l, we recursively segment the remaining left/right part of φ.
The computation time of this step is mainly determined by the following two aspects. First, we scan all wmers in the entire dataset in O(tn) time to obtain the initial highfrequency substrings and store them to the set A. Second, in segmenting substrings, we need to calculate the distance between each pair of substrings in A in O(L^{2}) time, where L is the average length of the substrings in A. Therefore, the time complexity of this step is O(tn + A^{2} L^{2}).
Highfrequency substring grouping
We mainly use the clustering method to obtain sample sequence sets. The process is described in Algorithm 4, which includes three stages.
In the first stage (line 1), we cluster the highfrequency substrings to distinguish substrings corresponding to different motifs. The AP algorithm [30] is used for clustering; it can automatically determine the number of clusters and obtain cluster centers. For each cluster, we take the cluster center as the substring that is most similar to the motif and use it to filter out random highfrequency substrings in the cluster. In clustering, the similarity sim(φ, φ’) between two substrings φ and φ’ is evaluated as follows.
In the second stage (lines 2 to 11), the resulting clusters are combined, since multiple clusters may correspond to the same motif. For two clusters c and c’ (c ≥ c’), we use the cluster center φ of c to compare each substring φ’ in c’; in terms of (11), if the number of φ’ satisfying dis(φ, φ’) ≤ d is significantly larger than the number under random case P_{ d }c’, we combine c and c’. Multiple clusters are combined by using a greedy strategy.
In the third stage (lines 12 to 17), we obtain sample sequence sets. For each cluster c, we sort the substrings in c in ascending order according to their distance from the cluster center and update c by keeping the first t’ substrings. The value of t’ is specified by the user and should be less than or equal to the maximum number of sequences containing motif instances qt. Then, to maximize the possibility that c corresponds to a set of motif instances, we use the following three rules in turn to test c and filter out a portion of substrings to make c satisfy these rules. Thus, the final value of t’ may be less than the specified value. Finally, for each cluster c, after filtering, we obtain a sample sequence set D’ consisting of the input sequences from which substrings in c are obtained. If we obtain two or more sample sequence sets, we rank them in descending order by size, since a large sample sequence set is more likely to contain a highly conserved motif.
Rule 1
The distance between any two substrings in c is less than or equal to 2d.
Rule 2
The distance between each substring in c and the cluster center is less than or equal to 3d/2.
The reason for adopting these two rules is as follows. For any two motif instances, their Hamming distance is less than or equal to 2d. The cluster center usually contains a motif instance of high conservation that is close to the motif and at distance < d from the motif. Therefore, a more stringent distance constraint (≤ 3d/2) should be observed between each substring in c and the cluster center.
Rule 3
The set c is a motif set.
The set c satisfying Rule 1 is called a pairwise bounded set. If c is a set of motif instances, a consensus m should exist such that the distance between m and each substring in c is less than or equal to d; such set c is called a motif set. A pairwise bounded set that is not a motif set is called a decoy set.
The work of Boucher and King [31] shows a clear difference between the weight of motif sets and that of decoy sets (the weight is calculated by (12)), so the majority of motif sets and decoy sets can be distinguished with statistical methods. Specifically, for a given pairwise bounded set c, if w(c) ≤ a_{ m } or w(c) ≥ a_{ d }, where a_{ m } and a_{ d } (a_{ m } < a_{ d }) are two thresholds obtained by statistical methods, c is determined as a motif set or a decoy set. Otherwise, an exhaustive method is required to determine whether c is a motif set. In our work, to maximize the possibility that c is a motif set, it is determined as a motif set if w(c) ≤ a_{ m }; otherwise, ten substrings are removed from c iteratively. We use the following method to set the threshold a_{ m }: randomly generate 1000 samples, each containing c motif instances; then, compute the mean μ and the standard deviation σ of the weights of these samples; finally, set a_{ m } to μ + σ.
For each obtained sample sequence set D’, t’ = D’, and the value of q’ is set to 0.9 to 0.95 according to the intensity of the disturbance information in the processed data. Although we maximize the possibility that D’ corresponds to a motif set, q’ cannot be set to 1. The reasons are as follows. First, the statistical method is used to determine a cluster of substrings as a motif set. Second, the distance between two substrings φ and φ’ is defined as the minimum Hamming distance between two lmers x ∈_{ l }φ and x’ ∈_{ l }φ’; thus, when the distance of φ is calculated from different φ’, the lmer in φ leading to dis(φ, φ’) may not come from a fixed position, which also affects the accuracy of determining a set as a motif set.
The computation time of this step is mainly determined by clustering the highfrequency substrings obtained in the previous step, i.e., the substrings stored in the set A’. To obtain the similarity matrix for clustering, we need to calculate the distance between each pair of substrings in A’ in O(L’^{2}) time, where L’ is the average length of the substrings in A’. Then, given the similarity matrix, the time complexity of the AP clustering algorithm is O(A’^{2}r) [30], where r is the number of iterations. Therefore, the time complexity of this step is O(A’^{2}(L’^{2} + r)).
The overall time complexity of SamSelect, denoted by T_{SamSelect}, is obtained by adding up the time complexity of the three steps of SamSelect. Since each sequence contains constant occurrences of highfrequency substrings, the number of obtained highfrequency substrings is O(t). Then, we have A = O(t) and A’ = O(t). According to empirical studies, we have L = O(l) and L’ = O(l). Therefore, T_{SamSelect} is given as follows.
Results and discussion
Data, experimental setting and evaluation
Both the simulated data and real data are used in our experiment. The simulated data are generated as follows [5]: randomly generate t nlength DNA sequences and an llength motif m; then, randomly select qt sequences, each implanted with a random instance m’ of m in a random position. The Hamming distance between m and m’ is less than or equal to d. To control the motif conservation, an instance m’ of m is generated as follows: randomly select d positions of m, and then, for each selected position i, change m[i] to a different character with probability g; a large g leads to lower motif conservation.
According to the settings of (l, d), t, q and g, three groups of simulated datasets are generated. The first group of simulated datasets is used to test qPMS algorithms under different (l, d) problem instances by fixing t = 3000 and q = 0.5, varying (l, d) from (9, 2) to (19, 7) and taking g as 0.2, 0.5 and 0.8 to represent high, intermediate and low conservation, respectively. The second group of simulated datasets is used to test qPMS algorithms under different proportions of sequences containing motif instances by fixing (l, d) = (9, 2), t = 3000 and g = 0.8 and varying q from 0.2 to 0.9. The third group of simulated datasets is used to test qPMS algorithms with a different scale of input by fixing (l, d) = (9, 2), g = 0.8 and q = 0.5 and varying t from 3000 to 10,000. For each combination of (l, d), t, q and g, the result is the average obtained on five randomly generated datasets.
Eight Homo sapiens datasets selected from the ENCODE TF ChIPseq data [32] and twelve mouse datasets in the mouse embryonic stem cell (mESC) data [33] are used as the real data. As shown in Tables 2 and 3, these datasets, each named for the corresponding transcription factor, have different numbers t of sequences, ranging from 1126 to 39,601. We use the following method to obtain the proportion q of sequences containing motif instances for each dataset: determine a consensus motif m (see the second column of Tables 2 and 3) according to the published motif (see Figs. 3 and 4), and set its value of (l, d) to a challenge problem instance [25]; then, scan the entire dataset using m to obtain the number Q of sequences containing at least one occurrence of m with up to d mismatches; finally, take q as Q/t. Note that, the actual value of q will be less than Q/t because the sequences contain random occurrences of m. We find that, although more sequences in ChIPseq datasets than in traditional small datasets containing motif instances, the proportion q of sequences containing motif instances in ChIPseq datasets is small. That is, a ChIPseq dataset contains many background sequences.
For the simulated data, the stpd qPMS algorithms (FMotif [17]) and spd qPMS algorithms (TravStrR [21] and qPMS9 [25]) are tested separately to verify the effect of using the sample sequences. FMotif is designed to handle ChIPseq datasets based on the suffix tree, whereas TravStrR and qPMS9 show good time performance when identifying motifs of large (l, d) on traditional datasets. For the real data, since the qPMS algorithms report the same results, we use a representative algorithm FMotif to verify that we can find real motifs in a reasonable time.
For each dataset D, the experiment uses SamSelect to select the sample sequence sets D’ from D, and then qPMS algorithms are executed separately on D and D’. When determining a sample sequence set D’, the number of sample sequences t’ is set to 100, and the proportion q’ of the sequences containing motif instances in D’ is set to 0.95 and 0.9 under the simulated and real data, respectively. Note that we use a smaller q’ for real data because more disturbance information is present in real data. The experimental environment is a 2.60 GHz 24core platform with 64 Gbyte memory. SamSelect and FMotif are executed on a single core. TravStrR and qPMS9 are executed on 24 cores.
The sample sequence selection is evaluated in terms of the following two goals. The first is to compute the speedup of running time T_{ D }/T_{ s } + T_{ D }’, where T_{ s } is the time of selecting sample sequences using SamSelect, and T_{ D } and T_{ D }’ are the running time of a particular qPMS algorithm on D and D’, respectively. The speedup can be fairly large as the number of sequences grows. The second is to verify whether the qPMS algorithms can find the implanted or real motifs m on D’; for FMotif, since it can output the rank of the identified motifs, we also compare the rank of m among the motifs obtained on D and that on D’. Note that in the case of two or more D’, T_{ D }’ is the total time on each D’. For the simulated data, the rank of m among the identified motifs is obtained on the first D’, since experimental results show that m is always present in the first D’; for the real data, both the rank of D’ containing m (denoted by D’_{m}) among all D’ and the rank of m among the motifs obtained on D’_{m} are reported.
Results of accelerating suffix treebased patterndriven qPMS algorithms
Since the maximum number of sequences processed by FMotif is limited to 3000, we only perform experiments on the first and second groups of simulated datasets, and the results are shown in Tables 4 and 5, respectively. We find that using the sample sequences selected by SamSelect to accelerate FMotif is effective. On the one hand, for each dataset D, the implanted motif m can be found on the selected sample sequence sets D’; in particular, the rank of m among the (l, d) motifs obtained on D’ can hold that on D, except for a few cases with a slight rise. On the other hand, the execution of FMotif on D’ achieves a good speedup (in some cases, the speedup can be more than 200); moreover, the running time of SamSelect is very small, generally negligible relative to the running time of qPMS algorithms on D.
We perform the following further analysis according to the results. First, the use of D’ can effectively reduce the effect of (l, d) on the time performance of FMotif. As shown in Table 4, although the running time of FMotif increases dramatically with increasing (l, d), which is easily explained by the time complexity of the stpd qPMS algorithms, the largest (l, d) problem instances processed by FMotif within 48 h on D and D’ are (15, 5) and (19, 7), respectively. Second, the use of D’ can effectively reduce the effect of q on the time performance of FMotif. As shown in Table 5, the running time of FMotif increases as q decreases for D of the same size, whereas FMotif executed on D’ can have efficient and stable time performance because the sizes of D’ and q’ obtained by SamSelect are nearly fixed. Third, the speedup is relatively small when processing small (l, d) problem instances with large q (e.g., (l, d) = (9, 2) and q = 0.9). In this case, the running time of SamSelect is larger than that of the FMotif executed on D’ but still smaller than that of FMotif executed on D. Finally, as shown in Table 4, for a particular (l, d), the higher the conservation of implanted motifs the larger the running time of SamSelect. This difference occurs because a high conservation of implanted motifs leads to the accumulation of more substrings to be clustered, thus increasing the time cost of clustering.
We also perform experiments on the simulated datasets of nonchallenging (l, d) instances. Except for (l, d), the settings of t, q and g for this group of simulated datasets are the same as those for the first group of simulated datasets. The results are shown in Table 6. We find that using the selected sample sequences to accelerate FMotif is also effective for nonchallenging (l, d) instances. It should be noted that, the speedup is less than 1 for the (9, 1) instance, which is a nonchallenging instance with a small (l, d). In this case, the running time of FMotif is small even on the entire dataset and it is not necessary to further accelerate FMotif using the selected sample sequences.
Results of accelerating samplepatterndriven qPMS algorithms
Tables 7, 8 and 9 give the results of testing spd qPMS algorithms (qPMS9 and TravStrR) on the first, second and third groups of simulated datasets, respectively. Since they output the same motifs as FMotif, they can also find the implanted (l, d) motifs, and thus we mainly consider their running time.
On the whole, both qPMS9 and TravStrR show poor time performance on D, spending more than 48 h for all (l, d) problem instances except small ones with large q. Therefore, a large speedup on D’ is achieved. The use of D’ can effectively reduce the effects of (l, d), q and t on the time performance. Furthermore, we perform the following analysis. First, as shown in Table 7, for a particular (l, d), spd qPMS algorithms require more time to solve problem instances of high conservation because the motif instances contained in D’ are more similar in the case of high conservation, and too many h tuples are needed to generate candidate motifs. Therefore, it is not surprising that, for the case of (l, d) = (19, 7) with high conservation, qPMS9 executed on D’ still takes more than 48 h. Second, as shown in Table 9, the running time of SamSelect increases slightly as the data scale increases but is still very small when t = 10,000.
Results on real data
We use FMotif to validate that qPMS algorithms identify real motifs by using the selected sample sequence sets D’. For the sake of fairness, a uniform parameter setting is used for each data set D in the experiments: we set q = 0.3, (l, d) = (13, 4) and t’ = 100 to execute SamSelect. After obtaining D’, we set q’ = 0.9 and use FMotif to search (13, 4) motifs in D’.
In Figs. 3 and 4, we give the experimental results, including the running time of SamSelect, the running time of FMotif on D’ and the predicted motifs. The found motif that is most similar to the published motif is taken as the predicted motif, shown in the form of a sequence logo [34]. Let D’_{m} denote the sample sequence set containing the predicted motif. Figures 3 and 4 also show the number of sample sequence sets obtained, the rank of D’_{m} (R_{1}) and the rank of the predicted motif among the motifs present in D’_{m} (R_{2}). For the real data, R_{2} is obtained by sorting the motifs present in D’_{m} in ascending order according to their enrichment Pvalue [35]. The sequence logo of the predicted motif is drawn by using the substrings similar to the motif in the entire dataset, i.e., the substrings with a Hamming distance no more than d / 2 from the motif. We find that FMotif executed on D’ can find the real motifs in a short time. It should be noted that the rank R_{1} and R_{2} differ greatly on some of the real datasets. The reasons are as follows. First, both the coregulated motifs and the spurious motifs can disturb finding the motif to be identified. Second, the intensity of the disturbance, which affects the rank R_{1} and R_{2}, is usually different for different datasets.
Applicability of SamSelect
Our motif discovery method is not an exact algorithm. Although our method can find the implanted (l, d) motif, it does not guarantee finding all (l, d) motifs present in the entire dataset D. Besides the implanted (l, d) motif, some spurious (l, d) motifs may also be present in D by chance and are usually less conserved than the implanted motif. Our method selects sample sequences by mining highfrequency substrings, which are more likely to be the instances of highly conserved motifs. Therefore, it may miss some spurious (l, d) motifs. Moreover, some reported motifs present in the sample sequence set D’ may not be the (l, d) motifs present in D, but it is not difficult to eliminate such motifs by verifying them in D.
Our method is particularly designed for large DNA sequence datasets. When processing traditional datasets (t = 20, n = 600), the existing qPMS algorithms have already performed well, even for challenging (l, d) problem instances. Therefore, it is not necessary to use our method to accelerate existing qPMS algorithms on small datasets.
Moreover, the setting of q’ is discussed as follows. In general, the proportion q of sequences containing motif instances in large datasets is relatively small. For example, the maximum value of q for the ChIPseq datasets given in Tables 2 and 3 is 0.68. For the sample sequence set selected by our method, the value of q’ is set to 0.9 to 0.95 as described in the section Methods. When q > 0.95, we still use our method to select the sample sequence set and set q’ = q. For a special case when q = 1, the reported motifs present in the sample sequence set must contain all the (l, d) motifs present in the entire dataset.
Conclusions
To address the problem that existing qPMS algorithms are too time consuming for motif discovery on large DNA sequence datasets, we propose an algorithm to select a sample sequence set D’ from D such that D’ has a larger proportion of input sequences containing motif instances. Executed on D’, the qPMS algorithms are able to find implanted or real motifs in a significantly shorter time. In our future work, we will design the parallel version of SamSelect and the extended SamSelect algorithm for motif discovery on large alphabet datasets, e.g., protein datasets.
Notably, qPMS10 [36, 37] is also a work of sample sequence selection for the quorum planted motif search. The main difference between qPMS10 and our work is as follows. qPMS10 adopts random sampling to select a sample sequence set with t’ ≤ t and q’ ≤ q. In our work, we analyze that for a particular t, a small q will cause larger computation time. Therefore, we use word count and clustering methods to select sample sequence sets with t’ < t and 1 ≥ q’ > q.
Abbreviations
 mESC:

Mouse embryonic stem cell
 qPMS:

Quorum planted motif search
 spd:

Samplepatterndriven
 stpd:

Suffix treebased pattern driven
References
D’haeseleer P. How does DNA sequence motif discovery work. Nat Biotechnol. 2006;24(8):959–61.
Wong KC, Chan TM, Peng C, Li Y, Zhang Z. DNA motif elucidation using belief propagation. Nucleic Acids Res. 2013;41(16):e153.
Weirauch MT, Yang A, Albu M, Cote A, MontenegroMontero A, Drewe P, Najafabadi HS, Lambert SA, Mann I, Cook K, Zheng H, Goity A, van Bakel H, Lozano JC, Galli M, Lewsey M, Huang E, Mukherjee T, Chen X, ReeceHoyes JS, Govindarajan S, Shaulsky G, Walhout AJM, Bouget FY, Ratsch G, Larrondo LF, Ecker JR, Hughes TR. Determination and inference of eukaryotic transcription factor sequence specificity. Cell. 2014;158(6):1431–43.
Wong KC. MotifHyades: expectation maximization for de novo DNA motif pair discovery on paired sequences. Bioinformatics. 2017;33(19):3028–35.
Pevzner PA, Sze SH. Combinatorial approaches to finding subtle signals in DNA sequences. In: Altman R, Bailey TL, editors. Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology. California: AAAI Press; 2000. p. 269–78.
Davila J, Balla S, Rajasekaran S. Fast and practical algorithms for planted (l, d) motif search. IEEE/ACM Trans Comput Biol Bioinform. 2007;4(4):544–52.
Evans PA, Smith A, Wareham HT. On the complexity of finding common approximate substrings. Theor Comput Sci. 2003;306:407–30.
Das M, Dai H. A survey of DNA motif finding algorithms. BMC Bioinf. 2007;8(Suppl 7):S21.
Zambelli F, Pesole G, Pavesi G. Motif discovery and transcription factor binding sites before and after the next generation sequencing era. Brief Bioinform. 2013;14(2):225–37.
Lihu A, Holban Ş. A review of ensemble methods for de novo motif discovery in ChIPSeq data. Brief Bioinform. 2015;16(6):964–73.
Liu B, Yang J, Li Y, Mcdermaid A, Ma Q. An algorithmic perspective of de novo cisregulatory motif finding based on ChIPseq data. Brief Bioinform. 2017; https://doi.org/10.1093/bib/bbx026.
Yang X, Rajapakse JC. Graphical approach to weak motif recognition. Genome Inform. 2004;15(2):52–62.
Sun H, Low MYH, Hsu WJ, Rajapakse JC. RecMotif: a novel fast algorithm for weak motif discovery. BMC Bioinformatics. 2010;11(Suppl 11):S8.
Sagot MF. Spelling approximate repeated or common motifs using a suffix tree. In: Lucchesi CL, Moura AV, editors. Proceedings of the Third Latin American Symposium: Theoretical Informatics. Campinas: LNCS; 1998. p. 111–27.
Pavesi G, Mereghetti P, Mauri G, Pesole G. Weeder web: discovery of transcription factor binding sites in a set of sequences from coregulated genes. Nucleic Acids Res. 2004;32(Web Server issue):199–203.
Pisanti N, Carvalho AM, Marsan L, Sagot MF. RISOTTO: Fast extraction of motifs with mismatches. In: Correa JR, Hevia A, Kiwi MA, editors. Proceedings of the Seventh Latin American Symposium: Theoretical Informatics. Valdivia: LNCS; 2006. p. 757–68.
Jia C, Carson MB, Wang Y, Lin Y, Lu H. A new exhaustive method and strategy for finding motifs in ChIPenriched regions. PLoS One. 2014;9(1):e86044.
Davila J, Balla S, Rajasekaran S. Space and time efficient algorithms for planted motif search. In: Yi P, Zelikovsky A, editors. Proceedings of the Second International Workshop on Bioinformatics Research and Applications. UK: LNCS; 2006. p. 822–9.
Yu Q, Huo H, Zhang Y, Guo H. PairMotif: a new patterndriven algorithm for planted (l, d) DNA motif search. PLoS One. 2012;7(10):e48442.
Dinh H, Rajasekaran S, Davila J. qPMS7: a fast algorithm for finding (l, d)motifs in DNA and protein sequences. PLoS One. 2012;7(7):e41425.
Tanaka S. Improved exact enumerative algorithms for the planted (l, d)motif search problem. IEEE/ACM Trans Comput Biol Bioinf. 2014;11(2):361–74.
Ho ES, Jakubowski CD, Gunderson SI. iTriplet, a rulebased nucleic acid sequence motif finder. Algorithms Mol Biol. 2009;4(1):1–14.
Dinh H, Rajasekaran S, Kundeti VK. PMS5: an efficient exact algorithm for the (l, d)motif finding problem. BMC Bioinf. 2011;12:410.
Nicolae M, Rajasekaran S. Efficient sequential and parallel algorithms for planted motif search. BMC Bioinf. 2014;15:34.
Nicolae M, Rajasekaran S. qPMS9: an efficient algorithm for quorum planted motif search. Sci Rep. 2015;5:7813.
Buhler J, Tompa M. Finding motifs using random projections. J Comput Biol. 2002;9(2):225–42.
Bailey T, Krajewski P, Ladunga I, Lefebvre C, Li Q, Liu T, Madrigal P, Taslim C, Zhang J. Practical guidelines for the comprehensive analysis of ChIPseq data. PLoS Comput Biol. 2013;9(11):e1003326.
Huo H, Chen L, Zhao H, Vitter JS, Nekrich Y, Yu Q. A dataaware FMindex. In: Indyk P, editor. Proceedings of the SODA Algorithm Engineering and Experiments (ALENEX). San Diego: ACM Press; 2015. p. 10–23.
Yu Q, Huo H, Feng D. PairMotifChIP: a fast algorithm for discovery of patterns conserved in large ChIPseq data sets. Biomed Res Int. 2016;2016:4986707.
Frey BJ, Dueck D. Clustering by passing messages between data points. Science. 2007;315(5814):972–6.
Boucher C, King J. Fast motif recognition via application of statistical thresholds. BMC Bioinf. 2010;11(1):1–8.
Kheradpour P, Kellis M. Systematic discovery and characterization of regulatory motifs in ENCODE TF binding experiments. Nucleic Acids Res. 2014;42(5):2976–87.
Chen X, Xu H, Yuan P, Fang F, Huss M, Vega VB, Wong E, Orlov YL, Zhang W, Jiang J, Loh YH, Yeo HC, Yeo ZX, Narang V, Govindarajan KR, Leong B, Shahab A, Ruan Y, Bourque G, Sung WK, Clarke ND, Wei CL, Ng HH. Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell. 2008;133(6):1106–17.
Crooks GE, Hon G, Chandonia JM, Brenner SE. WebLogo: a sequence Logo generator. Genome Res. 2004;14(6):1188–90.
Hartmann H, Guthöhrlein EW, Siebert M, Luehr S, Söding J. Pvaluebased regulatory motif discovery using positional weight matrices. Genome Res. 2013;23(1):181–94.
Xiao P, Pal S, Rajasekaran S. qPMS10: a randomized algorithm for efficiently solving quorum planted motif search problem. In: Wang Y, Burrage K, editors. Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine. Shenzhen: IEEE Press; 2016. p. 670–5.
Xiao P, Pal S, Rajasekaran S. Randomised sequential and parallel algorithms for efficient quorum planted motif search. Int J Data Min Bioinform. 2017;18(2):105–24.
Acknowledgements
We express our sincere appreciation to the editors and specialist reviewers for their instructions and help improving the article.
Funding
This work was supported, in part, by the National Natural Science Foundation of China under Grants 61502366, 61741215 and 61373044 and by the Fundamental Research Funds for the Central Universities under Grant XJS17092. The funding body did not contribute to the design of the study, to the collection, analysis and interpretation of the data, or to the writing of the manuscript.
Availability of data and materials
The source code of SamSelect and the datasets generated and analyzed during the current study are available at https://github.com/qyu071/samselect.
Author information
Authors and Affiliations
Contributions
Initial idea of the research was from QY. QY, DW and HH designed the proposed algorithm. DW and QY implemented the proposed algorithm and carried out the experiments. All authors participated in analysis and manuscript preparation. All authors read and approved the final version of the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable
Consent for publication
Not applicable
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Yu, Q., Wei, D. & Huo, H. SamSelect: a sample sequence selection algorithm for quorum planted motif search on large DNA datasets. BMC Bioinformatics 19, 228 (2018). https://doi.org/10.1186/s128590182242y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s128590182242y