HeliCis: a DNA motif discovery tool for colocalized motif pairs with periodic spacing

Background Correct temporal and spatial gene expression during metazoan development relies on combinatorial interactions between different transcription factors. As a consequence, cis-regulatory elements often colocalize in clusters termed cis-regulatory modules. These may have requirements on organizational features such as spacing, order and helical phasing (periodic spacing) between binding sites. Due to the turning of the DNA helix, a small modification of the distance between a pair of sites may sometimes drastically disrupt function, while insertion of a full helical turn of DNA (10–11 bp) between cis elements may cause functionality to be restored. Recently, de novo motif discovery methods which incorporate organizational properties such as colocalization and order preferences have been developed, but there are no tools which incorporate periodic spacing into the model. Results We have developed a web based motif discovery tool, HeliCis, which features a flexible model which allows de novo detection of motifs with periodic spacing. Depending on the parameter settings it may also be used for discovering colocalized motifs without periodicity or motifs separated by a fixed gap of known or unknown length. We show on simulated data that it can efficiently capture the synergistic effects of colocalization and periodic spacing to improve detection of weak DNA motifs. It provides a simple to use web interface which interactively visualizes the current settings and thereby makes it easy to understand the parameters and the model structure. Conclusion HeliCis provides simple and efficient de novo discovery of colocalized DNA motif pairs, with or without periodic spacing. Our evaluations show that it can detect weak periodic patterns which are not easily discovered using a sequential approach, i.e. first finding the binding sites and second analyzing the properties of their pairwise distances.


Background
DNA sequence motifs recognized by transcription factors are usually short (~10 bp) with low information content, and matching sequence elements therefore occur randomly in large numbers in the genome. The precise specificity required for correct temporal and spatial transcription during metazoan development relies on combinatorial interactions between binding sites in relatively dense clusters [1]. These clusters, termed cis-regulatory modules (CRMs), typically contain sites (cisregulatory elements) for several different transcriptional activators and repressors. CRMs may be unstructured, serving as "billboards" that bring DNA binding proteins into proximity [2]. In this case, the balance of activators and repressors, rather than the order or spacing between factors, is the most important property. They may however also be highly structured, the extreme example being the "enhanceosome"-type CRM, with very little flexibility in the arrangement of recognition sites [3]. Others are more flexible, but with requirements on organizational features such as spacing, order and helical phasing between binding sites.
Numerous examples demonstrate the importance of the last feature, the phase. A small modification of the distance between a pair of sites may sometimes drastically disrupt function and this is usually attributed to the turning of the DNA helix. In many cases, insertion of a full helical turn of DNA (10-11 bp [4]) between cis elements will cause functionality to be restored, as this will cause the same face of the binding protein to be exposed to cofactors and nearby DNA binding factors. The phenomenon has been observed in many studies of single genes, e.g. for AP-1 and RD binding sites in the collagenase-3 promoter [5] as well as for the smooth muscle α-actin promoter, where introduction of a 20 bp spacer caused significantly higher reporter activity than a 15 bp spacer [6]. Other examples include the HPV18 enhancer [7], lung surfactant protein B [8], TNF-α [9] and Igamma1 [10]. In study of four coregulated Drosophila developmental enhancers, a conserved shared organization with pairwise periodic distances between neighboring sites was identified [11]. Periodic signals in distances between neighboring motif pairs have also been observed on a genomic scale in Drosophila [12] and other eukaryotes [13].
Significant effort has been put into the problem of de novo motif discovery of transcription factor binding sites [14]. The task, often described as a local multiple alignment problem, is difficult due to the degenerate nature of transcription factor recognition sequences. Prediction may sometimes be improved by incorporating organizational features such as colocalization and order preferences into the model, and in recent years several such methods have been proposed [15][16][17][18][19]. The idea of incorporating helical phasing into a motif discovery tool has been suggested [12], but to our knowledge no such tool has yet been devised. We propose a motif sampler which can efficiently discover ordered or unordered colocalized motif pairs de novo in DNA sequences. In addition, our tool incorporates an optional periodic spacing model, and we show on simulated data that it can detect weak periodic patterns that are not easily discovered using single motif or colocalization methods.

Algorithm overview
We propose a de novo method for motif discovery, HeliCis, which can find motif pairs separated by a distance (gap) that varies in a periodical manner. More specifically, the distance is modeled as some fixed offset φ (the phase) plus a variable integer multiple of the period T ( Figure 1). Small deviations from exact periodic spacing may optionally be allowed. HeliCis detects patterns which are common to a group of sequences. A typical input would be regulatory DNA from a set of assumedly coregulated genes. The motif pair is assumed to be either present or absent in each sequence and may optionally be allowed to occur on either strand. The period T is specified by the user, but the program can be provided with a range of periods to evaluate. Upper and lower boundaries for the distance can be specified. The distance can be allowed to be negative, making it possible to find unordered motif pairs. Our method is not limited to finding periodically distributed binding sites. The flexibility of the algorithm makes the task of finding colocalized motifs (e.g. positioned within 100 bp of one another without periodicity) or motifs with fixed spacing (e.g. always exactly 25 bp from each other) into special cases simply achieved by choosing appropriate parameters. E.g. by setting the period to one, the model will find colocalized motifs without periodicity. Examples of parameter settings for different scenarios are available on the HeliCis home page [20]. The software also incorporates the possibility to take advantage of interspecies conservation by favoring motif placement in highly conserved regions.

Mathematical model
Let S be the set of N sequences to be analyzed. Each sequence s i ∈ S, of length n i , (i = 1...N) is assumed to contain zero or one motif pair. Below, we refer to motif-containing sequences as being regulated and denote this by R i = true. The position of the first and second motif in a particular sequence s i is denoted a i and b i respectively. Motifs are modeled as two position frequency matrices A and B, where A j [l] and B j [l] denotes the probability of the nucleotide l appearing in position j of motif A and B respectively. Unregulated sequences are modeled as background sequence, described by an order 0 Markov process with nucleotide frequencies θ 0 . Regulated sequences are mod-eled as a combination of motif and background sequence. The probability of a sequence s i (where s i,j denotes the j:th base in the sequence) can therefore be written and where and where W A and W B are widths of the motifs. a i and b i cannot take on arbitrary values but will depend on each other, since we are looking for motif pairs where the distance between the two must follow certain criteria. We use a prior p(a i ,b i ) to reflect this, described below. We also assume there is a fixed prior probability p(R i = true) for any sequence to be regulated. For θ 0 , A j and B j we use Dirichlet priors, with pseudocounts α[l] proportional to the frequencies of the bases in all the sequences. Our goal is to find values for R = (R 1 , ..., R N ), a = (a 1 , ..., a N ) and b = (b 1 , ..., b N ) which maximize the posterior p(R,a,b | S). To accomplish this we use an algorithm based on the Gibbs sampling principle for motif discovery [21], which makes use of the predictive update version of the Gibbs sampler [22]. As in other Gibbs motif samplers, an iterative update/ sampling procedure is applied. One of the sequences, s i , is removed from the alignment by setting R i = 0. Given values for A, B, and θ 0 according to the formulas above, new values for R i , a i and b i are determined by sampling from θ 0 , S) using the following steps: Bayes formula on odds form gives that from which we get that which is used to sample whether R i = true. Note that, using (1) and (2), we have We define the prior p(a i ,b i ) to be proportional to an indicator function e(a i ,b i ) which is zero unless a i and b i represent a pair of motif positions compatible with the assumptions that the motifs are both within the sequence, do not overlap, and have a distance conforming to the assumed periodicity and the assumed possible variation around this periodicity. As described above, the allowed distance is modeled as a fixed phase φ plus a variable integer multiple of the period T ( Figure 1). Specifically, given W A , W B , the period T, the phase φ, the allowed deviation from exact periodic distance ("noise"), the length of sequence i and the minimum and maximum distances, we can for all i = 1..n i find all j such that e(i,j) = 1, and the value of (8) can be calculated as Schematic drawing of the model structure Figure 1 Schematic drawing of the model structure. The triangle and rectangle represent the first and second motif respectively. Gray boxes indicate valid locations for the second motif given the position of the first. The "phase" (distance offset) is assumed to be constant over all sequences and is determined by the algorithm.
Period T

Phase
Secondly, we get that p(a i | R i = true, A, B, θ 0 , S) is proportional to so if R i = true, a value for a i can be sampled by using probabilities proportional to the numbers (10). Finally, b i can be sampled by noting that given R i = true and a value for a i , the probabilities for valid values of b i according to e(i, The algorithm is initiated by setting all R i = false. The update/sampling procedure described above is then performed for each sequence s i , i = 1...N. When all R i , a i and b i have been updated, the alignment is scored according to We are interested in finding values which maximize p(R, a, b | S), which approximately corresponds to maximizing F above. Having completed a full iteration of the update/ sampling procedure, sampling continues at the first sequence. The algorithm stops when the same F has been observed several times in a row or when the maximum number of iterations is reached. To avoid getting stuck in local maxima, the algorithm is restarted several times. It is also systematically restarted with different settings of the phase φ (all values between 0...T-1 are evaluated), as this parameter is not updated during each run of the algorithm and therefore has to be determined exhaustively.
To avoid that the algorithm finds "shifted versions" of the actual motifs, a type of shift jump is introduced. Each time the score F is improved, possible shifts of the motifs are found, defined by adding or subtracting some integer to all a i and b i . For each of the possible shifts (a*, b*), we calculate F. If a better score is encountered, the positions are updated and used as a starting point for the next update/ sampling iteration.
For simplicity, we have described the case where motif pairs are assumed to occur only on the forward strand. Our method optionally permits both forward and reverse strands to be searched. In this case, the sampling distribution and the calculation of the posterior probability for R is extended to included both strands. Optionally, information about conservation between species can be used to favor placement of motifs in evolutionarily conserved regions. In this case, instead of single sequences, pairwise alignments of orthologous sequences are loaded into the program. Gaps are removed from the "base" sequences to ensure that correct distances are maintained. The fraction of conserved bases over windows the same size as the motifs is calculated for each possible motif position. The sampling distributions are then weighted according to this vector. A similar strategy is implemented in [23]. The same vector is also used to exclude regions from being searched. This allows the sampler to be restarted after convergence to search for a new set of non-overlapping binding sites.

Implementation and user interface
The main algorithm is implemented in Matlab while time critical functions are written in the C language. These can be downloaded for local use (see Additional File 1). Heli-Cis is also available through a web interface [20] which provides several templates to simplify parameter setup. To make it easier to understand the function of the different parameters, these are visualized using an interactive schematic figure which is updated to reflect the current settings ( Figure 2). The web interface is implemented in php and the source files can be made available upon request.

Performance vs. motif information content
The performance was evaluated on synthetic sequence datasets. Ordered pairs of SRF (CArG) and ETS binding sites, generated from raw TRANSFAC [24] weight matrices (M01007 and M00771), were planted into sets of 15 random sequences of length 400 bp. The choice of matrices was arbitrary, although these factors have been shown to cooperatively regulate certain genes [25]. One motif pair was assigned to each sequence and the distance between each pair was set to a uniformly random multiple (n = 0...4) of the helical period (10 bp) plus a 5 bp offset. The binding sites were thus both colocalized and periodically spaced. The TRANSFAC CArG matrix is based on 54 occurrences and the central 12 bases were used when generating the test sequences (the core CArG motif is 10 bp long). The ETS matrix is 12 bp long and based on 48 occurrences. Raw counts were converted into relative frequencies and bases were randomly selected according to this distribution. Several sequence sets with increasingly weaker motifs were generated by varying the number of pseudocounts between 0 and 4. The information content of the resulting matrices was calculated. Evaluation sequence sets are available both as supplementary information (see Additional File 2) and for download on the HeliCis homepage [20].
HeliCis with default settings for periodic spacing (period 10, motif distance 0...50 bp), HeliCis with colocalization settings (period 1, distance 0...50 bp) and HeliCis with single motif settings were compared to an established single motif discovery tool based on the EM algorithm, MEME [26], and a motif discovery tool based on Gibbs sampling, BioProspector [27]. The latter was run in "twoblock" mode, searching for motif pairs with a maximum gap of 50 bp. All were configured to search the forward Web interface screenshot, showing the parameter setup screen strand only with a fixed motif width of 12 bp, and with forced presence of a motif in each sequence (oops = "one occurrence per sequence" model in MEME, "-a 1" switch for BioProspector, "-p 1" switch for HeliCis). The quality of the resulting alignments was determined by calculating the fraction of correctly identified sites (Figure 3). Results shown are average values from five independent trials where the sequence sets were regenerated each time. It should be noted that BioProspector, unlike HeliCis and MEME, cannot be forced to detect exactly one occurrence per sequence, but will often assign several motifs per sequence. This should be taken into account when evaluating the results, as this model may be slightly disadvantageous on this dataset.
The CArG motif has high information content and all tested tools performed reasonably well on this motif before pseudocounts were added. However, the sensitivity of HeliCis with periodic and colocalization settings was still higher, reaching 99 % and 97 % respectively, as opposed to 88 % for MEME and BioProspector. As the information content of the motifs was lowered, the ability of the periodic model to make use of the periodicity in the data became obvious and the other methods were outperformed. When the already weak ETS motif was obscured by added pseudocounts, HeliCis in colocalization mode quickly lost its ability to make use of this motif to improve detection of the CArG box.
The ETS motif was not efficiently detected using any of the single motif methods, and this is where the advantages of the HeliCis model were most obvious. BioProspector in two-block mode was able to draw some advantage of the proximity to the stronger CArG motif and reached 65 % sensivity with no added pseudocounts, to be compared with ~42 % for MEME and HeliCis in single motif mode. The corresponding result for HeliCis in colocalization mode was 92 %, and the advantage was even bigger when the information content of the motifs was reduced. On the ETS motif, HeliCis in periodic mode had considerably higher sensitivity than all the other tested methods throughout the series.

Performance vs. fraction of sequences containing motifs
In a second evaluation, sets of 20 sequences containing artificially planted CArG and ETS motifs were generated as described above. However, this time the information content of the motif matrices was kept constant (one pseudocount added). Instead, the fraction of sequences containing motifs was gradually reduced from 20/20 to 10/20, thus making them increasingly difficult to detect. In this case, the tools were not forced to detect motifs in all sequences (zoops = "zero or one occurrences per sequence" model in MEME, default for BioProspector and HeliCis). Other settings were as described above. To account for false positive predictions, a PPV score (positive predictive value, i.e. the fraction of predicted sites which are correct) was calculated, in addition to sensitivity. The results, shown in Figure 4, are average values from 5 independent trials.
Again, the less informative ETS motif benefited considerably from the HeliCis model, both with periodic and colocalization settings. This motif was only sporadically detected by MEME, BioProspector and HeliCis with single motif settings, while HeliCis in periodic mode reached 91 % sensitivity when 16/20 sequences contained the motifs. When the fraction of motif-containing sequences was Performance on synthetic sequence datasets containing colocalized and periodically spaced CArG and ETS motifs with varying information content Figure 3 Performance on synthetic sequence datasets containing colocalized and periodically spaced CArG and ETS motifs with varying information content. HeliCis with different settings was compared to MEME and BioProspector. The information content of the motifs was gradually reduced by varying the number of pseudocounts and the sensitivity of the different tools was determined by calculating the fraction of correctly identified motifs. Results are from 5 averaged trials. In the most challenging dataset, with motifs in 10 out of 20 sequences, HeliCis was not able to detect any motifs. However, both MEME and BioProspector could sporadically detect the CArG motif with average sensitivity scores of 32 % and 18 % respectively. MEME generally performed well in the PPV plots, reflecting that it was less prone to assigning false positive motifs in non-motif containing sequences. BioProspector does not have the possibility to limit the number of detected two-block motifs to maximum one per sequence. Due to a larger number of false positive predictions it therefore scored unfavorably in the PPV plots. It should be noted that its two-block model was occasionally able to detect the difficult ETS motif with high sensitivity, however, the average performance was still similar to the single motif methods.

Discussion
We have described a novel tool for de novo discovery of regulatory DNA motifs, HeliCis, available for local use and through a web interface [20]. Our method can efficiently detect motif pairs which are spatially colocalized in regulatory DNA. It is based on a flexible probabilistic model which optionally allows de novo discovery of motif pairs with periodic spacing (helical phasing). A large number of experimental studies show the importance of helical phasing in regulatory regions. The ability to detect such patterns de novo without prior knowledge of recognition sequences may be useful in the study of coregulated CRMs.
Our results show that HeliCis is able to efficiently take advantage of the synergistic effects of colocalization to improve sensitivity to weak DNA patterns. HeliCis in colocalization mode was evaluated on planted ETS and CArG motifs which were colocalized with a spacer of ran-Performance on synthetic sequence datasets with varying motif coverage Figure 4 Performance on synthetic sequence datasets with varying motif coverage. Datasets of 20 sequences with colocalized and periodically spaced CArG and ETS motifs were generated. The proportion of sequences containing the motifs was gradually reduced, thus making them increasingly difficult to detect. HeliCis with different settings was compared to MEME and BioProspector. The plots show sensitivity and positive predictive value (PPV = TP/(TP + FP)). Results are from 5 averaged trials.

HeliCis periodic
HeliCis colocalization dom variable length. The weaker ETS motif was detected with far better accuracy compared to other tested methods, and this can be attributed to the ability of our method to make use of the nearby stronger CArG motif to improve sensitivity. Detection of the CArG motif also benefited from the ETS-motif, although to a lesser extent. Sensitivity was further improved in a drastic way by running HeliCis in periodic mode. Both the CArG and the ETS motif benefited considerably from this reduction of the search space. Importantly, this shows that the method is capable of finding weak periodic patterns which are not readily detected using a "sequential" approach, i.e. first detecting single motifs and second analyzing their spacing properties.
One limitation of our model is that the motifs widths are fixed. Some Gibbs sampling algorithms handle this using an alternative scoring function and restarts using several widths [21] or the "fragmentation algorithm [28]," while others use a fixed width [15,27]. TF binding sites are usually within the 8-12 bp range and we have found results to be quite robust to changes in this parameter as long as the motif width is not set too short. Results were nearly identical when HeliCis was applied to the test sets in this paper using a 10 bp motif width instead of the default 12 bp (data not shown).
HeliCis models the intermotif distance as a variable integer multiple of the period T plus a fixed "phase" (offset) φ = 0...T-1. The phase is determined exhaustively by restarting the sampler several times, leading execution time to be proportional to the chosen period. A desirable improvement would be to determine the phase during execution of the algorithm rather than to use restarts. If several periods other than the default 10 bp are to be evaluated, more restarts are required and the algorithm can become computationally demanding. However, the current implementation normally does not cause problems with sequence sets of reasonable size. With 15 400 bp sequences, execution time with the periodic model (10 bp period) is typically around 10 minutes on a low-end processor (Pentium 4 2.4 GHz). The execution time in each iteration theoretically scales linearly with the number of sequences, the total amount of sequence data, the motif length and the maximum motif distance. In practice, as long as each individual sequence is not to long (<1000 bp), the number of sequences is the most important factor (data not shown). Some parameters in the web interface have been slightly limited to avoid overloading the server, but no such limitations are present in the downloadable version.

Conclusion
HeliCis is a flexible and efficient tool for de novo discovery of colocalized DNA motif pairs. It incorporates structural features such as ordered or unordered colocalization and periodic spacing. Our evaluations show that it can detect weak periodic patterns which cannot be easily discovered by others means. It is available both for local use and through a simple web interface.

Availability and requirements
Project name: HeliCis Project home page: http://lymphomics.wall.gu.se/helicis Operating system: Platform independent Programming language: Matlab, C License: Free for academic and non-profit researchers. Contact the authors for commercial licensing.