HeliCis: a DNA motif discovery tool for colocalized motif pairs with periodic spacing
© Larsson et al. 2007
Received: 23 May 2007
Accepted: 28 October 2007
Published: 28 October 2007
Skip to main content
© Larsson et al. 2007
Received: 23 May 2007
Accepted: 28 October 2007
Published: 28 October 2007
Correct temporal and spatial gene expression during metazoan development relies on combinatorial interactions between different transcription factors. As a consequence, cis-regulatory elements often colocalize in clusters termed cis-regulatory modules. These may have requirements on organizational features such as spacing, order and helical phasing (periodic spacing) between binding sites. Due to the turning of the DNA helix, a small modification of the distance between a pair of sites may sometimes drastically disrupt function, while insertion of a full helical turn of DNA (10–11 bp) between cis elements may cause functionality to be restored. Recently, de novo motif discovery methods which incorporate organizational properties such as colocalization and order preferences have been developed, but there are no tools which incorporate periodic spacing into the model.
We have developed a web based motif discovery tool, HeliCis, which features a flexible model which allows de novo detection of motifs with periodic spacing. Depending on the parameter settings it may also be used for discovering colocalized motifs without periodicity or motifs separated by a fixed gap of known or unknown length. We show on simulated data that it can efficiently capture the synergistic effects of colocalization and periodic spacing to improve detection of weak DNA motifs. It provides a simple to use web interface which interactively visualizes the current settings and thereby makes it easy to understand the parameters and the model structure.
HeliCis provides simple and efficient de novo discovery of colocalized DNA motif pairs, with or without periodic spacing. Our evaluations show that it can detect weak periodic patterns which are not easily discovered using a sequential approach, i.e. first finding the binding sites and second analyzing the properties of their pairwise distances.
DNA sequence motifs recognized by transcription factors are usually short (~10 bp) with low information content, and matching sequence elements therefore occur randomly in large numbers in the genome. The precise specificity required for correct temporal and spatial transcription during metazoan development relies on combinatorial interactions between binding sites in relatively dense clusters . These clusters, termed cis-regulatory modules (CRMs), typically contain sites (cis-regulatory elements) for several different transcriptional activators and repressors. CRMs may be unstructured, serving as "billboards" that bring DNA binding proteins into proximity . In this case, the balance of activators and repressors, rather than the order or spacing between factors, is the most important property. They may however also be highly structured, the extreme example being the "enhanceosome"-type CRM, with very little flexibility in the arrangement of recognition sites . Others are more flexible, but with requirements on organizational features such as spacing, order and helical phasing between binding sites.
Numerous examples demonstrate the importance of the last feature, the phase. A small modification of the distance between a pair of sites may sometimes drastically disrupt function and this is usually attributed to the turning of the DNA helix. In many cases, insertion of a full helical turn of DNA (10–11 bp ) between cis elements will cause functionality to be restored, as this will cause the same face of the binding protein to be exposed to cofactors and nearby DNA binding factors. The phenomenon has been observed in many studies of single genes, e.g. for AP-1 and RD binding sites in the collagenase-3 promoter  as well as for the smooth muscle α-actin promoter, where introduction of a 20 bp spacer caused significantly higher reporter activity than a 15 bp spacer . Other examples include the HPV18 enhancer , lung surfactant protein B , TNF-α  and Igamma1 . In study of four coregulated Drosophila developmental enhancers, a conserved shared organization with pairwise periodic distances between neighboring sites was identified . Periodic signals in distances between neighboring motif pairs have also been observed on a genomic scale in Drosophila  and other eukaryotes .
Significant effort has been put into the problem of de novo motif discovery of transcription factor binding sites . The task, often described as a local multiple alignment problem, is difficult due to the degenerate nature of transcription factor recognition sequences. Prediction may sometimes be improved by incorporating organizational features such as colocalization and order preferences into the model, and in recent years several such methods have been proposed [15–19]. The idea of incorporating helical phasing into a motif discovery tool has been suggested , but to our knowledge no such tool has yet been devised. We propose a motif sampler which can efficiently discover ordered or unordered colocalized motif pairs de novo in DNA sequences. In addition, our tool incorporates an optional periodic spacing model, and we show on simulated data that it can detect weak periodic patterns that are not easily discovered using single motif or colocalization methods.
and where W A and W B are widths of the motifs. a i and b i cannot take on arbitrary values but will depend on each other, since we are looking for motif pairs where the distance between the two must follow certain criteria. We use a prior p(a i ,b i ) to reflect this, described below. We also assume there is a fixed prior probability p(R i = true) for any sequence to be regulated. For θ 0, A j and B j we use Dirichlet priors, with pseudocounts α[l] proportional to the frequencies of the bases in all the sequences. Our goal is to find values for R = (R 1, ..., R N ), a = (a 1, ..., a N ) and b = (b 1, ..., b N ) which maximize the posterior p(R,a,b | S). To accomplish this we use an algorithm based on the Gibbs sampling principle for motif discovery , which makes use of the predictive update version of the Gibbs sampler .
so if R i = true, a value for a i can be sampled by using probabilities proportional to the numbers (10). Finally, b i can be sampled by noting that given R i = true and a value for a i , the probabilities for valid values of b i according to e(i, a i ) are proportional to Q B [b i ].
We are interested in finding values which maximize p(R, a, b | S), which approximately corresponds to maximizing F above. Having completed a full iteration of the update/sampling procedure, sampling continues at the first sequence. The algorithm stops when the same F has been observed several times in a row or when the maximum number of iterations is reached. To avoid getting stuck in local maxima, the algorithm is restarted several times. It is also systematically restarted with different settings of the phase φ (all values between 0...T-1 are evaluated), as this parameter is not updated during each run of the algorithm and therefore has to be determined exhaustively.
To avoid that the algorithm finds "shifted versions" of the actual motifs, a type of shift jump is introduced. Each time the score F is improved, possible shifts of the motifs are found, defined by adding or subtracting some integer to all a i and b i . For each of the possible shifts (a*, b*), we calculate F. If a better score is encountered, the positions are updated and used as a starting point for the next update/sampling iteration.
For simplicity, we have described the case where motif pairs are assumed to occur only on the forward strand. Our method optionally permits both forward and reverse strands to be searched. In this case, the sampling distribution and the calculation of the posterior probability for R is extended to included both strands. Optionally, information about conservation between species can be used to favor placement of motifs in evolutionarily conserved regions. In this case, instead of single sequences, pairwise alignments of orthologous sequences are loaded into the program. Gaps are removed from the "base" sequences to ensure that correct distances are maintained. The fraction of conserved bases over windows the same size as the motifs is calculated for each possible motif position. The sampling distributions are then weighted according to this vector. A similar strategy is implemented in . The same vector is also used to exclude regions from being searched. This allows the sampler to be restarted after convergence to search for a new set of non-overlapping binding sites.
The performance was evaluated on synthetic sequence datasets. Ordered pairs of SRF (CArG) and ETS binding sites, generated from raw TRANSFAC  weight matrices (M01007 and M00771), were planted into sets of 15 random sequences of length 400 bp. The choice of matrices was arbitrary, although these factors have been shown to cooperatively regulate certain genes . One motif pair was assigned to each sequence and the distance between each pair was set to a uniformly random multiple (n = 0...4) of the helical period (10 bp) plus a 5 bp offset. The binding sites were thus both colocalized and periodically spaced. The TRANSFAC CArG matrix is based on 54 occurrences and the central 12 bases were used when generating the test sequences (the core CArG motif is 10 bp long). The ETS matrix is 12 bp long and based on 48 occurrences. Raw counts were converted into relative frequencies and bases were randomly selected according to this distribution. Several sequence sets with increasingly weaker motifs were generated by varying the number of pseudocounts between 0 and 4. The information content of the resulting matrices was calculated. Evaluation sequence sets are available both as supplementary information (see Additional File 2) and for download on the HeliCis homepage .
The CArG motif has high information content and all tested tools performed reasonably well on this motif before pseudocounts were added. However, the sensitivity of HeliCis with periodic and colocalization settings was still higher, reaching 99 % and 97 % respectively, as opposed to 88 % for MEME and BioProspector. As the information content of the motifs was lowered, the ability of the periodic model to make use of the periodicity in the data became obvious and the other methods were outperformed. When the already weak ETS motif was obscured by added pseudocounts, HeliCis in colocalization mode quickly lost its ability to make use of this motif to improve detection of the CArG box.
The ETS motif was not efficiently detected using any of the single motif methods, and this is where the advantages of the HeliCis model were most obvious. BioProspector in two-block mode was able to draw some advantage of the proximity to the stronger CArG motif and reached 65 % sensivity with no added pseudocounts, to be compared with ~42 % for MEME and HeliCis in single motif mode. The corresponding result for HeliCis in colocalization mode was 92 %, and the advantage was even bigger when the information content of the motifs was reduced. On the ETS motif, HeliCis in periodic mode had considerably higher sensitivity than all the other tested methods throughout the series.
Again, the less informative ETS motif benefited considerably from the HeliCis model, both with periodic and colocalization settings. This motif was only sporadically detected by MEME, BioProspector and HeliCis with single motif settings, while HeliCis in periodic mode reached 91 % sensitivity when 16/20 sequences contained the motifs. When the fraction of motif-containing sequences was high (20/20 to 16/20) also the CArG motif was detected with higher sensitivity by HeliCis in periodic mode compared to the other tested tools.
In the most challenging dataset, with motifs in 10 out of 20 sequences, HeliCis was not able to detect any motifs. However, both MEME and BioProspector could sporadically detect the CArG motif with average sensitivity scores of 32 % and 18 % respectively. MEME generally performed well in the PPV plots, reflecting that it was less prone to assigning false positive motifs in non-motif containing sequences. BioProspector does not have the possibility to limit the number of detected two-block motifs to maximum one per sequence. Due to a larger number of false positive predictions it therefore scored unfavorably in the PPV plots. It should be noted that its two-block model was occasionally able to detect the difficult ETS motif with high sensitivity, however, the average performance was still similar to the single motif methods.
We have described a novel tool for de novo discovery of regulatory DNA motifs, HeliCis, available for local use and through a web interface. Our method can efficiently detect motif pairs which are spatially colocalized in regulatory DNA. It is based on a flexible probabilistic model which optionally allows de novo discovery of motif pairs with periodic spacing (helical phasing). A large number of experimental studies show the importance of helical phasing in regulatory regions. The ability to detect such patterns de novo without prior knowledge of recognition sequences may be useful in the study of coregulated CRMs.
Our results show that HeliCis is able to efficiently take advantage of the synergistic effects of colocalization to improve sensitivity to weak DNA patterns. HeliCis in colocalization mode was evaluated on planted ETS and CArG motifs which were colocalized with a spacer of random variable length. The weaker ETS motif was detected with far better accuracy compared to other tested methods, and this can be attributed to the ability of our method to make use of the nearby stronger CArG motif to improve sensitivity. Detection of the CArG motif also benefited from the ETS-motif, although to a lesser extent. Sensitivity was further improved in a drastic way by running HeliCis in periodic mode. Both the CArG and the ETS motif benefited considerably from this reduction of the search space. Importantly, this shows that the method is capable of finding weak periodic patterns which are not readily detected using a "sequential" approach, i.e. first detecting single motifs and second analyzing their spacing properties.
One limitation of our model is that the motifs widths are fixed. Some Gibbs sampling algorithms handle this using an alternative scoring function and restarts using several widths  or the "fragmentation algorithm ," while others use a fixed width [15, 27]. TF binding sites are usually within the 8–12 bp range and we have found results to be quite robust to changes in this parameter as long as the motif width is not set too short. Results were nearly identical when HeliCis was applied to the test sets in this paper using a 10 bp motif width instead of the default 12 bp (data not shown).
HeliCis models the intermotif distance as a variable integer multiple of the period T plus a fixed "phase" (offset) φ = 0...T-1. The phase is determined exhaustively by restarting the sampler several times, leading execution time to be proportional to the chosen period. A desirable improvement would be to determine the phase during execution of the algorithm rather than to use restarts. If several periods other than the default 10 bp are to be evaluated, more restarts are required and the algorithm can become computationally demanding. However, the current implementation normally does not cause problems with sequence sets of reasonable size. With 15 400 bp sequences, execution time with the periodic model (10 bp period) is typically around 10 minutes on a low-end processor (Pentium 4 2.4 GHz). The execution time in each iteration theoretically scales linearly with the number of sequences, the total amount of sequence data, the motif length and the maximum motif distance. In practice, as long as each individual sequence is not to long (<1000 bp), the number of sequences is the most important factor (data not shown). Some parameters in the web interface have been slightly limited to avoid overloading the server, but no such limitations are present in the downloadable version.
HeliCis is a flexible and efficient tool for de novo discovery of colocalized DNA motif pairs. It incorporates structural features such as ordered or unordered colocalization and periodic spacing. Our evaluations show that it can detect weak periodic patterns which cannot be easily discovered by others means. It is available both for local use and through a simple web interface.
Project name: HeliCis
Project home page: http://lymphomics.wall.gu.se/helicis
Operating system: Platform independent
Programming language: Matlab, C
License: Free for academic and non-profit researchers. Contact the authors for commercial licensing.
Prof. Olle Nerman is greatly acknowledged for fruitful discussions during the initial part of the project. The work was partly funded by the European Commission: The Sixth Framework Programme (LSHG-CT-2004-503573).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.