An integrative computational approach to effectively guide experimental identification of regulatory elements in promoters

Background Transcriptional activity of genes depends on many factors like DNA motifs, conformational characteristics of DNA, melting etc. and there are computational approaches for their identification. However, in real applications, the number of predicted, for example, DNA motifs may be considerably large. In cases when various computational programs are applied, systematic experimental knock out of each of the potential elements obviously becomes nonproductive. Hence, one needs an approach that is able to integrate many heterogeneous computational methods and upon that suggest selected regulatory elements for experimental verification. Results Here, we present an integrative bioinformatic approach aimed at the discovery of regulatory modules that can be effectively verified experimentally. It is based on combinatorial analysis of known and novel binding motifs, as well as of any other known features of promoters. The goal of this method is the identification of a collection of modules that are specific for an established dataset and at the same time are optimal for experimental verification. The method is particularly effective on small datasets, where most statistical approaches fail. We apply it to promoters that drive tumor-specific gene expression in tumor-colonizing Gram-negative bacteria. The method successfully identified a number of potential modules, which required only a few experiments to be verified. The resulting minimal functional bacterial promoter exhibited high specificity of expression in cancerous tissue. Conclusions Experimental analysis of promoter structures guided by bioinformatics has proved to be efficient. The developed computational method is able to include heterogeneous features of promoters and suggest combinatorial modules for experimental testing. Expansibility and robustness of the methodology implemented in the approach ensures good results for a wide range of problems.

Programs that are not applicable to our dataset. Improbizer -fails to run with "number of motifs to find" = 6, program reports "sequence data is too big". With "number of motifs to find" =2 program outputs only one motif, than hangs. Ann-spec 1.0 -searches only in human and yeast promoters. MotifRegressor -needs expression values. Can not be run just on a set of sequences.
CisModule: found modules of 200bp in length, that have from 1 to 14 instances of 3 different single motifs in all combinations. No general rule for module structure can be established.
ModuleSearcher: This program is a part of TOUCAN project. Results are dominated by 2 frequent motifs (~10 times more frequent than others). CRMs comprising these frequent motifs show good statistical significance, but present not on all sequences.
Stubb: did not identify any hits.
COMET, Cluster-Buster and Cister: 3 programs from the same lab that search for dense clusters or sites. All programs treat multiple sequences as one long sequence. Programs found clusters with 6 and more motifs and length > 150bp, most of the clusters overlap adjacent sequences. If sequences are submitted separately, finds no clusters.

Programs not used for other reasons.
ModuleFinder: Only executables are provided. Unclear how use user-defined PWMs. MCAST: MCAST identifies very long modules. For example, a top hit is a module consisted from 28 binding sites and spanning 620bp on a 1150bp sequence of the insert 48. Though very significant E-value and high score, this result seems to have little practical use. CMA: Website is partially down (images are not available). No details provided on model parameters (distance, orientation). ModuleScaner: This program is a part of TOUCAN project. Only scans for CRMs using provided templates.

Not available programs:
HexDiff: Website is not available. No supplementary provided. Google search -No results. MSCAN: Website does not exist. Email to authors: "The recipient's e-mail address was not found in the recipient's e-mail system". Google search -No results. CO-Bind: Executables are reported to be on an ftp site. Ftp errors with "no such directory". EMCMODULE: "Executables upon request", e-mail to authors failed: "user unknown". HexDiff: link to source code provided in the manuscript is invalid. Google search -no results. RP -no web site or standalone executables provided.
Supplementary Figure 2S. Graphical representation of promoter modules Potential promoter modules responsible for tumor specific transcriptional activation. Similar modules found by genetic algorithm were combined using "OR" logic (modules 3,4,5,9). DNA sequence of individual features can be found in Excel file in supplementary. For description of individual features see legend below. Eukaryotic motifs: Newly found motifs: