Scanning sequences after Gibbs sampling to find multiple occurrences of functional elements

Background Many DNA regulatory elements occur as multiple instances within a target promoter. Gibbs sampling programs for finding DNA regulatory elements de novo can be prohibitively slow in locating all instances of such an element in a sequence set. Results We describe an improvement to the A-GLAM computer program, which predicts regulatory elements within DNA sequences with Gibbs sampling. The improvement adds an optional "scanning step" after Gibbs sampling. Gibbs sampling produces a position specific scoring matrix (PSSM). The new scanning step resembles an iterative PSI-BLAST search based on the PSSM. First, it assigns an "individual score" to each subsequence of appropriate length within the input sequences using the initial PSSM. Second, it computes an E-value from each individual score, to assess the agreement between the corresponding subsequence and the PSSM. Third, it permits subsequences with E-values falling below a threshold to contribute to the underlying PSSM, which is then updated using the Bayesian calculus. A-GLAM iterates its scanning step to convergence, at which point no new subsequences contribute to the PSSM. After convergence, A-GLAM reports predicted regulatory elements within each sequence in order of increasing E-values, so users have a statistical evaluation of the predicted elements in a convenient presentation. Thus, although the Gibbs sampling step in A-GLAM finds at most one regulatory element per input sequence, the scanning step can now rapidly locate further instances of the element in each sequence. Conclusion Datasets from experiments determining the binding sites of transcription factors were used to evaluate the improvement to A-GLAM. Typically, the datasets included several sequences containing multiple instances of a regulatory motif. The improvements to A-GLAM permitted it to predict the multiple instances.


Background
Regulation of gene transcription is complex and often combinatorial in nature [1][2][3]. Combinatorial gene regula-tion is a major factor in evolution, because it helps coordinate diverse novel phenotypic features in a new species. Because it often reflects chemical synergies between tran-scription factors (TFs), combinatorial gene regulation can be broadly classified as either: (1) homotypic, where a single TF binds to multiple sites in the regulatory region of a gene; or (2) heterotypic, where multiple TFs target a single gene. Accurate knowledge of potential synergies between regulatory elements is therefore essential to understanding evolution and phenotypic diversity.
Many computational tools are available for prediction of regulatory elements. Most tools are based on one of two methods: (1) an enumeration of over-represented words [4][5][6][7][8][9]; or (2) probabilistic sequence models [10][11][12][13][14][15]. Our previous work [16] produced the A-GLAM computer program, which combines word enumeration with probabilistic sequence models to identify cis-regulatory sequences in human promoters, as follows. Given any gapless subsequence alignment, probabilistic sequence models yield a marginal Bayesian log-odds score. The Gibbs sampler in A-GLAM uses simulated annealing to maximize the logodds score over all possible gapless alignments. A-GLAM also can start from a set of "seeds", e.g., statistically significant positions from word enumeration, to maximize the log-odds score over all possible gapless alignments containing the seeds.
Gibbs sampling (or more descriptively, successive substitution sampling) is a respected Markov-chain Monte Carlo procedure for discovering sequence motifs. As a theoretical framework, however, it encounters several practical problems when searching for regulatory elements in DNA. First, it tends to find DNA repeat elements, regardless of their biological interest. Second, it often requires prohibitive computational time to find multiple instances of a regulatory element in a single sequence.
Because A-GLAM was based on Gibbs sampling, we were eager to overcome the practical problems above. Our previous work [16] used seeds to overcome the first problem, repeats. The user can constrain the alignment output to include the seeds, a so-called "anchored alignment". Our implementation of Gibbs sampling therefore avoids repeats, because the user can specify in advance which motif is of biological interest.
To overcome the second problem, multiple instances of a motif, A-GLAM now has an option for post-processing the results of Gibbs sampling. Gibbs sampling produces a position specific scoring matrix (PSSM). The new scanning step resembles a PSI-BLAST search based on the PSSM. The Methods section describes it under the subheading "The new scanning step".

Implementation
A-GLAM was written in C++ and compiled by gcc (GCC version 3.4.0) under the linux operating environment.
The binary files, documentation and the datasets are available for download from the project ftp site [see Additional files 1 &2].

The Gibbs sampling step in the previous implementation of A-GLAM
Briefly, A-GLAM takes a set of sequences as input. The Gibbs sampler step in A-GLAM uses simulated annealing to maximize an "overall score", a figure of merit corresponding to a Bayesian marginal log-odds score. The overall score is given by In Equation (1), m! = m(m -1)...1 denotes a factorial; a j , the pseudo-counts for nucleic acid j in each position; a = a 1 + a 2 + a 3 + a 4 , the total pseudo-counts in each position; c ij , the count of nucleic acid j in position i; and c = c i1 + c i2 + c i3 + c i4 , the total number of aligned windows, which is independent of the position i. The rationale behind the overall score s in A-GLAM is explained in detail elsewhere [17].
To initialize its annealing maximization, A-GLAM places a single window of size 3 (the default permissible minimum window size) within every sequence randomly (according to a uniform distribution), implicitly placing the windowed subsequences into a gapless multiple alignment. It then iterates the following procedure. In the procedure's first step, A-GLAM proposes a set of possible changes to the alignment. The proposal step is either a repositioning or resizing step. In a repositioning step, one sequence is selected uniformly at random; the set of proposed changes includes all possible positions for its window. In a resizing step, either the right or the left end of the alignment windows is selected; and the set of proposed changes includes expanding or contracting the corresponding end of all alignment windows by one position at a time, expansion being permitted only up to the ends of the sequences. (The resizing step has been slightly modified from its original implementation in A-GLAM, which expanded or contracted each window by a single column.) Each one of the proposed changes leads to different value of the overall score s. In the procedure's second step, A-GLAM then accepts one of the proposals randomly, with probability proportional to exp(s/T), where T is a parameter representing an annealing temperature. The temperature T is gradually lowered to T = 0, with the intent of finding the gapless multiple alignment of the windows maximizing s. The maximization implicitly determines the final window size. The randomness in the algorithm helps it avoid local maxima and find the global maximum of s. We ran the annealing algorithm within A-GLAM ten times independently. The steps (below) corresponding to E-values and post-processing were then carried out with the PSSM corresponding to the best of the ten scores s.

The individual score and its E-value in the previous implementation of A-GLAM
The Gibbs sampling step produces an alignment whose overall score s is given by Equation (1). Consider a window of length w that is about to be added to A-GLAM's alignment. Let δ i (j) equal 1 if the window has nucleic acid j in position i, and 0 otherwise. The addition of the new window changes the overall score by The score change corresponds to scoring the new window according to a PSSM that assigns the "individual score" to nucleic acid j in position i. Equation To assign an E-value to a subsequence with a particular individual score, consider the alignment sequence containing the subsequence. Let n be the sequence length, and recall that w is the window size. If ΔS i denotes the quantity in Equation (2) if the final letter in the window falls at position i of the alignment sequence, then ΔS* = max{ΔS i : i = w,...,n} is the maximum individual score over all sequence positions i. We assigned an E-value to the actual value ΔS* = Δs*, as follows. Staden's method [18] yields Ρ{ΔS i ≥ Δs*} (independent of i) under the null hypothesis of bases chosen independently and randomly from the frequency distribution {p j }. Our E-value E = (nw + 1) Ρ {ΔS i ≥ Δs*} is therefore the expected number of sequence positions with an individual score exceeding Δs*. The factor n -w + 1 in E is essentially a multiple test correction.

The new scanning step
Our scanning method shares some similarities with the algorithm previously developed by Neuwald et al [19]. Given a PSSM like Equation (3), the scanning step scans all sequences, assigning an E-value E to every subsequence of length w. Every subsequence with a small E-value E ≤ E 0 , where E 0 is some pre-assigned small threshold, contributes to the counts c ij in a new PSSM. The new PSSM replaces the old PSSM, and the step is repeated. The step is repeated until either: (1) no new motifs contribute to the PSSM (a condition called "convergence"); or (2) some user-specified number of iterations is attained. Figure 1 describes the method graphically. Finally, the algorithm reports the predicted motifs within each sequence, in order of increasing E-values. Analogous to PSI-BLAST, the iterative procedure usually converges, or else background motifs come to dominate the PSSM (a condition called "corruption"). Corruption indicates that a lower threshold E 0 is required.
Thus, although the Gibbs sampling step in A-GLAM finds at most one regulatory element per sequence, the scanning step can rapidly locate several instances of the element in each sequence.

Prediction performance of A-GLAM
A-GLAM's predictions of transcription factor binding sites were evaluated with reference sets containing known functional sites. Sequence logos [20] of the motifs predicted by A-GLAM were generated using WebLogo [21]. The height of a stack of letters in the logo represents the total amount of information at that position, in bits. Within each stack, the height of each letter is proportional to the nucleotide frequency at that position.

UAS elements in histone promoters
Others have identified the SPT10 gene as a global regulator of core histone promoter activity in yeast. A recent study [22] concluded that the Spt10p transcription factor is involved in sequence-specific activation of histone genes. The protein promotes histone gene expression by binding in highly cooperative manner to paired instances of a DNA regulatory motif, UAS (upstream activating sequence). Accordingly, we tested A-GLAM with four histone promoter sequences known to contain multiple instances of the binding site for the Spt10p transcription factor. All binding sites had been experimentally verified with gel-shift assays.
Of the nineteen motifs in the dataset, A-GLAM correctly identified fifteen sites without any false positives ( Figure  2). A-GLAM's consensus motif of GTTCN 2 ANTTTTTCNC corresponds closely to previous results [23,24]. Previous knowledge about the consensus permits some further evaluation of A-GLAM's predictions. In the HHT2-HHF2 and HTA2-HTB2 promoters, Spt10p is known to bind to six sites. In the HHT2-HHF2 sequence, the two sites A-GLAM missed lacked the complete TTC motif, however, suggesting that Spt10p might only bind weakly there. Similarly, in the HTA2-HTB2 promoter, alignment sites contain the consensus TT/GC and TTCT/GC sites. However, the two sites A-GLAM missed lack important consen- Description of the scanning step in A-GLAM sus nucleotides, again suggesting weak binding. These results suggested A-GLAM's potential to rank motifs based on binding strengths. Accordingly, we sought datasets where binding affinities had been measured experimentally.

Operator sites in lambda phage
Many previous studies have examined the kinetics of operator binding in the promoter region of phage lambda [25,26]. Gene regulation in phage lambda is complex, and its description can be found elsewhere [27].
We extracted two sequences corresponding to adjacent promoter regions of the lambda chromosome from the RefSeq database [28]. The first promoter sequence contains the three right operator sites (OR); the second, the three left operator sites (OL). In each case, the binding sites correspond to the palindromic consensus TATCAC-CGCCGGTGATA [29]. Previous studies have deduced that the lambda repressor and cro compete for the operator sites, with the outcome often deciding the fate of the infected bacteria. Molecules of the repressor bound to adjacent operator sites interact and display positive cooperativity [30,31].
A-GLAM identified five out of the six experimentally verified operator sites, missing only OR2. When the five operator sites A-GLAM identified were placed in increasing order by their experimentally determined constants for dissociation with cro repressor, the E-values were also in increasing order, with the exception of OR3 (Table 1).

Gal4p regulatory elements
Gal4p, encoded by the GAL4 gene, is one of the best known regulatory transcription factors in yeast. Gal4p and a complex of other proteins activate yeast galactose catabolic genes (GAL) [32]. They regulate GAL genes having multiple binding sites (GAL1, GAL2, GAL7, and GAL10) in a highly cooperative manner [33,34]. Gal4p itself binds to correctly spaced pairs of low affinity binding sites in the upstream activating sequence for GAL (UASG) [35]. Cooperativity in DNA binding causes a synergistic enhancement of Gal4p activation of transcription. Gal4p also binds to pairs of high-affinity binding sites, but binding studies involving isolated sites have shown that the corresponding UASG is only twice that of a single high affinity binding site.
We extracted sequences from the Saccharomyces cerevisiae Promoter Database for the upstream promoter regions of the six genes GAL1, GAL2, GAL7, GAL10, GAL80 and GCY1 [36]. The sequences contained fourteen experimentally verified Gal4p binding sites. Among the 14 binding sites, A-GLAM correctly identified 13 sites excluding one false positive. A-GLAM was run using the ZOOPS (zero or one position per sequence) mode, setting maximum motif width to 23. The consensus motif identified by A-GLAM closely matched the known consensus CGG(N 11 )CCG of the Gal4p binding site [37,38] (see Figure 3A).   Figure 2B shows the sequence logo for the predicted motifs within the histone promoters. (The Methods section gives a brief explanation of sequence logos.) The letters at the top of the logo closely match the consensus of the histone UAS. A-GLAM correctly identified 12 binding sites without any false positives. The consensus produced by A-GLAM agreed well with the known consensus (see Figure 3B).

Comparison of A-GLAM, GLAM and AlignACE
We compared A-GLAM's accuracy in identifying the binding sites with GLAM [39] and AlignACE [11], another de novo Gibbs sampler based algorithm. The Gibbs sampling procedure in AlignACE permits multiple occurrences of a motif in a sequence, unlike the ZOOPS model in GLAM and A-GLAM. For comparison purposes, we obtained test datasets that were used to assess various motif discovery tools in a recent contest [40]. The datasets were comprised of sequences from four different species: (1) fly; (2) human; (3) mouse; (4) yeast. Each data set contained known binding sites in the original promoter sequence (not in a random sequence). Approximately 85% of the datasets contained multiple occurrences of a binding site in at least one sequence. Hence, these datasets provided a convenient benchmark for assessment purposes. Detailed description of the datasets appear elsewhere [40]. For each tool, the accuracy of the top motif predicted on each data set was compared using the correlation coefficient The comparison yielded some valuable insights into A-GLAM's performance. In general, the scanning step improved A-GLAM's ability to identify known sites on mouse and yeast datasets (see Figure 4). A-GLAM and GLAM performed poorly on fly data sets, however, worse than AlignACE. The fly datasets where AlignACE performed better contained only single sequences, however. A-GLAM and GLAM probably failed on such datasets because of their ZOOPS mode, in which the Gibbs sampler permits at most one motif occurrence per sequence. On human datasets, surprisingly, GLAM outperformed both A-GLAM and AlignACE. Moreover, the three programs often produced alignments corresponding to completely different motifs. In all such cases, A-GLAM and AlignACE identified motifs corresponding to a repeat sequence. A partial explanation follows. In the ZOOPS mode, Gibbs sampling searches for at most one motif instance in any single sequence, so the multiplicity of a repeat does not affect the Gibbs sampling step much. The best alignment after Gibbs sampling therefore might correspond to a known biological signal. Unfortunately, the large number of repeat elements in the human genome then can decoy A-GLAM in the scanning step. The multiplicity of a repeat does affect the scanning step, so after iterating sufficiently, the scanning step incorporates the repeat into the PSSM to overwhelm the original biological signal. The multiplicity of a repeat also affects the Gibbs sampling step in AlignACE, so AlignACE converges on repeats for similar reasons. Alignments of transcription factor binding sites Figure 3 Alignments of transcription factor binding sites. The sequence logo generated using the motifs predicted by A-GLAM in data sets containing experimentally verified binding sites for Gal4p ( Figure 3A) and Kruppel ( Figure 3B).

B
A step predicted at most one regulatory element per input sequence. Now, A-GLAM uses an iterative strategy like PSI-BLAST, so the PSSM from the sampling step finds multiple instances of the regulatory motif in individual sequences. Instances with an E-value below a user-defined threshold E 0 are then permitted to contribute to the PSSM, which is then updated. The PSSM-updating step is then iterated. Finally, the predicted instances of the regulatory motif are reported, by sequence in the order of increasing E-value.
Validation with regulatory elements from well characterized systems confirmed that the scanning step can identify regulatory elements rapidly and dependably, even in the presence of homotypic regulatory clusters (multiple instances of the motif in a single sequence). In comparison to Gibbs sampling for homotypic regulatory clusters, the scanning step is faster, with little loss (if any) in predictive accuracy, particularly in yeast datasets. Moreover, the E-values for predicted elements sometimes corresponded well with their experimental binding affinities (see Table 1). Further investigation of the correspondence would therefore be desirable.
The scanning step uses a threshold E 0 for inclusion into A-GLAM's PSSM. The threshold E 0 is critical, because it is subject to the same conflicting constraints as in PSI-BLAST. On one hand, stringent thresholds (low values of E 0 ) can eliminate interesting instances of a motif; on the other hand, loose thresholds (high values of E 0 ) can cause the PSSM to include too many false positives, possibly diluting the true positives to oblivion, causing "corruption". In particular, corruption can occur for subtle motifs that do not deviate much from the background nucleotide Our studies of A-GLAM's performance on particular datasets indicated some general conclusions about Gibbs sampling and the identification of binding sites. In a fly dataset consisting of a single sequence, GLAM's Gibbs sampling step performed poorly because the step identifies only a single binding site per sequence. The scanning step added in this article therefore identifies multiple instances of a binding site per sequence only when the dataset contains multiple sequences. The scanning step noticeably degraded predictions on human datasets, primarily because of repeats (e.g., Alu or poly-A). The degradation is likely to be a pitfall for any program able to detect homotypic regulatory clusters (e.g., either A-GLAM or AlignACE), because through sheer multiplicity, repeat elements can overwhelm a signal from homotypic binding elements. Notably, the scanning step improved A-GLAM's performance on yeast datasets, a behaviour likely to generalize to any genome containing homotypic regulatory clusters and lacking repeats. We do not specifically address the issue of interaction among binding elements for different transcription factors, a phenomenon largely confined to complex organisms. Hence, our methods are most effective in lower organisms such as yeast, fly and microbes.
Several remedies to the problem of repeats are available. First, a user can focus the A-GLAM program on a motif of interest by providing either: (1) a "seed word" contained in the motif of interest or (2) a list of "seed windows", at most one per input sequence and all of equal size. In its seed-oriented mode, A-GLAM then constrains its gapless multiple alignment to contain the user-provided seeds. Second, the repeats can be masked with standard programs, such as RepeatMasker [41]. Third, many recent studies have also suggested that a high-order background Markov model can avoid repeats and aid the detection of regulatory elements [42]. We are currently incorporating an option for a Markov background into A-GLAM.

Conclusion
In summary, our scanning step identifies multiple elements in a single sequence with E-values. It speeds up regulatory motif discovery, by avoiding unnecessary use of the computationally expensive Gibbs sampling step, with little loss (if any) in predictive accuracy. The availability of completely sequenced genomes presents an increased demand for rapid and accurate prediction of regulatory elements. Our methods seem well adapted for this challenge.

Authors' contributions
KT analyzed the data and implemented the scanning step in A-GLAM; LM-R participated in the design of the study and acquired the datasets for analysis; SS provided the software for the E-value computation; DL contributed to the design of the datasets; JLS developed the mathematical theory.