Design of a combinatorial DNA microarray for protein-DNA interaction studies
© Mintseris and Eisen. 2006
Received: 07 July 2006
Accepted: 03 October 2006
Published: 03 October 2006
Skip to main content
© Mintseris and Eisen. 2006
Received: 07 July 2006
Accepted: 03 October 2006
Published: 03 October 2006
Discovery of precise specificity of transcription factors is an important step on the way to understanding the complex mechanisms of gene regulation in eukaryotes. Recently, double-stranded protein-binding microarrays were developed as a potentially scalable approach to tackle transcription factor binding site identification.
Here we present an algorithmic approach to experimental design of a microarray that allows for testing full specificity of a transcription factor binding to all possible DNA binding sites of a given length, with optimally efficient use of the array. This design is universal, works for any factor that binds a sequence motif and is not species-specific. Furthermore, simulation results show that data produced with the designed arrays is easier to analyze and would result in more precise identification of binding sites.
In this study, we present a design of a double stranded DNA microarray for protein-DNA interaction studies and show that our algorithm allows optimally efficient use of the arrays for this purpose. We believe such a design will prove useful for transcription factor binding site identification and other biological problems.
With the human and many other genome sequences complete or nearing completion, we are approaching the goal of identifying all the protein coding genes. However, to understand the function of these genes in different physiological contexts, it is important to understand how their expression is regulated. Mechanisms of gene regulation are varied and complex and unraveling them will require a combination of approaches[1, 2]. Having a catalog of all the transcription factors and being able to characterize their binding specificity at cis-regulatory sites would provide a fruitful starting point.
Recent advances in chromatin immunoprecipitation (CHIP) methods have led to large-scale efforts to determine all protein-DNA binding events in yeast[3, 4] but scaling up such methods for mammalian genomes may prove difficult. Protein-binding microarrays (PBM), initially developed on a small scale by Bulyk et al[5, 6] showed promise in identifying transcription factor binding specificity with high accuracy and was recently successfully scaled up for the yeast genome by using PBMs with all known yeast intergenic regions. Although an exciting advance in the field, current design of PBMs still leaves room for uncertainty because some of the intergenic regions may be too long to pinpoint the binding sites with high accuracy. Scaling this method up to mammalian genomes would also require designs spanning multiple arrays, with a new design for each genome. Both CHIP and PBM methods are well suited for low resolution identification of genes affected by a given transcription factor. However, in order to fully understand regulation, researchers will always be interested in pinpointing the specific regions to which the factor binds. Identifying this region from CHIP-CHIP or PBM data requires sophisticated computational analysis, much like that used in ab-initio cis-regulatory region discovery. Reliability of such analyses is sometimes questionable, in part because of the repetitive and degenerate nature of the intergenic sequences. Harbison et al. note that some intergenic sequences are highly homologous thus skewing the results of motif discovery algorithms. If there was a way to test the binding of a given factor to all possible motifs of a given length, it would then be trivial to scan the intergenic sequences for potential sequences corresponding to a well-defined motif. We therefore propose a new PBM design that would allow the testing of all possible binding sequences of a given length in an optimally-efficient non-degenerate manner.
In recent years, a number of technological innovations took place, allowing programmable synthesis of microarrays as well as new techniques to make the arrays double-stranded[8, 9]. In particular, Warren et al. successfully constructed and tested a combinatorial dsDNA array with all possible 8-mer sequences, with one sequence per spot. Since the proof of principle for this technology has now been shown, here we focus on optimizing experimental design. Using variations on established graph theory algorithms, we propose a new design of a PBM, which would allow the in-vitro testing of transcription factor binding to all possible DNA targets up to length 12. This approach removes some of the redundancy in testing long intergenic regions. In addition, our design is organism-independent.
The design, as described by Bulyk et al. in proof-of-concept papers [5, 6] allows for testing N binding sites by screening N spots on the array. This approach is straightforward but not very practical for most transcription factors because the number of possible binding sequences is 4 k , where k is the length of the binding site.
The more recent design involved spotting all annotated yeast intergenic regions on the array. This comprehensive approach is more scaleable, although mammalian genomes contain long "desert" regions which would most likely have to be broken up into shorter segments for spotting on microarrays. In order to identify the transcription factor binding sites within the spotted regions, in this as well as in many other approaches, the authors rely on a variant of the Gibbs sampling algorithm. Some of the longer intergenic regions tested may present a problem in identifying binding patterns for low-specificity transcription factors. Uniform probe length and optimal non-redundancy of the array proposed here would make it easier to analyze experimental results and estimate their statistical significance.
We propose the design of a dsDNA array that allows screening for length k TF binding sites with maximum efficiency by allowing the k-mers to overlap. For instance, the 8-mer probe ACTGTGCA represents two potential 7-mer TF binding sites – ACTGTGC and CTGTGCA. It turns out that we can easily design an array with probes of certain length b that contain all possible k-mers, such that the required number of probes is minimal. If we can find the shortest string that contains all possible k-mer substrings, we can then "cut up" this string into individual probes of desired length. The problem of constructing such a minimum-length string can be represented in graph-theoretical formulation (see Methods for details).
Imagine a directed graph with nodes represented by all possible k-mers, where the edges exist between nodes that overlap by (k-1). Finding the shortest path for a graph of all possible k-mers results in a superstring of length (4 k + k). Given a desired probe length b > k, we can design an array with N probes that enables us to test the binding specificity of any transcription factor that can bind to a k-mer. The number of probes would have to be approximately
N = 4 k /(b-k+1)
The length of a string produced by naively joining all possible k-mers is k*4 k . This means we are able to reduce the number of probes by a factor of k. Furthermore, we can turn the reverse complementarity of double-stranded DNA sequence to our advantage and gain another factor of 2 reduction in number of array probes[9, 11]. For instance, having included the 7-mer ACTGTGC in the superstring and assuming that the array probe will be double stranded, we are already accounting for the reverse complement 7-mer GCACAGT. This introduces some complications in the algorithm, which we discuss in Methods.
We would also need to take into account some additional considerations, such as allowing for spacers on either side of the designed sequence to ensure reliable binding, as well as a primer, if the double stranded DNA is constructed enzymatically. We believe such an approach takes some of the ambiguity out of the decoding process that is needed in current approaches that rely on spotting long intergenic regions.
Sample calculations for the number of probes/array
Probe Length b
Motif Length k
Identifying the actual binding sequences given intergenic array spot data is a non-trivial problem, which Mukherjee et al. addressed by Gibbs-sampling algorithms[7, 12]. This problem arises from a combination of two factors: 1) many intergenic sequences are quite long (mean length 486 bp for yeast), increasing the probability of finding multiple binding sites; 2) intergenic sequences are inherently redundant. Our combinatorial design addresses both of these issues by proposing reasonably short and optimally non-redundant sequence features.
The early versions of PBMs were made double-stranded by enzymatic primer extension, [5, 8] which would mean that the combinatorial portion of the probe intended to assay for protein binding would be adjacent (either 3' or 5') to a constant primer sequence. Of course, any such primer sequence could also contain a portion of a binding site or even an entire binding site, making it difficult to analyze the data. The more recent approach involved only a short 3-base flanking sequence on either side of the combinatorial portion of the probe, thus eliminating the problem. Nevertheless, the enzymatic primer extension approach remains a valid option and has the advantage of higher fidelity, compared with oligo synthesis. It is therefore important to address the potential challenge of analyzing data from an experiment where the flanking sequence is bound on some probes and deciphering the true binding site in such an experiment.
We propose that this challenge be addressed by making a replicate array (Figure 1). The simplest approach would be to make a replicate array with different primers/flanking sequences. If the number of bound probes differs significantly between the two replicates, it would suggest that the flanking sequence is involved in one of them. Analysis of the array with the smaller number of bound probes should reveal the true binding site and help extract additional information from the other replicate.
Using array replicates to discover the Rap1 binding site when the flanking sequence is involved in binding.
Total # of Probes Bound
Top BioProspector Hits (1 st n)
# of Replicate Probes Bound
Top BioProspector Hits (1 st n)
While the technological aspects of array construction have been the subject of much recent work, less attention has been paid to the oligonucleotides on these arrays in terms of experimental design. Here we have laid out an algorithmic solution to the design of a DNA microarray that would allow the characterization of binding specificity of any transcription factor independent of the species under study. The solution discussed here focuses on the algorithmic part of the problem and does not include some of the concerns involved in the production of such an array. However, we believe that given the recent advances in microarray technology, the arrays described here are well within the reach of current state of the art. Custom arrays can be obtained from several sources such as Agilent, Nimblegen and several others and new technologies for programmable array synthesis are still being developed. Synthesis of the complementary strand on the arrays can be achieved enzymatically with a surface-proximal primer or with other, more recently developed methods[8, 9].
Analysis of intergenic PBM data has been complicated by the fact that the sequences are long, redundant, and often contain multiple binding sites especially for factors that do not bind with high specificity. Our design addresses this problem and in simulations produces data that is much easier to analyze due to higher signal-to-noise ratio. Given our simulation data, it seems reasonable to make the assumption of a single binding site per probe and thus make it much easier for Gibbs sampling algorithms to converge on the correct solution.
The combinatorial array design that includes all possible k-mers also has the advantage that as genome annotation continues to improve, including the validation of intron/exon boundaries and discovery of novel genes, the data obtained from such an array remains valid and relevant.
Despite the probe number savings offered by the design presented here, the exponential growth of the number of probes as a function of k will limit the length of combinatorial binding sites. However, even with k up to 12, the design can be applied to many important unresolved problems. Applications of ideas presented here extend beyond transcription factor interactions. For instance, they may also prove useful to characterize restriction enzyme specificity, DNA methylation patterns and in other systematic studies. The array could be used to study not only the binding patterns of natural DNA-binding proteins, but also to analyze mutants and thus help us gain a more detailed understanding of the nature of specificity/promiscuity of these interactions as well as design new ones.
In this study, we present the design of a microarray containing all combinations of a DNA motif for testing of transcription factor binding and other protein-DNA interaction applications. The advantage of this approach is that it is exhaustive and the same exact design could be used for any genome. Furthermore, uniform probe lengths and optimal non-redundancy allows for a more straightforward statistical analysis of the results. Combined with recent advances in PBM technology development, our design will enable more precise identification of true binding sites.
The problem of constructing a minimum-length string can be represented in graph-theoretical formulation. Imagine a directed graph with nodes represented by all possible k-mers, where the edges
<u,v> exist iff u = s 1 s 2 ... s n-1 and v = s 2 ... s n-1 s n
Then, walking the shortest path through this graph results in the construction of the shortest cyclical sequence that contains all the subsequences only once. This turns out to be a well-known problem in computer science known as the Chinese Postman problem. The shortest path visiting the edges only once is known as the Eulerian cycle. Moreover, the problem is specifically known in terms of constructing the minimal string sequence known as the de Bruijn sequence. The graph consisting of all possible subsequences of a certain length from an alphabet of a given size is known as the de Bruijn graph. A Eulerian path is easily found in linear time with Fleury's algorithm.
The algorithm has to be modified to take advantage of the fact that for a double-stranded DNA probe, every k-mer in the probe will also have a reverse complement and therefore, the reverse complement sequence optimally should not be included in the superstring. Every de Bruijn graph therefore contains within it two "reverse complementary" sub-graphs. There is an additional complication arising from the fact that graphs with k = even and k = odd have different properties. Constructing the minimal superstring for odd-k graphs amounts to finding two "pseudo-Eulerian" cycles, which are reverse complementary to each other. This can be achieved simultaneously in the context of Fleury's algorithm. Even-k graphs are further complicated by the fact that some nodes are reverse complements of each other (e.g. ACGT) and are therefore shared nodes between the two reverse complementary sub-graphs. Because of this peculiarity, the number of nodes in a "pseudo-Eulerian" cycle containing each k-mer or its reverse complement only once is equal to k/2 for odd k graphs and slightly more than k/2 for even k graphs. As shown in Figure 1, this comes from the fact that k-mers that are reverse complements of each other have to be counted twice – once for each of the reverse-complementary sub-graphs. The figure shows two possible "pseudo-Eulerian" reverse-complementary cycles for k = 2, with the four self-complementary nodes highlighted.
In simulation to test how robust the array probes are to noise, BioProspector software was run to try to find a motif 100 times per run, using the probe sequences from the entire designed array as background.
In primer/flanking sequence simulations, we used ACTGACGTACTGGTTT as a control primer (not containing a part of Rap1 binding site) and ACTGACGTACTCACCC as the primer sequence with the last 5 bases overlapping the Rap1 consensus binding site (CACCCATACA).
J.M. was supported by Department of Energy Computational Science Graduate Fellowship (CSGF). The authors wish to thank Boris Shakhnovich for advice and discussions.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.