The identification of transcription factor binding sites is an important biological question. To date, the majority of methods to detect these sites have focused on creating statistical models, such as position weight matrices, of transcription factor specificities. However, these models are limited due to the fact that they must make generalized assumptions about transcription factor binding properties that are not completely understood. Other recent technologies have been developed such as ChIP-Seq to look at genomic transcription factor occupancy . However, these technologies do not identify the precise binding sites, are technically difficult, and limited by the lack of high quality antibodies for many transcription factors. Therefore, bioinformatic prediction of transcription factor binding sites remains a powerful and useful tool for understanding transcriptional regulation. In this paper, we propose a modified technique for creating genome-wide TFBS site maps using direct mapping of protein binding microarray data (PBM-mapping). This method is technically simpler than ChIP-seq methods and does not rely on the assumptions used in statistical models.
We have shown that PBM-mapping more accurately predicts relative binding affinity than previously reported TRANSFAC or PBM based PWMs. PWM inaccuracies have been attributed to both experimental and theoretical errors . Our studies support both of these limitations. The TRANSFAC PWM was developed from the alignment of 23 sequences enriched using SELEX experiments. The PBM-PWM was based on microarray experiments, which provide data for all possible octamers. In our experiments, the PBM-PWM was more highly correlated with observed relative binding affinity than the TRANSFAC PWM, most likely due to the increased information content in the PBM-PWM. However, PBM-mapping scores correlated to a greater extent with relative binding affinity than either PWM method, even though it was generated from the same data used for the PBM-PWM. Therefore, the PWM statistical model does not accurately model Nkx2.2 binding.
Numerous methods for generating PWMs exist and there are several statistical corrections that can be applied to the PWM model, however accurately testing and comparing all of these corrections is technically difficult and therefore were outside the scope of this study . Predictions based on the combined results of more than one PWM could also be attempted. However, these predictions would still be susceptible to the limitations of the PWM model to account for the influence of neighboring nucleotides and flanking regions.
A method for using the average E-score of 3 overlapping octamers surrounding a previously identified core sequence (AvgES) has been utilized for predicting binding sites in C. elegans. However, this method was not fully tested for accuracy and is biased by the assumption of a completely conserved 4-basepair long core sequence. Our results show that using an average E-score from an increased number of overlapping octamers improved the accuracy of transcription factor binding site prediction. Interestingly, an average E-score of 7 overlapping octamers resulted in the highest correlation with relative binding affinity. This represents the greatest number of overlapping octamers possible with at least one base pair common between all oligos. Nkx2.2 binding has been reported to be influenced by the flanking DNA sequence [16, 34]. Therefore, the increase in accuracy is most likely due to accounting for the flanking regions; however, it may also be due to reductions in the inherent noise present from microarray quantification . It remains possible that bases outside of the 7 octamer window can affect binding affinity, although using a window larger than 7 overlapping octamers resulted in decreased accuracy. This may be due to limitations of using 8 bp long motifs for generation of PBM data. It remains to be seen if increasing the length of DNA-binding sequences used in the microarray based experiments to generate E-scores would further increase the accuracy of binding affinity prediction.
We did not observe a large difference between the flanking regions of "AAGT" and "GAGT" containing sequences. Furthermore, both core sequences share common flanking regions in the octamers with the highest E-score (TCAAGT GG and TCGAGT GG). However, the magnitude of the effect that base substitutions in the flanking regions have of the overall score of the site differed between "AAGT" and "GAGT" containing sites. For example, conversion of "G" to "C" in the last position of the above mentioned octamers reduces the single octamer E-score of the "AAGT" site by 0.0038 and the "GAGT" site by 0.011, confirming that the bases in these positions are not independently additive, but are dependent on other bases in the binding site, even when these bases are not immediately adjacent.
We tested PBM-mapping using the homeodomain transcription factor Nkx2.2. Nkx2.2 has been shown to act as both an activator and a repressor during pancreatic islet formation [23, 35]. β-cells are completely absent in the Nkx2.2 null embryo . There was also a corresponding decrease in many, but not all, β-cell markers . However, it was unclear which of these genes were down-regulated due to direct transcriptional control by Nkx2.2 or because the β-cell population was absent. We used PBM-mapping to predict novel Nkx2.2 binding sites within bound promoters of genes differentially expressed between wildtype and Nkx2.2 null embryos. Interestingly, a large majority of differentially expressed genes had predicted sites, suggesting that between e12.5 and e13.5 the changes seen in the Nkx2.2 null pancreas are largely due to direct regulation by Nkx2.2 and not a cascade of downstream transcription factors. We were also able to predict binding sites in several β-cell specific genes, including a battery of genes encoding proteins present in secretory vesicles, such as insulin, IAPP and ChgB that appear to be coordinately activated by Nkx2.2.
Several studies have shown that relatively weak or secondary binding sites are biologically important [1, 2]. These sites may create genomic binding profiles dependent on protein concentrations or they may differentially bind closely related transcription factors that share a common primary binding motif . Careful analysis of the PBM data for Nkx2.2 revealed a previously unidentified alternative "GAGT" binding site motif for Nkx2.2. GAGT containing sites were also represented in our predicted sites (Gcg -432, Iapp -1355, Nkx2.2 -716, and Tm4sf4 -1723)--confirming the ability of the algorithm to predict secondary sites. This is the first secondary motif that has been identified for an Nkx2 family member, although a unique secondary motif has been identified for Nkx3.1 (GTAC). PBM-mapping does not predict that Nkx2.2 binds the GTAC core sequence; however this is not surprising since the homeodomain sequences between Nkx2 and Nkx3 family members are not well conserved and the two protein families are known to preferentially bind different core sequences . Further research is necessary to determine whether the GAGT motif identified in this study is unique to Nkx2.2 or is shared among several Nkx2 genes.
The importance of weak or secondary binding sites highlights the importance of finding a suitable threshold for determining positive prediction results. In our study, we set the threshold for putative Nkx2.2 binding sites at 0.37 based on the results from our EMSA analysis. However, this may not accurately reflect in vivo binding. For example, the Gcg -1080 site, which had an average E-score just below our threshold, did not show binding in EMSA analysis (Figure 3A), but appeared to be occupied in ChIP (Figure 3B). Therefore, it is possible that our threshold is overly stringent or that Nkx2.2 binding to weak sites is dependent on the presence of cofactors. The appropriate threshold may also vary between cell types due to different expression levels of the transcription factor. Furthermore, our results show that the appropriate threshold for different transcription factors will differ.