Ndt80 binding sequence requirements in vitro
We determined the binding preferences of the Ndt80 DNA binding domain (aa 51–350) fused N-terminally to Maltose binding protein (MBP-Ndt80 51–350) for the SPO77 MSE using EMSA analysis [26–28]. The SPO77 MSE (located at -152 from the start of translation) was chosen because it has good homology to the Ndt80 consensus [22, 24, 25] and because a single copy was sufficient to direct transcription [20]. We systematically examined the effects of every alteration at each position of a 14 bp region, which consisted of the 9 bp core sequence GTCACAAAA together with three upstream and two downstream base-pairs, presented as the central region of a 35 bp oligonucleotide (Figure 1a). The quantitative data from the mutational analysis is shown in Figure 1b. This study complements two previous binding preference studies, performed while this work was in progress [26, 29].
We confirmed that most binding determinants are in the 9 bp core MSE. Of the five positions systematically mutated outside the core, only changes to the second position downstream of the MSE (+11) had any effect on binding and these were minor (Figure 1b). Within the core MSE, C5 is essential, as a change to any other nucleotide essentially abolished binding. The A4 and A6 positions are next most important, exhibiting significant decreases in binding with any change and severe restriction of binding with two of the three changes. The C3, A7 and A8 positions are also significant. For those positions where similar changes have been examined, the data of this study and that of Pierce et al are in excellent agreement [29]. These two studies generally agree with the more restricted data set of Lamoureaux et al [26], except for their conclusion that at C5, only a change to G severely restricts activity.
Ndt80 binding sequence requirements in vivo
A subset of the mutations were analyzed in vivo during sporulation, using a strain having one authentic SPO77 locus, so that sporulation was unimpaired, and a second SPO77 locus encoding GFP and driven either by the wide type promoter or one having an altered MSE (Figure 2). Quantitative analysis of the in vivo expression is included in the supplementary material (see Additional files 1 and 2). By comparing the RNA levels from the GFP construct with that from SPO77 expressed in the same cells and normalizing both to expression of an internal control, we were able to make an accurate determination of the effect of the MSE mutations on expression. When driven by the wild type MSE, GFP was induced at levels similar to SPO77 RNA with similar timing (Figures 2a), indicating that this system could be used to test expression of GFP driven by mutant MSEs.
With the exception of position 2, where nucleotide identity did not affect binding (Figure 1), we tested the effects of at least one nucleotide change at each MSE position on expression during sporulation (Figures 2). Overall, we find excellent agreement between our in vivo expression measurements and the in vitro binding data (Compare Figures 1 and 2). Importantly, these data validate our finding in vitro that changes to the C5 position other than G are very deleterious, as a C5A change has essentially no activity in vivo (Figures 2g). Additionally, by testing all changes at position C3 (Figures 2c–e), we could show that even small changes in in vitro binding are significant for expression in vivo.
The location of the MSE is important for sporulation specific gene induction in vivo
Although total genomic MSEs are found in widely varying positions [22, 24, 25], our computational analysis of positive targets showed that functional MSEs are distributed in a preferred window between -75 to -300 bps upstream (see next section) of the translation start point. We therefore tested positional dependence of MSE activity by changing the endogenous MSE at -152 to a nonfunctional sequence (gtcaAaaaa) and positioning a consensus MSE at either-450 or -50. These MSEs mediate very little, if any GFP expression. Lack of gene expression is not due to inhibition of RNA synthesis, as normal GFP RNA synthesis is observed when the positionally altered variants also have the normal MSE at position -152 (Additional file 3). We conclude that MSEs position is important for function.
Identification of the test set of middle genes utilized in computational analysis
To get a highly specific set of target genes as our test set, we define middle genes as those exhibiting less than a 1.2 fold change at 2 hours and greater than 3 fold change at 5 hours following the initiation of sporulation. Using these criteria, we identified 68 middle sporulation ORFs from the microarray data published by Chu et al.[20]. Of these 68 ORFs, 54 have multi-species sequence data (see method) and the Ndt80 core consensus binding site CRCAAA. These genes were used as our test set. 36 genes in this test set overlap with 62 strongly induced middle sporulation genes independently identified by Primig et al. via clustering analysis [21].
The location of the MSE is specifically distributed in middle sporulation genes
We used a statistical analysis that compares the location of MSE relative to the translation start site in the test set with that in the whole genome promoters to determine whether there is a positional preference for MSE in the true target sets. We find that MSEs are highly localized between positions -75 to -300 from the translation start site in the test set (Fig 3a). In striking contrast, the MSE sequences in the whole genome promoters do not have a preferential location (Fig 3a, inset), suggesting that MSE location might be a determinant of middle gene activity. To pursue this issue further, we asked whether positional bias of the core MSE sequence, CRCAAA, of the test set promoters is conserved between S. cerevisiae and the orthologous promoters in S. bayanus, S. mikatae, and S. paradoxus [11, 30]. We found that the positional distribution of the CRCAAA motif in all the other three species have similar preference (Fig. 3a), whereas no preference is observed for MSE sequences in the whole genome promoters.
Distribution of the binding affinity of potential Ndt80 binding sites in the genome
There are over two thousand genes in the S. cerevisiae genome whose promoter (800 base pairs upstream) contains the Ndt80 core consensus binding motif CRCAAA. Given the modest number of middle genes, it is likely that most of these matches are false positives. It might be possible to discriminate against the false positives using quantitative biochemical data from the in vitro binding assay, since bases matching the degenerate symbols in the core motif and/or from the flanking sequences can contribute to the binding affinity (Figure 1). We therefore extracted the N5-CRCAAA-N3 motifs in all S. cerevisiae promoter sequences and scored each motif using the in vitro binding data, which we converted to binding affinity relative to the wild type SPO77 motif (see Materials and Methods). The resulting relative binding affinity reflects the contributions from the degenerate core position (R) and the flanking sequences. We observe that the distribution of relative binding affinities for matches in the test middle sporulation gene set is distinct from the set of all the matches in the genome (presumably mostly false positives) (Figure 3b). Whereas the mean of the genome-wide distribution is -0.58, with a standard deviation of 1.05, that of the reference set shifted by about one standard deviation towards a higher score, showing a mean of m
s
= 0.57 and a standard deviation of σ
s
= 0.96. These data show that the false positive matches can be discriminated to certain degree based on the quantitative binding affinity. For example, using m
s
- σ
s
as a cutoff score and assuming Gaussian distribution, we would eliminate about 50% false positives, while keeping more than 80% of true positives.
Predicting the direct targets of Ndt80 in the genome
We have shown that the potential Ndt80 binding sites in the promoters of middle genes differ from their non-middle gene counterparts by having on average a higher binding affinity and positional preference. With the availability of complete sequences of several yeast species, it is possible to use evolutionary conservation as an additional filter to select true binding sites, since functional sites are more conserved due to selection pressure. We therefore searched for the direct targets of Ndt80 using the following criteria: 1) the selected ORF should have at least one core motif N5-CRCAAA-N3 with binding affinity score higher than a specified cutoff (m
s
- σ
s
); 2) the motif is located between -80 to -400 from the translation start site; and 3) the core motif is conserved in a number of yeast species (see methods for details). To assess the predictive value of this selection scheme, we compared its performance on the middle gene test set with that on the 2259 ORFs whose promoter has at least one core MSE motif. When we use conditions 1, 2 and demand conservation of the core motifs in all 4 species, we identify 115 ORFs in S. cerevisiae as potential Ndt80 regulatory targets (Figure 4). These combined criteria reduced the total number of predictions by 20 fold, and thus greatly reduced the number of false positives. On the other hand, the predicted ORFs still include 44% of the ORFs in the test set having the MSE motif (24 out of 54). This demonstrates that our procedure can successfully discriminate between bonafide middle genes and false positives. Furthermore, the computational analysis also predicted 91 new genes as potential Ndt80 targets.
Calculating the false positive and the false negative rate
To gauge the accuracy of the prediction, it is important to ask how many of the predicted genes are false positives and how many true targets are missed. In a functional genomics approach, the false positive rate is typically estimated by experimental verification of the predicted targets. Here we develop a mathematical approach that allows us to estimate both false positive and false negative rate, and consequently the total number of targets in the genome. Crucial to this approach are: 1) a set of true targets known with high confidence; and 2) multiple independent filters that can be used to select potential targets. The idea is illustrated in Figure 4 where we start with a high confidence positive set of 54 genes and 2259 potential targets with the consensus MSE site in their promoters. We believe that the 54 genes are the authentic targets of Ndt80 as they passed the stringent criterion of gene expression (exhibiting less than 1.2 fold change at 2 hours and greater than 3 fold change at 5 hours following the initiation of sporulation) and also posses a consensus MSE in their promoters. We now apply two different filters individually and jointly. One filter selects genes whose binding sites score is larger than (m
s
- σ
s
) and the motif is located between -80 to -400 from the translation start site (condition 1 and 2 of the selection in the previous section). The other filter selects genes whose promoter have the MSE site conserved across 4 species (condition 3). The fractions of true positives passing the two filters can be estimated from the test set. There are three unknowns: the fractions of false positives passing the two filters, and the total number of true positives. Given the number of genes (from the set of 2259) passing the first, the second, and the joint filters, we can write down three equations, and solve for the three unknowns (see methods). From the obtained parameters, we estimate that out of 2259 ORFs with CRCAAA motif in their promoters, 169 ± 20 are the regulatory targets of Ndt80. Among the 115 genes we identified as putative Ndt80 targets, 43 ± 5 are false positives. We also found that the probability for a functional MSE to be conserved across 4 species is 0.5 while the probability of chance conservation for a false positive site is 0.1. This gives a quantitative measure of the strength of selection on a functional regulatory site.
Validating the predictions using gene expression profiles
To validate these predictions, we performed a clustering analysis of the mRNA expression profiles during sporulation using the expression of the 115 genes predicted by our computational method and the 54 middle genes in the test set. We separately clustered the expression profiles of the following three sets of genes: 30 genes in the test set but missed by our computational analysis; 24 genes given by the overlap between the test set and the predicted set; and 91 genes in the predicted set but not in the test set. The first two sets of genes showed a clear middle gene expression profile – repressed till about 2 hours and then sharply turned on at 5 hours (group 1 and 2 in Figure 5). The set of 91 genes gives rise to three distinct clusters (group 3 – 5 in Figure 5). The first cluster (41 genes, group 3) resembles middle gene expression profile. The second cluster (group 4) contains 9 genes turned on at 0.5 hours and then at much higher level at 5 hours. The third cluster (group 5) contains 41 genes which are not up-regulated during sporulation. We believe that the 74 genes in groups 2–4 predicted by the computational method are likely to be the true targets of Ndt80, while the 41 genes in group 5 are likely false positives. This estimation of false positives is consistent with the number we obtained from the simple model (43 ± 5) presented above.
We also compared the targets predicted by our computational method to those identified by Ndt80 over-expression during vegetative growth [20]. We selected 236 genes whose log transformed expression ratio is 2 standard deviation above the mean as targets predicted by over-expression data. We found that 49 out of 54 genes in the test set, and 20 out of 50 in group 3 and 4 were predicted by Ndt80 over-expression. These overlaps were highly significant (only expect 2 by random for each case). For group 5 genes which we believe are false positives, the overlap is insignificant, only 2 out of 41 were predicted by the over-expression data.
Our predictions are also consistent with the microarray data for the Ndt80 deletion strain during sporulation[20]. We compared the fold changes for the mRNA induction of the Ndt80 deletion strain to the wild type strain. The fold change ratio is defined as , where we use the mRNA expression at 6 hours () and at 2 hours () for the deletion strain and the mRNA expression at 7 hours () and 2 hours () for the wild type strain. For Ndt80 targets, we expect that the fold induction during sporulation will be much less in the deletion strain compared to the wild type, thus r will be much smaller than 1, while for non-targets, we expect r to be around 1. The genome-wide average of the fold change ratio r is close to one (1.07 ± 0.57) and only 12% genes have the ratio less than 0.5. In group1, the fold change ratio of 29 genes out of 30 genes is less than 0.5 in the deletion strain. The fold change ratio of all genes in group 2 and 41 out of 50 genes in group 3 and 4 are less than 0.5, while only 8 out of 41 genes in group 5 have the ratio less than 0.5. The result for group 5 is consistent with the prediction that these genes are false positives based on the clustering analysis.
Examination of the promoter sequences of these genes reveals that different groups have the binding sites for other transcription factors in addition to the Ndt80 binding sites. For example, most genes in groups 2 and 3 (middle genes) have a binding site resembling that of both Ndt80 and Sum1 [29, 31]. The binding of Sum1 may be important for repressing these genes early in sporulation. Two genes in group 4 (which are turned on early) ADY3 and MPC54 have the conserved URS1 site DSGGCGGC in their promoter sequences close to the conserved Ndt80 binding site. It is known that Ume6/Ime1 complex bind to URS1 and turn on their target genes early in sporulation[20, 32]. Another gene with similar expression profile and conserved URS1 and Ndt80 sites is YNL196C. This gene is not in the cluster as the position of the Ndt80 site is -78, which barely missed our positional cutoff of -80. Thus it seemed that Ndt80 works together with other factors to fine tune the temporal expression of different gene sets in a combinatorial fashion.
Comparison with ChIP-chip data
ChIP-chip technology has been widely used to identify the genomic targets of transcription factors. Recently, Harbison et al. published a comprehensive dataset for yeast containing 203 factors under a variety of conditions[2]. This dataset includes Ndt80 done under the YPD condition. We found that with the standard P value cut off (P < 0.001), 18 genes were predicted, and none is in the test set. With the less stringent cut off P < 0.01, 86 genes were predicted, and again none is in the test set. Only a few genes out of the 86 predicted by the ChIP data showed a middle gene expression profile (see additional file 1 and 4). Thus it seems that the ChIP-chip data for Ndt80 under YPD condition is dominated by noise. This is not surprising as the experiment was done under normal growth condition, and Ndt80 is activated only during sporulation. For Ndt80, the signal would be improved greatly by doing ChIP-chip experiment under sporulation condition. For an uncharacterized factor, it is quite challenging to search for conditions under which the factor is active. Although ChIP-chip is a powerful and systematic approach to dissecting transcription networks of a cell, the Ndt80 example highlighted the need for complementary approaches such as the one we described here.