Correlating overrepresented upstream motifs to gene expression: a computational approach to regulatory element discovery in eukaryotes

Background Gene regulation in eukaryotes is mainly effected through transcription factors binding to rather short recognition motifs generally located upstream of the coding region. We present a novel computational method to identify regulatory elements in the upstream region of eukaryotic genes. The genes are grouped in sets sharing an overrepresented short motif in their upstream sequence. For each set, the average expression level from a microarray experiment is determined: If this level is significantly higher or lower than the average taken over the whole genome, then the overerpresented motif shared by the genes in the set is likely to play a role in their regulation. Results The method was tested by applying it to the genome of Saccharomyces cerevisiae, using the publicly available results of a DNA microarray experiment, in which expression levels for virtually all the genes were measured during the diauxic shift from fermentation to respiration. Several known motifs were correctly identified, and a new candidate regulatory sequence was determined. Conclusions We have described and successfully tested a simple computational method to identify upstream motifs relevant to gene regulation in eukaryotes by studying the statistical correlation between overepresented upstream motifs and gene expression levels.


Introduction
One of the biggest challenges of modern genetics is to extract biologically meaningful information from the huge mass of raw data that is becoming available.In particular, the availability of complete genome sequences on one hand, and of genome-wide microarray data on the other, provide invaluable tools to elucidate the mechanisms underlying transcriptional regulation.The sheer amount of available data and the complexity of the mechanisms at work require the development of specific data analysis techniques to identify statistical patterns and regularities, that can then be the subject of experimental investigation.
The regulation of gene expression in eukaryotes is known to be mainly effected through transcription factors binding to rather short recognition motifs generally located upstream of the coding region.One of the main problems in studying regulation of gene expression is to identify the motifs that have transcriptional meaning, and the genes each motif regulates.
The usual approach to this kind of analysis begins by identifying groups of co-regulated genes, for example by applying clustering techniques to the expression profiles obtained from microarray experiments.One then studies the upstream sequences of a set of coregulated genes looking for shared motifs.Examples of this approach as applied to S. cerevisiae are Refs.[1,2,4].
In this paper we suggest an alternative method which somehow follows the inverse route: genes are grouped into (non-disjoint) sets, each set being characterized by a short motif which is overrepresented in the upstream sequence.For each set, the average expression is computed for a certain microarray experiment, and compared to the genome-wide average expression from the same experiment.If a statistically significant difference is found, then the motif that defines the set of genes is a candidate regulatory sequence.The rationale for looking for overrepresented motifs is that, in many instances, regulatory motifs are known to appear repeated many times within a relatively short upstream sequence [2,3], so that the number of repetitions turns out to be much bigger than what would be expected from chance alone.A somehow related approach, which does not require any previous grouping of genes based on their expression profiles, was presented in Ref. [5], where the effect of upstream motifs on gene expression levels is modeled by a sum of activating and inhibitory terms.Experimental expression levels are then fitted to the model, and statistically significant motifs are identified.Our approach differs in the importance given to overrepresented motifs, thus considering activation and inhibition as an effect that depends on a threshold number of repetitions of a motif rather than on additive contributions from all motifs.Clearly the two mechanisms are far from being mutually exclusive, therefore we expect the candidate regulatory sites found with the two methods to significantly overlap.
However it is important to notice that the kind of statistical correlation between upstream motifs and expression that our algorithm identifies does not depend on any special assumption on the functional dependence of expression levels on the number of motif repetitions, as long as this dependence is strong enough to provide a significant deviation from the average expression when enough copies of the motif are present.A comparison of our results with those obtained in Ref. [5] is provided in the "Results and discussion" section.

The method
In general the motifs with known regulatory function are not identified with a fixed nucleotide sequence, but rather with sequences where substitutions are allowed, or spaced dyads of fixed sequences, etc. However in this study, in order to test the method while keeping the technical complications to a minimum, we will limit ourselves to fixed short nucleotide sequences, that we call words.While previous studies (see e.g [2]) show that even this simple analysis can give interesting results, the method we present can easily be generalized to include variable sequences and other more complicated patterns.
The computational method we propose has two main steps: first the open reading frames (ORFs) of an eukaryote genome are grouped in (overlapping) sets based on words that are overrepresented in their upstream region, compared to their frequencies in the reference sample made of all the upstream regions of the whole genome.Each set is labelled by a word.Then for each of these sets the average expression in one or more microarray experiments are compared to the genome-wide average: if a statistically significant difference is found, the word that labels the set is a candidate regulatory site for the genes in the set, either enhancing or inhibiting their expression.
It is worth stressing that the grouping of the genes into sets depends only on the upstream sequences and not on the microarray experiment considered: It needs to be done only once for each organism, and can then be used to analyse an arbitrary number of microarray experiments.It is precisely this fact that should allow the extension of the method to patterns more complex than fixed sequences, while keeping the required computational resources within reasonable limits.

Constructing the sets
We consider the upstream region of each open reading frame (ORF), and we fix the maximum length K of the upstream sequence to be considered.The choice of K depends on the typical location of most regulatory sites: in general K is a number between several hundred and a few thousand.For each ORF g, the actual length of the sequence we consider is K g defined as the minimum between K and the available upstream sequence before the coding region of the previous gene.
For each word w of length l (6 ≤ l ≤ 8 in this study), and for each ORF g we compute the number m g (w) of occurrences of w in the upstream region of g.Non palindromic words are counted on both strands: therefore we define the effective number of occurrences n g (w) as where w is the reverse complement of w.
We define the global frequency p(w) of each word w as where, in order to count correctly the available space for palindromic and non palindromic words, p(w) is therefore the frequency with which the word w appears in the upstream regions of the whole genome: it is the "background frequency" against which occurrences in the upstream regions of the individual genes are compared to determine which words are overrepresented.
For each ORF g and each word w we compute the probability b g (w) of finding n g (w) or more occurrences of w based on the global frequency p(w): We define a maximum probability P , depending in general on the length l of the words under consideration, and consider, for each w, the set of the ORFs in which the word w is overepresented compared to the frequency of w in the upstream regions of the whole genome.That is, w is considered overrepresented in the upstream region of g if the probability of finding n g (w) or more instances of w based on the global frequency is less than P .
This completes the construction of the sets S(w).Two free parameters have to be fixed: the length K of the upstream region to be considered and the probability cutoff P for each length l of words considered.A result in Ref. [2] suggests suitable choices of these two numbers: the authors list the 34 ORFs of S. cerevisiae that have 3 or more occurrences of the word GATAAG in their 500 bp upstream region.23 out of these 34 ORFs correspond to a gene with known function, and 20 out of these 23 are regulated by nitrogen.This result suggests to choose K = 500 for the upstream length, and a value of the probability cutoff such that three or more instances of GATAAG in the 500 bp upstream region of an ORF are considered significant.Any choice of P between 0.018 and 0.1 would satisfy this criterion, and we chose P = 0.02.Tentatively, we kept the same value of P for all values of l.
With this choice, the number of instances of a word that are necessary to be considered overrepresented in a 500bp upstream sequence can be as high as six for common 6-letter words and as low as one for rare 8-letter words.In particular, our set S(GAT AAG) almost1 coincides with the one discussed in [2].However the word GATAAG will not turn out to be significant in our study.
As noted above, it would be natural to make the probability cutoff P depend on the word length, simply because the number of possible words increases with their length: For example one could take the cutoff for each word length to be inversely proportional to the number of independent words of such length.However it turns out that this procedure tends to construct sets that are less significant when tested for correlation with expression.Therefore we chose to fix the cutoff at 0.02 for all word lengths.It is important to keep in mind that no statistical significance whatsoever is attributed to the sets per se: The only sets that are retained at the end of the analysis are the ones that show significant correlation with expression.Therefore the choice of the cutoff in the construction of the sets can be based on such a pragmatic approach without jeopardizing the statistical relevance of the final result.

Studying the average expression level in each set
The second step of our procedure consists in studying, for each set S(w) defined as above, the expression profiles of the ORFs belonging to S(w) in DNA microarray experiments.The idea is that if the average expression profile in the set S(w) for a certain experiment is significantly different from the average expression for the same experiment computed on the whole genome, then it is likely that some of the ORFs in S(w) are coregulated and that the word w is a binding site for the common regulating factor.
To look for such instances we consider the gene expression profiles during the diauxic shift, i.e. the metabolic shift from fermentation to respiration, as measured with DNA microarrays techniques in Ref. [1].In the experiment gene expression levels were measured for virtually all the genes of S. Cerevisiae at seven time-points while such metabolic shift took place.The experimental results are publicly available from the web supplement to Ref. [1].
We considered each time-point as a single experiment, and for each gene g we defined the quantity r g (i) (1 = 1, . . ., 7) as the log 2 of the ratio between the mRNA levels for the gene g at time-point i and the initial mRNA level.Therefore e.g.r g (i) = 1 means a two-fold increase in expression at timepoint i compared to initial expression.
For each time-point i we computed the genome-wide average expression R(i) and its standard deviation σ(i).These are reported in Tab. 1, where N(i) is the number of genes with available expression value for each timepoint.
Then for each word w we compute the average expression in the subset of S w given by the genes for which an experimental result is available at timepoint i (in most cases this coincides with S w ): where N(i, w) is the number of ORFs in S w for which an experimental result at timepoint i is available, and the difference ∆R w (i) is the discrepancy between the genome-wide average expression at time-point i and the average expression at the same time-point of the ORFs that share an abundance of the word w in their upstream region.A significance index sig(i, w) is defined as and the word w is considered significantly correlated with expression at time In this work we chose Λ = 6: this means that we consider meaningful a deviation of R w (i) by six s.d.'s from its expected value.The sign of sig(i, w) indicates whether w acts as an enhancer or an inhibitor of gene expression.

Results and discussion
We found a total of 29 words of length between 6 and 8 above our significance threshold |sig| > 6.Most of them are related to known regulatory motifs; two words turned out to be false positives due to the presence, in their sets, of families of identical ORF's.Finally, one word does not match any known motif and is a candidate new binding site.
The comparison between our significant words and known motifs was performed using the database of regulatory motifs made publicly available by the authors of Ref. [6], and the CompareACE software [7] available from the same web source.This package allowed us to compute the Pearson correlation coefficient of the best alignment between each of our significant words and each known regulatory motif (expressed as a set of nucleotide frequencies).
We used the following criterion to associate our significant words to known motifs: a motif is considered as identified if at least one significant word scores better than 0.8 when compared to it.A probability value for this choice of the cutoff can be estimated to be a few percent: out of all the 2080 independent 6-letter words, 66 (that is 3.17%) score better than 0.8 with at least one motif.For 7-and 8-letter words we have respectively 2.21% and 1.51%.Once a motif has been identified, all words which score best with the motif are attributed to it, independently of the score, provided their expression pattern is consistent with the word(s) scoring better than 0.8.

PAC and RRPE motifs
Nine significant words can be associated to the PAC motif [8,4,7], all of them with rather high scores.They are shown in Tab. 2, where, as in all the following tables, significativity indices are shown only for those timepoints where they exceed our threshold |sig| > 6.Given the perfect alignment of these words, it is not surprising that these sets largely overlap each other: The union af all the nine sets contains a total of 96 genes.As an example, in Fig. 1 we show the average expression for the genes associated with the word GATGAG as a function of the time, compared to the average expression computed over the whole genome.Fig. 2 shows the significance index for the same set.In Tab. 3 we show the set of 24 genes associated to the word GATGAG, together with their expression profiles.
Two words can be associated with confidence to the motif RRPE [4,7], and are shown in Tab. 4. The union of the two sets contains 76 genes.We see that genes containing the motifs PAC and RRPE are repressed at the late stage of the diauxic shift compared to the early stages.This result is in agreement with the expression coherence score data available from the web supplement to Ref. [6]: There one can see that (1) of all known regulatory motifs, PAC and RRPE show the highest expression coherence for the diauxic shift and (2) viceversa, of the eight experimental conditions considered in Ref. [6], the diauxic shift is the one in which both the PAC and RRPE motif show the highest expression coherence score.

STRE and MIG1 motifs
A total of ten significant words can be associated to the motifs STRE [9,10] and MIG1 [11,12].It is well known that these play an important role in glucose repression (see e.g.[1,13] and references therein).Most of these words show comparable scores for the two motifs (due to their similarity) so we decided to show them together in Tab. 5 which shows the two scores for each word.A total of 212 genes belong to the union of all these sets.

The UME6 motif
Two words are associated to the known UME6 motif, a.k.a.URS1 [14,15], known to be a pleiotropic regulator implicated in glucose repression [16].They are shown in Tab. 6.The two sets do not overlap, so that a total of 56 genes are associated to this motif.

Other significant words
Three words, shown in Tab. 7, are of uncertain status: for the first one, the set S(ACTTTC) contains only 2 genes, making the statistical significance of the result questionable.The word CCCCTGAA scores best with the PDR motif (0.58): given the low significance of this score, and the fact that PDR does not seem to be relevant for any other word, this is most likely accidental.The word should probably be considered as belonging to the STRE/MIG1 motif (the scores are STRE: 0.46, MIG1: 0.49).Finally the word GCCCCTGA scores best with UME6 (0.55), but its expression pattern is more similar to the STRE/MIG1 motifs (scores: STRE:0.44,MIG1: 0.46).As stated in the introduction, the method proposed in Ref. [5] also allows one to identify regulatory motifs without any previous clustering of gene expression data: a linear dependence of the logarithm of the expression levels on the number of repetitions of each regulatory motifs is postulated, and motifs are ranked according to the reduction in χ 2 obtained when such dependence is subtracted from the experimental expression levels.Iteration of the procedure produces a model, that is a set of relevant regulatory motifs, for each expression data set.

S(ATAAGGG)
In Ref. [5] such a model is presented for the 14 min.time point in the α-synchronized cell-cycle experiment of Spellmann et al., Ref. [17].We used our algorithm on the same data set to compare the findings.Let us concentrate on the 7-letter words (the longest considered in [5]).We found 9 significant words, reported in Tab. 9. Of these, five coincide with or are very similar to words found by the authors of Ref. [5] (see their Tab.2).The remaining four (AGGCTAA, GGCTAAG, GCTAAGC and CTAAGCG, whose similarity clearly suggests the existence of a longer motif) are of particular interest for the purpose of comparing the two methods: If one looks at the dependence of the expression levels on the number of occurrences of these words in the 500 bp upstream region, one clearly sees the existence of an activation threshold (see Fig. 5, where such dependence is shown for GGC-TAAG).On the other hand, by looking at these data one hardly expects a significant reduction in χ 2 when trying to describe this dependence with a straight line.This should be compared to the same dependence for the word AAAATTT, shown in Fig. 6, which is found by both algorithms.On the other hand, there are two 7-word motifs found in [5] that do not pass our significativity threshold, that is CCTCGAC and TAAACAA.
We can conclude that the two methods tend to find motifs with a different effect on gene expression: probably the best results can be obtained by using them both on the same data set.

Conclusions
We have presented a new computational method to identify regulatory motifs in eukaryotes, suitable to identify those motifs that are effective when repeated many times in the upstream sequence of a gene.The main feature occurrences GCTAAGC ( that differentiates our method from existing algorithms for motif discovery is the fact that genes are grouped a priori based on similarities in their upstream sequences. Most of the significant words the algorithm finds can be associated to five known regulatory motifs: This fact consitutes a strong validation of the method.Three of them (STRE, MIG1 and UME6) were previously known to be implicated in glucose suppression, while the fact that PAC and RRPE sites are relevant to regulation during the diauxic shift is in agreement with expression coherence data as reported in the web supplement to Ref. [6].
One of the significant words we find (ATAAGGG) cannot be identified with any known motif, and is a candidate new binding site.
It is easy, at least in principle, to extend the method to a larger class of regulatory sites.According to our knowledge of gene regulation, this should be done at least in two directions: (1) the analysis should not be restricted to fixed sequences, but extended to motifs with controlled variability; in particular the extension to spaced dyads [18] should be straightforward; (2) the combinatorial analysis of binding sites [6] could also be performed along the same lines, that is first grouping genes according to which combinations of motifs appear in their upstream region, and then analysing expression profiles within each group.

Figure 1 :Figure 2 :
Figure1: Expression of the genes in the set S(GATGAG): The average expression of the genes in the set (solid red line) are compared to the genomewide average expression (dashed green line) at the seven time points of the diauxic shift experiment.The expression data are the log 2 of the ratio between mRNA levels at each timepoint and the initial mRNA level.

Figure 3 :Figure 4 :
Figure 3: Expression of the genes in the set S(ATAAGGG): Same as Fig. 1 for our new candidate regulatory motif.

Figure 5 :
Figure 5: Expression as a function of occurrences of the word GGCTAAG:The average expression of genes presenting n occurrences of the word GGC-TAAG as a function of n in the 14 min.time point of the α-synchronized cell-cycle experiment of Spellmann et al., Ref.[17].In parentheses is the number of genes with n occurrences of GGCTAAG in the upstream region.The horizontal line represents the average expression for the whole genome.

Figure 6 :
Figure 6: Expression as a function of occurrences of the word AAAATTT: Same as Fig. 5 for AAAATTT.

Table 3 :
The ORFs in the set S(GATGAG) with their expression profiles.

Table 4 :
Significant words related to the RRPE motif.

Table 5 :
Significant words related to the STRE and MIG1 motifs.The words marked * actually score better with the variant STRE' motif (0.60 and 0.55 respectively).

Table 6 :
Significant words related to the UME6 motif.

Table 7 :
Significant words of uncertain attribution.

Table 8 :
The ORFs in the set S(ATAAGGG) with their expression profiles.