Volume 8 Supplement 6

Otto Warburg International Summer School and Workshop on Networks and Regulation

Open Access

Dissecting complex transcriptional responses using pathway-level scores based on prior information

BMC Bioinformatics20078(Suppl 6):S6


Published: 27 September 2007



The genomewide pattern of changes in mRNA expression measured using DNA microarrays is typically a complex superposition of the response of multiple regulatory pathways to changes in the environment of the cells. The use of prior information, either about the function of the protein encoded by each gene, or about the physical interactions between regulatory factors and the sequences controlling its expression, has emerged as a powerful approach for dissecting complex transcriptional responses.


We review two different approaches for combining the noisy expression levels of multiple individual genes into robust pathway-level differential expression scores. The first is based on a comparison between the distribution of expression levels of genes within a predefined gene set and those of all other genes in the genome. The second starts from an estimate of the strength of genomewide regulatory network connectivities based on sequence information or direct measurements of protein-DNA interactions, and uses regression analysis to estimate the activity of gene regulatory pathways. The statistical methods used are explained in detail.


By avoiding the thresholding of individual genes, pathway-level analysis of differential expression based on prior information can be considerably more sensitive to subtle changes in gene expression than gene-level analysis. The methods are technically straightforward and yield results that are easily interpretable, both biologically and statistically.


Many of the popular methods for analyzing DNA microarray expression data, from clustering [1] to more sophisticated machine-learning approaches [25], require expression data over a large number of different conditions as input. However, it is common to only have expression data for a few different strains and/or conditions. In this case, what is of interest are the changes in mRNA abundance for each gene, usually represented as the logarithm of the fold-change between test and control. The traditional way of analyzing such data is to first identify significantly up- and down-regulated genes, and subsequently to characterize these sets in terms of enrichment for functional annotation [6] or upstream promoter elements [79]. However, by requiring statistically significant differential expression at the level of individual genes, a lot of information about differential expression will be lost that could have been detected using analysis methods working at the level of pathways.

To understand this, assume that we are comparing two conditions and that the measurement error for the fold-change of individual genes is 20%. Now consider a specific pathway consisting of 100 genes that are all upregulated by 10%. This level of differential expression is well within the noise for individual genes, none of which will therefore be classified as significantly induced. However, the error in the average expression of 100 randomly chosen genes will be on the order of 20%/ 100 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaGcaaqaaiabigdaXiabicdaWiabicdaWaWcbeaaaaa@2F93@ = 2%. The 10% change in expression at the level of the whole pathway therefore corresponds to five units of standard error and is highly statistically significant.

In recent years, two distinct classes of methods have been developed that use prior information about how genes can be viewed as belonging to different regulatory or functional pathways (Figure 1). This information can be used to score differential expression at the pathway level rather than at the gene level. The first class of methods represents pathways as gene sets, to which individual genes either belong or do not belong. One well-known source of such gene sets is the Gene Ontology (GO) project [6], where the classification is based on the function of the proteins encoded by each gene. The second class of methods takes a more sophisticated approach by assigning a regulatory susceptibility to each gene, quantifying how strongly this gene is expected to respond to a change in the activity of a specific regulatory pathway. For example, the affinity of a gene's promoter sequence for a specific transcription factor (TF) could be predicted using consensus motifs or weight matrices [10] and be used to predict the response of that gene to changes in TF activity.
Figure 1

Scoring pathway activity: gene sets versus regression. Two types of prior information, categorical and quantitative, may be combined with non-thresholded genome-wide expression data to derive a statistical measure of pathway-level activity. In (A) a pre-defined gene set (gray), such as those annotated by the Gene Ontology project, is scored using a t-test for its expression response (red = positive, green = negative) compared to all other genes. In (B) estimated interaction strengths (shades of gray), such as those derived from regulatory sequence analysis or ChIP-chip experiments, are correlated with the expression response of all genes. In both instances the result is a t-value (yellow = positive, blue = negative) that measures the change in mRNA expression associated with a category (A) or interaction (B).

In this review, we describe how such pathway-level analyses can be implemented mathematically. It is helpful to understand that, in general, information about genes comes in two different types: categorical information of boolean type ("true" or "false"), which tells us whether or not a gene belongs to a specific gene set; and quantitative information, e.g., the mRNA expression log-ratio between two conditions for a gene or the ChIP-chip [11] fold enrichment for the gene's promoter region. Given any two distinct features characterizing each gene, their genomewide statistical association can be scored using an appropriate statistical test (Table 1).
Table 1

When to use which statistical test.

First Feature

Second Feature














Spearman, Kendall




two-sample t-test




Wilcoxon-Mann-Whitney, Kolmogorov-Smirnov

When analyzing the statistical association between two features across the genome, the choice of statistical test depends on whether the features are categorical or quantitative, and whether or not a parametric method can be used. For each case the appropriate test is listed.

The traditional approach: scoring over-representation of predefined gene sets

Suppose that we want to know whether a specific set of genes of interest is statistically enriched for genes with a specific annotation in Gene Ontology. In this case, both features (namely, "does the gene belong to the set of genes of interest" and "is the gene associated with GO term X") are categorical, and the appropriate statistic is the overlap between both gene sets. Let the total number of genes in set A be a, the total number of genes in set B be b, and the total number of genes in the genome be n. Furthermore, let the overlap x denote the number of genes shared between A and B. If the two sets are chosen randomly and independently, the average overlap will be:
x = a b n MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqGHPms4cqWG4baEcqGHQms8cqGH9aqpdaWcaaqaaiabdggaHjabdkgaIbqaaiabd6gaUbaaaaa@36BB@
This makes sense: if a fraction b/n of all genes belongs to set B then the expected fraction of genes in set A that also belongs to set B equals x/a. In the case of over-representation, when x > <x>, the P-value that quantifies how likely it is to get at least the same number of overlapping genes by chance, is given by
P over ( x ) = x = x min ( a , b ) H ( x | a , b , n ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGqbaudaWgaaWcbaGaee4Ba8MaeeODayNaeeyzauMaeeOCaihabeaakiabcIcaOiabdIha4jabcMcaPiabg2da9maaqahabaGaemisaGKaeiikaGIafmiEaGNbauaacqGG8baFcqWGHbqycqGGSaalcqWGIbGycqGGSaalcqWGUbGBcqGGPaqkaSqaaiqbdIha4zaafaGaeyypa0JaemiEaGhabaGagiyBa0MaeiyAaKMaeiOBa4MaeiikaGIaemyyaeMaeiilaWIaemOyaiMaeiykaKcaniabggHiLdaaaa@52EF@
where H is the hypergeometric distribution given by
H ( x | a , b , n ) = ( a x ) ( n a b x ) ( n b ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGibascqGGOaakcqWG4baEcqGG8baFcqWGHbqycqGGSaalcqWGIbGycqGGSaalcqWGUbGBcqGGPaqkcqGH9aqpdaWcaaqaamaabmaabaqbaeqabiqaaaqaaiabdggaHbqaaiabdIha4baaaiaawIcacaGLPaaadaqadaqaauaabeqaceaaaeaacqWGUbGBcqGHsislcqWGHbqyaeaacqWGIbGycqGHsislcqWG4baEaaaacaGLOaGaayzkaaaabaWaaeWaaeaafaqabeGabaaabaGaemOBa4gabaGaemOyaigaaaGaayjkaiaawMcaaaaaaaa@4ACB@
( n k ) = n ! k ! ( n k ) ! MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqadaqaauaabeqaceaaaeaacqWGUbGBaeaacqWGRbWAaaaacaGLOaGaayzkaaGaeyypa0ZaaSaaaeaacqWGUbGBcqGGHaqiaeaacqWGRbWAcqGGHaqicqGGOaakcqWGUbGBcqGHsislcqWGRbWAcqGGPaqkcqGGHaqiaaaaaa@3CA1@
It is also possible to have significant under-representation (x < <x>). In that case, the P-value is given by
P under ( x ) = x = 0 x H ( x | a , b , n ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGqbaudaWgaaWcbaGaeeyDauNaeeOBa4MaeeizaqMaeeyzauMaeeOCaihabeaakiabcIcaOiabdIha4jabcMcaPiabg2da9maaqahabaGaemisaGKaeiikaGIafmiEaGNbauaacqGG8baFcqWGHbqycqGGSaalcqWGIbGycqGGSaalcqWGUbGBcqGGPaqkaSqaaiqbdIha4zaafaGaeyypa0JaeGimaadabaGaemiEaGhaniabggHiLdaaaa@4BDC@

This use of the cumulative hypergeometric distribution is also known as "Fisher's exact test." The test is by nature non-parametric because both input features are non-parametric. Under specific conditions the hypergeometric distribution may be approximated by the binomial or chi-square distribution. Several implementations of this approach are reviewed by Khatri and Draghici [12]. Since typically a large number of gene sets are scored in parallel, the p-values must be corrected for multiple testing. Grossman et al. [13] recently addressed technical complications arising from the strong overlap between the hierarchically organized Gene Ontology categories.

An alternative: scoring the distribution of expression levels for predefined gene sets

An early example of the use of predefined gene sets to analyze differential expression at the pathway level can be found in Lascaris et al. [14]. The authors used a z-score to represent the difference between the average expression in a gene set S consisting of n genes and the genomewide mean μ:
z = x ¯ S μ σ x ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWG6bGEcqGH9aqpdaWcaaqaaiqbdIha4zaaraWaaSbaaSqaaiabdofatbqabaGccqGHsisliiGacqWF8oqBaeaacqWFdpWCdaWgaaWcbaGafmiEaGNbaebaaeqaaaaaaaa@385A@

Here σ x ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWFdpWCdaWgaaWcbaGafmiEaGNbaebaaeqaaaaa@3033@ = σ/ n MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaGcaaqaaiabd6gaUbWcbeaaaaa@2E2C@ is the standard error of the mean, σ being the standard deviation of the genomewide distribution of log-ratios. The same metric is used by the "parametric analysis of gene expression" (PAGE) method of Kim and Volsky [15]. For larger gene sets, however, the standard t-test for the difference between means yields more accurate results [16]. The t-test, in general, scores the statistical association between a categorical and quantitative feature. The categorical feature is used to partition the set of all genes, G, into two complementary subsets S and S'. The t statistic measures the difference between the means of the two subsets in units of its standard error:

t = x ¯ S x ¯ S σ 1 | S | + 1 | S | MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamXvP5wqSXMqHnxAJn0BKvguHDwzZbqegyvzYrwyUfgarqqtubsr4rNCHbGeaGqiA8vkIkVAFgIELiFeLkFeLk=iY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfeaY=biLkVcLq=JHqVepeea0=as0db9vqpepesP0xe9Fve9Fve9GapdbaqaaeGacaGaaiaabeqaamqadiabaaGcbaGaemiDaqNaeyypa0ZaaSaaaeaacuWG4baEgaqeamaaBaaaleaacqWGtbWuaeqaaOGaeyOeI0IafmiEaGNbaebadaWgaaWcbaGafe4uamLbauaaaeqaaaGcbaGaeq4Wdm3aaOaaaeaadaWcaaqaaiabigdaXaqaaiabcYha8jabdofatjabcYha8baacqGHRaWkdaWcaaqaaiabigdaXaqaaiabcYha8jqbdofatzaafaGaeiiFaWhaaaWcbeaaaaaaaa@537F@

Here x ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWG4baEgaqeaaaa@2E3D@ S and x ¯ S MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamXvP5wqSXMqHnxAJn0BKvguHDwzZbqegyvzYrwyUfgarqqtubsr4rNCHbGeaGqiA8vkIkVAFgIELiFeLkFeLk=iY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfeaY=biLkVcLq=JHqVepeea0=as0db9vqpepesP0xe9Fve9Fve9GapdbaqaaeGacaGaaiaabeqaamqadiabaaGcbaGafmiEaGNbaebadaWgaaWcbaGafe4uamLbauaaaeqaaaaa@3FE1@ are the mean expression value of genes in set S and S', respe ctively, and the standard error of the difference is given by
s = ( | S | 1 ) σ S 2 + ( | S | 1 ) σ S 2 | S | + | S | 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGZbWCcqGH9aqpdaGcaaqaamaalaaabaGaeiikaGIaeiiFaWNaem4uamLaeiiFaWNaeyOeI0IaeGymaeJaeiykaKccciGae83Wdm3aa0baaSqaaiabdofatbqaaiabikdaYaaakiabgUcaRiabcIcaOiabcYha8jqbdofatzaafaGaeiiFaWNaeyOeI0IaeGymaeJaeiykaKIae83Wdm3aa0baaSqaaiqbdofatzaafaaabaGaeGOmaidaaaGcbaGaeiiFaWNaem4uamLaeiiFaWNaey4kaSIaeiiFaWNafm4uamLbauaacqGG8baFcqGHsislcqaIYaGmaaaaleqaaaaa@5325@

with σ S and σ S' the standard deviation of the expression values of the genes within set S and S', respectively. Using a t-distribution with n - 2 degrees of freedom, each t-value can be converted to a p-value, which should again be corrected for multiple testing.

Figure 2 shows a side-by-side comparison of Fisher's exact test and the t-test for a specific combination of GO category and genomewide differential expression profile. Fisher's exact test can only be applied once a set of "genes of interest" has been defined. We thresholded the fold-induction of individual genes to define this gene set, and computed GO category enrichment P-values at different thresholds (solid line/symbols). The smallest, most significant, P-value is obtained at an individual-gene threshold significantly below 2-fold induction, satisfied by over 500 genes. In general, the optimal threshold will depend on both the GO category and the expression data. By contrast, the two-sample t-test uses the expression value for all genes; no threshold for individual genes is required, an important practical advantage. While the optimal P-value from Fisher's exact test is slightly smaller than that of the two-sample t-test (dashed line), this seeming advantage disappears as soon as multiple-testing correction associated with the required threshold optimization is taken into account. Note that at the commonly used threshold of 2-fold induction, the two-sample t-test performs dramatically better.
Figure 2

Scoring GO categories: Fisher's exact test versus two-sample t-test. We analyzed gene expression data for the response to the ergosterol biosynthesis inhibitor Lovastatin as measured by Hughes et al. [27]. The two-sample t-test reveals that the mean expression level of genes in the GO category "ergosterol biosynthesis" is significantly higher than expected (dotted line; t = 7.4; P = 1.1·10-13). Fisher's exact test can be used to score over-representation of the same GO category in the set of most induced genes. However, this requires one to first define a threshold for the expression fold-change of individual genes. The solid line shows how the P-value from Fisher's exact test depends on this threshold.

Other statistical tests have also been used to detect differential expression of gene sets based on the distribution of expression values. The original version of the "gene set enrichment analysis" (GSEA) method [17] used the Kolmogorov-Smirnov (KS) statistic to test whether the distribution of expression levels in a specific gene set was different from that of all genes; this approach was later found to require a modification to work reliably [18]. The Wilcoxon-Mann-Whitney test, a non-parametric equivalent of the t-test that uses expression values only to rank the genes, has also been applied to this problem [19].

Beyond gene sets: approaches based on regression analysis

The assignment of genes to gene sets is categorical: Either the gene belongs to the set, or it does not. However, gene sets are often a proxy for regulatory pathways. This is most obvious in the case of the gene sets based on ChIP-chip data [11], which were used by Boorsma et al. [16] to analyze differential mRNA expression using the t-test. The strict delineation of "targets" of a given TF based on thresholding of the ChIP-chip signals is an oversimplification. In reality, the degree to which the transcription rate for a given gene responds to a change in the activity of the TF depends in a continuous fashion on the binding affinity between the TF and the promoter DNA (as well as interactions with co-factors, chromatin, etc.). Thus, if an estimate of this affinity is used as a predictor for changes in transcription rate (and therefore expression), a single parameter that quantifies the global change in TF activity may explain a wide range of transcriptional responses across the genome. This intuition can be formalized in the form of a linear regression model:

A g = C + FN g

where C is an intercept and F a slope estimating the change in TF activity. The dependent ("response") variable A g is the mRNA expression log-ratio of gene g between conditions. The independent ("predictor") variable N g represents the regulatory network connectivity between the TF and the promoter region of gene g. For given A g and N g , the deviance D between the measured and predicted expression values
D = g ( A g C F N g ) 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGebarcqGH9aqpdaaeqbqaaiabcIcaOiabdgeabnaaBaaaleaacqWGNbWzaeqaaOGaeyOeI0Iaem4qamKaeyOeI0IaemOrayKaemOta40aaSbaaSqaaiabdEgaNbqabaGccqGGPaqkdaahaaWcbeqaaiabikdaYaaaaeaacqWGNbWzaeqaniabggHiLdaaaa@3E4B@
is minimized. The solution is given by
F = A N A N N 2 N 2 = δ A δ N δ N 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGgbGrcqGH9aqpdaWcaaqaaiabgMYiHlabdgeabjabd6eaojabgQYiXlabgkHiTiabgMYiHlabdgeabjabgQYiXlabgMYiHlabd6eaojabgQYiXdqaaiabgMYiHlabd6eaonaaCaaaleqabaGaeGOmaidaaOGaeyOkJeVaeyOeI0IaeyykJeUaemOta4KaeyOkJe=aaWbaaSqabeaacqaIYaGmaaaaaOGaeyypa0ZaaSaaaeaacqGHPms4iiGacqWF0oazcqWGbbqqcqaH0oazcqWGobGtcqGHQms8aeaacqGHPms4cqWF0oazcqWGobGtdaahaaWcbeqaaiabikdaYaaakiabgQYiXdaaaaa@5CC7@

andC = <A> - F <N>.

where <X> = (1/G) ∑ g X g denotes an average over all genes and δX g X g - <X> denotes the deviation from the genomic mean, so that <δX2> equals the variance of X. Because we are dealing with univariate regression (a single independent variable), the Pearson correlation coefficient between A and N,
r = δ A δ N δ A 2 δ N 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGYbGCcqGH9aqpdaWcaaqaaiabgMYiHJGaciab=r7aKjabdgeabjab=r7aKjabd6eaojabgQYiXdqaamaakaaabaGaeyykJeUae8hTdqMaemyqae0aaWbaaSqabeaacqaIYaGmaaGccqGHQms8cqGHPms4cqWF0oazcqWGobGtdaahaaWcbeqaaiabikdaYaaakiabgQYiXdWcbeaaaaaaaa@4711@
can be directly related to the slope F by the following equation:
F = r δ A 2 δ N 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGgbGrcqGH9aqpcqWGYbGCdaGcaaqaamaalaaabaGaeyykJeocciGae8hTdqMaemyqae0aaWbaaSqabeaacqaIYaGmaaGccqGHQms8aeaacqGHPms4cqWF0oazcqWGobGtdaahaaWcbeqaaiabikdaYaaakiabgQYiXdaaaSqabaaaaa@3F33@
It can furthermore be shown that, in the univariate case, R2, defined as the fraction of the variance in expression that can be explained by the linear model, is given by the square of Pearson correlation:
R 2 = var ( C + F N ) var ( A ) = δ A δ N 2 δ A 2 δ N 2 = r 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGsbGudaahaaWcbeqaaiabikdaYaaakiabg2da9maalaaabaGagiODayNaeiyyaeMaeiOCaiNaeiikaGIaem4qamKaey4kaSIaemOrayKaemOta4KaeiykaKcabaGagiODayNaeiyyaeMaeiOCaiNaeiikaGIaemyqaeKaeiykaKcaaiabg2da9maalaaabaGaeyykJeocciGae8hTdqMaemyqaeKae8hTdqMaemOta4KaeyOkJe=aaWbaaSqabeaacqaIYaGmaaaakeaacqGHPms4cqWF0oazcqWGbbqqdaahaaWcbeqaaiabikdaYaaakiabgQYiXlabgMYiHlab=r7aKjabd6eaonaaCaaaleqabaGaeGOmaidaaOGaeyOkJepaaiabg2da9iabdkhaYnaaCaaaleqabaGaeGOmaidaaaaa@5EA2@
A transformation of r due to R.A. Fisher
t = 1 2 ln ( 1 + r 1 r ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWG0baDcqGH9aqpdaWcaaqaaiabigdaXaqaaiabikdaYaaacyGGSbaBcqGGUbGBdaqadaqaamaalaaabaGaeGymaeJaey4kaSIaemOCaihabaGaeGymaeJaeyOeI0IaemOCaihaaaGaayjkaiaawMcaaaaa@3BFD@

yields a statistic t that is distributed according a t-distribution with n - 3 degrees of freedom, and can thus be easily converted to a p-value. Again, multiple testing will need to be accounted for whenever the association with multiple features is scored in parallel.

There are many ways in which the regulatory network connectivities N g can be chosen. The first application of regression analysis to microarray data, by Bussemaker et al. [20], used integer motif counts in promoter regions. Continuous sequence scores based on position-specific scoring matrices (PSSMs) [21, 22] and position-specific affinity matrices (PSAMs) [23, 24] have also been used. The values for R2 obtained with such sequence-based predictors are typically in the range of 1–5%. Another possible choice for N are ChIP-chip enrichment (log-)ratios [25, 26]. As these values are relatively noisy experimental measurements, the values for R2 observed in this case are usually smaller (< 1%).


In this work, rather than providing a comprehensive review of all relevant literature, we have outlined two conceptually different approaches for scoring differential expression at the pathway level. These methods use prior information about how different genes relate to each other to reduce the dimensionality of the problem. This obviates the need to first obtain gene clusters or modules from expression data over multiple conditions, and thereby makes it possible to analyze each differential expression profile by itself in a condition-specific fashion.



We thank members of the Bussemaker Lab for valuable discussions, and Barrett Foat for a critical reading of the manuscript. This work was supported by grants HG003008, CA121852, and GM074105 from the National Institutes of Health and grant APB.5504 from the Netherlands Foundation for Technical Research (STW).

This article has been published as part of BMC Bioinformatics Volume 8 Supplement 6, 2007: Otto Warburg International Summer School and Workshop on Networks and Regulation. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/8?issue=S6

Authors’ Affiliations

Department of Biological Sciences, Columbia University
Center for Computational Biology and Bioinformatics, Columbia University
Swammerdam Institute for Life Sciences, University of Amsterdam, BioCentrum Amsterdam


  1. Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA. 1998, 95 (25): 14863-14868. 10.1073/pnas.95.25.14863.PubMed CentralView ArticlePubMedGoogle Scholar
  2. Segal E, Shapira M, Regev A, Pe'er D, Botstein D, Koller D, Friedman N: Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nat Genet. 2003, 34 (2): 166-176. [http://dx.doi.org/10.1038/ng1165]View ArticlePubMedGoogle Scholar
  3. Beer MA, Tavazoie S: Predicting gene expression from sequence. Cell. 2004, 117 (2): 185-198. 10.1016/S0092-8674(04)00304-6.View ArticlePubMedGoogle Scholar
  4. Friedman N: Inferring cellular networks using probabilistic graphical models. Science. 2004, 303 (5659): 799-805. 10.1126/science.1094068. [http://dx.doi.org/10.1126/science.1094068]View ArticlePubMedGoogle Scholar
  5. Middendorf M, Kundaje A, Wiggins C, Freund Y, Leslie C: Predicting genetic regulatory response using classification. Bioinformatics. 2004, 20 (Suppl 1): I232-I240. 10.1093/bioinformatics/bth923. [http://dx.doi.org/10.1093/bioinformatics/bth923]View ArticlePubMedGoogle Scholar
  6. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25: 25-29. 10.1038/75556. [http://dx.doi.org/10.1038/75556]PubMed CentralView ArticlePubMedGoogle Scholar
  7. Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell. 1998, 9 (12): 3273-3297.PubMed CentralView ArticlePubMedGoogle Scholar
  8. Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture. Nat Genet. 1999, 22 (3): 281-285. 10.1038/10343. [http://dx.doi.org/10.1038/10343]View ArticlePubMedGoogle Scholar
  9. van Helden J, Andre B, Collado-Vides J: Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J Mol Biol. 1998, 281 (5): 827-842. 10.1006/jmbi.1998.1947. [http://dx.doi.org/10.1006/jmbi.1998.1947]View ArticlePubMedGoogle Scholar
  10. Stormo GD: DNA binding sites: representation and discovery. Bioinformatics. 2000, 16: 16-23. 10.1093/bioinformatics/16.1.16.View ArticlePubMedGoogle Scholar
  11. Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, Macisaac KD, Danford TW, Hannett NM, Tagne JB, Reynolds DB, Yoo J, Jennings EG, Zeitlinger J, Pokholok DK, Kellis M, Rolfe PA, Takusagawa KT, Lander ES, Gifford DK, Fraenkel E, Young RA: Transcriptional regulatory code of a eukaryotic genome. Nature. 2004, 431 (7004): 99-104. 10.1038/nature02800. [http://dx.doi.org/10.1038/nature02800]PubMed CentralView ArticlePubMedGoogle Scholar
  12. Khatri P, Draghici S: Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics. 2005, 21 (18): 3587-3595. 10.1093/bioinformatics/bti565. [http://dx.doi.org/10.1093/bioinformatics/bti565]PubMed CentralView ArticlePubMedGoogle Scholar
  13. Grossmann S, Bauer S, Robinson PN, Vingron M: An Improved Statistic for Detecting Over-Represented Gene Ontology Annotations in Gene Sets. RECOMB. 2006, 85-98.Google Scholar
  14. Lascaris R, Bussemaker HJ, Boorsma A, Piper M, van der Spek H, Grivell L, Blom J: Hap4p overexpression in glucose-grown Saccharomyces cerevisiae induces cells to enter a novel metabolic state. Genome Biol. 2003, 4: R3-10.1186/gb-2002-4-1-r3.PubMed CentralView ArticlePubMedGoogle Scholar
  15. Kim SY, Volsky DJ: PAGE: parametric analysis of gene set enrichment. BMC Bioinformatics. 2005, 6: 144-10.1186/1471-2105-6-144. [http://dx.doi.org/10.1186/1471-2105-6-144]PubMed CentralView ArticlePubMedGoogle Scholar
  16. Boorsma A, Foat BC, Vis D, Klis F, Bussemaker HJ: T-profller: scoring the activity of predefined groups of genes using gene expression data. Nucleic Acids Res. 2005, W592-W595. 10.1093/nar/gki484. [http://dx.doi.org/10.1093/nar/gki484]33 Web Server
  17. Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E, Houstis N, Daly MJ, Patterson N, Mesirov JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop LC: PGC-lalpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet. 2003, 34 (3): 267-273. 10.1038/ng1180. [http://dx.doi.org/10.1038/ng1180]View ArticlePubMedGoogle Scholar
  18. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA. 2005, 102 (43): 15545-15550. 10.1073/pnas.0506580102. [http://dx.doi.org/10.1073/pnas.0506580102]PubMed CentralView ArticlePubMedGoogle Scholar
  19. Scheer M, Klawonn F, Muench R, Grote A, Killer K, Choi C, Koch I, Schobert M, Haertig E, Klages U, Jahn D: JProGO: a novel tool for the functional interpretation of prokaryotic microarray data using Gene Ontology information. Nucleic Acids Res. 2006, W510-W515. 10.1093/nar/gkl329. [http://dx.doi.org/10.1093/nar/gkl329]34 Web Server
  20. Bussemaker HJ, Li H, Siggia ED: Regulatory element detection using correlation with expression. Nat Genet. 2001, 27 (2): 167-171. 10.1038/84792. [http://dx.doi.org/10.1038/84792]View ArticlePubMedGoogle Scholar
  21. Conlon EM, Liu XS, Lieb JD, Liu JS: Integrating regulatory motif discovery and genome-wide expression analysis. Proc Natl Acad Sci USA. 2003, 100 (6): 3339-3344. 10.1073/pnas.0630591100. [http://dx.doi.org/10.1073/pnas.0630591100]PubMed CentralView ArticlePubMedGoogle Scholar
  22. Nguyen DH, D'haeseleer P: Deciphering principles of transcription regulation in eukaryotic genomes. Mol Syst Biol. 2006, 2: 2006.0012-10.1038/msb4100054. [http://dx.doi.org/10.1038/msb4100054]PubMed CentralView ArticlePubMedGoogle Scholar
  23. Foat BC, Houshmandi SS, Olivas WM, Bussemaker HJ: Profiling condition-specific, genome-wide regulation of mRNA stability in yeast. Proc Natl Acad Sci USA. 2005, 102 (49): 17675-17680. 10.1073/pnas.0503803102. [http://dx.doi.org/10.1073/pnas.0503803102]PubMed CentralView ArticlePubMedGoogle Scholar
  24. Foat BC, Morozov AV, Bussemaker HJ: Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE. Bioinformatics. 2006, 22 (14): e141-e149. 10.1093/bioinformatics/btl223. [http://dx.doi.org/10.1093/bioinformatics/btl223]View ArticlePubMedGoogle Scholar
  25. Liao JC, Boscolo R, Yang YL, Tran LM, Sabatti C, Roychowdhury VP: Network component analysis: reconstruction of regulatory signals in biological systems. Proc Natl Acad Sci USA. 2003, 100 (26): 15522-15527. 10.1073/pnas.2136632100. [http://dx.doi.org/10.1073/pnas.2136632100]PubMed CentralView ArticlePubMedGoogle Scholar
  26. Gao F, Foat BC, Bussemaker HJ: Defining transcriptional networks through integrative modeling of mRNA expression and transcription factor binding data. BMC Bioinformatics. 2004, 5: 31-10.1186/1471-2105-5-31. [http://dx.doi.org/10.1186/1471-2105-5-31]PubMed CentralView ArticlePubMedGoogle Scholar
  27. Hughes TR, Marton MJ, Jones AR, Roberts CJ, Stoughton R, Armour CD, Bennett HA, Coffey E, Dai H, He YD, Kidd MJ, King AM, Meyer MR, Slade D, Lum PY, Stepaniants SB, Shoemaker DD, Gachotte D, Chakraburtty K, Simon J, Bard M, Friend SH: Functional discovery via a compendium of expression profiles. Cell. 2000, 102: 109-126. 10.1016/S0092-8674(00)00015-5.View ArticlePubMedGoogle Scholar


© Bussemaker et al; licensee BioMed Central Ltd. 2007

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.