Signal transduction pathway profiling of individual tumor samples

Background Signal transduction pathways convey information from the outside of the cell to transcription factors, which in turn regulate gene expression. Our objective is to analyze tumor gene expression data from microarrays in the context of such pathways. Results We use pathways compiled from the TRANSPATH/TRANSFAC databases and the literature, and three publicly available cancer microarray data sets. Variation in pathway activity, across the samples, is gauged by the degree of correlation between downstream targets of a pathway. Two correlation scores are applied; one considers all pairs of downstream targets, and the other considers only pairs without common transcription factors. Several pathways are found to be differentially active in the data sets using these scores. Moreover, we devise a score for pathway activity in individual samples, based on the average expression value of the downstream targets. Statistical significance is assigned to the scores using permutation of genes as null model. Hence, for individual samples, the status of a pathway is given as a sign, + or -, and a p-value. This approach defines a projection of high-dimensional gene expression data onto low-dimensional pathway activity scores. For each dataset and many pathways we find a much larger number of significant samples than expected by chance. Finally, we find that several sample-wise pathway activities are significantly associated with clinical classifications of the samples. Conclusion This study shows that it is feasible to infer signal transduction pathway activity, in individual samples, from gene expression data. Furthermore, these pathway activities are biologically relevant in the three cancer data sets.


Background
The interpretation of microarray data is facilitated by combining the data, or results of data analysis, with prior contextual knowledge, e.g., ontologies [1][2][3][4], pathways [5][6][7] and other annotation groups of interest [8]. By using prior knowledge about pathways, we aim at inferring cellular signaling pathway activity from tumor microarray data, on a sample-by-sample basis. Furthermore, we examine whether the pathway activity of individual sam-ples is associated with clinical classifications of the samples. Our approach is in sharp contrast to establishing pathways from gene expression data (see e.g. [9]). What we do here is to project gene expresson data onto prior knowledge, in this case established pathway databases.
Signaling pathway activity scoring is a more direct measurement of biological processes than ontology mapping, which aims at finding over-representation of genes in various groups of contextual annotation. A cellular signaling pathway (see Fig. 1) is composed of a series of signaling molecules that convey information, typically from the outside of the cell to the nucleus. The initial step consists of extracellular signaling molecules, ligands, that activate receptors of the cell. These receptors then initiate intracellular signaling events, which eventually regulate the activity of various transcription factors. These transcription factors, in turn, regulate the expression levels of various genes, termed downstream targets of the pathway.
To characterize pathway activity, it would be desirable to have both proteomic and gene expression data. Gene expression data alone is not sufficient for assessing protein concentrations [10] and post-translational modifications of proteins. In the absence of proteomic data, one is thus forced to rely on aspects of the pathway that are detectable at the mRNA level. The foremost candidate for this is the downstream targets, which we will focus on here. It is of course also possible that the mRNA levels of effector proteins in a pathway change due to altered pathway activity. However, such effects are outside the scope of this paper. A further complication is that many pathways overlap, both in terms of having common transcription factors, and in terms of distinct transcription factors having common downstream targets. Our methods will not be able to distinguish very similar pathways, and in some sense this problem can be seen as a result of the ambiguities that follow when the full protein network is partitioned into separate pathways.
Cellular signaling pathways are subject to intense research, and current knowledge is compiled into databases such as STKE [11], TRANSPATH [12] and TRANS-FAC [13]. These databases do not yet account for all pathways or transcription factors, but develop over time. Most pathway information utilized in this work is collected from TRANSPATH/TRANSFAC, where information about transcription factors and downstream targets is readily available. The only exception is the estrogen receptor pathway, which is taken from [14]. We analyze three microarray data sets in this study: Two breast cancer data sets [15,16], and one leukemia data set [17].
For the three data sets, we assess pathway activity from two related, but different, points of view. The first is to examine which pathways behave in a coherent way across the entire data set, i.e., which pathways have significantly co-expressed downstream targets. This is done both with and without accounting for the fact that downstream targets of a single transcription factor are correlated irrespective of pathway behavior. The second point of view is to assess pathway activity of individual samples, relative to the other samples in the same experiment, yielding an active or inactive status for each pathway in each sample. Finally, we relate the sample-wise pathway activity to clinical classifications of samples by way of contingency tables. Several pathways are found to be highly predictive of the clinical classifications. Figure 1 Example pathway. A simplified and partial view of the TNF-α pathway. A ligand (yellow) binds to a receptor (green) on the cell surface, triggering a cascade of events. Eventually, transcription factors (blue) activate or repress the expression of genes.

Results
In this section we assess variation in transcription factor and pathway activity across the samples. We then proceed to probe pathway activity in individual samples. Finally, we study the association between pathway activity and the clinical classifications of the samples.

Co-expression of the downstream targets of a transcription factor
As a prelude to the study of pathways, we quantified the degree of correlation among downstream targets of single transcription factors. For this purpose we used the Group Correlation Score defined in Methods. The p-values were calculated using random reshuffling of the genes. With this in mind we conclude that the downstream targets of transcription factors are co-expressed in the these data sets, albeit not to a high degree. The full lists for all 3 data sets can be found in Additional file 1.

Co-expression of the downstream targets of a pathway
Here we use the same Group Correlation Score as above, applied to the downstream targets of entire pathways rather than those of individual transcription factors. Table  1 shows the results for the van 't Veer data set, where 21 out of 29 pathways have a p-value below 0.05, which is significantly more than expected by chance. However, many of the downstream targets have common transcription factors, which might be the major cause of the coexpression. To eliminate such a contribution we used the Exclusive Group Correlation Score, which considers only pairs of downstream targets lacking common transcription factors. The p-values for the Exclusive Group Correlation Score are also shown in Table 1. Although these p-values are larger, they are still significant; out of 20 pathways with more than one transcription factor, 12 have a Exclusive Group Correlation Score p-value below 0.05, which is still more than expected by chance. We conclude that the co-expression of downstream targets in a pathway can only in part be explained by the genes having common transcription factors. This co-expression at the pathway level justifies the view of pathways as functional units. Similar tables for the two other data sets are shown in Additional file 1. Both data sets have smaller p-values than expected by chance, albeit not as convincingly as the van 't Veer data set.

Pathway assignments for individual samples
After having established that downstream target genes are co-expressed in some pathways, we proceeded to study the status of pathway activity in individual samples. To this end we employed the Group Sample Score, which for each pathway designates every sample in a data set as either active or inactive, with an associated p-value. Table 3 shows p-values and pathway activity status, for six samples in the van 't Veer data set. For example, in the first sample the RANK pathway is designated as inactive with a p-value of 0.004, whereas the inactiveness of the ERinduced pathway cannot be considered significant. The full tables for all samples and pathways in all 3 data sets are provided in Additional file 1. Table 4 shows the number of samples that are active and inactive at the 5% level, for every pathway in the van 't Veer data set. The table also contains the family-wise pvalue, defined in Methods, which gives the probability of observing at least this total number of significant samples for a pathway. The family-wise p-value assumes that the samples are independent, which is only approximately true since the mean expression value of a gene across all samples is zero. Corresponding tables for the two other data sets can be found in Additional file 1. We note that the most significant pathways according to this measure are mostly the same as with the correlation based scores, although the p-values are numerically different.  There are also many genes in the pathway that do not seem to be regulated, and even a few which act oppositely. The latter ones are possibly genes, that are repressed by the pathway. Similar figures for the entire genome (not shown) do not show large red or green upper corners, confirming the statistical analysis.

Association between sample-wise pathway activity and clinical classifications
We analyzed the association between sample-wise pathway activity and clinical classifications using contingency tables. For every pathway and data set, we divided the samples into three groups: Samples where the pathway was active at a 5% significance level, samples where it was inactive at a 5% significance level, and insignificant samples referred to as undecided. For each data set, contingency tables of pathway activity versus clinical classifications were created, and χ 2 p-values were calculated.
In the data set of Golub et al., the only available clinical classification is tumor type, i.e., ALL or AML. Table 5 shows the contingencies for the Insulin pathway and the I1-1 pathway. Seven out of 29 pathways have contingency tables with a χ 2 p-value below 0.01.
For the breast cancer data set of van 't Veer et al., we investigated six clinical classifications: metastasis status (0), estrogen receptor status (20), progesterone receptor status (12), lymph node status (12), BRCA mutations (15) and histological grade (8). The numbers in parentheses refer to the number of significant contingency tables at the 0.01 level. The total number of pathways was again 29. For metastasis status, only 97 out of the 117 samples were labeled in the original data set, and this may contribute to the low degree of association between this clinical classification and pathway activity. However, similar results were obtained for the data set of Sotiriou et al., which indicates that it may be difficult to obtain any association between the pathways analyzed in this work and breast cancer metastasis status. Table 6 shows the contingency between estrogen receptor status and the ER-induced pathway. As expected, there is a strong association between presence of the estrogen receptor protein, and the activity status of the ER-induced pathway. Somewhat more surprisingly, there are also strong associations between ER status and many other pathways. Similar results are obtained for the data set of Sotiriou et al., but with fewer significant associations.
A general tendency of the contingency table analysis is illustrated in Table 7. Lowering the pathway activity pvalue cutoff makes the association to clinical classifications more specific but less sensitive. The complete set of contingency tables for all three data sets can be found in Additional file 1.

Conclusion
We have shown that downstream target genes of signal transduction pathways behave coherently in gene expression tumor data sets. First, we confirmed that downstream targets of transcription factors are correlated across samples. We then demonstrated that the same holds true for downstream targets of an entire pathway, even after discounting the correlations due to genes having a common transcription factor. The correlations for entire pathways were found to be more significant than those for individual transcription factors. ER-induced pathway heatmap Figure 2 ER-induced pathway heatmap. Heatmap of the ER-induced pathway activity corresponding to Table 4. The vertical divisions correspond to samples where the ER-induced pathway is significantly upregulated (30 samples, red bar), its status undetermined (50 samples, black bar) or significantly downregulated (37 samples, green bar) respectively. The rows represent genes in the ER-induced pathway. The genes are shown in descending order according to scalar product with the vector ± (1 -pvalue) for each sample, where the sign is the sign of the pathway activity of that sample. The expression values are log ratios normalized as described in Methods. Red and green represent up-and down-regulation respectively, and the precise color scheme is illustrated in the color key.

Density
The presence of significant correlations confirms the expectation that gene expression is controlled by the activity of pathways. However, these correlations do not tell us in which samples a pathway is active or inactive. To reveal this, we devised the Group Sample Score. With this score we classified the samples into those where the pathway was significantly active, significantly inactive or undecided, respectively. As seen in Table 4, the number of significant samples is, for most pathways, much higher than the random expectation.

Density
In many cases, the active/inactive pathway status was highly correlated with independent clinical classifications. This confirms the relevance of pathways for understanding of the underlying biology. Furthermore, the activity status of one or more pathways may be used to subdivide the samples into groups with distinct biological characteristics. Such a subdivision is feasible if, for instance, tumors of a certain clinical diagnosis are an agglomerate of several subtypes.
The Group Sample Score is natural if a pathway either induces all its downstream targets, or represses them. However, in most pathways some downstream genes are induced, while others are repressed. To account for a mix-ture of induction and repression, one should include a sign, or more generally a weight, to each term in the sum. Such a weight might even depend on the type of tissue and the environment. Since this information was not readily available for the studied pathways, all genes were weighted equally. Nevertheless, we obtained significant results, indicative of a dominant trend among the downstream genes. For the estrogen receptor (ER) pathway, we did have information about the sign, but instead of introducing a more general score for this pathway alone, we split the ER pathway into two parts, with induced and repressed genes, respectively. In the breast cancer data set of van 't Veer et al. [15], there were 50 genes in the induced part and 27 in the repressed. As seen in Tables 1 and 4, the ER-induced pathway was highly significant, whereas the repressed pathway was not. The full pathway was also highly significant, although to a lesser extent. The significance of the full pathway is thus due to the induced genes, which constitute a majority of the downstream targets. The situation is similar for other pathways and data sets.
It should be stressed that correlations, and the pathway activity status observed in a sample, are only defined relative to the other samples in the same data set. If a pathway were active in all samples, it would not show up in our significance test. The status of a pathway, as we define it, is given by the downstream genes, and the connection to ligands, receptors and other pathway components cannot be inferred from this analysis. Table 2 shows that the most significant transcription factor in the breast cancer data set of van 't Veer et al. is NF-κB. This transcription factor is also the most one in the Table 5: Contingency tables for the ALL/AML status versus the Insulin and IL-1 pathways in the leukemia data set of Golub et al. [17]. Active, non-active and undecided pathways are denoted +,and U respectively.

ALL AML
Insulin pw(+) p-value: 5e-04 Table 6: Contingency tables of estrogen receptor protein (binned at three levels: 0, 5-50, 60-100) versus the ER-induced and RANK pathways in the breast cancer data set of van't Veer et al. [15]. Same notation as in Table 5.   [18] demonstrate this in a mouse model of human hepatocellular carcinoma, where the inflammatory mediator tumornecrosis factor-α (TNF-α) is shown to play an important role as an activator of NF-κB. Greten et al. [19] find similar results in a mouse model of colitis-associated cancer.
Our current knowledge of pathways, and of downstream targets of transcription factors, is far from complete. However, we find that the results presented herein constitute a proof of concept for analyzing microarray gene expression in the context of signal transduction pathways.

Pathway information and UniGene clusters
Transcription factors for 23 pathways were extracted from TRANSPATH [12]. The downstream target genes of those transcription factors were obtained from TRANSFAC [13]. Since our study contains breast cancer data, we have augmented the pathway information with the Estrogen Receptor (ER) pathway compiled from [14], where 89 direct target genes were identified. 59 of them were induced by the ER complex and 30 were repressed. 8 out of the 89 genes were previously verified. We employ all six combinations of induced/repressed/all and verified/all, yielding 6 versions of the ER pathway. The 29 pathways employed are listed in Table 1. Downstream target genes were represented as UniGene IDs http:// www.ncbi.nlm.nih.gov/UniGene, using UniGene Hs build 171. For the analysis of the data sets, gene identifiers were converted into UniGene IDs and expression values of clones belonging to the same UniGene cluster were averaged.

Data sets
The following three publicly available data sets were analyzed: 1. The breast cancer data set of van 't Veer et al. [15], consisting of samples from 117 patients, of which 46 developed metastases. After UniGene merging the data set contains 20663 genes.
2. The breast cancer data set of Sotiriou et al. [16], consisting of 99 samples of different clinical classifications, with 4878 genes after UniGene merging.
3. The leukemia data set of Golub et al. [17], derived from bone marrow samples from 38 patients, 27 of which were diagnosed with Acute Myeloid Leukemia (AML) and 11 with Acute Lymphoblastic Leukemia (ALL). After Uni Gene merging and removal of genes with no variance across the samples, 4701 genes remain.

Normalization of microarray data
The data sets in [15,16] are given in the form of log ratios of expression values in the samples versus a reference. The data set in [17] is given in the form of Affymetrix average difference values. For the calculation of Group Correlation Score and Exclusive Group Correlation Score the affymetrix average difference values were logarithm transformed, since the Pearson correlation is very sensitive to single outlier samples. For the Group Sample Score the original differences were kept. We denote the expression value for gene g in sample s by x gs , with missing values allowed. We normalized the expression values in two steps. First, for each sample, the mean of all genes was subtracted, in order to ensure that no samples are up-or down-regulated on average. The transformed expression values satisfy: Second, for each gene, the mean of all samples was subtracted from the expression values of that gene, yielding: The second normalization implies that the expression value of a gene is measured relative to the same gene in other samples.

Pearson correlation p-values
To determine if a group of downstream target genes is significantly co-expressed, a total score of the group is needed. Two scores were used here, both based on the Pearson correlation of a pair of genes g and h: where the sums exclude missing values and is the mean of expression values for gene g.
The Group Correlation Score is defined as the sum of squares of Pearson correlations among all pairs of genes in a group of genes: where the sum runs over all genes in the group. The square ensures that both correlations and anti-correlations contribute to the score. We use the Group Correlation Score for where the sum runs exclusively over pairs of genes g and h that do not share any transcription factor.
The p-value of a score is defined as the fraction of random cases, drawn under the null hypothesis, which achieve a higher score than the score tested. For both scores, GCS and EGCS, our null hypothesis is reshuffling of the genes on the microarray. This null hypothesis keeps the structure and overlap of all pathways fixed, but changes the identity of the genes.

Pathway activity for individual samples
For each sample, s, and pathway, PW, the Group Sample Score is defined as follows: where the sum runs over all downstream target genes of the pathway.
The null hypothesis is again reshuffling of the genes from the microarray. We are interested in pathways both with high and low scores. Hence, we consider the p-values for the score being higher (p + ) and lower (p -) than random, respectively, and the final p-value is given by two times the smaller of these two p-values: p = 2·min(p + , p -).
The pathway is said to be active (+) if p + <p -, and inactive (-) otherwise.

Family-wise p-value
If N independent hypotheses are tested simultaneously, the probability to obtain K or more p-values below q is given by a binomial distribution: We refer to this probability as the family-wise p-value.