Sensitive detection of pathway perturbations in cancers
© BioMed Central Ltd 2012
Published: 21 March 2012
Skip to main content
© BioMed Central Ltd 2012
Published: 21 March 2012
The normal functioning of a living cell is characterized by complex interaction networks involving many different types of molecules. Associations detected between diseases and perturbations in well-defined pathways within such interaction networks have the potential to illuminate the molecular mechanisms underlying disease progression and response to treatment.
In this paper, we present a computational method that compares expression profiles of genes in cancer samples to samples from normal tissues in order to detect perturbations of pre-defined pathways in the cancer. In contrast to many previous methods, our scoring function approach explicitly takes into account the interactions between the gene products in a pathway. Moreover, we compute the sub-pathway that has the highest score, as opposed to merely computing the score for the entire pathway. We use a permutation test to assess the statistical significance of the most perturbed sub-pathway. We apply our method to 20 pathways in the Netpath database and to the Global Cancer Map of gene expression in 18 cancers. We demonstrate that our method yields more sensitive results than alternatives that do not consider interactions or measure the perturbation of a pathway as a whole. We perform a sensitivity analysis to show that our approach is robust to modest changes in the input data. Our method confirms numerous well-known connections between pathways and cancers.
Our results indicate that integrating differential gene expression with the interaction structure in a pathway is a powerful approach for detecting links between a cancer and the pathways perturbed in it. Our results also suggest that even well-studied pathways may be perturbed only partially in any given cancer. Further analysis of cancer-specific sub-pathways may shed new light on the similarities and differences between cancers.
Complex diseases such as cancer are associated with the alteration or dis-regulation of multiple pathways and processes in the cell. Discovering and cataloging which pathways are perturbed in each type of cancer is important for improving our understanding of the mechanisms underlying these diseases. In particular, such studies can pin-point pathways that may be uniquely perturbed in one or a small number of related cancers, thus providing potential targets for therapeutic studies.
Many methods have been developed to study the activation of pre-defined gene sets in human diseases and tissues [1–7]. In this context, a "gene set" is usually taken to be a collection of genes that share a common attribute, e.g., Gene Ontology annotation or membership in a pathway. For instance, Subramanian et al.  developed "Gene Set Enrichment Analysis" to test if a gene set is differentially expressed in two phenotypes by ranking all genes by some measure (say, the t statistic) and using a modified Kolmogorov-Smirnov statistic to decide whether the genes in the set have surprisingly high or low ranks. Segal et al.  used a hierarchical clustering algorithm to combine pre-defined gene sets into modules. They characterized gene-expression profiles in specific (sets of) tumors as a combination of activated and de-activated modules.
These methods ignore physical or functional interactions between the genes (or their products) in a gene set. Analysis of gene expression measurements in the context of the interaction structure inherent in a pathway can take into account both perturbations in gene expression and the topological properties of the network. More recent methods have sought to capture information about the activation of a pathway from the perspective of the interactions in it. A number of these techniques, reviewed by , have been developed for case-control data, for which we can compute p-values reflecting the statistical significance of the differential expression of each gene between the samples in the treatment and those in the control [10–14]. Draghici et al.  combined a term that captured the significance of the genes in a pathway with an additional weighted term that measured how well the data matches the expected pattern of induction and repression, as encoded by the interactions in the pathway. Efroni et al.  used pathway perturbation measurements to predict prognosis and tumor grade. Both approaches measure the perturbation of a pathway in its entirety. Thus, they may not be sensitive to situations when only a sub-pathway is highly perturbed.
Related techniques analyze gene expression measurements made under an experimental condition in the context of a large-scale protein-protein interaction network (often integrated from multiple sources) in order to determine the sub-network of interactions that respond to the experimental condition [15–19]. These approaches have primarily been used for determining the global response network perturbed in the cell in a particular condition, especially since most experimentally-determined protein interactions have not yet been explicitly associated with pathways.
In this paper, we develop a systematic methodology to detect which pathways are perturbed in a disease. Here, we use the term pathway to refer to a network of physical interactions between genes and gene products that together perform a specific biological function. Given the interactions in a pathway (e.g., the TNF alpha pathway) and genome-wide case-control gene expression data, i.e., measurements for a disease phenotype (e.g., melanoma) and a control phenotype (e.g., normal skin cells), our method computes the sub-pathway that is most perturbed in the disease (when compared to the control). Thus, our method combines the features of the two classes of methods discussed above: (i) it treats a pathway as a network of interconnected molecules rather than merely as a set of genes and gene products; (ii) it is sensitive to the possibility that the pathway is not perturbed in its entirety but that only some portion of it is significantly perturbed; and (iii) it can be applied to specific, well-defined pathways that a scientist may be interested in studying.
Our algorithm takes the interactions in a pathway P and case-control gene expression measurements as input. We first assess the differential expression of each gene in P. We develop a statistic based on the Liptak-Stouffer z-score that measures the combined perturbation of the genes in P. This statistic takes into account both the interactions in P and the differential expression of each gene. We use this statistic to compute which sub-pathway of P is maximally perturbed. Finally, we use a permutation-based test to assess the statistical significance of the maximally-perturbed sub-pathway.
We applied this approach to 20 cancer and immune signaling pathways in the Netpath database . We used gene expression measurements in the Global Cancer Map (GCM) . The GCM dataset spans 18 cancers and 13 normal tissues. First, we showed that the scores of perturbed sub-pathways computed by our method are much more statistically significant than the scores of the complete pathways. Second, we compared our results to those obtained by applying three techniques that analyze case-control gene expression data: ActiveModules , Gene Set Enrichment Analysis (GSEA) , and Sub-GSE . ActiveModules integrates the gene expression data with protein interaction networks in order to find highly perturbed sub-networks. GSEA and Sub-GSE are network-free approaches that find gene sets that are highly perturbed in the gene expression data. Our method showed much better sensitivity than both ActiveModules and GSEA in detecting perturbed sub-pathways. The comparison between our approach and Sub-GSE was mixed. Third, our method was robust to missing data, specifically to the removal of gene expression samples from the input. Finally, we found ample literature support for a number of pathway-cancer associations detected by our approach. Taken together, these results underscore the importance of carefully incorporating pathway structure into the analysis of gene expression data. We considered other recent approaches for comparison, which use mutual information to score individual genes , measure the synergistic relationship among a set of genes , or use biclustering to account for phenotypic variation among individuals [23, 24]. However, the number of samples per cancer in the GCM dataset is not sufficient to support robust computation of mutual information. These counts may not large enough to yield informative biclusters either. Therefore, we decided not to compare these methods with our approach in this paper.
We describe our approach in three stages. First, we formalize a measure of how perturbed a sub-network of a pathway is in a case-control gene expression data set. Next, we describe how to compute a sub-network that maximizes this measure. Finally, we discuss how we measure the statistical significance of the most perturbed sub-network.
We define a pathway P = (G, I) to be a graph composed of a set G of genes and a set I of physical or functional interactions between the genes in G or their gene products. Typically, P may be composed of multiple connected components. Given genome-wide gene expression measurements in multiple patients diagnosed with a disease in a tissue and from normal samples of that tissue, our goal is to determine whether the pathway P = (G, I) is perturbed in the disease (when compared to normal tissue) and to compute the subgraph of P that is most perturbed in the disease.
For each gene g ∈ G, let p(g) denote the p-value of its differential expression in the disease (when compared to normal tissue). We computed p(g) as the p-value of the two-sided t-test under the null hypothesis that the distributions of the expression values of g in the disease samples and in the normal samples have identical means (but may have different variances). We note that our pathway perturbation algorithm can take as input any gene expression pre-processing method that computes p-values for differential gene expression. We converted the p-value into a z-score z(g) = N - 1(1 - p(g)), where N - 1 is the inverse of the normal cumulative distribution function . At this stage, we did not impose a cut-off on z(g). Instead, we included all genes in subsequent analysis. The rationale for this choice was that while individual genes may not be differentially expressed to a statistically-significant extent, significant perturbations may be noticeable at the level of sets of genes .
The numerator of z(Q) is the weighted sum of the z-scores of all genes that appear in Q, where each gene is weighted by the number of interactions in Q that are incident on it. Dividing by the square root of the sum of squared gene degrees ensures that z(Q) is normally distributed with mean 0 and standard deviation 1, under the assumption that the z-scores for the individual genes arise from a normal distribution. Thus, this formulation of perturbation combines p-values over multiple genes in a statistically-sound way . Each gene in Q contributes both its z-score and its degree in Q to z(Q). Thus, z(Q) incorporates both the differential expression of the genes in Q as well as the network of interactions between them.
Among all subgraphs of P, let be the one with maximum value of perturbation. Since is the most differentially-perturbed subgraph of P, we use its perturbation to assess the overall perturbation of P. Thus, our formulation does not require that every gene in P be differentially expressed in order for us to declare that P itself is perturbed in the disease. We now describe how we compute . Note that we do not require that be connected, since P itself may not be connected. Ideker et al. demonstrated that a similar problem is -complete . Hence, we use a heuristic approach based on simulated annealing. Although simulated annealing is a very well known technique, we describe it below and sketch it in Algorithm 1 for the sake of completeness. To initialize , we include each interaction in P with a uniform probability of 0.5.
We perform the following series of operations for 100|I| iterations. (Recall that I is the set of interactions in P.) We select a node or an edge uniformly at random from P. Let the selected element be a. If a is already in , we delete it from ; if a is a node, we also delete all edges that are incident on a from . If a is not a member of , we add it to ; if a is a node, we insert into all edges that were incident on a in P. Let be the resulting subgraph. We compute z( ) and compare it to z( ). If z( ) is larger, we accept the modification, since we have increased the z-score. Otherwise, we accept the modification with a probability of , where T is the temperature in the current iteration. Over the iterations, we decrease the temperature T geometrically from T s = 100 to T e = 10-5. We output the final value of .
Algorithm 1 Compute , the subgraph of P with the maximum perturbation.
Initialize by including each interaction in P with probability 0.5.
T ← T s
for i = 1 .. . 100|I| do
Select a node or an edge a ∈ P uniformly at random.
if a is in then
Delete a from .
Insert a into .
if z( ) > z( ) then
Set to be
Set to be with probability
We experimented with other options within this framework such as starting with an empty subgraph and performing more than 100|I| iterations. We did not find a significant benefit from either of these choices, i.e., the score of most perturbed sub-pathway did not increase substantially (data not shown). We also found that including the addition and deletion of nodes (along with their incident edges) yielded subgraphs with much larger scores than those obtained by addition and deletion of edges alone.
We permuted node labels (and associated gene expression data) in the pathway. Let k be the number of genes in P. We replaced these k genes with k other genes, selected uniformly at random from a universe of genes (defined below). Let be the new pathway. Note that and P are isomorphic to each other, i.e., they have identical interaction structures.
We obtained the z-scores of the genes in from the gene expression data set for the disease.
We used the simulated annealing algorithm to compute z( ).
In the first step, we defined the universe to be the intersection of the set of all genes measured in the gene expression data set for the disease and the set of genes whose products were present in a protein interaction network containing 9352 proteins and 39890 interactions (assembled from multiple sources [28–31]). We used these two sets so that every gene in (i) would have gene expression values and (ii) had a protein product that was known to participate in at least one interaction.
We computed the p-value for z( ) as the fraction of random trials where z( ) > z( ). Since we tested multiple pathway-disease pairs, we controlled the false discovery rate using the method of Benjamini and Hochberg . We used the adjusted p-value in all of the subsequent analysis.
After describing the pathway and gene expression datasets we used, we present our results in five stages. First, we evaluate whether the most-perturbed pathway we computed were more statistically significantly that the entire pathway. Second, we compare the significance of the most-perturbed pathways computed by our algorithm to those found by the ActiveModules approach . Third, we compare our results to GSEA , a purely gene-set based approach. Fourth, we assess the robustness of our results to the removal of gene expression samples from the input. Finally, we present data in the literature that supports the pathway-cancer connections unearthed by our approach. At this stage, we also compare our results Sub-GSE , another gene-set based approach.
We obtained 20 curated pathways from the Netpath database . These pathways include 10 signaling pathways associated with proliferation (Androgen receptor, Alpha6 Beta4 integrin, EGFR1, Hedgehog, ID, Kit receptor, Notch, TGF beta receptor, TNF alpha/NF-kB, and Wnt) and 10 immune response signaling pathways (B cell receptor, T cell receptor, IL-1 IL-2 IL-3 IL-4 IL-5 IL-6 IL-7 and IL-9). We used gene expression measurements in the Global Cancer Map (GCM) . The GCM dataset contains 190 samples spanning 18 cancers (adenocarcinomas of the breast, colon, lung, ovary, pancreas, prostate, and uterus; follicular and large B-cell lymphomas; melanoma; bladder; acute lymphoblastic leukemias of the B cell and T cell; acute myeloid leukemia; renal carcinoma; mesothelioma; and glioblastoma and medulloblastoma, which are two cancers of the central nervous system) and 90 samples from 13 normal tissues (bladder, breast, cerebellum, colon, germinal center, lung, kidney, ovary, pancreas, peripheral blood, prostate, uterus, and whole brain). We compared the samples for each cancer in the dataset to the samples from the corresponding normal tissue (e.g., prostate cancer and normal prostate) using the t test. We applied our algorithm to 360 cancer-pathway pairs (18 cancers times 20 pathways). Note that if we performed k iterations of permutation testing followed by Benjamini-Hochberg FDR correction, the smallest p-value we would obtain would be 360/k.
Fraction of interactions in a pathway that occur in any significant most-perturbed sub-pathway
#cancers perturbed in
Alpha6 Beta4 Integrin
B Cell Receptor
T Cell Receptor
Note that the set of interactions I' does not play a role in the definition of z AM (Q). ActiveModules utilizes the interactions in the network during the search for a subgraph Q with the highest value of z AM (Q), by ensuring that Q is connected. Thus, interactions play an indirect role in this approach. In contrast, our approach directly incorporates interaction structure into the scoring function.
Our approach explicitly uses the interaction structure of pathways to calculate their perturbation. To assess the advantages of this approach, we compared our method to the gene-oriented method GSEA . GSEA compares two phenotypes of interest by sorting all the genes based on the difference in their expression profiles in the two phenotypes, e.g., by using the t statistic. Given a gene set of interest, GSEA uses a modified Kolmogorov-Smirnov statistic to test whether the genes in the gene set are ranked toward the top or the bottom of the sorted list. GSEA measures the statistical significance of an observed score by repeatedly permuting the phenotype labels of the samples.
We converted each Netpath pathway into the set of genes that are members of the pathway. We tested each of the 360 pathway-cancer using GSEA, ranking genes by the t statistic and generating 100,000 random permutations to assess the statistical significance of the computed scores. GSEA identified no Netpath gene set as significant in any cancer, even with an FDR-adjusted p-value less than 0.1. We had observed that perturbed pathways computed by our method may contain both up- and down-regulated genes. We reasoned that GSEA may not detect corresponding gene sets as significantly differentially expressed since these gene sets contain both genes with low ranks (large positive t statistics) and with high ranks (large negative t statistics). Therefore, we repeated the analysis using GSEA's option to rank genes by the absolute value of the t statistic. Even with this option, GSEA identified no pathway-cancer pairs as significant, even at the 0.1 level.
GSEA uses the null hypothesis that the distribution of the perturbation of the genes in a particular gene set is the same as the distribution of the rest of the genes measured in the transcriptional data set. Our approach uses the null hypothesis that the distribution of the perturbation of the genes in a particular pathway P is the same as the distribution of an equal number of randomly-selected genes, where the interactions between the randomly-selected genes are isomorphic to the interactions in P. To test the possibility that the stricter null hypothesis of GSEA prevents it from finding significant perturbations detected by our method, we used our results to construct a new gene set for each cancer. Each new gene set was composed of only those genes that participate in at least one of the most-perturbed sub-pathways in that cancer as determined by our method. We applied GSEA to these new gene sets, ranking genes by the absolute value of the t statistic. For 13 out of the 18 cancers, GSEA found that the combined gene set constructed based on our results was more significant than the gene set for any individual pathway. Yet, only two of these combined gene sets had an FDR-corrected p-value less than 0.1. From this comparison with GSEA, we conclude that incorporating interaction structure is an important aspect of determining pathway perturbation.
We evaluated the robustness of our approach to missing data. The GCM data contains multiple samples for each cancer. For each pathway-cancer pair, we removed each sample for that cancer from the input and re-computed the most perturbed sub-pathway and its statistical significance. This process was computationally intensive since we had to compute the statistical significance for each pathway-cancer about 15 times (depending on the number of samples in each cancer). Therefore, we ran 36,000 iterations of permutation testing, yielding p-values at least as large as 0.01. For each pathway-cancer pair, we counted how many leave-one-out datasets yielded results that were similar to results obtained with the complete dataset. Specifically, if the pair was statistically significant, i.e., had a p-value at most 0.05 in the full dataset, we counted the fraction of leave-one-out datasets for which the most-perturbed sub-pathway also had a p-value at most 0.05. Conversely, for pathway-cancer pairs that were not statistically significant, i.e., had a p-value greater than 0.05, we counted the fraction of leave-one-out datasets for which the most perturbed sub-pathway also had a p-value greater than 0.05. We expected all these fractions to be close to 1, i.e., the significance for the full dataset would hold in the leave-one-out dataset as well.
Both of these pathways have down-regulated expression in multiple tumor types [35, 36]. The TNF-alpha pathway is perturbed in association with CNS, melanoma, and bladder tumors, among others. TNF-alpha is down regulated in tumors like melanoma . Studies have observed the cytotoxic effects of TNF-alpha on medulloblastoma . Other work has found that TNF-alpha is an important factor in breast cancer promotion and survival . TNF is used for localized treatment of metastatic melanomas and other irresectable tumors . Recombinant TNF has been effectively used to treat bladder tumors in vivo . The interleukin 2 (IL2) pathway is another pathway that we find perturbed by many cancers. The IL2 pathway is an immune signaling pathway that is commonly down regulated in tumors like T-cell lymphoma . Like TNF-alpha, IL2 is also added exogenously to treat multiple cancer types including metastatic melanoma  and superficial bladder tumors . These treatments were found to work in breast cancer cell lines that express the interleukin 2 receptor on the cell surface .
A comprehensive understanding of pathway perturbations has important implications in disease treatment. As noted above, exogenous treatment with recombinant TNF and IL2 have had success in mitigating tumor progression in a number of diseases [37, 39, 40, 42]. The success of these treatments illustrates that reversing pathway perturbation to a pre-cancerous state can help to restore the healthy phenotype. Therefore, it is important to characterize both the extent and direction of pathway perturbation across diseases.
Sub-GSE  is another gene-set oriented method that has been reported to be more sensitive than GSEA. Therefore, we ran Sub-GSE on the GCM and Netpath data and compared the results to our perturbed pathways-cancer pairs. We ran Sub-GSE with 10,000 iterations (for the permutation test Sub-GSE uses to compute significance). The Sub-GSE software gave memory allocation errors for approximately 20,000 or more iterations. Since we could not run Sub-GSE for larger numbers of iterations, we did not correct the p-values yielded by Sub-GSE for multiple hypotheses testing.
We found that both Sub-GSE and our method identified many common pathway-cancer associations (Figure 4). However, Sub-GSE failed to identify any cancer associations for the ID or the alpha 6 beta 4 integrin signaling pathways. These pathways are known to be perturbed in multiple tumor types. The ID signaling pathway has been associated with carcinogenesis by supporting tumor cell migration and invasion . Although the ID pathway is mostly dormant after embryogenesis, the pathway is reactivated during tumor progression . Upregulation of the alpha 6 beta 4 integrin pathway has been associated with metastatic potential in many cancers . In tumor microenvironments, alpha 6 beta 4 is re-localized to the leading edge of tumor cells and promotes invasion . While Sub-GSE has superior sensitivity to GSEA, Sub-GSE was not sensitive enough to identify these important associations.
However, our approach was not able to identify significant associations with the Wnt or Androgen receptor pathways that were detected by Sub-GSE. Both pathways are known to be associated with multiple cancer types. Disregulation of the Wnt signaling pathway leads to upregulated expression of B-catenin, which ultimately results in increased proliferation of tumor cells . Sub-GSE is able to identify numerous cancers in which the Androgen receptor pathway is perturbed. However, neither Sub-GSE nor our method was detect the well-known and widely studied association between this pathway and prostate cancer . We note that the increased sensitivity of Sub-GSE in the case of these two pathways may arise from the fact that we did not adjust for testing multiple hypotheses in the case of Sub-GSE.
Our results indicate that integrating differential gene expression with the interaction structure in a pathway is a powerful approach for detecting links between a cancer and the pathways perturbed in it. The use of Stouffer's z-score to combine multiple p-values provides an important advantage over methods that consider pathway membership alone: in many perturbed pathways, we noticed that the receptor protein at the head of the pathway was very slightly differentially expressed, often not to a statistically significant extent, whereas many genes with products downstream of the receptor were differentially expressed (data not shown). Our use of meta analysis to combine p-values enabled detection of the perturbation of the pathway even in such cases.
There are several avenues for future work. Our method currently ignores the direction of differential expression of each gene. Incorporating this information is important. It is also essential to take into account prior information on whether any interactions are regulatory and on the type of regulation implied by an interaction . Such information may yield pathways with both directed and undirected interactions. Finally, it would be interesting to use universal protein interaction networks in order to expand curated pathways.
This work was supported by a grant from the Institute for Critical Technology and Applied Science at the Virginia Polytechnic Institute and State University to TMM and NSF grant DBI-0211863 to BT.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 3, 2012: ACM Conference on Bioinformatics, Computational Biology and Biomedicine 2011. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/13/S3.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.