- Methodology article
- Open Access
Identification of differentially expressed subnetworks based on multivariate ANOVA
BMC Bioinformatics volume 10, Article number: 128 (2009)
Since high-throughput protein-protein interaction (PPI) data has recently become available for humans, there has been a growing interest in combining PPI data with other genome-wide data. In particular, the identification of phenotype-related PPI subnetworks using gene expression data has been of great concern. Successful integration for the identification of significant subnetworks requires the use of a search algorithm with a proper scoring method. Here we propose a multivariate analysis of variance (MANOVA)-based scoring method with a greedy search for identifying differentially expressed PPI subnetworks.
Given the MANOVA-based scoring method, we performed a greedy search to identify the subnetworks with the maximum scores in the PPI network. Our approach was successfully applied to human microarray datasets. Each identified subnetwork was annotated with the Gene Ontology (GO) term, resulting in the phenotype-related functional pathway or complex. We also compared these results with those of other scoring methods such as t statistic- and mutual information-based scoring methods. The MANOVA-based method produced subnetworks with a larger number of proteins than the other methods. Furthermore, the subnetworks identified by the MANOVA-based method tended to consist of highly correlated proteins.
This article proposes a MANOVA-based scoring method to combine PPI data with expression data using a greedy search. This method is recommended for the highly sensitive detection of large subnetworks.
Since the advent of microarray technology, genome-wide expression analysis has become an important tool in biology . Much of the initial research with expression data has focused on evaluating the significance of individual genes in a comparison between two groups of samples to identify differentially expressed genes (DEGs). Various statistical approaches have been proposed in the literature [2–4]. However, the main difficulty lies not in the identification of DEGs but in their interpretation . The interpretation of a long list of DEGs can be a daunting and ad hoc endeavor, since it is dependent on the biologist's area of expertise . A compromise can be reached by constructing sets of individual genes (hereafter referred to as "gene sets") from prior biological data. For example, gene sets from a public database such as Gene Ontology (GO) and KEGG would provide functional interpretations of biological modules. In addition, by extending the level of analysis from an individual gene to multiple genes, we can identify genes with small changes that are not identified at the single-gene level analysis . Furthermore, this multiple-gene approach is reasonable from a biological perspective because many genes function in concert rather than alone . The hurdle, however, is that the majority of human genes have not yet been assigned to a definite pathway or complex. As protein-protein interaction (PPI) data become available, integrating PPI data with genome-wide expression data will provide new opportunities to at least partially address this challenge . The combined analysis of expression data and PPI data allows us to detect unknown gene sets–those not obtainable from a public database . Therefore, researchers could utilize this approach to formulate a novel hypothesis about a pathway.
The integration of PPI data with gene expression data requires an adequate scoring method to measure the discriminative potential of a given subnetwork or a connected subgraph. Several scoring methods have previously been proposed: combining multiple p-values , mutual information , and edge scoring [12, 13]. In addition, it is possible to utilize protein structure information in the scoring method. For example, the fact that no two partners can interact with a given binding site at the same time provides data that is complementary to the gene expression data. In this way, the prediction of protein-protein interactions by utilizing the information of protein-protein interfaces in public databases provides additional information for proper scoring [14–17]. In addition to a proper scoring method, an effective search algorithm is required to find the subnetworks with the maximum scores. Various search algorithms have been suggested. For example, simulated annealing , greedy search [8, 9], and exact search  have been proposed as search algorithms. In particular, Chuang et al.  used a greedy search with mutual information-based scoring method to classify breast cancer metastasis. They successfully identified the subnetworks related to breast cancer metastasis and showed the usefulness of integrated analysis for identifying a disease marker.
This paper focuses on a scoring method adopting when using a greedy search. It has been reported that while the correlation among the genes in a subnetwork is not conspicuous, it is important in assessing statistical significance . The multivariate analysis of variance (MANOVA) model is a well-known statistical model used to determine whether significant mean differences exist among groups. One advantage of MANOVA is that the correlation structure is taken into consideration. Motivated by this characteristic feature of MANOVA, we propose a MANOVA-based scoring method for identifying differentially expressed PPI subnetworks. We also suggest several criteria appropriate for the comparison of scoring methods in the context of a greedy search algorithm.
Biological significance of identified subnetworks
Fibroblast serum response
A greedy search with our MANOVA-based scoring function identified 86 significant subnetworks with p < 0.05. We performed the GO term enrichment analysis to investigate how well the identified subnetworks represented the functional modules. Under the assumption of a hyper-geometric distribution, we calculated the significance level for the number of proteins in an identified subnetwork that were included in each GO term in the category of "biological process" and selected most significant GO terms for functional annotations of the identified subnetwork. The ten most significant subnetworks are listed in Table 1. As shown in Table 1, the wound healing-relevant biological processes in GO, such as "cell-cell adhesion" (GO: 0016337), "fibroblast growth factor receptor signaling pathway" (GO: 0008543), and "blood coagulation, extrinsic pathway" (GO: 0007598), were enriched in our subnetworks. These are known as major biological processes related to wound healing.
Prostate cancer progression
In this example, 123 significant subnetworks were identified by our method. The ten most significant subnetworks are shown in Table 2. The "BMP signaling pathway" (GO: 0030509) and "epithelial to mesenchymal transition" (GO: 0001837) were observed. These processes have been reported to be essential to prostate cancer progression, showing how our results are useful [18, 19].
Comparison with other methods
Real data sets
We compared our MANOVA-based scoring method with two other methods. Nacu et al.  suggested a scoring method based on averaging gene expression levels (hereafter referred to as the TΣ-based scoring method). And Chuang et al. proposed the mutual information-based scoring method  (hereafter referred to as the MI-based scoring method). We evaluated the performance of these three methods and compared the number of significant subnetworks, the size of each significant subnetwork, and the percentage of proteins with a higher correlation coefficient for a given seed protein in the subnetworks. In the case of the prostate cancer metastasis data, we used just two phenotypes–benign epithelium and primary prostate cancer–because the TΣ – based scoring method can only handle data consisting of two phenotypes. First, the number of significant subnetworks was determined, as shown in Table 3. The MANOVA-based scoring method produced the smallest number of significant subnetworks among the three methods.
Next, we examined the sizes of the significant subnetworks commonly identified by all three methods. Figure 1 shows the distribution of number of proteins in the commonly identified subnetworks. The MANOVA-based scoring method usually identified subnetworks with a larger number of nodes (genes) than the other methods. In summary, the MANOVA-based scoring method tended to yield a smaller number of significant subnetworks with larger numbers of proteins than the other scoring methods.
Finally, we investigated the distribution of the correlation coefficients within the subnetworks identified as being significant by all three scoring methods to determine the strength of correlation among the proteins in the subnetworks. The correlation coefficient for every protein in the sub-networks with the seed protein was calculated. Figure 2 shows the distribution of the absolute values of the correlation coefficients for the subnetworks identified by each scoring method. The MI-based scoring method and the MANOVA-based scoring method had higher correlations than the TΣ – scoring method. We then calculated the percentages of the correlation coefficients higher than various thresholds. As shown by Tables 4 and Table 5, the MANOVA-based scoring method tended to have higher percentages than the other methods. This suggests that the MANOVA-based scoring method tends to construct subnetworks containing relatively larger numbers of highly co-regulated genes.
We performed a series of simulation studies to evaluate the three scoring methods. Specifically, we focused on the ability of each method to identify previously assumed target subnetworks. The simulation study was performed for the PPI network in Figure 3a, which has proteins within a depth of two from a given seed protein. For this PPI network, five types of network structures were assumed to be true target networks (Figure 3b). Expression values with correlation coefficient ρ were assigned to the nodes of the target networks, while expression values with correlation coefficient ρ' were assigned to the other remaining nodes. The expression values with correlation coefficient ρ and ρ' were generated by the following procedure.
For the kthsubnetwork, assume that there are p k nodes (genes) among which t k nodes (genes) belong to a given target network. For the ithgene in the phenotype group g, let x ig (i = 1, ⋯, t k ; g = 1, ⋯, G) epresent the target node and y ig (i = 1, ⋯, p k ; g = 1, ⋯, G) the nontarget node. Then, the expression values are generated to obtain a correlation coefficient of ρ for every pair in the target nodes and ρ' for every pair in the nontarget nodes by the following formulas:
where the Z s are normally distributed random variables and the c and c' are constants satisfying
For given values of ρ and ρ', we calculated c and c' by setting c0 and to be 0.001. Finally, we added the constant value α to the target nodes to make an average difference between the target nodes and nontarget nodes. α was assigned a value of 0.01. A greedy search was performed for each scoring method. Permutation tests were omitted in the simulation study because the expression values were generated from the normal distribution. If the identified subnetwork included all of the nodes of the target subnetworks, we considered it to be a successful identification. This procedure was repeated 100 times and the true positive (TP) rate was estimated as the number of successful identifications divided by 100. In order to estimate the false positive (FP) rate, every node (gene) in a subnetwork, regardless of whether it was a target or nontarget node, was assigned expression values of the same correlation coefficient value (ρ = ρ' in the above formulas).
We performed the simulation study with two cases of ρ and ρ' (0.8, 0.4; 0.8, 0.1) as shown in Table 6 and Table 7. The MANOVA-based scoring method had the highest sensitivity (TP rate) in both cases. In the case of specificity (true negative rate), the MI-based scoring method had the highest specificity.
This article proposes a MANOVA-based scoring method to identify subnetworks in a PPI network. Our method was successfully applied to two human microarray data sets, resulting in biologically significant PPI subnetworks. We also compared the results with those of two other scoring methods using a greedy search algorithm. The characteristics of these scoring methods when using a greedy search are not well known, because applying the different scoring methods to a greedy search algorithm usually generates subnetworks with different proteins and sizes. To our knowledge, our research is the first comparison study to evaluate the performance of various scoring methods in the context of a greedy search. We believe that our results provide a guide to develop and expedite the integration of PPI data with other biological data using a greedy search.
In an individual gene-level analysis [1–4], a list of DEGs is identified and examined for the enrichment of gene sets or subnetworks. In this case, DEGs are identified under the assumption of independence between genes. However, as several studies have pointed out, the correlation structure in a subnetwork is not a trivial problem in assessing the statistical significance of subnetworks [5, 8]. MANOVA takes into account the correlation between multiple dependent variables when testing whether significant differences exist among groups. Our empirical studies showed this consistently. The MANOVA-based scoring method provides the smallest number of significant subnetworks. However, it tends to provide larger subnetworks with higher correlated proteins than the TΣ – based scoring method, which does not consider the correlation among genes. In addition, the simulation study showed that the MANOVA-based scoring method has a high sensitivity for the identification of true target-networks consisting of higher correlated proteins, though at the cost of low specificity. These characteristics could be emphasized further considering the report that hub proteins have a tendency to exhibit low expression values even though they are more essential . Our method could detect significant subnetworks that have essential proteins with moderate expression levels around hub proteins by reflecting the correlation in the construction of the subnetworks, providing more opportunities to identify novel pathways or complexes around hub proteins than other methods.
Furthermore, the MANOVA-based scoring method can deal with the data of any number of groups. Our results show the successful identification of the subnetworks differentiating three groups related to prostate cancer metastasis, as well as two groups related to the wound response of a fibroblast culture. Since it is common for researchers to compare more than two experimental groups defined by multiple experimental factors, our method could provide a useful tool for identifying subnetworks under multiple conditions.
However, there are some assumptions when using the MANOVA approach . First, it assumes a normal distribution of the dependent variables. However, we adopted a permutation test to derive p-values. Thus, the violation of this normal assumption is not critical to our method. Second, the correlation among the dependent variables is linear. Note that linear correlation has been successfully applied to the analysis of gene expression data in many previous studies [21, 22]. Therefore, the assumption of linear correlation in MANOVA is biologically acceptable. Third, the variances and covariances of the dependent variables should be homogenous across the phenotype groups. Several tests may be applied to check the validity of this assumption. When this homogenous assumption is not satisfied, the transformation of the dependent variables may be recommended.
Our method adopted the greedy search for identifying PPI subnetworks. In contrast to approaches using a predefined gene set [5–7], greedy search has the potential to identify novel subnetworks. However, one shortcoming is the difficulty in creating a biological interpretation of the identified subnetworks. Moreover, the greedy search algorithm is not guaranteed to identify the highest scoring subnetwork because it is fundamentally a heuristic algorithm. To solve this problem, an algorithm for finding optimal-scoring subnetworks has recently been suggested . However, methods of this kind generally involve huge computational complexity.
Finally, it is worth noting that other types of additional information could be used to improve our method. For example, managing multiple microarray experiments with annotations of the experimental conditions  would produce finer phenotype-related subnetworks. In addition, integrating other factors influencing the protein abundance, such as the miRNA activity and degradation rate of proteins, with the current algorithm could improve the performance of our method, because the gene expression level does not necessarily represent the true protein abundance.
A MANOVA-based scoring method with a greedy search was proposed for combining PPI data with expression data. Our method takes advantage of a characteristic feature of MANOVA: it considers the correlation structure of multiple dependent variables when comparing the mean vectors across different groups. Our method was successfully applied to two human microarray data sets. The results from comparisons with two other scoring methods showed that the MANOVA-based scoring method tends to yield a smaller number of significant subnetworks with larger numbers of highly coregulated proteins. It also performed better in terms of sensitivity than other methods in simulation studies. Therefore, the MANOVA-based scoring method could provide more opportunities to identify novel pathways or complexes, including unknown genes or proteins related to the phenotypes of microarray experiments.
Following the idea of Ideker et al. , we matched the expression values of each gene with its corresponding protein in the PPI network. We then searched for the subnetworks with the maximum MANOVA-based scores. The significance of the subnetworks was determined by a permutation test.
We tested our method using two real expression data sets. One was related to the serum effect on fibroblast cultures, and the other involved the progression of prostate cancer in 44 individuals.
Fibroblast serum response
Chang et al.  studied the relationship between tumor growth and wound recovery using a microarray experiment. In vivo cells encounter serum, the soluble fraction of coagulated blood, when they are injured. Serum promotes the biological processes involved in wound healing by fibroblasts. To characterize wound response, Chang et al.  investigated the effect of serum on the expression profiles of fibroblast cultures. The expression data were obtained from Nacu et al. .
Prostate cancer progression
This data set was a collection of cDNA microarray expression measurements from 22 samples of benign epithelium, 32 samples of primary prostate cancer, and 17 samples of metastatic prostate cancer. The progression was benign to prostate cancer (PCA) to metastasis . The data were obtained from NCBI GEO , and the accession number was GSE6099.
Protein-protein interaction data
We exploited the interaction dataset used in Nacu et al.  to test our scoring method and compare it with other scoring methods. To obtain human PPI network data, Nacu et al.  used data from EntrezGene (December 2005) and 33 human pathways in KEGG (March 2006). After removing loops (nodes connected to themselves), a graph with 7180 nodes and 27,082 edges was obtained.
A subnetwork is defined as a gene set that induces a single connected component in the PPI network . Given a set of genes, we need to compute a score that measures how much the set is differentially expressed. Nacu et al.  suggested a scoring method based on averaging gene expression levels (hereafter referred to as "TΣ"). TΣ first sums the expression levels for the genes in the subnetwork and then computes the t statistic. Chuang et al.  proposed a mutual information-based scoring method (hereafter referred to as "MI") between the activity vector over the samples and the phenotype. The activity vector of a given subnetwork is defined by averaging their gene-wise normalized expression values.
We propose using Wilks' Λ statistic as a MANOVA-based score for a given subnetwork. MANOVA is an extension of analysis of variance (ANOVA) that covers cases where there is more than one dependent variable. In our case, we treat each gene of a subnetwork as a dependent variable. Suppose there are K subnetworks and G phenotype groups. For the kthsubnetwork, assume that there are p k nodes (genes). For the ithgene in group g, let be the expected value of its expression value. Then the following hypotheses are of interest:
MANOVA considers the correlation structure of multiple dependent variables when it is used to compare the mean vectors across different groups. Because subnetworks representing complexes or pathways usually consist of multiple coexpressed genes, we propose using the MANOVA model to consider these genes simultaneously.
Given the scoring functions, a greedy search is performed to identify subnetworks within the PPI network. In the beginning, each candidate subnetwork has a single seed protein. In this study, proteins with more than five interactions were chosen as seed proteins. To expand the subnetwork from a seed protein, we first construct every possible subnetwork consisting of the seed and each of its neighboring proteins. After completing the score calculation for all of the possible subnetworks, we choose the neighboring proteins in the subnetworks with the maximum scores and include them as members of expanded subnetworks. This process is iterated until the termination conditions are met. Three termination criteria are used. First, the search stops when no addition of neighbor proteins increases the score over a specified relative improvement rate r, which is defined as the difference between the previous and current scores divided by the previous score. Second, the distance from the seed is adopted as another criterion. That is, only proteins within a specified distance d from the seed are added to an expanded subnetwork. Finally, the maximum possible number of proteins in a subnetwork is also used as a termination criterion to avoid the singular matrix conversion in the process of calculating Wilks' Λ statistic from MANOVA. When one of the three criteria is satisfied, the iteration stops. Given that the median distance between any two proteins in the human PPI network is five, we set d to be 2 to provide a sufficient number of neighbors to keep the search local. A value of 0.05 is chosen for the parameter r . The maximum number of nodes (genes) in a subnetwork is set to one less than the number of samples in the smallest group.
Computing significance levels
We proposed a permutation-based test to derive the significance levels. The permutation-based approach provided a reasonable estimate for the null distribution of the subnetwork scores, and allowed us to compute p-values. The iteration number for this permutation test was set to 1000.
Pavlidis P, Li Q, Noble WS: The effect of replication on gene expression microarray experiments. Bioinformatics 2003, 19(13):1620–1627. 10.1093/bioinformatics/btg227
Newton MA, Kendziorski CM, Richmond CS, Blattner FR, Tsui KW: On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. J Comput Biol 2001, 8(1):37–52. 10.1089/106652701300099074
Park T, Yi SG, Lee S, Lee SY, Yoo DH, Ahn JI, Lee YS: Statistical tests for identifying differentially expressed genes in time-course microarray experiments. Bioinformatics 2003, 19(6):694–703. 10.1093/bioinformatics/btg068
Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA 2001, 98(9):5116–5121. 10.1073/pnas.091062498
Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ: Discovering statistically significant pathways in expression profiling studies. Proc Natl Acad Sci USA 2005, 102(38):13544–13549. 10.1073/pnas.0506577102
Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, et al.: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA 2005, 102(43):15545–15550. 10.1073/pnas.0506580102
Kim SB, Yang S, Kim SK, Kim SC, Woo HG, Volsky DJ, Kim SY, Chu IS: GAzer: gene set analyzer. Bioinformatics 2007, 23(13):1697–1699. 10.1093/bioinformatics/btm144
Nacu S, Critchley-Thorne R, Lee P, Holmes S: Gene expression network analysis and applications to immunology. Bioinformatics 2007, 23(7):850–858. 10.1093/bioinformatics/btm019
Chuang HY, Lee E, Liu YT, Lee D, Ideker T: Network-based classification of breast cancer metastasis. Mol Syst Biol 2007, 3: 140. 10.1038/msb4100180
Dittrich MT, Klau GW, Rosenwald A, Dandekar T, Muller T: Identifying functional modules in protein-protein interaction networks: an integrated exact approach. Bioinformatics 2008, 24(13):i223–231. 10.1093/bioinformatics/btn161
Ideker T, Ozier O, Schwikowski B, Siegel AF: Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics 2002, 18(Suppl 1):S233–240.
Cabusora L, Sutton E, Fulmer A, Forst CV: Differential network expression during drug and stress response. Bioinformatics 2005, 21(12):2898–2905. 10.1093/bioinformatics/bti440
Guo Z, Li Y, Gong X, Yao C, Ma W, Wang D, Zhu J, Zhang M, Yang D, Wang J: Edge-based scoring and searching method for identifying condition-responsive protein-protein interaction sub-network. Bioinformatics 2007, 23(16):2121–2128. 10.1093/bioinformatics/btm294
Gursoy A, Keskin O, Nussinov R: Topological properties of protein interaction networks from a structural perspective. Biochem Soc Trans 2008, 36(Pt 6):1398–1403. 10.1042/BST0361398
Keskin O, Nussinov R: Similar binding sites and different partners: implications to shared proteins in cellular pathways. Structure 2007, 15(3):341–354. 10.1016/j.str.2007.01.007
Keskin O, Ma B, Rogale K, Gunasekaran K, Nussinov R: Protein-protein interactions: organization, cooperativity and mapping in a bottom-up Systems Biology approach. Phys Biol 2005, 2(2):S24–35. 10.1088/1478-3975/2/2/S03
Keskin O, Gursoy A, Ma B, Nussinov R: Towards drugs targeting multiple proteins in a systems biology approach. Curr Top Med Chem 2007, 7(10):943–951. 10.2174/156802607780906690
Dai J, Hall CL, Escara-Wilke J, Mizokami A, Keller JM, Keller ET: Prostate cancer induces bone metastasis through Wnt-induced bone morphogenetic protein-dependent and independent mechanisms. Cancer Res 2008, 68(14):5785–5794. 10.1158/0008-5472.CAN-07-6541
Tse JC, Kalluri R: Mechanisms of metastasis: epithelial-to-mesenchymal transition and contribution of tumor microenvironment. J Cell Biochem 2007, 101(4):816–829. 10.1002/jcb.21215
Lu X, Jain VV, Finn PW, Perkins DL: Hubs in biological interaction networks exhibit low changes in expression in experimental asthma. Mol Syst Biol 2007, 3: 98. 10.1038/msb4100138
Ucar D, Neuhaus I, Ross-MacDonald P, Tilford C, Parthasarathy S, Siemers N, Ji RR: Construction of a reference gene association network from multiple profiling data: application to data analysis. Bioinformatics 2007, 23(20):2716–2724. 10.1093/bioinformatics/btm423
Han JD, Bertin N, Hao T, Goldberg DS, Berriz GF, Zhang LV, Dupuy D, Walhout AJ, Cusick ME, Roth FP, et al.: Evidence for dynamically organized modularity in the yeast protein-protein interaction network. Nature 2004, 430(6995):88–93. 10.1038/nature02555
Fellenberg K, Busold CH, Witt O, Bauer A, Beckmann B, Hauser NC, Frohme M, Winter S, Dippon J, Hoheisel JD: Systematic interpretation of microarray data using experiment annotations. BMC Genomics 2006, 7: 319. 10.1186/1471-2164-7-319
Chang HY, Sneddon JB, Alizadeh AA, Sood R, West RB, Montgomery K, Chi JT, Rijn M, Botstein D, Brown PO: Gene expression signature of fibroblast serum response predicts human cancer progression: similarities between tumors and wounds. PLoS Biol 2004, 2(2):E7. 10.1371/journal.pbio.0020007
Edelman EJ, Guinney J, Chi JT, Febbo PG, Mukherjee S: Modeling cancer progression via pathway dependencies. PLoS Comput Biol 2008, 4(2):e28. 10.1371/journal.pcbi.0040028
Barrett T, Suzek TO, Troup DB, Wilhite SE, Ngau WC, Ledoux P, Rudnev D, Lash AE, Fujibuchi W, Edgar R: NCBI GEO: mining millions of expression profiles–database and tools. Nucleic Acids Res 2005, (33 Database):D562–566.
The authors would like to thank two anonymous reviewers for their helpful comments which improved the manuscript greatly. This work was supported by the National Research Laboratory Program of the Korea Science and Engineering Foundation (M10500000126) and a fellowship for a Seoul citizen in Science.
The authors declare that they have no competing interests.
All of the authors participated in the design of the methods. TH implemented the algorithm and analyzed the results. TP coordinated the research. All of the authors have read and approved the manuscript.
About this article
Cite this article
Hwang, T., Park, T. Identification of differentially expressed subnetworks based on multivariate ANOVA. BMC Bioinformatics 10, 128 (2009). https://doi.org/10.1186/1471-2105-10-128
- Gene Ontology
- Seed Protein
- Target Node
- Greedy Search
- Primary Prostate Cancer