Reconstructing networks of pathways via significance analysis of their intersections

Background Significance analysis at single gene level may suffer from the limited number of samples and experimental noise that can severely limit the power of the chosen statistical test. This problem is typically approached by applying post hoc corrections to control the false discovery rate, without taking into account prior biological knowledge. Pathway or gene ontology analysis can provide an alternative way to relax the significance threshold applied to single genes and may lead to a better biological interpretation. Results Here we propose a new analysis method based on the study of networks of pathways. These networks are reconstructed considering both the significance of single pathways (network nodes) and the intersection between them (links). We apply this method for the reconstruction of networks of pathways to two gene expression datasets: the first one obtained from a c-Myc rat fibroblast cell line expressing a conditional Myc-estrogen receptor oncoprotein; the second one obtained from the comparison of Acute Myeloid Leukemia and Acute Lymphoblastic Leukemia derived from bone marrow samples. Conclusion Our method extends statistical models that have been recently adopted for the significance analysis of functional groups of genes to infer links between these groups. We show that groups of genes at the interface between different pathways can be considered as relevant even if the pathways they belong to are not significant by themselves.


Background
High-throughput gene expression analysis has become one of the methods of choice in the exploratory phase of cellular molecular biology and medical research studies. Although microarray technology has improved measurement accuracy, and new statistical algorithms for better signal estimation have been developed [1][2][3], reproducibility remains an issue [4]. A way to overcome this difficulty is to move the analysis from the gene level to a higher level where genes are grouped into functional categories. This approach has been shown to be more robust and reproducible [5,6], and leads to an easier biological interpretation of the experimental observations. Gene Ontology (GO) [7] and pathways are the two main gene-grouping schemes in use. GO organizes genes according to a hierarchy of terms divided into three categories: "cellular component", "biological process", and "molecular function". Genes appear in more than one level in each of the three categories, but no relation between genes is described (apart from them being in the same group). KEGG [8] is one of the most popular pathway databases; it groups genes into pathways of interacting genes and substrates, and contains specific links between genes and substrates that interact directly. Both databases are manually curated but incomplete. The GO database is also redundant, as it contains several terms in the top of the hierarchy that are too broad in their meaning and include thousands of genes. KEGG provides a more detailed organization of the genes but contains information on fewer genes than GO. Different approaches have been proposed to identify significant gene groups based on lists of differentially expressed genes. Several methods have been implemented that can be directly applied to existing gene-grouping schemes. GOstat [9] compares the occurrences of each GO term in a given list of genes (tested group) with its occurrence in a reference group (typically all the genes on the array) assigning a p value to each term. In the context of pathway analysis, a similar approach is used by Pathway Miner [10], which ranks pathways by p values obtained via a one-sided Fisher exact test. Other methods allow investigators the possibility to define their own genegrouping schemes. For example, Global Test package [11] applies a generalized linear model to determine if a userdefined group of genes is significantly related to a clinical outcome. With the Gene Set Enrichment Analysis (GSEA) [12,5] an investigator can test if the members of a gene set tend to occur towards the top or the bottom of a ranked gene list obtained from the differential expression analysis, and therefore are correlated with the phenotypic class distinction.
In this paper, we extend the significance analysis of gene pathways to higher order structures, i.e. networks of pathways whose intersections contain a significant number of differentially expressed genes. Network structure can reveal the degree of coordination of different biological functions as a consequence of the treatment, as well as the presence of "focal areas" in which groups of genes play central roles. We show examples in which some biological functions (related to specific pathways) are biologically relevant for the studied process, due to their position inside the pathway network. This analysis can be extended to groups of genes at the "interface" between pathways, whose imbalance can affect more than one biological function.
Our approach is aimed at understanding how external perturbations, such as gene activation or tumor induction, can induce in various types of cells, cell lines or derived tissues, behaviours that can generate, integrate, and respond to dynamic informational cues.
The broad question that we are trying to answer is how a cell converts perturbations to signalling activity into a binary decision resulting from the appearance of a given phenotype. Thus the signalling activity has to be diffused within the cell between and within pathways. A signaling pathway is not a rigid unit, but is made of modules with different functions (e.g. the communication with other pathways) that may be captured by selecting those elements belonging to the interface between pathways.

Data set
The first dataset we consider consist of time course gene expression arrays based on reconstituted c-myc -/rat fibroblast cell lines with the conditionally active, tamoxifen-specific c-Myc-estrogen receptor fusion protein. Binding of Tamoxifen to the estrogen receptor domain elicits a conformational change that allows the fusion protein to migrate to the nucleus and to act as a transcription factor. This data set (MYC data set) contains the gene expression data collected after the addition of Tamoxifen. Samples were harvested at five time points after the addition of Tamoxifen to the culture medium: 0, 2, 4, 8, and 16 h. The entire experiment was repeated on three separate occasions, providing three biological replicates for each gene and time point. Expression profiling was done by using the Affymetrix platform and U34A Gene Chips [13].
The second dataset we consider consists of the gene expression measurements described in [14]. This data set contains bone marrow samples obtained from acute leukaemia patients, that can be classified as Acute Lymphob-lastic Leukemia (ALL) and Acute Myeloid Leukemia (AML). The mRNA prepared from bone marrow mononuclear cells was hybridized with Affymetrix Hgu6800 containing probes for 6817 human genes. The experimental design is a comparison between ALL and AML (one factor) on the basis of 6817 probes. The dataset (AML/ALL dataset) contains 72 samples, 47 obtained from ALL patients and 25 obtained from AML patients.

Gene selection and pathway grouping
For the MYC dataset, one-way ANOVA was applied to each of the 8799 probe sets to identify those that significantly changed expression level over time. A p value of 0.05 was chosen as the cut-off significance level. No posthoc correction for multiple testing (i.e. Benjamini-Hockberg, FDR) was applied, since post-hoc validation is provided by pathway analysis: 765 genes resulted significant, 251 of which are annotated in KEGG and belonged to 142 pathways.
The AML/ALL dataset was analysed with a linear model with an empirical Bayes method to shrink gene variances (limma, R package) and 1924 genes were found as significantly differentially expressed between the AML and the ALL groups (p <0.05). Among the differentially expressed genes, 801 genes were annotated in the KEGG database.

Pathway significance and pathway network
In order to reconstruct a network, we need to specify both its nodes and links. From a biological point of view, nodes can be defined as groups of genes (such as pathways or ontologies) coding for proteins/peptides with similar functional properties (e.g. ion channels, kinases, phosphatases, and transcription factors), performing similar tasks or involved in the same biological function. The links between nodes can be drawn in various ways and their definition may also depend on the particular type of experimental design (e.g. temporal correlation or physical interactions of proteins) [15,16].
We choose to define network nodes as groups of genes belonging to the same pathway as described in the KEGG database. To each node we associate a feature corresponding to the state of the pathway, which can be significantly involved (overrepresented), significantly not involved (underrepresented) or not significant in the experimental context [17]. The same classification is used for the links between nodes by analyzing the ratio of significant genes at the intersection between the corresponding pathways.
Significance of nodes and links can be assessed within the framework of 2×2 contingency tables (Table 1)  Given a subset G of the N measured genes with N G genes, α will be differentially expressed while β = N G -α will not.
We compare α and β to the number of differentially expressed genes γ and not differentially expressed genes δ not belonging to G. The statistical significance of the contingency table can be computed in different ways: Fisher exact test, binomial and χ 2 distribution-based tests [17]. We chose to apply the Fisher exact test because the computation of the hypergeometric distribution is straightforward for tables with both small numbers (arising when testing intersections, see below) and large numbers (arising when testing pathways). The Fisher exact test first computes the probability p* of the observed 2×2 table by using the hypergeometric distribution with parameters (S,N G ,N): Differentially expressed Not differentially expressed The p value to reject the null hypothesis (independence of rows and columns in the contingency table) is given by the sum of the probabilities of all the tables with a probability lower than p* and with the same marginal totals, that is: This procedure gives a probability for a two-tailed Fisher test. Distinction between over-or under-representation of the selected group of genes G can then be obtained by comparing the proportion α/ N G of differentially expressed genes in G with the proportion of differentially expressed genes S/N on the array. A group G is considered significant if p ≤ 0.05.
We apply a similar framework to evaluate the significance of all non-empty intersections between two pathways: The only difference is the definition of the total number of genes N, which is taken to be equal to the total number of genes in the two groups. More precisely, for the group intersection significance analysis, the significant genes in the intersection S I and the total number of genes N I in the intersection are compared to the total number of genes found in the union of the two groups and the number of significant ones, N G 1 ∪G 2 and S G 1 ∪G 2 respectively. The reason for this choice is as follows. Suppose we have two groups with 100 genes each and with 50 genes in common. Suppose 60 genes are significant in each group, 30 of which are in the intersection. If 60% call rate is significant for the two sets, it is likely that it will also be for the intersection. However if we take a random subset of 50 genes from 150 genes in the union of the two groups, we can expect on average a 60% call rate. Hence a random subset with the same numbers of genes as the original intersection would be likely to be significant. By using the union of the two groups as a background, we increase the requirement for the intersection to be considered as significant and reduce the above problem. Intersections are considered significant if their p value is lower than 0.05, in which case a link is drawn between the two pathways either red if it is "significantly involved" or blue if it is "significantly not involved", while if p >0.05 no link is drawn.
All the gene groups that we consider are biological pathways defined according to the KEGG annotation, and the mapping between probes and pathways is accomplished by querying the KEGG Database via R software (KEGGSOAP package).
Once the significant links and nodes are established, we perform a meta-analysis on the obtained network structure. The aim of this analysis is twofold: first, the network structure (e.g. the presence of sub-networks, clusters, communities) can reveal important biological features. Second, each network element (node or link) can be ranked not only on the basis of its statistical significance (the p value obtained by the above method) but also considering its centrality in the network. We consider as a centrality measure the Betweenness Centrality (BC) for each vertex, a parameter that characterizes the degree of "trafficking" through a network element [18]. For a given vertex, BC is proportional to the sum of the shortest paths passing through it: For a graph G with n nodes, the Betweenness Centrality Where σ st is the number of shortest paths from node s to t, and σ st (v) is the number of shortest paths from s to t that pass through a node v.

Results
The list of pathways obtained from the Fisher test shows the most significant overrepresented and underrepresented pathways in the MYC (Table 2) and AML/ALL datasets (Table 3). In the MYC dataset three pathways are significantly underrepresented: neuroactive ligand receptor interaction, Cytokine-Cytokine receptor interaction, and Jak-STAT signalling pathway. Among the overrepresented pathways are RNA and DNA polymerase, Cell cycle and some metabolic and bio-synthetic pathways (Pyrimidine, fatty acid and Ether lipid metabolism, folate and Glycan structures Biosynthesis). In the case of AML/ALL dataset two pathways are found to be underrepresented (Ribosome and Neuroactive ligand receptor interaction); among the overrepresented pathway are Cell Cycle, many metabolic pathways (Glycerophospholipid, Galactose, Pyrimidine and Purine metabolism among other) and a signaling pathway (B cell receptor signaling pathway).
,... For the AML/ALL dataset, the pathways network showed in Figure 2 evidences a sub-network connecting signalling and metabolism (Insulin Signalling pathway, Glycolysis/ Gluconeogenesis, Galactose Metabolism, Fructose and  Mannose Metabolism and mTOR Signalling Pathway that shows an underrepresented intersection with Insulin Signalling Pathway). Another interesting sub-network contains some basic metabolic pathways (Aminophosphonate, Tryptophan and Tyrosine metabolism androgen and estrogen metabolism, etc.). The presence of a sub-network involving Calcium and Phosphatidylinositol signalling, Huntington disease, Glioma and olfactory transduction can be noticed.
As far as BC is concerned, the most central pathway for the MYC dataset is Regulation of Actin Cytoskeleton followed by Gap Junction and GnRH signalling pathway (Table 4). For the AML/ALL dataset, the most central pathway is the Androgen and Estrogen metabolism followed by Huntington disease and Insulin signalling pathway ( Table 5).
The existence of important genes belonging to the interface among pathways for the MYC dataset is clearly evidenced in (Figure 3) where the distribution of pathway membership (the number of pathways that a gene belongs to) is showed for all the genes obtained from Rat-tus norvegicus KEGG database, for the genes on the rat Affymetrix U34A Gene Chip and for the significant genes arising from c-Myc activation. All the three histograms show the same heavy-tailed distribution, meaning that the majority of genes belong to few pathways whereas few genes (the hubs) belong to several pathways.
In (Figure 4) the pathway membership distributions for the AML/ALL dataset are shown and similar results are obtained. Histograms from Homo sapiens KEGG database, for the whole Affymetrix Hgu6800 array genes and for the selected genes, are of the heavy-tailed type, highlighting the presence of hub genes.
The bipartite graphs highlight the central role of hub genes emerging from the pathway networks. In the MYC dataset ( Figure 5) the Signalling sub-network is strongly connected and presumably coordinated by a small number of genes such as MAP Kinase III (Mapk3), neuroblastoma ras oncogene (Nras), v-raf-1 murine leukemia viral oncogene homolog 1 (Raf1), platelet derived growth factor receptor alpha polypeptide (Pdgfra) and cell divi- Network of pathways for the AML/ALL dataset. Gray circles indicate not significant pathways, red and blue circles indicate significant pathways that are respectively overrepresented and underrepresented. Red lines indicate pathways interconnections (links) that resulted statistically significant and overrepresented, whereas blue lines indicate significant pathways interconnections that are underrepresented.  (Hrmt1l2, Hrmt1l3) that are involved in histone modification and chromatin remodelling.

Network of pathways for MYC dataset
As far as the AML/ALL dataset ( Figure 6) is concerned, in the sub-network connecting Signalling and Metabolism, especially at the intersection between insulin and mTOR signalling pathways, some crucial genes emerge (PIK3CA, PIK3CB, PIK3R2 and AKT1).
Similarly to the MYC dataset, the sub-network that contains some basic metabolic pathways (Aminophosphonate, Tryptophan and Tyrosine metabolism, androgen and estrogen metabolism, etc.) shows at their intersection genes belonging to the family of protein arginine methyltransferases (PRMT1 that is homologue to the rat counterpart Hrmt112 seen in the MYC dataset, PRMT2 and PRMT5).
It is worth noticing that in the sub-network involving both Calcium and Phosphatidylinositol signalling it is possible to evidence some crucial genes, for example calmodulin and the well known tumor protein p53 (TP53).

Discussion
A global picture of gene expression is greatly enhanced by the use of genes categorization and pathway analysis, but there can be several cases where this approach is not completely satisfactory. It may fail to capture the relationship between the categories and it may discard some important pathways or genes, because it does not take into account their relevance based on their central position. A typical case is that of genes that are at the interface among pathways (as in the case of hubs). With our method we try to overcome these limitations assigning more relevance to the position occupied by a group of genes in a higher level structure (i.e. the pathway network) in addition to their statistical significance alone.
The comparison between the pathways listed in Table 2 and Table 3 with the corresponding networks of Figure 1 and Figure 2 shows how the lists alone can not grasp the complexity of pathway activation induced by gene expression changes, remarkably most of the pathways in these networks are not listed in Table 2 and 3. For the MYC dataset, the biggest sub-network comprises pathways that are involved in signalling processes (MAP Kinases Signalling Pathway, VEGF Signalling Pathway, GnRH Signalling Pathway), and in structural reorganization, communication and connections (Regulation of actin cytoskeleton and Gap junction). The bipartite graph shows how this sub-network is strongly connected and coordinated by a small number of hub genes (Mapk3, Nras, Raf1, Pdgfra, Cdc42). All these genes are well-known proto-oncogenes involved in proliferation, regulation of growth, cell cycle progression control and structural reor- Distribution of gene pathway membership for MYC dataset Figure 3 Distribution of gene pathway membership for MYC dataset. The y axis is the gene frequency and the x axis is the number of pathways that genes belong to. The majority of genes belong to few pathways whereas few genes (hubs) belong to several pathways (scale free distribution). A) Histogram obtained from the KEGG database for the Rattus norvegicus organism. B) Histogram obtained from the genes on rat Affymetrix U34A Gene Chip. 3) Histogram obtained from the significant genes for MYC dataset.
ganization of the cell. They are major c-Myc downstream effectors and they are responsible of the profound effects that c-Myc exerts in cellular physiology: the down-regulation of the connections among cells and the connections between cells and the extracellular matrix, cytoskeleton reorganization, and the induction of cell growth and proliferation [19]. The MYC network evidences also a strongly connected component related to basic metabolism, comprising both biosynthetic and catabolic pathways and a small sub-network related to the synthesis of nucleic acids, which are known to be major targets of c-Myc that upregulates both energy metabolism and biosynthesis needed for growth and proliferation [13,19].
For the AML/ALL dataset, the pathways network showed in Figure 2 evidences a sub-network connecting signalling and metabolism that underlines how the regulation of energy metabolism may play a key role in the discrimination between the two types of leukaemia. It is interesting to note that Insulin and mTOR signalling pathways are known to be involved in AML [20,21], in particular the crucial genes (PIK3CA, PIK3CB, PIK3R2 and AKT1, Fig 6) in the intersection between these two pathways have been recently pointed out as promising novel targets for AML therapy [22].
Among the relevant genes extracted by our method we can notice the well known tumor protein p53 (TP53), involved in a wide variety of cancers, found in the sub-network involving Calcium and Phosphatidylinositol signalling systems.
In both datasets the sub-networks containing basic metabolic pathways (Aminophosphonate, Tryptophan and Tyrosine metabolism, androgen and estrogen metabolism, etc.) show at their intersections genes belonging to the family of protein arginine methyltransferases that are involved in histone modification and chromatin remodelling. They have been recently pointed out to have a major role in lymphoid tumours, leukaemia and more generally in cancer [23,24].
The role of epigenetic modification in cancer induction and differentiation, is gaining several experimental evidences and is giving new perspectives on cellular complex processes: our results can also suggest another mechanism for MYC in promoting oncogenesis through chromatin remodelling.

Conclusion
Our results show that it is possible to combine highthroughput experimental procedures and advanced data processing as a general Systems Biology approach to discover pathway network changes following variation of cellular phenotypes. The use of known pathways, such as those described in the KEGG database, is motivated by the clarity of their biological interpretation, but our method can be applied also to custom defined pathways or to group of genes obtained from other methods. This approach can be further generalized by considering different statistical methods for assessing single gene significance, or the significance of single network modules.
This approach leads to an increased biological insight of the results by adding topological information (Betweenness Centrality) to a list of pathways obtained by significance test. It may improve the comparability of microarray studies, both between different cell types and different perturbations by considering changes in pathway networks instead of single genes.
Moreover, this network-based method highlights the existence of "focal areas" or hub genes that are more likely found in the intersection between pathways. In this way it is possible to reconsider genes on the basis of their central role in the network and not only for their statistical significance. This can be of great importance also considering that the most central genes typically are subjected to very Distribution of gene pathways membership for AML/ALL dataset Figure 4 Distribution of gene pathways membership for AML/ALL dataset. The y axis is the gene frequency and the x axis is the number of pathways, as in Fig 3. A) Histogram obtained from the Homo sapiens KEGG database. B) Histogram obtained from the genes on Affymetrix Hgu6800 array. 3) Histogram obtained from the significant genes from AML/ALL dataset.
small changes that could be hardly detectable by any single gene statistical analysis, but can anyway exert great biological effects due to their central role in pathway interconnections and communication. Recently other authors are developing methods trying to extract from the data genes that can be biologically relevant even if they are not top-ranking in terms of statistical significance [25].
The problem of assessing pathway relevance is issued in a different way by Draghici [26]. In his paper the biological relevance of each pathway is scored both on the basis of a statistical significance test (pathway enrichment analysis) and on other parameters referred to the position of single genes in the pathway. Our method for pathway relevance can be seen as a top-down approach (from a KEGG-based network to single pathways and genes), as much as the method by Draghici is a bottom-up one (from genes to pathways). Bipartite graph of pathways and genes for the MYC dataset Figure 5 Bipartite graph of pathways and genes for the MYC dataset. Pathways are represented by squares and genes by circles. The isolated pathways have been removed so that the graph contains only the connected components of Figure 1, evidencing only the significant genes of the significant intersections. Red and blue squares indicate respectively significantly overrepresented and underrepresented pathways whereas the gray indicates not significant pathways. The green tone indicates the degree of pathways membership: from light green (connected to few pathways) to dark green (the hubs).