- Methodology article
- Open Access
Graph-based iterative Group Analysis enhances microarray interpretation
BMC Bioinformaticsvolume 5, Article number: 100 (2004)
One of the most time-consuming tasks after performing a gene expression experiment is the biological interpretation of the results by identifying physiologically important associations between the differentially expressed genes. A large part of the relevant functional evidence can be represented in the form of graphs, e.g. metabolic and signaling pathways, protein interaction maps, shared GeneOntology annotations, or literature co-citation relations. Such graphs are easily constructed from available genome annotation data. The problem of biological interpretation can then be described as identifying the subgraphs showing the most significant patterns of gene expression. We applied a graph-based extension of our iterative Group Analysis (iGA) approach to obtain a statistically rigorous identification of the subgraphs of interest in any evidence graph.
We validated the Graph-based iterative Group Analysis (GiGA) by applying it to the classic yeast diauxic shift experiment of DeRisi et al., using GeneOntology and metabolic network information. GiGA reliably identified and summarized all the biological processes discussed in the original publication. Visualization of the detected subgraphs allowed the convenient exploration of the results. The method also identified several processes that were not presented in the original paper but are of obvious relevance to the yeast starvation response.
GiGA provides a fast and flexible delimitation of the most interesting areas in a microarray experiment, and leads to a considerable speed-up and improvement of the interpretation process.
Microarray experiments can provide a comprehensive picture of gene expression levels in biological samples. In a typical application they compare expression of several thousand genes under two different conditions (e.g. healthy vs. diseased tissue, wild type vs. mutant animals, drug-treated vs. control cells), using a small number of replicate experiments. Various techniques have been developed to rank genes according to their expression changes, e.g. based on the t-statistic  or the strong non-parametric RankProducts . The resulting list of genes can then be restricted to those genes that fulfill a certain statistical criterion, usually an arbitrarily chosen maximum accepted false discovery rate.
The main challenge to the biologist is contained in the next step of the analysis. It consists in identifying the biologically relevant expression changes, the "big picture" of the experiment. As microarray experiments tend to generate unexpected observations in areas outside the specialized expertise of the experimentalist, this can be quite difficult and time-consuming. A principled mechanism to identify the significant higher-level features of the experimental results would therefore be very useful.
The biological interpretation process consists to a large extent of finding evidence connecting certain genes that are differentially expressed. This evidence can consist, e.g., of joint participation in some physiological process, physical interaction at the protein level, reported co-expression in earlier microarray experiments, a shared functional annotation, etc. This kind of evidence can intuitively be represented as a graph, and this feature is regularly used to visualize biological data, in the form of metabolic or signaling pathways or protein interaction maps. The task can then be described as the identification of subgraphs that as a whole show a statistically significant expression change. This would allow the biologist to focus her analysis on the most promising areas, without prior bias, while at the same time presenting the relevant evidence underlying each association for critical evaluation.
Results and Discussion
We have recently developed an approach, iterative Group Analysis (iGA) that identifies significantly changed functional classes of genes in a microarray experiment . In contrast to similar approaches such as [4–8], the iGA method does not require a previous delimitation of a set of "differentially expressed genes", but uses an iterative calculation of p-values to determine the subset of class members that is most likely to be changed. Due to this feature, the iGA method is more sensitive in identifying functional classes that are slightly but consistently regulated, and works well on noisy data with small numbers of replicates, where the delimitation of gene lists can be overly restrictive.
Here we extend this approach to the analysis of "evidence graphs", which offers much larger flexibility of the annotations that can be used and allows substantially improved visualization. Evidence graphs can be represented as bigraphs with two types of nodes, one for genes and one for the associated "evidence" (Fig. 1A). For evaluation purposes we focused on two types of networks, one where the evidence consists of GeneOntology annotations (GO network) and one where the evidence comprises enzyme substrates (metabolic network). The construction of these networks from gene annotation files is fast and simple, but more unconventional networks are also straightforward (e.g. regulatory networks inferred from previous gene expression experiments, or "literature networks" based on co-occurrence of genes in publications). Before the calculation the bigraph is converted to a simple graph, eliminating the evidence nodes and introducing pairwise edges between all nodes that were connected to a common evidence node (Fig. 1B).
In addition to the graph, a complete list of genes sorted by differential expression is provided. All nodes without corresponding expression data are eliminated from the network. In the first step of the analysis, each gene node is assigned its rank in the list of genes, such that the node for the most strongly changed gene is labeled "1" (Fig. 1C), the second most changed gene is labeled "2", etc. Then, local minima are identified in the graph, i.e. those nodes that have a lower rank than all their direct neighbors in the graph (Fig. 1D). In the next step, subgraphs are iteratively extended from each of those local minima by including the neighboring node with the next highest rank (m) and, if present, all adjacent nodes of ranks equal or smaller than m. Hence, at each step of the extension process the newly extended subgraph is not adjacent to any outside node with a rank lower than m (Fig. 1E,1F,1G,1H). At each extension step we thus obtain a subgraph with n members with a maximum rank m and can calculate the p-value for observing all n of n genes at rank m or better in a list of N total genes in the graph, , which follows easily from inserting these values in the cumulative hypergeometric equation, which is also used for iGA .
The extension process is continued until all nodes reachable from the local minimum are included or the subgraph reaches an arbitrary maximum size. After extending the subgraphs, for each local minimum the subgraph yielding the smallest p-value is selected as its "regulated neighborhood" and all local minima are sorted by increasing p-value of these regulated neighborhoods. The subgraphs at the top positions of the resulting list should contain the most relevant regions of the total evidence network.
Comparison with Previous Approaches
The method devised by Ideker et al.  for the determination of signaling and regulation circuits from a combination of protein interaction information and expression data could easily be extended to cover the more general case of microarray interpretation addressed by GiGA. This would only require the extension of "interaction" to include, e.g., participation in a shared cellular process or shared functional annotation. Their approach uses aggregate z-scores to evaluate the quality of each subgraph. This requires a relatively complex parameter estimation procedure followed by simulated annealing. In contrast, the rank-transform of the data that is the basis of GiGA allows non-parametric p-value calculations and is thus much faster (and computationally less demanding) than the method described by Ideker et al. A disadvantage of both methods is that they both do not guarantee to find the optimally scoring subgraph.
The recently released commercial Pathway Analysis software from Ingenuity Systems http://www.ingenuity.com seems to be based on a similar concept as GiGA, i.e. the determination of regulated subgraphs in annotation networks. However, it just classifies genes as changed ("focus genes") or unchanged based on an arbitrary selected significance cut-off and thus discards most of the relative-change information (gene ranks) used by GiGA http://www.ingenuity.com/products/Circadian_Application_Note0104.pdf. Therefore, this method is difficult to apply to very noisy or unreplicated experiments where a reliable delimitation of the "changed" genes becomes impossible.
Experimental Case Study
To validate the GiGA approach, we used the yeast diauxic shift experiment by DeRisi et al. . In this classical study the authors examined the response of yeast cells to glucose depletion in the growth medium. As the biology of this process is extremely well understood and the functional annotation of the yeast genome is very comprehensive, we could use this dataset to test the ability (and reliability) of GiGA to identify the relevant subgraphs of interest.
In their original publication, DeRisi et al. highlight the following changes during starvation: Rechanneling of metabolites into the tricarboxylic acid (TCA) and glyoxylate cycle, increase in aerobic respiration (cytochrome c oxidase and reductase), gluconeogenesis, and carbohydrate storage (glycogen and trehalose biosynthesis). In contrast, 95% of ribosomal proteins, as well as tRNA synthetases and translation elongation/initiation factors were strongly down-regulated. Twenty hours after the initial inoculation of the sample about 20% of all genes showed at least a two-fold change in expression.
Table 1 and 2 show the result of an iterative Group Analysis of these expression data. Seven time points after inoculation were examined. Significant expression changes become apparent at 13.5 hours. It can be seen that all the processes identified manually by DeRisi et al. are already apparent at this level of analysis. In addition, iGA highlights the up-regulation of sugar transporters at the early stages of starvation, obviously a desperate attempt by the cells to take up the last remaining sugars from the medium. It also highlights the induction of heat shock proteins and the repression of ribosome biosynthesis processes in the nucleolus, which were not discussed in the original paper. What is missing in these lists, however, is the automatic identification of the interrelationships between the identified processes, which would be particularly informative in more realistic applications with a less well understood biology.
This connectivity between genes and functional classes is provided by GiGA. Table 3 summarizes the results for the 20.5 hour time point, using two different networks, one for GeneOntology classes, and one for enzyme substrates, extracted from the SwissProt catalytic activity descriptors of yeast proteins. For both up- and down-regulated genes, the most significant subgraph is widely separated from the next best one, contains the largest number of genes, and comprises almost all the processes detected by iGA and the original authors. (We here restricted the size of subgraphs to a maximum of 40 genes to keep navigation of the results simple.) Figures 2 and 3 show the automatically generated visualization of the corresponding most significant subgraphs and the associated annotation. It is obvious how GiGA highlights the functional connections between different enzymes. In Fig. 2, the association between small and large ribosomal subunits, nucleolar rRNA processing and translational elongation is faithfully reconstructed. In Fig. 3, which uses the enzyme substrate network (which we considered to be more challenging for the algorithm), the interplay between the TCA cycle, the overlapping glyoxylate cycle, and all the relevant protein complexes of the respiratory chain are readily apparent. The agreement with the manual interpretation by DeRisi et al. extends down to the single gene level, while at the same time adding additional, obviously relevant connections, e.g. from the TCA cycle to the ATP synthase complex. Fig. 3 also shows the second best subgraph, which contains the cytochrome c oxidase subunits together with two catalase genes that may be involved in the detoxification of the hydrogen peroxide generated by the respiratory burst induced by starvation.
The performance of GiGA (as well as iGA) is best appreciated when compared to the results of an extensive expert interpretation of the same data. Table 1 to 3, and Fig. 2 and 3 show how both techniques succeed in detecting and condensing exactly the genes and processes that were considered relevant by the expert biologists when first interpreting the same data . GiGA effectively summarizes the original publication in three subgraph pictures (Fig. 2 and 3). This is even more astonishing when considering that these results are achieved for each single time-point separately (see, e.g., Tab. 1 and 2). This reveals the stability of the approach towards the measurement variance inherent in any unreplicated microarray experiment.
It is important to be aware that each of the highlighted subgraphs has to be carefully evaluated for its biological relevance. On the one hand, the sheer number of possible subgraphs in an evidence network creates a major multiple-testing problem, which means that some of the detected associations may be due to chance. Random permutations of the expression data – which can be generated by the GiGA software – can give an idea of the expected false-discovery rate. On the other hand, functional annotations are at present notoriously unreliable and spurious edges may affect the details of the results. Also, not all genes within a detected subgraph will necessarily show a strong expression change, because sometimes less strongly affected genes may connect those genes that do change. Such a relation is for example expected for many transcription factors and their targets . Nonetheless, GiGA is able to direct the user to the most interesting areas in the evidence network.
The GiGA method is not restricted to use with exhaustively annotated genomes. It can work on a wide variety of "evidence" to build the necessary network, including hypothetical predicted functions or associations. It is even possible to apply GiGA to metabolomics results, which are characterized by the absence of any significant amount of annotation – usually only exact molecular masses and their differential abundance are known. In that case, an evidence network can be built from the measured masses themselves, linking compounds m1 and m2 whenever their mass difference (Δm = |m1 - m2|) can be explained by a common biochemical transformation (e.g. dehydrogenation: Δm = 2* mass(hydrogen)). The set of relevant transformations can easily be collected from any biochemistry textbook. In addition, one can introduce edges for condensation reactions between observed masses, i.e. if m1 + m2 = m3 + mass(H2O) then edges between m1 and m3, and m2 and m3 are added to the evidence network. We are currently developing the application of GiGA to this kind of data in the Sir Henry Wellcome Functional Genomics Facility at the University of Glasgow http://www.gla.ac.uk/functionalgenomics/; data not shown).
The present analysis of a biologically well-understood test case demonstrates the reliable performance of GiGA. The method automatically identifies all relevant physiological processes, puts them into context, summarizes them in an intuitive format, and associates them with the underlying evidence (Fig. 2 and 3). It can be applied to experiments with very small numbers of replicates (a single time point in the diauxic shift test case) and can be used with any available functional annotation, including protein interaction networks, co-expression data or literature mining results, as well as in areas beyond microarray analysis. For visualization, we have used the graph-layout software aiSee, but output files suitable for a variety of graphical tools can easily be generated by slight modifications in the implementation. GiGA can be used as a stand-alone tool, but we expect that it will be most useful when integrated into existing microarray analysis software, and for that reason the GiGA algorithm is freely available without restrictions.
Yeast gene expression data for the diauxic shift experiment were obtained from the Stanford Microarray Database http://genome-www5.stanford.edu/. GeneOntology annotations http://www.geneontology.org were obtained from Affymetrix http://www.affymetrix.com/analysis/download_center.affx. The enzyme substrate networks were built based on information contained in the annotation of the yeast proteome in the SwissProt database http://us.expasy.org/sprot/. The GiGA algorithm has been implemented as a Perl script and compiled as a Windows command line executable. These files are available (together with a manual and example files) as Additional files 1 to 8.
Dudoit S, Yang YH, Callow MJ, Speed TP: Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica 2002, 12: 111–139.
Breitling R, Armengaud P, Amtmann A, Herzyk P: Rank products: A simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments. FEBS Letters, in press.
Breitling R, Amtmann A, Herzyk P: Iterative Group Analysis (iGA): A simple tool to enhance sensitivity and facilitate interpretation of microarray experiments. BMC Bioinformatics 2004, 5: 34. 10.1186/1471-2105-5-34
Doniger SW, Salomonis N, Dahlquist KD, Vranizan K, Lawlor SC, Conklin BR: MAPPFinder: using Gene Ontology and GenMAPP to create a global gene-expression profile from microarray data. Genome Biol 2003, 4: R7. 10.1186/gb-2003-4-1-r7
Hosack DA, Dennis G., Jr., Sherman BT, Lane HC, Lempicki RA: Identifying biological themes within lists of genes with EASE. Genome Biol 2003, 4: R70. 10.1186/gb-2003-4-10-r70
Kim CC, Falkow S: Significance analysis of lexical bias in microarray data. BMC Bioinformatics 2003, 4: 12. 10.1186/1471-2105-4-12
Provart NJ, Zhu T: A Browser-based Functional Classification SuperViewer for Arabidopsis Genomics. Currents in Computational Molecular Biology 2003, 2003: 271–272.
Zeeberg BR, Feng W, Wang G, Wang MD, Fojo AT, Sunshine M, Narasimhan S, Kane DW, Reinhold WC, Lababidi S, Bussey KJ, Riss J, Barrett JC, Weinstein JN: GoMiner: a resource for biological interpretation of genomic and proteomic data. Genome Biol 2003, 4: R28. 10.1186/gb-2003-4-4-r28
Ideker T, Ozier O, Schwikowski B, Siegel AF: Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics 2002, 18: S233-S240.
DeRisi JL, Iyer VR, Brown PO: Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 1997, 278: 680–686. 10.1126/science.278.5338.680
We thank Dr. Mike Barrett for stimulating discussions on the application of GiGA to metabolomics data. We also thank the referees for their very constructive evaluation of the software. This work was supported by BBSRC grants 17/GG17989 and 17/P17237.
RB devised and implemented the GiGA technique and drafted the manuscript. AA and PH supervised the project. All authors read and approved the final manuscript.