Iterative Group Analysis (iGA): A simple tool to enhance sensitivity and facilitate interpretation of microarray experiments
© Breitling et al; licensee BioMed Central Ltd. 2004
Received: 15 November 2003
Accepted: 29 March 2004
Published: 29 March 2004
The biological interpretation of even a simple microarray experiment can be a challenging and highly complex task. Here we present a new method (Iterative Group Analysis) to facilitate, improve, and accelerate this process.
Our Iterative Group Analysis approach (iGA) uses elementary statistics to identify those functional classes of genes that are significantly changed in an experiment and at the same time determines which of the class members are most likely to be differentially expressed. iGA does not require that all members of a class change and is therefore robust against imperfect class assignments, which can be derived from public sources (e.g. GeneOntologies) or automated processes (e.g. key word extraction from gene names).
In contrast to previous non-iterative approaches, iGA does not depend on the availability of fixed lists of differentially expressed genes, and thus can be used to increase the sensitivity of gene detection especially in very noisy or small data sets. In the extreme, iGA can even produce statistically meaningful results without any experimental replication.
The automated functional annotation provided by iGA greatly reduces the complexity of microarray results and facilitates the interpretation process. In addition, iGA can be used as a fast and efficient tool for the platform-independent comparison of a microarray experiment to the vast number of published results, automatically highlighting shared genes of potential interest.
By applying iGA to a wide variety of data from diverse organisms and platforms we show that this approach enhances and accelerates the interpretation of microarray experiments.
Microarray experiments determine the relative expression levels of a large number of genes in various conditions. In the easiest case they are used to detect differences in gene expression between two conditions, e.g. in diseased vs. healthy tissue, or in mutant vs. wild type organisms. The results in general are presented as lists of "differentially expressed" genes. A number of statistical techniques have been successfully employed to prepare and analyze these lists (preparation reviewed in [1–5]; analysis methods e.g. [6–11]).
In some applications, e.g. identification of marker genes or potential drug targets, such mere lists of genes may be considered sufficient. More often, however, it will be necessary to make biological sense of the data. Given lists of hundreds of genes, this is a daunting task, especially when nothing is known about the expected outcome. But even when the physiology of the experimental system is well-defined, it is sometimes difficult to distinguish which parts of the interpretation are significant and which may be artifacts of a biased focus on familiar features.
Here we present a method (Iterative Group Analysis, iGA) that provides an automatic functional annotation of microarray results together with a statistical confidence level for each annotation feature. This annotation is based on a comprehensive hypergeometric statistics calculation detecting concerted changes in "functional classes" of genes. In contrast to previous methods with similar objectives [6–11], iGA does not require a fixed list of reliably differentially expressed genes but rather uses an iterative procedure to determine the optimal threshold for each functional class. This feature should make iGA more flexible in the case of noisy data, when a reliable list of genes is not available, and allows its use as a sensitive gene detection method. The functional classes can be of diverse origin (e.g. GeneOntology assignments http://www.geneontology.org, BLAST result key words, literature extracts) and the detection algorithm will automatically determine the genes in each class that are most likely to be differentially expressed.
By focusing on groups instead of single genes, it is possible to determine statistical significance even without experimental replication, the group members serving as "internal replicates". While this will not provide the same detail as "real" replication, it can provide important information on the biology of a sample. At the same time, the iGA approach should enhance the sensitivity of gene detection, especially for small, noisy data sets.
The iGA approach also provides a fast and efficient way to compare an experiment with a large number of published microarray experiments, without requiring a common experimental platform or analytical technique.
Results and discussion
Functional annotation of microarray results
where n is the total number of genes, x is the total number of class members, and t is the rank of the z-th class member (z being the current step in the iteration; see Fig. 1). The notation indicates the binomial coefficient, i.e. the number of ways of picking k unordered items from a list of n items. This equation is equivalent to that used by Onto-Express  in a more restricted context. Third, we determine the position in the list that yields the smallest p-value, and assign this as the PC-value of the class. All class members above that position are considered as "potentially differentially expressed". Fourth, all classes are sorted by their PC-values. The classes with the lowest PC-value are most significantly changed.
Note that this process does not require that all members of a group change at the same time or in the same direction – a very important feature, because genes that share a functional annotation may include activators as well as inhibitors of a certain process, and may also include a large number of hypothetically assigned genes that in the majority won't change at all.
In the case of genes that are present several times on a chip, the user can decide how to proceed: Either the replicate spots are merged into an average value during the previous steps of analysis, i.e. before sorting the gene list, or each replicate is treated as an independent group member. The latter approach is statistically not quite correct, as replicate spots do not represent independent measurements, but in the case studies described below we found the resulting summaries of replicates very useful.
Up-regulated GeneOntology classes after 6 h of white light-treatment applied to etiolated A. thaliana seedlings (chip ID 7341, see Material and Methods for details). The top 10 classes are shown. They contain a total of 26 distinct genes that are detected as "possibly changed". Six of the 10 classes are directly related to the light reactions of photosynthesis.
GO class description
photosystem I reaction center
reductive pentose-phosphate cycle
plastid large ribosomal subunit
chloroplast stromal thylakoid
Assigning statistical confidence to the annotation
As the PC-values are directly derived from p-values, they already give a good idea of the statistical significance of an observed change, after correcting for the effect of multiple testing, i.e. the fact that many hundreds of groups are tested for differential expression at the same time. However, as many functional classes overlap, multiple testing correction using the total number of functional classes tested is too restrictive. At the same time, the PC-values are underestimating the true probability of changes, because they are based on determining the minimum p-value within each class. As iGA is rank-based, it is very easy to overcome these two problems by calculating PC-values from a large number of random permutations of genes and count how often a given p-value and smaller is actually observed in those random data. This approach is possible even when the list of genes is based on a single experiment. E.g., PC-values smaller than 0.00092 occurred 48 times in 100 random permutations of the photosynthesis example discussed above. This also means that the expected false-discovery rate (FDR) among the top 5 groups in the example is less than 10% . Such an FDR is not particularly impressive at first glance, but these results are based on a single array, while in a single-gene approach no statistical confidence determination is possible at all without replication (or a detailed a priori knowledge of measurement errors). The results can be corroborated by the analysis of three replicate hybridizations for the same experiment, as iGA finds almost exactly the same groups in two of them (the third [chip 14753] may be a biological "outlier", as its iGA result shows hardly any similarity to the other three chips, although clustering analysis does not indicate a serious technical problem with this hybridization).
Major assumptions underlying the iGA statistics are the independence of measurements within a group and equal variation between groups. While the first assumption will generally hold unless replicate spots or cross-hybridizing isoforms of a gene are included as group members, the second assumption may be violated for biological reasons: There may be certain groups of genes that tend to show more variation of their expression than others, even in constant conditions, and will therefore show up more often in the iGA results. The same problem occurs with a manual analysis of microarrays. iGA is still applicable in these cases, but is essential that the statistical analysis is followed by a rigorous assessment of the biological significance of the results.
Sources of functional annotations
The power of iGA increases with the quality of the available annotations. In the case of human and mouse genomes, the functional annotation by GeneOntology terms is relatively detailed and comprehensive and in our test cases (see below) gave satisfactory results. For rat and A. thaliana data we found it useful to include classes defined by key words automatically extracted from the complete annotation of each gene (including the gene description and the description of BLAST hits). While those key word-based classes are not necessarily biologically meaningful, iGA proved to be sufficiently robust to detect the most relevant changes. Of course, user-defined groups of interest can be particularly powerful and are easily integrated in iGA.
Sensitive gene detection by using Iterative Group Analysis
Group analysis of gene expression in potassium-starved A. thaliana roots. This experiment used a custom-made transporter array and corresponding annotations . The PC-values, number of group members and of significantly changed genes are indicated. Only groups with a group-wise FDR <10% are shown. The last column indicates the number of group members that were detected by Significance Analysis of Microarrays  with a FDR <10%.
Potassium transporter (up)
Nitrate transporter (down)
Putative anion exchanger (up)
19 TMS proteins (down)
Comparison of SAM and iGA performance on noisy lymphocyte data. The cells were obtained from elicitor-treated animals and show an expression pattern indicative of immune system activation. For iGA the genes on the array were annotated by Gene Ontology terms and keywords extracted from gene names.
iGA (based on lists of sorted by fold-change)
Kallikrein (4 genes)
Rhesus transporters (2)
Antimicrobial peptides (4)
Classical complement (4)
Scavenger receptors (2)
7 other groups (15)
Using Iterative Group Analysis to compare microarray results
A variation of iGA uses classes based on the results of previous microarray experiments. Those classes can either be manually extracted from the published literature (e.g., each list of up- or down-regulated genes defines one class) or automatically determined from the raw data files (e.g., each class contains the genes that change more than 2-fold in a certain experiment). In this way, each experiment is described by a number of "signature classes", which can then be tested for concerted differential expression in the experiment of interest. This approach not only identifies the experimental conditions that come closest to the present experiment, but also highlights the shared genes. In contrast to clustering techniques that can be used for similar purposes, iGA does not require that the same genes are present in all experiments, that the experimental platform is the same, or that all genes behave identically. The "open", iterative technique of iGA makes it more robust than previous "fixed list"-based approaches used for similar purposes . It is also not necessary that the complete data are available for all experiments. The statistical confidence measures associated with iGA results are particularly useful when only a small number of regulated genes are shared between two experiments, as we often found to be the case even for almost identical treatments (not shown).
Validation of Iterative Group Analysis
To validate our approach we performed a "semi-blind" study of microarray data produced at the Sir Henry Wellcome Functional Genomics Facility at the University of Glasgow. The data were collected by one of us (P.H.) from samples provided by several collaborators. The pre-processed and normalized data were then passed on to iGA without information on the biology of the experiment other than the array used. Using these "anonymized" results the first author (R.B.) produced a detailed description of the samples, trying to identify the sample source, treatment and physiological status of each sample. The final interpretation was then discussed with the experimentalists to establish the correctness of the iGA conclusions.
Summary of the case studies used for the "semi-blind" evaluation of iGA. The first two columns indicate the study organism and experiment performed. The next two columns show the tissue and physiological condition identified on the basis of iGA.
CaCo-2 cell Campylobacter infection
anoxia, growth arrest, immediate-early response
CaCo-2 cell Campylobacter infection
cell culture – hepatocytes
growth arrest, acute-phase response
Chronic Fatigue Syndrome muscle
lipid phosphate phosphatase-transfected HEK cells
mostly minor changes
testis of hormone receptor knock-outs
steroidogenesis vs. spermatogenesis
phaeochromocytoma treated with neural growth factor
growth-factor response vs. basic metabolism
inflammation; early activation; chemokines
This study shows that iGA can yield a reasonable, unbiased, and statistically supported interpretation of microarray data even without prior knowledge of the expected outcome. With increasing comprehensiveness and quality of available annotations the method will become ever more powerful. iGA can be used as a stand-alone application, but it should also be easily integrated with existing graphical interfaces for microarray exploration such as MAPPFinder , SuperViewer , GoMiner , or various commercial programs.
Material and methods
Pre-processed microarray data for light-treated A. thaliana seedlings were obtained from TAIR (http://www.arabidopsis.org; ExpressionSet:1005823603, AFGC IDs 7341, 14753, 14765, 14779). These were from two-color cDNA arrays that covered about 5000 genes after pre-processing as described on the Arabidopsis website. All other experiments were performed at the Sir Henry Wellcome Functional Genomics Facility at the University of Glasgow http://www.gla.ac.uk/functionalgenomics/ and mainly involved whole-genome Affymetrix arrays for various mammalian species. Details of those experiments will be published elsewhere. Mammalian gene annotations were downloaded from Affymetrix http://www.affymetrix.com/analysis/download_center.affx, A. thaliana gene annotations were obtained from TAIR.
iGA was implemented as a Perl script (Additional file 2). Required input data are a list of genes sorted by differential expression (Additional file 4) and a list of functional annotations (Additional file 5). A Windows executable of iGA (Additional file 1) as well as a manual (Additional file 3) and further annotation and examples (Additional files 6, 7, and 8) are also provided.
We thank Patrick Armengaud, Wilhelmina Behan, Simone Boldt, Anna Casburn-Jones, Gillian Douce, Paul Everest, Michael Farthing, Heather Johnston, Walter Kolch, Peter O'Shaughnessy, Susan Pyne, Rosemary Smith, and Hawys Williams, who allowed us to test iGA on their data and were always willing to discuss their results. This work was supported by BBSRC grants 17/GG17989 (AA and PH) and 17/P17237 (AA).
- Cui X, Churchill GA: Statistical tests for differential expression in cDNA microarray experiments. Genome Biol 2003, 4: 210. 10.1186/gb-2003-4-4-210PubMed CentralView ArticlePubMedGoogle Scholar
- Pan W: A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics 2002, 18: 546–554. 10.1093/bioinformatics/18.4.546View ArticlePubMedGoogle Scholar
- Pan W: On the use of permutation in and the performance of a class of nonparametric methods to detect differential gene expression. Bioinformatics 2003, 19: 1333–1340. 10.1093/bioinformatics/btg167View ArticlePubMedGoogle Scholar
- Pepe MS, Longton G, Anderson GL, Schummer M: Selecting differentially expressed genes from microarray experiments. Biometrics 2003, 59: 133–142.View ArticlePubMedGoogle Scholar
- Stolovitzky G: Gene selection in microarray data: the elephant, the blind men and our algorithms. Curr Opin Struct Biol 2003, 13: 370–376. 10.1016/S0959-440X(03)00078-2View ArticlePubMedGoogle Scholar
- Doniger SW, Salomonis N, Dahlquist KD, Vranizan K, Lawlor SC, Conklin BR: MAPPFinder: using Gene Ontology and GenMAPP to create a global gene-expression profile from microarray data. Genome Biol 2003, 4: R7. 10.1186/gb-2003-4-1-r7PubMed CentralView ArticlePubMedGoogle Scholar
- Draghici S, Khatri P, Martins RP, Ostermeier GC, Krawetz SA: Global functional profiling of gene expression. Genomics 2003, 81: 98–104. 10.1016/S0888-7543(02)00021-6View ArticlePubMedGoogle Scholar
- Hosack DA, Dennis G., Jr., Sherman BT, Lane HC, Lempicki RA: Identifying biological themes within lists of genes with EASE. Genome Biol 2003, 4: R70. 10.1186/gb-2003-4-10-r70PubMed CentralView ArticlePubMedGoogle Scholar
- Kim CC, Falkow S: Significance analysis of lexical bias in microarray data. BMC Bioinformatics 2003, 4: 12. 10.1186/1471-2105-4-12PubMed CentralView ArticlePubMedGoogle Scholar
- Provart NJ, Zhu T: A Browser-based Functional Classification SuperViewer for Arabidopsis Genomics. Currents in Computational Molecular Biology 2003, 2003: 271–272.Google Scholar
- Zeeberg BR, Feng W, Wang G, Wang MD, Fojo AT, Sunshine M, Narasimhan S, Kane DW, Reinhold WC, Lababidi S, Bussey KJ, Riss J, Barrett JC, Weinstein JN: GoMiner: a resource for biological interpretation of genomic and proteomic data. Genome Biol 2003, 4: R28. 10.1186/gb-2003-4-4-r28PubMed CentralView ArticlePubMedGoogle Scholar
- Storey JD: The positive false discovery rate: A Bayesian interpretation and the q-value. Ann. Stat., in press.Google Scholar
- Maathuis FJ, Filatov V, Herzyk P, Krijger GC, Axelsen KB, Chen S, Green BJ, Li Y, Madagan KL, Sanchez-Fernandez R, Forde BG, Palmgren MG, Rea PA, Williams LE, Sanders D, Amtmann A: Transcriptome analysis of root transporters reveals participation of multiple gene families in the response to cation stress. Plant J 2003, 35: 675–692. 10.1046/j.1365-313X.2003.01839.xView ArticlePubMedGoogle Scholar
- Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A 2001, 98: 5116–5121. 10.1073/pnas.091062498PubMed CentralView ArticlePubMedGoogle Scholar
- Smid M, Dorssers LC, Jenster G: Venn Mapping: clustering of heterologous microarray data based on the number of co-occurring differentially expressed genes. Bioinformatics 2003, 19: 2065–2071. 10.1093/bioinformatics/btg282View ArticlePubMedGoogle Scholar
- Molmenti EP, Ziambaras T, Perlmutter DH: Evidence for an acute phase response in human intestinal epithelial cells. J Biol Chem 1993, 268: 14116–14124.PubMedGoogle Scholar
- Shaw G, Morse S, Ararat M, Graham FL: Preferential transformation of human neuronal cells by human adenoviruses and the origin of HEK 293 cells. Faseb J 2002, 16: 869–871.PubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.