We previously developed GoMiner  and High-Throughput GoMiner , applications that organize lists of "interesting" genes (for example, under-and over-expressed genes from a microarray experiment) for biological interpretation in the context of the Gene Ontology [3, 4]. GoMiner and related tools typically generate a list of significant functional categories. In addition to lists and tables, High-Throughput GoMiner also provides a valuable graphical output termed a "clustered image map" (CIM). The "integrative" and "individual" CIMS can depict the relationship between categories and either multiple experiments or genes, respectively.
When designing an algorithm for a program like GoMiner, a number of implementation decisions must be made. One such decision is how to handle genes mapping to a category that is a child of the category under consideration. The particular algorithm adopted by GoMiner "rolls up" genes mapping to a child category; that is, genes mapping to a child category are (recursively) assigned to the parent of that child category. Although that approach provides robust protection against variability in curation techniques, it can result in redundancy between parent and child categories.
Even in the absence of "rolling up," redundancy can be an important issue. That is, two non-parent/child categories may include identical or nearly-identical sets of genes. Overall, the redundancy can easily inflate by a factor of about three the number of categories that are considered statistically significant, create an illusion of an overly long list of significant categories, and obscure the relevant biological interpretation.
One way of addressing redundancy is exemplified by GO slims : "GO slims are cut-down versions of the GO ontologies containing a subset of the terms in the whole GO. They give a broad overview of the ontology content without the detail of the specific fine grained terms. GO slims are particularly useful for giving a summary of the results of GO annotation of a genome, microarray, or cDNA collection when broad classification of gene product function is required."
However, in the context of GoMiner analysis, the GO slims approach has several drawbacks:
It cannot deal with redundancy that might not result from "rolling up"
It is rather inflexible, as it is pre-computed and cannot adapt to the characteristics of a particular data set
It "throws out the baby with the bathwater:" a simplified view might be a useful first approximation, but the molecular biologist also needs to be able to "drill down" to see the full details
We propose here a solution that overcomes these limitations of GO slims. Full details are given in the Methods section. Briefly, our approach, RedundancyMiner, de-replicates the (fully- or partially-) redundant GO categories: the user selects a desired redundancy threshold, and a new reduced clustered imaged map (CIM) is created. That CIM represents those categories that were not affected by the processing, as well as composite categories that represent groups of merged categories. An additional new type of CIM is also created, which we term a "META CIM." The META CIM conveniently visualizes the pattern of grouping within the merged categories. Thus, an overview is afforded by the reduced CIM, and the details by the META CIM.
Furthermore, the redundancy computation can be based on either (a) all genes that map to a category or (b) just the genes that exhibited altered expression levels in the current experiment. The latter approach (b) will provide a META CIM that reflects redundancy and interaction between categories that is specific to the conditions of the study. This pattern may be significantly different from the static reference behavior obtained by approach (a), and it may suggest the underlying systems biology.
The META CIM does not simply discard redundancy, as might be the case for GO slims; rather, it processes the patterns of redundancy and extracts information from them. Ignoring the existence of redundancy, as GO slims does, is an oversimplification that may throw away valuable information.
A number of other earlier papers address related issues. Several of those papers address approaches to studying gene enrichment, but not specifically the redundancy problem. For example, Pehkonen  developed a method that clusters genes to groups with homogenous functionalities. The method uses Nonnegative Matrix Factorization (NMF) to create several clustering results with varying numbers of clusters. The clustering results are combined into a simple graphical presentation showing the functional groups over-represented in the analyzed gene list. Prufer  developed "FUNC," a package for detecting significant associations between gene sets and ontological annotations. Xu  developed "CeaGO," enriching clustered GO terms based on semantic similarity. Hermann  developed "SimCT," which draws a simplified representation of biological terms present in the set of objects
Several papers address approaches to address the redundancy problem within the context of studying gene enrichment, and are therefore potentially more germane. For example, Alexa  proposed a method "TopW" to eliminate local dependencies between GO terms; Lu  developed "GenGO," a generative probabilistic model which identifies a small subset of categories that, together, explain the selected gene set; and Grossmann  developed the "Ontologizer" that uses the parent child union (PCU) to reduce the dependencies between the individual term's measurements, and recomputes the P-value for a specific category by taking into account the immediately more general terms (the parents). That procedure can often lead to the removal of false positives, since some of the more specific categories are eliminated if their parent category is determined to be significant. The more recent global "MGSA" method of Bauer  outperformed the local methods of Alexa, Lu, and Grossmann. Finally, Richards  recently developed a novel global approach to assess the functional coherence of gene sets by taking into account both the enrichment of GO terms and their relationships among terms.
In summary, the two most promising of the previously published methods for addressing redundancy are those of Bauer and Richards. However, unlike RedundancyMiner, neither of those methods takes advantage of the redundancy patterns to infer subtle nuanced themes among groups of GO categories. RedundancyMiner's META CIM is shown here to be of great potential value in such analyses.