VennMaster: Areaproportional Euler diagrams for functional GO analysis of microarrays
 Hans A Kestler^{1, 2}Email author,
 André Müller^{2},
 Johann M Kraus^{1, 2},
 Malte Buchholz^{2, 3},
 Thomas M Gress^{2, 3},
 Hongfang Liu^{4, 6},
 David W Kane^{5},
 Barry R Zeeberg^{6} and
 John N Weinstein^{6}
DOI: 10.1186/14712105967
© Kestler et al; licensee BioMed Central Ltd. 2008
Received: 16 March 2007
Accepted: 29 January 2008
Published: 29 January 2008
Abstract
Background
Microarray experiments generate vast amounts of data. The functional context of differentially expressed genes can be assessed by querying the Gene Ontology (GO) database via GoMiner. Directed acyclic graph representations, which are used to depict GO categories enriched with differentially expressed genes, are difficult to interpret and, depending on the particular analysis, may not be well suited for formulating new hypotheses. Additional graphical methods are therefore needed to augment the GO graphical representation.
Results
We present an alternative visualization approach, areaproportional Euler diagrams, showing set relationships with semiquantitative size information in a single diagram to support biological hypothesis formulation. The cardinalities of sets and intersection sets are represented by areaproportional Euler diagrams and their corresponding graphical (circular or polygonal) intersection areas. Optimally proportional representations are obtained using swarm and evolutionary optimization algorithms.
Conclusion
VennMaster's areaproportional Euler diagrams effectively structure and visualize the results of a GO analysis by indicating to what extent flagged genes are shared by different categories. In addition to reducing the complexity of the output, the visualizations facilitate generation of novel hypotheses from the analysis of seemingly unrelated categories that share differentially expressed genes.
Background
A major goal, as well as a major challenge, of transcriptome analyses is the interpretation of results in a biological context. In many comparative studies, the primary results of the analyses are lists of genes expressed differentially between different groups of samples. The identification of underlying biological themes (e.g. alterations of specific pathways, triggering of complex cellular responses, activation of specific transcriptional programs) is usually not straightforward. By providing a controlled and structured vocabulary for the functional description of gene products, the Gene Ontology (GO) database [1] represents a useful resource for comprehensive functional annotation of gene lists. Moreover, GO categories that are significantly enriched in the differentially expressed genes can be identified, providing clues to the biological causes and consequences of observed transcriptome changes. Since genes and gene products are usually associated with several GO terms, such an analysis tends to increase, rather than reduce, the information load. Methods are therefore needed to structure and adequately visualize the results of a GO analysis (e.g., by indicating to what extent genes are shared by different categories). In addition to simply reducing the complexity of the output, such visualizations may facilitate the generation of novel hypotheses from observation of seemingly unrelated categories that share differentially expressed genes.
Diagrammatic notations involving circles and other closed curves have been used to represent classical syllogisms since the Middle Ages [2]. In the 18th century the mathematician Leonhard Euler introduced the notation that is now called the "Euler diagram" to illustrate relationships among sets. That notation uses the topological properties of enclosure, exclusion, and partial overlap to represent the settheoretic concepts of containment, disjointness, and intersection. Another notation was invented by John Venn in the 19th century. A Venn diagram contains n closed curves representing n sets, in which all sets must intersect. Those diagrams rarely provide a useful visual representation if five or more sets are involved (in general using nonoval contours). Moreover, it can be shown that Venn diagrams with circles are not generally possible for more than three sets. Here, we relax the requirement of total intersection of all curves, limit ourselves to circles, but impose the additional requirement that area must be as nearly as possible proportional to set size. The last restriction enables us to visualize the set relationships at least semiquantitatively. The problem of proportional areas is in general not perfectly solvable (i.e. fulfilling all requirements of containment, disjointness, and intersection with set size proportional to the corresponding area). Rather, the aim is to construct approximate solutions. A preliminary report [3] described a basic implementation of those ideas. We now describe how the analytical ideas can be used to construct Euler/Venn diagrams, together with full, seamless integration into GoMiner.
Finding interesting intersections
The Gene Ontology (GO) database imposes three hierarchically structured ontologies, or classification systems, on gene products:

Molecular function – an activity at the molecular level (e.g. catalytic/transporter activity or binding).

Biological process – a series of molecular functions (e.g. signal transduction).

Cellular component – an anatomical structure (e.g. rough endoplasmic reticulum or nucleus) or a gene product macromolecular structure (e.g. ribosome, proteasome or a protein dimer).
Because each GO category may have more than one parent, the hierarchy takes the form of a directed acyclic graph (DAG), with edges pointing from a parent (a more general category) to children (more specific categories). The three major ontologies share no nodes and are therefore independent DAGs. Each gene product is associated with one or more categories. The root subsumes all three ontologies and is therefore associated with all categorized gene products in the database.
GoMiner [4–6] evaluates the significance of each GO category by a Fisher's exact pvalue and a false discovery rate (FDR) to detect differentially expressed genes of a microarray assay that are significantly overrepresented in a certain GO category.
One analytical approach is to select from the GO DAG an interesting subset of categories that meet two filtering criteria:

the pvalue or the FDR does not exceed a threshold, and

the number of genes in a category lies in a prespecified range of interest, since categories that are too small (containing only a few genes) or too large (such as the whole ontology) may be considered uninformative.
Reasoning with areaproportional Euler diagrams
The scenario of Figure 1 (right) could be described by the following syllogisms:
(a1) All C s are A s
(a2) No C s are B s
(a3) Some A s are B s
(a4) All D s and E s are B s
(a5) All F s are E s and D s
(a6) Some E s are D s
(a7) No E s are A s
(a8) Some D s are A s
¿From the Euler diagram representation or the previously defined rule set the following relations can be inferred:
(a5) + (a4) + (a7) ⇒ (b1) No F s are A s
(a2) + (a4) + (a5) ⇒ (b2) No F s are C s
(a2) + (a4) ⇒ (b2) No D s are C s
In addition, the (approximate) areaproportionality enables assessment of the number of elements in the sets, and leads to the following inferences:
(c1) A ≈ B
(c2) E ≈ D
(c3) C <E and C <D
(c4) F <C
(c5') D ∩ B < A ∩ B
In the last conclusion (c5') has to be verified by observing the exact cardinalities as the overlaps need not to be strictly proportional to the area, as the visualization depends on the concrete set family and the parameters of the cost functional. It is important to control the existence of missing intersections if the Euler arrangement is not able to express fully all set relations. However, missing intersections occurred for GO data.
Results
Visualization results
Simulation results
We also compared the two biologically inspired optimization strategies, evolutionary optimization (EO) and particle swarm optimization (PSO). The quality of the solution was assessed by the cost function term ("E" see Methods section) and the number of optimization steps required to reach a stable solution. Toward that end, we varied the parameters of two different optimization algorithms. We used 20 different settings for the compactness term (delta parameter; see Methods), equally spaced in the range [0, 2000], with 20 runs (using different seed values for the random number generator) for each data set. Further, the number of individuals (EO) or particles (for the PSO) was set to 50 with a maximum of 500 iterations. If the best individuals/particles could not improve the cost function within 50 iterations the optimization was stopped.
For the stellate cell data set (import parameters: minimum category size 40, maximum category size 140, maximum pValue of 0.05) the evaluation resulted in a total of 400 simulations for each of the two optimization strategies (EO versus PSO). An unpaired onesided Wilcoxon rank sum test revealed significantly lower cost function values (p < 2.2 · 10^{16}) and a significantly lower number of iteration steps (p < 2.2 · 10^{16}).
In addition to testing of the real world data set, we performed simulations on 10 random set families (a total of 4000 simulations for each algorithm, details of performing the simulations are given in the supplementary information [9]). Each of those 10 families consisted of 5 sets. The pooled results (for both algorithms) gave a pvalue of 4.567 · 10^{10} for the value of the cost function, and a pvalue below 2.2 · 10^{16} for the number of iteration steps (both unpaired onesided Wilcoxon rank test).
Discussion and Conclusion
Analyzing functional annotations of genes and gene products is becoming increasingly important in the comprehensive GO analysis of microarray data. Identification of functional interrelations between differentially expressed genes detected by GoMiner contributes substantially to uncovering fundamental biological programs and superordinate pathways reflected in the transcriptional differences. Displaying the results of such analyses has remained challenging.
We have introduced a new method for visualizing annotated gene sets as overlapping circles in the plane. The approach is loosely related to other procedures such as Venny [10] (4set Venn diagrams, no area proportionality) and TreeEASE [11, 12], (which uses hierarchical clustering), or GoMiner to find functionally related genes, which are then annotated. Our focus is on a semiquantitative visualization that could be performed after such analyses. Although there is in general no perfect solution for these areaproportional Euler diagrams using circles or regular polygons, the proposed approach leads to easily interpretable visualizations.
We draw diagrams with zero size zones that are shaded, in accord with the original visualization of Venn diagrams [13]. The proposed type of Euler diagram is appropriate only for problems involving a relatively small number of intersections, a situation that often pertains to data originating from the GO database, since those data are naturally hierarchically structured. Areaproportional Euler diagrams are, in most cases, a tradeoff between accuracy of the intersection areas and meaningful polygon arrangements without missing faces (= inconsistencies) and without too many empty faces (which are shaded). Therefore, we suggest several alternative formulations of the cost function to focus on different aspects of the data, such as the importance attached to intersections involving many sets (weights w_{ k }) and the importance attached to giving equal weight to elements (genes) or groups (GO categories) (see error function f_{1} versus f_{2}, respectively, in the Methods section).
The simulations produced the rather unexpected result that the PSO outperformed the EO, both in generating solutions with a lower cost and in faster convergence. The momentum inherent in the PSO seems to be better suited to the graphical optimization situation. A possible further improvement could be achieved by using a gradient descent optimization (similar to those proposed in [14]) for finetuning a coarse solution from the evolutionary strategy. Gradient descent alone is not able to find the optimal solution, since for more than 3 sets, local minima exist in the error function. On the other hand, it is impractical to differentiate the cost function analytically, and an approximation of the gradient is computationally expensive (compare the complexity estimation in the Methods section). Therefore, a gradient descent algorithm seems not to be particularly appropriate for this problem.
In summary, we have developed a method for visualizing set relationships that extends the inferences that can be expressed by DAGs. Intersections in different branches can now be visualized. The approach is implemented as an interactive application specifically designed for use with GoMiner in the context of the GO database. It has been integrated directly into the original GUI GoMiner software and is compatible with HighThroughput GoMiner.
Methods
It was demonstrated by Chow and Ruskey [14, 15] that the task of visualizing intersecting sample sets by areaproportional Euler diagrams is in general not perfectly solvable for more than two sets with circles in the plane. We therefore defined a cost function reflecting the conflicting constraints of circle overlap and cardinality of the intersection set and sought the best compromise solutions employing evolutionary and swarm approaches for optimization [for details see Additional file 1].
Cost function
We propose a cost functional E mapping the regular polygon (or circle) centers to an error value describing the goodness of the solution. The function E includes a tradeoff between the correct graphical intersection areas and the true set sizes. The problem is first partitioned into disjoint, independently solvable subproblems. That can be accomplished by finding the connected components of an intersection graph that has one vertex for each set and edges that connect intersecting sets. The connected components can be found using a depthfirst search (compare [16, 17]) which takes O(n + m) steps, where n is the number of vertices (sets) and m is the number of edges (which can be at most n(n  1)/2). The resulting complexity is O(n^{2}) to partition the problem. In the following it is therefore assumed that all sets have at least one intersecting partner. Let A_{1} ...A_{ m }⊆ $\mathcal{U}$ be a sequence of intersecting subsets of the overall gene set $\mathcal{U}$ and let G_{1} ...G_{ m }⊆ ℝ^{2} be a graphical twodimensional representation of the sets.
with a distance function d(g, c) and constants α, β, γ ≥ 0 allowing different weights on the three cases: unwanted graphical overlaps, missing graphical intersections, and area deviations.
In general one should select β > γ.
An error function evaluation requires O(Lm 2^{m1}) computation steps when using polygons with L edges (intersecting two polygons with M and N edges can be computed in O(M + N) with O'Rourke's algorithm [18]). For problems with ≥ 8 categories the complexity may be reduced due to time and space limitations by observing only intersection sets I with I = K for an upper bound K. The probability that for a highly intersecting group of sets a perfect diagram exists (up to a high level of intersections) is nevertheless very low. The size of an Euler diagram is defined as the number of faces this diagram should have to reflect all intersections occuring in the datae(I) = {I ⊆ {1 ... m}A(I) ≠ ∅ }
The current implementation enables one to observe the partial error f for each intersection set A(I), I ⊆ {1 ... m}. In the following we propose some extensions of the previous visualization scheme:
i) To allow for better adaption and reduction of the unwanted regions (marked in gray), the solution space was extended by allowing the optimization to vary the polygon areas in a certain range such that the order conditions area(G_{π(1)}) ≤ ... ≤ area(G_{π(m)}) were preserved with a permutation π such that A_{π(1)} ≤ ... ≤ A_{π(m)}. Sets with equal cardinalities were represented by equal graphical areas. Only radial scaling of the polygons was allowed. Unfortunately, this strategy did not improve the visual representation even though the current implementation neglected the order criteria and so had more freedom to adapt (the solutions found were certainly not more informative).
ii) If using many sets, the scaling must be chosen small in order to fit the polygons into the unit box [0, 1]^{2}. Therefore, the empty space in relation to the polygon areas will be very large, and the optimization may take a long time or may not produce a plausible solution (in those cases, the graphical representation is unconnected).
with the weighting parameter δ ≥ 0.
To avoid local minima, the cost function E is minimized over the polygon centers (shape and orientation of the polygons remain fixed) using a swarm optimization algorithm [19] and an evolutionary strategy with selfadapting mutation rates [20].
Particle swarm optimization
with the positive acceleration constants c_{ glob }and c_{ loc }. The global best solution (having the maximum fitness value) among all particles for all t' ≤ t is defined as ${x}_{t}^{(glob)}$; the local best solution for a single particle j is ${x}_{t}^{(j,loc)}$. The velocities are restricted to the interval [v_{ max }, v_{ max }] for each axis. The interaction among particles is regulated by the influence of the global and local optima on the velocity term. $\mathcal{U}$ [0, 1) is a uniform random variate generating a number from the interval [0, 1). One variation of the method is to restrict the locations to the bounding box [0,1]^{ n }and to change the sign of velocity components of the respective dimensions such that the particles bounces back from the wall.
Evolutionary optimization
A generation contains N individuals each representing a permissible solution of the problem. In the following step each individual is mutated, and its fitness is evaluated by the previously defined cost function E. An individual is replicated with frequencies proportional to its fitness rank, thus generating offspring until the original generation size is reached. The best individual is always transferred unchanged into the new generation. The process of mutation and replication is repeated until the best individual does not change over a certain number of steps.
An individual consists of a parameter vector ${v}_{1}^{t}\cdots {v}_{m}^{t}\in {\mathbb{R}}^{2}$ representing the polygon centers and a vector σ^{ t }∈ ${\mathbb{R}}_{+}^{m}$ describing the mutation rate for each parameter. The first population is initialized with uniformly distributed random values such that each parameter stays in a certain range i.e. the polygons must be enclosed by the bounding box [0, 1]^{2} and the mutation parameters have to be contained in the interval [τ_{ lower }, τ_{ upper }] with 0 <τ_{ lower }<τ_{ upper }.
where $\mathcal{N}$(0, s) represents a normally distributed variate with mean 0 and variance σ. After each mutation, all parameters are restricted to meet the above conditions.
Evolutionary selection and offspring generation are performed by assigning each individual a rank r = 1 ... N according to its fitness as determined according to the value of the cost functional E or E' such that the best individual (the one with the lowest cost or highest fitness) has r = 1. Each individual is then replicated a number of times inversely proportional to its rank value. Therefore, an individual with rank r will have at most [qN/r] (for a fixed 0 <q < 1) offspring. Starting with the highest rank r = 1 the new population is filled up until the size N is reached and the new generation is complete. All but the first individual (the fittest of the last generation) are mutated.
The optimization process is stopped when the cost functional of the best individual does not improve over a certain number of steps or the number of generations exceeds an upper bound.
Implementation of VennMaster
The visualization approach was implemented as a platformindependent open source Java application, which is available online [9]. The application allows interactive exploration of Euler diagrams was tested under Windows XP, Linux, and Mac OS X using the Java Runtime Environment 1.5 [21].
When one touches a polygon with the cursor, its area is highlighted and the involved group names and the cardinality of the intersection set are shown. Among many other parameters involving the evolutionary strategy and the error function, the number of edges of the polygons can be configured. Those settings can be exported and imported in XML format [22]. Furthermore, a gene list of the selected intersection set(s) is shown in an information field. Unresolved intersections (for which no corresponding polygon intersection exists) are listed in the field "Inconsistencies". For each set or intersection set, a text label can be attached (see Figure 3). Labels and polygons can be moved by drag and drop (the cost function will be updated immediately). So the user can interactively modify the configuration and may restart the optimization process on the changed arrangement. Set positions can be locked so that they will not be moved by the optimizer. The optimization process can be controlled via a parameter dialog (see supplementary information [9]). The areaproportional Euler diagrams may be saved as JPEG or SVG (Scalable Vector Graphics (SVG) is a XML based graphics format [23]).
Integration with GoMiner
GoMiner [4] is a tool for biological interpretation of "omic" data – including data from gene expression microarrays. Omic experiments often generate lists of dozens or hundreds of genes that differ in expression between samples. GoMiner uses the Gene Ontology to identify the biological processes, functions and cellular components represented in these lists, and groups the genes into biologicallycoherent categories. The ability to import files from GoMiner was included in the software to permit analysis of functional categories of differentially expressed genes. For the GoMiner files, categories were prefiltered so that the number of genes in the included categories would lie in a certain definable range and would not exceed a given FDR or pvalue. Alternatively, a simple tab delimited file format with an element/group pair in each line can be used as input. The current version enables the export of an error profile listing all partial errors for each nonempty set combination I ⊆ {1 ... n} such that G(I) > 0 or A(I) ≠ ∅. Each line contains the values I, I, G(I), A(I), and f(I).
Availability and requirements
Declarations
Acknowledgements
We thank Bernd Ruoss for Java support. This research was funded in part by a "Forschungsdozent" grant through the Stifterverband für die Deutsche Wissenschaft and by the German Science Foundation (SFB 518, Project C05) to HAK. This research was supported in part by the Intramural Research Program of the NIH, National Cancer Institute, Center for Cancer Research.
Authors’ Affiliations
References
 The Gene Ontology Consortium: The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res 2004, 32: D258D261.View ArticleGoogle Scholar
 Gil J, Howse J, Tulchinskiy E: Positive semantics of projections in VennEuler diagrams. In Diagrams LNAI 1889. Edited by: Anderson M, Cheng P, Haarslev V. Springer Verlag; 2000:7–25.Google Scholar
 Kestler HA, Müller A, Gress TM, Buchholz M: Generalized Venn diagrams: a new method of visualizing complex genetic set relations. Bioinformatics 2005, 21: 1592–1595.View ArticlePubMedGoogle Scholar
 Zeeberg BR, Feng W, Wang G, Wang MD, Fojo AT, Sunshine M, Narasimhan S, Kane DW, Reinhold WC, Lababidi S, Bussey KJ, Riss J, Barrett JC, Weinstein JN: GoMiner: A Resource for Biological Interpretation of Genomic and Proteomic Data. Genome Biology 2003, 4: R28.PubMed CentralView ArticlePubMedGoogle Scholar
 Zeeberg BR, Qin H, Narasimhan S, Sunshine M, Cao H, Kane DW, Reimers M, Stephens RM, Bryant D, Burt SK, Elnekave E, Hari DM, Wynn TA, CunninghamRundles C, Stewart DM, Nelson D, Weinstein J: HighThroughput GoMiner, an 'industrialstrength' integrative gene ontology tool for interpretation of multiplemicroarray experiments, with application to studies of Common Variable Immune Deficiency (CVID). BMC Bioinformatics 2005., 6(168):Google Scholar
 GoMiner[http://discover.nci.nih.gov/gominer/]
 Allwein G, Barwise J: Logical Reasoning with Diagrams. Oxford University Press; 1996.Google Scholar
 Buchholz M, Kestler HA, Holzmann K, Ellenrieder V, Schneiderhan W, Siech M, Adler G, Bachem MG, Gress TM: Transcriptome analysis of human hepatic and pancreatic staellate cells: Evidence for common cell lineage and function. J Molecular Medicine 2005, 83: 795–805. MB and HAK contributed equallyView ArticleGoogle Scholar
 Supplementary information and VennMaster software[http://www.informatik.uniulm.de/ni/mitarbeiter/HKestler/vennhyp/]
 Oliveros JC: VENNY: An interactive tool for comparing lists with Venn diagrams.2007. [http://bioinfogp.cnb.csic.es/tools/venny/index.html]Google Scholar
 Saeed A, Sharov V, White J, Li J, Liang W, Bhagabati N, Braisted J, Klapa M, Currier T, Thiagarajan M, Sturn A, Snuffin M, Rezantsev A, Popov D, Ryltsov A, Kostukovich E, Borisovsky I, Liu Z, Vinsavich A, Trush V, Quackenbush J: TM4: a free, opensource system for microarray data management and analysis. Biotechniques 2003, 2(34):374–8.Google Scholar
 Hosack D, GD Jr, Sherman B, Lane H, Lempicki R: Identifying biological themes within lists of genes with EASE. Genome Biolology 2003, 4: R70R70.8.View ArticleGoogle Scholar
 Venn J: On the diagrammatic and mechanical representation of propositions and reasoning. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 1880, 9: 1–18.View ArticleGoogle Scholar
 Chow S, Rodgers P: Extended Abstract: Constructing AreaProportional Venn and Euler Diagrams with Three Circles. Euler Diagrams INRIA 2005, 9–12.Google Scholar
 Chow S, Ruskey F: Drawing AreaProportional Venn and Euler Diagrams. In Graph Drawing. Volume 2912. Edited by: Guiseppe Liotta. Springer Verlag; 2004:466–477.View ArticleGoogle Scholar
 Cormen TH, Leiserson CE, Rivest RL: Introduction to Algorithms. MIT Press; 1989.Google Scholar
 Skiena SS: The Algorithm Design Manual. Springer Verlag; 1998.Google Scholar
 O'Rourke J: Computational Geometry in C. second edition. Cambridge University Press; 2000.Google Scholar
 Kennedy J, Eberhart R: Swarm Intelligence. Morgan Kaufmann; 2001.Google Scholar
 Bäck T: Evolutionary algorithms in theory and practice. Oxford University Press; 1996.Google Scholar
 Java Runtime Environment[http://java.sun.com]
 Extensible Markup Language (XML)[http://www.w3.org/XML/]
 Scalable Vector Graphics (SVG)[http://www.w3c.org/Graphics/SVG/]
 Cancer Biomedical Informatics Grid[https://cabig.nci.nih.gov/]
 Buetow KH: Cyberinfrastructure: Empowering a "Third Way" in Biomedical Research. Science 2005, 308(5723):821–824.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.