A visual analytics approach for understanding biclustering results from microarray data

Santamaría, Rodrigo; Therón, Roberto; Quintales, Luis

doi:10.1186/1471-2105-9-247

Methodology article
Open access
Published: 27 May 2008

A visual analytics approach for understanding biclustering results from microarray data

Rodrigo Santamaría¹,
Roberto Therón¹ &
Luis Quintales¹

BMC Bioinformatics volume 9, Article number: 247 (2008) Cite this article

8934 Accesses
33 Citations
Metrics details

Abstract

Background

Microarray analysis is an important area of bioinformatics. In the last few years, biclustering has become one of the most popular methods for classifying data from microarrays. Although biclustering can be used in any kind of classification problem, nowadays it is mostly used for microarray data classification. A large number of biclustering algorithms have been developed over the years, however little effort has been devoted to the representation of the results.

Results

We present an interactive framework that helps to infer differences or similarities between biclustering results, to unravel trends and to highlight robust groupings of genes and conditions. These linked representations of biclusters can complement biological analysis and reduce the time spent by specialists on interpreting the results. Within the framework, besides other standard representations, a visualization technique is presented which is based on a force-directed graph where biclusters are represented as flexible overlapped groups of genes and conditions. This microarray analysis framework (BicOverlapper), is available at http://vis.usal.es/bicoverlapper

Conclusion

The main visualization technique, tested with different biclustering results on a real dataset, allows researchers to extract interesting features of the biclustering results, especially the highlighting of overlapping zones that usually represent robust groups of genes and/or conditions. The visual analytics methodology will permit biology experts to study biclustering results without inspecting an overwhelming number of biclusters individually.

Background

Biclustering

Microarray experiments determine the transcript abundance of an organism's genes under different conditions. Microarray analysis tries to identify groups of genes that exhibit similar behavior under certain conditions.

One of the main methods to analyze microarray data is biclustering, a non-supervised technique very widespread in the recent years (see [1] for a survey). Biclustering outperforms traditional clustering because of its two main characteristics: simultaneous grouping of genes and conditions, and overlapping. Simultaneous grouping means that biclusters (the groups found by biclustering algorithms) group genes with similar behavior under a certain number of conditions (thus, the bicluster will group genes and conditions), while traditional clustering techniques only group genes with similar behavior across all the conditions (or vice versa). This characteristic makes biclusters better fitted to biological behavior in several circumstances, for example, when an interesting cellular process is active only in a subset of the conditions. Although it is unusual that the subsets of genes grouped by two different clusters intersect, overlapping is an intrinsic characteristic of biclusters. If two biclusters B₁ and B₂ that group genes G₁ and G₂ and conditions C₁ and C₂, respectively, have G₁ ∩ G₂≠ ∅ and/or C₁ ∩ C₂≠ ∅ it is said that B₁ and B₂ overlap. Overlapping gives biclusters the flexibility to represent biological circumstances such as genes that participate in multiple pathways active under a subset of conditions.

Visualization of single biclusters

The most widespread visualization technique to represent a single bicluster are heatmaps, which are used in several popular tools [2–4]. In a heatmap (Fig. 1a) genes are displayed as the rows, and conditions as the columns, of a matrix A, where element a_ijis the transcript abundance of gene i under condition j. Each element a_ijis then represented as a square colored upon its transcript abundance. To draw a bicluster B_kthat groups a subset of genes G_kand conditions C_k, the heatmap is reordered so G_krows and C_kcolumns appear together, usually in the upper left section of the matrix (Fig. 1b).

Heatmaps usually satisfy the purpose of inspecting a single bicluster. Unfortunately, they have geometrical limitations when representing several biclusters simultaneously, especially if they overlap (see Fig. 1c). BiVoc [5] addresses this problem by repeating rows and columns to properly represent overlapped biclusters. Although it is a useful tool and implements a method that minimizes the number of repeated rows and columns, this replication could lead to ambiguities and misinterpretations.

Parallel coordinates [6] have also been used to represent biclusters, but they are less widespread than heatmaps. In this technique, the gene g_iis an m-dimensional point p_i= (a_{i 1}, a_{i 2}, ...a_im) where a_ikis the transcript abundance of g_iunder condition c_k. Conditions are visualized as vertical axes and genes as lines joining the corresponding transcript abundances (Fig. 2).

In this case, a bicluster with n' genes and m' conditions is represented by n' lines corresponding to the genes, and by rearranging or somehow highlighting the axes corresponding to the m' conditions. When we try to represent several biclusters with this method, again geometrical problems arise because of the large number of lines and the overlapping of several different ones. BicAT [3] and BiVisu [4] use parallel coordinates to display single biclusters. However, their representations are limited. BicAT does not rearrange bicluster conditions, as it simply marks their corresponding axes with vertical lines (making hard to visualize the whole bicluster). On the other hand, BiVisu only visualizes gene profiles under the conditions in the bicluster, losing context information for other conditions, which could be related but not grouped by the bicluster. None of these methods provide interactive thresholds to manipulate the display.

Visualization of multiple clusters

As in the case of single biclusters, the most widespread technique used to visualize multiple clusters from a single clustering are heatmaps. Usually heatmaps are used together with dendrograms, as introduced by Treeview [7]. This way, the hierarchical clustering is represented in a tree and the heatmap rows are rearranged to fit with the clusters found. Sometimes the attached dendrogram can also be used to visually vary the clustering threshold to check the robustness of clusters (see Fig. 3). Usually, clustering is applied to rows (genes) and to columns (conditions), so both dimensions are rearranged and two dendrograms are displayed. Treeview has been enhanced [8], adding a scatterplot visualization for one-by-one condition comparison of transcription levels and a "karyoscope" visualization that represents the transcription levels of the genes under one condition, ordered as they are located in the chromosomes.

gCLUTO [9] uses a variation of this heatmap visualization to represent clusters from hierarchical clustering, including the representation of clustering averages for rows and/or columns. In addition, it introduces mountain maps, a 3D visualization technique (see Fig. 4) that displays clusters simultaneously by means of projections onto a 2D space, while the third dimension is used to represent geometrical properties of the mountains (height, width, slope, etc.) that are used to represent properties of the clusters (size, homogeneity, etc.). However, clusters only group genes (not conditions) with similar transcription levels under all the conditions, and therefore its adaptation to biclusters is not satisfactory.

Hibbs et al. [10] take advantage of a linked-views approach, so two visualizations, heatmaps and cluster projections, are displayed simultaneously, boosting the visual analysis. The projection used is similar to that of gCLUTO but now in a 3D space. It improves heatmap representation by assigning colors by rank and by visualizing cluster averages. As in gCLUTO, projections are useful, but because of the reduction of dimensionality that they require, some information is lost. Although this is not so important when representing clusters, it becomes an issue with biclusters, where overlapping is a main characteristic and projections usually fail to convey actual overlap between biclusters.

Related Biological Knowledge

Besides the input (the microarray data matrix) and the outcome of the analysis (for our purpose, biclusters), additional information is available from previous biological studies. This information is usually structured in ontologies, for example, in the case of genes and proteins from eukaryotes, there is the Gene Ontology [11], a dynamic, controlled vocabulary that describes all known biological processes, molecular functions and cellular components associated to them. On the other hand, Transcription Regulation Networks (TRN) represent transcription relationships between genes. In these networks, nodes correspond to genes, and an edge from node a to b means that gene a transcriptionally regulates (activates or inhibits) gene b.

This information can be used to partially guide the bicluster search, or to validate the biclusters found. Note that although this information may be helpful for finding or validating groups, it is rarely complete and grows everyday with new biological discoveries (as an example, the TRN for Escherichia coli increased from 577 relationships in 2002 [12] to 2724 in 2004 [13]). Also, its use may bias the search of biclusters to already known groups, limiting the knowledge discovery capability of the methods.

Some of the visualization tools discussed above make use of ontologies to complement their displays, either embedded in the visualization [2] or from a web navigator [8]. There are also several tools that visualize TRNs (for example, Cytoscape [14] or Hawkeye [15]) and link them with other biological knowledge, but it's more difficult to find tools that link TRN networks with clustering or biclustering results.

Motivation

As described above, the display of several clusters and single biclusters is well known, but the visualization of several biclusters is an almost entirely new area of study. The ability of visualizing several biclusters in the same display speeds up the understanding of relationships among the different biological groups represented by biclusters, specifically it permits:

To find genes with similar biological functions or conditions that affect similarly to a particular group of genes. This is given by each bicluster alone, but the relevance of these relationships grows if several biclusters coincide in them (forming sorts of 'super-biclusters').
To trace third-party relationships among biclusters, helping to find, for example, two groups of genes related under different groups of conditions, but also with some conditions in common. The finding of these common genes or conditions ('hubs') is key to infer relationships or bridges among different functional groups.
To quickly characterize the biclustering algorithm search through its results: is it exhaustive?, does it find several groups?, of which size?, how much are they connected?, are there unconnected groups?

Currently, during an analysis biclusters must be individually inspected and/or filtered using statistical methods or a priori biological knowledge. Due to the heterogeneity of biclustering approaches and the novelty of most of the biclustering algorithms (an increasing number of which have appeared since the year 2000), few theoretical statistical methods to analyze or filter them are available. Most of them are based on significance tests over biological knowledge as Transcription Regulatory Networks [16] or Gene Ontology [17]. These tests are not perfect since biological knowledge is still incomplete. Because of this lack of statistical or biological filters, it is usually difficult to reduce the number of biclusters and even if reduced, to be able to draw conclusions quickly, one way of putting all the biclusters together on a single graphical representation is an urgent need.

Since there are no fault-free standards to determine which is the best biclustering method for each case, problems are usually approached from different points of view, often by using different methods, or different configurations of the same method, in order to identify the most robust results (the biclusters that are found under different approaches).

Due to overlap, in the case of the outcome of a single biclustering method with a single configuration, an interesting fact is that a kind of robustness can still be found in genes or conditions that are grouped together by several biclusters (in other words, they are at the intersection of several biclusters). This can be extended to the use of different methods or configurations of parameters. The robust groups of genes and/or conditions formed by the intersection of different biclusters are a kind of super-biclusters, usually not directly grouped by any method (what can lead to groups of only genes or only conditions) but grouped together by several biclusters.

Visual analytics is the science of analytical reasoning supported by highly interactive visual interfaces [18]. This is our approach, which focuses on the representation of biclusters in several ways that enhance the analysis of biclustering results. Thus, while the center of the analysis is based on a representation of biclusters that is capable of visualizing several biclusters simultaneously, this visualization technique has been implemented as part of a framework that includes other traditional bicluster representations such as heatmaps or parallel coordinates, so the user can inspect biclustering results from different points of view. All the visualizations are highly interactive and are linked together. As a result, the detection of super-biclusters and hub nodes is easy and useful. The framework helps in the comprehension of the differences and similarities among biclusters from different biclustering methods and quickens the task of analyzing biclustering results.

Results and Discussion

Group data: movie relations analogy

The main difficulty when it comes to assessing a visualization for biclustering results is the need for a very well known data set that permits the validation of the conclusions reached using the tool. Taking this into account, before using our visualization technique to represent biclustering results from microarray data sets, it has been validated using a database of more than 20.000 movies and over 250.000 persons, extracted from IMDb [19]. Each movie is treated as a 'bicluster', so each person involved in the movie is a node in it. Of course these are fictional biclusters because they do not come from a biclustering algorithm, nor do they refer to two dimensions, but they have the most interesting property of overlapping. The characteristics that our tool helps to discover in the movie relationships usually have a direct analogy in a biological context, for example:

Working groups involved in more than one movie. These groups are of special interest if the movies in which the people worked together are prize-winning movies or movies that earned lots of money, because one can identify which are the most successful collaborations. For example, in Fig. 5 both successful sagas (Spider-man at the left and The Lord of the Rings at the top) and couples working in prize-winning films (such as Paul Haggis and Michael Peña at the bottom) are easily distinguished. Analogously, groups of genes present in several similar biclusters that are expected to have similar behaviors can be identified, for instance.

Hub nodes (or groups of nodes) joining two larger, otherwise separated, groups. In the case of movie relationships, these groups are quite interesting because they connect working groups of different nationalities, movie genres or degrees of success. For example, if you are a producer, the hub nodes that join blockbusters with prize-winning movies will lead you to the people that is capable of making quality movies that earn money (in Fig. 5, we can see hub nodes connecting prize-winning movies with blockbusters, such as Danny Elfman, and others such as Catherine Zeta-Jones or Orlando Bloom). In biology hub genes related to two groups of biclusters, each one grouping different biological processes, can be interesting as they may participate in the regulation of both processes.
Indirect relationships. Each single group gives information of direct relationships among movie people. However, the inspection of several groups, by means of the navigation through the graph (possibly tracing hub nodes), helps to discover third party relationships (notice how, in Fig. 5, Russell Crowe and Cate Blanchett have worked with John Logan in different films). Biologically, this can lead to the discovery of side-effects of the activation (or inhibition) of genes.

The familiarity with the movie ontology makes it easy to test the capability of analysis of the presented technique, much more than to use gene or biological ontologies (usually incomplete), applied to results of (very heterogeneous in concepts) biclustering algorithms. Focused in this field, the framework in which the visualization technique is embedded was also validated by entering two contests, one centered on visual analytics and the other one on graph drawing (social networks). Our entry was selected as finalist of the former [20] and was awarded the first place in the latter [21].

Microarray Data and Biclustering algorithms

To test the power of our bicluster visualization method, now applied to biological information, the budding yeast Saccharomyces cerevisiae microarray data [7] has been used. This data set has been broadly studied and images of heatmap clustering are available. This organism genome is fully sequenced, and the conditions of the microarray are understandable even by non-specialists, presenting clear groups such as sporulation time series, cell division or changes in temperature.

The yeast microarray data forms a 2467 × 79 matrix that has been analyzed using three different biclustering methods: Bimax [16], Iterative Search Algorithm (ISA) [22, 23] and Ben-Dor et al. [24] approach to find Order-Preserving SubMatrix biclusters (we will refer to this biclustering algorithm just as OPSM) using BicAT analysis Toolbox [3].

These three algorithms have been chosen because they look for different concepts of biclusters using different strategies. Bimax searches for constant up-regulated biclusters (using Madeira and Oliveira notation [1]), ISA searches for biclusters that highly deviate from the mean (both above or below) and OPSM searches for biclusters which preserve certain order (coherent evolution). Bimax uses a divide-and-conquer strategy while ISA uses Z-score statistics and OPSM performs a greedy iterative search. This way, we can present the results of the visualization under different biclustering conditions and discuss how those differences affect results by comparing their different layouts.

Bimax results analysis

Bimax is an exhaustive divide-and-conquer method that preprocesses the data matrix to convert it into a binary matrix by fixing a threshold, so transcription levels above this threshold become ones and transcription levels below become zeros (or vice versa). Then, it searches for all possible biclusters that contain only ones, so up or down-regulated constant biclusters are found.

Bimax was executed with a discretisation threshold of 1%, so only that percentage of transcription levels (the highest up-regulated) were considered. The minimum size of biclusters was set to 3 × 2, finding 421 biclusters, most of them of small size (groupings with under 30 transcription levels).

Fig. 6 shows the 50 biggest biclusters found. With a simple glance at the representation, two clear groups of biclusters appear, one at the top of the graph and another one at the bottom. The display of a higher number of biclusters, up to 250, did not reveal additional information, other than making the two groups tighter.

High connectivity of the nodes demonstrates the exhaustiveness of Bimax, since some of the biclusters are very similar, although no one is completely included in another bicluster. The top group mainly contains conditions related to sporulation (spo.mid, spo.7, spo.9, spo.11 are the most biclustered but not the only ones), revealing that this process provokes up-regulation on a high number of genes. We have compared Fig. 6 with the genes related to sporulation that have been identified by Eisen et al. [7] by means of clustering. The top group contains all the genes related to sporulation in that previous work, as expected. The bottom group, less connected, contains biclusters that group other conditions, especially heat shock conditions such as heat.20 and heat.40. Some genes highly active under heat shock and sporulation conditions such as those with ORFs YGR088W, YNR034W or YKL096W are present in biclusters of both groups; and can be seen at the center of the representation. These hub genes are of special biological interest because they act as a bridge between sporulation conditions and heat shock conditions. For example, ORF YKL096W corresponds to gene CWP1, involved in cell wall organization [25] and known to be related to sporulation [26], but it has not been related with heat shock conditions, triggering a new research question in order to clarify these findings.

OPSM results analysis

OPSM defines a bicluster as a group of rows whose values are monotonically increased under a certain column ordering, enabling us to find coherent evolution biclusters, i.e. genes and conditions that significatively increase or decrease at the same time regardless of the amount of the change. This is the broadest bicluster definition, yielding sometimes very large groups of genes.

OPSM was run using 10 models for each iteration, which yielded 13 biclusters. Four of the biclusters found were ignored due to their high number of genes (above 400).

The visualization (Fig. 7) reveals one of the characteristics of OPSM biclusters: when an OPSM bicluster contains few genes, it usually has more conditions, and vice versa (this is especially evident in biclusters 6 and 7, or 1 and 2). However the most interesting result that the visualization helps to quickly detect for this dataset is that OPSM biclusters are mainly connected by sporulation conditions. These detected conditions are biologically interesting because they are able to maintain an order in transcription levels over a large number of genes.

This feature could also be discovered by means of the visualization of single biclusters, but it requires much more effort. Also, third party relationships cannot be discovered unless all the elements in each bicluster are tracked one by one, while in this visualization they are quickly identified. For example, we can see that genes with locus tags YGL147C, YER102W and YGL076C are grouped together in two biclusters (3 and 4), and are not related to genes in bicluster 1, except for some nodes (mainly sporulation conditions) at the center. These three genes, along with several others, such as YHR203C or YLR075W (highlighted in the figure), are protein components of the ribosomal subunits 40S and 60S. This explains why they are grouped together in biclusters by OPSM. In this case, they serve as validation of the method because there are biological evidence of the relation among genes (components of ribosomal subunits), but in other cases (as for example, in Bimax hub nodes above) these identifications could lead to new knowledge. It is also remarkable that most of the genes grouped along with sporulation conditions at OPSM is not grouped by Bimax for the same conditions, suggesting that genes related to ribosomal subunits present order in transcription levels during sporulation, but they are not highly expressed.

ISA results analysis

Iterative Search Algorithm (ISA) aims at finding genes and conditions that deviate from the mean, so only highly up- or down-regulated genes and conditions are biclustered. The method starts with two normalized copies of the data matrix, one for genes and another one for conditions. Then, different thresholds are imposed for genes and conditions, and biclusters are searched using Z-score statistics. In the end, biclusters with both up- and down-regulated transcription levels are obtained.

This algorithm found 45 biclusters with both gene and condition thresholds set to 2, and taking 100 starting points. ISA's bicluster structure is more entangled than the ones of Bimax or OPSM (see Fig. 8). While hull overlapping helps to draw conclusions regarding bicluster relationships (see Fig. 8a), when clusters grow in number and heterogeneity as in this case, abstraction to a higher level of grouping is also interesting. This way, highly intersected zones, such as nodes in biclusters 1, 2 and 3 (Fig. 8b) acquire relevance not through the individual biclusters they pertain to, but through the frequency by which the biclustering method groups them together (forming a super-bicluster around conditions heat.40 and heat.160). When complexity increases, it is also interesting to know exactly what nodes are connected, which is achieved by highlighting all related nodes when hovering one of them.

Since ISA searches for both up and down-regulated biclusters, relevant nodes differ from Bimax. For example, some conditions arise as important for this method, such as heat shock conditions heat.40, heat.80, heat.160 or diauxic shift conditions diau.e, diau.f (see Fig. 8b), while sporulation conditions, very relevant in Bimax, are secondary (Fig. 8d).

Conclusion

The present article analyzes and compares results from three prominent biclustering methods when applied to a real microarray experiment using a visual analytics framework that allows whole representation and interaction for all biclusters. The main conclusions are the following:

The proposed visualization allows to display large number of biclusters in a single representation, enhancing the detection of overlap among biclusters.
As a consequence of conveying overlapping groups, actual biological features can be extracted by the detection of super-biclusters and hub nodes.
The combination of different representations (hulls, piecharts, labels) with the interaction and navigation through the graph helps in the analysis, allowing to simplify the visualization of complex results.
This visualization also helps to determine biclustering algorithms characteristics, and differences and similarities between biclustering algorithms.
The integration of the presented visualization into a visual framework that provides standard representations helps experts to follow the results more easily. Furthermore, the linkage of novel and traditional visualizations permits a deeper analysis of results, from overview to details, thus gaining insight into the problem at hand.

Following these promising results, our future line of work will be based on the research and optimization of the layout when different biclustering algorithm's results are compared with each other, and on the integration of additional biological knowledge from gene and condition databases.

Methods

This section details the main characteristics of the presented visual analytics approach, focusing on the description of the novel graph-based bicluster visualization (we will refer to it as overlapper) and its use inside a framework (BicOverlapper) that implements other well known bicluster visualizations such as heatmaps or parallel coordinates (see Fig. 9).

We start with the definition of bicluster, then we explain the graph building, layout and complexity. Finally, we will see how the overlapper and the other views interact and are linked together to help discover new knowledge.

Bicluster Definition

The presented visualization technique relies on a graph where nodes represent genes or conditions, and edges join nodes that are grouped by one or more biclusters (Fig. 10). By using the same entity (graph nodes) for genes and conditions, the characteristic of the grouping of genes and conditions, natural in biclusters, can be easily visualized, a difficult task when both entities are separated (rows and columns in heatmaps, or lines and axes in parallel coordinates). Of course, gene nodes and condition nodes will be finally identifiable by using different shapes to represent them.

Let B_kbe a bicluster that groups genes G_k= {g_{k 1}, ..., g_kn} and conditions C_k= {c_{k 1}, ..., c_km}. B_kis represented as an undirected complete subgraph with nodes N_k= {G_k∪ C_k} = {g_{k 1}, ..., g_kn, c_{k 1}..., c_km}. As previously explained, for purposes of graph computation, genes and conditions are not distinguished and are simply considered as nodes N_k= {n₁, ..., n_k(n+m)}, and edges are defined as E_k= {(n_i, n_j) = (n_j, n_i), with n_i, n_j∈ N_k}. The weight w_ijof each edge e_ij= (n_i, n_j) is given by the number of biclusters that contain both nodes n_iand n_j.

Graph Layout

The nodes are displayed following a force-directed layout [27]. In our model, each pair of nodes n_iand n_j(positioned at p_iand p_jrespectively) with a distance between them d_ij, can be affected by up to two forces. If the nodes are connected, a spring force acts to keep them at an optimal distance d_o, with stiffness depending on a constant s and the weight of the edge:

(1)

where |d_ij| = |p_j- p_i| is the magnitude of the distance and ${\hat{d}}_{i j}$ is the unit vector that indicates direction from n_ito n_j. Between every pair of nodes, whether connected or not, an expansion force makes them repel each other:

X_{i j} = - (G / | d_{i j} |) {\hat{d}}_{i j}

(2)

where G is a gravitational constant that controls the intensity of the repulsion. S_ijkeeps nodes in the same biclusters close, while X_ijseparates nodes into different biclusters.

Edge cluttering is an issue when we deal with large graphs, making the resulting display confusing [28]. To solve this, edges are not drawn unless requested by the user. Instead, each bicluster (represented as a complete subgraph) is wrapped in a polygon or a rounded shape (hull). This hull is drawn by determining the outermost nodes of each bicluster, and using their positions as anchor points for a spline curve that draws the contour of the hull. The inside of the hull is filled with the same color used for the line, but with a degree of transparency. Unlike other zone graph visualizations [29, 30], a node can be in more than one zone, reflecting overlapping between biclusters, which can usually affect more than one node. Because hulls are drawn with a transparent color, their intersecting areas become more opaque, enhancing the detection of overlaps.

Node Representation

Node positions are defined by the graph layout, but other information can be displayed by node representation, by means of glyphs, at user's demand. A glyph is a graphical object designed to convey multiple data values [31]. The geometrical properties of the glyph represent different dimensions (Fig. 11). In our case:

The shape of each node distinguishes between genes (circles) and conditions (squares).
Pie charts have as many sectors as biclusters in which the node appears. The color of sectors could also be used to identify different biclusters which meet some predetermined criterion. Pie charts also serve to quantify the degree of overlapping of hull zones.
Labels with gene and condition names can be displayed for node identification. In this case, label color is determined by the node type (gene or condition) and text size by the degree of overlapping.

The final result of the graph display is a set of flexible overlapped, colored areas representing biclusters, with glyph nodes inside representing genes or conditions. Drawing these areas instead of drawing edges, along with its flexibility, allows a large number of biclusters to be represented without excessive cluttering on the display.

Graph Complexity

An optimal implementation of force-directed layouts has a complexity of O(n³) [32], with n being the number of nodes. Microarray experiments tend to have high dimensionality, containing 10^3–5genes and 10^1–2 conditions. Usually the number of genes n is much higher than the number of conditions m.

Regarding edge complexity on our representation, the worst scenario for an n × m microarray data matrix is that all genes behave similarly for all conditions, so the only bicluster will be the entire matrix n × m, and thus the resulting graph will have n + m nodes and (n + m)(n + m - 1)/2 edges (around 10⁶ edges for a 1000 × 100 matrix).

Obviously, such a microarray experiment is useless, but helps to understand that complexity is very sensitive to the dimensionality of the biclusters. The number of biclusters found is another factor which increases the complexity, but is much less important, since more edges will be shared when the number of biclusters grows.

In practice, the number and size of biclusters vary depending on the biclustering algorithm, the input parameters and the microarray data set. Typically, for a 1000 × 100 matrix, an exhaustive method like Bimax [16] gives hundreds of biclusters. Other methods such as Spectral Biclustering [33] or ISA [22] yield around 50 biclusters, while Turner's Plaid model [34] or OPSM [24] generate only a dozen biclusters. Other algorithms take the number of biclusters as an input parameter. Usual sizes for biclusters range from 2 × 2 to 100 × 10, though exceptionally larger ones may be generated.

Overlapper can deal with up to 100 biclusters with sizes ranging from 2 × 2 to 100 × 10, on an Intel Pentium D 2.8 GHz processor, without relevant loss of interactivity. Although performance is currently being optimized to deal with more than 100 biclusters, graph complexity and the ability of human perception to inspect a graph impose limits on the number of biclusters than can be visualized. Therefore, previous statistical or biological filters linked to graph visualization are of great importance when it comes to comparing larger biclustering results.

The visual analytics approach described here helps to reduce complexity by interacting with other linked visualizations such as parallel coordinates or TRN networks, filtering the displayed biclusters using simple criteria such as "only biclusters that contains gene X" or "only biclusters with genes that have high transcription levels for condition Y".

Multiple Linked Views

Four other visualizations are implemented along with the overlapper: heatmap, parallel coordinates, TRN network and bubble map. Heatmap and parallel coordinates are used to display transcription levels and also to represent gene or condition profiles and single biclusters.

The heatmap implementation is conventional, representing single biclusters by rearranging and distorting the corresponding rows and columns. The implementation of parallel coordinates allows transcription thresholds to be set for each condition, thus helping to perform user-driven filtering of gene profiles. Biclusters are represented by first placing all coordinate axes corresponding to conditions in the bicluster. The lines corresponding to genes in the bicluster are highlighted, with the segments corresponding to conditions in the bicluster being brighter (see Fig. 12). With this implementation of parallel coordinates, the context of genes and conditions not in the bicluster is not lost, one of the main limitations of other implementations of this technique, as discussed in the Background section.

On the other side, the bubble map is a 2D projection map similar to gCLUTO's mountain map, but we use both genes and conditions to compute projections (see Fig. 9c). However, this implementation uses 2D characteristics such as diameter, color and transparency to represent the characteristics of biclusters (size, homogeneity, etc.), so biclusters are drawn as 2D 'bubbles' instead of 3D mountains. This way, we avoid the occlusion of objects, which is an issue in 3D visualizations, and we simplify the display, which is more complex in the case of biclusters than in the case of clusters.

Finally, the TRN visualization is implemented as a force directed graph in the fashion of tools such as Cytoscape [14] or Hawkeye [15], with genes as nodes, joined by directed edges if they are related by any transcription behavior (activation or inhibition).

Visual analytics focuses on the interaction with the representations, so they can be adapted to the user's information needs. Thus, all the visualizations in the framework, including the overlapper, implement a large number of options to interact with them. Most of these interactions deal with navigation through the view and the capability of selecting or searching for biclusters, genes or conditions. However, other interactions are specific to each visualization, such as for example, the setting of transcription thresholds in parallel coordinates, as described above. Please refer to the Implementation section for a list of additional material explaining these interactions in detail.

It must be added that many of the interactions within a representation will lead to different visual changes in the rest of the visualizations of the framework.

Filters and thresholds

A helpful way of using the proposed visual analytics methodology is an incremental exploration of the problem. The initial problem (the analysis of all the biclusters found for a given microarray data matrix) can be divided into simpler problems. For example, if a gene is considered interesting for our experiment, to search for all the biclusters that group it can be a way to simplify our analysis.

For this reason, as stated above, all the visualizations implemented in the framework are linked, so an interaction with one of them propagates to the other visualizations. By interacting with the ancillary visualizations (heatmap, parallel coordinates, TRN network and bubble map) we can filter the number of biclusters displayed in the overlapper making them easier to analyze.

For example, we can search for a gene named lacI in the TRN network and select it to display its profile (parallel coordinates, heatmap) and the biclusters where it belongs to, which will appear in the bubble map and the overlapper (see Fig. 13). Or we can set parallel coordinate thresholds to select the genes with low transcription levels for conditions E1A, E1B and high levels for conditions E9A, E9B, E10A and E10B, and see which biclusters contain them all and how they are related within the TRN network (see Fig. 14). In addition to this multiple-view filtering, the overlapper visualization alone allows the setting of internal thresholds that also help to simplify the visualization and to focus on bicluster subsets. Three kind of thresholds are available to modify the display:

Overlap threshold: when this node-oriented threshold is set to n, only genes and conditions grouped in more than n biclusters are drawn.
Size threshold: when this bicluster-oriented threshold is set to n, only biclusters with at least n nodes (counting genes and conditions) are drawn.
Constance threshold: if this threshold is set to n, only biclusters with standard deviation less than n are drawn. Note that some biclustering algorithms do not use constance as a criterion to find biclusters, so this threshold will disfavor them.

Both filters and thresholds are useful to focus on specific subgroups of genes or conditions, but without them the overlapper is still capable of visualizing several biclusters and allowing the user to draw conclusions about the behavior of biclustering algorithms and the biological data grouped by them, as is discussed in the Result section.

Implementation

The visual analysis approach described has been implemented as a Java framework called BicOverlapper. The overlapper technique was initially designed as a sketch in Processing [35], and later was translated to pure Java [36]. Heatmap, TRN network and bubble map implementations make use of the Prefuse library [37].

A public version of BicOverlapper is available, and can be executed in any operating system that supports Java 1.6. The framework makes use of three different sources of data:

The bicluster results, which contain all the biclusters to be visualized in the overlapper.
The microarray data matrix, necessary for the visualization of heatmaps and parallel coordinates.
The TRN network with information about transcription regulations and necessary for the TRN visualization.

Although these data sources are fundamentally different, they all share genes and conditions as elementary entities, so the different visualizations in the framework can be linked by them. Basically, this is done by a session manager that separates the different data sources from the visualizations, filtering the relevant data entities (see Fig. 15). When a change happens in one of the visualizations because of the user interaction, the session manager detects and translates it to the associated changes in the rest of linked visualizations. More technical information about the framework, their options and implementation details can be found in the following documents:

A user's guide with installation instructions, further information about usage of the software, interaction options and a complete case study.
A developer's guide explaining details about the architecture and design of the Java implementation.
A technical report with a review of the state of the art and details on the visualization technique and the framework.

All these files, along with the open source and a ready to use distribution of the framework are available at the BicOverlapper development site [38]. The framework and the user's guide are also available for download [39].

References

Madeira S, Oliveira A: Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform 2004, 1(1):24–45.
Article CAS PubMed Google Scholar
Seo J, Shneiderman B: Interactively Exploring Hierarchical Clustering Results by Interactive Exploration of Dendrograms, a Case Study with Genomic Microarray Data. IEEE Computer 2002, 35(7):80–86.
Article Google Scholar
Barkow S, Bleuer S, Prelic A, Zimmermann P, Zitzler E: BicAT: a biclustering analysis toolbox. Bioinformatics 2006, 22(10):1282–1283.
Article CAS PubMed Google Scholar
Cheng KO, Law NF, Siu WC, Lau TH: BiVisu: Software Tool for Bicluster Detection and Visualization. Bioinformatics 2007, 23(17):2342–2344.
Article CAS PubMed Google Scholar
Grothaus GA, Mufti A, Murali T: Automatic layout and visualization of biclusters. Algorithms for Molecular Biology 2006., 1(15):
Google Scholar
Inselberg A: The plane with parallel coordinates. The Visual Computer 1985, 1: 69–91.
Article Google Scholar
Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 1998, 95: 14863–14868.
Article PubMed Central CAS PubMed Google Scholar
Saldanha AJ: Java Treeview-extensible visualization of microarray data. Bioinformatics 2004, 20(17):3246–3248.
Article CAS PubMed Google Scholar
Rasmussen M, Karypis G: gCLUTO: An Interactive Clustering, Visualization and Analysis System. In Tech Rep. 04–021. University of Minnesota; 2004.
Google Scholar
Hibbs MA, Dirksen NC, Li K, Troyanskaya OG: Visualization methods for statistical analysis of microarray clusters. BMC Bioinformatics 2005., 6(115):
Google Scholar
Ashburner M, Ball CA, Blake JA, Bolsteing D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genetics 2000, 25: 25–29.
Article PubMed Central CAS PubMed Google Scholar
Shen-Orr SS, Milo R, Mangan S, Alon U: Network motifs in the transcriptional regulation network of Escherichia coli. Nature Genetics 2002, 31: 64–68.
Article CAS PubMed Google Scholar
Ma HW, Kumar B, Ditges U, Gunzer F, Buer J, Zeng AP: An extended transcriptional regulatory network of Escherichia coli and analysis of its hierarchical structure and network motifs. Nucleic Acids Research 2004, 32(22):6643–6649.
Article PubMed Central CAS PubMed Google Scholar
Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T: A Software Environment for Integrated Models of Biomolecular Interaction Networks. Genome Res 2003, 13(11):2498–2504.
Article PubMed Central CAS PubMed Google Scholar
Schatz MC, Phillippy AM, Shneiderman B, Salzberg SL: Hawkeye: an interactive visual analytics tool for genome assemblies. Genome Biology 2007., 8(3): [http://amos.sourceforge.net/hawkeye/]
Google Scholar
Prelic A, Bleuer S, Zimmermann P, Wille A, Buhlmann P, Gruissem W, Hennig L, Thiele L, Zitzler E: A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics 2006, 22(9):1122–1129.
Article CAS PubMed Google Scholar
Carmona-Saez P, Pascual-Marqui RD, Tirado F, Carazo JM, Pascual-Montano A: Biclustering of gene expression data by non-smooth non-negative matrix factorization. BMC Bioinformatics 2006., 7(78): [http://www.biomedcentral.com/1471–2105/7/78]
National Visualization and Analytics Center: Illuminating the Path: The Research and Development Agenda for Visual Analytics. IEEE Press; 2005.
Google Scholar
Internet Movie Database[http://www.imdb.com]
Theron R, Santamaria R, Garcia J, Gomez D, Paz-Madrid V: Overlapper: movie analyzer. Information Visualization Confererence Compendium 2007, 140–141. [http://conferences.computer.org/infovis/files/compendium2007.pdf]
Google Scholar
Duncan CA, Kobourov SG, Sander G: Graph Drawing Contest Report. Lecture Notes in Computer Science, GD'07 2007, 395–400.
Google Scholar
Ihmels J, Friedlander G, Bergmann S, Sarig O, Ziv Y, Barkai N: Revealing modular organization in the yeast transcriptional network. Nat Genet 2002, 31: 370–377.
CAS PubMed Google Scholar
Ihmels J, Bergmann S, Barkai N: Defining transcription modules using large-scale gene expression data. Bioinformatics 2004, 20: 1993–2003.
Article CAS PubMed Google Scholar
Ben-Dor A, Chor B, Karp R, Yakhini Z: Discovering local structure in gene expression data: the order-preserving submatrix problem. Journal of Computational Biology 2003, 10: 373–384.
Article CAS PubMed Google Scholar
Vaart JM, Caro LH, Chapman JW, Klis FM, Verrips CT: Identification of three mannoproteins in the cell wall of Saccharomyces cerevisiae. Journal of Bacteriology 1995, 177(11):3104–3110.
PubMed Central PubMed Google Scholar
Tevzadze GG, Swift H, Esposito RE: Spo1, a phospholipase B homolog, is required for spindle pole body duplication during meiosis in Saccharomyces cerevisiae. Chromosoma 2000, 109(1–2):72–85.
Article CAS PubMed Google Scholar
Fruchterman TMJ, Reinhold EM: Graph Drawing by Force-directed Placement. Software – Practice and Experience 1991, 21: 1129–1164.
Article Google Scholar
Gansner ER, North SC: Improved force-directed layouts. Proc of the 6th Symposium on Graph Drawing 1998, 1547: 364–374.
Article Google Scholar
Perer A, Shneiderman B: Balancing Systematic and Flexible Exploration of Social Networks. IEEE Trans on Vis and Comp Graphics 2006, 12(5):693–700.
Article Google Scholar
Kumar G, Garland M: Visual Exploration of Complex Time-Varying Graphs. IEEE Trans on Vis and Comp Graph 2006, 12(5):805–812.
Article Google Scholar
Ware C: Perception for Design. San Diego, Calif.: Morgan Kaufmann; 1999.
Google Scholar
Herman , Melançon G, Marshall MS: Graph Visualization and Navigation in Information Visualization: A Survey. IEEE Trans on Vis and Comp Graph 2000, 6: 24–43.
Article Google Scholar
Kluger Y, Basri R, Chang JT, Gerstein M: Spectral Biclustering of Microarray Data: Coclustering Genes and Conditions. Genome Res 2003, 13: 703–716.
Article PubMed Central CAS PubMed Google Scholar
Turner HL, Bailey TC, Krzanowski WJ, Hemingway CA: Biclustering Models for Structured Microarray Data. IEEE/ACM Trans Comput Biol Bioinform 2005, 2(4):316–329.
Article CAS PubMed Google Scholar
Fry B, Reas C: Processing, a programming handbook for visual designers and artists. MIT Press; 2007.
Google Scholar
Santamaría R, Therón R, Quintales L: BicOverlapper: A tool for bicluster visualization. Bioinformatics 2008, 24(9):1212–1213.
Article PubMed Google Scholar
Heer J, Card SK, Landay JA: prefuse: a toolkit for interactive information visualization. In Proceedings of SIGCHI Human Factors in Computing Systems. New York, NY, USA: ACM Press; 2005:421–430.
Google Scholar
Bicoverlapper Development Site[http://bicoverlapper.googlecode.com]
BicOverlapper[http://vis.usal.es/bicoverlapper]
den Bulcke TV, Leemput KV, Naudts B, van Remortel P, Ma H, Verschoren A, Moor BD, Marchal K: SynTReN: a generator of synthetic gene expression data for design and analysis of structure learning algorithms. BMC Bioinformatics 2006., 7(43): [http://www.biomedcentral.com/1471–2105/7/43]
Google Scholar

Download references

Acknowledgements

This work was supported by the Ministerio de Educación y Ciencia of Spain under project TIN2006-06313 and by a grant from the Junta de Castilla y León autonomous government. The authors wish to thank Francisco Antequera for his biological advice.

Author information

Authors and Affiliations

Departamento de Informática y Automática, Universidad de Salamanca, Pz. de Los Caídos S/N, 37007, Salamanca, Spain
Rodrigo Santamaría, Roberto Therón & Luis Quintales

Authors

Rodrigo Santamaría
View author publications
You can also search for this author in PubMed Google Scholar
Roberto Therón
View author publications
You can also search for this author in PubMed Google Scholar
Luis Quintales
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rodrigo Santamaría.

Additional information

Authors' contributions

RT and RS conceived the study and carried out the computational analysis, implementation and optimization. RT and LQ managed and coordinated the project. LQ studied biological data mining issues and RT and RS studied visual analysis issues. All authors participated in writing and revising the final manuscript.

Rodrigo Santamaría, Roberto Therón contributed equally to this work.

Electronic supplementary material

Additional file 1: Shockwave Flash (swf) video showing overlapper interaction. (SWF 14 MB)

Additional file 2: Shockwave Flash (swf) video showing Bimax biclusters in overlapper. (SWF 9 MB)

Additional file 3: Shockwave Flash (swf) video showing OPSM biclusters in overlapper. (SWF 13 MB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Authors’ original file for figure 10

Authors’ original file for figure 11

Authors’ original file for figure 12

Authors’ original file for figure 13

Authors’ original file for figure 14

Authors’ original file for figure 15

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Santamaría, R., Therón, R. & Quintales, L. A visual analytics approach for understanding biclustering results from microarray data. BMC Bioinformatics 9, 247 (2008). https://doi.org/10.1186/1471-2105-9-247

Download citation

Received: 21 December 2007
Accepted: 27 May 2008
Published: 27 May 2008
DOI: https://doi.org/10.1186/1471-2105-9-247

A visual analytics approach for understanding biclustering results from microarray data

Abstract

Background

Results

Conclusion

Background

Biclustering

Visualization of single biclusters

Visualization of multiple clusters

Related Biological Knowledge

Motivation

Results and Discussion

Group data: movie relations analogy

Microarray Data and Biclustering algorithms

Bimax results analysis

OPSM results analysis

ISA results analysis

Conclusion

Methods

Bicluster Definition

Graph Layout

Node Representation

Graph Complexity

Multiple Linked Views

Filters and thresholds

Implementation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Authors' contributions

Electronic supplementary material

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us