Making sense of large, multi-dimensional data is challenging. Cluster analysis is widely used to discover patterns in such data. In general terms, clustering algorithms group similar objects into clusters such that the clusters themselves are as homogenous as possible and as dissimilar as possible to other clusters. Clustering is often applied to, for instance, gene expression matrices [1, 2], consisting of genes (rows) and samples (columns). We need to differentiate between one-way, two-way, and biclustering. In one-way clustering, the goal is to determine either clusters in the row or the column dimension. Examples for one-way clustering algorithms are k-means, hierarchical clustering, and affinity propagation [3]. In two-way clustering, the result of two sequentially performed one-way clustering runs - one in the row and one in the column dimension - are combined into one result.
Biclustering
Biclustering [4], also known as co-clustering or two-mode clustering, is an emerging field of machine learning. In contrast to one-way and two-way clustering, biclustering is a category of algorithms in which the rows and columns are clustered simultaneously. Biclustering is also different from standard clustering because rows and columns may have multiple or no memberships. In this work, we focus on the visual analysis of biclustering results, as the characteristics of overlapping clusters pose a yet unsolved challenge for visualization.
The array of available biclustering methods ranges from algorithms that try to find a single bicluster, to algorithms that seek to find multiple overlapping biclusters. Madeira and Oliveira [5] surveyed different biclustering algorithms with respect to the structure of their output. Bicluster algorithms are often used to analyze gene expression data [6]. In the context of gene expression, a bicluster may correspond to a pathway that is activated in particular samples (the column members) and that contains certain genes (the row members). Each gene may belong to one bicluster, to more than one bicluster, or to no bicluster at all. The same holds for samples.
In general, clustering algorithms can additionally be differentiated by the kind of memberships they produce. In hard clustering, rows and columns are assigned to clusters in a binary way, i.e., they either belong to clusters or not. In soft clustering, the result consists of non-binary membership values that describe to what degree rows and columns belong to the clusters. As the assignment of rows and columns to clusters is fuzzy, this is also known as fuzzy clustering [7, 8].
Bicluster visualization
Let us consider the visualization of hard clustering results first. In order to understand and interpret hard clustering results, it is necessary to visualize the clusters together with the underlying data. Clustered heatmaps are the standard technique for visualizing both one-way and two-way clustering results. In clustered heatmaps, the rows or columns are reordered, such that clusters can be recognized as contiguous blocks consisting of adjacent cells. Showing clusters as contiguous blocks is highly desired, as it simplifies the detection and interpretation of patterns. However, for biclustering results, where clusters can overlap, rearranging the matrix this way is often impossible. Let us consider the example from Figure 1 that shows a 5x5 matrix with three clusters. In Figure 1(a), the columns are sorted such that the red and yellow clusters are represented as contiguous blocks, as indicated by a thick border. However, this sorting splits the blue cluster into two unconnected blocks. In Figure 1(b), columns B and E are swapped, which makes it possible to show the blue cluster as a contiguous block, but splits the red cluster. Consequently, even in small matrices there is often no optimal order of rows and columns where all clusters form contiguous blocks. The sorting problem can be solved by duplicating rows and/or columns, as demonstrated in Figure 1(c). However, the duplication approach does not scale, as it potentially produces large output matrices for comparably small input matrices.
Interpreting biclustering results is often time-consuming and tedious, as it is usually done statically by visually inspecting many separate plots. Adding fuzzy clustering to this equation makes the situation even more difficult.
Fuzzy biclustering is a visualization research problem that cannot be addressed by any of the existing tools. We will first elaborate on how biclustering results can be represented and then introduce the FABIA fuzzy biclustering algorithm [9]. We use FABIA to demonstrate the proposed technique; however, note that any other biclustering algorithm that produces overlapping clusters can be used in the same way. We continue by introducing general requirements for bicluster visualization, which we use to review existing work in this field. We then present Furby, an interactive visualization technique for analyzing fuzzy biclustering results. After a brief description of the implementation, we present how the tool can be used effectively to analyze a real-world dataset. Before concluding the paper, we discuss the scalability of our tool to large datasets.
Representation of biclustering results
Biclustering data can generally be represented by three matrices: X, L, and Z. The X matrix represents the input data to be clustered. The biclustering results are represented by L and Z. The L matrix contains the relationship information between rows and biclusters, and the Z matrix contains the same information for columns. While for hard biclustering results, L and Z hold binary values (1 = row/column is part of a cluster, 0 = row/column is not part of a cluster), for fuzzy biclustering results they contain real values that denote the degree of membership, starting with 0, which means that the row or column does not belong to the considered bicluster.
FABIA biclustering algorithm
FABIA [9] is an established biclustering algorithm which has been successfully applied not only to drug discovery and systems biology, but also to enhance recommender systems. FABIA, as a generative model, is based on factor analysis, but can be considered as a sparse matrix decomposition algorithm. As described above, the observed data matrix X is decomposed into two matrices: the L matrix describes memberships of rows (genes) to biclusters, and the Z matrix describes the memberships of columns (samples) to biclusters. Consequently, a bicluster is described by row and column memberships. The FABIA model assumes that biclusters have only few row and column members. This is the typical situation for gene expression data, where pathways contain only few genes (compared to all genes), which are activated in only few samples. This situation is also typical for recommender systems, where a customer buys only few products, and a certain product combination is chosen only by few customers. Another example is word-document matrices, where a bicluster is a certain topic (a document contains few topics and a topic contains few indicative words). Thus, in all these applications the data matrix is sparse, as are the matrix that describes row memberships, and the matrix that describes column memberships. In FABIA models, this sparsity is reflected by sparse row and column decomposition matrices enforced by sparse priors in a Bayesian framework. FABIA describes row and column memberships by real numbers. Hence, the bicluster memberships are fuzzy, and it is difficult to decide in the "twilight zone" whether a column or a row indeed belongs to a bicluster.
The memberships must often be inspected visually by an expert, who then decides how good the bicluster pattern is (gene pattern of a pathway) and how strong the signal is (gene expression). Assessing the relationship between biclusters is an even more complex task. Do two biclusters partially represent the same information? If yes, which columns and/or rows do they share? To comprehend the information in the biclusters and their mutual dependencies, a visual representation of this information is highly desired.
Requirements
Based on interviews with domain experts and surveying the body of previous work, we have elicited seven requirements that an effective fuzzy bicluster visualization needs to fulfill. We will assess existing work in this field against these requirements. Later sections will demonstrate how our technique addresses these requirements.
-
R I: Show individual biclusters
As the primary goal of data clustering is to find data subsets that are similar in some respect, the most basic requirement for a bicluster visualization technique is to present the individual biclusters to the analyst. The visualization needs to encode the data elements that form the cluster, together with the corresponding column and row identifiers. An effective visualization of a single cluster is essential for interpreting the data.
-
R II: Visualize shared rows and columns of multiple biclusters
In a biclustering result, columns and rows can be assigned to multiple biclusters. For interpreting the clustering result, it is important to communicate which rows and columns are shared between which clusters. This is also relevant to identifying similar clusters, i.e., a set of clusters with a large overlap.
-
R III: Visualize membership of rows and columns to biclusters
In contrast to requirement R I, where users are interested in a single bicluster in detail, analysts also want to investigate to which biclusters a single row or column is assigned to.
-
R IV: Scalability
A well-designed bicluster visualization should scale to large datasets, to many biclusters, and to a large number of shared rows and columns between biclusters.
-
R V: Visualize bicluster strength
When visualizing fuzzy biclustering results, it is important to encode the membership values of rows and columns to biclusters. The membership value represents to what degree a row or column belongs to a particular cluster. By setting thresholds, fuzzy clusters can be transformed into hard clusters. Encoding the membership value of rows and columns in addition to the raw data, allows analysts to judge the strength of clusters.
-
R VI: Interactive cluster refinement
Supporting analysts in the process of transforming fuzzy clusters into hard clusters by setting thresholds for the membership values is a central task of fuzzy bicluster visualization. Analysts need to be able to set the threshold interactively and immediately see the resulting hard biclusters. The combination of interactive refinement and encoding of shared rows and columns should help the analyst to determine optimal membership threshold values.
-
R VII: Visualize relationships to additional metadata
An effective bicluster visualization should allow analysts to relate rows and columns of biclusters to additional external data. For example, analysts want to investigate the correlation of biclusters defined on gene expression data with, for instance, patient groups, cancer subtypes, or tumor staging.
Related work on cluster visualization
The key to let analysts gain new insights in large and complex multi-dimensional datasets is to combine the strength of automated algorithmic techniques with the power of interactive visualization [10–13]. Numerous techniques for the interactive visual analysis of clustering results have been proposed over the last years.
In order to discuss interactive cluster visualization techniques, we split up the body of existing work according to types of clustering. The standard technique for oneand two-way clustering is the clustered heatmap, where rows and/or columns are reordered to reflect the similarities. Examples for visual analysis tools that provide interactive heatmaps are Mayday [14], Caleydo [15, 16] and the Dual Analysis framework [17]. For hierarchical clustering results, the clustered heatmap is commonly extended with a dendrogram that represents the similarities between the rows or columns [18]. The Hierarchical Cluster Explorer (HCE) [19] and MultiClusterTree [20] are both approaches that allow interactive analysis of hierarchical clustering results.
However, as mentioned at the beginning of the Bicluster visualization section, for biclustering results it is often not possible to rearrange the matrix in order to represent all clusters as contiguous blocks (see Figure 1), which is essential for interpreting the clusters. A simple approach to visualizing biclustering results is to create a separate plot for each bicluster, as implemented, for instance, in the Biclustering Analysis Toolbox (BicAT) [21], the BiClust R toolbox [21] and the BiVisu tool [22]. Showing every cluster as a separate plot allows analysts to inspect the clusters individually, which addresses requirement R I. However, this makes it impossible to see which rows and columns they share, which violates R II. Jin et al. [23] formulated the reordering issue as an optimization problem and proposed a reordering approach by exploiting analogies to the hypergraph vertex ordering problem. Grothaus et al. [24] propose to duplicate rows and columns to resolve situations where reordering is not possible. The BiCluster viewer [25] follows the same approach, but additionally allows analysts to interactively decide which clusters to show contiguously in order to minimize the number of duplicates. As this can, however, still result in very large matrices, scalability is limited (see R IV).
The work that is probably related most closely to ours is the BicOverlapper tool [26], which presents the biclustering result in a multiple-coordinated view setup. A parallel coordinates view and a heatmap show the individual biclusters, realizing R I. The overlapper view visualizes the bicluster network as a force-directed graph where biclusters are represented as overlapping groups. Although the BicOverlapper tool encodes the overlaps between clusters (R II) and the cluster assignment (R III), it does not scale well to many biclusters (R IV), as it creates occlusion problems, which renders obtaining an overview of the biclustering results as a whole impossible.
Only a small number of articles on visualizing fuzzy clustering results have been published. Most of them propose extensions to classical clustering visualizations, including parallel coordinate plots [27], heatmaps [28], and RadVis [29] - a radial visualization technique, in which membership values are projected to polar coordinates. A similar approach was developed by Rueda and Zhang [30], which maps membership values to a hyper-tetrahedron in the 2D or 3D space representing three or four fuzzy clusters. clusterMaker [31] takes a different approach by representing a one-way fuzzy clustering result as a force-directed graph where the clustered entities are shown as nodes and color is used to encode the cluster membership. However, all these methods focus on the membership or membership values of the rows and columns and ignore the underlying data of the clustering result (see R I).
In summary, none of the existing approaches are able to address the requirements in a satisfactory way. In particular, the visualization of fuzzy biclustering seems to be a blank area in the research map - a blank which Furby attempts to fill.