Coral: an integrated suite of visualizations for comparing clusterings
© Fillippova et al.; licensee BioMed Central Ltd. 2012
Received: 6 April 2012
Accepted: 9 October 2012
Published: 29 October 2012
Skip to main content
© Fillippova et al.; licensee BioMed Central Ltd. 2012
Received: 6 April 2012
Accepted: 9 October 2012
Published: 29 October 2012
Clustering has become a standard analysis for many types of biological data (e.g interaction networks, gene expression, metagenomic abundance). In practice, it is possible to obtain a large number of contradictory clusterings by varying which clustering algorithm is used, which data attributes are considered, how algorithmic parameters are set, and which near-optimal clusterings are chosen. It is a difficult task to sift though such a large collection of varied clusterings to determine which clustering features are affected by parameter settings or are artifacts of particular algorithms and which represent meaningful patterns. Knowing which items are often clustered together helps to improve our understanding of the underlying data and to increase our confidence about generated modules.
We present Coral, an application for interactive exploration of large ensembles of clusterings. Coral makes all-to-all clustering comparison easy, supports exploration of individual clusterings, allows tracking modules across clusterings, and supports identification of core and peripheral items in modules. We discuss how each visual component in Coral tackles a specific question related to clustering comparison and provide examples of their use. We also show how Coral could be used to visually and quantitatively compare clusterings with a ground truth clustering.
As a case study, we compare clusterings of a recently published protein interaction network of Arabidopsis thaliana. We use several popular algorithms to generate the network’s clusterings. We find that the clusterings vary significantly and that few proteins are consistently co-clustered in all clusterings. This is evidence that several clusterings should typically be considered when evaluating modules of genes, proteins, or sequences, and Coral can be used to perform a comprehensive analysis of these clustering ensembles.
Collections of protein interactions, gene expression vectors, metagenomic samples, and gene sequences containing thousands to hundreds-of-thousands of elements are now being analyzed routinely. Clustering is often used to condense such large datasets into an understandable form: it has been successfully used on protein-protein interaction (PPI) networks to discover protein complexes and predict protein function, e.g. ; on gene expression data to find patterns in gene regulation and essential cell processes, e.g. ; and on metagenomic samples to identify new species, compare them to existing clades, evaluate the diversity of a population, and suggest interdependencies between them [3, 4]. In other words, clustering has become a ubiquitous part of analysis for large biological datasets.
There are many clustering algorithms available for numerical and network data, e.g. [5–12]. Each algorithm, and choice of its parameters, results in different clusterings. Sometimes, clustering algorithms must resolve ties when generating modules or may be randomized. Consequently, a single clustering algorithm may produce diverse partitions on the same data . Clusterings may also change when the underlying data becomes increasingly noisy or displays variation under different conditions (such as varying gene expression levels). In addition, considering many optimal and near-optimal partitions has been shown to improve the understanding of module dynamics and the strength of relationships between individual items [14–17]. Such clusterings may offer drastically different perspectives on the data, where assessing the commonalities and differences is of great interest.
There are several ways in which the problem of diverse clusterings has been addressed. Some tools rely on a single clustering only and focus on module quality assessment, e.g. [18, 19]. Comparing two or more clusterings at a time is usually done by computing a single metric, such as the Jaccard or Rand index , to compare clusterings side-by-side  or in a dendrogram . These approaches can easily compare a pair of clusterings, but are not extendable to greater number of clusterings. Another approach is to aggregate multiple partitions into a consensus clustering [23, 24] without delving into the differences between individual clusterings and, thus, disregarding possibly important information about the clusterings. Finally, some approaches have made steps towards visual examination of multiple clusterings: King and Grimmer  compare clusterings pairwise and project the space of clusterings onto a plane to visualize a clustering landscape, and Langfelder et al.  investigate ways to compare individual modules across multiple conditions. However, none of these approaches offer a platform for a multi-level analysis of ensembles of diverse clusterings.
In the present study, we describe Coral—a tool that allows for interactive and comprehensive comparison of multiple clusterings at different levels of abstraction. Coral allows users to progressively move from an overview to analysis of relationships between individual data items. Users may start by examining statistics on the data and individual clusterings, or by assessing the overall homogeneity of a dataset. Users can then compare any two partitions in the ensemble in greater detail. Larger scale trends become pronounced when groups of items that co-cluster often are automatically identified and highlighted. We also extend parallel sets plot  to show how individual items switch modules across clusterings. Coral’s visualizations are interactive and are coordinated so that users can track the same group of data items across multiple views.
Coral works with modules — groups of items closely related to one another according to some metric or property. For example, modules can constitute a collection of genes that get co-expressed together or proteins forming a complex. A clustering is a collection of modules and usually is an output of a clustering algorithm. Users may also choose to group data according to attributes that come with the data such as cellular component or molecular function GO terms and use that partition as a clustering. Users may combine data from different experiemnts and across species so long as the data items that the user treats as homologous have the same IDs across the dataset.
Coral takes as an input the module files where each file represents a single clustering, and each line in the file contains a list of data items (proteins, genes, or sequence ids) from a single module, e.g. as produced by MCL, the clustering algorithm by van Dongen . Coral aggregates and visualizes these data through several connected displays, each of which can be used to answer specific questions about the clusterings. Below, we examine a few such questions and describe how Coral’s visualizations help to answer them.
To gain a quick overview of their collection of clusterings, Coral users may start the analysis by reviewing basic statistics about their data: number of modules per clustering, average module size, number of items that were clustered, clustering entropy , and percentage of data items that ended up in the overlapping modules. Questions such as “Do clusterings have the same number of modules?” and “Are module sizes evenly distributed?” can be easily answered through these statistics. Each statistic is shown as a bar chart, and each clustering is associated with a unique color hue that is used consistently to identify the clustering throughout the system. If a clustering contains overlapping modules, the corresponding bar in the chart is striped as opposed to a solid bar for the clusterings containing no overlapping modules (see Figure 1).
A natural follow-up to finding a highly similar pair of clusterings is to review the similarities between their individual modules. Is a group of interacting genes preserved between the two stages in the cell life cycle? Is there a match for a given protein complex in the PPI network of another species? Module-to-module comparison answers these questions and explains the origins of clustering similarity at a “zoomed-in” module level.
The parallel partitions plot allows users to track individual items and whole modules across all partitions. To see whether a module splits or merges in other clusterings, users can select a module with a mouse while holding a shift key to highlight its members in every clustering in the plot (see red traces in Figure 4). Similarly, users may select individual items and trace them through every clustering band. The selections made in the parallel partitions plot propagate to other views making it easy to track the same group of items throughout the application. The plot is zoomable — users may zoom in to focus on a few items at a time or zoom out to see global trends across the ensemble. When the zoom level permits it, the plot displays the item labels.
The order of items in the clustering bands matches the order of items in the co-cluster matrix (discussed below) as closely as possible, while at the same time attempting to minimize the amount of crossover between the parallelograms connecting items in the consecutive clusterings. However, the items in the bands must be placed inside their respective modules. We discuss an algorithm that finds a good ordering of items in the clustering bands in the Methods section.
A single clustering assigns a data item v to a module defining its cohort — a set of items in the same module as v. Knowing the item’s module helps in assigning function to unknown proteins  and novel genes ; knowing that the item’s cohort is consistent across many clusterings increases the confidence of such predictions.
where A K t is a co-cluster matrix for a clustering K t and k is the number of clusterings. Here, the entries equal k (the number of clusterings) for item pairs (v i ,v j ) that have co-clustered in all partitions suggesting a strong relationship between the items, and the low values correspond to pairs that co-clustered in only a few clusterings and are more likely to have been assigned to the same module by chance. The cells are colored according to their values and vary from white (low values) to red (high values). Users may zoom in and out on the matrix to focus on areas of interest.
Groups of items that end up in the same module across many clusterings are of a particular interest because they represent the robust subclusters in the data. Such commonly co-clustered sets are called cores. Items in cores form the most trustworthy modules and indicate particularly strong ties between data items, increasing, for example, the confidence in protein complex identification  and gene annotation .
In a co-cluster matrix, cores translate to contiguous blocks of high-value entries. Coral finds the cores using a fast dynamic programming algorithm and highlights them within the co-cluster matrix (inset, Figure 6a). When users provide clusterings derived from a network, Coral can augment cores with an overlay showing each core’s cohesion — the ratio E in/E out where E in is the number of edges within the core and E out is the number of edges that have one endpoint inside the core and another endpoint outside of it . When a core’s cohesion is low, the blue overlay is smaller indicating that the core shares many connections with the rest of the network (Figure 6b). Cores for which cohesion is high are more isolated from the rest of the network — these cores are distinguishable by the blue overlays that almost cover the core.
The co-cluster matrix displays the total number of times any two items were co-clustered, and the tooltips that appear after hovering over a matrix cell show a list of clusterings in which a given pair has been co-clustered. To facilitate sorting and search for particular item pairs, Coral provides a companion table where each row represents a pair of data items and displays the number of times the items co-clustered along with the pair’s signature. The signature is a k-long vector where the t th element is 1 when both data items, say, proteins, have been placed in the same module in clustering K t . If the pair’s items were not in the same module in K t , the t th element is set to 0.
The order of rows and columns in the co-cluster matrix is critical to extracting meaningful information from it. Finding an optimal matrix reordering is NP-complete for almost any interesting objective. Algorithms for the optimal linear arrangement  and bandwidth minimization  problems have been used to reorder matrices with considerable success; however, both approaches perform poorly for matrices that have many off-diagonal elements. After comparing several reordering algorithms using the bandwidth and envelope metrics, we have chosen the SPIN  approach that consistently produced better results on a wide range of matrices.
where w(i,ℓ) is the weight of assigning i th row to ℓ th position, and the values are the co-cluster matrix entries. After permuting A + ’s rows, the columns of A + must be permuted to match the row order, thus changing the weights w(i,ℓ) and making the row assignments found previously no longer optimal, so this process is repeated. In Coral, we use two different solvers for the LAP problem: a fast, greedy solver and the Hungarian algorithm. The greedy solver matches rows to indexes by iteratively selecting the best row-index pair; it quickly finds a starter reordering that can later be improved by the Hungarian algorithm. The Hungarian algorithm solves the linear assignment problem optimally, but because a permutation of rows invalidates the ordering of the columns, the algorithm has to be re-run for several iterations to improve the reordering. We continue iterating LAP until we get no improvement in assignment cost, observe a previously considered permutation, or exceed the maximum number of iterations.
where a region is a square block on the matrix diagonal between rows p and q, and its density is the sum s(p q) of all matrix entries within the area divided by the area’s width |q−p|. Alternatively, we can think of the co-cluster matrix A + as a weighted adjacency matrix of some graph G(A + ), then d(p q) is the density of a subgraph S induced by the vertices p,…,q: d(p q) = |E(S)|/|V(S)|, where |E(S)| is the sum of edge weights in S and V(S) is a set vertices in S.
where D opt(j) is the optimal area arrangement between 0 th and j th item, and D opt (n) gives the optimal partition of a matrix A + into cores.
making it possible to compute all s(p,q) in O(n 2) time. This reduces the total runtime to find cores to O(n 2 + n 2) = O(n 2).
The areas that have density higher than their d avg(p,q) represent groups of data items that have co-clustered together more often than is expected by chance. Hence, Coral displays only these cores.
When ordering clustering bands in the parallel partitions plot, we would like to put similar clusterings next to each other and avoid putting two dissimilar clusterings vertically adjacent. The intuition for such a constraint is that if the two clusterings K i and K i+1 share many similarities, the bands connecting individual items between the clusterings will only cross a few times making it easier to track module changes. We also need to order items within the bands in a way that puts items from the same module next to each other and does not allow items from other modules to interleave.
To find a vertical order for the clustering bands, we apply a greedy algorithm that uses clustering similarity scores. First, we compute the similarity for every pair of clusterings sim(K i ,K j ) using Jaccard. Next, we find the two most similar clusterings K 1,K 2, add them to a list, and look for a clustering most similar to either K 1 or K 2 (whichever is greater). We proceed by repeatedly picking the clustering that is most similar to the last clustering added to the list. The order in which clusterings were added to the list determines the order of the clustering bands.
where i(u) is the index of a column in A + corresponding to the data item u. Modules that should be placed first in the clustering band would have the lowest rank, so we sort the modules in order of increasing d, and the module’s position in the sorted array determines module’s position in the clustering band.
For our comparison, we have focused on the graph clustering algorithms for which the implementations were available (see Table 1). Louvain and Clauset are two algorithms that search for a network partition with highest modularity . Both tend to find large clusters and usually cover most of the nodes in a network. CFinder is a clique percolation method that identifies overlapping communities by continuously rolling cliques of an increasing size. Resulting clusterings usually contain many small modules with a high amount of overlap and cover only a part of the network ignoring graph structures like bridges and stars. MCL is a fast, flow-based clustering algorithm that uses random walks to separate high-flow and low-flow parts of the graph. Its modules tend to be small and usually cover most of the input network. MCODE algorithm finds modules in biological networks by expanding communities around vertices with high clustering coefficient. “Fluff” and “haircut” options for MCODE allow to add singleton nodes connected to the module by just one edge and to remove nodes weakly connected to the module correspondingly. MINE is closely related to MCODE, but uses a modified weighting scheme for vertices which results in denser, possibly overlapping modules. SPICi grows modules around vertices with high weighted degree by greedily adding vertices that increase module’s density. The partitions contain many dense modules, but usually cover only a part of the network.
To get an overview of the data, we review various statistics on the generated clusterings. For the majority of the clusterings, modules that remained after filtering covered only a portion of the network. The two clusterings produced by the modularity-based methods, Louvain and Clauset, were the only clusterings that included more than 95% of all proteins into their modules. The number of modules per clustering varied significantly from 20 to 666 (Table 1). The average module size was highest for Clauset (115.65), Louvaine (103.00), and the MCODE.F (82.05) clusterings significantly exceeding the average module size among all other clusterings (3.02-26.31 items per module). For the original 26 A. thaliana modules , 3% of the proteins were assigned to more than one module; in the CFinder clustering over half of the clustered proteins (59%) participated in multiple modules.
The nine A. thaliana clusterings are highly dissimilar: most cells in the ladder widget (Figure 2) are white or pale blue, and the majority of pairwise Jaccard similarity scores are below 0.07. MCL yielded the partition most similar to A. thaliana modules reported in  (A.Thal original) with Jaccard similarity of 0.60. Surprisingly, the 26 modules generated by link clustering  shared very little similarity with CFinder, the only other algorithm in the ensemble designed to produce overlapping modules.
Low pairwise similarity scores between so many pairs of clusterings is easily explained using the module-to-module table: clusterings with Jaccard similarity below 0.07 overlap by a few small modules or no modules at all. The similarity of 0.60 between MCL and A.Thal (Figure 4) may be attributed to the two big modules that are largely replicated between the two clusterings: the module m9 from MCL and the module m0 from A. thaliana (highlighted row) overlap by 288 proteins with Jaccard similarity 0.8. Several smaller modules (shown at the top of the table) are duplicated exactly between the two clusterings.
The co-clustering matrix for A. thaliana clusterings contains several large regions of co-clustered proteins along the diagonal (Figure 6), however, most cells are pale indicating that they were co-clustered by only a few clustering algorithms; very few matrix cells are close to the saturated red. Indeed, 65.25% of all co-clustered pairs of A. thaliana proteins have co-clustered just once across all of the nine clusterings used in the analysis and only 6.34% of protein pairs were co-clustered in 5 or more partitions. This low number of protein pairs that were assigned to the same cluster means that the clusterings in the ensemble mostly disagreed.
The dynamic program for identifying cores found 249 areas in the A. thaliana network in which proteins co-clustered more often than could be expected by chance, with the largest core containing 215 proteins and with the average number of proteins per core of 10.38 proteins. Most cores, including the largest core, had low cohesion values indicating that the proteins forming the cores had many connections to proteins outside of the cores (see Figure 6). This finding is correlated with the fact that the clusterings did not agree in general and only small sets of proteins were consistently clustered together across the ensemble.
Finally, setting A.thal original to be the base clustering shows that the modules found by  covered only a fraction of modules found by other methods, although they included the largest core. The majority of A.thal original modules were colored pale pink indicating that modules found by the link clustering were found by no more than 3 other clustering methods. We trace the largest core in the parallel partitions plot (Figure 4): the proteins in the core are co-clustered by A.thal original, Clauset, Louvaine, MCL, and MCODE.F while SPICi, MINE, and MCODE ignored the majority of core’s proteins completely. CFinder, with its many overlapping modules of size 3, 4, and 5, clusters some of the core’s proteins and puts a large part of the core in the grab bag group representing unclustered proteins.
Clustering algorithms may generate wildly varying clusterings for the same data: algorithms optimize for different objectives, may use different tie breaking techniques, or only cluster part of the data items. A popular technique for optimizing modularity has been shown to suffer from a resolution limit  and multiple decompositions may have the same modularity value . When a true decomposition is available, the clustering quality can be quantified using the similarity score and the true positive and true negative rates. However, when there is no true clustering, it is hard to decide which clustering is better than the others. We propose that researchers generate several clusterings by either using different clustering algorithms or varying algorithm parameters. Coral can help compare such clusterings and identify cores in which the data items co-cluster across multiple clusterings.
Most views and statistics in Coral work for both non-overlapping and overlapping clusterings. All overview statistics extrapolate well for overlapping modules except for entropy which assumes that no two modules overlap and therefore may overestimate the actual entropy. The co-cluster matrix naturally adapts to overlapping modules by allowing their corresponding blocks to overlap. Currently, if a pair of data items co-occur in more than one module within a single clustering, their co-cluster value is set to 1 and is not weighted higher relative to other pairs. The parallel partitions plot assumes that the modules in individual clusterings do not overlap. However, if there are overlapping modules, parallel partitions will still lay out the modules in a line by duplicating the co-occurring element in every module in which it occurs.
Although the examples we use in this paper are based on network clustering, Coral does not require its input data to be a network partition and can be used with equal success on gene expression or classification data. In particular, if users would like to compare several classification results, they can do so in the same manner as we have demonstrated for the A. thaliana clusterings. The similarity measures of purity, inverse purity, and the F-measure implemented in Coral are helpful in comparing classifications to the available truth. The module-to-module table is a more flexible alternative to the confusion matrix that is often used to evaluate classification results.
Coral has been used to analyze clusterings of up to 4115 items. The runtime varies with the number of clusterings, number of modules and data items per clustering, and the size of the modules. The startup operations — parsing the input clusterings, computing dataset statistics and all-to-all clustering similarities, as well as rendering the views — take from under a second to 11 seconds for clusterings from 33 to 4115 data items. Matrix reordering is the single biggest performance bottleneck for Coral. Reordering the co-cluster matrix for 2376 A. thaliana proteins took, on average, 29 seconds when to reorder using the greedy heuristic and 70 seconds when to reorder using the Hungarian algorithm. However, both the greedy heuristic and the Hungarian algorithm find good orderings after very few iterations and the reordering only needs to be computed once before analysis. Solutions for LAP computed with the Hungarian algorithm improve with every iteration and usually converge on a good reordering fast.
Coral offers a comprehensive array of visualizations that allow users to investigate modules from various viewpoints inlcuding several novel views. Coral guides users from overview statistics implemented as familiar bar charts to detailed cluster-to-cluster comparison in a table. The ladder widget, a lower triangle of the comparison matrix, helps users pick the most similar (or dissimilar) pair of clusterings and to judge how similar clusterings in the dataset are overall. A color-coded co-cluster matrix shows how often any pair of items in the dataset have been placed in a module together. A novel adaptation of parallel coordinates, parallel partitions plot, makes tracking a group of items across clusterings easy with intuitive selection techniques. These views combined create a powerful tool for a comprehensive exploration of an ensemble of clusterings. Coral can help users generate questions and hypotheses about the data that could be later definitively answered with the help of additional experiments.
Project name: Coral
Project home page: http://cbcb.umd.edu/kingsford-group/coral/
Operating systems: platform-independent
Programming language: Java
Other requirements: Java 1.6 and 1Gb of RAM
License: GNU GPL
This work was partially supported by NSF grants EF-0849899, IIS-0812111, and CCF-1053918 and NIH grant 1R21AI085376 to C.K. The authors thank Megan Riordan for helpful programming contributions, and Geet Duggal and Rob Patro for their helpful insights on matrix reordering.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.