CLAG: an unsupervised non hierarchical clustering algorithm handling biological data

Background Searching for similarities in a set of biological data is intrinsically difficult due to possible data points that should not be clustered, or that should group within several clusters. Under these hypotheses, hierarchical agglomerative clustering is not appropriate. Moreover, if the dataset is not known enough, like often is the case, supervised classification is not appropriate either. Results CLAG (for CLusters AGgregation) is an unsupervised non hierarchical clustering algorithm designed to cluster a large variety of biological data and to provide a clustered matrix and numerical values indicating cluster strength. CLAG clusterizes correlation matrices for residues in protein families, gene-expression and miRNA data related to various cancer types, sets of species described by multidimensional vectors of characters, binary matrices. It does not ask to all data points to cluster and it converges yielding the same result at each run. Its simplicity and speed allows it to run on reasonably large datasets. Conclusions CLAG can be used to investigate the cluster structure present in biological datasets and to identify its underlying graph. It showed to be more informative and accurate than several known clustering methods, as hierarchical agglomerative clustering, k-means, fuzzy c-means, model-based clustering, affinity propagation clustering, and not to suffer of the convergence problem proper to this latter.


Instructions for running CLAG
The program CLAG has been designed to cluster sets of multidimensional data by looking for correlated data within the set.

Instructions for running CLAG
CLAG has been designed to detect clusters in a M × N matrix, where N is the number of elements in a set N and M is the number of properties E describing the elements in N . We call E the environment. CLAG is split into two steps: a. the detection of clusters of elements b. the aggregation of clusters obtained in step 1 into an aggregation graph and the visualization of key aggregates.
The first part of the algorithm needs as input the data of an M × N matrix loaded from a file called "input.txt", each line represents a pair of elements and the associated score (separated by a space), that is: <element of N> <element of M> score CLAG provides a list of files describing the clusters identified with respect to the parameter ∆.
The second part computes the key aggregates, constructs the aggregation graph and provides a set of figures representing clusters and key aggregates for each allowed value of the parameter ∆, parameterizing the analysis by quantiles of the score distribution.
To run the program on a general matrix using the command line, write: ./exe-RCommand.pl -f=/folder/ -p=n -k=i -d=X where n takes 3 possible values depending on the input we have and the clustering we look for: 1 a general matrix 2 a binary matrix 3 a matrix where N ⊂ E Parameters k and d are optional. Notice that i is the lower bound of environmental scores (and possibly symmetric scores) accepted in the aggregation analysis. If k is missing, then by default it takes value 0. Notice that i does not play a role in the clustering step where all affine clusters are identified (that is, i = 0). The value X can be any integer. If -d is not specified, then CLAG computes all values 5, 10, 20, 40 (standing for ∆ = 0.05, 0.1, 0.2, 0.4) by default. For binary matrices, the parameter d is useless.

CLAG output files
CLAG works on input matrices whose real values lie in the interval [0, 1] (input.txt). If the input matrix does not satisfy this condition, CLAG renormalizes the values of the matrix and keeps the original matrix into the file inputOriginal.txt.
In the first step, CLAG outputs several files. They enumerate the list of all clusters and their associated scores. In case of general or matrices where N ⊂ E, the files associated to the clustering of the matrix at a given value ∆, say ∆ = 0.M , are: -CLUSTERFILE-COMPLETE-M.txt: each line is a description of a different cluster C and their corresponding scores. Namely, it reports: the ∆ value, the percentage of elements Y ∈ E such that A(X, Y ) = A(Z, Y ) where X, Z ∈ N and X is the generator, the list of elements in E belonging to the set Diff (X, Z) (if the list is empty, the value is -1), the list of elements in the cluster C, the environmental score S env (C). Each information is separated by a ":".
-CLUSTERFILE-M.txt : each line is a description of a different affine cluster and their corresponding scores. See above for each information reported.
In the second step, CLAG outputs files describing key aggregates: -aggregation-M.txt : list of key aggregates elements, rank of the key aggregates (determined by the highest symmetric score, if it exists, and secondly by the highest environmental score), scores of the first and the last clusters merged by the algorithm. In the case of a general matrix, there will be the two environmental scores, and in the case of a matrix where N ⊂ E, the scores will be four, the two symmetric and the two environmental scores. Each information is separated by a ":".
Also, CLAG generates several figures describing the unclusterised matrix (Matrix.pdf), the aggregation graph generated by neato (a package of graphviz), the cluster aggregation matrix (Matrix-aggregated-M.pdf or Matrix-aggregated.pdf; this matrix displays all key aggregates constructed out of clusters with scores ≥ i), the clustered matrix for all scores (Matrix-Clusterized-M.pdf or Matrix-Clusterized.pdf) and the clustered matrix for environmental scores = 1 (and symmetric score = 1 in the case of a special case matrix) (Matrix-Clusterized-M-Scores1.pdf or Matrix-Clusterized-Scores1.pdf). The generation of these two last matrices can be dropped off on a comment line in the code. This option has been imposed because, at times, the generated files are too large and the user might want to avoid their generation. All files generated by R and neato are in pdf format. Note that the files beginning with "PR-SCORING-output" contain the matrix given to R for the generation of the corresponding pdf files. The commands given to R are contained in the files beginning with "R COMMAND".
In the case of a binary clustering, the first and the second step of the algorithm are independent of the parameter ∆ and the unique files generated by CLAG are: CLUSTERFILE-COMPLETE.txt, CLUSTER-FILE.txt, aggregation.pdf, GRAPH-aggregation.dot, GRAPH-aggregation.pdf.
CLAG draws graphs with neato found in Graphviz (Ellson J, Gansner ER, Koutsofios E, North SC, Woodhull G (2001) Graphviz -Open Source Graph Drawing Tools. Symposium on Graph Drawing -GD 483-484), downloadable at http://www.graphviz.org/Credits.php. Notice that neato draws graphs only when they are not too large (about 100 nodes; notice that 100 corresponds to N and not to M ; M can be much larger as in Figure 1). For graph with a large number of nodes, there will be no ps file generated. Only file .dot will be output as well as the matrix of key aggregates.
SI Figure 1. Comparison between different classification tools on brain cancer data. The brain cancer dataset is constituted by five sets: Medulloblastoma (M), malignant glioma (G), atypical teratoid/rhabdoid tumors (A), normal cerebella (Norm), primitive neuroectodermal tumors (PNET). Each graph is associated to a method and it represents the mix of the five sets after clustering. The five sets label the nodes of the graphs and each edge in the graph describes the coexistence of elements in the sets in at least one cluster. Clusters for k-means, c-means and MCLUST are reported in SI Table 2.
SI Figure 2. CLAG applied to brain cancer data: error analysis. Curves reporting the number of errors associated to aggregations based on clusters with scores satisfying a certain threshold. The curves describe how the errors decrease by varying ∆. Errors are misclassified and unclustered datapoints.  Figure 5D. The table associated to the hierarchical clustering results is organised around the three clusters (red, black and blue) corresponding to the three subtrees in Figure 5D.  Table 5. k-means, c-means and MCLUST classification on the globin dataset. k-means and c-means have been run for 4 clusters. MCLUST selected "VII" (diagonal, varying volume and shape) with 9 components as best model. The best model occurs at the max # of components considered and the optimal number of clusters occurs at max choice. SCAP clusters were obtained with p = 0.13. Residues belonging to CLAG red cluster and to CLAG green cluster are highlighted in red and green respectively.   Table 6. CLAG globine analysis: rank of key aggregates. Key aggregates are ranked with respect to their scores that represent the strength of the aggregation. Colors correspond to those employed for identifying clusters in Fig. 6. The second and third columns (first) report the scores of the first cluster entering the key aggregate and the last two columns (last) report the scores of the last cluster completing the key aggregate.

Globin dataset analysis
6. Synthetic datasets analysis SI Figure 6. Clustering of the synthetic dataset Dim32. A: the 32-dimensional dataset contains 1024 points and 16 clusters generated with a gaussian distribution (http://cs.joensuu.fi/sipu/datasets/). CLAG perfectly distinguished the 16 clusters and it was run with ∆ = 0.05 and scores > 0.5. Besides CLAG, different clustering algorithms have been run on this synthetic dataset: c-means (B), MCLUST (C), and k-means (D). k-means and c-means were run with 16 clusters, and MCLUST with "ellipsoidal, unconstrained with 6 components" as best model. For k-means, cluster 1 and cluster 5 are split in several k-means clusters. c-means recognizes the 16 clusters correctly. SI Figure 10. Clustering of the synthetic dataset Dim512. A: the 512-dimensional dataset contains 1024 points and 16 clusters generated with a gaussian distribution (http://cs.joensuu.fi/sipu/datasets/). CLAG perfectly distinguished the 16 clusters and it was run with ∆ = 0.05 and scores > 0.5. Besides CLAG, different clustering algorithms have been run on this synthetic dataset: c-means (B), MCLUST (C), and k-means (D). k-means and c-means were run with 16 clusters, and MCLUST with "ellipsoidal, equal variance with 9 components" as best model. For k-means, clusters 5, 12, 13, 14 are split in several k-means clusters. c-means clusters the ensemble in only 12 clusters. Figures ABCD are realized by plotting the first two columns of the matrix describing the dataset.
SI Figure 11. Clustering of the synthetic dataset Dim1024. A: the 1024-dimensional dataset contains 1024 points and 16 clusters generated with a gaussian distribution (http://cs.joensuu.fi/sipu/datasets/). CLAG perfectly distinguished the 16 clusters and it was run with ∆ = 0.05 and scores ≥ 0.5. Besides CLAG, different clustering algorithms have been run on this synthetic dataset: c-means (B), MCLUST (C), and k-means (D). k-means and c-means were run with 16 clusters, and MCLUST with "ellipsoidal multivariate normal with 1 component" as best model. For k-means, clusters 3, 6, 7 are split in several k-means clusters. c-means clusters the ensemble in only 14 clusters. Figures ABCD are realized by plotting the first two columns of the matrix describing the dataset. generated with difficulty level=2 and density level= 3, by using the software DataGenerator.jnlp, downloadable at http://webdocs.cs.ualberta.ca/ yaling/Cluster/Php/index.php. Different clustering algorithms have been run on this synthetic dataset: CLAG (B), k-means (E), c-means (C) and MCLUST (D). CLAG was run with ∆ = 0.05 and scores = 1, k-means and c-means with 5 clusters, and MCLUST with "ellipsoidal, equal volume and shape" as best model and with 9 components.