Simcluster: clustering enumeration gene expression data on the simplex space
 Ricardo ZN Vêncio†^{1}Email author,
 Leonardo Varuzza†^{2},
 Carlos A de B Pereira^{2},
 Helena Brentani^{3} and
 Ilya Shmulevich^{1}
DOI: 10.1186/147121058246
© Vêncio et al; licensee BioMed Central Ltd. 2007
Received: 02 March 2007
Accepted: 11 July 2007
Published: 11 July 2007
Abstract
Background
Transcript enumeration methods such as SAGE, MPSS, and sequencingbysynthesis EST "digital northern", are important highthroughput techniques for digital gene expression measurement. As other counting or voting processes, these measurements constitute compositional data exhibiting properties particular to the simplex space where the summation of the components is constrained. These properties are not present on regular Euclidean spaces, on which hybridizationbased microarray data is often modeled. Therefore, pattern recognition methods commonly used for microarray data analysis may be noninformative for the data generated by transcript enumeration techniques since they ignore certain fundamental properties of this space.
Results
Here we present a software tool, Simcluster, designed to perform clustering analysis for data on the simplex space. We present Simcluster as a standalone commandline C package and as a userfriendly online tool. Both versions are available at: http://xerad.systemsbiology.net/simcluster.
Conclusion
Simcluster is designed in accordance with a wellestablished mathematical framework for compositional data analysis, which provides principled procedures for dealing with the simplex space, and is thus applicable in a number of contexts, including enumerationbased gene expression data.
Background
Technologies for highthroughput measurement of transcriptional gene expression are mainly divided into two categories: those based on hybridization, such as all microarrayrelated technologies [1, 2] and those based on transcript enumeration, which include SAGE [3], MPSS [4], and Digital Northern powered by traditional [5] or, recently developed, EST sequencingbysynthesis (SBS) technologies [6].
Currently, transcript enumeration methods are relatively expensive and more timeconsuming than methods based on hybridization. However, recent improvements in sequencing technology, powered by the "$1000 genome" effort [7], promises to transform the transcript enumeration approach into a fast and accessible alternative [8–10] paving the way for a systemslevel absolute digital description of individualized samples [11].
Methods for finding differentially expressed genes have been developed specifically in the context of enumerationbased techniques of different sequencing scales such as EST [12], SAGE [13] and MPSS [14]. However, in spite of their differences, hybridizationbased and enumerationbased data are typically analyzed using the same pattern recognition techniques, which are generally imported from the microarray analysis field.
In the case of clustering analysis of gene profiles, the simple appropriation of practices from the microarray analysis field has been shown to lead to suboptimal performance [15]. Cai and coworkers [15] provided an elegant clustering computational solution to group tag (rows in a usual expression matrix representation) profiles that takes into account the specificities of enumerationbased datasets. However, to the best of our knowledge, a solution for transcript enumeration libraries (columns in a usual expression matrix representation) is still needed. We report on a novel computational solution, called Simcluster, to support clustering analysis of transcript enumeration libraries.
Implementation
Theory
where 1 is a vector of ones. In the gene expression context, d is the number of unique tags observed. An example of a simplex vector is p= [πx] and applying a standard Bayesian approach, one obtains from xπ, n, using a Dirichlet prior density π~ Dir(α), the posterior density: πx~ Dir(x+ α).
 (i)
Δ(a, b) = Δ(b, a);
 (ii)
Δ(a, b) = 0 ⇔ a = b;
 (iii)
Δ(a, c) ≤ Δ(a, b) + Δ(b, c).
 (iv)
scale invariance Δ(xa, yb) = Δ(a, b), x, y ∈ ℝ_{+}; and
 (v)
translational invariance Δ(a + t, b + t) = Δ(a, b).
These commonly required additional properties guarantee that distance measurements are not affected by the definition of arbitrary scale or measurement units and that more importance is given to the actual difference between the objects being measured rather than commonalities (more details can be found in the appendix Additional File 1).
where · is the usual Hadamard product and the division is vectorevaluated.
where I is the identity matrix, × is the Kronecker product, d subscript is a notation for "excluding the d^{ th }element", and elementary operations are vectorevaluated.
Clustering procedures coherent with this theoretical background are suitable for transcript enumeration data.
Software design
In short, Simcluster's method can be described as the use of a Bayesian inference step (currently with a uniform prior) to obtain the expected abundance simplex vectors given the observed counts $\mathbb{E}$[πx], and the use of the Aitchisonean distance in the following algorithms: kmeans, kmedoids and selforganizing maps (SOM) for partition clustering, PCA for inferring the number of variability sources present, and common variants of agglomerative hierarchical clustering.
Currently, the Simcluster package is comprised of: Simtree, for hierarchical clustering; Simpart, for partition clustering; Simpca for Principal Component Analysis (PCA); and several utilities such as TreeDraw, a program to draw hierarchical clustering dendrograms with userdefined colored leaves. Simcluster's modularity allows relatively simple extension and addition of new modules or algorithms. Increasing the coverage of algorithms and validity assessment methods [20] are envisioned in future updates. Simcluster can be used, modified and distributed under the terms of the GPL license [21]. The software was implemented in C for improved performance and memory usage, assuring that even large datasets can be processed on a regular desktop PC (Additional File 2).
To increase source code reuse, established libraries were used: Cluster 3 [22] for clustering, GNU Scientific library [23] for PCA, Cairo [24] and a modification of TreeDraw X [25] for colored dendrogram drawing. The input data set can be a matrix of transcript counts or general simplex vectors. Some auxiliary shell and Perl scripts are available to: automatically download data from the GEO database [26], convert GEO files to Simcluster input format, and filter out lowcount tags.
The Linuxbased installation and compilation is facilitated by a configuration script that detects all the prerequisites for Simcluster compilation. Missing libraries are automatically downloaded from the Simcluster website and compiled by the Simcluster compilation process.
Results and Discussion
We agree with Dougherty and Brun [27, 28] that "validation" of clustering results is a heuristic process, even though there are some interesting efforts to objectively incorporate biological knowledge in this process using Gene Ontology, especially when one is clustering gene expression profiles [29, 30]. However, to illustrate the usefulness of our software, we collected several examples in which the performance of Simcluster can be considered as qualitatively superior to some traditional approaches imported from the microarray analysis field. These examples include EST, SAGE and MPSS datasets, and are available on the project's webpage [31]. Among these, we describe here a simulated enumeration dataset built from real microarray data, for which we can define the ground truth and check results against it in a relatively objective way. Of course, a comprehensive study with simulated data, consisting of comparisons of clustering algorithms, distance metrics, and distributions generating the random point sets, would be necessary to properly evaluate any clustering algorithm. This should be the subject of future work. The objective of this example is to show that Simcluster is able to reconstruct the clustering result obtained for an Affymetrix microarray dataset when the input is a simulated transcript enumeration dataset, built to mimic the real microarray biological data.
The data used to create the virtual transcript enumeration data was obtained from the Innate Immunity Systems Biology project [32] and is provided as an Additional File 3. This data is a set of Affymetrix experiments of mouse macrophages stimulated by different Tolllike receptor agonists (LPS, PIC, CPG, R848, PAM) during a timecourse (0, 20, 40, 60, 80 and 120 minutes). A detailed description and biological significance of this dataset is presented elsewhere [32, 33].
Affymetrix expression levels
Probesets  Representative ID  Gene Symbol  Intensity (sorted) 

1457375_at  BG094499  Transcribed locus  1.94760 
1452109_at  BG973910  interleukin 17 receptor E  2.14522 
...  ...  ...  ... 
M12481_3_at  AFFXbActinMur  actin beta cytoplasmic  36191.41765 
1436996_x_at  AV066625  P lysozyme structural  43458.17590 
The virtual total number of available tags is defined as proportional to the measured intensity using 10,000 as a scaling constant, an arbitrary number large enough to assure that finite population issues are negligible. Actual examples are: 19,476 for BG094499; 21,452 for BG973910; and so on until 361,914,176 for actin; and 434,581,759 for AV066625. The total amount of available tags is T = 126,971,909,452, which is a number much greater than the typical number of sequenced tags and is in accordance with the "infinite urn" model.
It is clear that cluster results obtained by Simcluster converge to the same structure obtained by analyzing the Affymetrix data, as the number of virtually sequenced tags increases. Moreover, Simcluster's results are not only compatible with the usual microarray analysis for Affymetrix data, but also are more biologically meaningful than the results obtained by the usual microarray analysis techniques applied to the virtual sequencing data. As in the original microarray analysis, the Simcluster result is able to cluster together the different stimuli, placing consecutive timepoints close to each other.
Although this kind of analysis certainly does not provide a proof, the above result indicate that the theoretical framework is adequate for enumerationbased data, as expected. Additional examples and discussions can be found on the project's website [31].
Conclusion
We developed a software tool, called Simcluster, for clustering libraries of enumerationbased data. It is important to note that Simcluster is built in accordance with a wellestablished mathematical framework for compositional data analysis, which provides principled procedures for dealing with the simplex space, and is thus applicable in contexts other than transcript enumeration.
Availability and requirements

Project Name: Simcluster

Project Home Page: http://xerad.systemsbiology.net/simcluster

Operating Systems: Linux for the standalone version and platform independent for the webbased tool.

Programming Languages: C for the standalone version and C, Perl and HTML for the webbased tool.

Other requirements: some GNU/GPL or GNU/LGPL libraries distributed together with the main package.

License: GNU General Public License 2.0
Notes
List of abbreviations
 EST:

Expressed Sequence Tag
 SAGE:

Serial Analysis of Gene Expression
 MPSS:

Massive Parallel Signature Sequencing
 SBS:

SequencingBySynthesis
Declarations
Acknowledgements
We thank Dr. Jared Roach (ISB) and Dr João C. Barata (USP) for constructive discussions and Dr. Alistair Rust (ISB) for help with the web server. LV is supported by CAPES. CABP is partially supported by CNPq. This work is partially supported by NIH/NIAID grants U19AI057266 and U54AI54253 and NIH/NIGMS P50GMO76547.
Authors’ Affiliations
References
 Schena M, Shalon D, Davis R, Brown P: Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 1995, 270(5235):467–470. 10.1126/science.270.5235.467View ArticlePubMedGoogle Scholar
 Fodor S, Rava R, Huang X, Pease A, Holmes C, Adams C: Multiplexed biochemical assays with biological chips. Nature 1993, 364: 555–556. 10.1038/364555a0View ArticlePubMedGoogle Scholar
 Velculescu V, Zhang L, Vogelstein B, Kinzler K, et al.: Serial analysis of gene expression. Science 1995, 270(5235):484–487. 10.1126/science.270.5235.484View ArticlePubMedGoogle Scholar
 Brenner S, Johnson M, Bridgham J, Golda G, Lloyd D, Johnson D, Luo S, McCurdy S, Foy M, Ewan M, et al.: Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nature Biotechnology 2000, 18: 630–634. 10.1038/76469View ArticlePubMedGoogle Scholar
 Okubo K, Hori N, Matoba R, Niiyama T, Fukushima A, Kojima Y, Matsubara K: Large scale cDNA sequencing for analysis of quantitative and qualitative aspects of gene expression. Nature Genetics 1992, 2: 173–179. 10.1038/ng1192173View ArticlePubMedGoogle Scholar
 Bainbridge M, Warren R, Hirst M, Romanuik T, Zeng T, Go A, Delaney A, Griffith M, Hickenbotham M, Magrini V, Mardis E, Sadar M, Siddiqui A, Marra M, Jones S: Analysis of the prostate cancer cell line LNCaP transcriptome using a sequencingbysynthesis approach. BMC Genomics 2006, 7: 246. 10.1186/147121647246PubMed CentralView ArticlePubMedGoogle Scholar
 Service RF: Gene sequencing. The race for the $1000 genome. Science 2006, 311(5767):1544–1546. 10.1126/science.311.5767.1544View ArticlePubMedGoogle Scholar
 Margulies M, Egholm M, Altman W, Attiya S, Bader J, Bemben L, Berka J, Braverman M, Chen Y, Chen Z, et al.: Genome sequencing in microfabricated highdensity picolitre reactors. Nature 2005, 437: 376–380.PubMed CentralPubMedGoogle Scholar
 Seo T, Bai X, Kim D, Meng Q, Shi S, Ruparel H, Li Z, Turro N, Ju J: Fourcolor DNA sequencing by synthesis on a chip using photocleavable fluorescent nucleotides. Proceedings of the National Academy of Sciences 2005, 102(17):5926–5931. 10.1073/pnas.0501965102View ArticleGoogle Scholar
 Braslavsky I, Hebert B, Kartalov E, Quake S: Sequence information can be obtained from single DNA molecules. Proc Natl Acad Sci USA 2003, 100(7):3960–3964. 10.1073/pnas.0230489100PubMed CentralView ArticlePubMedGoogle Scholar
 Hood L, Heath J, Phelps M, Lin B: Systems Biology and New Technologies Enable Predictive and Preventative Medicine. Science 2004, 306(5696):640–643. 10.1126/science.1104635View ArticlePubMedGoogle Scholar
 Audic S, Claverie J: The significance of digital gene expression profiles. Genome Res 1997, 7: 986–989.PubMedGoogle Scholar
 Vencio R, Brentani H, Patrao D, Pereira C: Bayesian model accounting for withinclass biological variability in Serial Analysis of Gene Expression (SAGE). BMC Bioinformatics 2004, 5: 119. 10.1186/147121055119PubMed CentralView ArticlePubMedGoogle Scholar
 Stolovitzky G, Kundaje A, Held G, Duggar K, Haudenschild C, Zhou D, Vasicek T, Smith K, Aderem A, Roach J: Statistical analysis of MPSS measurements: Application to the study of LPSactivated macrophage gene expression. Proceedings of the National Academy of Sciences 2005, 102(5):1402–1407. 10.1073/pnas.0406555102View ArticleGoogle Scholar
 Cai L, Huang H, Blackshaw S, Liu J, Cepko C, Wong W: Clustering analysis of SAGE data using a Poisson approach. Genome Biol 2004, 5(7):R51. 10.1186/gb200457r51PubMed CentralView ArticlePubMedGoogle Scholar
 Vencio R, Brentani H: Statistical Methods in Serial Analysis of Gene Expression (SAGE). In Computational and Statistical Approaches to Genomics. 2nd edition. Edited by: Zhang W, Shmulevich I. New York City, New York: Springer; 2006:209–233.View ArticleGoogle Scholar
 Thygesen H, Zwinderman A: Modeling Sage data with a truncated gammaPoisson model. BMC Bioinformatics 2006, 7: 157. 10.1186/147121057157PubMed CentralView ArticlePubMedGoogle Scholar
 Aitchison J: The Statistical Annalysis of Compositional Data. Monographs on Statistics and Applied Probability. London: Chapman and Hall; 1986.View ArticleGoogle Scholar
 Aitchison J: Simplicial inference. In Algebraic Methods in Statistics and Probability: Contemporary Mathematics Series, no. 287 in Contemporary Mathematics Series. Edited by: Viana M, Richards D. Providence, Rhode Island: American Mathematical Society; 2001:1–22.View ArticleGoogle Scholar
 Bolshakova N, Azuaje F, Cunningham P: An integrated tool for microarray data clustering and cluster validity assessment. Bioinformatics 2005, 21(4):451–455. 10.1093/bioinformatics/bti190View ArticlePubMedGoogle Scholar
 GNU General Public License[http://www.gnu.org/licenses/gpl.txt]
 de Hoon M, Imoto S, Nolan J, Miyano S: Open source clustering software. Bioinformatics 2004, 20: 1453–1454. 10.1093/bioinformatics/bth078View ArticlePubMedGoogle Scholar
 GNU Scientific library[http://www.gnu.org/software/gsl]
 Cairo Graphics[http://cairographics.org]
 Page R: TreeView: an application to display phylogenetic trees on personal computers. Computer Applications in the Biosciences 1996, 12(4):357–358.PubMedGoogle Scholar
 Gene Expression Omnibus database[http://www.ncbi.nlm.nih.gov/geo]
 Dougherty E, Brun M: A probabilistic theory of clustering. Pattern Recognition 2004, 37(5):917–925. 10.1016/j.patcog.2003.10.003View ArticleGoogle Scholar
 Brun M, Sima C, Hua J, Lowey J, Carroll B, Suh E, Dougherty E: Modelbased evaluation of clustering validation measures. Pattern Recognition 2007, 40(3):807–824. 10.1016/j.patcog.2006.06.026View ArticleGoogle Scholar
 Datta S, Datta S: Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes. BMC Bioinformatics 2006, 7: 397. 10.1186/147121057397PubMed CentralView ArticlePubMedGoogle Scholar
 Loganantharaj R, S C, Clifford J: Metric for measuring the effectiveness of clustering of DNA microarray expression. BMC Bioinformatics 2006, 7(Suppl 2):S5. 10.1186/147121057S2S5PubMed CentralView ArticlePubMedGoogle Scholar
 Simcluster Home Page[http://xerad.systemsbiology.net/simcluster]
 Innate Immunity Systems Biology[http://www.innateimmunitysystemsbiology.org]
 Gilchrist M, Thorsson V, Li B, Rust A, Korb M, Kennedy K, Hai T, Bolouri H, Aderem A: Systems biology approaches identify ATF3 as a negative regulator of Tolllike receptor 4. Nature 2006, 441: 173–178. 10.1038/nature04768View ArticlePubMedGoogle Scholar
 The R Project for Statistical Computing[http://www.rproject.org]
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.