Clustering of the SOM easily reveals distinct gene expression patterns: results of a reanalysis of lymphoma study
© Wang et al; licensee BioMed Central Ltd. 2002
Received: 14 June 2002
Accepted: 24 November 2002
Published: 24 November 2002
A method to evaluate and analyze the massive data generated by series of microarray experiments is of utmost importance to reveal the hidden patterns of gene expression. Because of the complexity and the high dimensionality of microarray gene expression profiles, the dimensional reduction of raw expression data and the feature selections necessary for, for example, classification of disease samples remains a challenge. To solve the problem we propose a two-level analysis. First self-organizing map (SOM) is used. SOM is a vector quantization method that simplifies and reduces the dimensionality of original measurements and visualizes individual tumor sample in a SOM component plane. Next, hierarchical clustering and K-means clustering is used to identify patterns of gene expression useful for classification of samples.
We tested the two-level analysis on public data from diffuse large B-cell lymphomas. The analysis easily distinguished major gene expression patterns without the need for supervision: a germinal center-related, a proliferation, an inflammatory and a plasma cell differentiation-related gene expression pattern. The first three patterns matched the patterns described in the original publication using supervised clustering analysis, whereas the fourth one was novel.
Our study shows that by using SOM as an intermediate step to analyze genome-wide gene expression data, the gene expression patterns can more easily be revealed. The "expression display" by the SOM component plane summarises the complicated data in a way that allows the clinician to evaluate the classification options rather than giving a fixed diagnosis.
The development and progression of cancer is accompanied by complex changes in the patterns of gene expression. That can be revealed by DNA microarrays analysis . However, to reliably identify expression patterns associated with tumor type, prognosis or therapy, hundreds of samples need to be studied, and powerful data mining tools are needed. Microarray experiments are generally performed without a priori hypothesis. Therefore, the data mining tools have to be developed that reveal a maximum of information to generate new hypotheses  with minimal supervision. Hierarchical clustering is a frequently used method [2–4], but has a number of shortcomings [5, 6]. Notably, the most important genes defining the branches of the clustering tree are not readily recognized, and important patterns can be lost due to the deterministic nature of clustering or the high dimensionality of data. To solve this problem, we propose a two-level analysis  for the study of complex gene expression data. This analysis summarizes the data by the SOM component plane, and then clusters the SOM to investigate the feature gene expression patterns. The SOM reduces the dimensionality of the data, and thereby allows to easy display the data and reveal the gene expression patterns. The visual inspection of the gene expression patterns in each single case, and comparison of those patterns between the different cases allows identifying common patterns in gene expression that may have been lost by directly applying hierarchal clustering to the data. In addition, by K-means clustering of the SOM, genes that have similar expression patterns, and might therefore be functionally related, may be identified.
To test the power of this two-level approach, we applied it to the analysis of a publicly available gene expression data set of non-Hodgkin's lymphomas, including mostly diffuse large B-cell lymphoma (DLBCL), follicular lymphoma (FL) and chronic lymphocytic leukaemia (CLL). K-means clustering of the SOM readily identifies four distinct gene expression profiles: germinal center related, proliferation, inflammatory and plasma cell differentiation related gene expression patterns. All identified gene expression patterns are correlated with clinical survival analysis.
Selected genes grouped to cluster 1,7,9,10,11 of K-means clustering of SOM. Full list can be found in the web supplement .
Ki67 (long type)
Survivin = apoptosis inhibitor = effector cell protease EPR-1
XRCC9 = DNA repair protein
Actin = cytoskeletal gamma-actin
Similar to MCM2 = DNA replication licensing factor
MPP1 = Putative M phase phosphoprotein 1
BUB1 = putative mitotic checkpoint protein ser/thr kinase
PCNA = proliferating cell nuclear antigen
BAK = BCL-2 family member
Bag-1 = Bcl-2 interacting anti-apoptotic protein = RAP46 = Glucocorticoid receptor-associated protein
C-myc binding protein
40S ribosomal protein S21
Ribosomal protein S29
Ribosomal protein S9
Ribosomal protein S16
Ribosomal protein L27
Similar to ribosomal protein L37a
Ribosomal protein L32
A1 = Bfl-1 = GRs = Bcl-2 related protein
IRF-4 = LSIRF = Mum1 = homologue of Pip = Lymphoid-specific interferon regulatory factor = Multiple myeloma oncogene 1
Similar to TREB and X box binding protein 1
MCL1 = myeloid cell differentiation protein
701606, 1286850, 200814
CD10 = CALLA = Neprilysin = enkepalinase
824476, 95093, 1350545
Spi-B transcription factor
1335782, 13194072, 1338245
Oct-2 = lymphoid-specific octamer binding transcription factor = POU
Spi-1 = PU.1 = ets family transcription factor
CD86 = B7-2 = CD28 and CTLA-4 counter-receptor 2
BLC = BCA-1 = B lymphocyte chemoattractant BLC = CXC chemokine
SDF-1 = Stromal cell-derived factor 1 = chemokine
CD14 = monocyte differentiation antigen
CD3E antigen, epsilon polypeptide
CD64 = high affinity immunogobulin gamma FC receptor I A form precursor = FC-gamma
CD3 delta = T cell surface glycoprotein
T cell receptor beta chain
CD11C = leukocyte adhesion protein p150,95 alpha subunit = integrin alpha-X
1219244, 57, 1071581
RANTES = chemokine
S100 calcium binding protein A4 = Placental calcium binding protein = Calvasculin
C-C chemokine receptor 5 == CC CK5
Major histocompatibility complex, class II, DN alpha
Figures 1b shows the mean SOM component planes of CLL and FL. Typically for CLL the genes in the whole lower part of the SOM are highly expressed while for FL the genes in the lower and middle left part of the SOM (cluster 10) are highly expressed. Therefore, the most prominent distinction between CLL and FL lies in the expression of genes that are characteristic of germinal center B cells (cluster 10), as has also been suggested by Alizadeh et al .
When microarray measurements are presented in random order, the patterns of gene expression are impossible to discern by eye, and methods like hierarchical clustering are frequently used to sort the measurements in such a way that many patterns can easily be visualized, such as in figure 3. However, this method suffers from several shortcomings , of which the most important is the loss of information of potentially important patterns in a high dimensional gene space. Although the number of measured genes is large there may only be a few underlying gene components that account for most of the response variation; for example, only a few linear combinations of a subset of genes can account for nearly all of the expression variation among various tumor types. In such a situation, dimension reduction is needed to reduce the high dimensional gene space to a low dimensional gene component space; for instance, principal component analysis  and partial least squares  had been applied to the dimension reduction of microarray data. Thus, we proposed a two-level analysis, first to summarize the gene expression data by a large set of prototypes; then the prototypes are further combined to form the actual clusters in the next step. SOM is a suitable method for data reduction since it creates a set of prototype vectors representing the gene expression data and carries out a topology preserving the projection of the prototypes from the high-dimensional gene space into a low-dimensional map. To preserve the cluster structure of original data in a low-dimensional map, we can select as many prototype vectors as needed, where the number of prototypes equals 5 N0.5 (N is the number of genes) . The map follows the probability density function of the data and is very robust with regard to missing data points . Furthermore, the component plane of SOM can be used as a visualization surface for showing different features of the SOM (and thus of the gene expression data), for example the cluster structure . By clustering the SOM, a good insight into the cluster structure (and thus of the feature gene expression patterns) can be obtained.
We applied this two-level approach to the analysis of a set DLBCL samples that have previously been published. The inspection of the maps obtained through our analysis clearly reveals four major gene expression patterns. One pattern concerns genes expressed by germinal center B cells (cluster 10), the second could be called an 'inflammatory' pattern and relates to genes expressed by T-cells and macrophages (cluster 11). The third pattern is an extensive collection of genes involved in cell proliferation (cluster 1), which seems to be closely linked to the fourth pattern, anti-apoptosis and plasma cell differentiation-related genes (cluster 7, 9). This last pattern has not previously been described whereas the others were also discovered by Alizadeh et. al, by using hierarchical clustering only.
The survival data based on the grouping of cases according to the different gene expression patterns show that all these expression patterns were significantly correlated with survival (figure 2a, 2b, 2c). When the germinal center B cell gene expression pattern (cluster 10) is combined with the proliferation/anti-apoptosis/plasma cell differentiation pattern (cluster 1,7,9), thus yielding four groups (figure 2d), significant differences in survival are still seen notwithstanding the low number of cases. It is of particular interest that all but one of the cases expressing high levels of germinal center (cell) genes but low levels of proliferation/anti-apoptosis/plasma cell genes, have a survival beyond 5 years (figure 2d). This contrasts sharply with the cases expressing low levels of germinal center B cell genes but high levels of proliferation/anti-apoptosis and plasma cell differentiation genes of which none survive beyond 5 years. Although these data need to be confirmed in larger series of cases, a division of DLBCL according to expression of a combination of genes relating to the germinal center, proliferation, anti-apoptosis and plasma cell differentiation seems to be very relevant in predicting prognosis. Why the expressions of genes related to cell proliferation, anti-apoptosis and plasma cell differentiation are frequently co-expressed in DLBCL is not known and needs to be further investigated. It is apparent from our further analysis (figure 3) that there are two mutually exclusive patterns of gene expression related to plasma cell differentiation. One pattern contains the transcription factors IRF4 and XBP-1, which have both been shown to be important for plasma cell differentiation, as well as STAT3, which is part of the IL-6 signaling pathway involved in plasma cell differentiation [15–17]. The other pattern shows many unknown genes in addition to genes coding for ribosomal proteins. The latter suggests an expression pattern related to a later stage of plasma cell differentiation. These patterns are intriguing but more studies on normal plasma cell differentiation are needed in order for these plasmas to be fully understood.
In conclusion, we propose a two-level approach for the analysis of gene expression patterns, where the clustering analysis is carried out in a set of summarized prototype vectors created by SOM. By applying the current two-level approach to the DLBCL data set , the discovered gene expression patterns were consistent with the ones originally published. In addition, a novel pattern of gene expression related to plasma cell differentiation was revealed. Our results underscore the value of the two-level analysis for discovering gene expression patterns, and the method should be useful as a part of routine classification of clinical samples, when the suggested subdivision have been confirmed in large studies.
Sources of experimental data
Preprocessing of data
The data were cleaned before doing any data mining. This includes flagging and removal of bad measurements, i.e. measurements where the fluorescent intensity in one channel was less than 1.4 times the local background were discarded , and replacement of values for identical probes (same IMAGE number and gene) with the mean ratio. After cleaning the original data, we were left with values for 3906 genes from 96 samples, and these ratios were log 2 transformed.
Hierarchical clustering  is an agglomerative clustering usually having the following steps: 1) Initialization: assign each vector (the series of values from a single sample) to its own cluster. 2) Computation of the distance between all clusters. 3) Merging the two clusters that are closest to each other. Step 2 and 3 are repeated until there is only one cluster left. In this work, log 2 transformed ratios were median-centered before clustering, Pearson correlation was used as distance matrixes and the centered average linkage method was used for merging. Hierarchical clustering was applied to both rows and columns using the Cluster and Tree View software from Stanford .
Self-organizing map (SOM) and K-means clustering
The basic SOM  consists of m neurons located on a regular low-dimensional grid, usually 1- or 2- dimensional. The lattice of the grid is hexagonal. The basic SOM algorithm is iterative. Each neuron i has a d-dimensional prototype vector mi= [mi1, ..., mid], d is the input vector dimension. Before the training phase, initial values are given to the prototype vectors and typically linear initialization was used. At each training step, a sample data vector x is randomly chosen from the training set. Distances between x and all the prototype vectors are computed. During training, the SOM behaves like a flexible net that folds onto the "cloud" formed by the training data. Because of the neighborhood relations, neighboring prototypes are pulled to the same direction, and thus prototype vectors of neighboring units resemble each other . To inspect the cluster structure of the map, the SOM component plane (figure 1) was used to show the gene expression features of various tumor samples, and also the common gene expression patterns of each tumor type. Each component plane can be thought of as a slice of the map: it consists of the values of a single vector component in all map units. It is visualized as 2-dimensional color images, where the color of a map unit corresponds to its value. By visualizing the spread of values of that component and comparing component planes with each other, correlations are revealed as similar patterns in identical positions of the component planes. Based on overall view, it is easy to select interesting component combinations and map units for further investigation. To be able to more effectively study interesting groups of map units, methods to give good candidates for map unit clusters or groups are required. Thus, the trained prototype vectors mi of SOM is further clustered by K-means clustering and combined to form the actual clusters, more detailed description of clustering of the SOM can be found in the early paper .
K-means clustering is a partition clustering, it classifies the data into k groups, which together satisfy the requirements of a partition: (1) Each group must contain at least one object. (2) Each object must belong to only one group. To select the best k among different partitions, each of these can be evaluated using some kind of validity index. In our calculations, we used the Davies-Bouldin index , which minimizes the ratio between within-cluster distance and between-cluster distance, indicating good clustering results for spherical clusters with low values. Because no unified theory for determining the number of clusters has been fully developed and accepted, the selection of optimal number of clusters remains as an active research field [19, 21]. Thus, the Davies-Bouldin index used here is only a guideline to estimate the best clustering among the partitionings with different number of clusters. Some problems need to be noted when clustering the SOM by the K-means clustering, due to the properties of the algorithm: it not only searches for spherical clusters but also clusters with roughly equal number of samples, the non-spherical cluster could not be properly recognized as one cluster; and as the number of clusters is increased, the number of samples in clusters decreases, which makes the algorithm more sensitive to outliers. Therefore, we have to carefully verify the results obtained by K-means clustering .
In this work, SOM and K-means clustering were carried out by the SOM toolbox in MATLAB . SOM was trained using batch version of the algorithm for raw expression data. All prototype vectors were linearly initialized in the subspace spanned by the two eigenvectors with greatest eigenvalues computed from the training data. The SOM was trained in two phases: a rough training with large initial neighborhood width and a fine-tuning phase with small initial neighborhood width. The neighborhood width decreased linearly to 1; neighborhood function was Gaussian. The training length of the two phases was 1 and 4 epochs and the initial neighborhood width 3 and 1, respectively.
The statistical treatment of survival times is known as survival analysis. From a set of observed survival times from a sample of individuals we can estimate the proportion of the population of such people who would survive a given length of time in the same circumstances. The method yields a graph, the Kaplan-Merier survival curve, is drawn as a "step function" that changes at every distinct survival time. The time of survival observations are indicated by ticks on the survival curve, which shows at a glance the survival times of the surviving subjects (figure 2). To compare the survival experience of two or more groups of subjects we calculate the logrank test. The logrank test is a hypothesis test for testing the null hypothesis that the groups being compared are samples from the same population as regards survival experience, it involves calculating the observed and expected numbers of failures in separate time intervals, and summing these, comparing the results to a χ2 distribution with k-1 degrees of freedom gives P value, where there are k groups of observations . The plotting of Kaplan-Merier survival curves and logrank test of significance level P value were implemented in MATLAB.
This work was supported by the Norwegian Cancer Society http://www.kreft.no.
- DeRisi J, Penland L, Brown PO, Bittner ML, Meltzer PS, Ray M, Chen Y, Su YA, Trent JM: Use of a cDNA microarray to analyze gene expression patterns in human cancer. Nat Genet 1996, 14: 457–460.View ArticlePubMedGoogle Scholar
- Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 1998, 95: 14863–14868. 10.1073/pnas.95.25.14863PubMed CentralView ArticlePubMedGoogle Scholar
- Iyer VR, Eisen MB, Ross DT, Schuler G, Moore T, Lee JC, Trent JM, Staudt LM, Hudson J Jr, Boguski MS, Lashkari D, Shalon D, Botstein D, Brown PO: The transcriptional program in the response of human fibroblasts to serum. Science 1999, 283: 83–87. 10.1126/science.283.5398.83View ArticlePubMedGoogle Scholar
- Perou CM, Sorlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA, Pollack JR, Ross DT, Johnsen H, Akslen LA, Fluge O, Pergamenschikov A, Williams C, Zhu SX, Lonning PE, Borresen-Dale AL, Brown PO, Botstein D: Molecular portraits of human breast tumours. Nature 2000, 406: 747–752. 10.1038/35021093View ArticlePubMedGoogle Scholar
- Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR: Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA 1999, 96: 2907–2912. 10.1073/pnas.96.6.2907PubMed CentralView ArticlePubMedGoogle Scholar
- Kufman L, Rousseeuw PJ: Finding groups in data, An introduction to cluster analysis. (Edited by: Kuman L. Brussels). John Wiley & Sons 1991.Google Scholar
- Kohonen T: Self-organizing maps. (Edited by: Lotsch HKV). Berlin, Springer 1997, 117.View ArticleGoogle Scholar
- Toronen P, Kolehmainen M, Wong G, Castren E: Analysis of gene expression data using self-organizing maps. FEBS Letters 1999, 451: 142–146. 10.1016/S0014-5793(99)00524-4View ArticlePubMedGoogle Scholar
- Altman DG: Practical statistics for medical research. (Edited by: Altman DG). London, Chapman and Hall 1991.Google Scholar
- Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, Yang L, Marti GE, Moore T, Hudson J Jr, Lu L, Lewis DB, Tibshirani R, Sherlock G, Chan WC, Greiner TC, Weisenburger DD, Armitage JO, Warnke R, Staudt LM: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. [see comments]. Nature 2000, 403: 503–511. 10.1038/35000501View ArticlePubMedGoogle Scholar
- Vesanto J: SOM-Based data visualization methods. Intelligent Data Analysis journal 1999.Google Scholar
- Everitt BS: Cluster Analysis. (Edited by: Edward Arnold). London, John Wiley & Sons 1987.Google Scholar
- Junbai wang, Jan Delabie, Hans Christian Aasheim, Erlend Smeland, Ola Myklebost: Supplementary informatioin for "Reanalysis of global gene expression patterns from Diffuse Large B-Cell Lymphoma by a two-level strategy reveals novel subtypes".2001. [http://matrise.uio.no/supDLBCL/Supview.html]Google Scholar
- Vesanto J, Alhoniemi E: Clustering of the self-organizing map. IEEE TNN 2000, 11(3):586–600. 10.1109/72.846731Google Scholar
- Reimold AM, Iwakoshi NN, Manis J, Vallabhajosyula P, Szomolanyi-Tsuda E, Gravallese EM, Friend D, Grusby MJ, Alt F, Glimcher LH: Plasma cell differentiation requires the transcription factor XBP-1. Nature 2001, 412: 300–307. 10.1038/35085509View ArticlePubMedGoogle Scholar
- Hirano T, Ishihara K, Hibi M: Roles of STAT3 in mediating the cell growth, differnetiation and survival signals relayed through the IL-6 family of cytokine receptors. Oncogene 2000, 19: 2548–2556. 10.1038/sj.onc.1203551View ArticlePubMedGoogle Scholar
- Mittrucker HW, Matsuyama T, Grossman A, Kundig TM, Potter J, Shahinian A, Wakeham A, Patterson B, Ohashi PS, Mak TW: Requirement for the transcription factor LSIRF/IRF4 for mature B and T lymphocyte function. Science 1997, 275: 540–543. 10.1126/science.275.5299.540View ArticlePubMedGoogle Scholar
- Khan J, Wei JS, Ringner M, Saal LH, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonescu CR, Peterson C, Meltzer PS: Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat med 2001, 7: 673–679. 10.1038/89044PubMed CentralView ArticlePubMedGoogle Scholar
- Horimoto K, Toh H: Statistical estimation of cluster boundaries in gene expression profile data. Bioinformatics 2001, 17(12):1143–1151. 10.1093/bioinformatics/17.12.1143View ArticlePubMedGoogle Scholar
- Nguyen DV, Rocke DM: Tumor classification by partial least squares using microarray gene expression data. Bioinformatics 2002, 18(1):39–50. 10.1093/bioinformatics/18.1.39View ArticlePubMedGoogle Scholar
- Fukunaga K: Introduction to statistical pattern recognition. (Edited by: Rheinboldt W). Boston, Academic Press 1990.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.