Multiclass discovery in array data
© Liu and Ringnér; licensee BioMed Central Ltd. 2004
Received: 21 January 2004
Accepted: 04 June 2004
Published: 04 June 2004
A routine goal in the analysis of microarray data is to identify genes with expression levels that correlate with known classes of experiments. In a growing number of array data sets, it has been shown that there is an over-abundance of genes that discriminate between known classes as compared to expectations for random classes. Therefore, one can search for novel classes in array data by looking for partitions of experiments for which there are an over-abundance of discriminatory genes. We have previously used such an approach in a breast cancer study.
We describe the implementation of an unsupervised classification method for class discovery in microarray data. The method allows for discovery of more than two classes. We applied our method on two published microarray data sets: small round blue cell tumors and breast tumors. The method predicts relevant classes in the data sets with high success rates.
We conclude that the proposed method is accurate and efficient in finding biologically relevant classes in microarray data. Additionally, the method is useful for quality control of microarray experiments. We have made the method available as a computer program.
A common application in microarray data analysis is to identify genes that, based on their expression levels, discriminate between known classes of experiments. This identification is often achieved by using various statistical measures to, gene-by-gene, correlate the expression levels with the classes of interest. In this way a discriminatory weight is calculated for each gene. For example, Golub et al. used a signal-to-noise statistic to find genes with expression patterns that discriminate between samples obtained from patients with acute myeloid leukemia and patients with acute lymphoblastic leukemia . Other examples include using a standard t-test to discriminate between breast tumors from carriers of BRCA1 mutations and carriers of BRCA2 mutations . For an overview of applications see . In most studies, the number of genes is much larger than the number of experiments. For such a large number of genes, it is crucial to estimate how many genes would correlate with the classes of interest by chance. Often, a P value corresponding to the probability of obtaining a given weight by chance is calculated for each weight. One can then investigate if there is an over-abundance of discriminatory genes for classes of interest as compared to randomly selected classes. Indeed, such an over-abundance has been found for many microarray-based classification applications (see e.g. [1, 2, 4]).
Often clustering methods, such as hierarchical clustering , k-means clustering , or self-organizing maps (SOM)  are used for unsupervised classification of array data (see  for an overview). For example, hierarchical clustering has been used to discover two subtypes of diffuse B-cell lymphoma , three subtypes of breast tumors , and two subtypes of cutaneous melanoma . Dugas et al. have developed an iterative k-means clustering method for class discovery in array data . Examples of SOM-based methods for discovery of cancer subtypes include applying SOMs to automatically discover the distinction between acute myeloid leukemia and acute lymphoblastic leukemia , and to separate 14 different tumor subtypes . SOMs and k-means clustering require the user to predefine the number of clusters to be found. Hsu et al. proposed an unsupervised hierarchical self-organizing map approach that automatically identifies a suitable number of clusters, and applied it to a couple of publicly available array data sets .
In these clustering methods, experiments are clustered based on the distance between them in gene expression space. An alternative unsupervised classification approach to discover classes in gene expression data, which exploits the fact that there typically is an over-abundance of genes separating known classes was proposed by Ben-Dor et al. . In their method, classes are discovered by seeking partitions of experiments with an over-abundance of discriminatory genes. In contrast to many clustering methods, no metric to define distances between experiments is required. Furthermore, classes are discovered based only on the subset of genes that are differentially expressed between the classes, whereas in unsupervised clustering the distances are often based on all the genes. A similar classification method, which also searches for binary class distinctions in a set of samples that show separation in the expression of subsets of genes, has been developed by von Heydebreck et al. . These classification methods are well-suited to discover several significant partitions of experiments, each based on a different subset of genes.
Inspired by the method by Ben-Dor et al., we have previously used a similar approach to sub-classify familial breast cancer into two classes . Briefly, the approach was as follows. For a given partition of the experiments into two classes (with n1 and n2 experiments, respectively), a discriminative weight was calculated for each gene using the signal-to-noise statistic . To assign P values to the weights one has to perform random permutation tests. Such a test was used to generate a weight distribution that could be expected for two classes with n1 and n2 experiments under the assumption of random gene expression. Using this weight distribution, each weight was assigned a P value corresponding to the probability to obtain the weight or larger for a random partitioning into n1 and n2 experiments. Candidate partitions of the data were scored with the number of statistically significant weights, that is the number of genes that were significantly different in expression between the classes. A simulated annealing  scheme was used, in which partitions were updated by changing the class of a randomly selected experiment, to find the partition of experiments into the classes with the highest score. Our approach is different from the Ben-Dor et al. method in two respects. First, they use the total number of misclassification (TNoM) score to find discriminatory genes . Second, we use a fixed P value cut-off to find the number of discriminatory genes, whereas they instead use surprise scores .
In this work, we have extended our unsupervised classification method for discovery of more than two classes and to allow for missing values in gene expression data. Furthermore, we have made the method publicly available as a computer program. For the breast cancer study , we performed random permutation tests for all possible n1 and n2 that add up to the total number of experiments. Extending the method to find a preset but arbitrary number of classes and to allow for missing values would result in performing random permutation tests for many more combinations of class sizes. For nonparametric rank-based statistics, analytically calculated P values correspond to what would be obtained by random permutation tests. Therefore, we decided to use such statistics instead of a parametric test. Moreover, nonparametric methods have been shown to be robust conservative (low numbers of false positives) in its application to the identification of discriminatory genes in gene expression data .
Here, we describe the unsupervised classification method used in our class discovery program in detail, and results from applying it on two publicly available data sets.
Identification of differentially expressed genes
The Wilcoxon rank sum test (WT) is used to identify genes differentially expressed in two classes . The nonparametric WT tests for equality of medians of two samples of data, but unlike the t-test it makes no assumption that the data is normally distributed. It operates on rank-transformed data rather than the raw values. In our method, the expression values of each gene are ranked across experiments from low to high, disregarding to which class each experiment belongs. For a given partition of experiments into two classes, a discriminatory weight u g is calculated for each gene (g),
where e denotes an experiment with rank g (e) for g. Next, we want to calculate a P value of u g for the null hypothesis that expression values for all experiments are drawn from the same probability distribution. If the P value is near zero, it casts doubt on the null hypothesis and suggests that the medians of expression values are significantly different in the two classes. For n1 > 8 (and thus n2 > 8 also), the P value can be calculated by using a normal approximation ,
For partitions with n1 being at least 9, we use this normal approximation. For partitions with n1 smaller than 9, we assign P values to u g using a random permutation test. For each n1 and n2, the test is based on 50,000 random permutations of class labels. For the WT, we use two-sided tests.
For partitions of experiments into three or more classes, the Kruskal-Wallis test (KWT) is used to identify discriminatory genes . The KWT is a nonparametric version of the one-way analysis of variance (ANOVA), uses the ranks of the data, and is an extension of the WT to more than two groups. It tests for equality of medians of k samples of data. For a given partition of experiments into k classes, a discriminatory weight H g is calculated for each gene,
For partitions where all classes have at least 5 samples, the sampling distribution of H g can be approximated very well by a chi-square distribution with k - 1 degrees of freedom . Hence, the P value for the null hypothesis that expression values for all experiments are drawn from the same probability distribution can be calculated from χ2(k - 1). If the P value is close to zero it suggests that the median of expression levels for at least one class is significantly different from the others. This does not necessarily mean that every group differs from every other group. For partitions with all classes having at least 5 experiments, we use the chi-square approximation. P values are not calculated for partitions into more than 2 classes for which a class have less than 5 experiments.
Missing values is handled by using, for each gene, only those experiments for which there exists measurements in the calculation of the statistical test (WT or KWT).
Scoring partitions of experiments
Each partition of experiments is assigned a score corresponding to the number of genes with P values smaller than or equal to a user specified cut-off. Thus, the score is the number of genes with significantly different expression in the classes. Because the total number of genes for a data set is identical for all partitions, we do not correct P values for multiple testing. For random gene expressions, we would, for a typical partition, expect a score of P multiplied by the total number of genes. We denote this expected score by E.
Finding the partition with the highest score
Parameters in the class discovery procedure and the values used for the SRBCT and the BRCAx data.
150 (SRBCT) or 50 (BRCAx)
Peeling discriminatory genes
For a data set there may be several biologically relevant partitions. However, one of them is often supported by many more discriminatory genes than the others. In such a scenario, the discovered partitions with the highest scores will mostly be similar to the best one, because shifting one or two experiments will still result in a higher score as compared to other relevant partitions. Hence, it may be difficult to discover important but not dominating partitions. One way to address this issue is to remove genes that contribute to the score for the best partition and run the class discovery program again using this smaller set of genes . Using such a procedure, one can iteratively peel discriminatory genes from the data set to systematically investigate the presence of further partitions with significant scores.
To further evaluate discovered classes we used hierarchical clustering. Clustering was performed using EPCLUST http://ep.ebi.ac.uk/EP/EPCLUST/ with the distance measure parameter set to linear correlation based distance (Pearson).
To investigate discovery of two classes, we used the non-BRCA1/2 familial breast cancer (termed BRCAx) data set by Hedenfalk et al. . This data set consists of 16 BRCAx samples, for which intensity ratios of 4,795 clones considered to be well-measured are provided. Following Hedenfalk et al., we performed pre-processing of the data such that the log intensity ratios were mean-centered and these values were used as a measure of the expression levels.
Batches of production for the 88 SRBCT microarrays.
T1, T2, T3, T4, C1, C2, C3, C4
C1, C2, C3, C4
C1, C2, C3
T1, T2, T3, T4, C8, C11
T6, T7, T9, T11, T12, T13, T14, T15, T19
T5, T6, T7, T8, C3, C4
6, 9, 11, 20, 21
C6, C7, C8, C9, C1O, C11
C5, C6, C7, C8
C4, C5, C6, C7, C8, C9, C10, C11, C12
C2, C5, C6, C7, C9, C10
1, 2, 4, 7, 12, 17
8, 10, 13, 14, 15, 16, 18, 19, 22, 23, 25
Results and Discussion
Discovery of two classes in the breast cancer data set
To validate our class discovery program, we first applied it to the Hedenfalk et al. BRCAx data set and looked for two classes, as in our original analysis . In that work, a signal-to-noise statistic  and a P value cut-off of 0.001 was used to discover two classes supported by 60 discriminatory genes. For a fixed P value cut-off, we now expect less discriminatory genes, because of the change to a rank-based statistic. Therefore, we decided to use a somewhat larger cut-off to facilitate a comparison of our set of discriminatory genes with the original set. Importantly, we found the classes discovered to be rather insensitive to changes in P value cut-off. Our best scoring partition was supported by 133 discriminatory genes (P <= 0.005), whereas we would expect 24 discriminatory genes for a random partition (E = 24). Our partition was similar to the one in the original study, except that three samples had shifted class. Moreover, the Hedenfalk et al. partition was also highly significant (score 117), and the top scoring partitions were dominated by partitions very similar to it. Of our 117 discriminatory genes 57 overlapped with the 60 found in ref. . We conclude that the partition found in the original study is robust to changes in the details of the class discovery algorithm.
Setting the values of the parameters in the annealing schedule requires experimentation. For our data sets, we selected parameter values for which the program runs relatively fast (5–10 minutes on a standard personal computer), and for which running the program again, with a new random initialization of the classes, often resulted in again finding the best partition found previously. For more conservative parameter values (Tstart = 25, Nsuccess = 100, Ntotal = 500) for which the program takes 10–15 times longer to run, we found that the best partition was essentially found every time the program was run (9 out of 10 times for the breast cancer data).
Discovery of two classes in the SRBCT data set
Discovery of three classes in the SRBCT data set
Next, we investigated discovery of more than two classes. For three classes, our best scoring partition was supported by 934 discriminatory genes (P <= 0.001; E = 2.3). The three classes separated experiments according to batches of microarray production (Table 2). Experiments on print batches 104, 118 and 119 were in one class each. There was only one error: RMS-T4 (print batch 104) was in the class with experiments on print batch 118, as were the two experiments from batches 143 and 163. When only using the 61 training experiments belonging to batches 104, 118 and 119, the score for a perfect partitioning into these three batches was 923.
Discovery of four classes in the SRBCT data set
For four classes, our best scoring partition was supported by 1051 discriminatory genes (P <= 0.001; E = 2.3). The four classes were identical to our result for three classes (according to print batches), except that the BL experiments (on batches 104 and 119) were in a class of their own. Thus, apart from finding the BL class, the dominant separation of the experiments into four classes is according to print batches. In agreement with our result, BL was in the original supervised analysis found to be the category with the most distinct expression profile . The score for perfect partitioning into the four SRBCT categories was 544, which though not the best score is highly significant compared to random expectations.
Peeling genes discriminatory for batches of array production
The four classes of experiments identified by the class discovery program (score = 470; P < 0.001; E = 1.4) after removing 923 genes discriminatory for batches of array production.
EWS-T & RMS-T
NB-C & RMS-C
Peeling genes discriminatory between cell lines and tumors
The four classes of experiments identified by the class discovery program (score = 353; P < 0.001; E = 1.2) after removing 1076 genes discriminatory for batches of array production or cell lines versus tumors.
Results for additional SRBCT test samples
We have developed an unsupervised classification method for the discovery of two or more classes of experiments in microarray data. The method has been implemented as a publicly available computer program. We have tested the method on two published gene expression data sets and conclude that the proposed method is effective in finding relevant classes. We are planning to make the program available as a plugin for BASE , the open-source database for array data maintained by our group.
When applying our method on these data sets, we have found that the best partitions discovered are relatively insensitive to the cut-off in P value. Moreover, even though there is no guarantee that the simulated annealing algorithm finds the optimal score, we found that high-quality local minimas are discovered. Nevertheless, any user of the program will benefit from experimenting with the values of the parameters in the algorithm to explore the particular details of each data set.
The method was designed for unsupervised classification, but it can also be very useful for quality control when classes of experiments are known. It is common to look for an over-abundance of discriminatory genes separating known classes. In such a scenario, it may be useful to seek the partitions having the largest overabundance of discriminatory genes. Thereby, one can rule out potential problems, as highlighted by our example of the print batches for the SRBCT experiments. The SRBCT experiments were performed on arrays produced by the first generation of cDNA microarray printers. Using our class discovery program on more recent data sets our experience is that differences due to print batches are much smaller (data not shown). Nevertheless, we think that our results using the SRBCT data set illustrates how experimental artifacts can be found and corrected for when using our class discovery program. Here, one should keep in mind the crucial importance of random experimental design: if each SRBCT category had been investigated using a unique print batch, there would be no way to disentangle the relevant biology from artifacts. Moreover, these findings illustrate that it is important to know the procedural steps underlying an experiment to be able to interpret discovered classes.
Availability and requirements
Project name: MCD – Multiclass discoverer
Project homepage: http://bioinfo.thep.lu.se/classdiscoverer
Operating systems: Linux, Windows, Mac OS X
Programming language: Perl
Other requirements: The Perl modules: Algorithm::Numerical::Shuffle, POSIX, Statistics::Distributions, Storable, and Tie::RefHash
License: GNU general public license
Any restrictions to use by non-academics: none
We thank Amir Ben-Dor and Zohar Yakhini for valuable discussions. This work was in part supported by the National Research School in Genomics and Bioinformatics and the Knut and Alice Wallenberg Foundation through the Swegene consortium.
- Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ED: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999, 286: 531–537. 10.1126/science.286.5439.531View ArticlePubMedGoogle Scholar
- Hedenfalk I, Duggan D, Chen Y, Radmacher M, Bittner M, Simon R, Meltzer P, Gusterson B, Esteller M, Kallioniemi OP, Wilfond B, Borg Å, Trent J: Gene-expression profiles in hereditary breast cancer. N Engl J Med 2001, 344: 539–548. 10.1056/NEJM200102223440801View ArticlePubMedGoogle Scholar
- Ringnér M, Peterson C, Khan J: Analyzing array data using supervised methods. Pharmacogenomics 2002, 3: 403–415.View ArticlePubMedGoogle Scholar
- Bittner M, Meltzer P, Chen Y, Jiang Y, Seftor E, Hendrix M, Radmacher M, Simon R, Yakhini Z, Ben-Dor A, Sampas N, Dougherty E, Wang E, Marincola F, Gooden C, Lueders J, Glatfelter A, Pollock P, Carpten J, Gillanders E, Leja D, Dietrich K, Beaudry C, Berens M, Alberts D, Sondak V, Hayward N, Trent J: Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature 2000, 406: 536–540. 10.1038/35020115View ArticlePubMedGoogle Scholar
- Sokal RR, Michener CD: A statistical method for evaluating systematic relationships. Univ Kans Sci Bull 1958, 38: 1409–1438.Google Scholar
- Hartigan JA, Wong MA: A K-means clustering algorithm. Applied Statistics 1979, 28: 100–108.View ArticleGoogle Scholar
- Kohonen T: Self-Organizing Maps 3 Edition Berlin: Springer 2001.View ArticleGoogle Scholar
- Quackenbush J: Computational analysis of microarray data. Nat Rev Genet 2001, 2: 418–427. 10.1038/35076576View ArticlePubMedGoogle Scholar
- Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, Yang L, Marti GE, Moore T, Hudson J Jr, Lu L, Lewis DB, Tibshirani R, Sherlock G, Chan WC, Greiner TC, Weisenburger DD, Armitage JO, Warnke R, Levy R, Wilson W, Grever MR, Byrd JC, Botstein D, Brown PO, Staudt LM: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 2000, 403: 503–511. 10.1038/35000501View ArticlePubMedGoogle Scholar
- Perou CM, Sorlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA, Pollack JR, Ross DT, Johnsen H, Akslen LA, Fluge O, Pergamenschikov A, Williams C, Zhu SX, Lonning PE, Borresen-Dale AL, Brown PO, Botstein D: Molecular portraits of human breast tumours. Nature 2000, 406: 747–752. 10.1038/35021093View ArticlePubMedGoogle Scholar
- Dugas M, Merk S, Breit S, Dirschedl P: Mdclust – Exploratory microarray analysis by multidimensional clustering. Bioinformatics 2004, 20: 931–936. 10.1093/bioinformatics/bth009View ArticlePubMedGoogle Scholar
- Covell DG, Wallqvist A, Rabow AA, Thanki N: Molecular classification of cancer: unsupervised self-organizing map analysis of gene expression microarray data. Mol Cancer Ther 2003, 2: 317–332.PubMedGoogle Scholar
- Hsu AL, Tang SL, Halgamuge SK: An unsupervised hierarchical dynamic self-organizing approach to cancer class discovery and marker gene identification in microarray data. Bioinformatics 2003, 19: 2131–2140. 10.1093/bioinformatics/btg296View ArticlePubMedGoogle Scholar
- Ben-Dor A, Friedman N, Yakhini Z: Class discovery in gene expression data. In Proceedings of the Fifth Annual Conference on Computational Biology (RECOMB): 2001; Montreal (Edited by: Lengauer T, Sankoff D, Istrail S, Pevzner P, Waterman M). ACM Press 2001, 31–38.View ArticleGoogle Scholar
- von Heydebreck A, Huber W, Poustka A, Vingron M: Identifying splits with clear separation: a new class discovery method for gene expression data. Bioinformatics 2001, 17(Suppl 1):S107–114.View ArticlePubMedGoogle Scholar
- Hedenfalk I, Ringnér M, Ben-Dor A, Yakhini Z, Chen Y, Chebil G, Ach R, Loman N, Olsson H, Meltzer P, Borg Å, Trent J: Molecular classification of familial non-BRCA1/BRCA2 breast cancer. Proc Natl Acad Sci USA 2003, 100: 2532–2537. 10.1073/pnas.0533805100PubMed CentralView ArticlePubMedGoogle Scholar
- Kirkpatrick S, Gelatt C, Vecchi M: Optimization by simulated annealing. Science 1983, 220: 671–680.View ArticlePubMedGoogle Scholar
- Ben-Dor A, Bruhn L, Friedman N, Nachman I, Schummer M, Yakhini Z: Tissue classification with gene expression profiles. J Comput Biol 2000, 7: 559–583. 10.1089/106652700750050943View ArticlePubMedGoogle Scholar
- Troyanskaya OG, Garber ME, Brown PO, Botstein D, Altman RB: Nonparametric methods for identifying differentially expressed genes in microarray data. Bioinformatics 2002, 18: 1454–1461. 10.1093/bioinformatics/18.11.1454View ArticlePubMedGoogle Scholar
- Khan J, Wei JS, Ringnér M, Saal LH, Ladanyi M, Westermann F, Berthold F, Schwab M, Atonescu CR, Peterson C, Meltzer PS: Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med 2001, 7: 673–679. 10.1038/89044PubMed CentralView ArticlePubMedGoogle Scholar
- Wilcoxon F: Individual comparisons by ranking methods. Biometrics 1945, 1: 80–83.View ArticleGoogle Scholar
- Walpole RE, Myers RH: Probability and Statistics for Engineers and Scientists 3 Edition New York: Macmillan 1985.Google Scholar
- Kruskal WH, Wallis WA: Use of ranks in one-criterion variance analysis. J Amer Statist Assoc 1952, 47: 583–621.View ArticleGoogle Scholar
- Press WH, Teukolsky SA, Vetterling WT, Flannery BP: Numerical Recipes in C 2 Edition Cambridge, UK: Cambridge University Press 1992.Google Scholar
- Ringnér M, Edén , Johansson P: Classification of expression patterns using artificial neural networks. In A Practical Approach to Microarray Data Analysis (Edited by: Berrar DP, Dubitzky W, Granzow M). Boston: Kluwer Academic Publishers 2002, 201–215.Google Scholar
- Saal LH, Troein C, Vallon-Christersson J, Gruvberger S, Borg Å, Peterson C: Bioarray software environment: a platform for comprehensive management and analysis of microarray data. Genome Biol 2002, 3: software0003.1–0003.6. 10.1186/gb-2002-3-8-software0003View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.