Measuring similarities between gene expression profiles through new data transformations
© Kim et al; licensee BioMed Central Ltd. 2007
Received: 01 September 2006
Accepted: 27 January 2007
Published: 27 January 2007
Clustering methods are widely used on gene expression data to categorize genes with similar expression profiles. Finding an appropriate (dis)similarity measure is critical to the analysis. In our study, we developed a new measure for clustering the genes when the key factor is the shape of the profile, and when the expression magnitude should also be accounted for in determining the gene relationship. This is achieved by modeling the shape and magnitude parameters separately in a gene expression profile, and then using the estimated shape and magnitude parameters to define a measure in a new feature space.
We explored several different transformation schemes to construct the feature spaces that include a space whose features are determined by the mutual differences of the original expression components, a space derived from a parametric covariance matrix, and the principal component space in traditional PCA analysis. The former two are the newly proposed and the latter is explored for comparison purposes. The new measures we defined in these feature spaces were employed in a K-means clustering procedure to perform analyses. Applying these algorithms to a simulation dataset, a developing mouse retina SAGE dataset, a small yeast sporulation cDNA dataset, and a maize root affymetrix microarray dataset, we found from the results that the algorithm associated with the first feature space, named TransChisq, showed clear advantages over other methods.
The proposed TransChisq is very promising in capturing meaningful gene expression clusters. This study also demonstrates the importance of data transformations in defining an efficient distance measure. Our method should provide new insights in analyzing gene expression data. The clustering algorithms are available upon request.
With the explosion of various 'omic' data, a general question facing the biologists and statisticians is how to summarize and organize the observed data into meaningful structures. Clustering is one of the methods that have been widely explored for this purpose [1–3]. In particular, clustering is being generally applied to gene expression data to group genes with similar expression profiles into discrete functional clusters. Many clustering methods are available, including hierarchical clustering , K-means clustering , self-organizing maps , and various model-based methods [7–9].
Recent research in clustering analysis has been focused largely on two areas: estimating the number of clusters in data [10–12] and the optimization of the clustering algorithms [13, 14]. In this paper we studied a different yet fundamental issue in clustering analysis: to define an appropriate measure of similarity for gene expression patterns.
The most commonly used distances or similarity measures for analyzing gene expression data are the Pearson correlation coefficient and Euclidean distance, which however, in some situations, could be unsuitable to explore the true gene relationship. The Pearson correlation coefficient is overly sensitive to the shape of an expression curve, and the Euclidean distance mainly considers the magnitude of the changes of the gene expression. For other model-based methods [7–9, 15], their successes would highly rely on how well the assumed probability model fits the data and the clustering purpose.
In recent literature, several advanced measures with emphasis on the expression profile shape have been developed in different contexts [16–18]. In particular, based on the Spearman Rank Correlation, CLARITY was defined for detecting the local similarity or time-shifted patterns in expression profiles . However, the rank-based methods could mistakenly interpret a pattern since the use of rank causes information loss. As an example, we consider a profile Y= (104, 95, 88, 92, 88) with all components generated from the same Poisson distribution of mean 100. Clearly, the differences among the components in Y are due to the distribution variance and ranking in this case is meaningless. Briefly, Spearman Rank Correlation cannot distinguish the real differences from random errors in some situations and thus may provide a wrong estimate of the pattern.
By separately modeling the shape and the magnitude parameters in a gene expression profile, we developed a new measure for clustering the genes when the profile shape is a key factor, and when the expression magnitude should also be accounted for in determining the gene relationship. The approach is to use the estimated shape and magnitude parameters to define a Chi-square-statistic based distance measure in a new feature space. An appropriate feature space helps summarize the data more effectively, hence improving the identification of gene relationships. We explored different transformation schemes to construct the feature spaces, which include a space with features determined by the mutual differences of the original expression components, a space derived from a parametric covariance matrix, and the principal component space in PCA analysis . The former two are the newly proposed and the latter is explored for comparison purposes.
The new measures associated with different feature spaces were employed in a K-means clustering procedure to perform clustering analyses. We designated the algorithm using the measure from the first transformed space as TransChisq, and the one associated with the principal component space as PCAChisq. The space derived from a parametric covariance matrix is not included in comparison for computational reasons (see Methods). For evaluation purposes we also implemented a set of widely used measures into the K-means clustering procedure, including Pearson correlation coefficient (PearsonC), Euclidian distance (Eucli), Spearman Rank Correlation (SRC), and a chi-square based measure for Poisson distributed data (PoissonC) . All the measures were applied to a simulation dataset, a developing mouse retina SAGE dataset of 153 tags , a small yeast sporulation cDNA dataset , and a maize root affymetrix microarray dataset . The results showed that TransChisq outperforms other methods. Using the gap statistic [24, 25], TransChisq was also found to be advantageous in estimating the number of clusters. The underlying probability model of our method was adopted from the model of PoissonC, a method for analyzing the expression patterns in Serial Analysis of Gene Expression (SAGE) data . The MATLAB source codes for all these algorithms are available upon request.
First, we will illustrate the property of the proposed new transformations by applying them to a maize gene expression dataset. Then we will present the applications of TransChisq, PCAChisq and other methods to a simulation dataset, a yeast sporulation microarray dataset, and a mouse retinal SAGE dataset.
Experimental maize gene expression data
The maize dataset, consisting of nine Affymetrix microarrays, was generated to investigate the gene transcription activity in three maize root tissues with three replicates for each tissue: the proximal meristem (PM), the quiescent center (QC) and the root cap (RC) . 2092 significantly differentially expressed genes have been identified and categorized into 6 classes of expression patterns . Here we used these genes to illustrate the property of the proposed transformations with comparison to the traditional PCA.
The six expression patterns and their separating regions described by PC2 and PC3
Center of separating regions described by PC2 and PC3
PM > (QC ≈ RC)
PC2 = ·PC3 < 0
PM < (QC ≈ RC)
PC2 = ·PC3 > 0
QC > (PM ≈ RC)
PC2 = -·PC3 > 0
QC < (PM ≈ RC)
PC2 = -·PC3 < 0
RC > (PM ≈ QC)
PC2 = 0; PC3 > 0
RC < (PM ≈ QC)
PC2 = 0; PC3 < 0
For comparison, we performed a traditional PCA analysis to the same data. Figures 1(g)–(i) plot the expression profiles of the genes in the principal component space. We can see that the direct application of the PCA can separate the two dominating expression patterns. But it fails to recognize the other patterns, even when exhausting all principal components. The poor performance of PCA could be attributed to the use of empirical sample covariance matrix in determining the principal components. In the maize dataset, about 94% genes are RC up- or down-regulated genes, which cause most of the variance. The principal components, determined by this sample covariance matrix thus largely capture the two dominating clusters, yet miss the meaningful class information for the other four small groups.
This example demonstrates the advantage of the proposed new data transformations over the traditional PCA in keeping class information intact.
We applied TransChisq to a simulation dataset to evaluate its performance. For comparison purposes, other modified K-means algorithms, i.e. PCAChisq, PoissonC, PearsonC, and Eucli were also applied to the same dataset.
Five dimensional simulation dataset with Normal distributions (σ2 = 3μ).
Mean parameters of the Normal distributions (μ)
a1 ~ a3
b1 ~ b6
c1 ~ c4
c5 ~ c6
d1 ~ d7
d8 ~ d9
e1 ~ e5
e6 ~ e7
f1 ~ f3
f4 ~ f6
F7 ~ f9
f10 ~ f11
f12 ~ f13
f14 ~ f15
We performed an additional 100 replications of the above simulation. TransChisq, PCAChisq and PoissonC correctly clustered 75, 37 and 43 of the 100 replicate simulation datasets, while PearsonC, Eucli and Eucli on rescaled data cannot generate correct clusters. We also tried PCAChisq on different combinations of principal components to optimize the clustering results. These different combinations, however, are not helpful to identify all the six groups.
This study evaluates the performance of TransChisq on the normally distributed data with Poisson-like property: variance increases with mean. The success of this application sheds a light on applying TransChisq to a microarray dataset in addition to the SAGE data.
Experimental mouse retinal SAGE data
Functional categorization of the 153 mouse retinal tags (125 developmental genes; 28 non-developmental genes).
Number of tags
Comparison of the algorithms on the 153 SAGE tags
Number of tags in incorrect clusters
% of tags in incorrect clusters
Adjusted Rand Index
Eucli on rescaled data
Microarray yeast sporulation gene expression data
Comparison of the algorithms on the 39 yeast sporulation genes
Number of genes in incorrect clusters
% of genes in incorrect clusters
Adjusted Rand Index
Eucli on rescaled data
For PCAChisq, we tried different combinations of principal components (PCs) to optimize the clustering results. The best result can be reached when the first 5 PCs were used: 3 out of the 39 genes were incorrectly grouped. This optimal result is the same as the one from TransChisq. However, in practice, it is not feasible to exhaust all possible combinations of PCs to search for the optimal clustering result.
Estimating the number of clusters using Gap Statistics
An unsolved issue in K-means clustering analysis is how to estimate K, the number of clusters. In the recent literature the Gap statistic was found useful [25, 26]. The technique of the Gap statistic uses the output of any clustering algorithm to compare the 'between-to-total variance (R2)' with that expected under an appropriate reference null distribution. A high R2 value represents high variability between clusters and high coherence within clusters. Below we sketch how to calculate the Gap statistic: Let D k be the R2 measure for the clustering output when the number of clusters is k. To derive the reference expected value of D k , the elements within each row of original data are permuted to produce the new matrices with random profile patterns. Assume B such matrices are obtained. Then for each matrix, a new R2 is calculated based on the original clustering output and the pre-selected similarity measure. The average of these R2's, denoted by , serves as the expectation of D k . With D k and , the Gap function is defined by
Gap(k)= D k - .
The value of k with the largest Gap value will be selected as the optimal number of clusters in that at this k, the observed between-to-total variance R2 is the most ahead of expected.
Discussions and conclusions
In this study, we proposed a method, TransChisq, to group genes with similar expression shapes. The expression magnitude was considered when measuring the shape similarity. Results from applications to a variety of datasets demonstrated TransChisq's clear advantages over other methods. Furthermore, with the gap statistics, TransChisq was also found to be effective in estimating the number of clusters. Regarding the computational efficiency, TransChisq, PCAChisq and PoissonC have similar costs but usually run a few times (2 to 5 times) slower than the PearsonC and Eucli.
We have embedded different measures in the K-means clustering procedure to reveal the important gene expression patterns. In addition to K-means, our new measure can also be implemented in other clustering methods, e.g., hierarchical clustering , to perform the analysis. In a hierarchical clustering procedure, the distance of any two gene expression profiles can be defined using measure (4) by assuming that two genes form a cluster. A study on the performance of different measures in a hierarchical clustering procedure is in Additional file 2. Our new method also outperforms others when implemented in the hierarchical clustering algorithm.
We view different measures as complementary rather than competing in that each has its advantages. In general, TransChisq would be effective when it is necessary to consider the magnitude information in measuring the shape similarity. In clustering analyses of SAGE and microarray data, very often the magnitude information should be taken into account, whereas the shape could be a more critical factor to determine the gene relationship.
Although the proposed method is very promising, it does require further study on possible data transformation schemes when the original data show a more complex structure, or when the clustering purpose is different. We suggest our method could provide new insights to the applications of different data transformations in clustering analysis of gene expression data.
The underlying probability model of our new measures was adopted from the work of Cai et al. , where two Poisson based measures were proposed for clustering analysis of SAGE data, or more generally, Poisson distributed data. A brief review on this work is presented below, followed by a detailed description of the newly proposed measures.
PoissonC and PoissonL for clustering analysis of SAGE data
SAGE is one of the effective techniques for comprehensive gene expression profiling. The result of a SAGE experiment, called a SAGE library, is a list of counts of sequenced tags isolated from mRNAs that are randomly sampled from a cell or tissue. As discussed in Man et al. , the sampling process for tag extraction is approximately equivalent to randomly taking a bag of colored balls from a big box. This randomness leads to an approximate multinomial distribution for the number of transcripts of different types. Moreover, due to the vast amount of varied types of transcripts in a cell or tissue, the selection probability of a particular type of transcript at each draw should be very small. This suggests that the tag counts of sampled transcripts of each type are approximately Poisson distributed. PoissonC and PoissonL were developed under this context . The method is summarized below.
Let Y i (t) be the count of tag i in library t, and Y i = (Y i (1),..., Y i (T)) be the vector of counts of tag i over a total of T libraries. Y i (t) is assumed to be Poisson distributed with mean γ it . To model the magnitude and shape of the expression profile separately, Cai et al.  further parameterized the Poisson rate as γ it = λ i (t)θ i , where θ i is the expected sum of counts of tag i over all libraries, and λ i (t) is the contribution of tag i in library t to the sum θ i expressed in percentage. The sum of λ i (t) over all libraries equals to 1. So λ i (t)θ i redistributes the tag counts according to the expression shape parameter (λ i (t)'s) but keeps the sum of counts over libraries constant. The genes with similar λ i (t)'s over t are considered to be in the same cluster.
For a cluster consisting of tags 1,2,..., m with the common shape parameter λ = (λ(1),..., λ(T)), the joint likelihood function for Y1, Y2,...,Y m is
The maximum likelihood estimates of λ and θ1,..., θ m are
Formula (2) forms the basis of the following two measures for evaluating how well a particular tag fits in a cluster. One natural measure is to use the log-likelihood function: log f(Y i |λ, θ i ). The larger the log-likelihood is, the more likely the observed counts are generated from the expected Poisson distributions. So for a cluster consisting of tags 1,2,..., m, a likelihood based measure is defined as
The other measure is based on the Chi-square statistic, a well known statistic for evaluating the deviation of the observations from the expected values. It is defined as
Using Chi-square statistic as a similarity measure, the penalty for the deviation from large expected count is smaller than that for small expected count. It is consistent with the above likelihood-based measure in that the variance of a Poisson variable equals to its mean. In general, the smaller the value of L or D, the more likely the tags belong to the same cluster. We should also note that the statistics in measure (3) and measure (4) consider both the shape and magnitude information when measuring the cluster dispersion, i.e., the cluster is specified by the shape parameter λ, but the relationship of a tag to a certain cluster is determined by the deviation of observed counts ( i i ) from the expected values ( i λ). Here i is the estimated profile shape of tag i ( i = ( i (1),..., i (T)) and ). A measure that ignores magnitude would take the difference between i and λ directly.
All SAGE tags are assigned randomly to K sets. Estimate initial parameters and for each tag and each cluster by formula (2).
In the (b+1)th iteration, assign each tag i to the cluster with minimum deviation from the expected model. The deviation is measured by either or .
Set new cluster centers by formula (2).
Repeat step 2 till convergence.
Let c(i) denote the index of the cluster that tag i is assigned to. The above algorithm aims to minimize the within-cluster dispersion ∑ i L i,c(i) or ∑ i D i,c(i) . The algorithm using measure L is called PoissonL, and the algorithm using measure D is called PoissonC. PoissonL and PoissonC perform similarly in applications. But PoissonC is more practical in terms of running time. So we use PoissonC for comparison in this paper.
PoissonC is designed to group the objects by their departure from the expected Poisson distributions. The success of PoissonC has been shown in applications [20, 21]. However, if the clustering purpose is slightly different, some modification on PoissonC may be necessary. For instance, if the shape difference should be more emphasized in determining the relationship, the direction of departure of observed from expected may/should also be considered. As an example, we consider an expression vector Y= (15, 30, 15) and its relationship with two clusters with shape specified by λ1 = (1/12,5/6,1/12) and λ2 = (5/12, 1/6, 5/12) respectively. The expectation of Y in cluster 1 is = (5, 50, 5), and in cluster 2, it is = (25, 10, 25). If more emphasis should be put on the shape change in determining the relationship, Y would be expected to be closer to the first cluster because of the large value observed on the middle component in both Y and . PoissonC, however, determines that Y has the same distance to and (by the measure (4), the distance between Y and is 48, so is the distance between Y and ). PoissonC ignores the direction of departure. To address this omission we propose to emphasize the profile shape through suitable data transformations, and to define a distance measure in the transformed space. The construction of a proper feature space under a certain clustering purpose is essential to define an effective distance or similarity measure.
Proposed distance measures (I): TransChisq
A simple yet natural data transformation to emphasize the expression shape is to consider the mutual differences of the original vector components. Given a gene with expression profile Y i = (Y i (1),..., Y i (T)) the transformed vector Z i is of dimension T(T-1)/2 with components in the form of Y i (t1)-Y i (t2) for t1 = 1,..., T-1 and t2 = (t1 + 1),..., T.
According to the Poisson model in the previous section, E(Y i (t1)-Y i (t2)) = (λ i (t1)-λ i (t2))θ i and Var(Y i (t1)-Y i (t2)) = (λ i (t1)+λ i (t2))θ i . For a cluster consisting of tags l, 2,..., m, we can define the following statistic to measure the cluster dispersion:
where (t) and i can be estimated by formula (2). We call the modified K-means algorithm with this measure TransChisq. Applying it to the toy example in the previous section, TransChisq determines that Y is closer to as we expected.
To better understand the effects of the proposed data transformation, we performed a simple simulation study and presented the results in Additional file 3.
Proposed distance measures (II): a parametric-covariance-matrix-based measure
Now we consider a data transformation determined by a parametric covariance matrix:
R = cov(X) = (γ ij )i,j = 1,..., T, with γ ij = α > 0 if i = j and γ ij = β if i ≠ j,
where X is the data matrix with n observations on the rows and T variables on the columns, and R is the covariance matrix of the T variables. The matrix R in this form implies that the variables have identical variances and covariances with each other. These properties are biologically reasonable in that normalized arrays have identical distributions, hence equal variances. Also all pairs of variables would exhibit equal covariance (or un-correlated when β = 0) if each component had been equally important (or independent) to determine a class.
A data transformation can be defined through the eigenspace of R. One set of column orthonormal eigenvectors, denoted by e1,e2,...,e T , is presented in Additional file 4. Given a gene expression profile Y i = (Y i (1),..., Y i (T)), a transformation based on R is
Z i = (Zi 1,..., Zi T) = Y i (e1 e2...e T ).
A convenient property of this transformation is that each component has a clear meaning: with e1 = [1/,...,1/]T, e2 = [1/, -1/,0,...,0]T and e3 = [1/,1/,-2/,0,...,0]T, for a profile Y= (Y1,..., YT), the component associated with e1 is Y e1 = (Y1 + Y2+...+YT)/, which reflects the general expression level; the component associated with e2 is Y e2 = (Y1-Y2)/, which reflects the difference between Y1 and Y2; the component associated with e3 is Y e3 = (Y1+Y2-2Y3)/, which reflects the relationship among Y1, Y2 and Y3.
According to the Poisson model, E(Z it ) = E(Y i )e t = (λ i (1)θ i ,..., λ i (T)θ i )e t , Var(Zi t) = (λ i (1)θ i ,..., λ i (T)θ i ) and Cov(Z it , Z ik ) = 0 when t ≠ k. Then for a cluster consisting of tags 1, 2,..., m, we can measure the cluster dispersion by:
We should note the connection between this measure and the S trans in formula (5). As we discussed above, the component associated with e2 is (Y1-Y2)/. Thus the new space associated with S trans is equivalent to the space determined by e2 and all its row-switching transformations. We can also define a measure similarly through e3 or other eigenvectors. S trans seems to have the potential of losing the information carried by e3 and other eigenvectors. However, applications of TransChisq to a variety of datasets suggested that this potential information loss is minor and can be ignored in most cases in practice. In fact, the row-switching transformations of e2 make up most of the information included in e3 and other eigenvectors.
A potential shortcoming of S trans_N comes from the fact that it is defined based on only one set of eigenvectors. The orthonormal eigenspace of a covariance matrix is not unique (e.g., the row switching operation can result in a different set of eigenvectors) and different eigenspaces may result in different values of S trans_N . Although one can consider all possible eigenspaces to overcome the limitation of S trans_N , it is not computationally feasible.
Applying S trans_N to several different datasets, we observed that i) using the eigenvectors e1,e2,...,e T in Additional file 4, S trans_N performs very similarly to S trans and ii) when a different set of eigenvectors used, the clustering results can be different, though the difference is not obvious. These results are not presented in this paper.
Proposed distance measures (III): PCAChisq
For comparison purposes, we applied PCA to transform the data . PCA is useful to simplify the analysis of a high dimensional dataset. Recently, PCA has been explored as a method for clustering gene expression data [28–33]. But a blind application of PCA in clustering analysis is dangerous in that PCA chooses principal component axes based on the empirical covariance matrix rather than the class information, and thus it does not necessarily give good clustering results [29, 34, 35].
In some theoretical  and empirical  studies, there have been observations that the first few principal components (PCs) in PCA are not always helpful to extract meaningful signals from data. Thus, we considered all PCs in this study. By substituting the e1 e2...e T in measure (6) by the eigenvectors from the sample covariance matrix, we defined a new measure and implemented it in the PCAChisq. The Results section gives examples showing the positive and negative effects of the PCA transformation. In general, PCAChisq is difficult to use. Firstly, it is unclear what types of variances the principal components are capturing (if it is the within-cluster variance, the principal components would lead to wrong clustering results). Next, it is unclear how many principal components should be used. The optimal number of PCs is unavailable before we compare the results to the ground truth. To be brief, PCAChisq is only efficient when the principal components happen to match the key features that determine a cluster.
Clustering analysis of microarray data
We explored the potential application of the proposed measures to a clustering analysis of microarray data. We proposed the following restricted normal model for this purpose. The parameter notations in the Poisson model were adopted. Given a microarray dataset of expressions of n genes in T experiments, the expression of gene i in experiment t, X i (t), is assumed to be normally distributed with mean μ i (t) = λ i (t)θ i and variance (t) = kλ i (t)θ i , where k is an unknown constant. The derivation of the maximum likelihood estimates (MLEs) of λ i (t) and θ i under the normal model is rather involved. So we borrowed the estimators in formula (2). It can be shown that i in formula (2) is unbiased and t in formula (2) is consistent under the restricted normal model [see Additional file 5]. With i and t available under the normal model, TransChisq, PCAChisq and PoissonC can be applied.
For both oligonucleotide and cDNA microarray data, it is widely observed that there is strong dependence of the variance on the mean: variance increases with mean [36, 37]. So it is reasonable to expect that our restricted normal model is applicable to many microarray datasets. One example of this application on the yeast sporulation dataset has been presented to demonstrate the power of TransChisq in analyzing microarray data (see the Results section). We should also note that TransChisq would deliver less promising results if the assumption on the relationship between the variance and the mean is seriously violated.
The work of K. Kim was supported by Pohang University of Science and Technology (POSTECH), Korea and NIH R01GM075312. The work of H. Huang was supported by NIH R01GM075312.
- Brazma A, Vilo J: Gene expression data analysis. FEES Lett 2000, 480: 17–24. 10.1016/S0014-5793(00)01772-5View ArticleGoogle Scholar
- Quackenbush J: Computational analysis of microarray data. Nat Rev Genet 2001, 2: 418–427. 10.1038/35076576View ArticlePubMedGoogle Scholar
- Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 1998, 95: 14863–14868. 10.1073/pnas.95.25.14863PubMed CentralView ArticlePubMedGoogle Scholar
- Johnson SC: Hierarchical Clustering Schemes. Psychometrika 1967, 2: 241–254. 10.1007/BF02289588View ArticleGoogle Scholar
- Hartigan JA: Clustering algorithms. New York: John Wiley & Sons, Inc; 1975.Google Scholar
- Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR: Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA 1999, 96: 2907–2912. 10.1073/pnas.96.6.2907PubMed CentralView ArticlePubMedGoogle Scholar
- McLachlan GJ, Basford KE: Mixture models: inference and applications to clustering. New York: Dekker; 1988.Google Scholar
- Banfield JD, Raftery AE: Model-based Gaussian and non-Gaussian clustering. Biometrics 1993, 49: 803–821. 10.2307/2532201View ArticleGoogle Scholar
- Fraley C, Raftery AE: Model-based clustering, discriminant analysis and density estimation. Journal of the American Statistical Association 2002, 97: 611–631. 10.1198/016214502760047131View ArticleGoogle Scholar
- Tibshirani R, Walther Q, Hastie T: Estimating the number of clusters in a data set via the gap statistic. J R Statist Soc B 2001, 63: 411–423. 10.1111/1467-9868.00293View ArticleGoogle Scholar
- Feher M, Schmidt JM: Fuzzy clustering as a means of selecting representative conformers and molecular alignments. J Chem Inf Comput Sci 2003, 43: 810–818. 10.1021/ci0200671View ArticlePubMedGoogle Scholar
- Okada Y, Sahara T, Mitsubayashi H, Ohgiya S, Nagashima T: Knowledge-assisted recognition of cluster boundaries in gene expression data. Artif Intell Med 2005, 35: 171–183. 10.1016/j.artmed.2005.02.007View ArticlePubMedGoogle Scholar
- Baccelli F, Kofman D, Rougier JL: Self organizing hierarchical multicast trees and their optimization. Proceedings of IEEE Inforcom'99 1999, 3: 1081–1089.Google Scholar
- Jia L, Bagirov AM, Ouveysi I, Rubinov AM: Optimization based clustering algorithms in multicast group hierarchies. In Proceedings of the Australian Telecommunications, Networks and Applications Conference (ATNAC). Melbourne Australia; 2003. (published on CD, ISNB 0–646–42229–4). (published on CD, ISNB 0-646-42229-4).Google Scholar
- Friedman N, Linial M, Nachman I, Pe'er D: Using Bayesian networks to analyze expression data. J Comput Biol 2000, 7: 601–620. 10.1089/106652700750050961View ArticlePubMedGoogle Scholar
- Wen X, Fuhrman S, Michaels GS, Carr DB, Smith S, Barker JL, Somogyi R: Large-scale temporal gene expression mapping of central nervous system development. Proc Natl Acad Sci USA 1998, 95: 334–339. 10.1073/pnas.95.1.334PubMed CentralView ArticlePubMedGoogle Scholar
- Filkov V, Skiena S, Zhi J: Analysis techniques for microarray time-series data. J Comput Biol 2002, 9: 317–330. 10.1089/10665270252935485View ArticlePubMedGoogle Scholar
- Balasubramaniyan R, Hullermeier E, Weskamp N, Kamper J: Clustering of gene expression data using a local shape-based similarity measure. Bioinformatics 2005, 21: 1069–1077. 10.1093/bioinformatics/bti095View ArticlePubMedGoogle Scholar
- Jolliffe IT: Principal Component Analysis. New York: Springer-Verlag; 1986.View ArticleGoogle Scholar
- Cai L, Huang H, Blackshaw S, Liu JS, Cepko C, Wong WH: Cluster analysis of SAGE data using a Poisson approach. Genome Biology 2004, 5: R51. 10.1186/gb-2004-5-7-r51PubMed CentralView ArticlePubMedGoogle Scholar
- Blackshaw S, Harpavat S, Trimarchi J, Cai L, Huang H, Kuo WP, Weber G, Lee K, Fraioli RE, Cho S-H, Yung R, Asch E, Ohno-Machado L, Wong WH, Cepko CL: Genomic analysis of mouse retinal development. PLoS Biology 2004, 2: e247. 10.1371/journal.pbio.0020247PubMed CentralView ArticlePubMedGoogle Scholar
- Chu S, DeRisi J, Eisen M, Mulholland J, Botstein D, Brown PO, Herskowitz I: The transcriptional program of sporulation in budding yeast. Science 1998, 282: 699–705. 10.1126/science.282.5389.699View ArticlePubMedGoogle Scholar
- Jiang K, Zhang S, Lee S, Tsai G, Kim K, Huang H, Chilcott C, Zhu T, Feldman LJ: Transcription profile analysis identify genes and pathways central to root cap functions in maize. Plant Molecular Biology 2006, 60: 343–363. 10.1007/s11103-005-4209-4View ArticlePubMedGoogle Scholar
- Hastie T, Tibshirani R, Eisen MB, Alizadeh A, Levy R, Staudt L, Chan WC, Botstein D, Brown P: 'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns. Genome Biology 2000, 1(2):research0003. 10.1186/gb-2000-1-2-research0003PubMed CentralView ArticlePubMedGoogle Scholar
- Tibshirani R, Walther G, Hastie T: Estimating the number of clusters in a data set via the gap statistic. J R Statist Soc B 2001, 63: 411–423. 10.1111/1467-9868.00293View ArticleGoogle Scholar
- Hubert L, Arabie P: Comparing partitions. J Classifi 1995, 193–218.Google Scholar
- Man MZ, Wang X, Wang Y: POWER_SAGE: comparing statistical tests for SAGE experiments. Bioinformatics 2000, 16: 953–959. 10.1093/bioinformatics/16.11.953View ArticlePubMedGoogle Scholar
- Raychaudhuri S, Stuart JM, Altman RB: Principal components analysis to summarize microarray experiments: application to sporulation time series. Pac Symp Biocomput 2000, 5: 452–463.Google Scholar
- Yeung KY, Ruzzo WL: Principal component analysis for clustering gene expression data. Bioinformatics 2001, 17: 763–774. 10.1093/bioinformatics/17.9.763View ArticlePubMedGoogle Scholar
- Alter O, Brown PO, Bostein D: Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci USA 2000, 97: 10101–10106. 10.1073/pnas.97.18.10101PubMed CentralView ArticlePubMedGoogle Scholar
- Holter NS, Mitra M, Maritan A, Cieplak M, Banavar JR, Fedoroff NV: Fundamental patterns underlying gene expression profiles: simplicity from complexity. Proc Natl Acad Sci USA 2000, 97: 8409–8414. 10.1073/pnas.150242097PubMed CentralView ArticlePubMedGoogle Scholar
- Bicciato S, Luchini A, Di Bello C: PCA disjoint models for multiclass cancer analysis using gene expression data. Bioinformatics 2003, 19: 571–578. 10.1093/bioinformatics/btg051View ArticlePubMedGoogle Scholar
- Misra J, Schmitt W, Hwang D, Hsiao L-L, Gullans S, Stephanopoulos G: Interactive exploration of microarray gene expression patterns in a reduced dimensional space. Genome Res 2002, 12: 1112–1120. 10.1101/gr.225302PubMed CentralView ArticlePubMedGoogle Scholar
- Komura D, Nakamura H, Tsutsumi S, Aburatani H, Ihara S: Multidimensional support vector machines for visualization of gene expression data. Bioinformatics 2005, 21: 439–444. 10.1093/bioinformatics/bti188View ArticlePubMedGoogle Scholar
- Chang W-C: On using principal components before separating a mixture of two multivariate normal distributions. Appl Statist 1983, 32: 267–275. 10.2307/2347949View ArticleGoogle Scholar
- Durbin BP, Hardin JS, Hawkins DM, Rocke DM: A variance-stabilizing transformation for gene-expression microarray data. Bioinformatics 2002, 18: S105-S110.View ArticlePubMedGoogle Scholar
- Rocke DM: Heterogeneity of variance in gene expression microarray data.University of California at Davis, Department of Applied Science and Division of Bio statistics; 2003. [http://www.cipic.ucdavis.edu/~dmrocke/papers/empbayes2.pdf]Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.