Partial mixture model for tight clustering of gene expression timecourse
 Yinyin Yuan^{1},
 ChangTsun Li^{1}Email author and
 Roland Wilson^{1}
https://doi.org/10.1186/147121059287
© Yuan et al; licensee BioMed Central Ltd. 2008
Received: 28 September 2007
Accepted: 18 June 2008
Published: 18 June 2008
Abstract
Background
Tight clustering arose recently from a desire to obtain tighter and potentially more informative clusters in gene expression studies. Scattered genes with relatively loose correlations should be excluded from the clusters. However, in the literature there is little work dedicated to this area of research. On the other hand, there has been extensive use of maximum likelihood techniques for model parameter estimation. By contrast, the minimum distance estimator has been largely ignored.
Results
In this paper we show the inherent robustness of the minimum distance estimator that makes it a powerful tool for parameter estimation in modelbased timecourse clustering. To apply minimum distance estimation, a partial mixture model that can naturally incorporate replicate information and allow scattered genes is formulated. We provide experimental results of simulated data fitting, where the minimum distance estimator demonstrates superior performance to the maximum likelihood estimator. Both biological and statistical validations are conducted on a simulated dataset and two real gene expression datasets. Our proposed partial regression clustering algorithm scores top in Gene Ontology driven evaluation, in comparison with four other popular clustering algorithms.
Conclusion
For the first time partial mixture model is successfully extended to timecourse data analysis. The robustness of our partial regression clustering algorithm proves the suitability of the combination of both partial mixture model and minimum distance estimator in this field. We show that tight clustering not only is capable to generate more profound understanding of the dataset under study well in accordance to established biological knowledge, but also presents interesting new hypotheses during interpretation of clustering results. In particular, we provide biological evidences that scattered genes can be relevant and are interesting subjects for study, in contrast to prevailing opinion.
Background
Based on the assumption that coexpression indicates coregulation, gene expression data clustering aims to reveal gene groups of similar functions in the biological pathways. This biological rationale is readily supported by both empirical observations and systematic analysis [1]. In particular, consider gene expression timecourse experiments, where the data are made up of tens of thousands of genes, each with measurements taken at either uniformly or unevenly distributed time points often with several replicates. Clustering algorithms provide a good initial investigation into such largescale datasets, which ultimately leads to biological inference. An excellent review of current techniques and all subsequent analysis can be found in [2].
Various modelbased methods have been proposed to accommodate the needs for data mining in such massive datasets. Among them are mixed effects models [3, 4] and auto regressive models [5]. The basic approach of these modelbased methods is to fit a finite mixture model to the observed data, assuming that there is an underlying true model/density, and then systemically find the optimal parameters so that the fitted model/density is as close to the true model/density as possible. It is observed that modelbased approaches generally achieve superior performance to many others [6–9]. However, current methods can be problematic, as they often fail to show how clustering can assist in mining gene expression data.
The maximum likelihood estimator (MLE) is one of the most extensively used statistical estimation techniques in the literature. For a variety of models, likelihood functions [4, 6, 10], especially maximum likelihood, have been used for making inferences about parameters of the underlying probability distribution for a given dataset. The solution often involves a nonlinear optimization such as quasiNewton methods or, more commonly, expectationmaximization (EM) methods [4, 11]. The problem with the former method is that the quantities are estimated only when they satisfy some constraints, while with the latter method all parameters have to be explicitly specified, so the number of clusters K has to be known a priori, which is not practical in microarray data analysis. There are many unique features of MLE, including its efficiency. However the practical deficiencies of MLE, besides those with its optimization, are the lack of robustness against outliers and its sensitivity to the correctness of model specification. We discuss in this paper the performance of an appealing alternative, the minimum distance estimator (MDE) [12], which is less explored in this field. Inspired by the work of [13], we propose to incorporate MDE in our algorithm for gene expression timecourse analysis. MDE provides robust estimation against noise and outliers, which is of particular importance in gene expression data analysis, where data are often noisy and there are few replicates.
Tight clustering has been proposed as a response to the needs for obtaining smaller clusters in genomic signal processing. It was motivated by the fact that the most informative clusters are very often the tight clusters, usually of size 20–60 genes [14]. Tight clustering refers to methods that can be built upon an existing partition to obtain core patterns that are more interpretable. The initial partition can be obtained empirically or by using generic algorithms such as Kmeans. As a result, more information can possibly be revealed. For example, if genes in the same functional category are allocated into different tight clusters, one may pursue the underlying explanation by looking into these clusters. One possible result of such investigation, for example, is new function discovery.
In this sense, to obtain tight clusters, some genes should be classified as scattered genes, if forcing them into clusters will only disturb biologically relevant patterns. Indeed, the issue of scattered genes has received more attention recently [2, 14]. However, in contrast to the prevailing concept that scattered genes should be treated as outliers and discarded from further study, we prove that some scattered genes can be of biological significance. Current methods for gene expression timecourse data rarely deal with scattered genes. To the best of our knowledge, [14] is the first to address this issue, but it results in heavy computation due to the nature of random resampling. It was proposed in [11] that outliers can be modelled by adding a Poisson process component in the mixture model. However, this method has not been verified in this field, and it relies on correct model specification.
There has been a lot of research focusing on modelling timecourse data by splines and autoregressive models, usually followed by EM [3, 4, 6, 15, 16]. In [15], the cubic Bspline, which is a linear combination of Bspline basis functions, is used for fitting gene expression timecourse data. To avoid overfitting, it is suggested not to fit a curve to every individual gene, but to constrain the spline coefficients of coexpressed genes to have the same covariance matrix. Alternatively, we propose in this work a novel approach to fit our spline model.
SplineCluster [6] is an efficient hierarchical clustering program based on a regression model with a marginal likelihood criterion. Starting from singleton clusters, the idea is to merge clusters based on marginal likelihood in each iteration. It is efficient and straightforward to visualize. The problem is that it overlooks microarray replication information by using only the mean of all replicates, which leads to loss of information. As microarry experiments are increasingly performed with replicates, the additional information provided by replicated measurements is a valuable source of variability in terms of effective clustering [17].
The outline of this paper is as follows. In the second section, we describe the MDE framework and demonstrate how its excellent properties inspire a partial regression model for fitting gene expression timecourse data. Simulated datasets are designed for fitting by both partial MDE and MLE, to reveal their inherent differences. Built upon the advantages of MDE and partial modelling, a robust partial regression clustering algorithm is proposed for tight clustering which naturally incorporates replication information and allows a set of scattered genes to be left out. The experimental section is made up of two parts. First, our proposed partial regression clustering algorithm is applied to a simulated dataset to demonstrate its effectiveness. Secondly, it is compared with some recent work by applying the methods to two well studied real datasets. The superior performance of our algorithm is found through a carefully organized clustering validation, based on both biological knowledge and statistical indices. In particular, a GeneOntology (GO) [18] driven validation measure is proposed, specifically designed for gene expression clustering. Subsequent analysis of the clustering outcome reveals new knowledge generated by incorporating different biological resources. This study not only explores the differences between the two estimators and the application of partial modelling, but also provide an excellent example of gene expression data mining through the combination of machine learning and biological knowledge. Owning to space restrictions, some discussions, results and elaborations have been relegated [see Additional file 1, Section 1].
Results and Discussion
Minimum Distance Estimation and Partial Modelling
Minimum Distance Estimator (MDE)
Given a density function f(·), its corresponding parameters θ and n samples x_{ i }, i = 1, 2, ..., n, we aim to find the optimal parameters $\widehat{\theta}$ to approximate the true parameters θ_{0} by minimizing the integrated squared differenced(f(θ), f(θ_{0})) = ∫ [f(xθ)  f(xθ_{0})]^{2} dx
which givesd(f(θ), f(θ_{0})) = ∫ f(xθ)^{2} dx  2 ∫ f(xθ) f(xθ) f(xθ_{0})dx + ∫ f(xθ_{0})^{2} dx
There are many interesting features of MDE. First of all, it comes with the same robustness as all other minimum distance techniques [20–23]. Secondly, MDE approximates data by making the residuals as close to normal in distribution as possible [20–22]. These features will be further explained and illustrated in the experiments. We will also illustrate derivation of the MDE criterion for parameter estimation for our partial regression algorithm.
Gaussian Mixture Model with MDE
g_{ K }(xθ) is a closedform expression, whose minimization can be performed by a standard nonlinear optimization method.
To further relieve the system from constraints by the weight parameters, while keeping its weightedcomponent structure, in the next section the idea of partial modelling is presented. It originated from the fact that incomplete densities are allowed [25], so the model will be fitted to the most relevant data.
Partial Mixture Model with MDE (PMDE)
The weight parameters are of particular importance in a partial mixture model. They allow the model to estimate the component/components, while their value indicates the proportions of fitted data, so the rest of the data can be treated as scattered genes. This approach is first described in [13] for outlier detection. It was suggested to accommodate scattered genes by forcing a large scaling parameter in one of the components in the mixture [2]. However, partial modelling provides a better alternative.
Although it is suggested in [13] that the unconstrained mixture model can be applied for clustering, through our experiments it is clear that if the data overlap to a certain degree, all components will converge to the biggest component as a result of model freedom. Moreover, it is not practical to formulate the criterion in the form of Eq.(9) when it comes to implementation. Instead, we solve the problem by taking advantage of the onecomponent model to formulate our clustering algorithm.
Partial Regression Model
To analyse such high dimensional data as gene expression timecourse measurements, a regression model with a cubic Bspline basis is set up in order to account for the inherent time dependence. The linear regression model is capable of handling either uniformly or unevenly distributed time points, while the nonlinear spline basis helps accommodate the underlying stochastic process in the data. The advantage of using cubic Bspline lies in that the degree of the polynomials is independent of the number of points and that curve shape is controlled locally. Let Y be the variables of interest, consisting of gene expression data replicate matrices modelled asY = α + X(t)β + ε
X(t) is the design matrix consisting of a linear combination of cubic spline basis functions. The error term ε represents the residuals taken as a weighted distribution w·N(0, ${\sigma}_{\epsilon}^{2}$). α, β = β_{1}, β_{2}, ..., β_{ m }, m depending on the choice of X(t), are the regression parameters. As stated before, the useful feature of MDE is that it fits data in such a way that the residuals are close to a normal distribution. Therefore our model isε = Y  α  X(t)β
where θ = {w, α, β_{1}, ...β_{ m }, σ_{ ε }} and φ is the density of a normal random variable. Altogether there are m + 3 parameters to be estimated.
Simulated Datasets for PMDE fitting
The main feature of our model is that it is able to identify the key component, if any, and a set of outliers, in order to find the core structure. Therefore, a feasible parameter estimator is of paramount importance. We empirically validate our points about the nature of partial modelling and MDE through fitting four simple simulated datasets. The performance of both PMDE and MLE with a onecomponent spline regression model (K = 1) is compared in terms of data fitting accuracy and robustness. Surprisingly, superior performance was achieved for the PMDE fits even on such simple datasets. All datasets are generated by sine functions, modelling cyclic behavior of genes, which are widely employed in the literature [3, 26]. Gaussian noise is added to all data. The number of knots for both spine models is chosen to be 15, to allow for flexibility in curves while avoiding overfitting.
More datasets shown in Figure 1(d)–(f) are used to compare the performances of PMDE and MLE in different scenarios. When there are two components of entirely opposite behaviors, we can see from Figure 1(d) that the MLE fit is almost flat, while PMDE fits the larger component (60% of the data). The situation where lots of outliers are present is simulated in Figure 1(e), where the major component has 60% of the data and the rest (40%) are generated from three different sine waves. PMDE demonstrates its robustness by capturing the major component, while MLE is seriously biased. However, in the case of two clusters of exactly equal size as shown in Figure 1(f), PMDE fails, as it is designed to capture only one component but now cannot decide which one to fit. This can be solved by using a multicomponent model.
From these examples, it is observed that PMDE has the ability to handle the relevant fraction of data and distinguish it from outliers, while MLE blurs the distinction by accounting for all data. This is of great value for massive datasets, when the data structure is unclear and lots of outliers are present. The smoother fits of the proposed PMDE than that of MLE manifest the fact that the former is more robust against noise. All these suggest PMDE a promising tool for microarray data analysis. Interested readers are referred to Additional file 1, Section 1, for comparison of the two estimators on theoretical ground.
Clustering Algorithm
When analyzing gene expression timecourse data, special attention needs to be paid to the following issues:

Replicates: It is desirable that the algorithm can naturally incorporate replicate information instead of simply using the mean of all replicates.

Number of clusters: The choice of K is always a problem. The categorization of supervised and unsupervised schemes are usually determined by how K is defined. In our unsupervised algorithm, new cluster generation automatically terminates when no new cluster can be found in the data.

Scattered genes: Recently, many have proposed allowing a noisy set of genes not being clustered [8, 14]. In microarray experiments, it is generally expected that, because of the nature of data and the existence of high noise levels, many genes could show uncorrelated variations and are unrelated to the biological process under investigation. Forcing these genes into clusters will only introduce more false positives, resulting in distorted clusters and difficulty in interpretation.
C_{ l }in Eq.(15) and (16) stands for the l th cluster. The idea behind the CH measure is to compute the pairwise sum of squared errors (distances) between clusters and compare that to the internal sum of squared errors for each cluster. In effect, it is a measure of betweencluster dissimilarity over withincluster dissimilarity. The optimum clustering outcome should be the one that maximizes the CH index in Eq.(14). The CH index was originally meant for squared Euclidean distance. Since the residuals are a natural product of our spline regression model, we use the their absolute value as distance measurement in BSS(K) and WSS(K) but without the square form.
Partial regression clustering algorithm
Tight clustering, by definition, builds compact clusters upon an existing partition. The initial partition, if not available, can be obtained by some empirical knowledge or heuristic clustering methods such as kmeans. Given an initial partition, the clustering procedure is formulated as in Algorithm 1.
In the initialization step of the algorithm, an existing partition of a dataset is provided as input. The tightness threshold, υ, which controls the tightness and the number of the refined clusters produced by the algorithm as output, is defined as the reciprocal of the weighted mean variance of the clusters of the initial partition. Therefore, the greater the threshold is (i.e., the smaller the variance is), the tighter the clusters become and the more clusters are formed. The weights are determined in proportional to the size of the clusters. In the main loop, after each new cluster is
Algorithm 1 Partial Regression Clustering
Require: Initialization
repeat
1. Fit partial regression model to each of the clusters;
2. Identify potential outliers according to a tightness threshold υ and discard them from the clusters;
3. For all outliers, fit partial regression model to form a new cluster;
repeat
4. For all genes reevaluate distances to all existing spline regression models, assign them to the closest one;
5. Fit partial regression models to all clusters;
6. Calculate CH value based on current partitions;
until the clustering quality measured by CH value fails to improve.
7. Take the partition with highest CH value.
until no partial regression model can be fitted to the outliers.
8. Label all outliers as scattered genes.
generated, all data points are reassigned in the gene redistribution loop, so resultant clusters should be of reasonable size. The rationale supporting our design is based on the features of partial modelling and robustness of the MDE estimator, which we believe is able to find the relevant components in the data, while not being distracted by outliers. The residuals, as a natural byproduct of model fitting, can be used as the distance between data points and spline regression models.
In this framework, we use deterministic class assignment during the clustering process. Stochastic relaxation or weighted assignment is regarded as more moderate than deterministic assignment. However, it is also commonly recognised that stochastic relaxation, such as simulated annealing, does not guarantee convergence. In fact, the selection of starting temperature or the setting of annealing schedule are often heuristic. An initial temperature, set too high, leads to high computational cost while an initial temperature, set too low, yields similar result as deterministic relaxation but incurs higher computational cost than deterministic relaxation. After intensive testing with stochastic and deterministic relaxation on the datasets we used, we observed that deterministic assignment strikes a better balance between computational cost and clustering accuracy.
Experiment on Simulated Dataset
Simulated datasets are necessary in evaluating the algorithm performance because the biological meanings of real datasets are very often not clear. Besides, simulated datasets provide more controllable conditions to test an algorithm. To obtain a meaningful result, the simulated data need to share statistical characteristic with biological data.
A simulated dataset is generated from a model x(i, j) = α_{ i }+ β_{ i }ψ (i, j) + ε (i, j), where ψ (i, j) = sin(γ_{ i }j + ω_{ i }). α, β, γ, ω are clusterspecific parameters and are chosen according to the normal distribution with mean equal to 2 and standard deviation 1. All pattern details are listed [see Additional file 1, Section 2]. ψ models the cyclic behavior of gene expression patterns. 30 time points are taken from 6 of these models, so i ∈ 1, 2, ..., 6, j ∈ 1, 2, ..., 30. The cluster sizes are 50, 60, 70, 80, 90, 80. To model the noisy environment of microarray experiments, Gaussian noise ε is added to all data, together with 10 outliers generated by adding large variance Gaussian noise to three sine waves. Altogether, the simulated dataset is of size 440. Finally, we made some perturbations to induce more ambiguity, such as reducing the amplitude of parts of the patterns.
The clustering results are depicted in Supplementary Figure 1 of Additional File 1. The correct partition is achieved, with all ten outliers detected as shown in the seventh plot and the whole dataset plotted in the last one.
Experiments on Yeast Cell Cycle (Y5) Dataset
A clustering method can be evaluated on theoretical grounds by internal or external validation, or both. For internal validation, a statistical measure is preferred. Our algorithm is first validated via the CH measure in a comparison with SplineCluster and MCLUST, two of the most popular clustering methods in the literature. On the other hand, a measure of agreement such as the adjusted Rand index (ARI) [28] between the resulting partition and the true partition, if known, is often used as an external validation criterion. Although a lot of evaluations for methods of the same kind are conducted in this way [8, 26, 29, 30], we note that there is currently no ground truth, given our knowledge of the biological structures [31]. Recognizing this, we set out to evaluate the performance of our algorithm through systematically finding biologically relevant evidence [32–34]. The key to interpret a clustering outcome is to recognize the functional relationships among genes within a cluster as well as between clusters. We first provide a quantitative measure based on the graph structure of Gene Ontology, then pursue biological validation and inference through GO enrichment analysis in an empirically way.
Clustering Y5 dataset
Cross tabulation of original partition and resulting partition for Y5 dataset.
C1  C2  C3  C4  C5  C6  C7  C8  SG  Total  

G1E  29  2  12  19  3  0  0  0  2  67 
G1L  5  52  0  10  63  4  0  0  1  135 
S  1  8  0  2  18  33  11  1  1  75 
G2  0  0  0  0  0  7  30  10  5  52 
M  1  0  23  0  0  0  1  29  1  55 
Total  36  62  35  31  84  44  42  40  10  384 
Then we use the 374 genes (excluding the 10 scattered genes), and again obtained 8 clusters for SplineCluster. As there is no biological knowledge input, comparison can first be conducted in a purely statistical manner, by the CH index. MCLUST [38] is a widely used mixture modelbased clustering method. It is unsupervised, not only in determining the number of clusters, but also in selecting the type of model that best fit the data. The R implementation of MCLUST is used in our experiment. For the 374gene dataset it decided on the EEE (Equal volume, shape and orientation) model and also found 8 components. Our algorithm achieves the highest CH value of 637.4, followed by 588.3 by MCLUST and 523.3 by SplineCluster.
Gene ontology enrichment analysis
The lower the pvalue is, the more unlikely the null hypothesis that the terms appear by chance is true. In this way, the overrepresented terms are found for each cluster.
Predictive accuracy
We compared five clustering methods: our partial regression algorithm, SplineCluster, MCLUST [38], hierarchical clustering, Kmeans, in terms of their predictive accuracy established in [8]. Since the underlying biological ground truth is unknown, evaluation of clustering algorithms for gene data cannot be carried out by similarity measures such as ARI. Instead, predictive accuracy was proposed to test functional prediction accuracy from clustering. The rationale is that since clustering is aimed at functional prediction of novel genes, if a cluster has exceptionally high occurrences of a certain gene annotation F (pvalue smaller than a certain threshold), all genes in this cluster can be predicted to be in the functional category F. The ratio of the verified predictions to all prediction made reflects the accuracy of a clustering algorithm. However, we have to bear in mind that this measure greatly depends on the annotation quality of the dataset under study.
Since our results involved a set of scattered genes, we propose as described below a slightly different criterion to the one in [8]. Suppose a functional category, F_{ i }, has v_{ i }genes in a dataset of size n. If there are in total V genes belonging to functional categories F_{1}, F_{2}, ..., F_{ M }, the remaining n  V genes are denoted as 'unannotated'. Such grouping and the resulting partition C_{1}, C_{2}, ..., C_{ K }of a clustering method can be crosstabulated to form a table. Let n_{ ij }, (i = 1, 2, ..., M and j = 1, 2, ..., K) be the (i, j) entry of the table denoting the number of annotated genes, p_{ ij }be the corresponding pvalue, and n_{·j}be the size of cluster C_{ j }. Given a threshold δ, for a Kcluster solution, its predictive accuracy A is defined asA(δ) = P_{ V }(δ)/P_{ C }(δ)
Scattered genes
Another important aspect in our investigation is to study the set of scattered genes. Multiple experiments are conducted with various tightness thresholds, υ, in our partial regression method. In Supplementary Table 1 of Additional File 2 the set of scattered genes found in eight runs of our program with various thresholds and their annotations are presented. Their frequencies of appearance in these experiments are shown in the column Feq. (out of 8). We noticed that although these thresholds result in different numbers of clusters, the set of scattered genes hardly changes (Supplementary Table 1, column Feq.). Such consistency leads one to think about the underlying biological meaning. As has already been pointed out [2], scattered genes can be those individuals that are not relevant to the biological process under study. However, we stress here that they can also be of significant interest, as each of them might be a key component of the cell cycle that may affect other components and indeed may be a transcription factor themselves. Therefore, its expression pattern can be uncorrelated to others in the set under study. Alternatively, a scattered gene can represent a gene whose expression is controlled by more transcription factors than the other coregulated genes within clusters. Moreover, because the set of genes under investigation is usually selected after performing gene ranking, there may be others in the complete list that would cluster with scattered genes. All these considerations drove us to further investigate this set of scattered genes.
Comparative evaluation on scattered gene detection
Experiments on Yeast Galactose dataset
Experiments are conducted on the Yeast Galactose dataset [42], which consists of gene expression measurements in galactose utilization in Saccharomyces cerevisiae. Gene expression was measured with 4 replicate assays across 20 experimental conditions (20 perturbations in the GAL pathway). A subset of measurements of 205 genes whose expression patterns reflect four functional categories in the GO listings was chosen and clustered previously [17, 29]. Compared with Y5 dataset, Yeast Galactose dataset show more distinguishable patterns, which is easier for clustering and leads to more agreeable correlation to its functional interpretation.
Crosstabulation of original partition (O1–O4) and resulting partition (C1–C4 and SG) for Yeast Galactose dataset.
Cluster  O1  O2  O3  O4  Total 

C1  83  0  0  0  83 
C2  0  12  0  0  12 
C3  0  1  90  1  92 
C4  0  1  0  13  14 
SG  0  1  3  0  4 
Total  83  15  93  14  205 
Conclusion
The aim of clustering gene profiles is to find possible functional relationships among tens of thousands of genes on a microarray. We propose that while the models for data fitting should be sensitive enough for discriminating individuals/genes, the estimators should be robust enough against noise and possible outliers. Therefore we focused on the differences between estimators by providing experimental comparisons. The robustness of the minimum distance estimator makes it stand out in our study. An immediate advantage is that when it is applied to gene expression clustering, it is capable of locating the key components in an unsupervised manner. As a result, a set of scattered genes that has low correlations is naturally obtained. Besides the GO enrichment analysis for the clusters from two real datasets, inference of the sets of scattered genes was also highlighted in this paper.
The partial mixture model (PMM) was known to solve problems for low dimensional data. In fact, one problem with classical PMM is that it cannot fit data of more than 7 data points [13]. This is the first time PMM is extended to use on high dimensional data, since current microarray experiments are having more time points and more replicates. Our contributions include introducing MDE and the idea of partial modelling to gene expression research, giving comparisons with the most common estimator in the literature – maximum likelihood, and proposing a novel partial regression clustering algorithm. Our spline regression model captures the inherent time dependencies among data. The error term is of particular importance as it can pick up the noise. The fact that PMDE estimates parameters so the residuals are as close to normal distribution as possible makes it a powerful tool for modelling the error term. The tightness of resulting clusters can be controlled by a threshold which in a sense decides the number of clusters. The effectiveness of the algorithm also depends on the model normality. When model normality holds approximately, clusters can be found. Often gene expression data are transformed during preprocessing so that normality holds approximately.
Although many interactions between genes are known, our knowledge of biological networks is far from complete. No conclusion can be drawn by merely comparing clustering inference with known measure from the biological literature. In this case, we aim to validate the algorithm and explain the clustering outcome with the help of various biological resources. As a highlight of this paper, Gene Ontology clustering validation was applied to the clustering outcomes of Yeast cell cycle dataset and Yeast Galactose dataset. From current knowledge, it is proved that these clusters can help separate groups of genes with similar functions, while new information can be learned from exploring the GO terms. First we proposed a novel measure based on graph theory and annotation knowledge as functional compactness indication for clusters. Further, predictive accuracy was utilized to compare the annotation prediction power across several common methods. Both measures confirmed that our proposed method has the best performance. Also, gene annotations reveal new knowledge that can be derived from scattered genes. A concern about GO analysis and annotation is that lots of genes and their functions are still unknown or poorly understood. It is our hope that through clustering, new understanding can be introduced to genome research.
In summary, the proposed system benefits from the robustness of MDE to detect scattered genes, the idea of partial modelling for tight clusters, the spline regression model for capturing the expression curves at either uniformly or unevenly distributed time points, and the use of the design matrix for incorporating replicate information. The proposed algorithm can be applied over an existing clustering to get tighter clusters. Although PMDE demonstrates its effectiveness through comparisons with maximum likelihood method, it also has its limits such as relative inefficiency. The aim of this paper is not to prove which one is better, but rather to provide analytical examples, discussions and insights.
Declarations
Acknowledgements
We thank three anonymous referees for suggestions and comments that significantly improved this manuscript.
Authors’ Affiliations
References
 Boutros PC, Okey AB: Unsupervised pattern recognition: An introduction to the whys and wherefores of clustering microarray data. Brief Bioinform 2005, 6(4):331–343.View ArticlePubMedGoogle Scholar
 Ji H, Wong WH: Computational Biology: Toward Deciphering Gene Regulatory Information in Mammalian Genomes. Biometrics 2006, 62(19):645–663.View ArticlePubMedGoogle Scholar
 Luan Y, Li H: Clustering of timecourse gene expression data using a mixedeffects model with Bsplines. Bioinformatics 2003, 19(4):474–482.View ArticlePubMedGoogle Scholar
 Ng SK, Mclachlan GJ, Wang K, Jones LBT, Ng SW: A Mixture model with randomeffects components for clustering correlated geneexpression profiles. Bioinformatics 2006, 22(14):1745–1752.View ArticlePubMedGoogle Scholar
 Wu FX, Zhang WJ, Kusalik AJ: Dynamic modelbased clustering for timecourse gene expression data. J Bioinform Comput Biol 2005, 3(4):821–836.View ArticlePubMedGoogle Scholar
 Heard NA, Holmes CC, Stephens DA: A quantitative study of gene regulation involved in the immune response of Anopheline mosquitoes: An application of Bayesian hierarchical clustering of curves. Journal of the American Statistical Association 2006, 101(473):18–29.View ArticleGoogle Scholar
 Yeung KY, Medvedovic M, Bumgarner RE: Clustering gene expression data with repeated measurements. Genome Biology 2003, 4(5):R34.PubMed CentralView ArticlePubMedGoogle Scholar
 Thalamuthu A, Mukhopadhyay I, Zheng X, Tseng GC: Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics 2006, 22(19):2405–2412.View ArticlePubMedGoogle Scholar
 Fraley C, Raftery AE: Enhanced ModelBased Clustering, Density Estimation, and Discriminant Analysis Software: MCLUST. Journal of Classification 2003, 20(2):263–286.View ArticleGoogle Scholar
 Wakefield J, Zhou C, Self G: Modelling gene expression data over time: Curve clustering with informative prior distributions. Bayesian Statistics 2003.Google Scholar
 Fraley C, Raftery AE: How Many Clusters? Which Clustering Method? Answers Via ModelBased Cluster Analysis. The Computer Journal 1998, 41(8):578–588.View ArticleGoogle Scholar
 Beran R: Minimum distance procedures. Handbook of Statistics 1984, 4: 741–754.View ArticleGoogle Scholar
 Scott DW: Parametric statistical modeling by minimum integrated square error. Technometrics 2001, 43(3):274–285.View ArticleGoogle Scholar
 Tseng GC, Wong WH: Tight Clustering: A ResamplingBased Approach for Identifying Stable and Tight Patterns in Data. Biometrics 2005, 61: 10–16.View ArticlePubMedGoogle Scholar
 BarJoseph Z, Gerber G, Gifford DK, Jaakkola TS, Simon I: A new approach to analyzing gene expression time series data. Proceedings of the Annual International Conference on Computational Molecular Biology, RECOMB 2002, 39–48.Google Scholar
 Ma P, CastilloDavis CI, Zhong W, Liu JS: A datadriven clustering method for time course gene expression data. Nucleic Acids Research 2006, 34(4):1261–1269.PubMed CentralView ArticlePubMedGoogle Scholar
 Tjaden B: An approach for clustering gene expression data with error information. BMC Bioinformatics 2006, 7: 17.PubMed CentralView ArticlePubMedGoogle Scholar
 Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, IsselTarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. Nat Genet 2000, 25: 25–29.PubMed CentralView ArticlePubMedGoogle Scholar
 Parzen E: On the estimation of a probability density function and mode. Annals of Mathematical Statistics 1962, 33: 1065–1076.View ArticleGoogle Scholar
 Zacks S: Parametric Statistical Inference. Pergamon Press; 1981.Google Scholar
 Mayoral L: Minimum distance estimation of stationary and nonstationary ARFIMA processes. The Econometrics Journal 2007, 10: 124–148.View ArticleGoogle Scholar
 GarciaDorado A, Gallego A: Comparing Analysis Methods for MutationAccumulation Data: A Simulation Study. Genetics 2003, 164(2):807–819.PubMed CentralPubMedGoogle Scholar
 Parr WC, Schucany WR: Minimum Distance and Robust Estimation. Journal of the American Statistical Association 1980, 75(371):616–624.View ArticleGoogle Scholar
 Wand MP, Jones MC: Kernel Smoothing. Monographs on Statistics and Applied Probability. London: Chapman and Hall; 1995.Google Scholar
 Basu A, Harris I, Hjort N, Jones M: Robust and efficient estimation by minimising a density power divergence. Biometrika 1998, 85: 549–559.View ArticleGoogle Scholar
 Yeung K, Fraley C, Murua A, Raftery A, Ruzzo W: Modelbased clustering and data transformations for gene expression data. Bioinformatics 2001, 17(10):977–987.View ArticlePubMedGoogle Scholar
 Calinski T, Harabasz J: A dendrite method for cluster analysis. Comm Statist 1974, 3: 1–27.View ArticleGoogle Scholar
 Hubert L, Arabie P: Comparing partitions. Journal of Classification 1985, 2: 193–218.View ArticleGoogle Scholar
 Medvedovic M, Yeung KY, Bumgarner RE: Bayesian mixture model based clustering of replicated microarray data. Bioinformatics 2004, 20(8):1222–1232.View ArticlePubMedGoogle Scholar
 Schliep A, Costa IG, Steinhoff C, Schonhuth A: Analyzing gene expression timecourses. IEEE/ACM Trans Comput Biol Bioinform 2005, 2(3):179–193.View ArticlePubMedGoogle Scholar
 Dojer N, Gambin A, Mizera A, Wilczynski B, Tiuryn J: Applying dynamic Bayesian networks to perturbed gene expression data. BMC Bioinformatics 2006, 7.Google Scholar
 Jiang D, Pei J, Ramanathan M, Tang C, Zhang A: Mining coherent gene clusters from genesampletime microarray data. In KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. New York, NY, USA: ACM Press; 2004:430–439.View ArticleGoogle Scholar
 Qin L, Self SG: The clustering of regression models method with applications in gene expression data. Biometrics 2006, 62(2):526–533.View ArticlePubMedGoogle Scholar
 Ernst J, Nau GJ, BarJoseph Z: Clustering short time series gene expression data. Bioinformatics 2005., 21(SUPPL. 1):Google Scholar
 Cho RJ, Campbell MJ, Winzeler EA, Steinmetz L, Conway A, Wodicka L, Wolfsberg TG, Gabrielian AE, Landsman D, Lockhart DJ, Davis RW: A genomewide transcriptional analysis of the mitotic cell cycle. Molecular Cell 1998, 2: 65–73.View ArticlePubMedGoogle Scholar
 Spellman P, Sherlock G, Zhang M, Iyer V, Anders K, Eisen M, Brown P, Botstein D, Futcher B: Comprehensive identification of cell cycleregulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 1998, 9(12):3273–97.PubMed CentralView ArticlePubMedGoogle Scholar
 Yuan Y, Li CT: Unsupervised Clustering of Gene Expression Time Series with Conditional Random Fields. Proceedings of IEEE Workshop on Biomedical Applications for Digital Ecosystems 2007.Google Scholar
 Fraley C, Raftery A: ModelBased Clustering, Discriminant Analysis, and Density Estimation. Journal of the American Statistical Association 2002, 97(458):611–631.View ArticleGoogle Scholar
 Tavazoie S, Hughes J, Campbell M, Cho R, Church G: Systematic determination of genetic network architecture. Nat Genet 1999, 22(3):281–285.View ArticlePubMedGoogle Scholar
 Benjamini Y, Hochberg Y: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society 1995, B(57):289–300.Google Scholar
 Fraley C, Raftery AE: MCLUST version 3: an R package for normal mixture modeling and modelbased clustering. Technical Report 504, Department of Statistics, University of Washington, Seattle 2006.Google Scholar
 Ideker T, Thorsson V, Ranish JA, Christmas R, Buhler J, Eng JK, Bumgarner R, Goodlett DR, Aebersold R, Hood L: Integrated Genomic and Proteomic Analyses of a Systematically Perturbed Metabolic Network. Science 2001, 292(5518):929–934.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.