Time-course microarray experiments produce vector gene expression profiles across a series of time points. Clustering genes based on these profiles is important in discovering functional related and co-regulated genes. Early developed clustering algorithms do not take advantage of the ordering in a time-course study, explicit use of which should allow more sensitive detection of genes that display a consistent pattern over time. Peddada et al. [1] proposed a clustering algorithm that can incorporate the temporal ordering using order-restricted statistical inference. This algorithm is, however, very time-consuming and hence inapplicable to most microarray experiments that contain a large number of genes. Its computational burden also imposes difficulty to assess the clustering reliability, which is a very important measure when clustering noisy microarray data.

Results

We propose a computationally efficient information criterion-based clustering algorithm, called ORICC, that also takes account of the ordering in time-course microarray experiments by embedding the order-restricted inference into a model selection framework. Genes are assigned to the profile which they best match determined by a newly proposed information criterion for order-restricted inference. In addition, we also developed a bootstrap procedure to assess ORICC's clustering reliability for every gene. Simulation studies show that the ORICC method is robust, always gives better clustering accuracy than Peddada's method and saves hundreds of times computational time. Under some scenarios, its accuracy is also better than some other existing clustering methods for short time-course microarray data, such as STEM [2] and Wang et al. [3]. It is also computationally much faster than Wang et al. [3].

Conclusion

Our ORICC algorithm, which takes advantage of the temporal ordering in time-course microarray experiments, provides good clustering accuracy and is meanwhile much faster than Peddada's method. Moreover, the clustering reliability for each gene can also be assessed, which is unavailable in Peddada's method. In a real data example, the ORICC algorithm identifies new and interesting genes that previous analyses failed to reveal.

Background

The development of microarray technology provides a powerful analytical tool for large scale genomic research. Its ability to simultaneously study thousands of genes under a multitude of conditions presents a huge challenge to comprehend and interpret the resulting mass of data. An important application of microarray technology is to study the dynamic patterns of gene expression across a series of time points and find gene clusters within which genes share similar patterns. The premise is that genes sharing similar expression profiles might be functionally related or co-regulated. Therefore, microarray data may provide insights into gene-gene interactions, gene function and pathway identification. Examples of such studies include response to temperature changes and other stress conditions [4], immune response [5], developmental studies [6], and various systems in the cell [7]. Early clustering analysis of microarray data were mostly on static microarray experiments, such as hierarchial clustering [8], the k-nearest neighbors method [9, 10], and other correlation-based methods [11, 12]. These methods are not designed for time-course microarray data and can not effectively utilize the temporal information. Many clustering algorithms for time-course microarray data have been developed afterwards. Most of them view observed temporal gene expression profiles coming from underlying smooth curves and cluster genes based on estimated expression profiles obtained from nonparametric smoothing [2, 13–25]. While these algorithms work well for relatively long time series data, they are not appropriate for short time-course microarray data often taken on a small number of sparse time points. Generally, these algorithms tend to overfit the data when the number of time points is small [2]. A few clustering methods were also proposed specifically for short time-course microarray data, such as [1–3, 26, 27]. Among them, Peddada et al. [1] proposed an interesting idea of using order-restricted inference in clustering short time-course microarray data. A set of candidate expression profiles is first defined by inequality constraints among expression levels at different time points, i.e. up and downs. This strategy is less restrictive than those that define profiles via pre-specified expression levels because only the general shape of the profile is needed. Each candidate profile then represents a potential gene cluster. For short time-course data, the total number of candidate profiles is generally not large. Peddada's method then assigns genes to the cluster represented by its best matched candidate profile determined by some order-restricted statistical inference procedure. Two genes fall into the same cluster if their best matched candidate profiles are the same. This profile matching clustering strategy is different from most unsupervised clustering where a representation of a cluster is often calculated only after the cluster is formed. This method requires no smoothing of the expression profiles and was shown to discover more functionally related genes when applied to a breast cancer cell-line data in [28]. However, Peddada's method is computationally very costly when the number of genes is large. According to our experience on a workstation with a 2.30 GHz AMD Athlon(tm) 64 × 2 Dual Core 4400+ processor and a 2.00 GB memory, it took at least 72 hours to run Peddada's method (implemented in Matlab) for the breast cancer cell line data containing about 1900 genes.

In this article, from a different perspective to the order-restricted inference, we propose a new order-restricted information criterion-based clustering (ORICC) algorithm, which is computationally much more efficient than Peddada's method. Our method selects and clusters genes using the ideas of model selection for order-restricted inference, where estimation makes use of inequalities that define the candidate profiles. The first step is to define candidate profiles and express them in terms of inequalities between the expected gene expression levels at various time points. For a given candidate profile, we estimate the mean expression level at different time points of each gene using the order-restricted maximum likelihood [29]. The best fitting profile for a given gene is then selected using an information criterion for order-restricted inference. Due to the simplicity of our algorithm, the analysis of the breast cancer cell-line data in [28] can now be done in a few minutes using our method.

Results and discussion

Inequality profiles

Suppose that a time-course microarray experiment includes T time points, and at each time point there are M arrays each with G genes. Denote y_{
gti
}the expression measurement of gene g at time point t on the ith array. Suppose that the unknown true mean expression level of gene g at time t is μ_{
gt
}, i.e. μ_{
gt
}= E(y_{
gti
}) for all i. A candidate gene expression profile of gene g is then given by inequalities between the components of μ_{
g
}= (μ_{g1}, μ_{g2}, ⋯, μ_{
gT
})^{
T
}. In the following, we define some typical inequality profiles, and we drop the subscript g for simplicity.

C_{0}profile :

(1)

C_{⊥}profile :

(2)

where μ_{
i
}⊥ μ_{
j
}means that there is no defined inequality constraint between μ_{
i
}and μ_{
j
}.

Monotone increasing profile (simple order) :

(3)

(with at least one strict inequality). Similarly, a monotone decreasing profile C_{↓} is given by replacing ≤ by ≥ in (3).

Up -down profile with maximum at i (umbrella order) :

(4)

(with at least one strict inequality among μ_{1} ≤ μ_{2} ≤ ⋯ ≤ μ_{
i
}and one among μ_{
i
}≥ μ_{i+1 }≥ ⋯ ≥ μ_{
T
}). Genes satisfying this profile have mean expression values non-decreasing in time up to time point i and non-increasing thereafter. One may similarly define a down-up profile C_{∨}.

Cyclical profile with minima at 1, j, and T and maxima at i and k :

(5)

(with at least one strict inequality among each monotone sub-profile). Cyclical profiles may be important in relatively long time-course experiments where the mean expression value could oscillate.

Incomplete inequality profiles :

(6)

(with at least one strict inequality among each monotone sub-profile). Profiles (6) are useful when the investigator is unable to specify inequalities between certain means.

Information-criterion based clustering using order-restricted maximum likelihood

Our procedure seeks to match a gene's true profile, estimated from the observed data, to one of a specified set of candidate profiles. Provided the relationship of a gene's mean expression levels at different time points is defined by a given candidate profile, we first obtain the order-restricted maximum likelihood estimates (MLE) of the gene's mean expression levels at all time points. Details for simple order and umbrella order constraints are given in the Methods section. A general discussion of order-restricted MLE can be found in [29]. Some specific examples can also be found in [1]. Peddada's method then carried out a bootstrap-based likelihood ratio test to decide a gene's best matched profile, which needs to repeat computing the order-restricted MLEs at least hundreds of times for each gene. We replace this very time-consuming bootstrap procedure by computing an information criterion function instead. The best matched profile is given by the one with the smallest information criterion function value. Information criterion functions, such as Akaike information criterion (AIC) [30] and Bayesian information criterion (BIC) [31], are widely used in selecting a model from a set of potential models. In our context, each candidate profile can be viewed as a potential model for a gene's expression pattern, and hence a properly defined information criterion function can be used to decide the gene's best matched profile. However, the widely used AIC and BIC are designed for models with a fixed number of parameters, whereas the inequality constraints do not really specify any parameters explicitly. Some information criteria for order-restricted inference have been proposed recently [32, 33], but they only apply to simple order constraints (3). We have recently proposed a new order-restricted information criterion (ORIC) function for general inequality profiles, which we show to always select the correct profile when the sample size is large enough (unpublished manuscript). Suppose that there are d candidate profiles of interest. Let l(λ) denote the maximum log-likelihood under the λ-th candidate profile, and ν_{1}(λ) and ν_{2}(λ) are the number of ⊥ and {≥, ≤} specified in the profile, λ = 1, ..., d. Then, the ORIC function is

(7)

where M is the number of replicated arrays and

(8)

The ORIC function is similar to AIC and BIC in essence with p(λ) representing the model complexity. That is, the more the inequality constraints in a profile, the more complex it is as a model. And a profile with a smaller ORIC value is regarded as a better match to the gene's expression pattern. In the following, we describe our ORIC-based clustering (ORICC) algorithm (one-stage ORICC) and its computationally faster variant (two-stage ORICC).

One-stage ORICC

Step 1. Pre-specify a collection of candidate profiles, {C_{1}, ..., C_{
d
}}. To prevent genes with very little changes over time matched to these profiles, we also include C_{0} defined in (1) into the collection.

Step 2. Compute p(λ) in (8) for all candidate profiles.

Step 3. For gene g, compute the order-restricted MLE () of (μ_{g1}, ⋯, μ_{
gT
}) and the maximum log-likelihood l(λ) under each candidate profile C_{
λ
}, λ = 0, 1, ⋯, d.

Step 4. For gene g, compute the information criterion function ORIC(λ) in (7) for all λ = 0, 1, ⋯, d. The best matched profile is then selected as that corresponds to = arg min_{0 ≤ λ ≤ d
}ORIC(λ), and gene g is assigned to the th cluster if ≠ 0.

Step 5. Repeat Steps 3 and 4 for every gene.

Although our one-stage ORICC algorithm is hundreds of times faster than Peddada's method, performing Step 3 for all genes can still cost a lot of computational time even when only a moderate number of candidate profiles are considered because the number of genes is generally huge. This issue is more imminent for relatively longer time course microarray studies as more candidate profiles usually need to be considered. Next, we propose a computationally more efficient two-stage algorithm by adding a pre-screening stage.

Two-stage ORICC

Step 1. Pre-specify a collection of candidate profiles, {C_{1}, ..., C_{
d
}}. Here, we also add C_{0} and C_{⊥} defined in (1) and (2) into the collection for screening purpose.

Step 2. Compute p(λ) in (8) for all candidate profiles.

Step 3. For gene g, compute the order-restricted MLE () of (μ_{g1}, ⋯, μ_{
gT
}) and the maximum log-likelihood l(λ) under profiles C_{0} and C_{⊥}.

Step 4. For gene g, compute ORIC(λ) in (7) for λ = 0, ⊥. Exclude gene g for further consideration if ORIC(0) <ORIC(⊥).

Step 5. Repeat Steps 3 and 4 for every gene. Denote the set of remained genes by S.

Step 6. Run Steps 3–5 in the one-stage ORICC algorithm for genes in S considering only {C_{1}, ..., C_{
d
}} as candidate profiles.

In the one-stage algorithm, the ORIC function is evaluated for every gene under every candidate profile, whereas the two-stage algorithm first screens out genes that show no significant changes over time by comparing between two profiles C_{0} and C_{⊥}, and then applies the one-stage algorithm to a much smaller set of remained genes. As a result, the two stage algorithm is usually much faster and report tighter clusters with less genes in them.

Filtering genes with small expression levels

Some genes selected by the ORICC algorithm may have small mean expression levels at every time point. Such genes may not be of interest to some investigators. Peddada et al. [1] suggested a simple step to remove them. We include it here for completeness.

Let

(9)

where . Large values of v_{
g
}indicate that the mean expression of gene g is high for at least one time point. Arrange the genes that selected by the ORICC algorithm in descending order of v_{
g
}and retain the top R genes.

Assessing the reliability of the oricc results

Microarray data are often noisy and hence it is important to assess the reliability of the clustering results. Among the recently developed methods for assessing clustering reliability [34–38], we adopt a general bootstrap framework proposed by Kerr and Churchill [36], in which the clustering procedure is first applied to the original data and then to a large number of bootstrap samples obtained from perturbing the original data. Since Peddada's method is computationally so costly, it is impossible to put it into this framework. By contrast, our ORICC method is computationally very efficient, and we can easily embed it into this general framework.

In time-course microarray studies, we can use the following analysis of variance (ANOVA) model to account for sources of variation in microarray data.

(10)

where y_{
gti
}is the relative expression measurement from array i, time point t, and gene g on appropriate scale (typically the log scale). The terms β_{
t
}, γ_{
i
}and ψ_{
ti
}account for all effects that are not gene-specific. We assume that the error terms ε_{
gti
}are independent with mean 0 and variance but do not make any other distributional assumption. The bootstrap assessment is then done as follows.

Step 1. Estimate model (10), which can be done straightforwardly in any statistics software, such as SAS [39] and R [40].

Step 2. Generate B bootstrap samples by

where a ^ over a term means the estimate from the original model fit in Step 1, and are drawn with replacement from the studentized residuals of the original model fit.

Step 3. Repeat the ORICC algorithm for each bootstrap sample.

Now, the original clustering is accompanied by a collection of bootstrap clusterings, which can be regarded as a sample of clusterings that are close to the original clustering in space of all possible clusterings. When the level of noise in the original data is low, the bootstrap clusterings tend to be more like the original clustering. Then we can calculate a reliability measure for each gene by counting the proportion in the bootstrap clusterings it is attached to the same profile as in the original clustering. The larger the measure, the more reliable the gene's clustering membership.

Simulation studies

In this section, we use Monte Carlo simulation to examine the performance of the ORICC method and compare it with other clustering methods for short time-course microarray data, including Peddada's method, STEM [2], and Wang's method [3].

The STEM algorithm works by assigning genes to a pre-defined set of model profiles that capture the potential distinct patterns that can be expected from the microarray experiment. Each gene is then assigned to the closest model profile in certain distance measure, e.g. correlation, and genes assigned to the same model profile consist a cluster. Significant profiles/clusters are next determined by hypotheses tests. As a result, genes in insignificant clusters are usually not reported. Wang's method represents each gene's temporal profile by a polynomial model and estimates the model using a Bayesian approach. A heuristic search strategy [23] is then applied to obtain clusters by repeatedly merging models to improve marginal likelihood.

All simulations were carried out on a workstation with a 2.30 GHz AMD Athlon(tm) 64 × 2 Dual Core 4400+ processor and a 2.00 GB memory. Peddada's method, Wang's method and the one-stage ORICC algorithm are implemented in R [40], whereas the STEM software is in JAVA written by its author.

Simulation 1

In the first simulation study, we consider ten inequality profiles (C_{1}–C_{10}) plus a flat pattern (C_{0}) to represent a total number of eleven clusters. We set the number of time points as T = 6. These numbers are set to be similar to those in the real data set analyzed in the next section. The eleven profiles are specified as follows. For compactness, we drop μ ∈ R^{
T
}and the phrase 'with a strict inequality' when defining the profiles. True values of μ = (μ_{1}, μ_{2}, μ_{3}, μ_{4}, μ_{5}, μ_{6}) in each profile are also given.

Figure 1 shows the inequality profiles C_{1}–C_{10}. We generated a data set with 200 genes from each profile.

At each time point t, we generated M replicates for each gene's expression level from normal distributions with means μ_{
t
}and variance σ^{2}. To assess the effect of the data variability and replicates on the clustering results, we varied the variance σ^{2} from 0.2 to 1.2 (by an incremental of 0.2) and the number of replicates M from 2 to 10.

Next, we clustered the simulated data using Peddada's method and the one-stage ORICC algorithm, considering ten candidate inequality profiles C_{1}–C_{10}. For Peddada's method, we set the number of bootstrap replications as 200 and the significance level of the bootstrap based test as 0.025. Peddada et al. originally proposed to use significance level 0.0025, and we have observed that a large number of "non-flat" genes will be clustered to C_{0} using this choice. Meanwhile, using the common significance level of 0.05 tends to cluster many genes from C_{0} to other "non-flat" profiles. Using significance level of 0.025 offers a good compromise between the two kinds of false clustering.

Let γ_{
i
}denote the number of genes with true profile C_{
i
}and correctly clustered to profile C_{
i
}, i = 0, 1, ..., 10. The overall error rate and the false positive rate are then given by /(11 × 200) and 1 -γ_{0} = 200, respectively. Let denote the number of genes with true profile C_{
i
}, i = 1, ..., 10 and clustered to profile C_{0}. The false negative rate is then given by /(10 × 200).

We then use the overall error rate, the false positive rate and the false negative rate to evaluate the accuracy of the two algorithms. Simulation results are summarized in Figures 2, 3 and 4. Figure 2 and Figure 4 show that the overall error rate and the false negative rate of the one-stage ORICC algorithm are always better than those of Peddada's method. Figure 3 shows that the false positive rate of the one-stage ORICC algorithm is better than that of Peddada's method in most cases. The one-stage ORICC algorithm not only provides good clustering accuracy but also is much faster than Peddada's method. For example, when σ^{2} = 1 and M = 5, the run time for Peddada's method and one-stage ORICC algorithm is 2979.29 seconds versus 25.55 seconds.

Simulation 2

In the second simulation study, we consider the same set of inequality profiles and simulate the data in the same way as in Simulation 1, but we fix the number of replicates M to be 5. To study the effect of the true cluster size to the clustering accuracy, we consider different cluster sizes, 50, 100, 150 and 200. Meanwhile, we also vary the variance σ^{2} from 0.2 to 3.0 with an incremental of 0.4. Then we cluster the simulated data set using methods including Peddada's method, Wang's method, STEM and the one-stage ORICC algorithm.

For Peddada's method and the one-stage ORICC algorithm, we consider eleven candidate inequality profiles C_{0}–C_{10}. For Wang's method, we set the prior hyper-parameters (α_{1}, α_{2}) in the gamma prior distribution Gamma(α_{1}, α_{2}) as (2, 2). For STEM, we assume 50 possible profiles and use the recommended default settings in the package. To be consistent, we did not filter out any genes in any of these analyses. Then we use Rand's C statistic [41] to evaluate the similarity between the true cluster assignment and the clustering results of different methods. Rand's C statistic is defined as follows. Given a pair of clusterings C and C' of the same N objects, arbitrarily number the clusters in each clustering and let n_{
ij
}be the number of objects simultaneously in the ith cluster of C and the jth cluster of C'. Then, Rand's C statistic is given by

which denotes the proportion of pairs of objects that are assigned consistently in the two clusterings. Figure 5 gives Rand's C statistic from different clustering methods for different σ^{2} and cluster sizes. It shows that the precision of all methods is decreasing for increasing variance, and the cluster size has no obvious effect on the clustering precision for Peddada's method, STEM and the one-stage ORICC algorithm, but has an increasing effect for Wang's method. This comparison shows an interesting pattern. For larger σ^{2}, STEM performs the best, Wang's method the worst, and Peddada's method and the one-stage ORICC are in between. For smaller σ^{2}, the result is reversed with STEM being the worst, Wang's method the best, and Peddada's method and the one-stage ORICC still in between. When the cluster size is relatively small and σ^{2} is large, Wang's method can have quite low precision under 70%. Overall, the one-stage ORICC algorithm is consistently more accurate than Peddada's method by a slight margin, and provides good precision under all scenarios. The performance of STEM is also very stable, but tends to underperform when the data are less noisy, i.e., σ^{2} is small.

Figure 6 shows the simulated eleven clusters when M = 5, σ^{2} = 1 and the cluster size is 100. Figures 7, 8, 9 and 10 show the resulted clusters from the four different methods, respectively. ORICC and Peddada's method give similar clustering results that well match the true pattern. While the true number of clusters is eleven, STEM identifies six significant clusters and Wang's method keeps eight clusters. In particular, Wang's method did not cluster genes from the flat profile C_{0} into one cluster but assign them into different clusters.

In this simulation, Peddada's method, Wang's method and the one-stage ORICC method are implemented in R, whereas the STEM software is in JAVA written by its author. So, we can only compare the computational efficiency of the first three methods and the one-stage ORICC method is much faster than the other two. For example, when σ^{2} = 3 and the cluster size is 200, the run time for Peddada's method, Wang's method and the one-stage ORICC algorithm is 3073.37 seconds, 10303.9 seconds and 24.72 seconds, respectively.

Simulation 3

In the third simulation, we examine the robustness of the ORICC algorithm. We consider eleven inequality profiles (C_{0}–C_{10}) plus a cyclical profile (C_{∧∧}) to represent a total number of twelve clusters. We set the number of time points as T = 6. The eleven inequality profiles C_{0}–C_{10} and the true values of μ = (μ_{1}, μ_{2}, μ_{3}, μ_{4}, μ_{5}, μ_{6}) in each profile are the same as in Simulation 1. The cyclical profile C_{∧∧} and the true value of μ = (μ_{1}, μ_{2}, μ_{3}, μ_{4}, μ_{5}, μ_{6}) in C_{∧∧} are given as follows:

We generate a data set with 200 genes from each profile of C_{0}–C_{10} and 200 × r genes from cyclical profile C_{∧∧}. At each time point t, we generated 5 replicates for each gene's expression level from normal distributions with means μ_{
t
}and variance σ^{2}. To study the robustness of the one-stage ORICC algorithm, we consider different cluster sizes, 200 × r, r = 1, 2, ⋯, 10, for the cyclical profile C_{∧∧}. Meanwhile, we also vary the variance σ^{2} from 0.2 to 3.0 with an incremental of 0.4. Then we cluster the simulated data set using the one-stage ORICC algorithm. For the one-stage ORICC algorithm, we consider eleven candidate inequality profiles C_{0} – C_{10}. Note that the cyclical profile C_{∧∧} is not included in the candidate profiles.

Then we use Rand's C statistic to evaluate the similarity between the true cluster assignment and the clustering results from ORICC. Figure 11 gives Rand's C statistic from the one-stage ORICC algorithm for different σ^{2} and cluster sizes of the cyclical profile C_{∧∧}. It shows that the precision of the one-stage ORICC algorithm as measured by Rand's C statistic increases for decreasing variance, and decreases when increasing the cluster size of the cyclical profile C_{∧∧}. The cluster precision is however always greater than 80%, thus suggesting that the one-stage ORICC algorithm is very stable.

Simulation 4

In the fourth simulation, we further examine the robustness of the ORICC algorithm. We consider the scenario where a true profile is not explicitly included in the candidate profiles but maybe viewed as a special case of a more flexible candidate profile. Let the true inequality profile be

We generate a data set containing 2000 genes from this profile. At each time point t, we generate M replicates for each gene's expression level from normal distributions with means μ_{
t
}and variance 0.5. Then, we consider candidate profiles being C_{1}, C_{2}, C_{4} and C_{9} plus the profile C_{⊥} and cluster the simulated data using the one-stage ORICC algorithm. Note that the set of candidate profiles does not contain the true one C_{∧∧}, but C_{∧∧} may be viewed as a special case of C_{⊥}. Let γ_{⊥} denote the proportion of genes clustered to the profile C_{⊥}, and define the detection error as 1 - γ_{⊥}. The simulation results are summarized in Figure 12. It shows that the detection error decreases rapidly as the number of replicates M increases. With 5 replicates, the detection error is below 30%, and with 10 replicates, it goes down to below 5%. This indicates that it is quite safe to apply the ORICC algorithm even if some true profile is missing from the candidate profiles but a more comprehensive profile is considered.

Application to breast cancer cell line data

Next, we apply the ORICC algorithm to log-transformed relative expression data from a breast cancer cell line microarray study in [28]. The same data set was also analyzed by Peddada et al. [1]. The experiment was done as follows. First, the MCF-7 breast cancer cell line was treated with 17β-estradiol or ethanol (vehicle control). Then, samples were harvested at 1, 4, 12, 24, 36 and 48 hours after treatment. At each time point, M = 8 replicate arrays were prepared with each array consisting of G = 1901 genes. Similar as in [1], we assumed for each gene that the variance of the log relative expression was homoscedastic over time, and consider the following 10 candidate profiles for clustering: monotone decreasing, C_{1}; monotone increasing, C_{2}; four up-down profiles with maxima at 4, 12, 24, 36 hours, C_{3} – C_{6}, respectively; and 4 down-up profiles with minima at 4, 12, 24, 36 hours, C_{7} – C_{10}, respectively. Genes matched to these profiles will be regarded estrogen responsive.

The original analysis in [28] used a simple confidence interval approach [42], and identified 105 genes that demonstrated estrogen responsive expression. From the one-stage ORICC analysis, we finally had 981 genes in the 10 clusters, with 68 in C_{1}, 24 in C_{2}, 76 in C_{3}, 44 in C_{4}, 97 in C_{5}, 72 in C_{6}, 35 in C_{7}, 98 in C_{8}, 409 in C_{9}, and 58 in C_{10}. Due to limitation of space, we only present the top 50 genes ranked by the filtering criterion in (9) (Additional file 1). The last column in Additional file 1 indicates whether the gene was previously identified in [28]. Clustering reliability for each gene is also attached. Note that these 50 genes come from only nine of the 10 clusters, with none from C_{2}. Figure 13 presents the estimated profiles of these 50 genes from the order-restricted MLE. Of the 105 genes identified in [28], 44 are among our top 50, 82 are among our top 100, 94 were among our top 150, and 101 were among our top 200. Most of the 44 genes in the top 50 selected in common are involved in cell cycle progression and DNA replication reflecting the known sensitivity of MCF-7 cells to estrogen. Among the six genes identified by our ORICC algorithm but not in [28] (denoted by dashed lines in Figure 13), two have quite high clustering reliability. The methylmalonyl Coenzyme A mutase (Clone ID 35468 in C_{9}) has reliability 0.9967, and the deoxythymidylate kinase (Clone ID 489092 in C_{5}) has reliability 0.7800. Both genes are known specific to the metabolic process, and hence are very likely responsive to the metabolism of estrogen when overdosed estrogen are supplied to the cell. For example, the methylmalonyl Coenzyme A mutase could be involved in the breakdown of estradiol into smaller metabolic fragments. However, this gene was not reported in the top 50 list by Peddada et al. [1]. Estimated profiles in Figure 13 suggests this gene matches very well with the candidate profile C_{9}. An interesting phenomenon about the deoxythymidylate kinase is that this gene actually corresponds to two spots on the microarray chips (Clone IDs 489092 and 248008). The original analysis in [28] was only able to identify one of them (Clone ID 248008) but not the other, whereas our method identifies both in the same cluster with high clustering reliability.

In [1], Peddada et al. reported that 39 genes in the top 50 were in common with the 105 genes selected in [28]. Our ORICC result has a good overlap with Peddada's result, with 32 genes in our top 50 also in the top 50 list reported in [1]. Peddada et al. discussed in details several newly identified genes in their top 50, such as replication factors C4 and C5. Our ORICC analysis also cluster these genes into the same clusters, but they are not in our top 50 list due to relatively low clustering reliability, with replication factor C4 ranked 75 and replication factor C5 ranked 148.

We further applied STEM and Wang's method on the breast cancer cell line data. Table 1 reports Rand's C statistics among results from four clustering methods. It shows that the three profile matching algorithms have results more alike each other, while the unsupervised Wang's method is less similar to the rest. This observation can also be seen in the temporal profiles of the clusters given by four different methods (Figures 14, 15, 16 and 17). While ORICC and Peddada's method give ten clusters plus the 'flat' cluster, Wang's method identifies seven clusters, and STEM reports twelve significant clusters.

Table 1

Rand's C statistics among four clustering methods in the breast cancer cell line example.

ORICC

Peddada

Wang

STEM

ORICC

1.0000

0.7767

0.6313

0.7142

Peddada

0.7767

1.0000

0.5948

0.7694

Wang

0.6313

0.5948

1.0000

0.6025

STEM

0.7142

0.7694

0.6025

1.0000

A larger value indicates more overlap between two clustering results.

Discussion

In time-course microarray experiments, the ability to exploit the temporal ordering information may be especially valuable because genes whose expression levels change over time may be involved in the same cellular process or belong to the same regulatory pathway. Making use of the valuable ordering information can improve inference. Our proposed ORICC algorithm utilizes the temporal ordering information in clustering time-course microarray data using order-restricted maximum likelihood, while most existing clustering methods either can not incorporate the temporal information or require long time series to perform reliable nonparametric smoothing, e.g. spline smoothing, though most time-course microarray data are short time series. In our method, the temporal ordering information is exploited through a set of pre-defined candidate expression profiles given by inequality constraints among the mean expression levels at different time points. By viewing each candidate profile as a potential model for the data and let each profile represent a cluster, we transform the clustering problem into a model selection problem. Using an ORIC function, we decide the best matched profile for each gene and hence determine the gene's clustering membership. Peddada et al. [1] instead performed a likelihood based test to decide the best matched profile. However, a bootstrap procedure is needed to decide the threshold value on the test statistic, which makes Peddada's method computationally very intensive.

In many situations, field researchers can have good ideas on defining the inequality profiles. For example, when studying the gene expression patterns for disease onset. It is easy to postulate that gene expressions tend to go up before the disease onset and then go down after certain treatment is given. So, the inequality constraints allow an easy adoption of a prior knowledge into the analysis, whereas existing methods usually can not take such information into consideration. In addition, when inequality constraints are given, the order-restricted MLE has some optimal properties and universally dominates the unrestricted MLE [43]. Moreover, the candidate profiles are defined only based on ranks instead of the numerical value of the mean expression levels, hence our model specification is robust to small perturbation in the data. This feature is especially valuable in microarray studies since it is well known that microarray data are quite noisy. Furthermore, the rank-based specification is often closer to the real meaning of 'coexpression' that refers to two genes' expression levels changing in the same direction instead of with the same magnitude.

In this paper, we present our algorithm under the context of clustering time-course microarray data. Actually, it can be applied to data from any experiment with ordered treatment or conditions, such as dose-response microarray experiments where the dose levels provide the ordering.

Our current ORICC algorithm is based on order-restricted MLE for gene expressions with a constant variance through time. It can be generalized to handle situations where the variances change or are subject to order restrictions themselves. In such situations, the estimation of mean expression levels outlined in this paper can be modified according to the approach in [44]. However, it remains a subject for future investigation to modify the definition of the model complexity p(λ) (8) accordingly.

Conclusion

We developed a new clustering algorithm, ORICC, for short time-course microarray data, by taking a model selection approach in order-restricted statistical inference. Our method clusters genes into clusters represented by candidate profiles defined through inequalities among mean expression levels at different time points. A newly proposed information criterion function is used to determined the cluster assignment. Compared with a previous clustering method by Peddada et al. [1] that also uses order-restricted inference, our method is computationally much more efficient and provides an assessment of clustering reliability. Simulation studies indicate that the ORICC algorithm possesses good clustering accuracy when a moderate number of replicate arrays are available, and competes well with other existing clustering methods, such as [2] and [3]. Real data applications also indicate that our method can identify interesting genes that some correlation-based methods have failed to identify.

Methods

Order-restricted maximum likelihood estimation

Here, we briefly present the order-restricted MLE under simple order (3) and umbrella order constraints (4), which are needed for ORICC analysis in our simulation and real data example. For more general results, we refer to [29, 45]. Suppose that y_{
ti
}'s independent observations from normal distributions with unknown means μ_{
t
}and variances v_{
t
}for t = 1, ⋯, T and i = 1, ⋯, n_{
t
}. Then the data log-likelihood is

(11)

Where , μ = (μ_{1},⋯, μ_{
T
})^{
T
}and v = (v_{1}, ⋯, v_{T})^{
T
}. Let w_{
t
} = n_{
t
}/v_{
t
}, and . When assuming the variances v are known, we have the order-restricted MLE of μ as

(1) μ_{1} ≤ μ_{2} ≤ ⋯ ≤ μ_{
T
}:

(2) μ_{1} ≥ μ_{2} ≥ ⋯ ≥ μ_{
T
}:

(3) μ_{1} ≤ ⋯ ≤ μ_{
h
}≥ ⋯ ≥ μ_{
T
}:

(4) μ_{1} ≥ ⋯ ≥ μ_{
h
}≤ ⋯ ≤ μ_{
T
}:

The maximum log-likelihood under the λ-th candidate profile is then obtained by plugging in the corresponding order-restricted MLE into (11).

If the variances v are unknown, we need to impose the assumption v_{1} = v_{2} = ⋯ = v_{
n
}= v. In microarray data, this assumption is reasonable after the data are properly normalized. Now, the data log-likelihood is

(12)

Under this situation, the order-restricted MLE of μ can be obtained similarly as in the known variance case by letting w_{
t
}= n_{
t
}instead of w_{
t
}= n_{
t
}/v_{
t
}. And the MLE of v is then

Again, the maximum log-likelihood for the λ-th candidate profile is then obtained by plugging in the corresponding order-restricted MLE into (12).

The authors gratefully appreciate the editor and the referees for their valuable comments and suggestions. The authors are partly supported by the National Science Foundation of China (No.10871037) and the Science Foundation for Young Teachers of Northeast Normal University (No.20050107).

Authors’ Affiliations

(1)

Key Laboratory for Applied Statistics of MOE and School of Mathematics and Statistics, Northeast Normal University

(2)

Department of Mathematics, Washington University in St Louis

References

Peddada S, Lobenhofer E, Li L, Afshari C, Weinberg C, Umbach D: Gene Selection and clustering for time-course and dose-response microarray experiments using order-restricted inference.Bioinformatics 2003, 19:834–841.View ArticlePubMed

Ernst J, Nau GJ, Bar-Joseph Z: Clustering short time series gene expression data.Bioinformatics 2005,21(Suppl 1):i159-i168.View ArticlePubMed

Wang L, Ramoni M, Sebastiani P: Clustering short gene expression profiles.Lecture Notes in Computer Science 2006, 3909:60–68.View Article

Gasch A, Spellman P, Kao C, Carmel-Harel O, Eisen M, Storz G, Botstein D, Brown P: Genomic expression programs in the response of yeast cells to environmental changes.Mol Biol Cell 2000, 11:4241–4257.PubMed

Guillemin K, Salama N, Tompkins L, Falkow S: Cag pathogenicity island-specific responses of gastric epithelial cells to Helicobacter pylori infection.Proc Natl Acad Sci USA 2002, 99:15136–15141.View ArticlePubMed

Arbeitman M, Furlong E, Imam F, Johnson E, Null B, Baker B, Krasnow W, Scott M, Davis R, White K: Gene expression during the life cycle of Drosophila melanogaster.Science 2002, 298:2270–2275.View Article

Storch K, Lipan O, Leykin I, Viswanathan N, Davis F, Wong W, Weitz C: Extensive and divergent circadian gene expression in liver and heart.Nature 2002, 418:73–83.View Article

Eisen M, Spellman P, Brown P, Botstein D: Cluster analysis and display of genome-wide expression patterns.Proc Natl Acad Sci USA 1998, 95:14863–14868.View ArticlePubMed

Li L, Weinberg C, Darden T, Pedersen L: Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method.Bioinformatics 2001, 17:1131–1142.View ArticlePubMed

Li L, Darden T, Weinberg C, Pedersen L: Gene assessment and sample classification for gene expression data using a genetic algorothm/k-nearest neighbor method.Comb Chem High T Scr 2001, 4:727–739.

Chu S, DeRisi J, Eisen M, Mulholland J, Botstein D, Brown P, Herskowitz I: The transcriptional program of sporulation in budding yeast.Science 1998, 282:699–705.View ArticlePubMed

Heyer L, Kruglyak S, Yooseph S: Exploring expression data: identification and analysis of coexpressed genes.Genome Res 1999, 9:1106–1115.View ArticlePubMed

Bar-Joseph Z, Gerber G, Jaakkola T, Gifford D, Simon I: Continuous representations of time series gene expression data.J Comput Biol 2003, 34:341–356.View Article

De Hoon M, Imoto S, Miyano S: Statistical analysis of a small set of time-ordered gene expression data using linear splines.Bioinformatics 2002, 18:1477–1485.View ArticlePubMed

Ji X, Li-Ling J, Sun Z: Mining gene expression data using a novel approach based on hidden Markov models.FEBS Lett 2003, 542:125–131.View ArticlePubMed

Kim J, Kim JH: Difference-based clustering of short time-course microarray data with replicates.BMC Bioinformatics 2007, 8:253.View ArticlePubMed

Luan Y, Li H: Clustering of time-course gene expression data using a mixed-effects model with B-splines.Bioinformatics 2003, 19:474–482.View ArticlePubMed

Luan Y, Li H: Model-based methods for identifying periodically regulated genes based on the time course microarray gene expression data.Bioinformatics 2004, 20:332–339.View ArticlePubMed

Ma P, Castillo-Davis CI, Zhong W, Liu J: A data-driven clustering method for time course gene expression data.Nucleic Acids Res 2006,34(4):1261–1269.View ArticlePubMed

Ma P, Zhong W, Feng Y, Liu JS: Bayesian functional data clustering for temporal microarray data.Int J Plant Genomics 2008, 2008:231897.PubMed

Moller-Levet CS, Cho KH, Wolkenhauer O: Microarray data clustering based on temporal variation: FCV with TSD preclustering.Appl Bioinformatics 2003, 2:35–45.PubMed

Phang T, Neville M, Rudolph M, Hunter L: Trajectory clustering: a nonparametric method for grouping gene expression time courses, with applications to mammary development.Pac Symposium on Biocomputing 2003, 351–362.

Ramoni M, Sebastiani P, Kohane I: Cluster analysis of gene expression dynamics.Proc Natl Acad Sci USA 2002, 99:9121–9126.View ArticlePubMed

Song JJ, Lee HJ, Morris JS, Kang S: Clustering of time-course gene expression data using functional data analysis.Comput Biol Chem 2007,31(4):265–274.View ArticlePubMed

Peddada S, Harris S, Zajd J, Harvey E: ORIOGEN: order restricted inference for ordered gene expression data.Bioinformatics 2005, 21:3933–3934.View ArticlePubMed

Wang L, Montano M, Rarick M, Sebastiani P: Conditional clustering of temporal expression profiles.BMC Bioinformatics 2008, 9:147.View ArticlePubMed

Lobenhofer E, Bennett L, Cable P, Li L, Bushel P, Afshari C: Regulation of DNA replication fork genes by 17β-estradiol.Molec Endocrin 2002, 16:1215–1229.View Article

Robertson T, Wright FT, Dykstra RL: Order Restricted Statistical Inference John Wiley & Sons 1988.

Akaike H: A new look at the statistical model identification.IEEE T Automat Contr 1974, 19:716–723.View Article

Schwarz G: Estimating the dimension of a model.Ann Statist 1978, 6:461–464.View Article

Anraku K: An information criterion for parameters under a simple order restriction.Biometrika 1999, 86:141–152.View Article

Zhao L, Peng L: Model selection under order restriction.Stat & Probabil Lett 2002, 57:301–306.View Article

Laurent B, Olivier G, Olivier M: Using repeated measurements to validate hierarchical gene clusters.Bioinformatics 2008, 24:682–688.View Article

Gibbons F, Roth F: Judging the quality of gene expression-based clustering methods using gene annotation.Genome Res 2002, 12:1574–1581.View ArticlePubMed

Kerr M, Churchill G: Bootstraping cluster analysis: assessing the reliability of conclusions from microarray experiments.Proc Natl Acad Sci USA 2001, 98:8961–8965.View ArticlePubMed

Smolkin M, Ghosh D: Cluster stability scores for microarray data in cancer studies.BMC Bioinformatics 2003, 4:36.View ArticlePubMed

SAS Institute Inc: SAS 9.1.3 Help and Documentation SAS Institute Inc., Cary, NC 2000.

R Development Core Team: [http://www.R-project.org]R: A language and environment for statistical computing R Foundation for Statistical Computing, Vienna, Austria 2005.

Rand WM: Objective criteria for the evaluation of clustering methods.J Amer Statist Assoc 1971, 66:846–850.View Article

Chen Y, Dougherty E, Bittner M: Ratio-based decisions and the quantitative analysis of cDNA microarray images.J Biomed Optics 1997, 2:364–374.View Article

Hwang J, Peddada S: Confidence interval estimation subject to order restrictions.Ann Statist 1994, 22:67–93.View Article

Shi N: Maximum likelihood estimation of means and variances from normal populations under simultaneous order restrictions.J Mult Anal 1994, 50:282–293.View Article

Shi N, Gao W, Zhang B: One sided estimation and testing problems for location models from grouped samples.Commun Statist-Simula 2001, 30:885–898.View Article

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.