Quadratic regression analysis for gene discovery and pattern recognition for non-cyclic short time-course microarray experiments

Background Cluster analyses are used to analyze microarray time-course data for gene discovery and pattern recognition. However, in general, these methods do not take advantage of the fact that time is a continuous variable, and existing clustering methods often group biologically unrelated genes together. Results We propose a quadratic regression method for identification of differentially expressed genes and classification of genes based on their temporal expression profiles for non-cyclic short time-course microarray data. This method treats time as a continuous variable, therefore preserves actual time information. We applied this method to a microarray time-course study of gene expression at short time intervals following deafferentation of olfactory receptor neurons. Nine regression patterns have been identified and shown to fit gene expression profiles better than k-means clusters. EASE analysis identified over-represented functional groups in each regression pattern and each k-means cluster, which further demonstrated that the regression method provided more biologically meaningful classifications of gene expression profiles than the k-means clustering method. Comparison with Peddada et al.'s order-restricted inference method showed that our method provides a different perspective on the temporal gene profiles. Reliability study indicates that regression patterns have the highest reliabilities. Conclusion Our results demonstrate that the proposed quadratic regression method improves gene discovery and pattern recognition for non-cyclic short time-course microarray data. With a freely accessible Excel macro, investigators can readily apply this method to their microarray data.


Background
Microarray time-course experiments allow researchers to explore the temporal expression profiles for thousands of genes simultaneously. The premise for pattern analysis is that genes sharing similar expression profiles might be functionally related or co-regulated [1]. Due to the large number of genes involved and the complexity of gene regulatory networks, clustering analyses are popular for analyzing microarray time-course data. Heuristic-based cluster analyses group genes based on distance measures; the most commonly used methods include hierarchical clustering [2], k-means clustering [3], self-organizing maps [4], and support vector machines [5]. Due to the lack of statistical properties of these heuristic-based clustering methods, statistical models, especially analysis of variance (ANOVA) models and mixed models are often implemented as a precursor to clustering to ensure the genes used for clustering are statistically meaningful [6,7]. Only genes identified to be significantly regulated by statistical models are used for further clustering. Fitting statistical models prior to clustering usually dramatically reduces the number of genes used for clustering, which in general will improve the performance of the clustering method. An alternative way of clustering is statistical model-based clustering methods, which assume that the data is from a mixture of probability distributions such as multivariate normal distributions and describe each cluster using a probabilistic model [8,9].
In microarray time-course studies, time dependency of gene expression levels is usually of primary interest. Since time can affect the gene expression levels, it is important to preserve time information in time-course data analysis. However, most methods for analyzing microarray timecourse data treat time as a nominal variable rather than a continuous variable, and thus ignore the actual times at which these points were sampled. Peddada et al. (2003) proposed a method for gene selection and clustering using order-restricted inference, which preserves the ordering of time but treats time as nominal [1]. Recently, a number of algorithms treating time as a continuous variable have been introduced. Xu et al. (2002) applied a piecewise regression model to identify differentially expressed genes [10]. Both Luan and Li (2003) and Bar-Joseph et al (2003) proposed B-splines based approaches [11,12], which are appropriate for microarray data with relatively long timecourse, but their application to short time-course data is questionable. New methods for analyzing short timecourse microarray data are needed [13].
In this paper, we propose a model-based approach, step down quadratic regression, for gene identification and pattern recognition in non-cyclic short time-course microarray data. This approach takes into account time information because time is treated as a continuous variable. It is performed by initially fitting a quadratic regression model to each gene; a linear regression model will be fit to the gene if the quadratic term is determined to have no statistically significant relationship with time. Significance of gene differential expression and classification of gene expression patterns can be determined based on rel-evant F-statistics and least squares estimates. Major advantages of our approach are that it not only preserves the ordering of time but also utilizes the actual times at which they were sampled; it identifies differentially expressed genes and classifies these genes based on their temporal expression profiles; and the temporal expression patterns discovered are readily understandable and biologically meaningful. A free Excel macro for applying this method is available at http://www.mc.uky.edu/UKMicroArray/ bioinformatics.htm [14]. The proposed quadratic regression method is applied to a microarray time-course study of olfactory receptor neurons [15]. Biologically meaningful temporal expression patterns have been obtained and shown to be more effective classifications than ANOVAprotected k-means clusters. Comparison with Peddada et al.'s order-restricted inference method [1] showed that our method provides a different and interesting insight into the temporal gene profiles. Reliabilities of the results from all 3 methods were assessed using a bootstrap method [16] and regression patterns were shown to have the highest reliabilities.

Step-down quadratic regression
We propose a step-down quadratic regression method for gene discovery and pattern recognition for non-cyclic short time-course microarray experiment. The first step is to fit the following quadratic regression model to the j th gene: (1) where y ij denotes the expression of the j th gene at the i th replication, x denotes time, β 0j is the mean expression of the j th gene at x = 0, β 1j is the linear effect parameter of the j th gene, β 2j is the quadratic effect parameter of the j th gene, and, ε ij is the random error associated with the expression of the j th gene at the i th replication and is assumed to be independently distributed normal with mean 0 and variance . Two levels of significance, α 0 and α 1 , need to be pre-specified, where α 0 to is recommended to be small to reduce the false positive rate in the gene discovery and α 1 less stringent to control pattern classification. α 0 could be chosen using various multiple testing p-value adjustment procedures, for example, False Discovery Rate (FDR) [17]. The temporal gene expression patterns can be determined as follows: 1. If overall model (1) p-value >α 0 , the j th gene is considered to have no significant differential expression over time. The expression pattern of the gene is "flat".  Flow chart of the quadratic regression method Figure 1 Flow chart of the quadratic regression method. The gene selection and pattern classification procedure of our quadratic regression method. y ij is the expression level; x is time; β 0j , β 1j , and β 2j are the parameters of intercept, linear effect, and quadratic effect, respectively; ε ij is the random error. Among the 9 regression patterns, FLAT stands for no statistically significant differential expression over time; LU stands for linear up; LD stands for linear down; QC stands for quadratic concave; QV stands for quadratic convex; QLCU stands for quadratic linear concave up; QLCD stands for quadratic linear concave down; QLVU stands for quadratic linear convex up; QLVD stands for quadratic linear convex down.

SS linear df SS residual df
2. If overall model (1) p-value ≤ α 0 , the j th gene will be considered to have significant differential expression over time. The patterns are then determined based on the pvalues obtained from F tests (Table 1).
a. If both p-value of quadratic effect ≤ α 1 and p-value of linear effect ≤ α 1 , the j th gene is considered to be significant in both the quadratic and linear terms. The expression pattern of the gene is "quadratic-linear".
b. If p-value of quadratic effect ≤ α 1 and p-value of linear effect >α 1 , the j th gene is considered to be significant only in the quadratic term. The expression pattern of the gene is "quadratic".
c. If p-value of quadratic effect >α 1 , the j th gene is considered to be non-significant in the quadratic term. The quadratic term will be dropped and a linear regression model will be fitted to the gene: From fitting model (2), • If p-value of linear effect ≤ α 1 , the j th gene is considered to be significant in the linear term. The expression pattern of the gene is "linear".
• If p-value of linear effect >α 1 , the j th gene is considered to be non-significant in the linear term. The expression pattern of the gene is "flat". The four expression patterns described above can be furthered classified into 9 patterns according to the up/down regulation of the gene expression based on the leastsquares estimates and and the predicted signals (Table 2). A flow chart for the above procedure is shown in Figure 1. This procedure can be easily applied using the Excel macro available at http://www.mc.uky.edu/ UKMicroArray/bioinformatics.htm [14].

Application of the quadratic regression method
Normality test based on Shapiro-Wilk statistics [18] suggested that most of the 3834 present genes in the olfactory receptor neuron data do not have a significant departure from the normal distribution ( Figure 2). Therefore the quadratic regression method with normality assumption was applied to the data of 3834 present genes (Figure 3), where α 0 was chosen to be 0.01 and α 1 to be 0.05. 798 genes were determined to have significant differential expression over time at level 0.01. Examples of 9 regression patterns are shown in Figure 4.

Comparison with Peddada et al.'s method
Peddada et al.'s method [1] was applied to the expression data of 3834 present genes with 8 pre-specified profiles: monotone increasing (MI); monotone decreasing (MD);    Table 3. We observe that the majority of genes in MI are in LU, similarly for MD and LD, UD2 and QLCD, and DU2 and QLVU. However, each of the Peddada et al.'s profiles contains a mixture of regression patterns, and vice versa. This is reasonable because even though both methods perform gene selection and classification, they are aimed at different aspects of the temporal profiles. For example, Peddada et al.'s MI profile contains regression patterns LU, QLCU and QLVU. Although the gene expression level is increasing monotonically over time, the regression method gives more information on how it is increased: constantly (LU, Figure  5a, Gdp2), increases faster then slower (QLCU, Figure 5b

Comparison with ANOVA-protected k-means clustering
ANOVA-protected k-means clustering was applied to the expression signals of 3834 present genes. Out of 3834 present genes, 770 were identified to be differentially expressed over time by one way ANOVA (overall model pvalue ≤ 0.01). These 770 genes were used for classification Figure 2 Histogram of the Shapiro-Wilk p-values for normality test. The Shapiro-Wilk statistic was applied to the olfactory receptor neuron data for normality test. The horizontal axis is the Shapiro-Wilk p-values, and the vertical axis is the corresponding percentages. This histogram indicates that most of the 3834 present genes do not have a significant departure from the normality.

Histogram of the Shapiro-Wilk p-values for normality test
Flow chart of the filtering steps and quadratic regression analysis on the olfactory receptor neuron data Figure 3 Flow chart of the filtering steps and quadratic regression analysis on the olfactory receptor neuron data. Our regression method is applied to the olfactory receptor neuron data. At first, Affymetrix quality controls, expressed sequence tags, and genes which have "A" calls across all chips were removed from the analysis. Nine regression patterns were identified among 3834 remaining genes (colored in red). FLAT stands for no statistically significant differential expression over time detected by the regression method; LU stands for linear up regulated regression pattern; LD stands for linear down regulated regression pattern; QC stands for quadratic concave regulated regression pattern; QV stands for quadratic convex regulated regression pattern; QLCU stands for quadratic linear concave up regulated regression pattern; QLCD stands for quadratic linear concave down regulated regression pattern; QLVU stands for quadratic linear convex up regulated regression pattern; QLVD stands for quadratic linear convex down regulated regression pattern.
by k-means clustering with k = 9 and the distance measure being Pearson correlation coefficient (Table 4).
In order to make the regression patterns comparable with the k-means clusters, the quadratic regression method was applied to the 770 ANOVA significant genes. Table 4 shows the number of genes in common when comparing each regression pattern with each k-means cluster. An example of a good match between regression patterns and k-means clusters is the QLCD regression pattern and k-An illustration of the nine temporal expression patterns identified by the quadratic regression method Figure 4 An illustration of the nine temporal expression patterns identified by the quadratic regression method. The horizontal axis is the log transformation of time. The vertical axis is the hybridization signals obtained from the microarrays. The blue dots are the hybridization signals. The red line or curve is the fitted regression pattern. FLAT stands for no statistically significant differential expression over time detected by the regression method; LU stands for linear up regulated regression pattern; LD stands for linear down regulated regression pattern; QC stands for quadratic concave regulated regression pattern; QV stands for quadratic convex regulated regression pattern; QLCU stands for quadratic linear concave up regulated regression pattern; QLCD stands for quadratic linear concave down regulated regression pattern; QLVU stands for quadratic linear convex up regulated regression pattern; QLVD stands for quadratic linear convex down regulated regression pattern. The corresponding gene symbols are: FLAT. Cldn11; LU. Gba; LD. Col6a3; QC. Rab18; QV. unknown; QLCU. Psmb6; QLCD. Hnrpa2b1; QLVU. Tyrobp; QLVD. Acvr2b. means cluster K1. However, in most cases, k-means clusters contain a mixture of regression patterns and the regression patterns are separated into different k-means clusters. For example, genes that have the LU regression pattern are split into 4 k-means clusters (Figure 6a, Bzrp; 6b, Aqp1; 6c, Prg; 6d, Hnrpl). The similarity of the temporal expression profiles in Figure 6 indicates that it might be more appropriate to classify these genes into the same group, which occurs using the proposed regression method. Examples in Figure 7 show that some k-means clusters are also mixtures of expression profiles in terms of the mean signals (green lines). For example, a down-updown-up pattern (down-regulated at the second time point, up-regulated at the third time point, etc, in terms of mean signals) appeared in both k-means clusters K5 and K6 (green lines), but are identified to have QLVU regression pattern (Figure 7c, Clu; and 7d, D17H6S56E-5); similarly see Figure 7a and 7b (a, Sfpi1; and b, Anxa2). Once again, the regression method provides better classification. Figure 8 is an example of genes with similar expression patterns but different initial starting time of the differential expression (Figure 8a, Psmb6; 8b, Adora2b). Adora2b clearly starts differential expression later than Psmb6 (see the blue dots in Figure 8). After the initial starting point (first time point for Psmb6 and second time point for Adora2b), these two genes show similar upward regulation. These two genes were classified into the same regression group, but in different k-means clusters. Based on the above analysis, our regression method is demonstrated to be more appropriate for the classification of temporal gene expression profiles than k-means method.

EASE functional analysis on regression patterns and kmeans clusters
To further explore the effectiveness of the regression method on gene classification, EASE (Expression Analysis Systematic Explorer) software was used to examine the potential relationship between the biological functions of the genes and their expression patterns [19]. EASE calculates EASE scores (Jackknife one-sided Fisher exact p-values) to identify over-represented gene categories within lists of genes. EASE analysis was applied to each of the 9 regression patterns and 9 k-means clusters that were obtained from the classification of 770 ANOVA significant genes ( Table 4). The results are summarized (see Additional file 1), with part of the information shown in Tables 5 and 6. The EASE analysis demonstrates that the proposed regression method is more effective for gene classification than the k-means clustering method. Almost all of the regression patterns contain genes mainly from one biological process. For example, LU has 9 over-represented gene categories, 8 of which are involved in immune regulation ( Table 5). The majority of the LU and QLVU gene categories are in the immune regulation category. This suggests that there exist multiple regulatory mechanisms within the immune regulation. The immune regulation in QLVU appears to be a more complex regulatory mechanism for the initial up-regulation of these genes due to the slow upward regulation at early time points of this regression pattern (Figure 5c). The EASE results for the k-means clusters shows that the over-represented gene categories of most k-means clusters are involved in more than one biological process, for example, k-means cluster K5 contains 9 over-represented gene categories, 3 involved in immune regulation, 2 involved in cell death, etc. Notice that the immune regulation category is represented in 4 k-means clusters, which suggests that the immune regulation category is more consolidated in regression patterns than in k-means clusters (Table 6). Also, by comparing EASE scores in Tables 5 and 6, one can see that the over-represented gene categories in the regression patterns have, in general, smaller EASE scores than  those in the k-means clusters, which further indicates the greater effectiveness of the regression method in pattern classification. Kerr and Churchill (2001) introduced a bootstrap technique to assess the stability of clustering results [16]. We applied the same idea here to assess the reliability of regression patterns, Peddada et al.'s profiles, and k-means clusters. All 3 pattern classification methods were performed on the expression data of 770 ANOVA significant genes to make the results comparable. The reliability curves show that regression patterns have the highest reliability, and k-means clusters have the lowest reliability ( Figure 9). This suggests that the regression method provides relatively more stable pattern classifications.

Simulation study
We investigated the false positive rate (gene specific) of our method via a simulation study. The data were generated randomly from N(0,1), containing expression signals of 10000 "null" genes (no gene differentially expressed over time), with 5 time points and 3 replications per time point per gene. 50 of such data were generated. The regression approach was applied to each gene in each simulated data at α 0 = 0.01 and the numbers of significant genes in each of the 50 data were obtained. The average proportion of significance (average false positive rate) is 1.01% with standard deviation 0.01%. This demonstrates that the false positive rate of the regression method is accurate because 1% of 10000 genes would be expected to be significant at 0.01 level by chance. The false positive rates of the regression patterns LU, LD, QC, QV are all approximately equal to 1/6 of the average false positives, and those of QLCU, QLCD, QLVU, and QLVD are all approximately equal to 1/12 of the average false positives.

Discussion
The proposed step-down quadratic regression method is an effective statistical approach for gene discovery and pattern recognition. It utilizes the actual time information, and provides biologically meaningful classification of temporal gene expression profiles. Furthermore, it does not require replication at each time point, which ANOVAtype methods do require. Also, this method can identify genes with subtle changes over time and therefore discover genes that might be undetectable by other methods, eg, ANOVA-type methods. However, there are several limitations to this method. Firstly, it is designed to fit timecourse data with a small number of time points. We recommend this method when there are 4 to 10 time points in the data. For an experiment with more time points, spline-type methods [11,12] could be a possible choice; for an experiment with 2 or 3 time points, ANOVA-type method is recommended. Secondly, the 9 regression patterns are rather limited considering the complexity of gene regulatory networks. For example, certain proportion of genes show cubic, "M", and "W" shaped patterns in 211 regression FLAT genes which are ANOVA significant (Table 4). These patterns could be caused by random chance, but they could also be real patterns. Fitting a higher order polynomial regression model may discover these types of genes profiles. Theoretically, one could fit a 4 th -order polynomial regression model to this data (the highest order of the polynomial one can fit is the number of time points minus one). The model with 4 th -order polynomial will work similarly to connecting the mean at each time point, therefore will provide a good fit to the data with smallest R 2 and minimum Mean Squared Error, compared with lower-order polynomials. However, the purpose of pattern analysis is to cluster the data instead of fitting models, so the quadratic fit is useful even though the goodness of fit may not be great. Also, the use of high-order polynomials (higher than the secondorder) should be avoided if possible [20], particularly in cases such as this where the regression coefficients are used primarily for classification. Another issue is the transformation of the experimental time. Transformation should be considered when the sampling time is unequally spaced. The choice for the type of transformation (log-transformation, square-root transformation, etc) is not critical because the resulting pattern classification will in general not be impacted.
In the reliability curves, at 95% reliability, regression patterns, Peddada et al.'s profiles, and k-means clusters have 33%, 12%, and 0% of genes, respectively; and at 80% reliability, the percentage of genes are 55%, 32%, and 0%, respectively ( Figure 9). Even though the regression pat-  terns have the highest reliability, only 33% of genes have 95% reliabilities. We examined the overall model (1) pvalues of 770 genes by the regression method and found that genes that have the smallest overall model (1) p-values all have 95% reliabilities. This suggests that we could reduce the level of significance α 0 to increase the stability of regression patterns. α 0 could be reduced using various multiple testing p-value adjustment procedures, for example, Westfall and Young's step down method [21], and False Discovery Rate (FDR) [17]. Application of the FDR method can be done as follows (assuming FDR is control-  Figure 5a, b, and 5c by their method. Another fact is that Peddada et al.'s method provides exactly the location of the maximum/minimum, whereas our method provides the neighborhood of the location of the maximum/minimum. Furthermore, their method is based on bootstrap, which is computationally intensive. The result of their method, for example, the reliability curves, might be improved by applying more bootstrap, which is 4000 in this paper due to the computational difficulties and time constraints. Moreover, their method depends on the ordering of time but not the actual time at which the samples were taken, whereas the regression method accounts for both.
K-means is an iterative clustering algorithm [22]. The first step of this method is to randomly assign the data points Examples of genes with similar expression patterns in terms of mean signal and regression to the k clusters. Next, the distance to the center of each cluster is calculated for each data point, and the data point is moved into the closest cluster. This step will be repeated until no data point is moving from one cluster to another. In k-means, the number of clusters, k, needs to be prespecified. Researchers usually choose several different k and find the one which has the most biologically meaningful clusters. There are methods of finding the "optimal" k, for example, Bayesian Information Criterion [23]. In this paper, k was arbitrarily chosen to be 9. Since the k-means clustering does not perform well (Table 4; Figures 6, 7, and 8), we investigated different choice of k based on the Bayesian Information Criterion and identified that the "optimal" k is 15. However, as we examined these 15 k-means clusters, the pattern classification does not seem to be improved, the same problem exists as with k = 9. For example, Prom1, Clu, and D17H6S56E-5 ( Figure  5c, Figure 7c and 7d) all have similar temporal profiles and are all classified to be QLVU, but they were separated into 3 of the 15 k-means clusters. This could be related to the distance measure used (Pearson correlation coefficient). As we discovered, genes in the same cluster do not necessarily have higher correlation than genes in different clusters. For example, Sfpi1 and Anxa2 (Figure 7a and 7b) are highly correlated (Pearson correlation coefficient is 0.9934) and their expression patterns are similar, but they are in different k-means clusters. A possible reason might be that the time-course in olfactory receptor neuron data is too short for correlation to perform well. Even though there are a total of 15 observations for each gene, correlation calculations are based on the 5 mean signals, which could be too few to describe the relationship between temporal profiles. There is also concern about using correlation as the distance measure. A large correlation coefficient does not necessarily indicate two similarly shaped profiles, nor does a small correlation coefficient necessarily indicate differently shaped profiles [1].
A number of regression algorithms have been proposed recently, which treat time as a continuous variable. Several of them are based on cubic B-splines [11,12]. B-splines are defined as a linear combination of a set of basis polynomials. In order to fit cubic B-splines to time-course data, the entire duration of experimental time needs to be divided into several segments by "knots" (the point to separate segments), and each segment will be fit by cubic polynomial. The successful application of these methods to microarray time-course data depends heavily on having a relatively large numbers of time points. The B-spline based methods will not be effective when there are a small number of time points in the time-course experiment [13]. For a data with 5 time points, cubic B-spline type methods would not be appropriate because it is recommended that there should be at least 4 or 5 experimental time points in each segment [24]. Xu et al used a piecewise Examples of genes with the same regression pattern but different onset of differential expression Figure 8 Examples of genes with the same regression pattern but different onset of differential expression. a. Psmb6 is an example in k-means cluster K8; b. Adora2b is an example in k-means cluster K5. Adora2b clearly starts differential expression later than Psmb6. After the onset point (first time point for Psmb6 and second time point for Adora2b), these two genes show similar upward regulation. The regression method classifies these two genes into the same group (QLCU regression pattern), but k-means clustering method does not. The horizontal axis is the log transformation of time. The blue dots are the signals. The green line is the connection of the mean signal at each time point. The red curve is the QLCU regression pattern.
quadratic regression model to identify differentially expressed genes [10]. In their approach, expression levels at 0 hour and 2 hours after treatment are fit differently from the rest of time points after treatment. Although appropriate for their data, their method cannot be applied to the dataset used in this paper.
The quadratic regression method that we applied to the olfactory receptor neuron data relies on the normality assumption. This is supported by the result of the Shapiro-Wilk normality test, which indicates that most of the genes used for the analysis follow a normal distribution.
This might be due to the fact that we removed genes that are called "A" (absent) by Affymetrix across all chips. "A" calls are often assigned to low expression signals, which tend to be non-normal in general. Therefore removing genes with a high proportion of "A" calls may reduce the possibility of violation of the normality assumption, which will then make the test based on distributional assumption more likely to be valid, and thus avoid computational intensive resampling procedures, for example, bootstrap and permutation. If desired, experimenters could also try various types of data transformation to make their data closer to normal when the data are "Reg. Patterns" stands for the regression patterns identified by the proposed regression method; in the "Gene Category" column, the gene categories are further summarized to broader categories (in bold); "Pop. Total" stands for the number of total input genes (770) that are contained in EASE database, the remaining 71 genes do not have a biological function identified by EASE; "Pop. Hits" stands for the number of genes in "Pop. Total" that are classified into each gene category; "List Total" stands for the number of genes in "Pop. Total" that are classified into each regression pattern; "List Hits" stands for the number of genes in "List Total" that are classified into each gene category.
shown to have large departure from normality. However, the log transformation performed on the olfactory receptor neuron data was not to reduce the possible non-normality, but solely to make a fair comparison of our regression method and k-means method because it is the default transformation in Genespring. When the normality assumption (ε ij ~ N(0, )) does not hold, the bootstrap method [25] can be used to avoid the distributional assumption. For an experiment with m genes, T time points, and r replications per time point, the bootstrap procedure can be performed in the following way: form the data into a matrix of m × rT, each column in the matrix contains expressions of m genes in one chip and each row contains rT expressions of one gene; randomly draw rT columns with replacement to form a bootstrap sample; apply step-down quadratic regression procedure to the bootstrap sample to obtain F statistics from F tests; repeat the above steps 1000 times to form a bootstrap F distribution for each gene; claim a gene to be significance at level of α if its observed F statistics is greater than the upper (α / 2) th percentile or less than the lower (α / 2) th percentile of its bootstrap F distribution. One concern about using bootstrap here is that the bootstrap F distribution might be too discrete due to the small number of time points. However, the fact that we are bootstrapping both the explanatory and response variables mitigates this issue by using all the data points, not just the time points. Additionally, in a small simulation study, we observed that the bootstrap F distribution is rather smooth (result not shown).

Conclusion
The proposed step-down quadratic regression approach is shown to be effective for gene discovery and pattern recognition for non-cyclic short time-course microarray experiment. Major advantages of this method are that it preserves the actual time information, and provides a useful tool for gene identification and pattern recognition. The nine regression patterns, obtained when applied to the olfactory receptor neuron data, are shown to be more reasonable classifications compared to ANOVA-protected k-means clustering method. EASE analysis further showed that our regression patterns are more biologically meaningful than the k-means clusters. Comparison with Peddada et al.'s method showed that our method provides a different perspective on the temporal gene profiles. Reliability study indicates that regression patterns are most reliable. In conclusion, this method should improve gene discovery and pattern recognition for microarray time-course data. With the freely accessible Excel macro, investigators can readily apply this method to their research data.

ANOVA-protected k-means clustering
One-way ANOVA model y ijk = µ j + τ k + ε ijk was fitted to each gene in SAS v9, where y ijk denotes the gene expression level of the j th gene at the i th replication of the k th time point, µ j denotes the overall mean signal of the j th gene, τ k denotes the effect of k th time point, ε ijk denotes the random error associated with the i th replication at the k th time point of the j th gene and is assumed to be independently distributed normal with mean 0 and variance . Genes that have overall ANOVA model p-values ≤ α 0 will be used for k-means clustering. K-means clustering was performed in Genespring V6.1 (Silicon Genetics. Redwood City, CA) with k = 9. The similarity measure was chosen to be Pearson correlation coefficient, which was calculated from vectors of length 5 containing mean signals of 3 replications at each of the 5 time points. 500 additional random clusters were tested and the best clusters were selected by the software.

EASE functional analysis
EASE software was used to identify the over-represented categories of genes [19]. Gene Ontology Biological Process was chosen as the categorization system in EASE analysis. A functional gene category with an EASE score of less than 0.05 is considered to be over-represented. The EASE software is available at: http://david.niaid.nih.gov/david/ ease.htm.

Data description
The data used here are from a study of olfactory receptor neurons [15]. The goal is to investigate the induction of gene regulation at short time intervals following deafferentation of olfactory receptor neurons by target ablation at 2, 8, 16, and 48 hrs compared with the sham control. Total RNA was isolated from 3 male littermate mice per time point. Following hybridization with Affymetrix GeneChips MGU74Av2, 3 chips per time point, the signals were generated by GeneChip Analysis Suite v5.0. The data was filtered before statistical tests were performed. First, 66 Affymetrix quality control probesets and 6432 expressed sequence tags were removed. Next, the absent call (A) provided by Affymetrix was considered. 2156 genes that are called "A" across all 15 chips were removed from the data. The remaining 3834 present genes were used for the regression analysis ( Figure 3). The hybridization signals of these 3834 genes were log-transformed in Genespring. Because the time points in this