Genome-scale cluster analysis of replicated microarrays using shrinkage correlation coefficient

Background Currently, clustering with some form of correlation coefficient as the gene similarity metric has become a popular method for profiling genomic data. The Pearson correlation coefficient and the standard deviation (SD)-weighted correlation coefficient are the two most widely-used correlations as the similarity metrics in clustering microarray data. However, these two correlations are not optimal for analyzing replicated microarray data generated by most laboratories. An effective correlation coefficient is needed to provide statistically sufficient analysis of replicated microarray data. Results In this study, we describe a novel correlation coefficient, shrinkage correlation coefficient (SCC), that fully exploits the similarity between the replicated microarray experimental samples. The methodology considers both the number of replicates and the variance within each experimental group in clustering expression data, and provides a robust statistical estimation of the error of replicated microarray data. The value of SCC is revealed by its comparison with two other correlation coefficients that are currently the most widely-used (Pearson correlation coefficient and SD-weighted correlation coefficient) using statistical measures on both synthetic expression data as well as real gene expression data from Saccharomyces cerevisiae. Two leading clustering methods, hierarchical and k-means clustering were applied for the comparison. The comparison indicated that using SCC achieves better clustering performance. Applying SCC-based hierarchical clustering to the replicated microarray data obtained from germinating spores of the fern Ceratopteris richardii, we discovered two clusters of genes with shared expression patterns during spore germination. Functional analysis suggested that some of the genetic mechanisms that control germination in such diverse plant lineages as mosses and angiosperms are also conserved among ferns. Conclusion This study shows that SCC is an alternative to the Pearson correlation coefficient and the SD-weighted correlation coefficient, and is particularly useful for clustering replicated microarray data. This computational approach should be generally useful for proteomic data or other high-throughput analysis methodology.


Background
Advances in high-throughput technologies, such as DNA microarrays and genome sequencing, have enabled the large-scale exploration of the genome in a way that is systematic, comprehensive, and quantitative. Gene expression profiling has revealed valuable discoveries in basic biological research, pharmacology, and medicine. Currently, clustering has become a popular method for profiling genomic data by which clusters are formed based on the similarity between data points. The points in each specific cluster are similar from each other but different from points outside this cluster.
Clustering methods depend on the measure of pair-wise similarity, the similarity between two points. One commonly used similarity metric is the correlation coefficient between the profiles of the two points, and another commonly used similarity metric is the Euclidean distance. The measure of similarity based on correlation coefficients captures the similarity in shape or pattern of the profiles, and it does not account for the amplitude of the profiles. Scaled versions of any two profiles will have the same correlation coefficient since that of the pair of original profiles, i.e., the amplitude of the profiles, does not affect the correlation coefficient as long as the wave form (shape or pattern) of the profiles is maintained. If the similarity is measured by distance, the amplitude of the profiles does matter. Two profiles with the same pattern but very different amplitudes can have an ideal similarity (essentially the same) when measured by the correlation coefficient, but a very low similarity when measured by the Euclidean distance due to the large difference in amplitudes.
In this study, we focus on clustering based on similarity measured by the correlation coefficient where two genes with similar expression patterns will be considered to be similar regardless of the difference in their amplitudes. Using the Pearson correlation coefficient as the similarity metric, Eisen et al. [1] analyzed one of the first genomewide microarray data sets for the budding yeast Saccharomyces cerevisiae. When calculating the gene expression similarity with the Pearson correlation coefficient [1], many studies only averaged the replicates in each experiment [2][3][4] without taking into account the error in the replicates. Instead of averaging over the replicates, Hughes et al. [5] defined an error model which uses a standard deviation (SD)-weighted correlation coefficient (SDCC) to down-weight the gene expression values with high error estimates in their clustering analysis and classified the functions of previously uncharacterized genes by comparing the expression profiles of mutant cells from their S. cerevisiae compendium. Using the same correlation, van't Veer et al. [6] derived a breast cancer prognosis from the gene expression profile of a primary tumor. In addition, Yeung et al. [7] showed that the SD-weighted correlation coefficient improves cluster accuracy and stability to a greater extent than the Pearson correlation coefficient with averaging replicates.
However, the SD-weighted correlation coefficient [5][6][7] also has disadvantages. The error of measurement is estimated directly by the standard deviation of the replicates, and such an estimate of error can be very inaccurate when the number of replicates is small relative to the number of objects (in this study, genes) [8]. Unfortunately, most of the microarray experiments performed by an academic laboratory employ only small (usually less than 10) number of replicates due to the experimental cost and time concerns, and such a replicate number is much smaller than the amount of genes profiled (usually in the thousands or more). The "Stein phenomenon" [9] suggests that an effective statistical model is needed to deal with replicated microarray data. Here we provide a shrinkage correlation coefficient that considers both the number of replicates and the variance within each experimental group and fully exploits the similarity between the replicated microarray experimental samples. The shrinkage concept is widely accepted as a method to improve the estimation of correlation when the sample size is small [9][10][11], which is the primary inspiration for this work. We first describe our shrinkage correlation coefficient in generality, and then demonstrate the superiority of our correlation compared to the other two most widely-used correlation coefficients (Pearson correlation coefficient and SD-weighted correlation coefficient) using hierarchical clustering and k-means clustering. Finally we use a recently published analysis of the gene expression changes that occur during the germination of spores of the fern Ceratopteris richardii [12] as an example of this shrinkage correlation coefficient.
Other clustering techniques have been used for gene expression analysis. For example, using an analysis of variance (ANOVA) model, Kerr and Churchill [13] applied bootstrapping on a publicly available data set to assess the reliability of clustering results. Ng et al. [14] proposed a linear mixed-effects model (LMM) as an extension of the normal mixture model to incorporate covariate information into the clustering process. Tjaden [15] developed a clustering method that is similar to k-means clustering. Using a Bayesian infinite mixture model (IMM), Medvedovic and Sivaganesan [16,17] developed a clustering procedure to incorporate the information on experimental variability into gene expression profiling. IMM measures the similarity using a probabilistic model of the data, and the error between repeated measurements is inherently represented in the model. Instead of forming final clusters directly, a posterior distribution of the possible clusters is generated by Gibbs sampling first. Then the similarity between two data points is measured by the probability of the pair of points being in the same cluster inferred by the posterior distribution of the clustering result. With this measured similarity, final clusters are formed by applying the classical hierarchical clustering. Conceptually, IMM considers the magnitude (rather than pattern) of gene expression profiles and is similar to the Euclidean distance, as noted by Tjaden [15]. In contrast, as a correlation coefficient, our method is based on the pattern of gene expression profiles and is apparently different from IMM when applied to clustering methods.
We stress that our shrinkage correlation coefficient is a correlation instead of a clustering method, and it could be used as a similarity metric in many clustering methods or other circumstances. Therefore, we compare our correlation with two existing widely-used correlation coefficients (Pearson correlation coefficient and SD-weighted correlation coefficient) using hierarchical and k-means clustering. We present our method for better estimating the error in replicated microarrays that cannot be adequately estimated by other correlation coefficients when applying the existing popular clustering methods.
In this paper, we propose a novel correlation coefficient, shrinkage correlation coefficient (SCC). The comparison of SCC with other two most widely-used correlation coefficients using hierachical and k-means clustering shows that SCC is an alternative to the Pearson correlation coefficient and the SD-weighted correlation coefficient and achieves better clustering performance on both synthetic expression data as well as real gene expression data from Saccharomyces cerevisiae. We use SCC-based two-dimensional hierarchical clustering to analyze the replicated microarray data of Salmi et al. [12], revealing the novel finding that there are two distinct clusters of genes with shared expression patterns during the early stages of germination of C. richardii spores. Findings from this gene expression analysis suggest that some of the mechanisms that control germination in such diverse plant lineages as mosses and angiosperms are also conserved among ferns. We also present the use of singular value decomposition (SVD) [18][19][20] to uncover the gene-wise bias introduced by experimental artifacts due to comparison of different biological samples and multiple arrays.

Shrinkage correlation coefficient (SCC)
Correlation coefficients are computed to measure the similarity between each pair of genes in hierarchical clustering. Assuming that we have a total of N arrays consisting of F experimental (in this study, time point comparison) groups with N(k) replicates for the kth experimental group, then . Let G i,n (k) denote the expres-sion level of the ith gene for the nth replicate in the kth experimental group. The mean and variance of the expression of the ith gene over the replicates in the kth experimental group are defined as and , respectively.
If the standard deviation (SD) is used as an estimate of the measurement error, then the SD-weighted average expression of gene i over the experimental groups is given by and the SD-weighted correlation coefficient is defined as which has been used in the previous studies [5,6]. Since the SD-weighted correlation takes the measurement error into account, it is better than the Pearson correlation coefficient in estimating the correlation between a pair of genes in the case of repeated measurements [5][6][7]. The concept of applying larger weights on genes with smaller measurement error seems natural and has been demonstrated to be effective. However, the standard deviation may not be the best estimate of measurement error according to the Stein Phenomenon [10,21].
If F > 2, the Stein estimation, defined as with , is better than the standard variance in the sense that it is statistically closer to the real error [21]. The last statement is valid under the assumption that (k), k = 1,2, ..., F, are Gaussian distributed and independent of each other, which can be readily justified using the central limit theorem.
Since the Stein Phenomenon was discovered, several shrinkage estimation methods have been developed to find the optimal estimate of a group of measurement errors. In this work we propose a simple but effective methodology which is mainly inspired by the shrinkage concept [9][10][11]. N Nk Let the real squared measurement errors for the F experimental groups be ψ i (1), ψ i (2)...ψ i (F). We may estimate these F parameters using a high-dimensional model (of dimension F). According to statistics theory, the estimates in a high dimensional model will have larger variances compared to those in low-dimensional model (e.g., one dimension) when the same number of data points are available. Furthermore, if the number of data points are very limited (as is typically the case in real examples) the variances of a high-dimensional model may be unacceptably high for practical purposes. To reduce the estimation variance one may map the high-dimensional model for the F parameters onto a lower-dimensional restricted submodel. For example we may use the mean of the F parameters, as a one-dimensional submodel. Then, the estimation variance can be greatly reduced. However, the estimate is To summarize, the estimates in the original high-dimensional model have larger variances but are unbiased, and the estimate in the restricted one-dimensional submodel has a smaller variance but is biased. Since neither situation is satisfactory, we will propose a shrinkage error estimate that makes a balance between the above two kinds of estimates.
With the above restricted one-dimensional model (Eq. 3), we have an unbiased estimate of the squared measurement error Θ i as follows: Then, we can use a linear regularization model to define a balanced estimate: where λ i ∈ [0, 1] is an shrinkage factor that is to be determined according to a chosen optimization criterion. We propose to minimize the risk Applying the methodology of [11] and [8], the optimal shrinkage factor can be derived as From previous discussion, k, k = 1,2, ..., F, are assumed to be independent of each other, and from the above equation we have where var( (k)) is the variance of (k) estimated by To make sure that the shrinkage factor lies between 0 and 1, we define the final shrinkage factor to be and then we obtain the shrinkage estimate of (k), k = 1,2, ..., F, as Notice that in [8], a related mathematical problem is considered, where it aims to get a shrinkage estimate of a covariance matrix by using a restricted lower dimensional submodel in which the covariance matrix is assumed to be diagonal with common variance. In this work our problem is simpler. We only need to estimate the diagonal elements of the covariance matrix, not the whole matrix. Therefore, our result is different from what is obtained in [8], and is much simpler.
Using (k), the shrinkage estimate of (k), the error between the group mean (k) and the corresponding true expression value can be measured by means of the shrinkage error defined as . If we replace 1&2, the shrinkage error-weighted average expression of gene i is given by .
and we can define a new shrinkage correlation coefficient for any pair of ith and jth genes as If the number of replicates N(k) is the same for all the experimental (e.g., treatment/condition/time-point) groups, then Φ i (k) = α (k), where α = 1/N(k) is a constant, and hence the shrinkage correlation coefficient (Eq. 13) is effectively weighted by the shrinkage estimate (k) (i.e., N(k) has no effect on the shrinkage correlation coefficient). However, if the number of replicates is different for each experimental group, the use of in the shrinkage correlation coefficient provides additional benefits in our method through the weighting N(k) in a way that agrees with the common practice in statistics [22].
When using a biologically meaningful control sample, e.g., a matched and untreated sample, a zero time point, or the reference sample 24 hr presented in this study, we should use uncentered correlation to keep the impact of the biologically meaningful control sample on the gene expression changes [23]. Therefore, we replace and with zero in our analysis, and hence Eq. 13 is modified as follows:

Shrinkage correlation coefficient (SCC) is superior in clustering the synthetic and real gene expression data
As shown in Eq. 13, SCC is a type of correlation coefficient which is very useful in gene expression clustering. To assess the effectiveness of a new correlation coefficient in clustering analysis, it is important to compare it with other widely-used correlation coefficents using existing popular clustering methods. In this study, we compared SCC with the two most commonly used correlation coefficients: Pearson correlation coefficient [1] and SDweighted correlation coefficient [5][6][7]. We applied these three correlation coefficients on the two most popular clustering methods: hierarchical clustering [24] and kmeans clustering [25], and evaluated the performance by comparing the adjusted Rand index [26] generated for each correlation using these two clustering methods. Both synthetic expression data and real yeast expression data [27] were used in this study. The adjusted Rand index is a statistic that has been recently used for the comparison of clustering using different correlation coefficients [14,[28][29][30][31]. It measures the extent of concurrence between the clustering results and the underlying known cluster structure [32]. The comparison of the adjusted Rand indices generated by different correlation coefficients for the same data set indicates the performance of each correlation. The adjusted Rand index lies between 0 and 1, and a larger index indicates a higher level of agreement between the clustering results and the prior knowledge of functional categories, and further suggests better clustering performance [7,31].
Each synthetic data set includes 20 experiments, 2, 4, 6, 8, 10, 12, 14, 16, 18, and 20 replicates for each experiment, and two different levels of noises (low and high). The data sets were generated with predetermined patterns plus low or high level of random noise so that the underlying cluster structure is known. We evaluated the level of agreement between the resulting clusters from each of the three correlations and the known underlying cluster structure by computing the average adjusted Rand index over 1000 randomly generated synthetic data sets. Figure 1a and 1b show the correlation comparisons using hierarchical clustering and Figure 1c and 1d are the results from k-means clustering. As shown in Figure 1a, under low noise level (α = 0.5 in Eqs. 15&16), SCC has the same performance as the Pearson correlation coefficient but far better than the SD-weighted correlation coefficient when the replicate number is two. When the number of replicates is four, six and eight, SCC is superior to the other correlations. With the increasing of the number of replicates, the SDweighted correlation coefficient approaches SCC but both of them still outperform the Pearson correlation coefficient. Figure 1b shows that, when the noise level is high (α = 2.5 in Eqs. 15&16), SCC performs the best for all the numbers of replicates. These results suggest that, for syn- thetic microarray data, SCC is superior to the Pearson correlation coefficient and the SD-weighted correlation coefficient when using hierarchical clustering as it results in the most consistently high adjusted Rand index regardless of noise level and number of replicates.
We also compared the correlations using k-means clustering. Under low noise level (Figure 1c), SCC surpasses the other two correlations when the number of replicates is lower than 12. With the increasing of the number of replicates, the SD-weighted correlation coefficient approaches SCC but both of them still outperform the Pearson correlation coefficient. While the noise level is high (Figure 1d), SCC outperforms the Pearson correla-tion coefficients for almost all the numbers of replicates. When compared with the SD-weighted correlation coefficient, SCC is better when the number of replicates is smaller than four and close to the SD-weighted correlation coefficient when the number of replicates is four and six. With the increasing of the number of replicates, we noticed that SCC performs almost equally as the SDweighted correlation coefficient and has a slight advantage when the number of replicates is larger than 14. The k-means clustering results suggests that, SCC is a better choice compared to other correlations when the expression noise level is low, while the noise level is high, SCC is obviously superior to the Pearson correlation coefficient on almost all the numbers of replicates and has slight The performance of the three models indicated by the adjusted Rand index obtained from the synthetic data sets using hierar-chical clustering and k-means clustering advantages over the SD-weighted correlation coefficient for most of the numbers of replicates.
To further demonstrate the superiority of SCC, we applied the three correlations individually to hierarchical and kmeans clustering and computed the adjusted Rand index on the real yeast expression data [27]. This microarray data represent 20 systematic perturbations of the yeast galactose-utilization pathway, and four replicates were performed for each perturbation. Each gene has been annotated in one of four functional clusters in the Gene Ontology [33]. These four clusters are used as the external knowledge. As shown in Figure 2, when hierarchical clustering is used, the adjusted Rand indices for SCC is 0.8760 which is higher than those of the other two correlations (Pearson correlation coefficient: 0.8659; SD-weighted correlation coefficient: 0.8166). When applied to k-means clustering, SCC is also superior to other correlations with the highest adjusted Rand index 0.9132. Since the noise level of this real yeast expression data was not clearly stated and barely quantified in the previous study, we could not determine which is better between the Pearson correlation coefficient and the SD-weighted correlation coefficient. However, this real expression data compari-son suggests that SCC is superior regardless of noise level and clustering methods which corresponds to the results with the synthetic data.

Analysis of the tentative unique gene expression profiles during early development in germinating spores of C. richardii by SCC
Spores of C. richardii have proved to be an excellent model system to study the basic cellular processes that occur in early gametophyte development, such as gravity sensing and response, sex determination and differentiation, pattern formation, and photomorphogenesis [34]. Using DNA microarrays consisting of 3,840 spotted cDNA clones from an EST analysis, Salmi et al. [12] monitored the mRNA levels for 3,207 tentative unique genes (TUGs) of C. richardii over the first 48 hr of gametophyte development. TUG expression in the spores was evaluated at 0, 6, 12, 24 and 48 hr after spores were exposed to continuous white light. This developmental period includes initiation of germination at 0 hr, the production of a detectable polar calcium current that peaks at 6-12 hr, fixation of the polarity of more than half of cells by gravity at 12 hr, migration of the nucleus at 24 hr, and a polar cell division at 48 hr [35].
The performance of the three correlations indicated by the adjusted Rand index obtained from the real yeast expression data using hierarchical clustering and k-means clustering Figure 2 The performance of the three correlations indicated by the adjusted Rand index obtained from the real yeast expression data using hierarchical clustering and k-means clustering. Each correlation is represented by a bar: SCC (red), SD-weighted correlation (green), and Pearson correlation (blue). The y-axis is the adjusted Rand index.
The data analysis of Salmi et al. [12] focused on identification of differentially-expressed genes between time points and did not attempt to identify upward or downward trends in the time course data. This analysis did not discover and filter out the underlying biases associated with experimental artifacts due to comparison of different biological replicates and prints of arrays to facilitate further gene expression profiling. Moreover, this analysis did not organize expression patterns into biologically meaningful profiles through the whole time course experiments by assimilating the patterns of gene expression. A more thorough analysis of these microarray expression data that focuses on characterizing the entire set of transcripts temporally and displaying them graphically would promote a better understanding of cellular processes underlying early gametophyte development in ferns.
In total, we analyzed 34 arrays with biological replicates of four different developmental time point comparisons: nine replicates of 0:24 hr, eight of 6:24 hr, nine of 12:24 hr, and eight of 48:24 hr. The 34 arrays were used to generate a total of 34 data columns in which each column was treated independently rather than averaging the replicates.
The initial selection retained non-flagged spots for which the within-spot pixel-to-pixel correlation of intensities is > 0.5 and the sum of median (635/532) signal intensities is >150. These non-flagged spots were also well-measured in at least 80% of the array. In this selection, 39% of the TUGs were filtered out prior to further analysis. Selecting for TUGs with a known accession number in GenBank removed a further 4% of the TUGs. Selection for a fluorescence ratio of at least 1.5-fold greater than the geometric mean ratio for the TUGs examined in at least two arrays of any time point comparison removed 41%. The resulting data set (see Additional file 1) for the experiments analyzed included tabulation of the log 2 ratios of gene-expression levels for 572 TUGs. In the entire 34 arrays there were 137 TUGs for which there were no missing data. A total of 1,152 expression ratios were missed, accounting for only about 6% of the total data set we analyzed. After the imputation of the missing data with the K-nearest neighbors (KNN) method [36], the data set was normalized arraywise with the mean of 572 TUGs expression ratios of each array set as zero and the standard deviation as one.
We first uncovered ten samples that have poor correlation to other samples of their time point group by applying unsupervised agglomerative hierarchical array clustering and SVD (see Additional file 2). These ten arrays, which likely represent experimental artifacts, were removed from the data set, and the new data set (see Additional file 3) was used to tabulate the log 2 ratios of gene-expression levels for 572 TUGs. Of these TUGs, 151 have no missing data in the entire 24 arrays. A total of 1020 data were missed, accounting for only 7.4% of the adjusted data set. The imputation of the missing data was performed by the K-nearest neighbors (KNN) method [36] with the average value of the nearest 10 neighbors (K = 10). We further normalized this new data set by adjusting the mean of 572 TUGs expression ratios of each array to zero and the standard deviation to one.
In the new data set, the expression profile of the 12:24 hr time point comparison group was measured by two prints of arrays: four replicates hybridized on the Cri2 arrays, and two on the Cri3 arrays. Cri2 and Cri3 arrays were the same except printed on different days. In attempt to identify and correct the gene-wise bias introduced by the two prints of arrays, we carried out SVD. This technique has previously been used to detect and correct the artifact in the data set that was caused by different types of arrays (22 K vs. 42 K) [37] or sampling from cultures with slightly different periods [38]. We reduced the new "TUGs × Arrays" space to the "Eigenarrays" space that spans the space of the array expression profiles and the "Eigengenes" space that spans the space of the gene expression profiles. Eigengene 5 was discovered to be exactly correlated with the gene-wise bias (Figure 3). We sorted the abundance of this eigengene in each of the 24 arrays, and found a perfect correlation between the abundance and the print of array ( Figure 3). All of the four Cri2 arrays show negative abundances, whereas the Cri3 arrays all have non-negative abundances.
We filtered out the gene-wise bias from the data set (see Additional file 3) by substituting Eigenexpression 5 (the Gene-wise bias (Eigengene 5) associated with the two prints of arrays  diagonal entry of S in Eq. 1 of Additional file 2) with zero, and reconstructed the final data set with Eq. 1 of Additional file 2. Subsequently, this final data set (see Additional file 4) was analyzed by the unsupervised twodimensional agglomerative hierarchical clustering. We first calculated the optimal shrinkage factor with Eq.
10. Since the number of replicates is six (a moderate number) for each time point comparison after we filtered out ten low-quality arrays, is significantly different from 0 and 1, and has an average value of 0.69 (Figure 4). This indicates SCC is effective in our analysis, and neither the Pearson correlation coefficient nor the SD-weighted correlation coefficient should be used here. Therefore, we applied SCC to the gene clustering, and the Pearson correlation coefficient [1] to the array clustering. The TUG expression profiles during the early stages of gametophyte development of C. richardii are shown in Figure 5. The 572 TUGs are clearly clustered into two distinct groups: Cluster A with 292 TUGs, and Cluster B with 280 TUGs.
We performed functional analysis of clusters A and B using Gene Ontology annotations transferred from putative A. thaliana homologs of individual TUGs, as identified by blastx analysis described previously [12]. Using a standard over-representation analysis as implemented in the ErmineJ software [39], we examined the clusters to identify Gene Ontology terms that appeared unusually often among the TUGs in each cluster, relative to the full complement of TUGs represented on the array.
We found that Cluster A, which includes genes that are upregulated during the first 48 hours following germination, were significantly enriched with genes annotated with the GO term RNA-binding. The cluster contained TUGs that had high homology to A. thaliana proteins At4g32720.1 (CriU1545, CriU226) and At2g05120.1 (CriU2095). At4g32720.1 contains an RNA recognition motif, and At2g05120.1 contains a region matching a nucleoporin Pfam motif. Both are annotated with the term "RNA export from nucleus," suggesting that their C. richardii counterparts may also be involved in RNA processing and export.
Over-representation analysis of Cluster B, which includes genes that are down-regulated during the same period of development, revealed fourteen enriched terms. These included terms related to signal transduction (protein phosphatase type 2C activity, abscisic-acid mediated signaling, hormone-mediated signaling), transport, biosynthetic activities (water-soluble vitamin biosynthesis), oxidative phosphorylation, and response to oxidative stress. A full list of terms is given in Table 1.
In addition, we noticed that the final data set (see Additional file 4) analyzed here was generated by filtering out ten low-quality microarray samples and one gene-wise bias from the original 34 arrays. Therefore, it will be reasonable to obtain similar clusters based on this final data set by applying the three correlations (SCC, Pearson correlation coefficient and SD-weighted correlation coefficient) to hierarchical or k-means clustering (data are not presented here). Since SCC is superior in clustering synthetic and real expression data as demonstrated above, we presented here the application of SCC to analyze our own microarray data.

Discussion
In this work we have shown that a new correlation coefficient, shrinkage correlation coefficient (SCC), can be used as a similarity metric in clustering replicated microarray data generated by most academic laboratories and derive new information pertaining to gene expression patterns.
Using hierarchical and k-means clustering, we compared SCC with the two most widely-used correlation coefficients: Pearson correlation coefficient [1] and SDweighted correlation coefficient [5][6][7]. The adjusted Rand index comparison showed that SCC enables improved clustering performance for both synthetic and real expression data. Using the SCC-based hierarchical clustering algorithm, we discovered two distinct clusters of genes during the germination of C. richardii spores. Functional analysis suggests that some of the mechanisms that con- Histogram of optimal shrinkage factor . trol germination in such diverse plant lineages as mosses and angiosperms are also conserved among ferns.

SCC is a robust correlation coefficient
Correlation coefficient is crucial in cluster analysis to determine the similarity between two objects (in this study, genes) and further classify the objects into different groups. When the Pearson correlation coefficient was first applied for clustering gene expression [1], the replicates of each treatment group were simply averaged without considering the underlying error. As the importance of the error information was discovered [5], more and more studies use standard deviation as the error estimate when clustering gene expression with the help of correlation coefficient. Using the SD-weighted correlation coefficient, they down-weighted the gene expression values with high error estimates in microarray analysis [5][6][7]. However, the SD-weighted correlation coefficient is still not statistically efficient for analyzing replicated microarray data and the use of standard deviation as the error estimate exhibits serious defects when the number of replicates is small [8].
Commonly, the number of microarray replicates that are performed by most academic laboratories is usually less than 10 due to the experimental cost and time concern. The "Stein phenomenon" [9] states that when the number of data samples (in this study, biological replicates) in each experimental group is relatively small, a better estimate of the error of any individual experimental group could be obtained by shrinkage that considers all experimental groups. To avoid inaccuracy introduced by the small number of replicates, a better estimate of the error in the replicates can be obtained by shrinkage estimate. Our shrinkage correlation coefficient (Eq. 13) can be regarded as a generalized definition of correlation coefficient for the expression of a pair of genes with replicates. It takes into consideration both the number of replicates and the variance within each treatment comparison.
SCC can be reduced to other definitions of correlation coefficients when the shrinkage factor is set to some special values. For example, under the condition that the numbers of replicates are equal for all the F experimental groups, SCC is equivalent to the Pearson correlation coefficient when = 1, and to the SD-weighted correlation coefficient when = 0. Therefore, the SD-weighted correlation coefficient and the Pearson correlation coefficient are just two extreme cases of SCC. The correlation coefficient with optimal shrinkage is an alternative to these two extremes and is superior to them when 0 < < 1. By using the shrinkage factor , we obtain an optimal estimate of the error in the replicates and, accordingly, better estimates of the similarity between any pair of genes.
We would argue that in SCC, the use of the shrinkage error as a weighting provides a better correlation measure for two reasons. First, (k) is superior to the standard deviation as an estimate of the measurement error, and secondly, the inclusion of N(k) takes the size of an experimental group into account as an assessment of the reliability of the group mean. The benefit of N(k) disappears if all experimental groups are of the same size. For example, in our C. richardii microarray data analysis, the number of replicates happens to be equal for each time point comparison after filtering out ten low-quality arrays. In such an analysis, the use of and the use of are equivalent, since either choice leads to the same value of shrinkage correlation in Eq. 13. Figure 1a, b and 1c shows that SCC is superior to the other models regardless of the number of replicates. In Figure 1d, SCC also has advantages for most of the numbers of replicates. Therefore, SCC offers utility to most replicated microarray data sets.
Using the Stein shrinkage concept [9,10], a shrinkage estimator for gene-specific variance components was proposed to construct a F-like statistic that has been used in a linear mixed ANOVA (analysis of variance) model [40], and a shrinkage estimator of the mean used in the clustering similarity metric was developed for genome-wide expression data analysis [41]. This shrinkage ANOVA model and the shrinkage mean could be combined with our SCC for analyzing expression data in future studies.

Results of functional analysis
The stringent quality control filtering followed by novel cluster analysis methodology of microarray data on C. richardii early development produced two distinct clusters of TUGs. As shown in Figure 5, the TUGs represented in SCC Cluster A increased expression during the first 48 hours following germination, while TUGs in Cluster B decreased expression during the same period. An analysis of GO terms associated with TUGs in Cluster A reveals that only one term (RNA transport) is over-represented among GO annotations associated with Cluster A TUGs relative to the other TUGs assayed in the microarrays. This suggests that the TUGs in Cluster A represent a cross-section of many different types of genes, perhaps reflecting a general "ramp-up" in multiple biological functions, along with an increase in RNA processing as the spore activates embryonic transcription.
Previous results from other systems suggests that genes associated with the term RNA transport may be related to the process of establishing and maintaining cellular polarity in the early stages of germination of C. richardii spores.
In the filamentous fungus Aspergillus nidulans, mutation of swoK, a gene that encodes a protein with an N-terminal RNA recognition motif that causes cells to swell and lack the normal polarity maintained fungal hyphal cells [42].
The swoK protein appears to function in both mRNA maturation and nuclear export of mRNAs.
By contrast, over-representation analysis of the Cluster B TUGs, which are down-regulated during the 48 hr time course, reveals an enrichment of several GO terms, suggesting that the Cluster B TUGs represent a more specialized set of genes involved in functions and processes required during early development of C. richardii. Cluster B consists of genes that are down-regulated during the earliest stages of spore germination, starting with the initiation of germination by light (0 hr) through the first two days of development, when the first cell division occurs. These are likely to include transcripts that were present in the dormant spore but decline in abundance in the germinating spore. This population is likely to encode proteins involved in maintaining the dormant condition. Once germination begins, genes responsible for maintaining the dormant condition of the spore would need to be down-regulated to allow for the transition from dormant metabolism to active growth and development.
Careful examination of the Cluster B genes and their associated GO terms reveals some interesting patterns. One of the most notable findings from this study is that genes involved in abscisic acid mediated signaling are overrepresented among genes down-regulated in the first 48 hr of spore germination. Abscisic acid (ABA) is a plant hormone known to be involved in the process of establishing and maintaining dormancy in angiosperm seeds [43,44] and moss spores [45]. ABA has been previously shown to be involved in another aspect of C. richardii development.
Sex determination of C. richardii gametophytes is regulated in part by ABA [46].
The process of seed germination in Arabidopsis involves a decrease in the endogenous levels of abscisic acid [43], and inhibiting ABA biosynthesis by treatment with fluridone caused Nicotiana plumbaginifolia seeds that should be physiologically dormant (D) to germinate at the same rate as seeds that are in a physiologically non-dormant (ND) condition [44]. The more general term hormonemediated signaling (a sub-set of which would be abscisic acid-mediated signaling) is also over-represented among the TUGs in Cluster B. This term annotates TUGs predicted to be involved in ABA-related pathways as well as other hormone-related processes, notably signaling pathways mediated by gibberellic acid. Hormone-mediated regulation of the process of germination has been well studied in angiosperm seeds, including ABA involvement in the maintenance of dormancy, and the germination activating role of gibberellin. The process of germination in angiosperm seeds involves extensive hormone-mediated signaling [47] so it is not surprising to find this category of genes implicated in fern spore germination, as well.
The involvement of ABA in germination is not unique to angiosperm seeds. In moss cells that have differentiated into spores, removal of ABA causes these cells to germinate and develop into new filamentous cells [48]. In the classical view of hormone signaling in eukaryotic organisms hormones are thought of as the chemical substances produced in one part of the organism that serves as a signal to another part of the organism. In this paradigm it would seem unusual for a plant hormone to function within the single cell of the C. richardii spore. However, an ABA receptor involved in seed germination has been characterized that is not plasma membrane localized [49], the Arabidopsis gene CHLH (genomic locus At5g13630). This receptor functions inside of cells and it could, in principle, respond to intracellular changes in the level of ABA to regulate physiological changes within single cells. The CHLH "receptor" gene encodes a subunit of the Mg-chelatase complex that is an integral part of chlorophyll biosynthesis, and is involved in plastid-to-nucleus signaling. The process of producing a highly resistant, dormant stage of plant reproductive cycles (i.e. a spore or seed) is conserved among all major plant lineages. Given the intercellular localization and functioning of an ABA receptor that mediates germination in Arabidopsis seeds, and the documented role of ABA in the germination of various spores, it is plausible that this signaling pathway is similar in fern and moss germination.
One of the well-documented mechanisms that regulates ABA signaling in seed germination is dephosphorylation of regulatory proteins by protein phosphatases, particularly type 2C protein phosphatases (PP2C) [50,51]. In Fagus sylvatica the expression of a PP2C is directly regulated by ABA in dormant seeds [52]. Additionally, dephosphorylation of actin by protein phosphatases has been implicated in the germination of the plasmodium Physarum sclerotium [53] and Dictyostelium discoideum [54]. The biological process of dephosphorylation was included as an activity that is over-represented in the cluster of genes down regulated during germination, and this category includes genes likely to encode PP2C enzymes. The prediction from our clustering results that ABA-regulated signaling enzymes like protein phosphatases are involved in fern spore germination is plausible in light of the conservation of this pathway in other plant species.
Annotation of genes in the down-regulated cluster identified three cellular localization categories, including external encapsulating structure, nucleolus, and plasma membrane. Of these three categories, the down regulation of nucleolar associated genes is similar to a process observed in angiosperm seed germination. In Zea mays seeds, nucleolus-associated bodies are present in the cells of dry seeds, but after 24 h of imbibition these nuclear bodies have decreased significantly [55]. Although C. rich-ardii spores are a useful model system for the study of gravity perception in a single cell, we did not identify any genes in this analysis of early development that are obviously involved in the process of gravity perception or in early signaling steps of the gravity response.

Conclusion
In summary, we have developed a robust correlation coefficient, shrinkage correlation coefficient (SCC), which is an alternative to the Pearson correlation coefficient and the SD-weighted correlation coefficient, and particularly useful for clustering replicated microarray data generated by most academic laboratories. We have shown the superiority of SCC by the adjusted Rand index comparison on both synthetic and real expression data using hierarchical and k-means clustering. We apply SCC to successfully identify distinct clusters of genes during C. richardii early development. We also present the use of SVD to uncover the gene-wise biases introduced by experimental artifacts due to comparison of different biological replicates and prints of arrays. This computational approach is not only applicable to DNA microarray analysis but is also applicable to proteomics data or any other high-throughput analysis methodology.

Hierarchical clustering and k-means clustering
The agglomerative hierarchical clustering algorithm [24] was used in this study. It starts with individual objects. The most similar objects are clustered together, and then these initial clusters are merged according to their similarities. This hierarchical clustering algorithm is based on the average linkage method [56]. In our two-dimensional cluster analysis, gene clustering and array clustering are performed independently without interfering between the two dimensions. K-means [25] is a classic clustering algorithm which assigns each point to the cluster whose centroid is nearest. In our study, the clustering results of hierarchical clustering were used to compute the initial centroids to start k-means. All of the clustering analyses were implemented with MATLAB (MathWorks, Inc., Natick, MA).

Synthetic expression data
Following the convention of [7], we generated synthetic microarray data which include six clusters of genes and 20 experiments. Each cluster consists of 66 or 67 genes. We use the subscripts i, j, k and m to denote the gene number, the experiment number, the replicate number and the cluster number, respectively. The first four clusters of genes follow periodic functions plus some noises: