 Methodology article
 Open Access
 Published:
Microarray data mining: A novel optimizationbased approach to uncover biologically coherent structures
BMC Bioinformatics volume 9, Article number: 268 (2008)
Abstract
Background
DNA microarray technology allows for the measurement of genomewide expression patterns. Within the resultant mass of data lies the problem of analyzing and presenting information on this genomic scale, and a first step towards the rapid and comprehensive interpretation of this data is gene clustering with respect to the expression patterns. Classifying genes into clusters can lead to interesting biological insights. In this study, we describe an iterative clustering approach to uncover biologically coherent structures from DNA microarray data based on a novel clustering algorithm EP_GOS_Clust.
Results
We apply our proposed iterative algorithm to three sets of experimental DNA microarray data from experiments with the yeast Saccharomyces cerevisiae and show that the proposed iterative approach improves biological coherence. Comparison with other clustering techniques suggests that our iterative algorithm provides superior performance with regard to biological coherence. An important consequence of our approach is that an increasing proportion of genes find membership in clusters of high biological coherence and that the average cluster specificity improves.
Conclusion
The results from these clustering experiments provide a robust basis for extracting motifs and transacting factors that determine particular patterns of expression. In addition, the biological coherence of the clusters is iteratively assessed independently of the clustering. Thus, this method will not be severely impacted by functional annotations that are missing, inaccurate, or sparse.
Background
DNA microarray technology allows investigators to monitor simultaneously the expression behavior of essentially all the genes within an entire genome and can provide information on gene functions and transcriptional networks. However, the large number of genes and the complexity of the underlying biological networks make extracting this information a formidable task. A common first step to interpret DNA microarray data is to cluster the data on the basis of similarity of expression patterns. Since genes with similar function often show a common expression pattern, clustering genes of known functions with poorly characterized genes provides a means of gaining insights into the functions of the latter [1]. Furthermore, patterns seen in genomewide expression data can reveal potential gene regulation networks and relate cellular processes to changes in cellular conditions [2, 3]. However, the extent to which clustering reveals useful information about the system under study depends on the extent to which the clustering method successfully groups intrinsically related elements. Example classes of clustering algorithms include (a) single and complete link hierachical clustering [4], (b) Kfamily of clustering algorithms [5–7], (c) optimizationbased clustering approaches [8–10], (d) fuzzy clustering [11, 12], (e) quality threshold clustering (QTClust) [13], (f) artificial neural networks for clustering, such as the selforganizing map (SOM) [14] and a variant that combines the SOM with hierachical clustering, the selforganizing tree algorithm (SOTA) [15], (g) informationbased clustering [16, 17], and (h) stochastic approaches such as clustering by simulated annealing [18, 19]. Some of these algorithms, while novel in their own rights, suffer from certain shortcomings. For instance, the Kclustering methods require a userspecified cluster number. The QTClust approach is computationally expensive and assumes at each iteration that the largest cluster formed is necessarily the best grouping. And the simulated annealing approach requires a consistently good cooling schedule, subjective inputs for the upper bound cutoff distance and the allowable lower bound false negatives, and can become computationally more expensive than exhaustive search algorithms. Whichever the clustering algorithm used, we need an intuitive and relevant tool to first assess the quality and significance of the clusters formed.
One approach to assess cluster quality is based on the idea that if the clustering result reflects robust structures, existing clusters should accurately estimate the appropriate cluster labels for new data points [20]. [21] extend the prediction strength concept by proposing a figure of merit (FOM) measure, which describes the mean deviation of the gene expression levels with respect to the pertinent cluster centers. Other methods of measuring cluster validity have also been proposed, such as the DaviesBouldin Validity Index [22], which is a function of the ratio of the sum of withincluster scatter to betweencluster separation.
These approaches all rely solely on mathematical coherence to assess cluster quality. However, in uncovering biologically coherent structures from DNA microarray data, categorizing gene clusters on the basis of known functionally related groups is very relevant. This provides useful insights into key biological themes among the genes and enables progress in the studies of gene regulatory networks and signal transduction pathways [23]. The biological coherence of each cluster can be scored according to the percentage of its genes covered by annotations significantly enriched in the cluster in question, using functional classification schemes in Martinsried Institute of Protein Sciences (MIPS) or the Gene Ontology (GO) databases [24, 25] to generate a pvalue reflecting the likelihood that such enrichment would happen by chance. Auxiliary to this central issue is the question of rigorously identifying and isolating outlier genes since it is highly probable that only a subset of genes participate in any cellular process or biological studies of interest. On the other hand, genes with similar expression profiles may not have common functional characteristics or expression profiles of genes in the same functional category may still be dissimilar due to the existence of unknown functional subcategories and sparse or inaccurate functional annotations.
A remedy then is to integrate known biological knowledge into the clustering procedure itself. Most knowledgebased clustering methods directly incorporate GO knowledge into the algorithm. This assumes that the current GO knowledge is correct because genes known to have a similar functional annotation are 'pushed' more closely to one another in a biased fashion. However, this could handicap the clustering process if the organism is sparsely annotated. Methods that work solely on a modified distance measure [26] can thus overestimate the distances of the genes with unknown functions, thus depriving them of a chance of being 'discovered'. An alternative is to modify the similarity measure to be a linear combination of the expression profile similarity and functional similarity. This would however not work well with organisms that are not well annotated and is always highly subjective as to the contribution score of each similarity measure. Another approach allows genes sharing common functions to have a common prior probability as compared to genes with different functions [27]. In an earlier work [28], the clustering process is regulated by mustlink constraints, which apply to genes with known common functions, and cannotlink constraints, which apply to genes known not to be associated with one another. This class of knowledgebased clustering algorithms could distort the 'chance' of genes with unknown functions to be fairly clustered. It also assumes that the current GO annotations are accurate and comprehensive enough. A recent novel approach for knowledgebased clustering [29] involves the use of selectively snipping the edges of a typical hierarchical clustering tree to induce clusters that are maximally consistent with available background information such as functional annotations. This method is tested and reported to outperform another recent knowledgebased clustering method [30].
In addressing the various aforementioned concerns of many current knowledgebased clustering methods, we choose not to interfere with the measured distances of the genes or data points and instead cluster them based on the fundamental intuition that genes in the same cluster should behave similarly. Then, we introduce a secondary refinement process using the GO annotations to check the level of biological coherence of the clusters. We use the entire set of yeast GO annotations available throughout our study. The iterative nature of our procedure means that if the GO annotations are somehow wrong to begin with, subsequent iterations of our algorithm will still show a strong persistence in pushing seemingly 'unrelated' genes together, thus giving a hint that maybe these genes should have a common function after all. If the GO annotations are unknown, then the unknown genes would be still fairly clustered with their counterparts, as their measured experimental behavior has not been tampered with. In organisms that are sparsely annotated, the emphasis on these intuitive factors such as unaltered distances and response correlation provides opportunities for (a) genes with known functions and verified with similar experimental behavior to be clustered together and (b) genes with unknown functions to find cluster membership in clusters with known functions if they correlate well or in clusters with unknown functions containing counterparts that show similar expression behavior.
Our iterative clustering approach has as its backbone an optimizationbased clustering algorithm, EP_GOS_Clust, presented in an earlier paper [10]. The algorithm is based on a variant of the GBD algorithm, the Global Optimum Search (GOS) [31–33]. This is a robust algorithm that compares favorably with many commonlyused clustering algorithm in defining clusters that are as dissimilar from one another as possible (that is, a large intercluster error sum) while assigning members that are as similar to one another as possible into the same cluster (that is, a small intracluster error sum). It also incorporates a methodology to predict the optimal number of clusters for a given dataset. In measuring the biological coherence of the clusters, we assert that the appropriate performance indicators are the cluster pvalues and the proportion of genes that are in clusters of high coherence quality. The first reflects the functional richness of a cluster and is an intuitive measure of coherence, while the second reflects the potential to extract the maximum amount of useful information for subsequent studies of motif analysis and regulatory structure searches. We also look at cluster correlation and functional specificity as consistency checks. We then test our proposed method on three datasets of actual DNA microarray expression results.
Results and discussion
Outline of proposed approach for uncovering biological coherence
Our iterative approach extends a recentlyproposed algorithm, the EP_GOS_Clust, as a backbone for clustering the expression data. Specifically, we perform an initial clustering run on a given data as previously described to reach the optimal number of clusters [10] and then apply a GO analysis of the data to obtain a preliminary assessment of the level of biological coherence. Those clusters that exhibit better biological coherence than a prescribed benchmark value are retained as seeds for the next round of clustering. As a consistency check, we also look at cluster correlation and specificity. Those genes that fall into clusters lacking the benchmark level of coherence are subjected to a subsequent round of EP_GOS_Clust clustering, in which each gene is placed either in an existing cluster having a similar expression profile or aggregated with other unclustered genes having similar expression profiles to form a new cluster. This overall process is repeated until we observe an asymptotic saturation either in the optimal number of clusters or the proportion of genes that are placed into clusters with strong levels of biological coherence. The approach is described in greater detail in the Methods section.
Dataset I
Dataset I consists of DNA microarray data obtained from a study in the role of the Ras/protein kinase A pathway (PKA) on glucose signaling in yeast [34]. The Ras/PKA signal transduction pathways provide a major conduit to couple cellular responses to the availability of carbon sources such as glucose [35–37]. [34] measure the levels of RNA for each of the 6237 yeast genes over time in wild type and various mutant strains following glucose addition to cells grown on a nonfermentable carbon source. These experiments are designed to assess the extent to which the Ras/PKA pathway mediated transcriptional effects induced by glucose. [See additional file 5]
Each of the eight test and control experiments consist of four time points over a hour period, yielding 32 data points for each of the 6237 genes. Before clustering the array data, we filter the data to remove unreliable data. In particular, we retained all genes for which all the time points are deemed present by the Affymetrix software suite (4105 genes), all the genes for which greater than 50% of the time points are deemed present and all the genes for which the present/absent calls exhibit a consistent and biologically relevant pattern (e.g., PAAA for the four time points in the experiment, indicating repression of expression of that gene over the course of the experiment). In all, we retain 5652 genes.
Clustering dataset I using the EP_GOS_Clust algorithm described previously [4] results in 237 optimal clusters, out of which 46 were singleton clusters. We perform a GO analysis of the clusters and determined that 64% of the 5652 genes fall into clusters with significant functional coherence (pvalues of 10^{3} or less; 32% of the genes fall into clusters with pvalues of 10^{4} or less). Ignoring the singleton clusters, we also find that the average cluster size is 30.4, while the average size of clusters with pvalues of 10^{3} or less is 36.2, thus indicating that clusters formed using the EP_GOS_Clust do not show strong biological coherence just because of a size bias. In fact, the average size of clusters with pvalues of 10^{4} or less is only 36.9. We select as a cutoff standard a pvalue of 10^{3}, which encompass 64 clusters containing a total of 2318 genes. Reclustering the remaining 3334 genes by another round of EP_GOS_Clust decreases the optimal number of clusters to 112, out of which 81% (or 4570 of the 5652 genes) of the genes fall into clusters with pvalues of 10^{3} or less. After 6 iterations of this process, we find that the optimal number of clusters saturated at 62, with over 90% of the genes falling into clusters with pvalues of 10^{3} or less. We also find that the proportion of genes that fall into clusters with pvalues of 10^{4} or less increases from 32% to 69%. Figure 1 illustrates the results of the clustering process. As evident from this Figure, the proposed clustering method yields a monotonic increase in the average log_{10}(P) values and in the proportion of genes that found placement in quality coherent clusters, as well as a steady decrease in the optimal number of clusters, indicating the growing compactness and economy of the clustering. We further note that over 75% of the genes that were placed into singleton clusters for the first iteration remain in singleton clusters throughout suggesting that these genes are biological outliers. We also compare our clusters with that obtained using representative methods from the Kfamily of clustering approach [5–7], the SelfOrganizing Map (SOM) [14], a variant that combines the SOM with hierarchical clustering, the SelfOrganizing Tree Algorithm (SOTA) [15], and the QTClust method [13]. Table 1 shows that just the EP_GOS_Clust backbone already compares favorably against the other methods, and that the iterative approach further refines the cluster quality. In short, this iterative method clearly provides a means of enhancing the biological coherence of clusters obtained from microarray data.
Dataset II
We examine a second dataset derived from experiments designed to determine comprehensively the contribution of different signaling pathways to the glucose response in yeast. In addition to the Ras/PKA pathway, at least three other signaling pathways mediate transcriptional changes attendant on addition of glucose to cells [36, 37]. In one pathway, glucose addition leads to reduced activity of the AMPactivated protein kinase encoded by SNF1, thereby unfettering a transcriptional repressor encoded by Mig1/2 and suppressing several transcriptional activators [38, 39]. The second pathway, mediated by Rgt1, couples expression of the hexose transporter genes to the level of available glucose. Binding of glucose to plasma membrane glucose sensors promotes degradation of repressors of hexose transporter genes by the SCF^{Grr1} ubiquitin ligase complex, allowing Rgt1 activation of these genes. Finally, the AGC kinase Sch9 acts in parallel to PKA to induce expression of many genes normally regulated by PKA [40]. To test the roles of each of these pathways, we measure expression changes following glucose induction in wild type and mutant cells lacking specific components of the different pathways. Levels of RNA for each of the 6237 yeast genes in each of the RNA samples are assayed using Agilent microarray chips, out of which measurable signals are registered from 5657 of them. This dataset consists of results from 23 time course experiments, each with 2–5 time points. [See additional file 5]
Clustering of dataset II using EP_GOS_Clust results in 224 optimal clusters, out of which 130 are singleton clusters. As before, we find the singleton clusters to be poorly related to the other genes, with a low average correlation coefficient of only 0.08. We perform a GO analysis of the clusters and note that 79% of the 5657 genes fall into clusters with pvalues of 10^{4} or less and 66% of the genes fall into clusters with pvalues of 10^{5} or less. This in itself already represents a highly robust quality of clustering. As a separate analysis, we ignore the singleton clusters and find that the average cluster size is 59.8, while the average size of clusters with pvalues of 10^{4} or less is 78.2, thus indicating that clusters formed using the EP_GOS_Clust are unlikely to show strong biological coherence just because of a size bias. In fact, the average size of clusters with pvalues of 10^{5} or less is only 81.1. For this dataset, we select as a cutoff standard a pvalue of 10^{5}, which gives 44 clusters containing a total of 3737 genes. We apply the proposed methodology to improve the biological coherence of our clusters and found that after 6 iterations, the optimal number of clusters saturates at 189 (out of which 114 are singleton clusters), with 84% of the genes falling into clusters with pvalues of 10^{5} or less (Figure 2, Table 2). We highlight that this is less than the final gene placement of over 90% obtained for dataset I for two main reasons. First, dataset II is drawn from a larger number of DNA microarray experiments and has 75 feature points compared to 24 for dataset I, thus introducing a greater range of variability within the data. Second, we imposed a stricter cutoff pvalue of 10^{4} or less, as opposed to the previous choice of 10^{3}. Figure 2 summarizes the utility of our proposed methodology in improving cluster validity and placing the largest possible proportion of genes into biologically coherent clusters. We also find that the mean log_{10}(P) value of the clusters increases monotonically from 8.39 to 9.14.
At completion of the iterative process, we obtain 38 clusters with pvalues of 10^{5} or less, containing 4747 (or 84%) of the original 5657 genes. We note that these clusters exhibit a tight grouping, as evidenced by a visual inspection of the gene expression time course plots [See additional files 1 and 2], as well as the relatively high values of correlation coefficients for these clusters (an average value of 0.667 over all clusters, with a maximum of 0.925 and a minimum of 0.387).
Comparison of clustering methods for dataset II
We examine the coherence of the clusters obtained by our iterative version of EP_GOS_Clust versus the noniterative version as well as other clustering methods. These include the well established K family of partitionbased clustering algorithms [5–7], self organizing tree algorithms (SOTA) [15] and a recently proposed information theoreticbased method (IClust) [17]. As can be seen from the results in Table 3 the initial iteration of EP_GOS_Clust performs as well or better than any of these other methods, whether corrected for multiple testing or not, as measured both by the percent genes resident in clusters with high biological coherence and by the average expression correlation within individual clusters. Moreover, the clusters from the final iteration of the algorithm exhibit a higher level of expression correlation. Thus, the iterative EP_GOS_Clust compares favorably with other clustering methods.
Function prediction based on expression profiles
An application of any clustering approach is the ability to predict the functions of unknown genes by clustering them together with counterparts with known functions. This is a particularly important consideration when working with organisms that are not as wellannotated as yeast. We test the capability of our proposed iterative procedure in this respect through a simulated study. We do this by randomly deannotating 20% and 30% of the genes in dataset II. We then apply our iterative clustering approach to the entire dataset II and take into consideration the entire set of functional annotations reported on the SGD. Naturally at each iterative step, the clusters are scored as if the deannotated genes have no known function, thus affecting the pvalue. As a result, we find that our iterative procedure demonstrates a 61.5% level of prediction accuracy for the dataset with 20% deannotation and 50.4% for the dataset with 30% deannotation. We feel these results compare favorably to the other knowledgebased clustering methods reported in the literature [29, 30]. These methods report a prediction accuracy of between 30–70%. However, these accuracies are found using restricted datasets of only 2–3 function classes (each containing 200–300 genes) and clustering is done up to 10–14 clusters. Furthermore, we observe there is very little variation in the percentage of prediction accuracy as our algorithm steps through the iterations. As a case in point, the prediction accuracy for the 20% test case in the 4 iterations it took for the optimal number of clusters to stabilize is 59.1%, 60.9%, 60.8%, and finally 61.5%. This and the relative high level of prediction accuracy by our iterative algorithm is a result of the clustering not being driven by the extent of known functional annotations, which can handicap the clustering process if the data is sparsely or wrongly annotated, but rather by fundamental indicators of cluster goodness. Hence, we expect the clustered results from a deannotation simulated study or a GO permutation study to not severely affect the final results.
Motif identification
Another means of evaluating the effectiveness of our clustering regimen is to determine the extent to which it reveals information regarding the underlying transcriptional network responsible for the observed pattern of transcription. For instance, are genes in specific clusters enriched for motifs corresponding to transcription factors known to be involved in regulation under the conditions tested? Several methods are available to identify motifs enriched in groups of genes and we applied a recent method – FIRE, which uses the concept of mutual information to predict functional motifs from gene expression data [41]. We compare the results obtained on this dataset clustered as described above with that from an expanded version of the same dataset clustered by Kmeans with a cutoff correlation of 0.85 [Zaman & Broach, unpublished]. Both sets of clusters reveal the role of the PAC/RRPE motif, sites for the transcription factors Msn2/4, Mbp1, Rap1 and Rpn4 as well as the 3' RNA binding factor Puf3. The Kmeans clusters also reveal a role for Gcn4, Hap4, Reb1 and Cbf1 while our iterative algorithm identifies Bas1 and Mig1. All these factors have been implicated in glucose regulation and the overlapping but nonidentical results with the two cluster sets indicates the value of interrogating an expression dataset with multiple clustering methods.
Abridged iterative approach
We examine the capabilities of an abridged version of our iterative algorithm. In the standard EP_GOS_Clust algorithm, we allow the formation of singleton clusters as a strategy to identify and isolate genes that are clear outliers. Mathematically, these singleton clusters have pvalues of zero. Since it is reasonable to assume these genes play little or no part in the biological/cellular process of interest, having been put through a rigorous placement process, we drop these genes from future consideration instead of retaining them in the dataset for the iterative sorting process. As a result, the initial iteration on dataset II results in 224 optimal clusters. We then apply a pvalue cutoff of 10^{5} or less as above, confirm the poor correlation of the singleton genes with the other genes, and then remove these from further consideration. We perform the same set of iterations and arrive at 118 as the final optimal number of clusters after only 3 rounds, out of which 31 are singleton clusters. Of these, 44 clusters have pvalues of 10^{5} or less, as compared to 38 when using the full methodology. Nonetheless, Table 4 shows that the abridged method yields comparable results to the complete approach except in placing genes into clusters Bonferronicorrected pvalues of 10^{5} or less. Hence, as long as hypothesis testing errors are not a concern, the abridged proposal is valid and useful.
Minimal expression level for meaningful clustering
Gene expression measurements typically contain a certain amount of noise, which can come from hybridization errors due for instance to chip imperfections, as well as stochastic fluctuations in transcriptional processes. These are typically filtered by the appropriate microarray software, as described in the Methods section. As another area of interest, we then ask ourselves, noise aside, whether there can still be a minimal level of fold change for a particular gene to be considered relevant and can be meaningfully clustered.
We assume that for any particular gene's expression to be relevant, it should show a significant level of expression variation over at least 1–2 experiments. For dataset II, which contains 23 experiments, with 2–5 time points each, this corresponded to about 10% of the time points. Applying an arithmetic mean regression to this dataset yields a 1.7fold change cutoff as the condition in which genes would demonstrate significant variation for at least 10% of all time points [See additional file 3]. From this, we derive a feasible region for foldchange cutoff by combining the constraint brackets of a time point allowance of at least 5–20% and the number of genes showing a reasonable level of intensity variation [See additional file 4]. We then cluster dataset II using several of these foldchange criteria candidates. Specifically, we parametrically test for (a) 1.9fold change for at least 5% of time points – 4317 genes, (b) 1.8fold change for at least 10% of time points – 4045 genes, (c) 1.7fold change for at least 10% of time points – 4346 genes, and (d) 1.6fold change for at least 15% of time points – 4280 genes. The results from this analysis are presented in Tables 5 and 6 and Figure 2.
The first observation from Tables 5 and 6 is that the results of clustering improve significantly after undertaking some form of gene expression relevance threshold. The second observation is that an overstrict and overlenient foldchange threshold results in the respective deletion of useful information or the retention of noninformation. As can be seen from Table 5, even though a cutoff criterion of 1.8fold change for at least 10% of all time points results in the least number of genes and the best clustering economy in placing the genes into a lowest number optimal clusters, it does not result in the best overall correlation between members within the same clusters and clusters of the strongest biological coherence. On the other hand, the criterion of 1.7fold change for at least 10% of the time points appear to produce the tightest clusters with the highest level of biological coherence despite having the largest number of genes as compared to the other foldcriteria tests. This is particularly noticeable from Table 6, where it places the largest proportion of genes into coherent clusters, but also has the smallest proportion of genes leftover for the weaker clusters. In addition, while this criterion does not lead to the lowest number of clusters, the resultant data groupings exhibit the strongest correlation. This leads us to conclude that a 1.7fold cutoff for at least 10% is probably the appropriate screening criteria for dataset II.
We then assess the clusters using a newlyproposed functional genomics gold standard based on an expert curation of the Gene Ontology [42]. This allows us to assess our iterative process using only GO terms that are deemed specific enough to imply a meaningful biological relationship between any two annotated proteins. The iterative clustering of dataset II with a 1.7fold cutoff gives a final 72 optimal clusters, of which 42 are quality clusters with pvalues below the cutoff of 10^{5}. We performed a gold standard GO analysis of these 42 clusters and found that 41 of these returned meaningful annotation results based on this standard, and that the average precision and recall values of these clusters was 25.1% and 15.4% respectively (see Methods section). This significantly exceeds the average background of 1.8%. On the other hand, the average precision and recall values of the nonadjusted version of dataset II of 20.8% and 12.6% respectively, showing that the foldchange cutoff helps in improving the meaningfulness of the clustered results.
Dataset III
Natural genetic variation can cause significant differences in gene expression, and clustering and linkage analysis can yield meaningful insights on how natural polymorphisms affect gene regulation. We test our iterative clustering approach on dataset III, which is obtained by measuring the expression levels of all yeast genes in a laboratory and a wild strain, and in 113 segregants from a cross between them [43, 44]. The dataset consists of multiple replicates of parental strains and single arrays for each of the segregrants using spotted microarrays, yielding 131 experiments each with 7085 gene expression measurements. These genes correspond to 5740 known yeast genes, as well as 489 dubious ORFs and 856 control spots. [See additional file 5]
Unlike the previous two datasets, which were obtained under controlled experimental conditions, dataset III is a much more diverse set of expression conditions, rendering it more challenging to cluster. Clustering dataset III using EP_GOS_Clust results in 49 optimal clusters. We note that even at this level, the cluster correlation and biological coherence compares favorably against previous hierarchical clustering results [44], which for a minimum pairwise correlation of 0.725 yields 763 clusters of at least two genes containing a total of only 2522 genes out of the original 7085. As a test, we first adopt a multistage approach with progressive filtering to cluster it. The rationale for this approach is that successively extracting the higher correlated clusters allows subsequent clustering in the absence of these strong attractors. The scheme we follow and the results of that process are outlined in Figure 3.
The process of sequential clustering allows us to identify a total of 1864 genes in 62 quality clusters (indicated by A, C and E) that annotates for known biological functions, as well as 21 clusters (indicated by B, D and F) that contain 1057 genes of unknown biological processes. This compares favorably both in coverage and compactness with the 2544 genes sorted into 763 clusters by the hierarchical clustering method previously applied to this dataset. Furthermore, application of our iterative clustering approach at two instances (A and E) allowed us to improve cluster quality, in terms of member correlation, cluster precision, and pvalue. The iterative approach also improves the number of genes placed into good clusters. For instance, for group A, the number of genes falling within clusters with pvalues of 10^{6} and below improves from 986 to 1084, while for group E, the number goes from 85 to 164.
We evaluate the clustered results in two ways. First, as described for the previous dataset we apply FIRE to extract regulatory motifs associated with individual clusters. Second, we applied linkage analysis to the clustered expression data to determine whether the pattern of expression in a particular cluster could be associated with an unlinked site in the genome. This process identifies genes whose product would likely be acting in trans to modulate expression of genes in a particular cluster [43, 44]. The results of these analyses are presented in Figure 4 and summarized in Table 7. As evident from this summary, the serial clustering approach yields highly coherent clusters, both in expression correlation and in biological function, even in the third round of clustering (group E). Moreover, analysis of the genes clustered by this method identifies most of the cisacting motifs and the transactive sites that had been extracted from this dataset previously. In addition, the FIRE analysis of the clustered data identifies a number of enriched motifs in the 3' sequence of genes, which have not been extracted previously from this dataset but which correspond to a number of previously noted motifs associated with mRNA stability, including Puf3, Puf4 and PRSE, as well as a motif associated with the PAU gene family of unknown function [45]. As previously recognized, the transacting factor(s) that affect(s) expression of genes in a cluster generally does not correspond to the transcription factor(s) associated with the motifs enriched in genes in that cluster. Rather, these transacting factors predominantly define physiological processes – mating type (MAT) or prototrophy (leu2 versus LEU2), for example – or signaling networks – Ira1, Ira2, Ras1 or Gpa2 – that impinge on the expression patterns indirectly. In this context, though, it is intriguing to note that the same allele variation in a transacting can induce multiple, distinct expression patterns, witnessed by the fact that the same locus is linked to several different clusters. This would suggest that the factor may work in combination with other loci segregating in the cross to yield a variety of transcriptional patterns. Such polygenic effects are expected but have only begun to be rigorously explored.
Next, as a test of our iterative procedure, we subject dataset III to this algorithm without any intermediate filtering or multistage processing. From the initial clusters obtained from the first round of clustering, the iterative procedure results in 106 optimal clusters after 6 iterations, based on a pvalue cutoff of 10^{5}. A subsequent analysis of the clusters reveals results comparable to that obtained from the application of intermediate filtering as described previously. The large clusters that annotate for major functions such as translation, ribosome biogenesis, cellular component organization, and chromatin assembly appear in both instances, and have about 60% of their member genes in common. The smaller clusters that annotate for flocculation, cytokinesis, vitamin B6 metabolism, RNAmediated transposition, and sterol biosynthesis far even better, with 75% of common genes recovered in both cases. The average specificity of the clusters obtained by using just the iterated procedure is 33%, which compares well with the 25–75% specificity range (average of 52%) obtained earlier. We also note the uncovering of several clusters whose majority of genes have unknown biological processes. In both applications, the cluster specificity is high – an average of 33% compared to a range of 40–60% by using multistage clustering with intermediate filters. In addition, the number of clusters obtained (i.e, 106) is similar to the total of 89 quality clusters retained by the previous application. We note however that the optimal clusters obtained by using just the iterated approach without any filters is lower – an average of 0.497 versus 0.656. This and the comparatively lower cluster specificity is understandably the result of the two filters implemented in the former process, including one that involves correlation filtering. Nonetheless, the recovery of several commonfunction clusters with overlapping genes leads us to believe that our iterative approach would be effective in uncovering clusters with a strong level of biological coherence even without applying various levels of additional filtering to a dataset.
Conclusion
Uncovering biological insights from DNA microarray data is a promising but challenging task. Generally, the first step in organizing and analyzing microarray data is clustering genes into groups related by expression patterns, which often then reveals biological coherence. In our study, we observe that the task of placing genes into strongly coherent clusters may not necessarily be achieved with a single round of clustering, no matter how robust the clustering algorithm has proven to be. To address this limitation we formulated a methodology that filters out the genes placed in clusters of weak biological coherence and iteratively seeks the best placements for these genes. We show that on the whole our proposed algorithm unambiguously refines the biological quality of gene clusters. The extent of improvement over the iterations is significant, from a 20% increase in the number of genes in biologically coherent clusters for dataset III to a 40% increase for dataset I. We note also the increasing level of average cluster correlation as the iterative sorting progresses. The latter observation is significant, since cluster correlation is not a factor explicitly used in the algorithm to target clusters for recycling.
We apply this algorithm to a variety of datasets. Two of these datasets are strongly focused on the question of nutritional regulation in yeast. We select the third dataset as a test for this method since, due to experimental design, it is as diverse a collection of expression patterns as any in the literature (Myers C, personal communication). We evaluate our results from the second dataset against those obtained by a number of standard clustering algorithms and find that our method assigns a higher proportion of genes to biologically coherent clusters than any of the others. Moreover, the average expression correlation of genes in the clusters is higher by our method than by any of the others. In addition, subsequent analysis of the clusters reveals a strongly overlapping set of predictions for cisacting regulatory sequences to that obtained from clusters generated by other methods. Our results with the third dataset also compare favorably with other methods. Our algorithm, applied in a multistage fashion, assigns 2921 genes to 83 clusters compared with 2544 genes in 763 clusters previously compiled by hierarchical clustering with a correlation value cutoff. This reduced number of clusters facilitates subsequent analysis of the data while retaining essentially all of the information content, as assessed by analysis of linkage to transacting factors and identification of cisacting regulatory domains.
An issue with incorporating annotative knowledge is that the clustering process can be limited when applied to datasets from sparsely or inaccurately annotated organisms. In this respect, our iterative procedure still performs comparatively well under these situations. On its own, the clustering backbone EP_GOS_Clust compares favorably against several wellknown clustering algorithms based on not just the usual quantitative attributes but also the level of biological coherence based on GO resources. The procedure has as its preprocessing step a complete clique search (itself a rigorous search process) amongst the data points for strong preclusters. In our iterative procedure, GO knowledge is not considered during the clustering itself, but rather in postprocessing and cluster refinement steps. The clustering backbone thus considers only intuitive quantitative attributes, so a lack of functional annotation does not affect the result as adversely as other more involved knowledgebased clustering approaches. The iterations then smooth any distortions brought about by issues with existing annotation databases or measurement inconsistencies.
Based on the results of this study, we believe our work to be valuable in uncovering biologically meaningful data structures. Since the complex functions of a living cell are carried out through a concerted activity of many gene and gene products, it is important to be able to effectively cluster DNA microarray data to uncover functional relationships and regulatory modules to help understand the complex biological mechanisms involved in signaling.
Methods
Notation
We denote the distance measure of a gene i, for i = 1,....,n with k features (or dimensions), for k = 1,....., s as a_{ik}. Each gene expression pattern is transformed into a kdimensional vector, for which each element indicates the change in normalized expression level between time points for each gene, a_{ik}. Each gene is assigned to only one (hard clustering) of c possible clusters, each with center z_{jk}, for j = 1,....,c. The binary variables w_{ij} indicates whether gene i falls within cluster j (w_{ij} = 1, if yes; w_{ij} = 0, if no).
Microarray data processing
The experiments for datasets II and III are carried out using the Agilent microarray platform. All arrays are hybridized and processed using a SSPE wash according to the manufacturer's protocols, and the microarrays are imaged using a Agilent microarray scanner. The images are then extracted with Agilent Feature Extraction version A7.5.1 and the data analyzed with Rosetta Luminator 2.0 gene expression analysis system (Rosetta Informatics, Seattle, WA). Using a rank consistency filter, features are subjected to a combination linear and LOWESS normalization algorithm, the recommended algorithm for this microarray platform. The Agilent microarray scans feed directly into the Princeton University Microarray Database (PUMAdb) [46], where the noise filtering is performed in tandem. PUMAdb is based on the Stanford Microarray Database software package [47]. The filtering process involves identifying the Cy3 and Cy5 intensities above the background levels and picked out features that were either population outliers or nonuniformity outliers.
Handling missing data points
DNA microarray results often contain missing data points for a variety of reasons. In our study, dataset II contains about 0.4% of missing data points. To address this, we examine the row average of the unblemished data points and apply a variant of the KNearest Neighbor algorithm to estimate the missing points [48]. The procedure we use is as follows:

(i)
For a particular gene with missing data points, we calculate its level of similarity with all other genes that have values present for the particular data points. The similarity metric is consistently computed by excluding the data column containing the missing data point. In this respect, we test both the Euclidean distance and Pearson Correlation Coefficient as measures of data similarity and find that both work similarly well in estimating the 'removed' data points.

(ii)
For each gene with missing data points, we create a rankorder list of its similarity with its reference genes. Depending on the spread of similarity level, we estimate the missing data point by taking the sum average of its nearest 10–20 neighbors.
Normalization and preclustering
Preclustering the data helps minimize the computational resources required to solve the hard clustering problem by (i) identifying genes with similar experimental responses, and (ii) removing outliers deemed not to be significant to the clustering process. A straightforward preclustering approach to provide just the adequate amount of discriminatory characteristics so that the genes can be preclustered properly is to reduce the quantities represented in the kdimensional expression vectors into a set of representative variables [+, o, ]. The (+) variable represents an increase in expression level compared to the previous time point, the () variable represents a decrease in expression level from the previous time point, and the (o) variable represents an expression level that does not vary significantly across the time points (which we take for the purpose of this study as a range of ± 10% from the zero normalized value). We can also precluster the expression data by creating a rankordered list of gene proximities based on Euclidean distance or correlation. We then group genes that demonstrate an obvious level of proximity, such as a separation of only at most 1% of the maximum intergene distances. Here, we choose to precluster the proximity genes that form a complete clique. This means the preclustered genes that belong uniquely to only one cluster, or in other words, there is a link between every gene within the same cluster. With this choice, we can perform a maximal clique search by using various levels of preclustering criteria. In preclustering over a range of cutoff values, we are able to select the appropriate criterion as the point where the maximum number of complete cliques is formed [49]. In this study, we first znormalize the expression data across each gene and then compute the feature vector (that is, the change in expression level between time points). The normalization reduces expression data to a standard normal N(0,1) according to:
EP_GOS_Clust clustering algorithm
We perform EP_GOS_Clust clustering by the method described in detail in [4].
The global optimization approach seeks to minimize the sum of the Euclidean distances between the data and the assigned cluster center. There are two sets of variables in the problem, w_{ij} and z_{jk}. While the bounds of w_{ij} are clearly 0 and 1, that of z_{jk} is obtained by observing the range of a_{ik} values.
The basic clustering problem is:
w_{ ij }are binary variables, z_{jk} are continuous variables
The proximity study in preclustering will determine the spread of clusters each gene is suitable for. This can be described by an additional binary parameter suit_{ij}. A data point that has been determined to belong uniquely to just one cluster in the preclustering process will only have suit_{ij} = 1 for only one value of j and zero for the others, whereas a data point restricted to a few clusters will have suit_{ij} = 1 for only those clusters. The suit_{ij} parameters obviate the need for constraints that prevent the redundant reindexing of clusters and help reduce the computational demand required for the clustering process.
The objective function in Problem 1.1, when expanded is:
Together with the necessary firstorder optimality condition:
(i.e., the vector distance sum of all genes within a cluster to the cluster center should be intuitively zero), and the constraint that each gene is allowed to belong to only one cluster, (i.e. $\sum _{j=1}^{c}{w}_{ij}}=1$), the formulation becomes:
The first set of constraints are the necessary optimality conditions, the second demand that each gene can belong to only one cluster, and the third state that there is at least one and no more than (nc+1) data points in a cluster. Note also that the $\sum _{i=1}^{n}{\displaystyle \sum _{k=1}^{s}{a}_{ik}^{2}}$ term in the objective function of Problem 2 is a constant and can be dropped, though for the sake of completeness we will retain the term throughout.
Problem 1.2 is a Mixed Integer Nonlinear Programming (MINLP) problem with bilinear terms w_{ ij }.z_{ jk }in the objective function and the first constraint set. However, MINLP problems are difficult to solve and theoretical advances and prominent algorithms for approaching such problems have been extensively studied. We utilize the Global Optimum Search (GOS) algorithm [31–33] to handle the MINLP formulation. The algorithm decomposes the nonlinear problem into a primal problem and the master problem. The former optimizes the continuous variables while fixing the integer variables and provides an upper bound solution, while the latter optimizes the integer variables while fixing the continuous variables and provides a lower bound solution. The two sequences of upper and lower bounds are iteratively updated until they converge in a finite number of iterations. Even though the algorithm does not have a theoretical guarantee of finding the global optima, its application in a robust clustering procedure has been shown to perform favorably against existing clustering methods when applied to DNA microarray data [31].
The primal problem results from fixing the binary variables to a particular 0–1 combination. Here, w_{ij} is fixed and z_{jk} is solved from the resultant linear programming (LP) problem. The primal problem is given by:
In Problem 2.1, all the other constraints drop out since they do not involve z_{jk}, the variables to be solved in the primal problem. Besides z_{jk}, the Lagrange multipliers λ_{jk}^{m} for each of the constraints above is obtained. The objective function is the upper bound solution. These go into the master problem. The master problem is essentially the problem projected onto the yspace (i.e., that of the binary variables). To expedite the solution of this projection, the dual representation of the master is used. This dual representation is in terms of the supporting Lagrange functions of the projected problem. Also, the optimal solution of the primal problem as well as its Lagrange multipliers can be used for the determination of the support function, which are gradually built up over each successive iteration. The master problem is:
The master problem, which is a Mixed Integer Linear Programming (MILP) problem solves for w_{ij} and μ_{B}, and gives a lower bound solution in the objective function. The w_{ij} solutions then go back into the primal and the process is repeated. A new support function is added to the list of constraints for the master problem with each iteration. Thus in a sense, the support functions for the master problem build up with each iteration, forming a progressively tighter envelope and gradually pushing up the lower bound solution until it converges with the upper bound solution. In addition, after every run of the master problem, in which a solution set for w_{ij} is generated, an integer cut is added for subsequent iterations to prevent redundantly considering that particular solution set again. The cut is expressed as:
In implementing the algorithm we let the initial clusters be defined by a robust set of preclustered genes and compute the centers of these clusters. As a good initialization point we place the remaining genes into the nearest cluster. A proximity study assesses each gene's suitability in an appropriate number of clusters. After the initial GOS [31–33] run, the worst placed gene is removed and used as a seed for a new cluster. This gene has already been subjected to the initial search for parameters so there is no reason for it to belong to any one of the older clusters. Based on the updated centers, the iterative steps are repeated. With each GOS [31–33] cycle, the number of clusters builds up progressively. At each cluster number, we compute the clustering balance, which is the αweighted sum of the intracluster and intercluster error sums. The clustering balance parameter reaches a minimum at the optimum cluster number, when intracluster similarity is maximized and intercluster similarity is minimized. Hence, the incrementing of the cluster number can be stopped once this turning point is reached.
Biological coherence assessment and refinement
We determine the biological coherence of each cluster by assessing whether it contained genes assigned to a common biological function, as specified by Gene Ontology (GO) annotations available through the Saccharomyces Genome Database [50]. This is quantified as the likelihood that the subset of genes in that cluster with a common biological function could have been assigned to that cluster by chance (pvalue). Since a single full run of the EP_GOS_Clust does not necessarily assign a number of genes to clusters of high biological coherence, we employ an iterative approach to clustering by imposing a coherence floor based on pvalues to demarcate genes that have already been well clustered. As described by the original EP_GOS_Clust algorithm, we take these quality clusters as the initial set of clusters. We then compute the Euclidean distance between each of the remaining genes (coming from clusters with low biological coherence) and these initial centers and as a good initialization point placed these genes into the nearest cluster based on $Min\left\{{\displaystyle \sum _{k=1}^{s}{\left({a}_{ik}{z}_{jk}^{initial}\right)}^{2}},\forall j\right\}$. We create a rankorder list for each of these remaining genes for its distance to each of the initial clusters, and for each gene allow its suitability in the nearest clusters via suit_{ij} parameters. The initialization point and suit_{ij} parameter assignments are then utilized in the GOS algorithm all over again. Through these iterations, we offer the poorlyplaced genes an opportunity either to find relevant membership in one of the strongly coherent clusters or to regroup among themselves to form quality clusters. We have found that reiteration of this process eventually yields a saturation point whereby the optimal number of clusters becomes constant of the proportion of genes distributed within clusters of high biological coherence levels off. This algorithm can be summarized by the schematic in Figure 5.
Calculating cluster pvalue
The pvalue of each cluster is a measure of the statistical significance for functional category enrichment. The biological coherence of each cluster is scored according to the percentage of its genes covered by annotations significantly enriched in the cluster in question, and is computed using a hypergeometric distribution. If G is the number of genes annotated to a term and N is the total number of genes in the genome with GO annotations, then the probability p of a randomly selected gene being annotated to a particular GO term is $\raisebox{1ex}{$G$}\!\left/ \!\raisebox{1ex}{$N$}\right.$. Given a cluster of n genes, in which × of them have been annotated to a given GO term, the probability of having × out of n annotations assigned to the same GO term by chance is p^{x}(1  p)^{nx}. Within a list of n genes, there are multiple permutations by which × of them have this annotation. The number of permutations is $\raisebox{1ex}{$n!$}\!\left/ \!\raisebox{1ex}{$x!(nx)!$}\right.$. However, annotations to a particular term are low probability events. Thus instead of calculating the probability of having × of n genes annotated to a term, a more conservative approach is to calculate the probability of × or more of n genes being annotated to a particular term. This is given as $\sum _{j=x}^{n}\left(\raisebox{1ex}{$n!$}\!\left/ \!\raisebox{1ex}{$j!(nj)!$}\right.\right){p}^{j}{(1p)}^{nj}}$. In analyzing the gene clusters, we consider the pvalue of the most significant term associated with each cluster.
Sometimes, a cutoff for the pvalue, known as alpha, is chosen, such that pvalues below the alpha are deemed significant. For instance, an alpha level of 0.05 means that in no more than one in 20 statistical tests will the test show 'something' while there is in fact nothing (a type I error). However, when more than one statistical test is carried out, there is an increasing chance of committing a type I error due to fluctuations. This chance is given by 1(1  α)^{k}, where k is the number of tests (i.e. the number of elements in a cluster). A possible solution known as the Bonferroni correction is to divide the testwise significance level by the number of tests. For instance, given an initial alpha of 0.05, the chance of a type I error if 10 tests are made is 0.4. The Bonferroni correction adjusts the alpha so the overall risk for the tests remains at 0.05, by applying a significance level of 0.005 instead of 0.05. We include both the pvalues provided by the SGD and our Bonferronicorrected values in our manuscript for the sake of completeness.
Filtering dataset III
The first filter applied to dataset III eliminates those genes for which 75% of the values were zero, after background subtraction, or that show a correlation of less than 0.1 with any other gene. The second filter involves computing the expression mean and standard deviation over all segregants (feature points) for all of the 3,760 genes that remain after an initial clustering of Set B data (see Figure 3), which yields a normal distribution. We then determine for each gene what proportion of feature points had values falling outside the data mean ± 0.5 (standard deviation). Plotting the cumulative number of genes with a given proportion of feature points that lie outside this limit versus proportion of feature points, we find that as the proportion of feature points decreased from 100%, the number of genes increases almost linearly until the 77% mark, at which point the number of genes increases essentially exponentially. We take this to signify an increasing bulk of spurious genes and set the cutoff at 77% retained those 206 genes with values between 77% and 100% for further clustering.
Linkage analysis
Linkage analysis is performed on the average phenotype of each cluster. After mean centering each transcript, the average value of all transcript levels within a cluster is calculated and treated as a quantitative phenotype as previously described [44]. Linkage analysis is then performed using the nonparametric test in R/qtl [51] with 2894 previously described markers [43]. Empirical permutation tests [52] are used to determine a false discovery rate (FDR), with a LOD score of 3.4 estimated as a cutoff for a 5% false discovery rate.
'Gold standard' evaluation of functional genomic data
Rigorous analysis of gene clusters for functional annotations requires a quality collection of genomic functional categories. [42] propose a unified standard based on expert curation so that only functional terms that are specific enough to be confirmed or refuted experimentally and yet general enough to provide relevant information for highthroughput assays are considered. The evaluation framework and gold standard can be found at [53]. As performance measures, we use the recall, which is the ratio of the number of relevant records retrieved to the total number of relevant records in the database, and the precision, which is the ratio of the number of relevant records retrieved to the total number of irrelevant and relevant records retrieved. Given the number of true positives (TP) – protein pairs associated by data and annotated as positives in the gold standard, false positives (FP) – protein pairs associated by data and annotated as negatives in the gold standard, and false negatives (FN) – protein pairs not associated by data and annotated as positives in the gold standard, for each cluster, the metrics are given by:
In addition, we introduce as a minimal benchmark of gold standard validation the background, given by:
This is the minimal benchmark as it signifies the probability of drawing a gene annotated for a particular function from the entire genome of interest without even considering the clustering results based on differential DNA expression levels.
Computational resources
All optimization formulations are written in GAMS (General Algebraic Modeling System) [54] and solved using the commercial solver CPLEX 8.0. GAMS is a high level modeling system specifically designed for mathematical optimization. It consists of a language compiler and an integrated high performance solver such as CPLEX, DICOPT, or XPRESS.
References
 1.
Allison DB, Cui X, Page GP, Sabripour M: Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet 2006, 7: 55–65. 10.1038/nrg1749
 2.
Gasch AP, Spellman PT, Kao CM, CarmelHarel O, Eisen MB, Storz G, Botstein D, Brown PO: Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell 2000, 11: 4241–4257.
 3.
Lin X, Floudas CA, Wang Y, Broach JR: Theoretical and computational studies of the glucose signaling pathways in yeast using global gene expression data. Biotechnol Bioeng 2003, 84: 864–886. 10.1002/bit.10844
 4.
Sokal RR, Michener CD: A statistical method for evaluating systematic relationships. Univ Kans Sci Bull 1958, 38: 1409–1438.
 5.
Hartigan JA, Wong MA: Algorithm AS 136: A Kmeans clustering algorithm. Appl Stat J Roy St C 1979, 28: 100–108.
 6.
Zhang B, Hsu M, Dayal U: Kharmonic means – A data clustering algorithm. Hewlett Packard Research Laboratory Technical Report 1999.
 7.
Likas A, Vlassis N, Vebeek JL: The global Kmeans clustering algorithm. Pattern Recogn 2003, 36: 451–461. 10.1016/S00313203(02)000602
 8.
Adams WP, Sherali HD: Linearization strategies for a class of zeroone mixed integer programming problems. Operations Research 1990, 38: 217–226.
 9.
Sherali HD: A global optimization RLTbased approach for solving the hard clustering problem. Journal of Global Optimization 2005, 32: 281–306. 10.1007/s1089800427067
 10.
Tan MP, Broach JR, Floudas CA: A novel clustering approach and prediction of optimum number of clusters: Global optimum search with enhanced positioning. Journal of Global Optimization 2007, 39: 323–346. 10.1007/s1089800791406
 11.
Ruspini EH: A new approach to clustering. Inf Control 1969, 15: 22–32. 10.1016/S00199958(69)905919
 12.
Sherali HD, Desai J: A global optimization RLTbased approach for solving the fuzzy clustering problem. Journal of Global Optimization 2005, 33: 597–615. 10.1007/s1089800473900
 13.
Heyer LJ, Kruglyak S, Yooseph S: Exploring expression data: identification and analysis of coexpressed genes. Genome Res 1999, 9: 1106–1115. 10.1101/gr.9.11.1106
 14.
Kohonen T: SelfOrganizing Maps. Berlin: Springer Verlag; 1997.
 15.
Herrero J, Valencia A, Dopazo J: A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics 2001, 17: 126–136. 10.1093/bioinformatics/17.2.126
 16.
Tishby N, Pereira F, Bialek W: The information bottleneck method. Proceedings of the 37th Annual Allerton Conference on Communication, Control Comput 1999, 368–377.
 17.
Slonim N, Atwal GS, Tkacik G, Bialek W: Informationbased clustering. Proc Natl Acad Sci USA 2005, 102: 18297–18302. 10.1073/pnas.0507432102
 18.
Kirkpatrick S, Gelatt CD, Vecchi MP: Optimization by simulated annealing. Science 1983, 220: 671–680. 10.1126/science.220.4598.671
 19.
Lukashin AV, Fuchs R: Analysis of temporal gene expression profiles: clustering by simulated annealing and determining the optimal number of clusters. Bioinformatics 2001, 17: 405–414. 10.1093/bioinformatics/17.5.405
 20.
Jiang D, Tang C, Zhang A: Cluster analysis for gene expression data: A survey. IEEE Transactions on Knowledge and Data Engineering 2004, 16: 1370–1386. 10.1109/TKDE.2004.68
 21.
Yeung KY, Haynor DR, Ruzzo WL: Validating clustering for gene expression data. Bioinformatics 2001, 17: 309–318. 10.1093/bioinformatics/17.4.309
 22.
Davies DL, Bouldin DW: A cluster separation measure. IEEE Trans Pattern Anal Machine Intell 1979, 1: 224–227.
 23.
Pavlidis P, Qin J, Arango V, Mann JJ, Sibille E: Using the gene ontology for microarray data mining: a comparison of methods and application to age effects in human prefrontal cortex. Neurochem Res 2004, 29: 1213–1222. 10.1023/B:NERE.0000023608.29741.45
 24.
Guldener U, Munsterkotter M, Kastenmuller G, Strack N, van Helden J, Lemer C, Richelles J, Wodak SJ, GarciaMartinez J, PerezOrtin JE: CYGD: the Comprehensive Yeast Genome Database. Nucleic Acids Res 2005, 33: D364–368. 10.1093/nar/gki053
 25.
The Gene Ontology Consortium: Gene ontology: tool for unification of biology. Nat Genet 2000, 25: 25–29. 10.1038/75556
 26.
Cheng J, Cline M, Martin J, Finkelstein D, Awad T, Kulp D, SianiRose MA: A knowledgebased clustering algorithm driven by gene ontology. J Biopharm Stat 2004, 14: 687–700. 10.1081/BIP200025659
 27.
Pan W: Incorporating gene functions as priors in modelbased clustering of microarray gene expression data. Bioinformatics 2006, 22: 795–801. 10.1093/bioinformatics/btl011
 28.
Komura D, Nakamura H, Tsutsumi S, Aburatani H, Ihara S: Incorporating prior knowledge into clustering of gene expression profiles. 15th International Conference on Genome Informatics 2004.
 29.
DotanCohen D, Melkman AA, Kasif S: Hierarchical tree snipping: clustering guided by prior knowledge. Bioinformatics 2007, 23: 3335–3342. 10.1093/bioinformatics/btm526
 30.
Huang D, Pan W: Incorporating biological knowledge into distancebased clustering analysis of microarray gene expression data. Bioinformatics 2006, 22: 1259–1268. 10.1093/bioinformatics/btl065
 31.
Floudas CA, Aggarwal A, Ciric AR: Global optimum search for non convex NLP and MINLP problems. Comp Chem Eng 1989, 13: 1117–1132. 10.1016/00981354(89)870164
 32.
Paules GE, Floudas CA: APROS – Algorithmic development for discretecontinuous optimization problems. Operations Research 1989, 37: 902–915.
 33.
Floudas CA: Nonlinear and MixedInteger Optimization: Fundamentals and Applications. Oxford: Oxford University Press; 1995.
 34.
Wang Y, Pierce M, Schneper L, Guldal CG, Zhang X, Tavazoie S, Broach JR: Ras and Gpa2 mediate one branch of a redundant glucose signaling pathway in yeast. PLoS Biol 2004, 2: E128. 10.1371/journal.pbio.0020128
 35.
Broach JR, Deschenes RJ: The function of ras genes in Saccharomyces cerevisiae. Adv Cancer Res 1990, 54: 79–139.
 36.
Schneper L, Duvel K, Broach JR: Sense and sensibility: nutritional response and signal integration in yeast. Curr Opin Microbiol 2004, 7: 624–630. 10.1016/j.mib.2004.10.002
 37.
Santangelo GM: Glucose signaling in Saccharomyces cerevisiae. Microbiol Mol Biol Rev 2006, 70: 253–282. 10.1128/MMBR.70.1.253282.2006
 38.
Carlson M: Glucose repression in yeast. Curr Opin Microbiol 1999, 2: 202–207. 10.1016/S13695274(99)800356
 39.
Johnston M, Kim JH: Glucose as a hormone: receptormediated glucose sensing in the yeast Saccharomyces cerevisiae. Biochem Soc Trans 2005, 33: 247–252. 10.1042/BST0330247
 40.
Crauwels M, Donaton MC, Pernambuco MB, Winderickx J, de Winde JH, Thevelein JM: The Sch9 protein kinase in the yeast Saccharomyces cerevisiae controls cAPK activity and is required for nitrogen activation of the fermentablegrowthmediuminduced (FGM) pathway. Microbiology 1997, 143(Pt 8):2627–2637.
 41.
Elemento O, Slonim N, Tavazoie S: Uncovering regulatory elements from expression data using mutual information. Mol Cell, in press.
 42.
Myers CL, Barrett DR, Hibbs MA, Huttenhower C, Troyanskaya OG: Finding function: evaluation methods for functional genomic data. BMC Genomics 2006, 7: 187. 10.1186/147121647187
 43.
Brem RB, Yvert G, Clinton R, Kruglyak L: Genetic dissection of transcriptional regulation in budding yeast. Science 2002, 296: 752–755. 10.1126/science.1069516
 44.
Yvert G, Brem RB, Whittle J, Akey JM, Foss E, Smith EN, Mackelprang R, Kruglyak L: Transacting regulatory variation in Saccharomyces cerevisiae and the role of transcription factors. Nat Genet 2003, 35: 57–64. 10.1038/ng1222
 45.
Foat BC, Houshmandi SS, Olivas WM, Bussemaker HJ: Profiling conditionspecific, genomewide regulation of mRNA stability in yeast. Proc Natl Acad Sci USA 2005, 102: 17675–17680. 10.1073/pnas.0503803102
 46.
The Princeton University Microarray Database[http://puma.princeton.edu]
 47.
Gollub J, Ball Ca, Binkley G, Demeter J, Finkelstein DB, Hebert JM, HernandezBoussard T, Jin H, Kaloper M, Matese JC, Schroeder M, Brown PO, Botstein D, Sherlock G: The Stanford microarray database: data access and quality assessment tools. Nuclei Acids Res 2003, 31: 94–96. 10.1093/nar/gkg078
 48.
Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein S, Altman RB: Missing value estimation methods for DNA microarrays. Bioinformatics 2001, 17: 520–525. 10.1093/bioinformatics/17.6.520
 49.
Tan MP, Broach JR, Floudas CA: Evaluation of normalization and preclustering issues in a novel clustering approach: global optimum search with enhanced positioning. J Bioinform Comput Biol 2007, 5: 895–913. 10.1142/S0219720007002941
 50.
The Saccharomyces Genome Database[http://www.yeastgenome.org]
 51.
Broman KW, Wu H, Sen S, Churchill GA: R/qtl: QTL mapping in experimental crosses. Bioinformatics 2003, 19: 889–890. 10.1093/bioinformatics/btg112
 52.
Churchill GA, Doerge RW: Empirical threshold values for quantitative trait mapping. Genetics 1994, 138: 963–971.
 53.
 54.
Brooke A, Kendrick D, Meeraus A: GAMS: A User's Guide. San Francisco: The Scientific Press; 1988.
Acknowledgements
The authors gratefully acknowledge assistance by Drs. Noam Slonim and Olivier Elemento in applying FIRE to our clustered datasets. This work was supported by grants from the National Science Foundation to CAF and JRB, a National Institutes of Health Center of Excellence grant (P50 GM71508), an NIH grant (GM76562) to JRB, and a grant from the U.S. Environmental Protection Agency's STAR Program (R 832721010) to CAF. Although the research described in the article has been funded in part by the U.S. Environmental Protection Agency's STAR program, it has not been subjected to any EPA review and therefore does not necessarily reflect the views of the Agency, and no official endorsement should be inferred.
Author information
Additional information
Authors' contributions
MPT conceived of the study, implemented the computational coding and quantitative analysis, and drafted the manuscript. ENS performed the FIRE analysis and linkage studies of datasets II and III, and refined the manuscript. JRB carried out the microarray experiments for datasets I and II, and refined the manuscript. CAF conceived of the study, analyzed the quantitative results, and refined the manuscript. All authors read and approved the final manuscript.
Meng P Tan, Erin N Smith, James R Broach and Christodoulos A Floudas contributed equally to this work.
Electronic supplementary material
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
About this article
Cite this article
Tan, M.P., Smith, E.N., Broach, J.R. et al. Microarray data mining: A novel optimizationbased approach to uncover biologically coherent structures. BMC Bioinformatics 9, 268 (2008). https://doi.org/10.1186/147121059268
Received:
Accepted:
Published:
Keywords
 Gene Ontology
 Master Problem
 Quality Cluster
 Miss Data Point
 Cluster Correlation