Microarray data mining: A novel optimization-based approach to uncover biologically coherent structures
© Tan et al; licensee BioMed Central Ltd. 2008
Received: 15 November 2007
Accepted: 06 June 2008
Published: 06 June 2008
DNA microarray technology allows for the measurement of genome-wide expression patterns. Within the resultant mass of data lies the problem of analyzing and presenting information on this genomic scale, and a first step towards the rapid and comprehensive interpretation of this data is gene clustering with respect to the expression patterns. Classifying genes into clusters can lead to interesting biological insights. In this study, we describe an iterative clustering approach to uncover biologically coherent structures from DNA microarray data based on a novel clustering algorithm EP_GOS_Clust.
We apply our proposed iterative algorithm to three sets of experimental DNA microarray data from experiments with the yeast Saccharomyces cerevisiae and show that the proposed iterative approach improves biological coherence. Comparison with other clustering techniques suggests that our iterative algorithm provides superior performance with regard to biological coherence. An important consequence of our approach is that an increasing proportion of genes find membership in clusters of high biological coherence and that the average cluster specificity improves.
The results from these clustering experiments provide a robust basis for extracting motifs and trans-acting factors that determine particular patterns of expression. In addition, the biological coherence of the clusters is iteratively assessed independently of the clustering. Thus, this method will not be severely impacted by functional annotations that are missing, inaccurate, or sparse.
DNA microarray technology allows investigators to monitor simultaneously the expression behavior of essentially all the genes within an entire genome and can provide information on gene functions and transcriptional networks. However, the large number of genes and the complexity of the underlying biological networks make extracting this information a formidable task. A common first step to interpret DNA microarray data is to cluster the data on the basis of similarity of expression patterns. Since genes with similar function often show a common expression pattern, clustering genes of known functions with poorly characterized genes provides a means of gaining insights into the functions of the latter . Furthermore, patterns seen in genome-wide expression data can reveal potential gene regulation networks and relate cellular processes to changes in cellular conditions [2, 3]. However, the extent to which clustering reveals useful information about the system under study depends on the extent to which the clustering method successfully groups intrinsically related elements. Example classes of clustering algorithms include (a) single and complete link hierachical clustering , (b) K-family of clustering algorithms [5–7], (c) optimization-based clustering approaches [8–10], (d) fuzzy clustering [11, 12], (e) quality threshold clustering (QTClust) , (f) artificial neural networks for clustering, such as the self-organizing map (SOM)  and a variant that combines the SOM with hierachical clustering, the self-organizing tree algorithm (SOTA) , (g) information-based clustering [16, 17], and (h) stochastic approaches such as clustering by simulated annealing [18, 19]. Some of these algorithms, while novel in their own rights, suffer from certain shortcomings. For instance, the K-clustering methods require a user-specified cluster number. The QTClust approach is computationally expensive and assumes at each iteration that the largest cluster formed is necessarily the best grouping. And the simulated annealing approach requires a consistently good cooling schedule, subjective inputs for the upper bound cutoff distance and the allowable lower bound false negatives, and can become computationally more expensive than exhaustive search algorithms. Whichever the clustering algorithm used, we need an intuitive and relevant tool to first assess the quality and significance of the clusters formed.
One approach to assess cluster quality is based on the idea that if the clustering result reflects robust structures, existing clusters should accurately estimate the appropriate cluster labels for new data points .  extend the prediction strength concept by proposing a figure of merit (FOM) measure, which describes the mean deviation of the gene expression levels with respect to the pertinent cluster centers. Other methods of measuring cluster validity have also been proposed, such as the Davies-Bouldin Validity Index , which is a function of the ratio of the sum of within-cluster scatter to between-cluster separation.
These approaches all rely solely on mathematical coherence to assess cluster quality. However, in uncovering biologically coherent structures from DNA microarray data, categorizing gene clusters on the basis of known functionally related groups is very relevant. This provides useful insights into key biological themes among the genes and enables progress in the studies of gene regulatory networks and signal transduction pathways . The biological coherence of each cluster can be scored according to the percentage of its genes covered by annotations significantly enriched in the cluster in question, using functional classification schemes in Martinsried Institute of Protein Sciences (MIPS) or the Gene Ontology (GO) databases [24, 25] to generate a p-value reflecting the likelihood that such enrichment would happen by chance. Auxiliary to this central issue is the question of rigorously identifying and isolating outlier genes since it is highly probable that only a subset of genes participate in any cellular process or biological studies of interest. On the other hand, genes with similar expression profiles may not have common functional characteristics or expression profiles of genes in the same functional category may still be dissimilar due to the existence of unknown functional sub-categories and sparse or inaccurate functional annotations.
A remedy then is to integrate known biological knowledge into the clustering procedure itself. Most knowledge-based clustering methods directly incorporate GO knowledge into the algorithm. This assumes that the current GO knowledge is correct because genes known to have a similar functional annotation are 'pushed' more closely to one another in a biased fashion. However, this could handicap the clustering process if the organism is sparsely annotated. Methods that work solely on a modified distance measure  can thus over-estimate the distances of the genes with unknown functions, thus depriving them of a chance of being 'discovered'. An alternative is to modify the similarity measure to be a linear combination of the expression profile similarity and functional similarity. This would however not work well with organisms that are not well annotated and is always highly subjective as to the contribution score of each similarity measure. Another approach allows genes sharing common functions to have a common prior probability as compared to genes with different functions . In an earlier work , the clustering process is regulated by must-link constraints, which apply to genes with known common functions, and cannot-link constraints, which apply to genes known not to be associated with one another. This class of knowledge-based clustering algorithms could distort the 'chance' of genes with unknown functions to be fairly clustered. It also assumes that the current GO annotations are accurate and comprehensive enough. A recent novel approach for knowledge-based clustering  involves the use of selectively snipping the edges of a typical hierarchical clustering tree to induce clusters that are maximally consistent with available background information such as functional annotations. This method is tested and reported to outperform another recent knowledge-based clustering method .
In addressing the various aforementioned concerns of many current knowledge-based clustering methods, we choose not to interfere with the measured distances of the genes or data points and instead cluster them based on the fundamental intuition that genes in the same cluster should behave similarly. Then, we introduce a secondary refinement process using the GO annotations to check the level of biological coherence of the clusters. We use the entire set of yeast GO annotations available throughout our study. The iterative nature of our procedure means that if the GO annotations are somehow wrong to begin with, subsequent iterations of our algorithm will still show a strong persistence in pushing seemingly 'unrelated' genes together, thus giving a hint that maybe these genes should have a common function after all. If the GO annotations are unknown, then the unknown genes would be still fairly clustered with their counterparts, as their measured experimental behavior has not been tampered with. In organisms that are sparsely annotated, the emphasis on these intuitive factors such as unaltered distances and response correlation provides opportunities for (a) genes with known functions and verified with similar experimental behavior to be clustered together and (b) genes with unknown functions to find cluster membership in clusters with known functions if they correlate well or in clusters with unknown functions containing counterparts that show similar expression behavior.
Our iterative clustering approach has as its backbone an optimization-based clustering algorithm, EP_GOS_Clust, presented in an earlier paper . The algorithm is based on a variant of the GBD algorithm, the Global Optimum Search (GOS) [31–33]. This is a robust algorithm that compares favorably with many commonly-used clustering algorithm in defining clusters that are as dissimilar from one another as possible (that is, a large inter-cluster error sum) while assigning members that are as similar to one another as possible into the same cluster (that is, a small intra-cluster error sum). It also incorporates a methodology to predict the optimal number of clusters for a given dataset. In measuring the biological coherence of the clusters, we assert that the appropriate performance indicators are the cluster p-values and the proportion of genes that are in clusters of high coherence quality. The first reflects the functional richness of a cluster and is an intuitive measure of coherence, while the second reflects the potential to extract the maximum amount of useful information for subsequent studies of motif analysis and regulatory structure searches. We also look at cluster correlation and functional specificity as consistency checks. We then test our proposed method on three datasets of actual DNA microarray expression results.
Results and discussion
Outline of proposed approach for uncovering biological coherence
Our iterative approach extends a recently-proposed algorithm, the EP_GOS_Clust, as a backbone for clustering the expression data. Specifically, we perform an initial clustering run on a given data as previously described to reach the optimal number of clusters  and then apply a GO analysis of the data to obtain a preliminary assessment of the level of biological coherence. Those clusters that exhibit better biological coherence than a prescribed benchmark value are retained as seeds for the next round of clustering. As a consistency check, we also look at cluster correlation and specificity. Those genes that fall into clusters lacking the benchmark level of coherence are subjected to a subsequent round of EP_GOS_Clust clustering, in which each gene is placed either in an existing cluster having a similar expression profile or aggregated with other unclustered genes having similar expression profiles to form a new cluster. This overall process is repeated until we observe an asymptotic saturation either in the optimal number of clusters or the proportion of genes that are placed into clusters with strong levels of biological coherence. The approach is described in greater detail in the Methods section.
Dataset I consists of DNA microarray data obtained from a study in the role of the Ras/protein kinase A pathway (PKA) on glucose signaling in yeast . The Ras/PKA signal transduction pathways provide a major conduit to couple cellular responses to the availability of carbon sources such as glucose [35–37].  measure the levels of RNA for each of the 6237 yeast genes over time in wild type and various mutant strains following glucose addition to cells grown on a non-fermentable carbon source. These experiments are designed to assess the extent to which the Ras/PKA pathway mediated transcriptional effects induced by glucose. [See additional file 5]
Each of the eight test and control experiments consist of four time points over a hour period, yielding 32 data points for each of the 6237 genes. Before clustering the array data, we filter the data to remove unreliable data. In particular, we retained all genes for which all the time points are deemed present by the Affymetrix software suite (4105 genes), all the genes for which greater than 50% of the time points are deemed present and all the genes for which the present/absent calls exhibit a consistent and biologically relevant pattern (e.g., PAAA for the four time points in the experiment, indicating repression of expression of that gene over the course of the experiment). In all, we retain 5652 genes.
Comparison of biological coherence of clusters obtained for dataset I by different clustering algorithms
Proportion of Genes (%) in Clusters of p-values
We examine a second dataset derived from experiments designed to determine comprehensively the contribution of different signaling pathways to the glucose response in yeast. In addition to the Ras/PKA pathway, at least three other signaling pathways mediate transcriptional changes attendant on addition of glucose to cells [36, 37]. In one pathway, glucose addition leads to reduced activity of the AMP-activated protein kinase encoded by SNF1, thereby unfettering a transcriptional repressor encoded by Mig1/2 and suppressing several transcriptional activators [38, 39]. The second pathway, mediated by Rgt1, couples expression of the hexose transporter genes to the level of available glucose. Binding of glucose to plasma membrane glucose sensors promotes degradation of repressors of hexose transporter genes by the SCFGrr1 ubiquitin ligase complex, allowing Rgt1 activation of these genes. Finally, the AGC kinase Sch9 acts in parallel to PKA to induce expression of many genes normally regulated by PKA . To test the roles of each of these pathways, we measure expression changes following glucose induction in wild type and mutant cells lacking specific components of the different pathways. Levels of RNA for each of the 6237 yeast genes in each of the RNA samples are assayed using Agilent microarray chips, out of which measurable signals are registered from 5657 of them. This dataset consists of results from 23 time course experiments, each with 2–5 time points. [See additional file 5]
Improvement in biological coherence after iterative clustering of dataset II
Proportion of Genes (%)
p-value <= 10-4
p-value <= 10-5
p-value <= 10-4
p-value <= 10-5
At completion of the iterative process, we obtain 38 clusters with p-values of 10-5 or less, containing 4747 (or 84%) of the original 5657 genes. We note that these clusters exhibit a tight grouping, as evidenced by a visual inspection of the gene expression time course plots [See additional files 1 and 2], as well as the relatively high values of correlation coefficients for these clusters (an average value of 0.667 over all clusters, with a maximum of 0.925 and a minimum of 0.387).
Comparison of clustering methods for dataset II
Comparison of biological coherence of clusters obtained for dataset II by different clustering algorithms
Proportion of Genes (%) in Clusters of p-values
Function prediction based on expression profiles
An application of any clustering approach is the ability to predict the functions of unknown genes by clustering them together with counterparts with known functions. This is a particularly important consideration when working with organisms that are not as well-annotated as yeast. We test the capability of our proposed iterative procedure in this respect through a simulated study. We do this by randomly de-annotating 20% and 30% of the genes in dataset II. We then apply our iterative clustering approach to the entire dataset II and take into consideration the entire set of functional annotations reported on the SGD. Naturally at each iterative step, the clusters are scored as if the de-annotated genes have no known function, thus affecting the p-value. As a result, we find that our iterative procedure demonstrates a 61.5% level of prediction accuracy for the dataset with 20% de-annotation and 50.4% for the dataset with 30% de-annotation. We feel these results compare favorably to the other knowledge-based clustering methods reported in the literature [29, 30]. These methods report a prediction accuracy of between 30–70%. However, these accuracies are found using restricted datasets of only 2–3 function classes (each containing 200–300 genes) and clustering is done up to 10–14 clusters. Furthermore, we observe there is very little variation in the percentage of prediction accuracy as our algorithm steps through the iterations. As a case in point, the prediction accuracy for the 20% test case in the 4 iterations it took for the optimal number of clusters to stabilize is 59.1%, 60.9%, 60.8%, and finally 61.5%. This and the relative high level of prediction accuracy by our iterative algorithm is a result of the clustering not being driven by the extent of known functional annotations, which can handicap the clustering process if the data is sparsely or wrongly annotated, but rather by fundamental indicators of cluster goodness. Hence, we expect the clustered results from a de-annotation simulated study or a GO permutation study to not severely affect the final results.
Another means of evaluating the effectiveness of our clustering regimen is to determine the extent to which it reveals information regarding the underlying transcriptional network responsible for the observed pattern of transcription. For instance, are genes in specific clusters enriched for motifs corresponding to transcription factors known to be involved in regulation under the conditions tested? Several methods are available to identify motifs enriched in groups of genes and we applied a recent method – FIRE, which uses the concept of mutual information to predict functional motifs from gene expression data . We compare the results obtained on this dataset clustered as described above with that from an expanded version of the same dataset clustered by K-means with a cutoff correlation of 0.85 [Zaman & Broach, unpublished]. Both sets of clusters reveal the role of the PAC/RRPE motif, sites for the transcription factors Msn2/4, Mbp1, Rap1 and Rpn4 as well as the 3' RNA binding factor Puf3. The K-means clusters also reveal a role for Gcn4, Hap4, Reb1 and Cbf1 while our iterative algorithm identifies Bas1 and Mig1. All these factors have been implicated in glucose regulation and the overlapping but non-identical results with the two cluster sets indicates the value of interrogating an expression dataset with multiple clustering methods.
Abridged iterative approach
Comparison of cluster coherence between the full and abridged versions of the iterative clustering algorithm
Proportion of Genes (%) – Final Iteration
p-value <= 10-4
p-value <= 10-5
p-value <= 10-4
p-value <= 10-5
Minimal expression level for meaningful clustering
Gene expression measurements typically contain a certain amount of noise, which can come from hybridization errors due for instance to chip imperfections, as well as stochastic fluctuations in transcriptional processes. These are typically filtered by the appropriate microarray software, as described in the Methods section. As another area of interest, we then ask ourselves, noise aside, whether there can still be a minimal level of fold change for a particular gene to be considered relevant and can be meaningfully clustered.
Effect of imposing minimal gene expression levels on cluster correlation
1.9-fold, at least 5% time points
1.8-fold, at least 10% time points
1.7-fold, at least 10% time points
1.6-fold, at least 20% time points
Number of Genes
Effect of imposing minimal gene expression levels on biological coherence
1.9-fold, at least 5% time points
1.8-fold, at least 10% time points
1.7-fold, at least 10% time points
1.6-fold, at least 20% time points
% of Genes in Clusters with -log10(P) Values Ranges
> 0 and < 2
Comparison of -log10(P) Values
The first observation from Tables 5 and 6 is that the results of clustering improve significantly after undertaking some form of gene expression relevance threshold. The second observation is that an over-strict and over-lenient fold-change threshold results in the respective deletion of useful information or the retention of non-information. As can be seen from Table 5, even though a cut-off criterion of 1.8-fold change for at least 10% of all time points results in the least number of genes and the best clustering economy in placing the genes into a lowest number optimal clusters, it does not result in the best overall correlation between members within the same clusters and clusters of the strongest biological coherence. On the other hand, the criterion of 1.7-fold change for at least 10% of the time points appear to produce the tightest clusters with the highest level of biological coherence despite having the largest number of genes as compared to the other fold-criteria tests. This is particularly noticeable from Table 6, where it places the largest proportion of genes into coherent clusters, but also has the smallest proportion of genes leftover for the weaker clusters. In addition, while this criterion does not lead to the lowest number of clusters, the resultant data groupings exhibit the strongest correlation. This leads us to conclude that a 1.7-fold cut-off for at least 10% is probably the appropriate screening criteria for dataset II.
We then assess the clusters using a newly-proposed functional genomics gold standard based on an expert curation of the Gene Ontology . This allows us to assess our iterative process using only GO terms that are deemed specific enough to imply a meaningful biological relationship between any two annotated proteins. The iterative clustering of dataset II with a 1.7-fold cutoff gives a final 72 optimal clusters, of which 42 are quality clusters with p-values below the cut-off of 10-5. We performed a gold standard GO analysis of these 42 clusters and found that 41 of these returned meaningful annotation results based on this standard, and that the average precision and recall values of these clusters was 25.1% and 15.4% respectively (see Methods section). This significantly exceeds the average background of 1.8%. On the other hand, the average precision and recall values of the non-adjusted version of dataset II of 20.8% and 12.6% respectively, showing that the fold-change cutoff helps in improving the meaningfulness of the clustered results.
Natural genetic variation can cause significant differences in gene expression, and clustering and linkage analysis can yield meaningful insights on how natural polymorphisms affect gene regulation. We test our iterative clustering approach on dataset III, which is obtained by measuring the expression levels of all yeast genes in a laboratory and a wild strain, and in 113 segregants from a cross between them [43, 44]. The dataset consists of multiple replicates of parental strains and single arrays for each of the segregrants using spotted microarrays, yielding 131 experiments each with 7085 gene expression measurements. These genes correspond to 5740 known yeast genes, as well as 489 dubious ORFs and 856 control spots. [See additional file 5]
The process of sequential clustering allows us to identify a total of 1864 genes in 62 quality clusters (indicated by A, C and E) that annotates for known biological functions, as well as 21 clusters (indicated by B, D and F) that contain 1057 genes of unknown biological processes. This compares favorably both in coverage and compactness with the 2544 genes sorted into 763 clusters by the hierarchical clustering method previously applied to this dataset. Furthermore, application of our iterative clustering approach at two instances (A and E) allowed us to improve cluster quality, in terms of member correlation, cluster precision, and p-value. The iterative approach also improves the number of genes placed into good clusters. For instance, for group A, the number of genes falling within clusters with p-values of 10-6 and below improves from 986 to 1084, while for group E, the number goes from 85 to 164.
Summary of clustering results with dataset III
Next, as a test of our iterative procedure, we subject dataset III to this algorithm without any intermediate filtering or multi-stage processing. From the initial clusters obtained from the first round of clustering, the iterative procedure results in 106 optimal clusters after 6 iterations, based on a p-value cutoff of 10-5. A subsequent analysis of the clusters reveals results comparable to that obtained from the application of intermediate filtering as described previously. The large clusters that annotate for major functions such as translation, ribosome biogenesis, cellular component organization, and chromatin assembly appear in both instances, and have about 60% of their member genes in common. The smaller clusters that annotate for flocculation, cytokinesis, vitamin B6 metabolism, RNA-mediated transposition, and sterol biosynthesis far even better, with 75% of common genes recovered in both cases. The average specificity of the clusters obtained by using just the iterated procedure is 33%, which compares well with the 25–75% specificity range (average of 52%) obtained earlier. We also note the uncovering of several clusters whose majority of genes have unknown biological processes. In both applications, the cluster specificity is high – an average of 33% compared to a range of 40–60% by using multi-stage clustering with intermediate filters. In addition, the number of clusters obtained (i.e, 106) is similar to the total of 89 quality clusters retained by the previous application. We note however that the optimal clusters obtained by using just the iterated approach without any filters is lower – an average of 0.497 versus 0.656. This and the comparatively lower cluster specificity is understandably the result of the two filters implemented in the former process, including one that involves correlation filtering. Nonetheless, the recovery of several common-function clusters with overlapping genes leads us to believe that our iterative approach would be effective in uncovering clusters with a strong level of biological coherence even without applying various levels of additional filtering to a dataset.
Uncovering biological insights from DNA microarray data is a promising but challenging task. Generally, the first step in organizing and analyzing microarray data is clustering genes into groups related by expression patterns, which often then reveals biological coherence. In our study, we observe that the task of placing genes into strongly coherent clusters may not necessarily be achieved with a single round of clustering, no matter how robust the clustering algorithm has proven to be. To address this limitation we formulated a methodology that filters out the genes placed in clusters of weak biological coherence and iteratively seeks the best placements for these genes. We show that on the whole our proposed algorithm unambiguously refines the biological quality of gene clusters. The extent of improvement over the iterations is significant, from a 20% increase in the number of genes in biologically coherent clusters for dataset III to a 40% increase for dataset I. We note also the increasing level of average cluster correlation as the iterative sorting progresses. The latter observation is significant, since cluster correlation is not a factor explicitly used in the algorithm to target clusters for recycling.
We apply this algorithm to a variety of datasets. Two of these datasets are strongly focused on the question of nutritional regulation in yeast. We select the third dataset as a test for this method since, due to experimental design, it is as diverse a collection of expression patterns as any in the literature (Myers C, personal communication). We evaluate our results from the second dataset against those obtained by a number of standard clustering algorithms and find that our method assigns a higher proportion of genes to biologically coherent clusters than any of the others. Moreover, the average expression correlation of genes in the clusters is higher by our method than by any of the others. In addition, subsequent analysis of the clusters reveals a strongly overlapping set of predictions for cis-acting regulatory sequences to that obtained from clusters generated by other methods. Our results with the third dataset also compare favorably with other methods. Our algorithm, applied in a multi-stage fashion, assigns 2921 genes to 83 clusters compared with 2544 genes in 763 clusters previously compiled by hierarchical clustering with a correlation value cutoff. This reduced number of clusters facilitates subsequent analysis of the data while retaining essentially all of the information content, as assessed by analysis of linkage to trans-acting factors and identification of cis-acting regulatory domains.
An issue with incorporating annotative knowledge is that the clustering process can be limited when applied to datasets from sparsely or inaccurately annotated organisms. In this respect, our iterative procedure still performs comparatively well under these situations. On its own, the clustering backbone EP_GOS_Clust compares favorably against several well-known clustering algorithms based on not just the usual quantitative attributes but also the level of biological coherence based on GO resources. The procedure has as its pre-processing step a complete clique search (itself a rigorous search process) amongst the data points for strong pre-clusters. In our iterative procedure, GO knowledge is not considered during the clustering itself, but rather in post-processing and cluster refinement steps. The clustering backbone thus considers only intuitive quantitative attributes, so a lack of functional annotation does not affect the result as adversely as other more involved knowledge-based clustering approaches. The iterations then smooth any distortions brought about by issues with existing annotation databases or measurement inconsistencies.
Based on the results of this study, we believe our work to be valuable in uncovering biologically meaningful data structures. Since the complex functions of a living cell are carried out through a concerted activity of many gene and gene products, it is important to be able to effectively cluster DNA microarray data to uncover functional relationships and regulatory modules to help understand the complex biological mechanisms involved in signaling.
We denote the distance measure of a gene i, for i = 1,....,n with k features (or dimensions), for k = 1,....., s as aik. Each gene expression pattern is transformed into a k-dimensional vector, for which each element indicates the change in normalized expression level between time points for each gene, aik. Each gene is assigned to only one (hard clustering) of c possible clusters, each with center zjk, for j = 1,....,c. The binary variables wij indicates whether gene i falls within cluster j (wij = 1, if yes; wij = 0, if no).
Microarray data processing
The experiments for datasets II and III are carried out using the Agilent microarray platform. All arrays are hybridized and processed using a SSPE wash according to the manufacturer's protocols, and the microarrays are imaged using a Agilent microarray scanner. The images are then extracted with Agilent Feature Extraction version A7.5.1 and the data analyzed with Rosetta Luminator 2.0 gene expression analysis system (Rosetta Informatics, Seattle, WA). Using a rank consistency filter, features are subjected to a combination linear and LOWESS normalization algorithm, the recommended algorithm for this microarray platform. The Agilent microarray scans feed directly into the Princeton University Microarray Database (PUMAdb) , where the noise filtering is performed in tandem. PUMAdb is based on the Stanford Microarray Database software package . The filtering process involves identifying the Cy3 and Cy5 intensities above the background levels and picked out features that were either population outliers or non-uniformity outliers.
Handling missing data points
For a particular gene with missing data points, we calculate its level of similarity with all other genes that have values present for the particular data points. The similarity metric is consistently computed by excluding the data column containing the missing data point. In this respect, we test both the Euclidean distance and Pearson Correlation Coefficient as measures of data similarity and find that both work similarly well in estimating the 'removed' data points.
For each gene with missing data points, we create a rank-order list of its similarity with its reference genes. Depending on the spread of similarity level, we estimate the missing data point by taking the sum average of its nearest 10–20 neighbors.
Normalization and pre-clustering
EP_GOS_Clust clustering algorithm
We perform EP_GOS_Clust clustering by the method described in detail in .
w ij are binary variables, zjk are continuous variables
The proximity study in pre-clustering will determine the spread of clusters each gene is suitable for. This can be described by an additional binary parameter suitij. A data point that has been determined to belong uniquely to just one cluster in the pre-clustering process will only have suitij = 1 for only one value of j and zero for the others, whereas a data point restricted to a few clusters will have suitij = 1 for only those clusters. The suitij parameters obviate the need for constraints that prevent the redundant re-indexing of clusters and help reduce the computational demand required for the clustering process.
The first set of constraints are the necessary optimality conditions, the second demand that each gene can belong to only one cluster, and the third state that there is at least one and no more than (n-c+1) data points in a cluster. Note also that the term in the objective function of Problem 2 is a constant and can be dropped, though for the sake of completeness we will retain the term throughout.
Problem 1.2 is a Mixed Integer Nonlinear Programming (MINLP) problem with bilinear terms w ij .z jk in the objective function and the first constraint set. However, MINLP problems are difficult to solve and theoretical advances and prominent algorithms for approaching such problems have been extensively studied. We utilize the Global Optimum Search (GOS) algorithm [31–33] to handle the MINLP formulation. The algorithm decomposes the nonlinear problem into a primal problem and the master problem. The former optimizes the continuous variables while fixing the integer variables and provides an upper bound solution, while the latter optimizes the integer variables while fixing the continuous variables and provides a lower bound solution. The two sequences of upper and lower bounds are iteratively updated until they converge in a finite number of iterations. Even though the algorithm does not have a theoretical guarantee of finding the global optima, its application in a robust clustering procedure has been shown to perform favorably against existing clustering methods when applied to DNA microarray data .
In implementing the algorithm we let the initial clusters be defined by a robust set of pre-clustered genes and compute the centers of these clusters. As a good initialization point we place the remaining genes into the nearest cluster. A proximity study assesses each gene's suitability in an appropriate number of clusters. After the initial GOS [31–33] run, the worst placed gene is removed and used as a seed for a new cluster. This gene has already been subjected to the initial search for parameters so there is no reason for it to belong to any one of the older clusters. Based on the updated centers, the iterative steps are repeated. With each GOS [31–33] cycle, the number of clusters builds up progressively. At each cluster number, we compute the clustering balance, which is the α-weighted sum of the intra-cluster and inter-cluster error sums. The clustering balance parameter reaches a minimum at the optimum cluster number, when intra-cluster similarity is maximized and inter-cluster similarity is minimized. Hence, the incrementing of the cluster number can be stopped once this turning point is reached.
Biological coherence assessment and refinement
Calculating cluster p-value
The p-value of each cluster is a measure of the statistical significance for functional category enrichment. The biological coherence of each cluster is scored according to the percentage of its genes covered by annotations significantly enriched in the cluster in question, and is computed using a hypergeometric distribution. If G is the number of genes annotated to a term and N is the total number of genes in the genome with GO annotations, then the probability p of a randomly selected gene being annotated to a particular GO term is . Given a cluster of n genes, in which × of them have been annotated to a given GO term, the probability of having × out of n annotations assigned to the same GO term by chance is p x (1 - p)n-x. Within a list of n genes, there are multiple permutations by which × of them have this annotation. The number of permutations is . However, annotations to a particular term are low probability events. Thus instead of calculating the probability of having × of n genes annotated to a term, a more conservative approach is to calculate the probability of × or more of n genes being annotated to a particular term. This is given as . In analyzing the gene clusters, we consider the p-value of the most significant term associated with each cluster.
Sometimes, a cut-off for the p-value, known as alpha, is chosen, such that p-values below the alpha are deemed significant. For instance, an alpha level of 0.05 means that in no more than one in 20 statistical tests will the test show 'something' while there is in fact nothing (a type I error). However, when more than one statistical test is carried out, there is an increasing chance of committing a type I error due to fluctuations. This chance is given by 1-(1 - α) k , where k is the number of tests (i.e. the number of elements in a cluster). A possible solution known as the Bonferroni correction is to divide the test-wise significance level by the number of tests. For instance, given an initial alpha of 0.05, the chance of a type I error if 10 tests are made is 0.4. The Bonferroni correction adjusts the alpha so the overall risk for the tests remains at 0.05, by applying a significance level of 0.005 instead of 0.05. We include both the p-values provided by the SGD and our Bonferroni-corrected values in our manuscript for the sake of completeness.
Filtering dataset III
The first filter applied to dataset III eliminates those genes for which 75% of the values were zero, after background subtraction, or that show a correlation of less than 0.1 with any other gene. The second filter involves computing the expression mean and standard deviation over all segregants (feature points) for all of the 3,760 genes that remain after an initial clustering of Set B data (see Figure 3), which yields a normal distribution. We then determine for each gene what proportion of feature points had values falling outside the data mean ± 0.5 (standard deviation). Plotting the cumulative number of genes with a given proportion of feature points that lie outside this limit versus proportion of feature points, we find that as the proportion of feature points decreased from 100%, the number of genes increases almost linearly until the 77% mark, at which point the number of genes increases essentially exponentially. We take this to signify an increasing bulk of spurious genes and set the cut-off at 77% retained those 206 genes with values between 77% and 100% for further clustering.
Linkage analysis is performed on the average phenotype of each cluster. After mean centering each transcript, the average value of all transcript levels within a cluster is calculated and treated as a quantitative phenotype as previously described . Linkage analysis is then performed using the nonparametric test in R/qtl  with 2894 previously described markers . Empirical permutation tests  are used to determine a false discovery rate (FDR), with a LOD score of 3.4 estimated as a cutoff for a 5% false discovery rate.
'Gold standard' evaluation of functional genomic data
This is the minimal benchmark as it signifies the probability of drawing a gene annotated for a particular function from the entire genome of interest without even considering the clustering results based on differential DNA expression levels.
All optimization formulations are written in GAMS (General Algebraic Modeling System)  and solved using the commercial solver CPLEX 8.0. GAMS is a high level modeling system specifically designed for mathematical optimization. It consists of a language compiler and an integrated high performance solver such as CPLEX, DICOPT, or XPRESS.
The authors gratefully acknowledge assistance by Drs. Noam Slonim and Olivier Elemento in applying FIRE to our clustered datasets. This work was supported by grants from the National Science Foundation to CAF and JRB, a National Institutes of Health Center of Excellence grant (P50 GM71508), an NIH grant (GM76562) to JRB, and a grant from the U.S. Environmental Protection Agency's STAR Program (R 832721-010) to CAF. Although the research described in the article has been funded in part by the U.S. Environmental Protection Agency's STAR program, it has not been subjected to any EPA review and therefore does not necessarily reflect the views of the Agency, and no official endorsement should be inferred.
- Allison DB, Cui X, Page GP, Sabripour M: Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet 2006, 7: 55–65. 10.1038/nrg1749View ArticlePubMedGoogle Scholar
- Gasch AP, Spellman PT, Kao CM, Carmel-Harel O, Eisen MB, Storz G, Botstein D, Brown PO: Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell 2000, 11: 4241–4257.PubMed CentralView ArticlePubMedGoogle Scholar
- Lin X, Floudas CA, Wang Y, Broach JR: Theoretical and computational studies of the glucose signaling pathways in yeast using global gene expression data. Biotechnol Bioeng 2003, 84: 864–886. 10.1002/bit.10844View ArticlePubMedGoogle Scholar
- Sokal RR, Michener CD: A statistical method for evaluating systematic relationships. Univ Kans Sci Bull 1958, 38: 1409–1438.Google Scholar
- Hartigan JA, Wong MA: Algorithm AS 136: A K-means clustering algorithm. Appl Stat J Roy St C 1979, 28: 100–108.Google Scholar
- Zhang B, Hsu M, Dayal U: K-harmonic means – A data clustering algorithm. Hewlett Packard Research Laboratory Technical Report 1999.Google Scholar
- Likas A, Vlassis N, Vebeek JL: The global K-means clustering algorithm. Pattern Recogn 2003, 36: 451–461. 10.1016/S0031-3203(02)00060-2View ArticleGoogle Scholar
- Adams WP, Sherali HD: Linearization strategies for a class of zero-one mixed integer programming problems. Operations Research 1990, 38: 217–226.View ArticleGoogle Scholar
- Sherali HD: A global optimization RLT-based approach for solving the hard clustering problem. Journal of Global Optimization 2005, 32: 281–306. 10.1007/s10898-004-2706-7View ArticleGoogle Scholar
- Tan MP, Broach JR, Floudas CA: A novel clustering approach and prediction of optimum number of clusters: Global optimum search with enhanced positioning. Journal of Global Optimization 2007, 39: 323–346. 10.1007/s10898-007-9140-6View ArticleGoogle Scholar
- Ruspini EH: A new approach to clustering. Inf Control 1969, 15: 22–32. 10.1016/S0019-9958(69)90591-9View ArticleGoogle Scholar
- Sherali HD, Desai J: A global optimization RLT-based approach for solving the fuzzy clustering problem. Journal of Global Optimization 2005, 33: 597–615. 10.1007/s10898-004-7390-0View ArticleGoogle Scholar
- Heyer LJ, Kruglyak S, Yooseph S: Exploring expression data: identification and analysis of co-expressed genes. Genome Res 1999, 9: 1106–1115. 10.1101/gr.9.11.1106PubMed CentralView ArticlePubMedGoogle Scholar
- Kohonen T: Self-Organizing Maps. Berlin: Springer Verlag; 1997.View ArticleGoogle Scholar
- Herrero J, Valencia A, Dopazo J: A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics 2001, 17: 126–136. 10.1093/bioinformatics/17.2.126View ArticlePubMedGoogle Scholar
- Tishby N, Pereira F, Bialek W: The information bottleneck method. Proceedings of the 37th Annual Allerton Conference on Communication, Control Comput 1999, 368–377.Google Scholar
- Slonim N, Atwal GS, Tkacik G, Bialek W: Information-based clustering. Proc Natl Acad Sci USA 2005, 102: 18297–18302. 10.1073/pnas.0507432102PubMed CentralView ArticlePubMedGoogle Scholar
- Kirkpatrick S, Gelatt CD, Vecchi MP: Optimization by simulated annealing. Science 1983, 220: 671–680. 10.1126/science.220.4598.671View ArticlePubMedGoogle Scholar
- Lukashin AV, Fuchs R: Analysis of temporal gene expression profiles: clustering by simulated annealing and determining the optimal number of clusters. Bioinformatics 2001, 17: 405–414. 10.1093/bioinformatics/17.5.405View ArticlePubMedGoogle Scholar
- Jiang D, Tang C, Zhang A: Cluster analysis for gene expression data: A survey. IEEE Transactions on Knowledge and Data Engineering 2004, 16: 1370–1386. 10.1109/TKDE.2004.68View ArticleGoogle Scholar
- Yeung KY, Haynor DR, Ruzzo WL: Validating clustering for gene expression data. Bioinformatics 2001, 17: 309–318. 10.1093/bioinformatics/17.4.309View ArticlePubMedGoogle Scholar
- Davies DL, Bouldin DW: A cluster separation measure. IEEE Trans Pattern Anal Machine Intell 1979, 1: 224–227.View ArticleGoogle Scholar
- Pavlidis P, Qin J, Arango V, Mann JJ, Sibille E: Using the gene ontology for microarray data mining: a comparison of methods and application to age effects in human prefrontal cortex. Neurochem Res 2004, 29: 1213–1222. 10.1023/B:NERE.0000023608.29741.45View ArticlePubMedGoogle Scholar
- Guldener U, Munsterkotter M, Kastenmuller G, Strack N, van Helden J, Lemer C, Richelles J, Wodak SJ, Garcia-Martinez J, Perez-Ortin JE: CYGD: the Comprehensive Yeast Genome Database. Nucleic Acids Res 2005, 33: D364–368. 10.1093/nar/gki053PubMed CentralView ArticlePubMedGoogle Scholar
- The Gene Ontology Consortium: Gene ontology: tool for unification of biology. Nat Genet 2000, 25: 25–29. 10.1038/75556PubMed CentralView ArticleGoogle Scholar
- Cheng J, Cline M, Martin J, Finkelstein D, Awad T, Kulp D, Siani-Rose MA: A knowledge-based clustering algorithm driven by gene ontology. J Biopharm Stat 2004, 14: 687–700. 10.1081/BIP-200025659View ArticlePubMedGoogle Scholar
- Pan W: Incorporating gene functions as priors in model-based clustering of microarray gene expression data. Bioinformatics 2006, 22: 795–801. 10.1093/bioinformatics/btl011View ArticlePubMedGoogle Scholar
- Komura D, Nakamura H, Tsutsumi S, Aburatani H, Ihara S: Incorporating prior knowledge into clustering of gene expression profiles. 15th International Conference on Genome Informatics 2004.Google Scholar
- Dotan-Cohen D, Melkman AA, Kasif S: Hierarchical tree snipping: clustering guided by prior knowledge. Bioinformatics 2007, 23: 3335–3342. 10.1093/bioinformatics/btm526View ArticlePubMedGoogle Scholar
- Huang D, Pan W: Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data. Bioinformatics 2006, 22: 1259–1268. 10.1093/bioinformatics/btl065View ArticlePubMedGoogle Scholar
- Floudas CA, Aggarwal A, Ciric AR: Global optimum search for non convex NLP and MINLP problems. Comp Chem Eng 1989, 13: 1117–1132. 10.1016/0098-1354(89)87016-4View ArticleGoogle Scholar
- Paules GE, Floudas CA: APROS – Algorithmic development for discrete-continuous optimization problems. Operations Research 1989, 37: 902–915.View ArticleGoogle Scholar
- Floudas CA: Nonlinear and Mixed-Integer Optimization: Fundamentals and Applications. Oxford: Oxford University Press; 1995.Google Scholar
- Wang Y, Pierce M, Schneper L, Guldal CG, Zhang X, Tavazoie S, Broach JR: Ras and Gpa2 mediate one branch of a redundant glucose signaling pathway in yeast. PLoS Biol 2004, 2: E128. 10.1371/journal.pbio.0020128PubMed CentralView ArticlePubMedGoogle Scholar
- Broach JR, Deschenes RJ: The function of ras genes in Saccharomyces cerevisiae. Adv Cancer Res 1990, 54: 79–139.View ArticlePubMedGoogle Scholar
- Schneper L, Duvel K, Broach JR: Sense and sensibility: nutritional response and signal integration in yeast. Curr Opin Microbiol 2004, 7: 624–630. 10.1016/j.mib.2004.10.002View ArticlePubMedGoogle Scholar
- Santangelo GM: Glucose signaling in Saccharomyces cerevisiae. Microbiol Mol Biol Rev 2006, 70: 253–282. 10.1128/MMBR.70.1.253-282.2006PubMed CentralView ArticlePubMedGoogle Scholar
- Carlson M: Glucose repression in yeast. Curr Opin Microbiol 1999, 2: 202–207. 10.1016/S1369-5274(99)80035-6View ArticlePubMedGoogle Scholar
- Johnston M, Kim JH: Glucose as a hormone: receptor-mediated glucose sensing in the yeast Saccharomyces cerevisiae. Biochem Soc Trans 2005, 33: 247–252. 10.1042/BST0330247View ArticlePubMedGoogle Scholar
- Crauwels M, Donaton MC, Pernambuco MB, Winderickx J, de Winde JH, Thevelein JM: The Sch9 protein kinase in the yeast Saccharomyces cerevisiae controls cAPK activity and is required for nitrogen activation of the fermentable-growth-medium-induced (FGM) pathway. Microbiology 1997, 143(Pt 8):2627–2637.View ArticlePubMedGoogle Scholar
- Elemento O, Slonim N, Tavazoie S: Uncovering regulatory elements from expression data using mutual information. Mol Cell, in press.Google Scholar
- Myers CL, Barrett DR, Hibbs MA, Huttenhower C, Troyanskaya OG: Finding function: evaluation methods for functional genomic data. BMC Genomics 2006, 7: 187. 10.1186/1471-2164-7-187PubMed CentralView ArticlePubMedGoogle Scholar
- Brem RB, Yvert G, Clinton R, Kruglyak L: Genetic dissection of transcriptional regulation in budding yeast. Science 2002, 296: 752–755. 10.1126/science.1069516View ArticlePubMedGoogle Scholar
- Yvert G, Brem RB, Whittle J, Akey JM, Foss E, Smith EN, Mackelprang R, Kruglyak L: Trans-acting regulatory variation in Saccharomyces cerevisiae and the role of transcription factors. Nat Genet 2003, 35: 57–64. 10.1038/ng1222View ArticlePubMedGoogle Scholar
- Foat BC, Houshmandi SS, Olivas WM, Bussemaker HJ: Profiling condition-specific, genome-wide regulation of mRNA stability in yeast. Proc Natl Acad Sci USA 2005, 102: 17675–17680. 10.1073/pnas.0503803102PubMed CentralView ArticlePubMedGoogle Scholar
- The Princeton University Microarray Database[http://puma.princeton.edu]
- Gollub J, Ball Ca, Binkley G, Demeter J, Finkelstein DB, Hebert JM, Hernandez-Boussard T, Jin H, Kaloper M, Matese JC, Schroeder M, Brown PO, Botstein D, Sherlock G: The Stanford microarray database: data access and quality assessment tools. Nuclei Acids Res 2003, 31: 94–96. 10.1093/nar/gkg078View ArticleGoogle Scholar
- Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein S, Altman RB: Missing value estimation methods for DNA microarrays. Bioinformatics 2001, 17: 520–525. 10.1093/bioinformatics/17.6.520View ArticlePubMedGoogle Scholar
- Tan MP, Broach JR, Floudas CA: Evaluation of normalization and pre-clustering issues in a novel clustering approach: global optimum search with enhanced positioning. J Bioinform Comput Biol 2007, 5: 895–913. 10.1142/S0219720007002941View ArticlePubMedGoogle Scholar
- The Saccharomyces Genome Database[http://www.yeastgenome.org]
- Broman KW, Wu H, Sen S, Churchill GA: R/qtl: QTL mapping in experimental crosses. Bioinformatics 2003, 19: 889–890. 10.1093/bioinformatics/btg112View ArticlePubMedGoogle Scholar
- Churchill GA, Doerge RW: Empirical threshold values for quantitative trait mapping. Genetics 1994, 138: 963–971.PubMed CentralPubMedGoogle Scholar
- Brooke A, Kendrick D, Meeraus A: GAMS: A User's Guide. San Francisco: The Scientific Press; 1988.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.