Validation and functional annotation of expression-based clusters based on gene ontology

Background The biological interpretation of large-scale gene expression data is one of the paramount challenges in current bioinformatics. In particular, placing the results in the context of other available functional genomics data, such as existing bio-ontologies, has already provided substantial improvement for detecting and categorizing genes of interest. One common approach is to look for functional annotations that are significantly enriched within a group or cluster of genes, as compared to a reference group. Results In this work, we suggest the information-theoretic concept of mutual information to investigate the relationship between groups of genes, as given by data-driven clustering, and their respective functional categories. Drawing upon related approaches (Gibbons and Roth, Genome Research 12:1574-1581, 2002), we seek to quantify to what extent individual attributes are sufficient to characterize a given group or cluster of genes. Conclusion We show that the mutual information provides a systematic framework to assess the relationship between groups or clusters of genes and their functional annotations in a quantitative way. Within this framework, the mutual information allows us to address and incorporate several important issues, such as the interdependence of functional annotations and combinatorial combinations of attributes. It thus supplements and extends the conventional search for overrepresented attributes within a group or cluster of genes. In particular taking combinations of attributes into account, the mutual information opens the way to uncover specific functional descriptions of a group of genes or clustering result. All datasets and functional annotations used in this study are publicly available. All scripts used in the analysis are provided as additional files.


Background
One of the common assertions in expression analysis is that genes sharing a similar pattern of expression are more likely to be involved in the same regulatory processes [1]. This proposition, commonly referred to as 'guilt-by-asso-ciation', has been exploited by a large number of clustering algorithms, grouping genes into a (small) number of classes, based on the similarity of their expression profiles. While there are still many open problems associated with choosing a particular algorithm, clustering has already proven successful in a multitude of applications, such as the inference of putative functional annotations [2,3], as well as the extraction of regulatory motifs in the upstream regions of genes [4,5].
More recently, the data-centric view, i.e. based on measured expression levels alone, has been advanced to integrate additional information, such as existing functional annotations [6][7][8][9] or protein-protein interaction [10][11][12]. In doing so, the paramount task is to enhance the biological interpretation of the data, e.g. by identifying physiologically relevant categories, based on existing bioontologies, associated with a particular grouping of genes.
However, prior to such a step, it is necessary to obtain a better understanding about the specific relationship between the data generated clustering and the information contained in the functional annotation of genes. That is, to what extent does a grouping of genes reflect their functional annotation, as e.g. given in terms of the structured vocabulary provided by the gene ontology (GO) consortium? In this work, we thus investigate the relationship between groupings of genes and their respective functional categories. It will be shown that the informationtheoretic concept of mutual information provides a suitable theoretical basis to address this question in a systematic way. Importantly, the mutual information holds several favorable properties and i) allows to give a quantitative figure of merit between a clustering result and functional annotations. ii) allows to identify functional trends that characterize a given clustering or grouping of genes, iii) allows to address and incorporate the interdependence of functional annotation terms, and iv) can be straightforwardly applied to the whole set of clusters, or likewise, only to a single individual cluster of group or genes.
Within this framework, we aim to extend the earlier work of Gibbons and Roth [13] and seek to investigate to what extend individual attributes are sufficient to characterize or summarize a given cluster of genes. This question is also ultimately related to the problem of detecting significantly enriched attributes within a group of genes, well covered in the literature [14][15][16][17]. We will demonstrate that in certain situations a simple search for overrepresented attributes fails to uncover the specific functional description of clustering results.
The paper is organized as follows: In the first section, a brief synopsis of the mutual information as a measure of dependency between cluster membership and annotated gene attributes is given. In the second section, we address the capabilities of individual attributes to characterize or summarize a given cluster of genes. In the following section, two major shortcomings of this approach will be pointed out: The interrelatedness of gene attributes (redundancy) and the failure of individual attributes to adequately describe a given clustering or grouping of genes. To overcome these problems, a heuristic strategy is devised that allows to detect combinatorial combinations of attributes, providing a more specific functional description of clustering results. The results are summarized and discussed in the last section.

Results
The mutual information Following Gibbons and Roth [13], the mutual information I(C, A) provides a figure of merit between cluster membership C and known gene attributes A, where H(C), H(A) and H(C, A) denote the entropies of the distributions of C and A and the joint entropy of C and A respectively.
The mutual information is a general measure of dependency between two (or more) variables [18][19][20] and can be interpreted as a 'distance' between the hypothesis of statistical independence and the actual joint probability distribution [18]. A more detailed review concerning its mathematical properties and the estimation from finite data was given elsewhere [19,20]. Importantly, the mutual information I(C, A) is zero if and only if the two variables, here the gene attributes A and the cluster membership C, are statistically independent. As will be demonstrated below, this property incloses and extends conventional approaches, such as finding significantly enriched annotations associated with a group of genes.
In the most simplest setting, each gene is uniquely assigned to one particular functional category A i and is grouped into a cluster C j by a given clustering algorithm.
In this case the estimation of the mutual information is straightforward: One constructs a contingency table and estimates the respective probabilities from the relative frequencies of occurrence, as schematically visualized in Fig.  1.
The mutual information thus provides a systematic quantitative measure of the relationship between cluster membership and given gene attributes [13]. In particular, it opens the possibility to judge the quality of a clustering result, not based on internal measures of consistency, such as within cluster distances to inter cluster distances, but based on existing additional information.
Validating clustering results by the mutual information: A schematic example Figure 1 Validating clustering results by the mutual information: A schematic example. Each gene is uniquely assigned to one functional category A i and grouped into cluster C j by a given clustering algorithm. The joint probabilities can be straightforwardly estimated from the associated contingency table and the mutual information is calculated according to Eq. (1). To assess how related the clustering is to the annotation, the value of the mutual information is compared to random assignments of genes to cluster number, i.e. each gene is randomly assigned to a cluster, preserving the total number of genes within each cluster, but destroying all possible relationship between the clustering and the functional annotation. The lower right plot shows the mutual information, compared to an ensemble of 500 randomized assignments, In this example, the z-score, estimated according to Eq. (8), is S ≈ 3.8. For a z-score to be deemed significant, we further require that no random assignment results in a mutual information equal or larger that the tested annotation. Note that, though we expect the mutual information to be zero for the randomized assignments, the average estimated mutual information for randomized data has a bias towards positive values due to finite-size effects [19,20]. As a rule of thumb, to obtain reliable estimate of the mutual information the number of genes should be at least three times larger than the number of clusters or functional categories [20].  However, in our case, the multi-functions of genes, as reflected in current annotation databases, defy the straightforward approach outlined above. In the following, we will make use of the curator-controlled annotation of S. cerevisiae genes, as provided by the gene ontology consortium (GO) [21].
The GO database is organized in a rooted directed acyclic graph (DAG), with three branches corresponding to the three categories 'cellular component', 'molecular function' and 'biological process'. Each gene (or rather gene product) is annotated by one or multiple GO terms along the graph. The hierarchical nature of GO implies that genes annotated with a specific node are also annotated with every ancestor of that node. Nodes closer to the root of the graph usually correspond to more abstract functional descriptions and cover more genes, while nodes farther away from the root correspond to more detailed functional descriptions. Note that the structure of GO is not necessarily a tree since each node may have multiple parents and may have multiple paths to the root of the graph [22].
The GO database, as downloaded in June 2004. already includes more than N A > 16000 nodes within all three branches [23]. For further numerical processing, each gene is assigned to a vector consisting of binary attributes cates that the gene has been annotated with the GO term A i or one of its descendants.
Clearly, in such a situation, a straightforward estimation of the mutual information must inevitably fail: We would have to take into account all possible combinations of attributes A i , resulting in a contingency table with up to columns, as illustrated in Table 1. Even though the vast majority of combinations does not occur for the genes under consideration, a direct evaluation of Eq. (1), even for just a few hundreds of different attributes, is beyond all computational and statistical means.
To overcome this problem, Gibbons and Roth [13] suggested to approximate the total mutual information as a sum of the mutual information between clusters and each individual attribute. With Note that this approximation assumes both, independence and conditional independence between all attributes.
To illustrate this, we make use of a simple example involving just two attributes A 1 and A 2 . One has to evaluate Since for statistically independent attributes H(A 1 , A 2 ) = H(A 1 ) + H(A 2 ), we only need to consider the last term Thus Eq. (5) indeed reduces to Given the structure of the GO database, as described above, the assumption of statistical independence is, of course, not fulfilled: The attributes are not statistically annotation Cluster independent but strongly dependent on each other. In particular, any annotated attribute A i implies that all of its parents are also annotated. This interdependence will be considered in more detail below.
Unfortunately, Eq. (3) does likewise not allow to give an upper or lower bound on the true mutual information. It is well established that this approximation does not result in a systematic bias, i.e. in general one may not tell whether the violation of the assumptions under-or overestimates the true mutual information [18].
However, for the moment we accept Eq. (3) as a reasonable approximation of the mutual information. Based on this assumption, it was already demonstrated that clustering results and the GO annotations indeed possess a mutual information significantly different from zero [13]. Interestingly, the widely used hierarchical clustering gave results not significantly different from random assignments and was found to be far worse than other common algorithms, such as k-means.
In the following, we will draw upon these results, using the same datasets and preprocessing of the GO database as described by Gibbons and Roth [13]. but addressing slightly different questions instead.

The case of individual attributes
Given that clustering results and the known functional annotation indeed yield a mutual information significantly different from zero, the question arises how this mutual information is distributed among the individual attributes. Are there only few attributes which correspond to and summarize the cluster? Or, on the other hand, is the observed overlap tightly embedded within the structure of the GO annotations -a combined effect of a multitude of attributes, where neither of them is sufficient to characterize a given cluster on its own?
To evaluate this, we must have a closer look on the distribution of the individual terms contributing to Eq. (3). Restricting ourselves to a k-means algorithm (see appendix), the individual mutual information I(C, A i ) between the clustering C and all attributes was estimated.
To assess how related each each attribute is with the clustering, we evaluate randomized assignments of genes to clusters, i.e. each gene is randomly assigned to a cluster, preserving the total number of genes within each cluster. This results in the z-score where σ random denotes the standard deviation of the estimated mutual information for the randomized data. For the z-score to he considered significant, we further require that the number of random assigments is larger that the total number of tested attributes and that tor each attribute all random assigments result in a lower mutual information. Figure 2 shows a histogram of the obtained scores, estimated according to Eq. (8). for the cell cycle dataset of Spellman et al. [24] and k = 25 cluster. As can be observed the vast majority of annotated attributes shows no, or only little, significant overlap with the data-driven clustering. However, the overall distribution is highly inhomogeneous: few attributes are singled out and possess a remarkably high z-score with respect to their shuffled counterparts. The ten highest scoring attributes are indicated in Fig. 2. Likewise, we must also expect the aggregated mutual information of Eq. (3) to be dominated by only few addends of rather high value.
Interestingly, the highest scoring attributes do not change substantially when different datasets are considered (see Table 2 for all datasets under consideration). This indicates that different experimental conditions, corresponding here to different datasets, do not significantly influence which attributes are selected as the most descriptive for the clusters.

Combinatorial combinations of attributes
Two essential shortcoming of our analysis have to be pointed out: First of all. Fig. 2 strongly suggests that the top scoring attributes are largely redundant, i.e. that the individual terms contributing to the aggregated mutual information of Eq. (3) are not independent.
In fact, a gene product annotated with the cellular component 'cytosolic ribosome' (GO:0005830) is necessarily also annotated to the cellular component 'ribosome' (GO:0005840). While this is a trivial consequence of the tree structure of the database (the former node being a child of the latter), other relationships between attributes are less straightforward. For example, and even without any computational assistance, gene products annotated to the cellular component 'ribosome' (GO:0005840) can mostly be expected to be also annotated to the biological process 'protein biosynthesis' (GO:0006412). However, it is worth pointing out that this, in contrast to the former example, is not an inherent consequence of the tree structure of the database, as both nodes appear within distinct, branches of the tree. More systematically, we can assess the redundancy between two selected attributes again by means of the (pair-wise) mutual information I (A i , A j ) between two attributes A i arid A j . Table 3 gives the contingency tables for the five top scoring attributes of Fig. 2. As Histogram of the z-score Eq. (8) for all individual attributes. The attributes with the highest score are marked. The figure is for the cell cycle dataset of Spellman et al. [24] using the same preprocessing as described in [13] (sec appendix]. After filtering ≈ 2500 GO attributes remained for evaluation. Repeating the analysis for all datasets given in Table 2 yields similar results. The clustering was obtained using a k-means algorithm with Euclidean distance and k = 25, the results do not change significantly for different choices of k (tested between k = 5 -30, corresponding to the region where the z-score of the mutual information is largest [13]). Note that the top scoring attributes appear to be largely redundant, i.e. a gene that is annotated to the cellular component 'cytosolic ribosome' can be intuitively expected to be also annotated to the biological process 'protein biosynthesis'. See next section for details. terms. This, of course, affects the interpretation of the results, as these attributes do not actually contribute to a characterization of the given clustering. In the following, after pointing out a second drawback of our analysis, we will thus devise a strategy that incorporates the interdependence of attributes.

Detecting combinations of attributes
Apart from above described shortcoming, restricting ourselves to the mutual information between a given clustering and single individual attributes, entails yet another problem. While these (possibly redundant) attributes can indeed indicate a functional association between the genes gathered in a particular cluster, other, more specific, functional descriptions might be easily missed.
In most cases, it will not only be one attribute that defines or characterizes a group or cluster of genes, but rather a specific combination of attributes. Table 4 gives an illustrative example of such a situation. Here, neither of the attributes share any mutual information with the grouping of the genes into two clusters, nor is any attribute overrepresented within the two groups. However, looking at the combination of both attributes does immediately reveal that these attributes are nonetheless highly descriptive for the given cluster: Their combination does uniquely determine to which cluster a particular gene belongs, or, vice versa, which annotation coincides with a particular cluster.
Obviously, the tremendous computational demands prevent to conduct an exhaustive GO-wide search for all pos-    [26] sible combinations of attributes. To still detect relevant combinations for large-scale data, we thus devise a simple heuristic strategy: Starting with a seed attribute A 0 , the one that gives the highest mutual information I(C, A 0 ), we iteratively look for attributes that result in the largest information gain, i.e. the largest increase in mutual information, when included in the list of attributes. Schematically:  H(A 0 , A 1 ) - there is no gain in information and the attribute will not be included into A.
The iteration stops after a predefined maximal number of steps l max or when no new attribute leads to a significant increase in mutual information. The latter is tested at each iteration step by comparing the result to randomized counterparts of the attribute to be included. As in the previous section, we require that the number of randomized assignments is larger that the total number of tested attributes and that no randomized assignment yields a mutual information equal or higher than the attribute that is to be included. Otherwise, the increase in mutual information is not considered significant and the iteration stops. Note that this also avoids statistical problems due to finite size effects [20]. If the vector of attributes becomes too large, a reliable estimation of the mutual information is no longer possible. In this case, the resulting values upon inclusion of a new attribute will not significantly deviate from those of randomized attributes.
Thus, instead of conducting a comprehensive search for all possible combinations, we consider only those attributes which further contribute to a characterization of a clustering result, given the already selected attributes. In this way, we avoid the inclusion of a large number of redundant attributes. Note that this procedure is reminiscent of a decision tree [29], aiming to predict the cluster assignment based on the GO annotation. Table 5 shows the result for the previously considered dataset of Spellman at al. [24]. Starting with the highest scoring attribute 'cytosolic ribosome' (GO:0005830), already depicted in Fig. 2, new attributes were iteratively included until l max = 31, the first 20 are given in Table 5.
Constructing the associated contingency table of the 5 top scoring attributes, analog to Table 3, indeed reveals that the pair wise mutual information between the selected attributes is significantly lower. The respective z-scores are given in Table 6.
Again the results were not specific for the particular dataset. Repeating the analysis for all datasets given in Table 2 resulted in similar attributes. Those attributes that were selected among the top 32 for all datasets under consideration are indicated in bold in Table 5.
This again indicates that the specific experimental condition, under which the dataset was obtained (two cell cycle experiments and one alternative experiment, involving several conditions.), has no, or only little, influence over the prevailing functional annotations that characterizes the clustering of this respective dataset. This is noteworthy, as in each case different genes were selected for the A i random Table 4: Combinations of GO attributes: Shown is a schematic example of 8 genes, separated into two distinct clusters (Table  4a). As can be observed neither of the two given attributes is significantly enriched within any of the cluster, resulting in a vanishing mutual information between the clustering and the annotation (see the respective contingency tables in Table 4b and 4c). However, clearly the combination of both attributes does uniquely determine both cluster. In particular, genes with the combination A = (A 1 , A 2 ) = (0, 0) or (1, 1)  analysis (see description in Table 2). Moreover, the clustering results itself were sufficiently different, i.e. this mutuality in descriptive annotations is not straightforwardly apparent on the level of clusters itself.
Most importantly, the selected attributes indeed provide a functional categorization of the obtained clustering, improving the search for significantly enriched annotations. This is visualized in Fig. 3. Shown is a graphical representation of the contingency table between the clustering result (see again Fig. 2 for details) and the combined annotations. As can be observed, the combined annotations provide a more specific functional descriptions of the clustering result. For example, using solely the highest scoring attributes of Fig. 2. genes included in the clusters 13, 16 and 19 are assigned almost uniformly to all selected attributes. However taking only the top five selected attributes of Table 5 into account, several cluster are dominated by specific combinations of attributes, e.g. for cluster 6 genes annotated to 'cytoplasm' (GO:0005737), but to none of the other four attributes, are clearly overrepresented.

Conclusion
In this work, we have investigated the use of the mutual information as a measure to detect and quantify the interrelation between data generated clusterings and the known functional annotations of genes. Starting with the contribution of individual attributes, we found that the mutual information between a given clustering result and the attributes is highly inhomogeneously distributed. Few attributes show a remarkable overlap with the clustering, while the vast majority of attributes show no, or only little, overlap with the data-driven clustering. These results were robust with respect to parameters in the clustering algorithm, as well as to different choices of datasets.
One of the primary advantages of the mutual information is that it is not restricted to consider only individual attributes contained the GO database. Focusing on combinations of attributes that resulted in a maximal mutual information between the (selected) annotations and the clustering, we demonstrated that this approach extends and enhances the more conventional search for overrepresented attributes in a group of genes of interest. In particular, including only those attributes that further contribute to a characterization of the clustering, in addition to the already selected ones, circumvents the problem of redundant attributes. Within a group of highly dependent attributes, only the one which results in the largest information gain will be selected.
Interestingly, our results indicate that the experimental conditions under which a particular dataset was obtained has no major influence on the top-ranking attributes. For all considered datasets a nearly identical list of highly descriptive attributes was found. Also, these attributes mostly referred to rather abstract functional descriptions, such as 'cell growth', 'catalytic activity', or 'protein metabolism'. This, of course, questions the use of clustering results to gain insight into specific phenomena, such as the transcriptional response to a particular experimental perturbation or knockout experiment. Rather, one usually observes cluster of genes that are known to be tightly coregulated, such as protein synthesis genes. Only with massive experimental interventions, we must expect the resulting pattern of gene expression to be fundamentally changed and to be directly related to the respective experimental condition.
Finally, it should be emphasized that the application of the mutual information holds a vast potential for further improvements of the method. As yet, we have not focused at predicting a putative functional classification of a specitic cluster. Rather, the mutual information, as used here, represents an average quantity, quantifying the relationship between functional annotation and clusterings as a whole. In this sense, the validity of clustering results can be judged and compared to existing functional annotations. However, the approach tan be straightforwardly extended to detect the prevailing functional annotations of individual clusters, based on the information contained in the GO database.
Along similar veins, the mutual information may also be utilized to further improve the annotation of gene products. Reversing our approach, unknown functional annotations can be predicted based on the available annotations, as well as on membership in a specific cluster. Another advantage of the mutual information in this respect is that, in incorporating additional biological information complementing the GO annotations, it is not restricted to categorical data, but can be extended to include continuous data as well. In particular, the additional information to which a clustering or grouping is compared, is not necessarily restricted to functional annotations. Table 1 may hold any attributes or quantities related to a particular gene or gene product. In this sense, the mutual information constitutes a systematic theoretical basis to investigate the relationship between groups of genes and additional biological information.
ing the values N min = 10, U max = 0.8, and N max = max (i.e. no restriction on the maximal number of genes an attribute is assigned to) were used. The clustering of the data was performed using the open source clustering library described in [27]. All results were compared with the algorithms implemented in the software packages matlab and R (http://www.r-project.org/, [28]). K-means clustering was chosen in accordance with the results obtained previously in [13]. All scripts used in the analysis are provided as additional files [see additional file 1].
Combinations of GO attributes: Shown is a graphical representation of the contingency tables between the clustering result and the GO annotations