Multiconstrained gene clustering based on generalized projections

Zeng, Jia; Zhu, Shanfeng; Liew, Alan Wee-Chung; Yan, Hong

doi:10.1186/1471-2105-11-164

Methodology article
Open access
Published: 31 March 2010

Multiconstrained gene clustering based on generalized projections

Jia Zeng^1,2,
Shanfeng Zhu^3,4,
Alan Wee-Chung Liew⁵ &
…
Hong Yan^6,7

BMC Bioinformatics volume 11, Article number: 164 (2010) Cite this article

4402 Accesses
2 Citations
1 Altmetric
Metrics details

Abstract

Background

Gene clustering for annotating gene functions is one of the fundamental issues in bioinformatics. The best clustering solution is often regularized by multiple constraints such as gene expressions, Gene Ontology (GO) annotations and gene network structures. How to integrate multiple pieces of constraints for an optimal clustering solution still remains an unsolved problem.

Results

We propose a novel multiconstrained gene clustering (MGC) method within the generalized projection onto convex sets (POCS) framework used widely in image reconstruction. Each constraint is formulated as a corresponding set. The generalized projector iteratively projects the clustering solution onto these sets in order to find a consistent solution included in the intersection set that satisfies all constraints. Compared with previous MGC methods, POCS can integrate multiple constraints from different nature without distorting the original constraints. To evaluate the clustering solution, we also propose a new performance measure referred to as Gene Log Likelihood (GLL) that considers genes having more than one function and hence in more than one cluster. Comparative experimental results show that our POCS-based gene clustering method outperforms current state-of-the-art MGC methods.

Conclusions

The POCS-based MGC method can successfully combine multiple constraints from different nature for gene clustering. Also, the proposed GLL is an effective performance measure for the soft clustering solutions.

Background

Computational annotating gene functions is a fundamental issue in bioinformatics. Microarray gene expression data have been used widely to study the cell cycle system, genetic regulatory interactions, development at the molecular level, and genes that act in response to a certain infectious disease. To determine gene functions, a basic approach is gene clustering using gene expression data based on the assumption that genes with similar expression patterns should share similar functions in the process. Typical gene clustering methods include hierarchical clustering [1], the k-means algorithm [2], self-organizing maps [3], the fuzzy c-means algorithm [4], and hidden Markov models [5]. However, gene clustering regularized by only single constraint of gene expression is not enough to obtain biologically reliable clusters, because microarray data are often noisy, contain missing values, and have uncertain temporal dependencies in time-series data [6, 7]. Therefore, other constraints besides gene expression data should be incorporated for the robust and reliable gene clustering.

Recent multiconstrained gene clustering (MGC) methods have attracted much more interests [8–13]. The basic idea is that multiple constraints such as Gene Ontology (GO) and metabolic network structures can prevent gene clustering from falling into the locally optimal solution space constrained by noisy gene expression data alone. One key problem is how to combine multiple pieces of constraints to find a consistent clustering solution. Current MGC methods adopt a linear combination strategy to integrate multiple constraints of the same nature into a single new constraint, so that standard clustering algorithms for single-constrained gene clustering problems can be used, e.g., hierarchical clustering [8], Gaussian mixture models [9], k-medoids [10], and iterative conditional modes (ICM) for Markov random fields [12]. More specifically, they build a distance matrix of gene expression data as the first constraint, and then build another distance matrix based on either metabolic pathway [8, 12, 14] or GO annotations [9, 10] as the second constraint. These two constraints of distance matrices are added linearly to form the new distance matrix for gene clustering. This linear combination strategy has also been used to incorporate different constraints in document clustering [15, 16]. Despite good clustering performance, there are two major problems yet to be solved. The first is that these MGC methods can only combine constraints of the same nature, i.e., all constraints have to be represented as distance matrices. If one constraint is a similarity matrix, we need to transform it into a distance matrix so that we can add it up to other distance matrices. Such transformation may distort the original constraint with information loss. Even if we have two distance matrices, the distance values may be in different scales and cannot be added directly. The second problem lies in the linear combination of the constraint matrices. In most cases, the desired combined constraint does not necessarily have a simple linear relationship with all other original constraints. In addition, the weights for the linear combination often need a reasonable justification in practice. Another MGC strategy is the GO-guided fuzzy c-means (FCM) algorithm [13], which uses GO annotations to initialize and update the cluster probability of each gene.

To overcome above problems, we propose a novel MGC method within the generalized projection framework, which is a generalization of the projection onto convex sets (POCS) technique, which has found many applications in image reconstruction [17] and microarray missing value imputation [18]. Theoretically, POCS provides a flexible framework to integrate multiple pieces of constraints for an optimal solution. It first transforms each constraint into a corresponding convex set, and then uses an iteratively convergent procedure to find a solution in the intersection of all sets. POCS can integrate constraints from different nature such as different similarity matrices. Indeed, it often handles different constraints in frequency and spatial domains in image reconstruction problems. Another advantage is that the original constraints remain intact. The clustering result is projected onto the solution set that satisfies each constraint iteratively and the final result may lie in the intersection set that satisfies a nonlinear combination of the original constraints. Without loss of generality, in this paper we consider two major types of constraints: the gene expression similarity [8] and the GO-based semantic similarity [19]. POCS produces a regularized clustering result that may be more reliable than those solely dependent on either the gene expression similarity or the GO semantic similarity due to the fact that expression data are often short and noisy, while GO terms may be inaccurate and mis-annotated. Because in most cases the solution set is nonconvex, we adopt the generalized projections similar to the POCS procedure. To minimizes the distance between the candidate solution and the constraint set, we design the generalized projector based on a method similar to the relaxation labeling (RL) algorithm [20, 21], which has been used for the approximate inference for Markov random fields [22, 23]

Usually genes have multiple functions and can be assigned into more than one group. Traditional gene clustering algorithms often use a hard clustering strategy that assigns genes into only one group. Recent MGC methods relax this limitation and allows genes to be assigned into several groups [9, 10, 13]. To take this situation into account, we use a soft clustering strategy in which genes are assigned to all clusters with different probabilities. Based on soft clustering results, we propose a new performance measure "gene log likelihood" (GLL) to measure the distance between the predicted clustering result and the reference clusters. This measure has also been widely applied to evaluating word clustering performance in topic modeling problems [24]. To confirm the effectiveness, we evaluate the POCS-based MGC method on the yeast gene expression dataset, and compare the clustering results with recent MGC methods such as k-medoids [10], ICM [12] and FCM [13]. Experimental results demonstrate that the POCS-based MGC can enhance the overall clustering performance by a large margin.

This paper is organized as follows. In the next section we propose the POCS-based MGC method and the RL-based generalized projector to minimize the distance between clustering solution to the corresponding constrained solution set. To account for genes in multiple clusters, we also propose GLL for calculating the distance between the predicted soft clustering results and the reference gene clusters. The result section shows comparative experimental results on different yeast expression datasets. The POCS-based MGC algorithm always converges to the optimal solution in practice. Finally, we draw conclusions and envision future work.

Methods

Gene clustering is a labeling problem, in which a set of cluster labels are assigned to genes for annotating gene functions. Given I genes and K clusters, the soft clustering solution is a matrix X = (x_ik), 1 ≤ i ≤ I, 1 ≤ k ≤ K, where x_ik∈ [0, 1] and Σ_kx_ik= 1. The element x_ikis the probability that the i th gene is associated with the k th cluster label. For each gene we use a probability vector x_i= (x_{i 1}, . . ., x_ik, . . ., x_iK) to represent its cluster labeling configuration. From this perspective, the clustering solution X is the cluster labeling configuration of I genes over K clusters. We may also use the winner-take-all strategy to figure out the hard clustering solution X*, in which the i th gene belongs to only one cluster with the highest probability, i.e., k* = arg max_kx_ikand x_ik*= 1.

Gene expression constraint

Based on microarray gene expression profiles, we can build the first constraint using the similarity matrix for gene clustering. The metric can be the Pearson's correlation coefficient and Euclidean distance [8–10], or the more complex type-2 fuzzy hidden Markov model-based sequence similarity [25]. Because the Pearson's correlation coefficient is suitable for time-series gene expression data [26], we adopt it for calculating the similarity between two genes' log-ratio transformed profiles [8], i.e., the logarithm of the ratio between each sample point in the profile and a control measurement. More specifically, given two genes' transformed profiles g_i(m) and g_i'(m) in length M, the correlation coefficient v_ii'is

where μ_iand σ_idenote mean and standard deviation of the transformed profile of the i th gene respectively. The correlation coefficient value ∈ [-1, 1], where the higher value corresponds to the higher similarity between two genes' profiles. Here we consider the anti-correlated gens as most dissimilar because the correlated genes often involve in similar reaction steps and share similar functions. Therefore, the Pearson's correlation coefficient matrix constrains the first clustering solution set C₁ = {X_e}, which contains many locally optimal clustering solutions satisfying .

GO constraint

As an important source of biological knowledge, the Gene Ontology (GO) provides a consistent description of genes and gene products by a controlled and structured vocabulary, which includes three major categories: biological process (BP), molecular function (MF), and cellular component (CC). The GO terms are organized in the form of a directed acyclic graph (DAG) with two major semantic relations such as "is-a" and "part-of", where "A is-a B" means A is a subclass of B, and "C part-of D" means C is always part of D. Generally, simply identifying the shared GO annotations of gene products for their functional relationship has the following limitations. First, two quite different GO annotations can be closely related through their common ancestors in the DAG so as to have a higher semantic similarity. Second, the shared GO terms may be too general to describe the functional association of annotated gene products. Recently, the GO-based semantic similarity measures have been applied to searching semantically similar proteins [27], clustering gene expression data and assessing cluster validity [19, 28, 29], developing new human regulatory pathway modeling tools [30], validating protein interaction data [31], validating functional annotation of expression-based clusters [32], and enabling the identification of functionally related gene products independent of homology [33].

The GO-based semantic similarity measures assume that the more information two GO terms share, the more similar they are. In this paper we adopt a recent GO-based semantic measure proposed by Wang et al. [19], in which the similarity between two GO terms S_GO(c_m, c_n) is calculated according to the graph structural information encoded in the GO. This semantic measure between annotated GO terms for genes has been demonstrated to be better than the classic Resnik's measure in clustering gene products. If c is a GO term, is the set of GO terms including term c and all its ancestors, and E_cis the set of edges connecting all terms in , the S-value of any term t in the graph DAG_c= (c, , E_c) related to the term c, S_c(t), is defined as,

where w_eis the semantic contribution factor for edge e ∈ E_clinking the term t with its child term t'. Here we use w_e= 0.8 for "is-a" relation and w_e= 0.6 for "part-of" relation as suggested in [19]. After obtaining all S-values for all terms in the DAG_c, the semantic value of the term c, SV (c), is

Given two GO terms c₁ and c₂ as well as their graphs and , the semantic similarity S_GO(c₁, c₂) is

where (t) is the S-value of GO term t related to term c₁, and (t) is the S-value of GO term t related to term c₂. One gene may be annotated by many GO terms. Given two genes annotated by several GO terms, GO_i= {c_{i 1}, . . ., c_im, . . ., c_iM} and GO_i'= {c_i'1, . . ., c_i'n, . . ., c_i'N}, the functional similarity between genes,

Note that the functional similarity between two GO term sets GO_iand G_Oi'considers the hierarchical structure of GO terms c based on the S-value. Because the GO contains three main vocabularies, BP, MF and CC, the GO similarity value between genes can be calculated in a joint manner as

where BPsim, MFsim and CCsim denote the similarity values of the corresponding GO terms within the same type. The similarity value ∈ [0, 1], where the higher value corresponds to the higher similarity. As a result, the GO-based semantic similarity constrains the second clustering solution set C₂ = {X_g}, which contains many locally optimal clustering solutions satisfying .

Generalized projections

Although the gene expression and GO-based semantic similarity may achieve a clustering solution with a high correlation, there is still a large amount of complementary information between their final clustering results [34]. Both gene expression and GO constrained solution sets C₁ = {X_e} and C₂ = {X_g} may not contain a single globally optimal solution, and even they contain such a solution, we are unlikely able to find it since the optimization procedures are highly nonlinear. So, we consider C₁ and C₂ as sets of all locally optimal solutions under different constraints. When both constraints are satisfied, we eliminate many unreasonable locally optimal solutions and obtain an improved clustering performance. Our objective is to find the biologically consistent clustering solution X^†∈ C₁ ∩ C₂ using the POCS procedure [17]. Note that direct adding two constraints and based on the weight w ∈ [0, 1], i.e., , to produce the new constraint for gene clustering is not suitable because the constraints are from different nature. In contrast, the POCS framework decomposes the optimization procedure into different projections and solves the problem efficiently.

input: X₀, P_n, w_n, 1 ≤ n ≤ N, M.

output: X_M.

begin

for m ← 1 to M do

;

// P_nX_m-1is described in Algorithm 2.

end

Algorithm 1: The simultaneous projection.

Within the POCS framework [17], each constraint on the solution is formulated as a corresponding closed convex set, C_n, 1 ≤ n ≤ N, in the Hilbert space H. The optimal solution X^† is included in the intersection set C₀ of all convex sets C_n,

(1)

If C₀ is nonempty in Figure 1A, the successive projections onto the convex sets,

(2)

will converge to a consistent solution in C₀ for any random initial value X₀, where X_m, 1 ≤ m ≤ M is the solution at the mth iteration. Eq. (2) shows that the current solution X_m-1is projected to each set or constraint C_n, 1 ≤ n ≤ N through the projector P_nsuccessively in order to find the next better solution X_muntil it converges to the consistent solution X^† in the intersection of all sets. Figure 1A shows the projection process for the consistent problem in Eq. (2), where the thick black dot represents a consistent solution in the intersection of two sets C₁ and C₂ for the gene expression and GO constraints, respectively. The generalized projector P_ntransforms X_m-1into a solution within the set C_nthat minimizes the distance between X_m-1and ,

(3)

where ||· || denotes the norm in the Hilbert space H. Indeed, Eq. (3) indicates that we need to transforms the current clustering solution X_m-1into a more suitable clustering solution based on the similarity or distance matrix for the set C_n. If C₀ is empty in Figure 1B, the POCS algorithm uses simultaneous projections,

(4)

where w_nis the weight on the projections satisfying and w_n≥ 0 for all n. The simultaneous projections converge weakly to a solution such that a weighted set distance function is minimized. Note that the simultaneous projections only linearly combine the solutions projected onto all constraint sets, which is more reasonable than the strategy that linearly combines constraints and then finds a solution under the new constraint. Figure 1B shows the simultaneous projections for the inconsistent problem in Eq. (4), where the thick black dot is an approximately best solution minimizing the weighted set distance from gene expression constraint C₁ and GO constraint C₂, respectively.

In practice, both C₁ and C₂ are often nonconvex. A set is convex if and only if λ X_a+ (1 - λ)X_bis in the set when X_aand X_bare in the set for 0 ≤ λ ≤ 1. The constraint sets contain many locally optimal clustering "solutions" and the interpolation of the solutions, i.e., the weighted sum λ X_a+ (1 -λ)X_b, has no mathematical meaning. Thus, we cannot use the classic POCS procedure (2). Nevertheless, we can still use the generalized projections (3) to solve the problem within the POCS framework [[17], Chapter 5], which do not require the sets be convex. In practice it is difficult to minimize the distance functions (3) under both constraints at the same time, so we do it iteratively based on generalized projections. The generalized projector iteratively minimizes the distance function (3), and will terminate if the distance in the next step cannot decrease. From the regularization point of view, the solution is regularized under different constraints simultaneously, and the final solution is a linear combination of each regularized solution in Eq. (4). The simultaneous projection weights w_ncan be fixed empirically according to prior knowledge. To summarize, Algorithm 1 shows the simultaneous projection algorithm.

input: , 1 ≤ i, i' ≤ I, 1 ≤ k ≤ K, J.

output: , 1 ≤ i ≤ I, 1 ≤ k ≤ K.

begin

for j ← 1 to J do

for i ← 1 to I do

for k ← 1 to K do

;

end

Algorithm 2: The relaxation labeling projector.

Now we design the generalized projector based on the iterative RL algorithm [20, 21, 23], which can find the soft cluster label for each gene under a certain constraint. Given the clustering solution X and the constraint , minimizing (3) is equivalent to maximizing the corresponding gain function,

(5)

where i' ∈ ∂_iis a set of neighbors of the i th gene, and the term exp() increases with the similarity between two genes according to the constraint . The neighborhood system ∂_iis defined as the ten nearest genes i' with top similarity values . The term encourages that if the genes have a high similarity value they also have a high similarity value in soft cluster labeling configurations. The RL algorithm iteratively updates the initial X¹ by the gradient of the gain function (5) until j reaches the fixed maximum number J as shown in Algorithm 2. The value of J is determined experimentally to ensure that the gain function is maximized. That is, after J iterations, the RL algorithm converges to the local maximum of the gain function in terms of X^J. In the meanwhile, the distance function (3) is also minimized by X^J, where X^Jis equivalent to in (3). Algorithm 2 shows the projection of X¹ satisfying one constraint . Note that J is the number of iterations of the RL-based projector in Algorithm 2, while M is the number of iterations in the simultaneous projection in Algorithm 1. The RL-based projector is a fast algorithm and practically J = 5 is enough.

Gene log likelihood

If we have a reference gene clustering solution Y, we can calculate the distance between the predicted clustering solution X and the standard reference Y for the performance evaluation. The reference clustering solution is a matrix, Y = (y_iw), 1 ≤ i ≤ I, 1 ≤ w ≤ W, where y_iw= 1 denotes that the i th gene belongs to the w th cluster. The number of reference clusters W may not equal to the predicted number of clusters K in most cases. Because a gene may belong to multiple clusters due to multiple functions, the vector y_i= (y_{i 1}, . . ., y_iw, . . ., y_iW) may contain multiple ones for the i th gene.

Based on the hard clustering solution X*, we may quantify the distance between X* and Y by normalized mutual information (NMI), which has been widely used in a lot of applications to measure the performance of clustering methods [12, 19]. In information theory, the mutual information is defined as a quantity to measure the amount of information shared between two random variables. If one set of clusters is more consistent with the other set of clusters, the mutual information between two sets of cluster labels becomes larger. Generally, the mutual information is normalized because the range of the mutual information measures depends on the size of given sets of clusters. NMI is calculated as

where I is the number of genes, n_wis the number of genes in the w th reference cluster, n_kis the number of genes in the k th reference cluster, and n_wkis the number of genes in both w th reference cluster and k th predicted cluster. If two sets of clusters are identical, NMI between them reaches the maximum value of one.

However, NMI cannot be used if one gene may be in multiple clusters. So, we propose a new performance measure referred to as gene log likelihood (GLL) log P(Y|X) for gene clustering, which measures the likelihood in predicting a single gene in the reference cluster Y based on X. GLL has a simple meaning that the i th gene in the w th reference cluster Y is predicted with a likelihood proportional to the product of the likelihood that the w th cluster is generated by the k th cluster and the likelihood that the i th gene is generated by the k th cluster in X. Higher values are better, indicating the obtained clustering solution X has a higher likelihood to generate the reference gene clusters Y. Specifically we calculate GLL as follows,

(6)

where x_i= (x_{i 1}, . . ., x_ik, . . ., x_iK) is the probability distribution over K clusters of the i th gene, i ∈ w denotes the set of all genes in the w th reference cluster with y_iw= 1, and p_w= (p_{w 1}, . . ., p_wk, . . ., p_wK) is the probability distribution of the wth reference cluster over K predicted clusters. Empirically, this probability p_wkcan be estimated by

(7)

(8)

where we assume that the genes are conditionally independent in the generative process. Indeed, this is a standard performance measure for word clustering in the text mining [24], which indicates the empirical likelihood in predicting a single word in a document.

Results and Discussion

Datasets

To calculate the gene expression constraint, we select four microarray time-series datasets [35], monitoring genome-wide mRNA levels for 6178 yeast Saccharomyces cerevisiae open reading frames simultaneously using several different methods of synchronization including four datasets: alpha, cdc15, cdc28 and elu datasets. Also we add the Hughes dataset [36] widely used in gene clustering [9, 10], because it contains 300 time points while a small number of missing values. The missing values in the microarray data are interpolated by the POCS-based reconstruction method [18], which uses multiple constraints such as synchronization loss. To calculate the GO constraint, the GO (version 20080225) and annotation (version 1.1384) databases of yeast are downloaded from the GO official website. The yeast annotation file includes 6345 gene products annotated with 77152 GO terms.

To evaluate MGC methods for gene clustering, we generate two different sets of reference gene clusters with true cluster labels from KEGG [37] and SGD (Saccharomyces Genome Database) http://www.yeastgenome.org/ referred to as KEGG clusters [12] and SGD clusters [19], respectively. The KEGG pathway maps are generally classified into six major categories including metabolism. We use ten subcategories under the metabolism category as KEGG clusters, which includes a total of 531 genes. Note that a gene can be in more than one cluster. Table 1 lists the KEGG clusters and the number of genes in the corresponding cluster. We also use the gene annotation and classification information in yeast biochemical pathways as SGD clusters. There are 142 pathways involved with 835 genes, among which only 26 pathways contain more than 10 genes, where a gene can be in more than one pathway. Table 2 summarizes the list of pathway clusters and the number of genes in the corresponding cluster. The reason why we use two different sets of reference clusters lie in the fact that gene clusters are variable depending on the different partitioning criteria. If the predicted clusters by the POCS-based method are close to both reference clusters, we may make a safe conclusion that this method is robust to annotate gene functions under different conditions.

Table 1 10 reference gene clusters from KEGG

Full size table

Table 2 26 reference gene clusters from yeast biochemical pathways

Full size table

Comparative results

The POCS-based MGC method requires two key parameters, the number of simultaneous projections M and the weight on projections w_n, in Algorithm 1. Because we have two constraints, the weight for the GO-based constraint is w, and thus the weight for the gene expression constraint is 1 - w. Through experiments on the alpha dataset, we can determine proper M and w for desirable gene clustering performance. The parameters M and w are adjusted so that we can obtain the desirable result within the POCS framework. It is possible that another iterative method can estimate the parameters better. However, in many cases, such a better-performing method is a supervised learning procedure using reference gene clusters, and can be incorporated into the POCS procedure to achieve an even better performance or robustness. That is, POCS is useful for combining information from different sources if we can formulate corresponding constraint sets and projections.

To determine M, we randomly initialize the clustering solution, and the weight w = 0.5. Figure 2 shows the GLL values on the KEGG and SGD reference clusters when 10 projections are used. From different number of clusters K = 10, 15, 20, 25, we see that all GLL values do not increase significantly after two or three projections. So, we believe that M = 3 is enough to produce desirable clustering results in this task. From this experiment, we also see that Algorithm 1 converges quickly after a few projections. Then, we fix M = 3 and tune the weight w ∈ [0, 1]. By using M = 3 projections in practice, POCS does not increase the computational cost very much, which makes this algorithm very attractive in combining more constraints for gene clustering.

Figure 3 shows the GLL values on the KEGG and SGD reference clusters by increasing the weight at the step 0.1. We observe that the performance highly depends on different projection weights. If we use KEGG reference clusters, we find that weight w = 0.7 can produce higher GLL value on average. The gene expression constraint alone w = 0 does not ensure the best clustering result, while the GO constraint alone w = 1 does not ensure the best clustering result either. We see that the GO constraint can produce more reliable clustering result than the gene expression constraint, because the GO annotation is based on prior knowledge of biologists more reliable than gene expression data. Furthermore, we often assume that anti-correlated genes are not within the same cluster, but in some cases this assumption is not true. However, when the weight w increases, the final performance does not always increase and w = 0.5 produces a local minimum of the GLL value. After that, the GLL value continue to increase to the next local maximum of the GLL value. The SGD reference cluster reconfirms that the GO-based constraint is more reliable. The best clustering performance occurs often when w = 0.9 on average. Therefore, we adopt the weight w = 0.8 for the simultaneous projection in all our experiments.

As far as Figure 3 is concerned, one major reason why GO information is more reliable for clustering is that the reference gene clusters from KEGG and SGD (Tables 1 and 2) are partly correlated with GO annotations. Therefore, we need to delete a certain fraction of GO annotations when perform clustering, and use only the gene expression constraint to predict the new gene functions compared with reference gene clusters. In this paper, we adopt the cross-validation procedure [10] to validate the POCS-based MGC method. More specifically, we perform a five-fold cross-validation by deleting 20% GO constraints from the datasets in turn. We shall examine whether the POCS-based MGC clustering method can predict the functions for those 20% genes without GO constraints as compared to reference KEGG and SGD gene clusters. We repot the average prediction performance for the five-fold cross-validation.

After we fix M = 3 and w = 0.8, we compare our POCS-based MGC method with three state-of-the-art MGC methods: k-medoids [10], ICM [12] and FCM [13]. Both k-medoids and ICM first linearly combine two constraints and , and then use the ICM and k-medoids algorithms to partition the genes into different clusters. We empirically determine the linear combination weight of the GO constraint w = 0.9 for k-medoids, which can produce the desirable clustering results in terms of GLL on average. For the ICM algorithm [12], we choose the best recommended parameter w = 0.2, which is biased toward the gene expression constraint. On the other hand, FCM uses GO annotations to initialize X₀, and uses both initial X₀ and gene expression values to update X₀ until it converges to a new clustering solution X_M. We use the best suggested weight w = 0.8 for FCM [13], which is biased toward the GO constraint for soft clustering.

Tables 3, 4, 5 and 6 show the average clustering performance and standard deviation in terms of GLL and NMI based on soft clustering solution X and the hard clustering solution X*, respectively. We see that the POCS produces the highest GLL value among all MGC methods, which means that its soft clustering solution is the most likely to generate both KEGG and SGD reference clusters. The k-medoids algorithm performs the worst, partly because it is easy to fall into the local optimal clustering solution. ICM uses an iterative procedure to find a better clustering solution by the combined constraint, but it is biased to the unreliable gene expression constraint. FCM performs slightly better than ICM partly because it is biased to the more reliable GO constraint. Compared with FCM, POCS significantly increases the GLL value around 15% on both KEGG and SGD reference clusters. Another observation is that the Hughes dataset has the highest GLL value, partly because it contains much longer gene expression profiles than alpha, cdc15, cdc28 and elu datasets. The longer gene expression profiles are more reliable for gene clustering. The NMI values are consistent with the GLL values, where if the soft clustering solution has a higher GLL value the corresponding hard clustering solution by the winner-take-all strategy also has a higher NMI value. Thus, the performance measure GLL can best account for this soft clustering solution, where the higher GLL value corresponds to better soft clustering solution. However, we observe that the GLL value varies much more than the NMI value, mainly because the soft clustering solution space is larger than that of the hard clustering. In some cases, the difference of NMI values between POCS and FCM is not significant. Thus, we need to examine the statistical significance in the difference of NMI values between POCS and FCM. Table 7 shows the p-values of pairwise t-test [38] over all five microarray datasets, which indicates that the NMI value of POCS is higher than the corresponding FCM results with a statistical significance of more than 99% for all datasets.

Table 3 Five-fold cross-validation of the GLL values on KEGG clusters

Full size table

Table 4 Five-fold cross-validation of the NMI values on KEGG clusters

Full size table

Table 5 Five-fold cross-validation of the GLL values on SGD clusters

Full size table

Table 6 Five-fold cross-validation of the NMI values on SGD clusters

Full size table

Table 7 P-values of pairwise t-test of POCS and FCM

Full size table

To further confirm the effectiveness of POCS-based MGC method, we show two clustering examples. First, the gene YPR145W involves two KEGG pathways "Amino acid metabolism" and "Energy metabolism" in Table 1. All other MGC algorithms misclassify this gene into a single cluster, but our POCS algorithm successfully classify it into two clusters with probabilities 0.7 and 0.3. This example confirms the effectiveness of our method for identifying genes in multiple functions. Second, we examine the gene YJL052W involving two SGD pathways "glycolysis" and "gluconeogenesis" in Table 2. We compute the p-values between each gene function in GO and the cluster (alpha dataset when K = 10) containing the gene YJL052W using Gene Ontology Term Finder http://db.yeastgenome.org/cgi-bin/GO/goTermFinder.pl. We then rank the gene functions according to their p-values, and the top function is assigned to the gene cluster. We find that the top function is "glycolysis" with the p-value 3.12e - 41, which is consistent with one of SGD pathways in which YJL052W involves. This example further confirms that the discovered clusters indeed reflect the true biological functions in terms of pathways.

Conclusion

This paper presents a novel MGC method within the generalized POCS framework, which successfully combines two constraints from different nature for gene clustering. In addition, we also propose the GLL to measure the soft clustering performance. Experimental results of five-fold cross-validation on different microarray datasets show that the POCS-based MGC method is competitive or superior to other state-of-the-art MGC methods based on KEGG and SGD reference gene clusters. In the future, we aim to incorporate more constraints such as DNA sequence features and gene network structures to improve gene clustering performance further. For example, the structural profiles of DNA sequences play important roles in key genetic processes such as transcription [39], replication [40], protein-DNA recognition [41], and tissue specificity [42]. We may use the similarity between structural profiles of DNA sequences as a new constraint for gene clustering. On the other hand, we may also develop more efficient supervised learning strategies to automatically determine the weights of simultaneous projections in Algorithm 1. For example, we may choose decision trees [43] or ensemble learning methods [44] to learn the weights of different constraints from training data, and apply these weights to clustering unknown genes for function prediction.

References

Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci 1998, 95(25):14863–8.
Article CAS PubMed PubMed Central Google Scholar
Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture. Nat Genet 1999, 22(3):281–5.
Article CAS PubMed Google Scholar
Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR: Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci 1999, 96(6):2907–12.
Article CAS PubMed PubMed Central Google Scholar
Dembélé D, Kastner P: Fuzzy C-means method for clustering microarray data. Bioinformatics 2003, 19: 973–980.
Article PubMed Google Scholar
Schliep A, Schönhuth A, Steinhoff C: Using hidden Markov models to analyze gene expression time course data. Bioinformatics 2003, 19: i255-i263.
Article PubMed Google Scholar
Kerr MK, Churchill GA: Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proc Natl Acad Sci 2001, 98(16):8961–5.
Article CAS PubMed PubMed Central Google Scholar
Bar-Joseph Z: Analyzing time series gene expression data. Bioinformatics 2004, 20: 2493–2503.
Article CAS PubMed Google Scholar
Hanisch D, Zien A, Zimmer R, Lengauer T: Co-clustering of biological networks and gene expression data. Bioinformatics 2002, 18(Suppl 1):S145–54.
Article PubMed Google Scholar
Pan W: Incorporating gene functions as priors in model-based clustering of microarray gene expression data. Bioinformatics 2006, 22(7):795–801.
Article CAS PubMed Google Scholar
Huang D, Pan W: Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data. Bioinformatics 2006, 22(10):1259–1268.
Article CAS PubMed Google Scholar
Aubry M, Monnier A, Chicault C, de Tayrac M, Galibert MD, Burgun A, Mosser J: Combining evidence, biomedical literature and statistical dependence: new insights for functional annotation of gene sets. BMC Bioinformatics 2006, 7: 241.
Article PubMed PubMed Central Google Scholar
Shiga M, Takigawa I, Mamitsuka H: Annotating gene function by combining expression data with a modular gene network. Bioinformatics 2007, 23(13):i468-i478.
Article CAS PubMed Google Scholar
Tari L, Baral C, Kim S: Fuzzy c-means clustering with prior biological knowledge. J Biomed Inform 2009, 42: 74–81.
Article PubMed PubMed Central Google Scholar
Tritchler D, Parkhomenko E, Beyene J: Filtering genes for cluster and network analysis. BMC Bioinformatics 2009, 10: 193.
Article PubMed PubMed Central Google Scholar
Zhu S, Zeng J, Mamitsuka H: Enhancing MEDLINE document clustering by incorporating MeSH semantic similarity. Bioinformatics 2009, 25(15):1944–1951.
Article CAS PubMed Google Scholar
Zhu S, Takigawa I, Zeng J, Mamitsuka H: Field independent probabilistic model for clustering multi-field documents. Information Processing & Management 2009, 45: 555–570.
Article Google Scholar
Stark H, Yang Y: Vector space projections: a numerical approach to signal and image processing, neural nets, and optics. New York: Wiley; 1998.
Google Scholar
Gan X, Liew AWC, Yan H: Microarray missing data imputation based on a set theoretic framework and biological knowledge. Nucleic Acids Res 2006, 34: 1608–1619.
Article CAS PubMed PubMed Central Google Scholar
Wang JZ, Du Z, Payattakool R, Yu PS, Chen CF: A new method to measure the semantic similarity of GO terms. Bioinformatics 2007, 23: 1274–1281.
Article CAS PubMed Google Scholar
Zeng J, Liu ZQ: Markov Random Field-based Statistical Character Structure Modeling for Handwritten Chinese Character Recognition. IEEE Trans Pattern Anal Mach Intell 2008, 30(5):767–780.
Article PubMed Google Scholar
Zeng J, Liu ZQ: Type-2 fuzzy Markov random fields and their application to handwritten Chinese character recognition. IEEE Trans Fuzzy Syst 2008, 16(3):747–760.
Article Google Scholar
Feng W, Liu ZQ: Region-Level Image Authentication Using Bayesian Structural Content Abstraction. IEEE Trans Image Process 2008, 17(12):2413–2424.
Article PubMed Google Scholar
Zeng J, Feng W, Xie L, Liu ZQ: Cascade Markov random fields for stroke extraction of Chinese characters. Inf Sci 2010, 180: 301–311.
Article Google Scholar
Blei DM, Ng AY, Jordan MI: Latent Dirichlet allocation. J Mach Learn Res 2003, 3(4–5):993–1022.
Google Scholar
Zeng J, Liu ZQ: Type-2 Fuzzy Hidden Markov Models and Their Application to Speech Recognition. IEEE Trans Fuzzy Syst 2006, 14(3):454–467.
Article Google Scholar
Ramoni MF, Sebastianidagger P, Kohane IS: Cluster analysis of gene expression dynamics. Proc Natl Acad Sci 2002, 99: 9121–9126.
Article CAS PubMed PubMed Central Google Scholar
Lord PW, Stevens RD, Brass A, Goble CA: Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics 2003, 19: 1275–1283.
Article CAS PubMed Google Scholar
Adryan B, Schuh R: Gene-Ontology-based clustering of gene expression data. Bioinformatics 2004, 20: 2851–2852.
Article CAS PubMed Google Scholar
Bolshakova N, Azuaje F, Cunningham P: A knowledge-driven approach to cluster validity assessment. Bioinformatics 2005, 21: 2546–2547.
Article CAS PubMed Google Scholar
Guo X, Liu R, Shriver CD, Hu H, Liebman MN: Assessing semantic similarity measures for the characterization of human regulatory pathways. Bioinformatics 2006, 22: 967–973.
Article CAS PubMed Google Scholar
Wolting C, McGlade CJ, Tritchler D: Cluster analysis of protein array results via similarity of Gene Ontology annotation. BMC Bioinformatics 2006, 7: 338.
Article PubMed PubMed Central Google Scholar
Steuer R, Humburg P, Selbig J: Validation and functional annotation of expression-based clusters based on gene ontology. BMC Bioinformatics 2006, 7: 380.
Article PubMed PubMed Central Google Scholar
Schlicker A, Domingues FS, Rahnenführer J, Lengauer T: A new measure for functional similarity of gene products based on Gene Ontology. BMC Bioinformatics 2006, 7: 302.
Article PubMed PubMed Central Google Scholar
Sevilla JL, Segura V, Podhorski A, Guruceaga E, Mato JM, Martinez-Cruz LA, Corrales FJ, Rubio A: Correlation between gene expression and GO semantic similarity. IEEE/ACM Trans Comput Biol Bioinform 2005, 2: 330–338.
Article CAS PubMed Google Scholar
Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 1998, 9: 3273–3297.
Article CAS PubMed PubMed Central Google Scholar
Hughes TR, et al.: Functional discovery via a compendium of expression profiles. Cell 2000, 102: 109–26.
Article CAS PubMed Google Scholar
Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M, Kawashima S, Katayama T, Araki M, Hirakawa M: From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res 2006, 34: D354-D357.
Article CAS PubMed PubMed Central Google Scholar
Kreyszig E: Introductory Mathematical Statistics. New York: John Wiley & Sons; 1970.
Google Scholar
Cao XQ, Zeng J, Yan H: Structural property of regulatory elements in human promoters. Phys Rev E 2008, 77: 041908.
Article Google Scholar
Cao XQ, Zeng J, Yan H: Structural properties of replication origins in yeast DNA sequences. Phys Biol 2008, 5: 036012.
Article PubMed Google Scholar
Cao XQ, Zeng J, Yan H: Physical signals for protein-DNA recognition. Phys. Biol 2009, 6: 036012.
Google Scholar
Zeng J, Cao XQ, Zhao H, Yan H: Finding human promoter groups based on DNA physical properties. Phys Rev E 2009, 80: 041917.
Article Google Scholar
Zeng J, Zhao XY, Cao XQ, Yan H: SCS: Signal, context and structure features for genome-wide human promoter recognition. IEEE/ACM Trans Comput Biol Bioinform 2010, in press.
Google Scholar
Zeng J, Zhu S, Yan H: Towards accurate human promoter recognition: a review of currently used sequence features and classification methods. Briefings in Bioinformatics 2009, 10(5):498–508.
Article CAS PubMed Google Scholar

Download references

Acknowledgements

Great thanks are due to Xiao-Qin Cao and Xiao-Yu Zhao for their assistance in code implementation. This work is supported by the Hong Kong Research Grant Council (Project CityU 122607). This work is also supported by the National Nature Science Foundation of China (No. 60903076) and the Shanghai Committee of Science and Technology, China (No. 08DZ2271800 and 09DZ2272800).

Author information

Authors and Affiliations

School of Computer Science and Technology, Soochow University, Suzhou, 215006, China
Jia Zeng
Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong
Jia Zeng
School of Computer Science and Technology, Fudan University, Shanghai, 200433, China
Shanfeng Zhu
Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai, 200433, China
Shanfeng Zhu
School of Information and Communication Technology, Griffith University, Gold Coast Campus, QLD, 4222, Queensland, Australia
Alan Wee-Chung Liew
Department of Electronic Engineering, City University of Hong Kong, Kowloon, Hong Kong
Hong Yan
School of Electronic and Information Engineering, University of Sydney, NSW 2006, Australia
Hong Yan

Authors

Jia Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Shanfeng Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Alan Wee-Chung Liew
View author publications
You can also search for this author in PubMed Google Scholar
Hong Yan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jia Zeng.

Additional information

Authors' contributions

JZ developed this methodology, carried out experiments and drafted the manuscript. ZSF and AWL provided useful comments on methodology and helped revise this manuscript. HY initiated the project and participated in project design and helped revise the manuscript. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Zeng, J., Zhu, S., Liew, A.WC. et al. Multiconstrained gene clustering based on generalized projections. BMC Bioinformatics 11, 164 (2010). https://doi.org/10.1186/1471-2105-11-164

Download citation

Received: 08 July 2009
Accepted: 31 March 2010
Published: 31 March 2010
DOI: https://doi.org/10.1186/1471-2105-11-164

Multiconstrained gene clustering based on generalized projections