Identification of coherent patterns in gene expression data using an efficient biclustering algorithm and parallel coordinate visualization
 KinOn Cheng^{1},
 NgaiFong Law^{1},
 WanChi Siu^{1} and
 Alan WeeChung Liew^{2}Email author
DOI: 10.1186/147121059210
© Cheng et al; licensee BioMed Central Ltd. 2008
Received: 12 September 2007
Accepted: 23 April 2008
Published: 23 April 2008
Abstract
Background
The DNA microarray technology allows the measurement of expression levels of thousands of genes under tens/hundreds of different conditions. In microarray data, genes with similar functions usually coexpress under certain conditions only [1]. Thus, biclustering which clusters genes and conditions simultaneously is preferred over the traditional clustering technique in discovering these coherent genes. Various biclustering algorithms have been developed using different bicluster formulations. Unfortunately, many useful formulations result in NPcomplete problems. In this article, we investigate an efficient method for identifying a popular type of biclusters called additive model. Furthermore, parallel coordinate (PC) plots are used for bicluster visualization and analysis.
Results
We develop a novel and efficient biclustering algorithm which can be regarded as a greedy version of an existing algorithm known as pCluster algorithm. By relaxing the constraint in homogeneity, the proposed algorithm has polynomialtime complexity in the worst case instead of exponentialtime complexity as in the pCluster algorithm. Experiments on artificial datasets verify that our algorithm can identify both additiverelated and multiplicativerelated biclusters in the presence of overlap and noise. Biologically significant biclusters have been validated on the yeast cellcycle expression dataset using Gene Ontology annotations. Comparative study shows that the proposed approach outperforms several existing biclustering algorithms. We also provide an interactive exploratory tool based on PC plot visualization for determining the parameters of our biclustering algorithm.
Conclusion
We have proposed a novel biclustering algorithm which works with PC plots for an interactive exploratory analysis of gene expression data. Experiments show that the biclustering algorithm is efficient and is capable of detecting coregulated genes. The interactive analysis enables an optimum parameter determination in the biclustering algorithm so as to achieve the best result. In future, we will modify the proposed algorithm for other bicluster models such as the coherent evolution model.
Background
Gene expression matrix
Parallel coordinate plots
The parallel coordinate (PC) technique is a powerful method for visualizing and analyzing highdimensional data under a twodimensional setting [17, 18]. In this technique, each dimension is represented as a vertical axis, and then the Ndimensional axis is arranged in parallel to each other. By giving up the orthogonal representation, the number of dimensions that can be visualized is not restricted to only two [19–21]. Studies have found that geometric structure can still be preserved by the PC plot despite that the orthogonal property is destroyed [17–21]. In gene expression matrix, each gene is represented by a vector of conditions (i.e., row) and each condition is considered as a vector of genes (i.e., column). Since gene expression data always involves a large number of genes as well as a certain number of experimental conditions, the PC technique is well suited to their analysis. Moreover, visualization of gene expression data is an important problem for biological knowledge discovery [22]. Thus, the PC plots have been studied for gene expression data visualization [23, 24]. Further details about visualization of biclusters using PC plots are provided in Additional file 1. In section "Method", a new greedy algorithm for bicluster identification is presented. Meanwhile, an interactive approach of parameter determination for the proposed biclustering algorithm based on PC visualization is discussed.
Methods
Identification of biclusters from difference matrix

the first bicluster is for rows R1, R3, R5, R9 and R11 in which the difference between "C5" and "C3" is zero, i.e., a constant bicluster;

the second bicluster is for rows R2, R4, R6, R8 and R10 in which the difference between "C5" and "C3" is two, i.e., an additive bicluster; and

the third bicluster involves row R7 only, thus it is not considered to be a valid bicluster.
Proposed algorithm for additive models
Additiverelated biclusters can be found by progressively merging columns through studying the data distribution along each column in the difference matrix. If there is just one bicluster between two columns in the gene expression matrix, the distribution will have a single peak in one of the columns of the difference matrix. Related rows for this bicluster can then be identified. If there are multiple biclusters formed between two columns in the gene expression matrix, we can separate the rows into different groups by examining the distribution in the corresponding columns of the difference matrix. Therefore, by analyzing the distributions of difference values along columns of the difference matrix, peaks that correspond to different biclusters can be identified.
***S_{ i }= {a ∈ X :x_{ i } a <ε}
where i = 1, 2, ..., N. Also, the set of indices Q_{ i }associated with the values in S_{ i }can be obtained. Q_{ i }can be expressed by
Q_{ i }= {p ∈ {1, 2, ..., N}: x_{ p }∈ S_{ i }} (3)
 (1)
x_{ i } ${\overline{S}}_{j}$ ≥ ε where $\overline{\u2022}$ denotes the average operation of a set.
 (2)
S_{ i } ≥ N_{ r }, where • denotes the cardinality of a set.
 (3)
${\overline{S}}_{i}{\overline{S}}_{j}\ge \epsilon $.
The corresponding set of indices Q'_{ j }is given by
Q'_{ j }= {p ∈ {1, 2, ..., N}:x_{ p }∈ S'_{ j }} (4)
S'_{ j }and Q'_{ j }are added to S' and Q' respectively if S_{ j }' ≥ N_{ r }. For the firstlevel difference matrix D_{1}, each Q'_{ j }contains the row indices of the cluster S'_{ j }. Each column of D_{1} consists of difference values between column i and j of the original expression matrix. Define the collection of row indices sets of the clusters to be U_{ ij }. After finding all U_{ ij }for all distinct column pairs (i, j), the row indices set of the clusters and their associated column pairs are collected to form a list of possible biclusters L_{1} which can be expressed by
L_{1} = {(R_{ ij }, (i, j)): U_{ ij }≠ φ, R_{ ij }∈ U_{ ij }, i = 1, 2, ..., n  1 and j = i + 1, i + 2,...,n} (5)
As one always tries to find the biggest bicluster, a sorting is performed for the possible biclusters in L_{1} based on the number of rows in line 13 so that a bicluster with the largest number of rows can be processed first.
Starting from the biggest bicluster l_{1} in the sorted list of possible biclusters L_{2}, the secondlevel difference matrix D_{2} is formed as in line 24 in which one of the bicluster columns (column c_{k 1}or c_{k 2}) is compared with all the remaining columns on those chosen rows (e.g. difference matrices illustrated in Figure 3). Note that the secondlevel difference matrix D_{2} can be obtained directly from the firstlevel difference matrix D_{1}. Before D_{2} calculation, early termination can be introduced as presented in lines 17–23 as an optional step. In the early termination, the biclusters in L_{2} which significantly overlap with the identified biclusters are skipped as they are unlikely to derive a welldistinguishable bicluster according to the given parameter P_{ o }. Similar to the clustering done for D_{1}, clustering and sorting are performed for D_{2} as described in lines 26–33. As a result, a list of possible column segments H_{2} for growing the current bicluster is obtained. In lines 34–41, a possible bicluster (R, C) is constructed based on the row intersection with each column segment in H_{2}. Initially, (R, C) is set to be the current bicluster l_{ k }in L_{2}. If the size of the row set R does not fall below the userdefined threshold N_{ r }after the row intersects with a column segment, the column is included in C and R is updated. Otherwise, the process is moved to the next column segments until the last one is examined. Finally, the bicluster is validated with respect to the given requirements in bicluster size and degree of overlap as depicted in lines 42–50. Only a valid bicluster is output (lines 51–53).
Relation to existing δpCluster approaches
The proposed algorithm identifies biclusters which are homogeneous in each column pair. In this section, we show that the biclusters can be expressed as δpClusters [25]. Hence, any submatrix in an identified bicluster has similar homogeneity to that bicluster and the problem of outliers as in Cheng and Church algorithm [12] can be avoided. Denote a bicluster with a subset of rows U and a subset of columns V by B = (U, V). The bicluster B is a δpCluster if for each 2 × 2 submatrix M, the following condition holds
a_{ ij } a_{ in } (a_{ mj } a_{ mn }) ≤ δ (6)
where $M=\left[\begin{array}{cc}{a}_{ij}& {a}_{in}\\ {a}_{mj}& {a}_{mn}\end{array}\right]$, a_{ ij }denotes a value of the expression matrix at position (i, j), i, m ∈ U and j, n ∈ V. In our algorithm, the clustering (the second step) performed in the secondlevel difference matrix ensures that there exists a column k ∈ V such that
a_{ ij } a_{ ik }l_{ jk } <ε for ∀j ∈ V and some constant I_{ jk } (7)
where ε is the noise threshold parameter of the proposed algorithm. Hence, for any i, m ∈ U, we have
a_{ ij } a_{ ik } (a_{ mj } a_{ mk }) = a_{ ij } a_{ ik } l_{ jk } (a_{ mj } a_{ mk } l_{ jk }) ≤ a_{ ij } a_{ ik } l_{ jk } + a_{ mj } a_{ mk } l_{ jk } < 2ε (8)
where the last inequality follows from inequality (7). For a column n ∈ V with n ≠ k, using inequality (8), it is shown that
a_{ ij } a_{ in } (a_{ mj } a_{ mn }) = a_{ ij } a_{ ik } (a_{ mj } a_{ mk }) + a_{ ik } a_{ in } (a_{ mk } a_{ mn }) ≤ a_{ ij } a_{ ik } (a_{ mj } a_{ mk }) + (a_{ ik } a_{ in })  (a_{ mk } a_{ mn }) < 4ε (9)
This means that the bicluster B is a δpCluster with δ = 4ε. Although the biclusters identified by our algorithm are δpClusters, it should be emphasized that our algorithm is not designed specially for detecting δpClusters but rather is based on the clustering results in the difference matrix. Hence, there are some differences between our biclustering strategy and the other δpClusters algorithms like pCluster algorithm [25] and S. Yoon et al. approach [13]. Specifically, our algorithm takes into account the cluster density in which cluster centroids are considered. In contrast, the other two δpCluster based algorithms rely only on the interdistances between elements in the difference matrix as defined by the inequality (6). This results in an exponentialtime complexity in the worst case. Our proposed algorithm can be regarded as a greedy version of the other two algorithms. In particular, for each columnpair bicluster, our proposed algorithm derives a possible bicluster by greedily finding a larger column set through sequential intersection with other columnpair biclusters. The large columnpair biclusters usually contain the whole or a large part of the true gene set. On the other hand, these simplifications significantly reduce the complexity from exponentialtime to polynomialtime.
Complexity estimation
In general, a biclustering problem is NPcomplete [11]. However, we have adopted a simple clustering algorithm and bicluster growing strategy to reduce the complexity. Given a matrix of size m × n, the complexity of obtaining the difference matrix is O(mn^{2}). The simple clustering algorithm applied on each column requires operations on the order of O(m^{2}) because it involves comparing the value of each element with the others and the centroids of the found clusters. In addition, the total number of clusters found would not exceed m. Therefore, the complexity in obtaining clusters in the difference matrix is O(m^{2}n^{2}) and the number of clusters is at most mn(n1)/2. The sorting of the clusters requires a complexity of O(mn^{2} log mn^{2}). After that, each identified cluster is used as a seed to construct a bicluster. In the biclusters growing process, a seed is first checked if it has significant overlap with other identified biclusters for early termination. The overlapping in rows can be checked by sorting followed by elementwise comparison. The complexity is thus O(m logm). For columns, as a seed has only two columns, the complexity is O(n). Note that the number of identified biclusters is bounded by the number of seeds. Thus, the complexity for checking overlaps in all identified biclusters is O(mn^{2} (n + m log m)). If the seed is valid, a submatrix of the difference matrix is extracted as the secondlevel difference matrix. This step requires no arithmetic operations due to data reuse. Clustering and sorting procedure are then performed on this secondlevel difference matrix. As the matrix has n1 columns only, the clustering and the sorting processes need operations on the order of O(m^{2}n) and O(mn log(mn)), respectively. Note that there are at most (n1)m clusters detected in the secondlevel difference matrix. In the bicluster construction, row intersection is performed. In total, the complexity is O(m^{2}n log m). Finally, the new identified bicluster is validated (i.e. filtered) with respect to the number of columns and degree of overlap with other biclusters. The validation requires an additional complexity of O(mn^{2} (m log m + n log n)). Among the operations for obtaining each biclusters from the firstlevel difference matrix, the validation step dominates. So the entire processing for bicluster formation from seeds is O(m^{2}n^{4} (m log m + n log n)). Since this cost dominates all other costs in previous steps, our algorithm has a polynomialtime complexity of O(m^{2}n^{4} (m log m + n log n)). The above estimation shows the worst case complexity, in which the validation process dominates. In practice, the number of biclusters is far less than mn(n  1)/2. Moreover, some of the validation steps can be avoided through early termination of invalid biclusters. Elimination of invalid biclusters reduces the number of potential biclusters and this in turn reduces the complexity inside the validation step.
Modification for multiplicative models
As seen in Figure 1(E), a multiplicativerelated bicluster is a bicluster in which any two rows are related by the same ratio in all the related columns or any two columns are related by the same ratio in all the related rows. In order to modify the proposed framework for multiplicative models, the difference matrix is replaced by a ratio matrix which is in the form of c_{ i }/c_{ j }or c_{ j }/c_{ i }for all the n(n  1)/2 distinct combinations between columns i and j where c_{ k }represents the values in the kth column. In practice, we select the column which has the largest average magnitude as the denominator because quotient is sensitive to noise when the divisor is small. Thus, the major change for detecting multiplicativerelated biclusters is to replace the difference matrix by a ratio matrix. Note that the complexity for multiplicative models is essentially the same as that for additive models.
Interactive adjustment of noise threshold using PC plots
Results and Discussion
Evaluation methods
where a bicluster with a subset of genes U_{ i }and a subset of conditions V_{ i }is denoted by (U_{ i }, V_{ i }). ${S}_{V}^{\ast}({M}_{1},{M}_{2})$ is defined similarly with U replaced by V. Let M be the set of detected biclusters and M_{ t }be the set of true biclusters embedded in the artificial expression dataset. The overall match score S*(M, M_{ t }) quantifies the average relevance of the detected biclusters to the true biclusters. Conversely, S*(M_{ t }, M) measures the average recovery of the true biclusters in the detected biclusters. To unify the two measures into a single quantity for evaluation, their average is computed as the biclustering accuracy.
The performance of the proposed algorithm for artificial datasets has been compared with two existing algorithms with the additive model assumption, namely the Cheng and Church (C&C) algorithm [12] and the pCluster algorithm [25]. We considered the biclustering accuracy together with other measures such as number of biclusters, bicluster size and processing time. The programs for both algorithms are publicly available [27, 28]. The proposed algorithm was implemented in a C MEXfile and ran in Matlab 6.5. All the experiments were conducted on the Window XP platform in a computer with 2.4 GHz Intel Pentium 4 CPU and 512 MB RAM. In identification of multiplicativerelated biclusters, since C&C algorithm and the pCluster algorithm are designed for additive models, logarithm operation was applied to the expression data so that the multiplicative models become additive models. For comparison, we also applied the proposed algorithm for additive models to the logarithm values. Henceforth, the proposed algorithm for additive models and multiplicative models will be referred to as PA and PM respectively while the proposed algorithm for additive models with the logarithm operation as preprocessing will be referred to as PAL.
In other words, the pvalue is the probability of including genes of a given category in a cluster by chance. Thus, the overrepresented bicluster is a cluster of genes which is very unlikely to be obtained randomly. The annotations consist of three ontologies, namely biological process, cellular component and molecular function.
where c _row_{ ij }is the correlation coefficient between rows i and j and c _col_{ pq }is the correlation coefficient between columns p and q. ACV is applicable to additive models as well as multiplicative models but the MSRS is valid only for additive models. In order to measure homogeneity of multiplicativerelated biclusters, logarithm was applied onto the expression values before calculating MSRS values so that a multiplicativerelated bicluster can be formulated using an additive model. In order to avoid confusion, the MSRS for the logarithm of expression values is denoted by MSRS_{l}. A bicluster with high homogeneity in expression levels should have a low MSRS/MSRS_{l} value but a high ACV value. The minimum value of MSRS/MSRS_{l} is zero while ACV has a maximum value of one.
The statistical properties of the biclustering results refer to quantities including the number of discovered biclusters and the bicluster size. Comparative studies were performed in the three aspects with several existing biclustering algorithms such as C&C, iterative signature algorithm (ISA) [32, 33], orderpreserving submatrix (OPSM) approach [1] and xMotifs [34], which are available in [27]. In addition, the computational complexity of the proposed algorithm and other approaches is estimated using processing time as done for the artificial datasets. Despite the dependence of factors such as programming language and parameter settings, a rough comparison in complexity can still be achieved.
Datasets
Two types of artificial datasets were considered, one for the additive models and the other for the multiplicative models. The first type of dataset TD1 had a size of 200 rows by 40 columns. Uniformly distributed random values were first generated. Then four biclusters were embedded. Their details are as follows:

bicluster A is a constant row bicluster of size 40 × 7;

bicluster B is a constant row bicluster of size 25 × 10;

bicluster C is a constant column bicluster of size 35 × 8; and

bicluster D has coherent values related by additions of size 40 × 8.
The real dataset used was the yeast Saccharomyces cerevisiae cell cycle dataset as used in [12], which contains 2884 genes and 17 conditions. The nonmissing values were all nonnegative. As multiplicative models were also investigated, those zero nonmissing values were set to some small positive values. The missing values were filled with positive uniformly distributed random values to minimize the influence to our analysis.
Performance on artificial datasets
Parameter settings for biclustering algorithms and postfiltering in the experiments on artificial datasets
Experiment  Algorithm/postfiltering  Parameter settings* 

Artificial datasets for additive models  PA  ε = 0.5 – 2.0, N_{ r }= 21, N_{ c }= 5, P_{ o }= 50 
C&C  δ = 0.04 – 0.5, α = 1.2, M = 40  
pCluster  δ = 0.5 – 1.0, N_{ r }= 21, N_{ c }= 5  
Postfiltering  N_{ r }= 21, N_{ c }= 5, P_{ o }= 50 and M = 10  
Artificial datasets for multiplicative models  PM  ε = 0.2 – 0.6, N_{ r }= 18, N_{ c }= 4, P_{ o }= 25 
PAL  ε = 0.4 – 1.0, N_{ r }= 18, N_{ c }= 4, P_{ o }= 25  
C&C  δ = 0.04 – 0.5, α = 1.2, M = 20  
pCluster  δ = 0.5 – 1.0, N_{ r }= 18, N_{ c }= 4  
Postfiltering  N_{ r }= 18, N_{ c }= 4, P_{ o }= 25 and M = 5 
Statistical properties of biclustering results for the artificial datasets embedded with additiverelated biclusters before postfiltering
Property  Algorithm  Noise s.d.  

0  0.1  0.2  0.3  0.4  0.5  
Average number of biclusters  PA  4  4  4  4.4  5.8  6.6 
C&C  40  40  40  40  40  40  
pCluster  23  366.6  378.2  255.6  124.2  21.40  
Average number of rows  PA  35  35  34.85  33.67  31.87  32.25 
C&C  7.625  7.640  8.025  7.980  8.265  7.895  
pCluster  25.57  23.14  23.28  22.98  22.69  21.74  
Average number of columns  PA  8.250  8.250  8.250  7.480  7.262  7.228 
C&C  3.725  4.500  4.915  4.765  5.070  5.280  
pCluster  5.217  5.224  5.115  4.873  4.468  4.237 
Statistical properties of biclustering results for the artificial datasets embedded with multiplicativerelated biclusters before postfiltering
Property  Algorithm  Noise s.d.  

0  0.1  0.2  0.3  0.4  0.5  
Average number of biclusters  PM  2  2  2.2  3  2.4  2.6 
PAL  2  2  2.2  3.2  3  3.6  
C&C  20  20  20  20  20  20  
pCluster  1109  956.4  855.8  753.4  658.8  729.2  
Average number of Rows  PM  25  24.80  24.17  23.72  20.20  23.40 
PAL  25  23.70  21  20.10  19.52  19.72  
C&C  5.850  5.560  6.900  8.820  8.270  6.940  
pCluster  19.94  19.93  19.87  19.75  19.71  19.66  
Average number of columns  PM  7  7  6.367  6.400  5.900  5.933 
PAL  7  7  6.433  5.517  5.567  5.417  
C&C  4.300  3.870  5.020  5.660  5.570  5.110  
pCluster  4.346  4.345  4.297  4.224  4.178  4.158 
Average processing time for the artificial datasets
Dataset  Artificial datasets for additive models with noise s.d. of 0.3  Artificial datasets for multiplicative models with noise s.d. of 0.3  

Algorithm  PA  C&C  pCluster  PM  C&C (log)  pCluster (log) 
Average time (sec)  3.776  17  2716  0.0232  1  105.2 
Performance on a real dataset
Parameter settings of algorithms and postprocessing investigated in experiments based on the yeast dataset
Algorithm/postfiltering  Parameter settings* 

PA  ε = 60, N_{ r }= 10, N_{ c }= 5, P_{ o }= 20 
PM  ε = 0.2, N_{ r }= 10, N_{ c }= 5, P_{ o }= 20 
C&C  δ = 100, α = 1.2, M = 100 
C&C (log)  δ = 0.25, α = 1.2, M = 100 
ISA  t_{ g }= 2, t_{ c }= 1.0, number of initial sets = 500 
OPSM  l = 100 
xMotifs  n_{ s }= 10, n_{ d }= 1000, s_{ d }= 4, pvalue = 10^{10}, α = 0.29, max. number of expression values = 50 
Filtering  N_{ r }= 10, N_{ c }= 5, P_{ o }= 20 
Homogeneity comparison of biclusters identified in the yeast cellcycle dataset using various algorithms
Algorithm  MSRS/MSRS_{l}*  ACV  

min  mean  max  min  mean  max  
PA  326.0  412.1  552.7  0.8960  0.9416  0.9755 
PM  3.694 × 10^{4}  9.573 × 10^{3}  3.809 × 10^{2}  0.7493  0.9219  1 
C&C  439.6  554.5  593.3  0.6481  0.9217  0.9768 
C&C (log)  2.784 × 10^{2}  6.262 × 10^{2}  8.451 × 10^{2}  0.3489  0.5740  0.9000 
ISA  108.9  489.6  794.6  0.8420  0.9247  0.9588 
OPSM  480.4  497.1  513.8  0.8866  0.8904  0.8941 
xMotifs  1.910 × 10^{12}  4.820  12.04  0.9982  0.9992  1 
Statistical comparison of biclusters identified in the yeast cellcycle dataset using various algorithms
Algorithm  no. of biclusters  size*  no. of genes  no. of conditions  

min  max  min  mean  max  min  mean  max  
PA  25  10 × 5  597 × 17  10  97.64  597  5  13.16  17 
PM  59  10 × 6  518 × 17  10  46.09  518  5  9.085  17 
C&C  31  10 × 5  1391 × 17  10  91.74  1391  5  11.6  17 
C&C (log)  5  12 × 13  2270 × 17  12  486.2  2270  13  14.80  17 
ISA  18  28 × 5  149 × 6  28  74.56  149  5  5.667  7 
OPSM  2  132 × 7  469 × 5  132  300.5  469  5  6  7 
xMotifs  13  11 × 5  115 × 5  11  40.08  115  5  5  5 
When multiplicative model is concerned, i.e. the proposed algorithm for multiplicative model (PM) and C&C applied on log value (C&C (log)), the functional enrichment drops in general. At first glance, PM gives poorer performance in term of functional enrichment. Nonetheless, if the number of identified biclusters is also considered, PM actually outperforms C&C (log) by identifying more significant biclusters. The total number of biclusters identified by PM was 59 but CC only found 5 biclusters. In addition, the biclusters identified by PM exhibit higher homogeneity. The average values of MSRS_{l} and ACV are 9.573 × 10^{3} and 0.9219 for PM respectively. In comparison, the average values of MSRS_{l} and ACV are 6.262 × 10^{2} and 0.5740 for the C&C (log) respectively.
In addition to C&C based algorithms, Figure 14 shows the comparative results of ISA, OPSM and xMotifs for different values of p_{0}. Although OPSM shows high percentage of functionallyenriched biclusters at large values of p_{0}, there are only two biclusters found which are far from expectation. Thus, the proposed algorithms actually identify more functionallyenriched biclusters. Also, the percentage of functionallyenriched biclusters of OPSM drops to zero at low values of p_{0}. At low values of p_{0}, the results of ISA are the best in most cases. For p_{0} ≥ 5 × 10^{4}, the performance of the proposed algorithm PA, however, is close to or even better than that of ISA. For both OPSM and ISA, the identified biclusters are less homogeneous in terms of average MSRS and ACV because their bicluster models are different from those studied in this paper. PA and PM show better performance than xMotifs in the percentage of functionallyenriched biclusters despite that our algorithms have lower average value of ACV. The reason is that xMotifs is designed to find biclusters with coherent state in each gene, which is only a subclass of additive models. The homogeneity analysis suggests that the difference in biological relevance of identified biclusters between various algorithms such as the proposed algorithm PA and ISA is not merely due to implementation architecture but also due to the model assumption.
Processing time for the yeast cellcycle expression dataset using various biclustering algorithms
Algorithm  PA  PM  C&C  C&C (log)  ISA  OPSM  xMotifs 

Processing time (sec)  0.72  1.35  32  1217  1590  38  937 
Annotations of biological process ontology for biclusters identified by the proposed algorithm for additive models at pvalue < 0.001.
Bicluster index  Annotation  Pvalue  Corrected Pvalue  Genes 

1  chromatin modification  3.84E04  1.39E01  YBR081C, YBR198C, YDR392W, YDR448W, YGL112C, YMR236W, YNL097C 
histone acetylation  4.84E04  1.76E01  YBR081C, YBR198C, YDR392W, YDR448W, YFL039C, YGL112C, YJL081C, YMR236W, YNL136W  
endocytosis  7.24E04  2.63E01  YBR109C, YCL034W, YDR388W, YDR490C, YER166W, YFL039C, YGL106W, YHR001W, YHR073W, YNL084C, YNL227C, YNL243W, YOR089C, YOR109W, YOR327C, YPL145C  
2  aerobic respiration  2.47E04  4.34E02  YBR026C, YDL174C, YDR231C, YHR001W, YMR030W, YMR081C, YPL132W, YPL159C 
3  ribosome biogenesis and assembly  2.75E13  4.73E11  YBL024W, YBR142W, YDL153C, YDL167C, YDR060W, YDR120C, YDR312W, YDR365C, YEL026W, YER082C, YGL099W, YGR162W, YGR245C, YJL033W, YJR002W, YJR066W, YLL008W, YLR175W, YLR401C, YNL110C, YNL132W, YNL175C, YOL077C, YOR145C, YOR272W, YPL126W 
35S primary transcript processing  2.23E06  3.84E04  YCL031C, YDR339C, YER082C, YGR090W, YJL033W, YJR002W, YLL008W, YLR175W, YOR145C, YPR137W  
4  ergosterol biosynthetic process  1.08E04  1.33E02  YLR450W, YML008C, YMR202W, YMR208W 
translational elongation  1.87E06  2.30E04  YAL003W, YDL081C, YDR382W, YDR385W, YLR249W, YLR340W, YOL039W  
regulation of translational fidelity  7.04E07  8.67E05  YBR048W, YDL229W, YDR025W, YGR118W, YNL209W, YPL081W  
ribosomal small subunit assembly and maintenance  6.90E05  8.49E03  YBR048W, YDR025W, YDR447C, YGR214W, YLR048W, YLR167W  
ribosomal large subunit assembly and maintenance  6.83E06  8.40E04  YBR142W, YDR312W, YLR075W, YLR340W, YLR448W, YML073C, YOL127W, YPR102C  
translation  6.92E44  8.51E42  YBL072C, YBL092W, YBR048W, YBR268W, YDL061C, YDL075W, YDL081C, YDL082W, YDL083C, YDL136W, YDL191W, YDL229W, YDR012W, YDR025W, YDR064W, YDR382W, YDR447C, YDR450W, YDR471W, YDR500C, YER074W, YER117W, YER131W, YGR118W, YGR214W, YHL001W, YHR141C, YIL069C, YJL136C, YJL190C, YJR123W, YKL056C, YKL156W, YKR057W, YKR094C, YLR048W, YLR075W, YLR167W, YLR185W, YLR325C, YLR340W, YLR388W, YLR441C, YLR448W, YML026C, YML063W, YML073C, YMR143W, YMR242C, YNL067W, YNL096C, YNL162W, YNL209W, YNL301C, YNL302C, YNL306W, YOL039W, YOL040C, YOL127W, YOR167C, YOR234C, YOR293W, YOR312C, YOR369C, YPL081W, YPL090C, YPL143W, YPL198W, YPR102C  
6  protein folding  3.15E04  1.51E02  YDR214W, YFL016C, YML130C, YNL007C, YOR027W 
copper ion import  3.98E04  1.91E02  YLR411W, YPR124W  
8  tricarboxylic acid cycle  8.65E04  1.03E01  YDR178W, YIL125W, YLL041C, YNL037C 
mitochondrial electron transport, ubiquinol to cytochrome c  4.23E04  5.04E02  YEL024W, YHR001W, YOR065W  
ubiquitindependent protein catabolic process  3.41E04  4.05E02  YBR173C, YDR394W, YJL001W, YMR119W, YOL038W  
9  cytokinesis, contractile ring contraction  1.51E04  3.17E03  YBR038W, YHR023W 
10  cell morphogenesis checkpoint  8.19E04  7.37E02  YJL187C, YKL101W 
chitin biosynthetic process  8.19E04  7.37E02  YER096W, YNL233W  
mitotic sister chromatid cohesion  6.67E06  6.01E04  YFL008W, YIL026C, YJL019W, YMR076C, YMR078C, YNL273W  
14  glycolysis  4.18E04  3.81E02  YCL040W, YDR050C, YJR009C, YKL152C 
ribosomal small subunit assembly and maintenance  9.27E04  8.43E02  YDR337W, YGR214W, YLR167W, YML024W  
15  protein folding  4.54E04  1.22E02  YAL005C, YBR169C, YDR214W, YLR216C 
19  tricarboxylic acid cycle  7.49E05  2.55E03  YDR148C, YLL041C, YNR001C 
24  response to stress  3.45E04  3.45E03  YBR072W, YDR258C, YPL240C 
25  SRPdependent cotranslational protein targeting to membrane, translocation  4.27E04  8.12E03  YAL005C, YER103W 
response to stress  6.85E05  1.30E03  YAL005C, YDR258C, YER103W, YPL240C  
protein folding  5.73E04  1.09E02  YAL005C, YDR258C, YER103W  
protein refolding  2.86E04  5.43E03  YAL005C, YPL240C 
Annotations of cellular component ontology for biclusters identified by the proposed algorithm for additive models at pvalue < 0.001.
Bicluster index  Annotation  Pvalue  Corrected Pvalue  Genes 

1  SLIK (SAGAlike) complex  4.45E04  7.56E02  YBR081C, YBR198C, YDR392W, YDR448W, YGL112C, YMR236W 
transcription factor TFIID complex  4.45E04  7.56E02  YBR198C, YER148W, YGL112C, YML114C, YMR236W, YPL129W  
INO80 complex  3.75E04  6.38E02  YDL002C, YFL039C, YJL081C, YLR052W, YPL129W  
3  nucleolus  4.60E14  4.28E12  YAL059W, YBR142W, YCL031C, YCL054W, YDR312W, YDR339C, YDR365C, YDR378C, YEL026W, YEL055C, YGR090W, YGR159C, YJL033W, YJR002W, YLL008W, YLL034C, YLR175W, YML074C, YNL110C, YNL132W, YNL175C, YOL077C, YOL080C, YOR145C, YOR272W 
small nucleolar ribonucleoprotein complex  2.46E04  2.29E02  YDL153C, YDR378C, YEL026W, YER082C, YGR090W, YJR002W, YPL126W, YPR137W  
4  cytosolic large ribosomal subunit (sensu Eukaryota)  1.76E23  1.25E21  YBL092W, YDL075W, YDL081C, YDL082W, YDL136W, YDL191W, YDR012W, YDR382W, YDR471W, YDR500C, YER117W, YHL001W, YHR141C, YKR094C, YLR075W, YLR185W, YLR325C, YLR340W, YLR448W, YML073C, YMR242C, YNL067W, YNL162W, YNL301C, YOL039W, YOL127W, YOR234C, YOR312C, YPL143W, YPL198W, YPR102C 
cytosolic small ribosomal subunit (sensu Eukaryota)  7.45E29  5.29E27  YBL072C, YBR048W, YDL061C, YDL083C, YDR025W, YDR064W, YDR447C, YDR450W, YER074W, YER131W, YGR118W, YGR214W, YIL069C, YJL136C, YJL190C, YJR123W, YKL156W, YKR057W, YLR048W, YLR167W, YLR388W, YLR441C, YML026C, YML063W, YMR143W, YNL096C, YNL302C, YOL040C, YOR167C, YOR293W, YOR369C, YPL081W, YPL090C  
ribosome  2.87E05  2.04E03  YAL003W, YDR385W, YEL034W, YKL056C, YLR249W, YOL139C, YPR163C  
5  chromatin remodeling complex  5.68E04  3.12E02  YJL176C, YOR290C, YPL016W 
8  mitochondrion  1.66E08  1.11E06  YBL015W, YBL090W, YBR003W, YBR037C, YBR120C, YBR122C, YBR147W, YCR028C, YDL027C, YDR141C, YDR178W, YDR305C, YDR316W, YDR494W, YDR513W, YEL006W, YEL024W, YER141W, YGL229C, YGR207C, YGR243W, YHR001W, YHR147C, YIL111W, YIL125W, YJL131C, YJL171C, YKL087C, YLL041C, YLR168C, YLR395C, YML120C, YMR145C, YMR167W, YNL037C, YNL073W, YOL038W, YOL059W, YOL096C, YOR065W, YOR317W, YOR356W, YOR386W, YPL005W, YPL029W, YPL103C 
respiratory chain complex III (sensu Eukaryota)  4.23E04  2.84E02  YEL024W, YHR001W, YOR065W  
endosome  9.02E05  6.04E03  YAL030W, YDL113C, YJL053W, YLR119W, YLR408C, YNR006W, YOR036W  
9  bud neck  1.42E06  2.27E05  YBR038W, YGR092W, YHR023W, YIL106W, YLR190W, YOL070C 
10  bud neck  7.51E04  3.68E02  YDR507C, YGR152C, YGR238C, YIL140W, YJL187C, YKL101W, YNL233W 
septin ring  9.01E05  4.42E03  YIL140W, YKL101W, YNL233W  
14  lipid particle  5.85E05  2.93E03  YIL124W, YJR009C, YMR110C, YNL231C, YOR317W 
17  mitochondrion  8.16E06  2.86E04  YCL057W, YDL027C, YDL164C, YDR116C, YDR194C, YDR375C, YDR513W, YGL104C, YHR002W, YHR067W, YHR147C, YIL087C, YIL111W, YJL063C, YLL040C, YLR270W, YLR346C, YMR098C, YMR152W, YMR188C, YNL063W, YNL073W, YNL200C, YNL274C, YOL059W, YOL071W, YOR136W, YPR011C 
18  bud neck contractile ring  6.08E05  1.58E03  YHR023W, YJR092W, YMR032W 
preautophagosomal structure  6.62E04  1.72E02  YBL078C, YDL113C  
20  nuclear cohesin complex  3.74E04  5.24E03  YDL003W, YIL026C 
Annotations of molecular function ontology for biclusters identified by the proposed algorithm for additive models at pvalue < 0.001.
Bicluster index  Annotation  Pvalue  Corrected Pvalue  Genes 

1  endopeptidase activity  7.71E05  1.26E02  YBL041W, YDR394W, YER012W, YJL001W, YOL038W, YOR362C 
3  ATPdependent RNA helicase activity  5.10E04  3.57E02  YBR142W, YER172C, YJL033W, YLL008W, YMR080C 
snoRNA binding  7.57E04  5.30E02  YDL153C, YER082C, YGR090W, YPL126W, YPR137W  
4  structural constituent of ribosome  7.61E47  3.96E45  YBL072C, YBL092W, YBR048W, YBR268W, YDL061C, YDL075W, YDL081C, YDL082W, YDL083C, YDL136W, YDL191W, YDR012W, YDR025W, YDR064W, YDR382W, YDR447C, YDR450W, YDR471W, YDR500C, YER074W, YER117W, YER131W, YGR118W, YGR214W, YHL001W, YHR141C, YIL069C, YJL136C, YJL190C, YJR123W, YKL156W, YKR057W, YKR094C, YLR048W, YLR075W, YLR167W, YLR185W, YLR325C, YLR340W, YLR388W, YLR441C, YLR448W, YML026C, YML063W, YML073C, YMR143W, YMR242C, YNL067W, YNL096C, YNL162W, YNL301C, YNL302C, YNL306W, YOL039W, YOL040C, YOL127W, YOR167C, YOR234C, YOR293W, YOR312C, YOR369C, YPL081W, YPL090C, YPL143W, YPL198W, YPR102C 
RNAdirected DNA polymerase activity  9.72E04  5.05E02  YAR009C, YJR027W, YML039W, YML045W, YMR045C, YMR050C  
DNA helicase activity  5.11E04  2.66E02  YDR545W, YLR467W, YNL339C, YPL283C, YPR204W  
ribonuclease activity  9.72E04  5.05E02  YAR009C, YJR027W, YML039W, YML045W, YMR045C, YMR050C  
RNA binding  3.10E05  1.61E03  YAR009C, YDL208W, YDR378C, YDR381W, YEL026W, YHL001W, YJR027W, YLR277C, YLR448W, YML039W, YML045W, YML073C, YMR045C, YMR050C, YNL175C, YOL123W, YOL127W  
helicase activity  2.93E05  1.53E03  YEL077C, YJL225C, YLL066C, YLL067C, YML133C  
6  copper uptake transporter activity  3.98E04  1.11E02  YLR411W, YPR124W 
17  glycerol3phosphate dehydrogenase (NAD+) activity  8.19E04  2.95E02  YDL022W, YOL059W 
18  spermidine transporter activity  1.12E04  2.24E03  YLL028W, YOR273C 
spermine transporter activity  6.62E04  1.32E02  YLL028W, YOR273C  
24  unfolded protein binding  1.05E04  4.19E04  YBR072W, YDR258C, YPL240C 
25  unfolded protein binding  1.38E05  8.29E05  YAL005C, YDR258C, YER103W, YPL240C 
ATPase activity  9.71E04  5.82E03  YAL005C, YDR258C, YER103W 
Annotations of biological process ontology for biclusters identified by the proposed algorithm for multiplicative models at pvalue < 0.001.
Bicluster index  Annotation  Pvalue  Corrected Pvalue  Genes 

1  ribosomal large subunit biogenesis and assembly  5.83E06  1.97E03  YAL025C, YBR267W, YDR091C, YNL110C, YNL163C, YOR272W, YPL211W 
tRNA methylation  1.94E04  6.55E02  YBL024W, YBR061C, YDR165W, YNR046W, YOL093W, YOL124C  
processing of 20S prerRNA  1.80E05  6.05E03  YDL153C, YDL166C, YDR449C, YEL026W, YJL191W, YJR002W, YLR068W, YLR192C, YLR222C, YML093W, YMR093W, YOR056C, YPR137W  
transcription from RNA polymerase III promoter  1.02E04  3.45E02  YBR154C, YDL150W, YDR045C, YER148W, YLR223C, YNL113W, YNR003C, YOR224C  
35S primary transcript processing  6.37E06  2.15E03  YBL004W, YCL031C, YCL059C, YDR339C, YGR090W, YJR002W, YKR060W, YLL008W, YLR051C, YLR186W, YLR430W, YNR038W, YOL021C, YOR145C, YPR112C, YPR137W  
transcription from RNA polymerase I promoter  9.82E04  3.31E01  YBL014C, YBR154C, YDR156W, YER148W, YNL113W, YOR224C, YOR341W  
rRNA processing  5.33E04  1.80E01  YBR142W, YBR257W, YCL059C, YDR365C, YDR478W, YGR159C, YLR223C, YMR049C, YMR290C, YOL144W, YOR145C, YPL211W  
ribosome biogenesis and assembly  1.09E15  3.66E13  YAL025C, YBL024W, YBL054W, YBR034C, YBR084W, YBR142W, YBR267W, YCL059C, YDL153C, YDL167C, YDR060W, YDR165W, YDR300C, YDR312W, YDR365C, YDR449C, YDR465C, YEL026W, YHL039W, YJR002W, YJR066W, YKL143W, YKL191W, YKR056W, YKR060W, YLL008W, YLR186W, YML093W, YMR093W, YMR131C, YMR290C, YNL110C, YNL113W, YNL132W, YNL175C, YNR003C, YNR038W, YNR053C, YOL077C, YOL124C, YOL144W, YOR056C, YOR145C, YOR206W, YOR272W, YPL211W, YPL212C, YPL226W  
mRNA export from nucleus  3.51E04  1.18E01  YBR034C, YDL116W, YDR432W, YER107C, YJL140W, YKL057C, YKL068W, YKR002W, YKR095W, YMR308C, YOR098C  
5  nucleotideexcision repair  1.88E04  1.97E02  YBR088C, YDL164C, YJL173C, YNL312W 
DNA recombination  1.47E05  1.54E03  YDL164C, YJL173C, YML061C, YNL312W  
DNA replication, synthesis of RNA primer  6.91E05  7.25E03  YJL173C, YKL045W, YNL312W  
doublestrand break repair via homologous recombination  8.96E04  9.41E02  YER147C, YJL173C, YNL312W  
7  arabinose catabolic process  3.55E04  1.03E02  YHR104W, YOR120W 
Dxylose catabolic process  3.55E04  1.03E02  YHR104W, YOR120W  
protein refolding  1.22E05  3.55E04  YBR169C, YLL026W, YPL240C  
15  protein refolding  2.41E06  6.74E05  YAL005C, YBR169C, YPL240C 
21  ribosome biogenesis and assembly  2.26E04  1.26E02  YAL025C, YBR238C, YBR267W, YDL031W, YDR083W, YDR184C, YIR026C, YKL078W 
24  Glycolysis  3.40E05  4.15E03  YAL038W, YCR012W, YDR050C, YJR009C, YKL060C, YKL152C 
translational elongation  4.41E09  5.38E07  YAL003W, YBR118W, YDL081C, YDL130W, YDR382W, YDR385W, YLR249W, YLR340W, YOL039W  
regulation of translational fidelity  1.61E08  1.97E06  YBR048W, YBR189W, YDL229W, YDR025W, YGR118W, YNL209W, YPL081W  
ribosomal small subunit assembly and maintenance  4.58E07  5.59E05  YBR048W, YCR031C, YDR025W, YDR447C, YGR214W, YLR048W, YLR167W, YML024W  
ribosomal large subunit assembly and maintenance  1.58E04  1.93E02  YDR418W, YLR075W, YLR340W, YLR448W, YML073C, YOL127W, YPR102C  
translation  6.16E69  7.51E67  YBL027W, YBL038W, YBL072C, YBL092W, YBR031W, YBR048W, YBR181C, YBR189W, YBR191W, YCR031C, YDL061C, YDL075W, YDL081C, YDL082W, YDL083C, YDL130W, YDL136W, YDL191W, YDL229W, YDR012W, YDR025W, YDR064W, YDR382W, YDR418W, YDR447C, YDR450W, YDR471W, YDR500C, YER074W, YER102W, YER117W, YER131W, YGR118W, YGR214W, YHR141C, YIL069C, YJL136C, YJL177W, YJL189W, YJL190C, YJR123W, YJR145C, YKL006W, YKL056C, YKL156W, YKL180W, YKR057W, YKR094C, YLL045C, YLR029C, YLR048W, YLR075W, YLR167W, YLR185W, YLR325C, YLR333C, YLR340W, YLR344W, YLR388W, YLR406C, YLR441C, YLR448W, YML024W, YML026C, YML063W, YML073C, YMR121C, YMR143W, YMR194W, YMR230W, YMR242C, YNL067W, YNL096C, YNL162W, YNL209W, YNL301C, YNL302C, YOL039W, YOL040C, YOL127W, YOR167C, YOR234C, YOR293W, YOR312C, YOR369C, YPL081W, YPL090C, YPL143W, YPL198W, YPR043W, YPR102C  
telomere maintenance via recombination  5.04E04  6.14E02  YDR545W, YER190W, YLR467W, YNL339C, YPL283C  
31  nucleotideexcision repair  3.95E04  2.05E02  YAR007C, YDL164C, YNL312W 
DNA recombination  6.57E05  3.42E03  YAR007C, YDL164C, YNL312W  
DNA replication, synthesis of RNA primer  9.46E04  4.92E02  YAR007C, YNL312W  
mitotic sister chromatid cohesion  6.18E05  3.21E03  YDL003W, YFL008W, YIL026C, YMR078C  
DNA strand elongation during DNA replication  7.82E07  4.06E05  YAR007C, YKL108W, YLR103C, YNL312W  
34  NADH oxidation  1.25E05  9.00E04  YBR145W, YML120C, YMR145C, YOL059W 
35  transposition, RNAmediated  3.41E06  1.36E04  YCL020W, YER160C, YJR026W, YJR028W, YML040W, YOR142W 
36  glycine catabolic process  1.94E05  9.31E04  YAL044C, YDR019C, YMR189W 
onecarbon compound metabolic process  4.79E05  2.30E03  YAL044C, YDR019C, YMR189W  
41  karyogamy during conjugation with cellular fusion  4.10E04  3.86E02  YCL055W, YNL313C, YPL192C 
ribosome biogenesis and assembly  2.32E05  2.18E03  YBR267W, YCR072C, YDL031W, YDR184C, YDR465C, YGL099W, YGR187C, YMR128W, YOL010W, YOR001W  
35S primary transcript processing  4.81E04  4.52E02  YDL031W, YGR090W, YOL010W, YOL021C, YOR001W  
50  pseudohyphal growth  9.72E04  4.57E02  YBR083W, YJL164C, YKL185W, YOR127W 
Nterminal protein myristoylation  5.58E04  2.62E02  YIL009W, YOR317W  
54  DNA unwinding during replication  6.79E04  4.21E02  YBR202W, YGL201C, YLR274W 
DNA replication initiation  1.69E04  1.05E02  YBL035C, YBR202W, YGL201C, YLR274W  
pheromonedependent signal transduction during conjugation with cellular fusion  2.18E04  1.35E02  YHR005C, YJL157C, YNL173C, YOR127W  
55  spore wall assembly (sensu Fungi)  7.33E05  3.66E03  YDR126W, YDR523C, YOR177C, YOR242C 
56  meiotic mismatch repair  3.03E05  1.48E03  YDR097C, YNL082W, YOL090W 
mismatch repair  6.19E04  3.03E02  YDR097C, YNL082W, YOL090W  
microtubule nucleation  1.13E04  5.52E03  YDR356W, YKL042W, YOR373W, YPL124W  
57  sulfate assimilation  1.76E04  9.50E03  YFR030W, YJR010W, YKR069W 
microtubule nucleation  1.24E04  6.67E03  YBL063W, YMR117C, YOR373W, YPL124W  
59  DNA replication checkpoint  5.86E04  3.23E02  YCL061C, YMR048W 
Annotations of cellular component ontology for biclusters identified by the proposed algorithm for multiplicative models at pvalue < 0.001.
Bicluster index  Annotation  Pvalue  Corrected Pvalue  Genes 

1  DNAdirected RNA polymerase I complex  9.40E04  1.50E01  YBR154C, YDR156W, YNL113W, YOR224C, YOR341W 
nucleoplasm  6.60E04  1.06E01  YAL059W, YDL051W, YKR002W, YKR095W, YNL175C, YNR053C  
nucleolus  9.69E17  1.55E14  YAL025C, YAL059W, YBL004W, YBL026W, YBR142W, YCL031C, YCL054W, YCL059C, YDL051W, YDR299W, YDR312W, YDR339C, YDR365C, YDR378C, YEL026W, YGR090W, YGR159C, YJR002W, YKR060W, YLL008W, YLL034C, YLR051C, YLR068W, YLR186W, YLR223C, YMR049C, YMR131C, YMR233W, YMR269W, YMR290C, YNL110C, YNL132W, YNL147W, YNL175C, YNL299W, YNR038W, YNR046W, YNR053C, YOL041C, YOL077C, YOL144W, YOR145C, YOR272W, YPL211W, YPR112C  
nucleus  1.60E05  2.56E03  YAL059W, YAR015W, YBL016W, YBL024W, YBL054W, YBL093C, YBR034C, YBR066C, YBR090C, YBR112C, YBR160W, YBR173C, YCL011C, YCL031C, YCL054W, YCR036W, YCR051W, YCR059C, YCR060W, YCR090C, YDL002C, YDL006W, YDL047W, YDL051W, YDL070W, YDL076C, YDL153C, YDL166C, YDR006C, YDR091C, YDR098C, YDR143C, YDR155C, YDR162C, YDR165W, YDR260C, YDR296W, YDR305C, YDR361C, YDR365C, YDR390C, YDR432W, YDR465C, YDR477W, YEL007W, YER012W, YER042W, YER148W, YGL130W, YGR090W, YGR159C, YGR200C, YJL140W, YJR002W, YJR017C, YJR105W, YKL143W, YKR060W, YKR072C, YKR079C, YKR096W, YLL034C, YLR007W, YLR039C, YLR051C, YLR052W, YLR068W, YLR107W, YLR186W, YLR223C, YLR262C, YLR265C, YLR327C, YLR384C, YLR420W, YLR430W, YML032C, YML053C, YML080W, YML081W, YML114C, YMR009W, YMR021C, YMR049C, YMR070W, YMR074C, YMR092C, YMR176W, YMR178W, YMR226C, YMR233W, YMR235C, YMR308C, YNL004W, YNL016W, YNL110C, YNL136W, YNL164C, YNL186W, YNL199C, YNL215W, YNL299W, YNR003C, YNR046W, YNR053C, YOL093W, YOL108C, YOL143C, YOR006C, YOR056C, YOR123C, YOR145C, YOR189W, YOR206W, YOR252W, YOR272W, YOR283W, YOR304W, YPL047W, YPL086C, YPL204W, YPL212C, YPL268W, YPR069C, YPR073C  
small nucleolar ribonucleoprotein complex  1.97E05  3.16E03  YBL004W, YBL026W, YCL059C, YDL153C, YDR378C, YDR449C, YEL026W, YGR090W, YJL191W, YJR002W, YLR186W, YLR222C, YML093W, YMR093W, YNL147W, YPR137W  
5  incipient bud site  4.75E04  2.18E02  YGR189C, YKR090W, YLL021W, YNL233W, YNL304W 
nucleus  1.26E05  5.80E04  YBL046W, YBR073W, YBR088C, YCR065W, YDL006W, YDL103C, YDL164C, YDL197C, YER003C, YER152C, YGR042W, YKL045W, YKL089W, YKL113C, YLL022C, YLR233C, YLR376C, YML021C, YML061C, YML109W, YOR074C, YOR279C, YOR342C, YPL008W, YPL127C, YPL208W, YPL256C, YPR120C, YPR135W  
bud neck  4.38E04  2.01E02  YDR507C, YGR152C, YKL101W, YKR090W, YLL021W, YNL233W, YNL304W  
11  MCM complex  8.91E04  5.52E02  YEL032W, YLR274W, YPR019W 
12  bud neck  1.07E05  3.33E04  YBR038W, YBR200W, YGR092W, YHR023W, YIL106W, YLR190W, YMR001C, YPR119W 
14  spindle microtubule  6.05E05  2.12E03  YBL063W, YBR156C, YGL061C 
16  endoplasmic reticulum  1.91E04  4.21E03  YBR229C, YCR011C, YCR044C, YDL204W, YIL124W, YML128C, YMR134W 
mitochondrion  7.89E04  1.74E02  YBR003W, YBR026C, YBR037C, YBR147W, YBR229C, YCR005C, YHL021C, YHR067W, YIL087C, YIL124W, YLR142W, YLR253W, YML128C, YNL073W  
24  cytosolic large ribosomal subunit (sensu Eukaryota)  1.11E45  7.07E44  YBL027W, YBL092W, YBR031W, YBR191W, YDL075W, YDL081C, YDL082W, YDL130W, YDL136W, YDL191W, YDR012W, YDR382W, YDR418W, YDR471W, YDR500C, YER117W, YHR141C, YJL177W, YJL189W, YKL006W, YKL180W, YKR094C, YLL045C, YLR029C, YLR075W, YLR185W, YLR325C, YLR340W, YLR344W, YLR406C, YLR448W, YML073C, YMR121C, YMR194W, YMR242C, YNL067W, YNL162W, YNL301C, YOL039W, YOL127W, YOR234C, YOR312C, YPL143W, YPL198W, YPR043W, YPR102C 
cytosolic small ribosomal subunit (sensu Eukaryota)  7.45E43  4.77E41  YBL072C, YBR048W, YBR181C, YBR189W, YCR031C, YDL061C, YDL083C, YDR025W, YDR064W, YDR447C, YDR450W, YER074W, YER102W, YER131W, YGR118W, YGR214W, YIL069C, YJL136C, YJL190C, YJR123W, YJR145C, YKL156W, YKR057W, YLR048W, YLR167W, YLR333C, YLR388W, YLR441C, YML024W, YML026C, YML063W, YMR116C, YMR143W, YMR230W, YNL096C, YNL302C, YOL040C, YOR167C, YOR293W, YOR369C, YPL081W, YPL090C  
ribosome  4.84E06  3.10E04  YAL003W, YBR118W, YDR385W, YEL034W, YKL056C, YLR249W, YOL139C, YPR163C  
28  condensed nuclear chromosome  2.75E04  2.20E03  YHR157W, YPL194W 
31  chromosome, telomeric region  4.77E04  1.43E02  YAR007C, YNL312W 
nuclear cohesin complex  3.79E05  1.14E03  YDL003W, YFL008W, YIL026C  
DNA replication factor A complex  4.77E04  1.43E02  YAR007C, YNL312W  
replication fork  2.99E04  8.97E03  YDL164C, YKL108W, YLR103C  
34  mitochondrial inner membrane  4.76E04  2.14E02  YDL198C, YDR197W, YER058W, YER141W, YOL027C, YPR011C 
mitochondrion  5.60E06  2.52E04  YDL198C, YDR194C, YDR197W, YDR301W, YDR322W, YDR505C, YER058W, YER141W, YGL187C, YHR147C, YJR048W, YKL150W, YLR168C, YML030W, YML052W, YML120C, YMR098C, YMR145C, YMR188C, YNL306W, YNR036C, YOL009C, YOL027C, YOL038W, YOL059W, YPR011C  
35  retrotransposon nucleocapsid  3.41E06  9.89E05  YCL020W, YER160C, YJR026W, YJR028W, YML040W, YOR142W 
36  glycine cleavage complex  1.94E05  4.46E04  YAL044C, YDR019C, YMR189W 
41  nuclear exosome (RNase complex)  4.93E05  2.02E03  YNL251C, YOL021C, YOR001W 
54  MCM complex  1.19E04  4.65E03  YBR202W, YGL201C, YLR274W 
prereplicative complex  9.21E04  3.59E02  YBR202W, YGL201C, YLR274W  
56  central plaque of spindle pole body  3.03E05  8.78E04  YDR356W, YKL042W, YPL124W 
Annotations of molecular function ontology for biclusters identified by the proposed algorithm for multiplicative models at pvalue < 0.001.
Bicluster index  Annotation  Pvalue  Corrected Pvalue  Genes 

1  DNAdirected RNA polymerase activity  1.62E05  2.62E03  YBR154C, YDL140C, YDL150W, YDR045C, YDR156W, YJL140W, YNL113W, YNR003C, YOR224C, YOR341W 
snoRNA binding  1.55E04  2.51E02  YBL004W, YDL153C, YDR449C, YGR090W, YLR222C, YML093W, YMR093W, YPR112C, YPR137W  
3  MAP kinase kinase activity  9.00E04  4.50E02  YJL128C, YPL140C 
7  ATPase activity, coupled  3.55E04  6.04E03  YLL026W, YPL240C 
aldoketo reductase activity  3.55E04  6.04E03  YHR104W, YOR120W  
11  chromatin binding  6.21E05  3.66E03  YEL032W, YJL081C, YLR002C, YLR274W, YPR019W 
14  protein phosphatase type 2C activity  4.51E04  6.77E03  YCR079W, YDL006W 
24  structural constituent of ribosome  3.69E75  1.88E73  YBL027W, YBL038W, YBL072C, YBL092W, YBR031W, YBR048W, YBR181C, YBR189W, YBR191W, YCR031C, YDL061C, YDL075W, YDL081C, YDL082W, YDL083C, YDL130W, YDL136W, YDL191W, YDR012W, YDR025W, YDR064W, YDR382W, YDR418W, YDR447C, YDR450W, YDR471W, YDR500C, YER074W, YER102W, YER117W, YER131W, YGR118W, YGR214W, YHR141C, YIL069C, YJL136C, YJL177W, YJL189W, YJL190C, YJR123W, YJR145C, YKL006W, YKL156W, YKL180W, YKR057W, YKR094C, YLL045C, YLR029C, YLR048W, YLR075W, YLR167W, YLR185W, YLR325C, YLR333C, YLR340W, YLR344W, YLR388W, YLR406C, YLR441C, YLR448W, YML024W, YML026C, YML063W, YML073C, YMR121C, YMR143W, YMR194W, YMR230W, YMR242C, YNL067W, YNL096C, YNL162W, YNL301C, YNL302C, YOL039W, YOL040C, YOL127W, YOR167C, YOR234C, YOR293W, YOR312C, YOR369C, YPL081W, YPL090C, YPL143W, YPL198W, YPR043W, YPR102C 
translation elongation factor activity  4.78E04  2.44E02  YAL003W, YBR118W, YDR385W, YLR249W  
DNA helicase activity  6.99E05  3.57E03  YDR545W, YER190W, YLR467W, YNL339C, YPL283C, YPR204W  
RNA binding  4.35E04  2.22E02  YAR009C, YCR031C, YDL208W, YJR027W, YKL006W, YLR029C, YLR344W, YLR448W, YML039W, YML045W, YML073C, YMR045C, YMR050C, YMR121C, YMR194W, YOL127W  
helicase activity  1.61E08  8.22E07  YBL113C, YEL077C, YIL177C, YJL225C, YLL066C, YLL067C, YML133C  
35  RNA binding  3.28E05  1.05E03  YCL020W, YER160C, YJR026W, YJR028W, YML040W, YMR290C, YOR142W, YPR107C 
36  glycine dehydrogenase (decarboxylating) activity  1.94E05  5.62E04  YAL044C, YDR019C, YMR189W 
50  longchainfattyacidCoA ligase activity  5.58E04  1.06E02  YIL009W, YOR317W 
glycerol3phosphate dehydrogenase (NAD+) activity  1.88E04  3.56E03  YDL022W, YOL059W  
citrate (Si)synthase activity  5.58E04  1.06E02  YCR005C, YNR001C  
52  structural constituent of cytoskeleton  5.63E04  9.57E03  YDR016C, YGR113W, YHR129C, YNL126W 
54  ATPdependent DNA helicase activity  3.25E04  1.11E02  YBR202W, YGL201C, YLR274W 
ATP binding  3.25E04  1.11E02  YBR202W, YDR097C, YNL082W  
56  ATP binding  1.64E04  3.78E03  YDR097C, YNL082W, YOL090W 
structural constituent of cytoskeleton  4.43E05  1.02E03  YDR356W, YGR113W, YKL042W, YOR373W, YPL124W  
guanine/thymine mispair binding  2.17E04  5.00E03  YDR097C, YOL090W  
single base insertion or deletion binding  2.17E04  5.00E03  YDR097C, YOL090W  
fourway junction DNA binding  2.17E04  5.00E03  YDR097C, YOL090W  
57  structural constituent of cytoskeleton  7.39E04  1.85E02  YBL063W, YMR117C, YOR373W, YPL124W 
58  copper ion binding  6.45E04  1.42E02  YBR037C, YBR295W 
59  guanine/thymine mispair binding  1.97E04  4.73E03  YDR097C, YOL090W 
fourway junction DNA binding  1.97E04  4.73E03  YDR097C, YOL090W  
single base insertion or deletion binding  1.97E04  4.73E03  YDR097C, YOL090W 
The experiments on the real dataset show that our proposed algorithms PA and PM can identify biclusters with high biological relevance efficiently. Furthermore, PA can always give a reasonable number of biclusters, and with a good degree of homogeneity. Although GO annotation only provides descriptions currently known in the biological community, the results still give a reasonable indication of performance. Furthermore, the biclusters which have no GO terms assigned should be investigated for any new biological discoveries.
Determination of biclusters homogeneity
Conclusion
In this paper, a novel biclustering algorithm for additive models is proposed. First, we performed analysis on the difference matrix computed from a gene expression matrix. It was shown that the columnwise differences of an additiverelated bicluster appear as clusters in each corresponding column in the difference matrix. Similarly, clusters can be found from the columnwise ratios calculated from multiplicativerelated biclusters. These observations were then explored to construct biclusters greedily from the clustering results in columnwise differences or ratios in the proposed algorithms.
The proposed algorithms have been analyzed by comparing with pCluster algorithm. The results suggest that the proposed algorithms can be regarded as a greedy version of the pCluster algorithm. The biclusters found by the proposed algorithms can be expressed as δpClusters but clustering density is utilized in pattern discovery. Although the identified δpClusters is not guaranteed to be maximal, the proposed algorithm is much more efficient. Experiments showed that the computational time of the proposed algorithms is lower than that of the pCluster algorithm by a factor of hundreds or more. Moreover, we have verified that the worst case complexity of the proposed algorithms is polynomialtime instead of exponentialtime as in the case of the pCluster algorithm or other δpCluster based approaches.
The robustness of our algorithms to noise and regulatory complexity has been verified empirically using artificial datasets. It was found that our algorithm is capable of discovering overlapping biclusters under noisy condition. Biological significance of biclustering results has been verified on the yeast cellcycle dataset using Gene Ontology annotations. Comparative study shows that the proposed algorithm is the best or close to be the best one among several existing algorithms in terms of the percentage and the number of functionallyenriched biclusters for pvalues below a range of value from 5 × 10^{3} to 5 × 10^{2}. In particular, there are 96.0%, 88.0% and 80.0% of the biclusters annotated with pvalue below 0.01. The proposed algorithm can identify biclusters with less deviation from the additive models. The identified biclusters also have reasonable size ranged from 10 to 597 genes and 5 to 17 conditions. Comparison in processing time suggests that the proposed algorithm has the highest efficiency.
In the proposed algorithm, the noise threshold is a crucial parameter as it balances the homogeneity requirement and the noise tolerance in the identified biclusters. In order to determine an appropriate value for the noise threshold, an exploratory approach based on the PC plots is adopted. We believe that the proposed biclustering algorithm and the interactive PC plots offer an effective data analysis tool for gene expression data. In future, our research will be focused on detecting bicluster types other than additive or multiplicative models, e.g. biclusters of coherent evolution.
Availability and requirements
Project home page: http://www.eie.polyu.edu.hk/~nflaw/Biclustering/index.html.
Operating system: Window XP
Programming language: Matlab 6.5 or above
License: Free for academic use. For nonacademic use, please contact the author.
Declarations
Acknowledgements
The authors thank the anonymous reviewers for their constructive comments. This work is supported by the Centre for Multimedia Signal Processing, Department of Electronic and Information Engineering and the Hong Kong Polytechnic University (APA2P). K.O. Cheng acknowledges the research studentships provided by the University.
Authors’ Affiliations
References
 BenDor A, Chor B, Karp R, Yakhini Z: Discovering local structure in gene expression data: the orderpreserving submatrix problem. Journal of Computational Biology. 2003, 10 (3–4): 373384.View ArticlePubMedGoogle Scholar
 Schena M, Shalon D, Davis RW, Brown PO: Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science. 1995, 270: 467470.View ArticlePubMedGoogle Scholar
 Lockhart DJ, Winzeler EA: Genomics, gene expression and DNA arrays. Nature. 2000, 405: 827836.View ArticlePubMedGoogle Scholar
 Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture. Nature Genetics. 1999, 22: 281285.View ArticlePubMedGoogle Scholar
 Raychaudhuri S, Sutphin PD, Chang JT, Altman RB: Basic microarray analysis: grouping and feature reduction. Trends in Biotechnology. 2001, 19: 189193.View ArticlePubMedGoogle Scholar
 Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide array. Proceedings of the National Academy of Sciences of the United States of America. 1999, 96 (12): 67456750.PubMed CentralView ArticlePubMedGoogle Scholar
 Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genomewide expression patterns. Proceedings of the National Academy of Sciences of the United States of America. 1998, 95: 1486314868.PubMed CentralView ArticlePubMedGoogle Scholar
 Shamir R, Sharan R: Click: a clustering algorithm for gene Expression analysis. Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology. 2000, AAAIPress, 307316.Google Scholar
 Wu S, Liew AWC, Yan H: Cluster Analysis of Gene Expression Data Based on SelfSplitting and Merging Competitive Learning. IEEE Transactions on Information Technology in Biomedicine. 2004, 8 (1): 515.View ArticlePubMedGoogle Scholar
 Szeto LK, Liew AWC, Yan H, Tang SS: Gene Expression data clustering and visualization based on a binary hierarchical clustering framework. Special issue on Biomedical Visualization for Bioinformatics, Journal of Visual Languages and Computing. 2003, 14: 341362.View ArticleGoogle Scholar
 Madeira SC, Oliveira AL: Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform. 2004, 1 (1): 2445.View ArticlePubMedGoogle Scholar
 Cheng Y, Church GM: Biclustering of expression data. Proceedings of 8th International Conference on Intelligent Systems for Molecular Biology. 2000, 93103.Google Scholar
 Yoon S, Nardini C, Benini L, Micheli GD: Discovering coherent biclusters from gene expression data using zerosuppressed binary decision diagrams. IEEE/ACM Trans Comput Biol Bioinform. 2005, 2 (4): 339354.View ArticlePubMedGoogle Scholar
 Zhao H, Liew AWC, Xie X, Yan H: A new geometric biclustering algorithm based on the Hough transform for analysis of largescale microarray data. Journal of Theoretical Biology. 2008, 251 (2): 264274.View ArticlePubMedGoogle Scholar
 Gan X, Liew AW, Yan H: Discovering biclusters in gene expression data based on highdimensional linear geometries. BMC Bioinformatics. 2008, 9: 209acceptedPubMed CentralView ArticlePubMedGoogle Scholar
 Prelic A, Bleuler S, Zimmermann P, Wille A, Buhlmann P, Gruissem W, Hennig L, Thiele L, Zitzler E: A Systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics. 2006, 22 (9): 11221129.View ArticlePubMedGoogle Scholar
 Inselberg A, Dimsdale B: Parallel coordinates: a tool for visualizing multidimensional geometry. Proceedings Of Visualization. 1990, 361378.Google Scholar
 Wegman EJ: Hyperdimensional data analysis using parallel coordinates. Journal of the American Statistical Association. 1990, 85 (411): 664675.View ArticleGoogle Scholar
 Peng W, Ward MO, Rundensteiner EA: Clutter reduction in multidimensional data visualization using dimension reordering. Proceedings of IEEE Symposium on Information Visualization. 2004, 8996.View ArticleGoogle Scholar
 Ericson D, Johansson J, Cooper M: Visual data analysis using tracked statistical measures within parallel coordinate representations. Proceedings of the 3rd IEEE International Conference on Coordinated & Multiple Views in Exploratory Visualization. 2005, 4253.View ArticleGoogle Scholar
 Yang J, Ward MO, Rundensteiner EA: Interactive hierarchical displays: a general framework for visualization and exploration of large multivariate data sets. Computers & Graphics. 2003, 27 (2): 265283.View ArticleGoogle Scholar
 Prasad TV, Ahson SI: Visualization of Microarray Gene Expression Data. Bioinformation. 2006, 1: 141145.PubMed CentralView ArticlePubMedGoogle Scholar
 Craig P, Kennedy J: Coordinated graph and scatterplot views for the visual exploration of microarray timeseries data. Proceedings of IEEE Symposium on Information Visualization. 2003, 173180.Google Scholar
 Hochheiser H, Baehrecke EH, Mount SM, Shneiderman B: Dynamic querying for pattern identification in microarray and genomic data. Proceedings of IEEE International Conference on Multimedia and Expo. 2003, 3: 453456.Google Scholar
 Wang H, Wang W, Yang J, Yu PS: Clustering by pattern similarity in large data sets. Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data. 2002, 394405.View ArticleGoogle Scholar
 Cheng KO, Law NF, Siu WC, Lau TH: BiVisu: software tool for bicluster detection and visualization. Bioinformatics. 2007, 23 (17): 23422344.View ArticlePubMedGoogle Scholar
 BicAT (Biclustering Analysis Toolbox). 2006, [http://www.tik.ee.ethz.ch/sop/bicat/]
 Clustering by Pattern Similarity: the pCluster Algorithm. 2002, [http://wis.cs.ucla.edu/~hxwang/proj/delta.html]
 Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, IsselTarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene Ontology: tool for the unification of biology. Nature Genetics. 2000, 25: 2529.PubMed CentralView ArticlePubMedGoogle Scholar
 CastilloDavis CI, Hartl DL: GeneMerge – postgenomic analysis, data mining, and hypothesis testing. Bioinformatics. 2003, 19 (7): 891892.View ArticlePubMedGoogle Scholar
 Teng L, Chan LW: Biclustering gene expression profiles by alternately sorting with weighted correlated coefficient. Proceedings of IEEE International Workshop on Machine Learning for Signal Processing. 2006, 289294.Google Scholar
 Ihmels J, Friedlander G, Bergmann S, Sarig O, Ziv Y, Barkai N: Revealing modular organization in the yeast transcriptional network. Nature Genetics. 2002, 31: 370377.PubMedGoogle Scholar
 Ihmels J, Bergmann S, Barkai N: Defining transcription modules using largescale gene expression data. Bioinformatics. 2004, 20: 19932003.View ArticlePubMedGoogle Scholar
 Murali TM, Kasif S: Extracting conserved gene expression motifs from gene expression data. Pac Symp Biocomput. 2003, 7788.Google Scholar
 Yoon S, Nardini C, Benini L, Micheli GD: Enhanced pClustering and its applications to gene expression data. Proceedings of the Fourth IEEE Symposium on Bioinformatics and Bioengineering. 2004, 275282.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.