Volume 12 Supplement 12
Selected articles from the 9th International Workshop on Data Mining in Bioinformatics (BIOKDD)
Discovery of errortolerant biclusters from noisy gene expression data
 Rohit Gupta^{1}Email author,
 Navneet Rao^{1} and
 Vipin Kumar^{1}
https://doi.org/10.1186/1471210512S12S1
© Gupta et al; licensee BioMed Central Ltd. 2011
Published: 24 November 2011
Abstract
Background
An important analysis performed on microarray geneexpression data is to discover biclusters, which denote groups of genes that are coherently expressed for a subset of conditions. Various biclustering algorithms have been proposed to find different types of biclusters from these realvalued geneexpression data sets. However, these algorithms suffer from several limitations such as inability to explicitly handle errors/noise in the data; difficulty in discovering small bicliusters due to their topdown approach; inability of some of the approaches to find overlapping biclusters, which is crucial as many genes participate in multiple biological processes. Association pattern mining also produce biclusters as their result and can naturally address some of these limitations. However, traditional association mining only finds exact biclusters, which limits its applicability in reallife data sets where the biclusters may be fragmented due to random noise/errors. Moreover, as they only work with binary or boolean attributes, their application on geneexpression data require transforming realvalued attributes to binary attributes, which often results in loss of information. Many past approaches have tried to address the issue of noise and handling realvalued attributes independently but there is no systematic approach that addresses both of these issues together.
Results
In this paper, we first propose a novel errortolerant biclustering model, ‘ETbicluster’, and then propose a bottomup heuristicbased mining algorithm to sequentially discover errortolerant biclusters directly from realvalued geneexpression data. The efficacy of our proposed approach is illustrated by comparing it with a recent approach RAP in the context of two biological problems: discovery of functional modules and discovery of biomarkers. For the first problem, two realvalued S.Cerevisiae microarray geneexpression data sets are used to demonstrate that the biclusters obtained from ETbicluster approach not only recover larger set of genes as compared to those obtained from RAP approach but also have higher functional coherence as evaluated using the GObased functional enrichment analysis. The statistical significance of the discovered errortolerant biclusters as estimated by using two randomization tests, reveal that they are indeed biologically meaningful and statistically significant. For the second problem of biomarker discovery, we used four realvalued Breast Cancer microarray geneexpression data sets and evaluate the biomarkers obtained using MSigDB gene sets.
Conclusions
The results obtained for both the problems: functional module discovery and biomarkers discovery, clearly signifies the usefulness of the proposed ETbicluster approach and illustrate the importance of explicitly incorporating noise/errors in discovering coherent groups of genes from geneexpression data.
Background
Recent technical advancements in DNA microarray technologies have led to the availability of largescale gene expression data. These data sets can be represented as a matrix G with genes as rows and different experimental conditions as columns, where (G_{ ij } denotes the expression value of gene i for an experimental condition j. An important research problem of geneexpression analysis is to discover submatrix patterns or biclusters in G. These biclusters are essentially subsets of genes that show coherent values across a subset of experimental conditions. However, coherence among the data values can be defined in various ways. For instance, Madeira et al [1] classify biclusters into the following four different categories based on the definition of coherence: (i) biclusters with constant values, (ii) biclusters with constant rows or columns, (iii) biclusters with coherent values, and (iv) biclusters with coherent evolutions. Many approaches [1–7] have been proposed to discover biclusters from geneexpression data. Different biclustering algorithms have been designed to discover different types of biclusters. For instance, coclustering [4] and SAMBA [5] find constant value biclusters, Cheng and Church (CC) [3] find constant row biclusters and OPSM [6] find coherent evolutions biclusters. Though there are differences in biclustering algorithms in terms of the type of bicluster they discover, there are some common issues with these algorithms in general. First critical issue with all of these biclustering algorithms is that they are oblivious to noise/errors in the data and require all values in the discovered bicluster to be coherent. This limits the discovery of valid biclusters that are fragmented due to random noise in the data. Second issue with at least some of the biclustering algorithms is their inability to find overlapping biclusters. For instance, coclustering is designed to only look for disjoint biclusters and Cheng and Church’s approach, which masks the identified bicluster with random values in each iteration, also finds it hard to discover overlapping biclusters. Third, most of the algorithms are topdown greedy schemes that start with all rows and columns, and then iteratively eliminate them to optimize the objective function. This generally results in large biclusters, which although are useful, do not provide information about the small biological functional classes. Finally, all the biclustering algorithms employ heuristics and are unable to search the space of all possible biclusters exhaustively.
Association pattern mining can naturally address some of the issues faced by biclustering algorithms i.e, finding overlapping biclusters and performing an exhaustive search. However, there are two major drawbacks of traditional association mining algorithms. First, these algorithms use a strict definition of support that requires every item (gene) in a pattern (bicluster) to occur in each supporting transaction (experimental condition). This limits the recovery of patterns from noisy reallife data sets as patterns are fragmented due to random noise and other errors in the data. Second, since traditional association mining was originally developed for market basket data, it only works with binary or boolean attributes. Hence it’s application to data sets with continuous or categorical attributes requires transforming them into binary attributes, which can be performed by using discretization [8–10], binarization [11–14] or by using rankbased transformation [15]. In each case, there is a loss of information and associations obtained does not reflect relationships among the original realvalued attributes, rather reflect relationships among the binned independent values [16].
Efforts have been made to independently address the two issues mentioned above and to the best of our knowledge, no prior work has addressed both the issues together. For example, various methods [17–26] have been proposed in the last decade to discover approximate frequent patterns (often called errortolerant itemsets (ETIs)). These algorithms allow patterns in which a specified fraction of the items can be missing  see [27] for a comprehensive review of many of these algorithms. As the conventional support (i.e the number of transactions supporting the pattern) is not antimonotonic for errortolerant patterns, most of these algorithms resort to heuristics to discover these patterns. Moreover, all of these algorithms are developed only for binary data.
Another recent approach [28] addressed the second issue and extended association pattern mining for realvalued data. The extended framework is referred to as RAP (Range Support Pattern). A novel range and range support measures were proposed, which ensure that the values of the items constituting a meaningful pattern are coherent and occurs in a substantial fraction of transactions. This approach reduces the loss of information as incurred by discretization and binarizationbased approaches, as well as enables the exhaustive discovery of patterns. One of the major advantages of using an approach such as RAP, which adopts a very different pattern discovery algorithm as compared to more traditional biclustering algorithms such as CC or ISA, is the ability to find smaller or completely novel biclusters. Several examples shown in [28] illustrated that RAP can discover some biologically relevant smaller biclusters, which are either completely missed by biclustering approaches such as CC or ISA, or are found embedded in larger biclusters. In either case, they are not able to enrich the smaller functional classes as RAP biclusters do. Despite these advantages, RAP framework does not directly address the issue of noise and errors in the data.
As it has been independently shown that both issues, handling realvalued atributes and noise, are critical and affect the results of the mining process, it is important to address them together. In this paper, we propose a novel extension of association pattern mining to discover errortolerant biclusters (or patterns) directly from realvalued geneexpression data. We refer to this approach as ‘ETbicluster’ for errortolerant bicluster. This is a challenging task because the conventional support measure is not antimonotonic for the errortolerant patterns and therefore limits the exhaustive search of all possible patterns. Moreover the set of values constituting the pattern in the realvalued data is different than the binary data case. Therefore, instead of using the traditional support measure, we used the range and RangeSupport measures as proposed in [28] to ensure the coherence of values and for computing the contribution from supporting transactions. RangeSupport is antimonotonic for both dense and errortolerant patterns, however, range is not antimonotonic for errortolerant patterns. Due to this, exhaustive search is not guaranteed, however it is important to note that the proposed ETbicluster framework still, by design, finds more number of patterns (biclusters) than it’s counterpart RAP. Therefore using range as a heuristic measure, we describe a bottomup pattern mining algorithm, which sequentially generates errortolerant biclusters that satisfy the userdefined constraints, direcly from the realvalued data.
To demonstrate the efficacy of our proposed ETbicluster approach, we compare it’s performance with RAP in the context of two biological problems: (a) functional module discovery, and (b) biomarker discovery. Since both ETbicluster and RAP use same pattern mining framework, comparing them helps to quantify the impact of noise and errors in the data on the discovery of coherent groups of genes in an unbiased way.
For the first problem of functional module discovery, we used realvalued S. cereυísíae microarray geneexpression data sets and discovered biclusters using both ETbicluster and RAP algorithm. To illustrate the importance of directly incorporating data noise/errors in biclusters, we compared the errortolerant biclusters and RAP biclusters using gene ontology (GO) based biological processes annotation hierarchy [29] as the base biological knowledge. Specifically, for each {bicluster, GO term} pair, we computed a pvalue using a hypergeometric distribution, which denotes the random probability of annotating this bicluster with the given GO term. For the second problem of biomarker discovery, we combined four realvalued casecontrol Breast Cancer geneexpression data sets, and discovered discriminative biclusters (or biomarkers) from the combined data set using both ETbicluster and RAP. Again, to illustrate the importance of explicitly incorporating noise/errors in the data, we compared the biomarkers based on their enrichment scores computed using MSiGDB gene sets [30]. MSigDB gene sets are chosen as the base biological knowledge in this case because they include several manually annotated cancer gene sets. To further compare ETbicluster and RAP algorithms, we also performed network/pathway analysis using IPA for an example biomarker obtained from each of the two algorithms. The results obtained for both the functional module discovery and biomarker discovery problem clearly demonstrate that errortolerant biclusters are not only bigger than RAP biclusters but are also biologically meaningful. Using randomization tests, we further demonstrated that errortolerant biclusters are indeed statistically significant and are neither obtained by random chance nor capture random structures in the data. Overall, the results presented for both the biological problems strongly suggest that our proposed ETbicluster approach is a promising method for the analysis of realvalued geneexpression data sets.
Contributions

We proposed a novel association pattern mining based approach to discover errortolerant biclusters from noisy realvalued geneexpression data.

Our work highlights the importance of tolerating error(s) in the biclusters in order to capture the true underlying structure in the data. This is demonstrated using two case studies: functional module discovery and biomarker discovery. Using various realvalued gene expression data sets, we illustrated that our proposed algorithm ETbicluster can discover additional and bigger biologically relevant biclusters as compared to RAP.

We used two randomization techniques to compute the empirical pvalue of all the discovered errortolerant biclusters and demonstrated that they are statistically significant and it is highly unlikely to have obtained them by random chance.
Organization: The rest of the paper is organized as follows. In Section 2, we discuss our proposed algorithm ETbicluster. Section 3 details the experimental methodology for evaluating the errortolerant biclusters and their comparison with RAP biclusters, and the results obtained. We present a summary of the findings in section 4 followed by a discussion on limitations and future work in section 5.
Experimental results and discussion
We implemented our proposed association pattern mining approach ‘ETbicluster’ in C++. In this paper, we only compare our proposed approach with RAP, as RAP has already been shown to outperform biclustering approaches such as ISA and Cheng and Church, especially for finding small biclusters. Also, as mentioned in [28], transformation of data from realvalued attributes to binary attributes leads to loss of distinction between various types of biclusters (or patterns). Therefore, as the focus of this study is to discover constant row biclusters, binarization of realvalued geneexpression data is not meaningful. For this reason, we only show results on realvalued data sets. Further, in order to compare the performance of ‘ETbicluster’ and RAP in discovering coherent groups of genes, we considered two biological problems: discovery of functional modules (finding coherent gene groups) and discovery of biomarkers (finding coherent gene groups that are discriminative of the two classes of patients: cases and controls).
Selecting top biclusters: As association mining based approach generally produces a large number of biclusters that often have substantial overlap with each other, this redundancy in biclusters may bias the evaluation. Hence, we used a commonly adopted selection methodology similar to the one proposed by [7] to select upto 500 top biclusters. However, because errortolerant biclusters generally have a large set of supporting experimental conditions, even biclusters with high overlap in gene dimension may get selected in the top 500 biclusters. To avoid this situation, we computed the size of a bicluster by the number of genes (genes) in it, not by genes\ × conditions in it. Therefore, starting with the largest bicluster (only in terms of the number of genes in it), we greedily select upto 500 biclusters such that the overlap among any of the selected biclusters is not more than 25%. In case of a tie between the size of biclusters, bicluster with lower Mean Square Error (MSE) value [3] is selected. Please note that MSE of a bicluster is computed by discarding the error values in it, since ETbicluster is meant to look for errortolerant patterns.
Case study 1  discovery of functional modules
We used the following two realvalued S. cereυísíae microarray geneexpression data sets for the discovery of functional modules:

Hughes et al’s data set [31]: This data set contains a compendium of expression profiles corresponding to 300 diverse mutations and chemical treatments in S. cerevisiae and was compiled to study the functions of yeast genes on a large scale. The overall dimensions of this data set are 6316 genes x 300 conditions, with values (log_{10} ratio of expression values observed for experimental condition and control condition) in the range [2,2].

Mega Yeast data set [32]: This data set contains 501 yeast microarray experiments, including stress responses, cell cycle, sporulation, etc. The overall dimensions of this data set are 6447 genes x 501 conditions, with values in the range [12,12].
Functional enrichment analysis: Since the discovered biclusters represent groups of genes that are expected to coexpress with each other, we evaluated all the biclusters discovered in terms of their functional coherence using the biological processes annotation hierarchy of Gene Ontology [29]. A pvalue using a hypergeometric probability distribution is computed for each combination of bicluster and biological process GO term to determine if the discovered biclusters are statistically significant. The pvalue computed for a pair of bicluster (denoted by b) and GO term (denoted by t) denotes the random probability of annotating a bicluster of size same as b with the same GO term t.
To compare errortolerant biclusters and RAP biclusters in an unbiased fashion, we used the same 2652 biological processes GO terms (or classes), all of which contain at least 1 and at most 500 genes from S.cerevisiae. Furthermore, as only 4684 genes are annotated with either one or more of these 2652 classes, we restricted our analysis to a subset of data sets comprising of 4684 genes x 501 conditions and 4684 genes x 300 conditions for mega yeast and Hughes’s et al’s geneexpression data sets respectively.
Quantitative analysis of biclusters
This table shows various statistics of all the biclusters obtained using RAP and our proposed ETbicluster algorithms from Mega Yeast and Hughes et al's microarray geneexpression data sets
Run ID  Parameter settings  # total biclusters  #genes covered^{1}  #top biclusters  # genes covered^{2}  Size distribution^{2}#of genes:#of biclusters  Time taken (seconds) 

Errortolerant biclusters on Mega Yeast data set  
ETbicluster _{ M } _{1}  α = 0.5, ε= 0 for RS ε [120 150), ε = 0.25 for RS ≥ 150  153,960  361  500  295  2:128, 3:235, 4:8, 5:76, 6:39, 7:7, 8:2, 9:1, 10:2, 11:1, 13:1  10,560 
ETbicluster _{ M } _{2}  α = 0.3, ε = 0 for RS ∈ [60 90), ε = 0.25 for RS ≥ 90  271,101  792  500  233  3:203, 4:28, 5:177, 6:80, 7:5, 8:3, 9:3, 10:1  33,000 
RAP biclusters on Mega Yeast data set  
RAP _{ M1 }  α = 0.5, RS ≥ 120  33,330  361  500  247  2:68, 3:379, 4:33, 5:16, 6:4  642 
RAP _{ M } _{2}  α = 0.3, RS ≥ 60  94,806  792  500  241  3:384, 4:68, 5:43, 6:5  7,580 
Errortolerant biclusters on Hughes et al's data set  
ETbicluster _{ H } _{1}  α = 0.8, ε = 0 for RS ∈ [10 15), ε = 0.25 for RS ≥ 15  150,372  506  496  437  2:210, 3:187, 4:12, 5:66, 6:14, 7:3, 8:1, 10:1, 11:1, 13:1  8,360 
ETbicluster _{ H } _{2}  α = 0.5, ε = 0 for RS ∈ [6 10), ε = 0.25 for RS >10  234,761  1135  500  443  2:115, 3:258, 4:22, 5:69, 6:24, 7:6, 8:1, 9:2, 11:1, 13:1, 14:1  21,745 
RAP biclusters on Hughes et al's data set  
RAP _{ h } _{1}  α = 0.8, RS ≥ 10  56,009  506  495  438  2:212, 3:207, 4:25, 5:40, 6:5, 7:3, 8:2, 11:1  2,835 
RAP _{ h } _{2}  α = 0.5, RS ≥ 6  80,335  1135  500  405  2:96, 3:303, 4:18, 5:75, 6:2, 7:2, 8:3, 12:1  1,505 
Parameter controlling errortolerance (ε) was set to 0.25 in all the runs for ETbicluster. It is important to note that number of errortolerant biclusters is substantially larger than the number of RAP biclusters. Therefore, for a specific range (α) value and userdefined Range Support threshold, if ETbicluster algorithm was not able to finish in a reasonable amount of time and memory with α = 0.25, we first obtain exact biclusters (no errortolerance) by setting α to 0 and then increase the RangeSupport to obtain errortolerant biclusters by setting α to 0.25. The final resulting set of biclusters is obtained by merging these exact and errortolerant biclusters. Following are some of the general observations:
Number of biclusters: It can be clearly seen from Table 1 that introducing an errortolerance of 25% substantially increased the total number of biclusters. For example, number of total errortolerant biclusters obtained on mega yeast data is approximately 5times (for α = 0.5) and 3times (for α = 0.3) the number of RAP biclusters for corresponding α values. Similarly, for Hughes et al’s data set, number of errortolerant biclusters is approximately 3times the number of RAP biclusters for both the α values considered (α = 0.8 and α = 0.5).
Size of biclusters: Another important observation one can make from the results shown in Table 1 is that the size of errortolerant biclusters is more than RAP biclusters. This is expected as RAP can only find exact biclusters (with no errortolerance) and hence valid biclusters that are fragmented due to random noise and errors in the data, are either found as separate biclusters or completely missed. On the other hand, because ETbicluster algorithm explicitly handles noise and errors in the data, it can potentially find larger biclusters by stitching together the fragmented parts or can even find new biclusters that were missed by RAP. This might have a significant impact on the functional enrichment analysis as ETbicluster algorithm can potentially discover biclusters that have higher overlap with the considered GO biological processes classes. We discuss this further in the next section.
Coverage of genes and relationships among them: As can be noted from Table 1, the number of genes covered by ETbicluster and RAP algorithm is same at least if we consider all biclusters. This is because the starting set of genes (‘singletons’) are same for both the algorithms. In fact, if the errortolerance, α is 0.25 for example, then singletons, pairs (level2 bicluster) and even triplets (level3 bicluster) will be identical for ETbicluster and RAP. However note that the number of level4 biclusters generated by ETbicluster is more than those generated by RAP. This is due to the fact that ETbicluster algorithm, owing to its relaxed errortolerance criterion, can generate more combinations of genes than RAP. Therefore in other words, even if the total genes covered by both the algorithms are same, ETbicluster algorithm can find more relationships among them.
As mentioned above and shown in Table 1, since ETbicluster algorithm, as compared RAP, can potentially find newer and larger biclusters and hence more relationships among genes, an important question to address is: whether these larger and new biclusters are biologically meaningful? One promising way to answer this question is through functional enrichment analysis and below we discuss these results.
Functional enrichment using GO biological processes
Consider mega yeast data for example, while ETbicluster algorithm can discover biclusters of sizes as big as 13 (for α = 0.5) and 10 (for α = 0.3), RAP algorithm can only discover biclusters of maximum size 6. Moreover, enrichment scores of these larger errortolerant biclusters (computed using the minimum pvalue estimated for these biclusters for 2652 classes) are reasonably high. Therefore, even if the number of unique genes covered and number of enriched GO terms are comparable for ETbicluster and RAP algorithm, the degree to which errortolerant biclusters enrich the GO terms is certainly higher. In other words, ETbicluster algorithm can find more relationships among the genes covered and as shown by functional enrichment analysis, these relationships indeed seem to be biologically relevant and not spurious.
Further, considering various pvalue thresholds (from loose –5 × 10^{–2} to strict – 1 × 10^{–5}), we collected two more statistics. First, the fraction of biclusters that are enriched by at least one GO term, and second, the fraction of GO terms that enriched at least one bicluster. To illustrate the efficacy of ETbicluster in capturing the functional coherence among genes, and comparing it with RAP, the above two statistics are collected for all the runs shown in Table 1. For instance, if we compare these statistics for mega yeast data, while 83% of the top 500 errortolerant biclusters (corresponding to Run ID ETbicluster_{ M }_{2}) were enriched, only 76% of the top 500 RAP biclusters (corresponding to Run ID RAP_{ M }_{2}) were enriched by at least one GO term at a reasonable pvalue threshold of 1 × 10^{–3}, a gain of 7%. At even more strict pvalue threshold of 1 × 10^{–5}, the gain is 11%. Similarly, for Hughes et al’s data set, though the gain is not significant, biclusters obtained from ETbicluster still outperform those obtained by RAP in terms of the fraction of biclusters enriched. As far as the second statistics is concerned i.e. the number of GO terms that enriched at least one bicluster, performance of ETbicluster and RAP is comparable, however, as shown in –log_{10}(pvalue) vs. size distribution plots, enrichment scores for errortolerant biclusters are generally higher than RAP biclusters.
Statistical significance of errortolerant biclusters using randomization tests
Motivated by the discussion of randomizaton tests and their importance in validating the results from any data mining approach [33], we further estimate the statistical significance of the errortolerant biclusters using a data centric randomization approach. More specifically, an empirical pvalue is computed for all the errortolerant biclusters using the two randomization tests.
In the first randomization test, conserving the size of the top 500 errortolerant biclusters, we generated 1000 random sets of 500 biclusters each and evaluated them by the same functional enrichment analysis using GO biological processes. So effectively, for each actual errortolerant bicluster, we generated 1000 random biclusters of the same size (in terms of number of genes). The empirical pvalue for each actual errortolerant bicluster is then computed as the fraction of random biclusters (out of total 1000) whose enrichment score (–log_{10}(pvalue)) exceeds the enrichment score of the actual errortolerant bicluster. For instance, if for a errortolerant bicluster, only 1 out of 1000 random biclusters has higher enrichment score than it’s actual value, empirical pvalue of this errortolerant bicluster is given as ‘1 in 1000’ or 10^{–3}.
This table shows the statistical significance of biclusters obtained from our proposed ETbicluster algorithm
Run ID  # of random runs out of 1000 in which fraction of biclusters enriched exceeds the fraction for the true run  

pval ≤ 0.05  pval ≤ 0.01  pval ≤ 0.005  pval ≤ 0.001  pval ≤ 0.00001  
ETbicluster _{ M } _{1}  660  33  0  0  0 
ETbiduster _{ M } _{2}  660  76  4  0  0 
ETbicluster _{ H } _{1}  797  0  0  0  0 
ETbicluster _{ H } _{2}  886  0  0  0  0 
In the second randomization test, we randomized the data itself by shuffling the data values among the conditions for each gene. By doing this, we conserved the distribution of each gene profile but broke the correlation among them. We ran our proposed ETbicluster algorithm on randomized mega yeast data set for example, and obtained only 42 biclusters, all of which were pairs. In contrast, application of ETbicluster algorithm on actual nonrandomized mega yeast data generated many more biclusters and of size as big as 10.
Both of the above randomization tests suggest that the errortolerant biclusters obtained from the realvalued geneexpression data sets were indeed biologically meaningful and are neither obtained by random chance nor capture random structures in the data.
Case study 2  discovery of biomarkers
We used four realvalued Breast Cancer geneexpression data sets, all of which were taken from Affymetrix platform HGU133A and normalized using RMAnormalization approach. Please note that these geneexpression data sets are different than those considered for functional module discovery problem, in the sense that experimental conditions are replaced by two groups of patients. All the four breast cancer data sets were downloaded from GEO website: Desmedt (GSE7390), Loi (GSE6532), Miller (GSE3494) and Pawitan (GSE1456). The patients in the four data sets are classified as cases and controls based on their metastasis state. The patients who developed metastasis within 5 years of prognosis were considered as metastasis cases. The patients who were free of metastasis longer than 8 years of survival and followup time were considered as controls. The casecontrol ratio for Desmedt, Loi, Miller and Pawitan data set was 35:136, 51:112, 37:150 and 35:35 respectively. To increase the samle size, we combined these four data sets and used it for the discovery of biomarkers. This combined data set comprises of 8,920 genes and a casecontrol ratio of 158:433.
We discovered biclusters on combined Breast Cancer geneexpression data set using ETbicluster with parameters, α = 0.5, RS = 80, and α = 0.25.
Selecting disriminative biclusters: First we select top biclusters using the approach defined earlier and then amongst the top biclusters, only those are selected as biomarkers that are discriminative of the two groups of patients, cases and controls. To measure the discriminative power, we used two measures, odds ratio and pvalue. While odds ratio quantifies how different are cases and controls for a specific bicluster, pvalue quantifies the significance of the difference reflected by odds ratio. Only those biclusters are selected that have a pvalue of less than 0.05 and odds ratio of more than 2.0 (biclusters more represented in cases) or less than 0.5 (biclusters more represented in controls).
Functional enrichment analysis: We evaluated all the identified biomarkers in terms of their enrichment scores using the MSigDB gene sets [30]. A pvalue using a hypergeometric probability distribution, which denotes the random probability of annotating a biomarker with the gene set considered, is computed for all pair combinations of biomarkers and 5452 gene sets from MSigDB database. Enrichment score of each biomarker is then computed as –log_{10}(pvalue_{ min }) and used as a metric to compare the biomarkers obtained using ETbicluster and RAP.
Enrichment analysis using MSigDB gene sets
Now refer to Figure 3(b), gene sets covered by ETbicluster biomarkers are more than those covered by RAP biomarkers. The fraction of gene sets covered by biomarkers obtained from both the algorithms seems very low but this is expected because first a large number of gene sets are considered for the analysis and second, these biomarkers are only reflective of breast cancer metastasis. An important point to note however is that even a small change in fraction of gene sets covered would mean covering substantially large number of gene sets. For instance, consider a pvalue threshold of 10^{–6} (corresponding to –log_{10}(pvalue) of 6 as shown on the xaxis), ETbicluster and RAP biomarkers cover 3.03% (165 gene sets) and 1.96% (107 gene sets) respectively. These numbers for a even stricter pvalue threshold of 10^{–8} are 1.01% (55 gene sets) 0.26% (14 gene sets) respectively.
It is clear from the above analysis that the biomarkers obtained from ETbicluster algorithm are indeed biologically meaningful and since RAP algorithm does not explicitly handle noise in the data, it either completely miss some of these biologically relevant biomarkers or find fragmented parts of these, which eventually affect their enrichment score.
Biological relevance  example
Thus the network obtained by the bigger ETbicluster biomarker is better connected and therefore has a higher network score as computed using IPA than that obtained from RAP biomarker. In fact, all the 4 additional genes in ETbicluster biomarker i.e. MMP 2, CDH 11, THBS 2 and FBN 1 are previously shown to be wellcharacterized cancer biomarkers (as identified in IPA), increasing our confidence that the bigger ETbicluster biomarker is indeed a true biomarker.
Conclusions
We proposed a novel errortolerant biclustering model and presented an heuristicbased algorithm ‘ETbicluster’ to sequentially generate errortolerant biclusters from realvalued geneexpression data in a bottomup fashion.
We presented two biological case studies, functional module discovery and biomarker discovery, to demonstrate the importance of incorporating noise and errors in the data for discovering coherent groups of genes. In both the case studies, we found that the biclusters discovered using our proposed ETbicluster algorithm are not only bigger than those obtained by RAP algorithm, they were also assigned a higher functional enrichment score using the biological processes GO terms (functional module discovery case study) and MSigDB gene sets (biomarker discovery case study). These results suggest that the discovered errortolerant biclusters, not only capture the functional coherence among the genes, it is unlikely to have obtained them by random chance. We further demonstrated using two randomization tests that the statistical significance of errortolerant biclusters is high. The results from both randomization tests (one randomly selects the biclusters and other randomizes the input data itself) suggest the robustness of our proposed approach and clearly illustrate that discovered biclusters were indeed biologically and statistically meaningful and neither obtained by random chance nor capturing any random structure in the data.
The work presented in this study can be extended in various ways. Below we discuss some of the limitations of the ETbicluster algorithm and possible ideas to address them.

Since the range criterion that is used to check the coherence of expression values is not antimonotonic, the proposed ETbicluster approach does not exhaustively search for all errortolerant biclusters. Therefore, a promising idea is to define a new antimonotonic measure that measures the coherence among the expression values and enable exhaustive search for errortolerant biclusters.

The current implementation of ETbicluster algorithm only impose errortolerance constraints in the bicluster row. This means that it is possible for a gene in a discovered bicluster to have all error values. To avoid this situation, one can use additional column constraint and find a subset of supporting transactions for which each column in the pattern has no more than some userdefined fraction of errors. For binary data case, this kind of additional column constraint has been used in [20], however, a heuristicbased approach is used to check the column constraint. One of the promising directions is to develop a pattern mining algorithm that imposes both the row and column errortolerance constraints, and exhaustively search for all the errortolerant biclusters.
We only presented comparison of ETbicluster and RAP since comparison with other biclustering approaches such as CC and ISA is not well suited for quantifying the affect of noise/errors. Moreover CC and ISA approaches generally finds larger biclusters and follow a different approach based on optimizing an objective function. Nevertheless, it will still be interesting in future to compare ETbicluster with CC and ISA for potential complementarity among them.
It is also important to note that geneexpression data provides useful but limited view of the genome and therefore biclusters obtained from geneexpression data alone may not elucidate the complete underlying biological mechanism. Therefore to further illustrate the utility of ETbicluster algorithm, another promising research direction is to integrate multiple biological data sources. For example, proteinprotein interaction data can be used as a prior knowledge to guide the discovery of biclusters from the geneexpression data. The biclusters identified in this way are potentially more reliable and biologically plausible than those obtained from individual data sources. We are currently developing errortolerant pattern mining based approaches for integrated analysis of geneexpression and proteinprotein interaction data. Our initial efforts to combine these two sets of data sets for discovering subnetwork based biomarkers has been shown in [35], however, these approaches are primitive at this stage and further work is needed in this area.
Methods
Errortolerant bicluster model for realvalued data
 (a)
Bicluster composition: Unlike the case of binary data where collection of 1s was defined as a bicluster, in the case of realvalued data, similar values across a set of rows constitute a bicluster. These values can be any values in the set ℝ and athough similar across rows, they can be different for different rows. The errors in the biclusters defined on realvalued attributes are introduced in a way similar to the binary case. However, like binary case in which all nonerror entries are same (1s), in realvalued case, imposing such a requirement would be very harsh. Therefore, a measure is needed to check the coherence among the geneexpression values. For this purpose, we use the range measure, which checks for each transaction if the relative range of the geneexpression values in a bicluster, given as (max_{ val } – min_{ val })/min_{ val }, is within a prespecified threshold α. Furthermore, the contribution of each supporting transaction is measured as the minimum of the values taken by any of the genes in the bicluster in that transaction. Overall, to measure the strength of the bicluster, we use the RangeSupport (RS) measure [28], which sums up the contribution of each supporting transaction. This is similar to the support measure that is generally used in association pattern mining for binary data, however unlike binary case, each supporting transaction may not contribute equally for RangeSupport of a bicluster in realvalued data. The range and RangeSupport measures in combination capture the requirement that expression values of the genes in a bicluster are coherent for several transactions, and hence can be used to mine interesting biclusters from the realvalued data. Note that although both measures range and RangeSupport are antimonotonic for exact biclusters, range is not antimonotonic for errortolerant biclusters. Due to this reason, ETbicluster does not exhaustively find all errortolerant biclusters, but it is noteworthy that it still subsume all biclusters found by RAP and can even find biclusters that are fragmented due to noise/errors in the data. One the other hand, as RAP is oblivious to errors/noise in the data, it either completely miss these fragmented but valid biclusters or find them as separate parts.
 (b)
Positive/negative values: Unlike binary data, realvalued microarray data has both positive and negative values. In this case, it is important to consider the sign of the value to discover meaningful biclusters. Similar to [28], we address this problem by enforcing that a transaction can only be termed as the supporting transaction of a bicluster if for this transaction, the expression values of all the genes in the bicluster are of the same sign. This also help make biological interpretability easier as the sign enforcement would entail finding only those biclusters in which all the genes are either upregulated or downregulated for a given experimental condition. However note that the same genes can be upregulated for one experimental condition and downregulated for another.
 (c)
Error/nonerror values: In binary case, 1 is always a nonerror value and 0 an error value. This notion is no more valid for the realvalued data case. For example, consider an errortolerant bicluster shown in Figure 7 with 5 genes (a, b, c, d, e) and 8 experimental conditions (1 … 8). For the 1st condition, 8 is an error value, for the 3rd condition 9 is an error value, and for the 5th condition, 20 is an error value. Similarly, nonerror values can change for each transaction. Thus, it becomes important to keep track of error and nonerror values while mining for biclusters in the realvalued data.
Now, with the understanding of specific challenges and potential ways to address them, we now give the formal definition of errortolerant biclusters for a realvalued data.
Definition of errortolerant biclusters
Intuitively, a bicluster B is said to be an errortolerant bicluster if the following two general conditions are satisfied:

RangeSupport of bicluster B should be more than the userdefined threshold, RS.

All supporting transactions of bicluster B should have mostly nonerror values i.e. values should be generally coherent (governed by a userdefined parameter ε for maximum number of permissible errors).
Definition 1. Let D be a realvalued geneexpression data, RS be the RangeSupport threshold, E be a function that takes a set of real values as input and returns the number of errors in them using range criteria, and let error threshold be ε ∈ (0,1]. A bicluster B (with genes G) is an errortolerant bicluster ETbicluster(ε) in the realvalued attribute domain, if there exists a set of transactions T ∈ D such that the following two conditions hold:
Range Support (B) ≥ RS (1)
∀t ∈ T,E (D_{ t }_{,}_{ G }) ≤ ε • G (2)
Thus according to the definition, fraction of errors in each supporting transaction of the bicluster should not exceed ε.
Algorithm to discover errortolerant biclusters from realvalued data
Starting with singletons, the ETbicluster algorithm sequentially generates (k+l)level biclusters from klevel biclusters. At k = 1, genes that satisfy the RangeSupport (computed as the summation of absolute values for all transactions) criterion are valid singletons. Generally speaking, any (k+1)level bicluster is a valid bicluster if it satisfies the RangeSupport criterion and each supporting transaction of the bicluster has at most ε fraction of errors.
ETbicluster algorithm generates (k+1)level biclusters from klevel biclusters by one of the two steps: error extension or nonerror extension. Specifically, if ⌊ (k + 1) * ε⌋ = ⌊k * ε⌋, it’s a nonerror extension step (no more errors values are permitted) or else it will be a errorextension step (one additional error value is permitted). We used two lemmas proved in [20] to efficiently perform these extension steps. In nonerror extension step, for each (k+l)level bicluster, range criteria is only checked for the intersection of supporting transactions of all its klevel biclusters. On the other hand, in the errorextension step, range criteria is checked for the union of supporting transactions of all its klevel biclusters.
Checking the range criterion to ensure the coherence of values depends on the number of permissible errors at a particular biclusterlevel (k • ε). For instance, if the permissible number of errors is 1, then range criterion for a given transaction is computed as follows. First, for each transaction, all the expression values in a bicluster are sorted and then the range criterion is checked in usual manner by either discarding the minimum value or the maximum value. If the range criterion is satisfied in any of the two cases, transaction is classified as the supporting transaction for that bicluster. If for instance, number of permissible errors are 2 at any biclusterlevel, we check the range criterion for three cases: discarding the 2 minimum values; discarding the 2 maximum values; or discarding 1 minimum value and 1 maximum value.
Again, if any of the case satisfies the range criterion, transaction is classified as a supporting transaction. Similarly, we exhaustively make all cases when number of permissible errors are more than 2. However, note that with ε= 0.25 (value considered in this paper) and bicluster size in terms of number of genes even as big as 12, we only need to make these cases for 3 permissible errors.
An example
Step 1: k = 1. As range support for each gene is greater than 5, all the genes are returned as valid singletons.
Step 2: k = 2. Since ⌊k * e⌋ = ⌊k – l⌋ * ε, this is a nonerror extension step. Consider for example bicluster ab, for α = 0.5, it’s supporting transactions are {1,2,3,4,7,8}. To illustrate, while transaction 1 satisfies the range criteria (i.e. 2.1 — 2 < 0.5 * 2) and hence is valid, transaction 5 is not valid since 20 — 8 > 0.5 * 8. Now, RangeSupport of bicluster ab is given as the sum of the contributions from each supporting transaction i.e. RS(ab) = 2 + 2.1 + 4 + 6.5 + 3 + 2 = 19.6. Since, RS(ab) > 5, ab is a valid bicluster. Similarly, biclusters ac, ad, ae, bc, bd, be, cd, ce, de are all valid biclusters.
Step 3: k = 3. Again since ⌊k * ε⌋ = ⌊k – l⌋ * ε, this is a nonerror extension step. Consider for example, bicluster abc, range criterion is checked for intersection of supporting transactions of biclusters ab, bc and ac and hence supporting transactions are identified as {2,4,8}. Now, since RS(abc) = 10.6, which is greater than the thereshold 5, abc is a valid bicluster. Similarly, abd, abe, bee, bde and cde are all valid biclusters.
Step 4: k = 4. In this case, since ⌊k * ε⌋ ≠ ⌊k – 1⌋ * ε, this is an error extension step. The number of permissible errors at this level is k * ε_{ r } = 4 * 0.25 = 1. Consider for example, bicluster abcd, range criterion is checked for the union of supporting transactions of all its level3 biclusters subsets. Hence, we get {1,2,3,4,5,6,8} as the set of supporting transactions. For illustration, take an example of transaction 1. As only one error value is permitted, range criterion is checked as follows:
(((2^{ nd }max – min)/min) = (2.1 – 2)/2 = 0.05 < α(0.5)). Therefore, this is a supporting transaction. On the other hand, transaction 7, even after discarding one error value does not satisfy the range criterion for bicluster abcd. Also RS(abcd) = 33.6, hence abcd is a valid bicluster. Similarly, abce is also a valid bicluster.
Step 5: k = 5. Since, ⌊k * e⌋ = ⌊k – l⌋ * ε, this is a nonerror extension step. A bicluster abcde will be generated with set of supporting transactions as {1,2,3,4,5,6,8}. Now since RS(abcde) = 33.6, abcde is a valid bicluster.
It is important to note that since RAP does not explicitly handle errors/noise in the data, it cannot discover the bicluster abcde, which is fragmented due to errors.
Declarations
Acknowledgements
This work was supported by NSF grants ΠS0916439, CRI0551551 and a University of Minnesota Rochester Biomedical Informatics and Computational Biology (BICB) Program Traineeship Award (Rohit Gupta). Access to computing facilities was provided by the Minnesota Supercomputing Institute.
This article has been published as part of BMC Public Health Volume 11 Supplement 5, 2011: Navigating complexity in public health. The full contents of the supplement are available online at http://www.biomedcentral.com/14712458/11/S5.
Authors’ Affiliations
References
 Madeira SC, Oliveira AL: Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform 2004, 1: 24–45. 10.1109/TCBB.2004.2View ArticlePubMedGoogle Scholar
 Bergmann S, Ihmels J, Barkai N: Iterative signature algorithm for the analysis of largescale gene expression data. Phys Rev E Stat Nonlin Soft Matter Phys 2003, 67: 031902.View ArticlePubMedGoogle Scholar
 Cheng Y, Church GM: Biclustering of gene expression data. Proc Int Conf Intell Syst Mol Biol 2000, 8: 93–103.PubMedGoogle Scholar
 Dhillon I, Mallela S, Modha D: Informationtheoretic coclustering. In ACM SIGKDD. ACM New York, NY, USA; 2003:89–98.Google Scholar
 Tanay A, Sharan R, Shamir R: Discovering statistically significant biclusters in gene expression data. Bioinformatics 2002, 18(Suppl 1):S136S144. 10.1093/bioinformatics/18.suppl_1.S136View ArticlePubMedGoogle Scholar
 BenDor A, Chor B, Karp R, Yakhini Z: Discovering local structure in gene expression data: the orderpreserving submatrix problem. Journal of Computational Biology 2003, 10(3–4):373–384. 10.1089/10665270360688075View ArticlePubMedGoogle Scholar
 Prelic A, Bleuler S, Zimmermann P, Wille A, B"uhlmann P, Gruissem W, Hennig L, Thiele L, Zitzler E: A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics 2006, 22(9):1122–1129. 10.1093/bioinformatics/btl060View ArticlePubMedGoogle Scholar
 Srikant R, Agrawal R: Mining quantitative association rules in large relational tables. ACM SIGMOD Record 1996, 25(2):12.View ArticleGoogle Scholar
 Rastogi R, Shim K: Mining optimized association rules with categorical and numeric attributes. IEEE TKDE 2002, 29–50.Google Scholar
 Fukuda T, Morimoto Y, Morishita S, Tokuyama T: Mining optimized association rules for numeric attributes. Journal of Computer and System Sciences 1999, 58: 1–12. 10.1006/jcss.1998.1595View ArticleGoogle Scholar
 Becquet C, Blachon S, Jeudy B, Boulicaut JF, Gandrillon O: Strongassociationrule mining for largescale geneexpression data analysis: a case study on human SAGE data. Genome Biol 2002, 3(12):RESEARCH0067.PubMed CentralView ArticlePubMedGoogle Scholar
 Creighton C, Hanash S: Mining gene expression databases for association rules. Bioinformatics 2003, 19: 79–86. 10.1093/bioinformatics/19.1.79View ArticlePubMedGoogle Scholar
 Cong G, Tan K, Tung A, Pan F: Mining frequent closed patterns in microarray data,. ace 125: 123.Google Scholar
 Mcintosh T, Chawla S: HighConfidence Rule Mining for Microarray Analysis. IEEE/ACM Trans Comput Biol Bioinform 2007, 4: 611–623.View ArticlePubMedGoogle Scholar
 Calders T, Goethals B, Jaroszewicz S: Mining rankcorrelated sets of numerical attributes. In ACM SIGKDD. ACM; 2006:105.Google Scholar
 Gyenesei A, Schlapbach R, Stolte E, Wagner U: Frequent pattern discovery without binarization: Mining attribute profiles. LNCS 2006, 4213: 528.Google Scholar
 Zhang M, Wang W, Liu J: Mining approximate order preserving clusters in the presence of noise. Proc Int Conf Data Eng 2008, 2008: 160–168.PubMed CentralPubMedGoogle Scholar
 Yang C, Fayyad U, Bradley P: Efficient discovery of errortolerant frequent itemsets in high dimensions. In ACM SIGKDD. ACM New York, NY, USA; 2001:194–203.Google Scholar
 Liu J, Paulsen S, Wang W, Nobel A, Prins J: Mining approximate frequent itemsets from noisy data. IEEE ICDM 2005, 4.Google Scholar
 Liu J, Paulsen S, Sun X, Wang W, Nobel A, Prins J: Mining approximate frequent itemsets in the presence of noise: Algorithm and analysis. SDM 2006, 405–416.Google Scholar
 Seppänen J, Mannila H: Dense itemsets. In ACM SIGKDD. ACM New York, NY, USA; 2004:683–688.Google Scholar
 Cheng H, Yu P, Han J: Acclose: Efficiently mining approximate closed itemsets by core pattern recovery. ICDM 2006, 839–844.Google Scholar
 Besson J, Robardet C, Boulicaut J: Mining a new faulttolerant pattern type as an alternative to formal concept discovery. LNCS 2006, 4068: 144.Google Scholar
 Cheng H, Yu P, Han J: Approximate frequent itemset mining in the presence of random noise. Soft Computing for Knowledge Discovery and Data Mining 2007, 363.Google Scholar
 Poernomo A, Gopalkrishnan V: Mining statistical information of frequent faulttolerant patterns in transactional databases. In ICDM. IEEE Computer Society Washington, DC, USA; 2007:272–281.Google Scholar
 Poernomo A, Gopalkrishnan V: Towards efficient mining of proportional faulttolerant frequent itemsets. In ACM SIGKDD. ACM New York, NY, USA; 2009:697–706.Google Scholar
 Gupta R, Fang G, Field B, Steinbach M, Kumar V: Quantitative evaluation of approximate frequent pattern mining algorithms. In ACM SIGKDD. ACM; 2008:301–309.Google Scholar
 Pandey G, Atluri G, Steinbach M, Myers C, Kumar V: An association analysis approach to biclustering. In ACM SIGKDD. ACM New York, NY, USA; 2009:677–686.Google Scholar
 Ashburner M, Ball C, Blake J, Botstein D, Butler H, Cherry J, Davis A, Dolinski K, Dwight S, Eppig J, et al.: Gene Ontology: tool for the unification of biology. Nature genetics 2000, 25: 25–29. 10.1038/75556PubMed CentralView ArticlePubMedGoogle Scholar
 Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: a knowledgebased approach for interpreting genomewide expression profiles. Proc Natl Acad Sci USA 2005, 102(43):15545–50. 10.1073/pnas.0506580102PubMed CentralView ArticlePubMedGoogle Scholar
 Hughes T, Marton M, Jones A, Roberts C, Stoughton R, Armour C, Bennett H, Coffey E, Dai H, He Y, et al.: Functional discovery via a compendium of expression profiles. Cell 2000, 102: 109–126. 10.1016/S00928674(00)000155View ArticlePubMedGoogle Scholar
 Lab TG: http://gasch.genetics.wisc.edu/datasets.html. CellGoogle Scholar
 Hanhijärvi S, Ojala M, Vuokko N, Puolamäki K, Tatti N, Mannila H: Tell me something I don't know: randomization strategies for iterative data mining. In ACM SIGKDD. ACM New York, NY, USA; 2009:379–388.Google Scholar
 Han J, Kim H, Lee S, Park W, Lee J, Yoo N: Immunohistochemical expression of integrins and extracellular matrix proteins in nonsmall cell lung cancer: correlation with lymph node metastasis. Lung Cancer 2003, 41: 65–70. 10.1016/S01695002(03)001466View ArticlePubMedGoogle Scholar
 Gupta R, Agrawal S, Rao N, Tian Z, Kuang R, Kumar V: Integrative Biomarker Discovery for Breast Cancer Metastasis from Gene Expression and Protein Interaction Data Using Errortolerant Pattern Mining. Proc of the International Conference on Bioinformatics and Computational Biology (BICoB) 2010.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.