Query-based biclustering of gene expression data using Probabilistic Relational Models
- Hui Zhao†1,
- Lore Cloots†1,
- Tim Van den Bulcke2,
- Yan Wu1,
- Riet De Smet1,
- Valerie Storms1,
- Pieter Meysman1,
- Kristof Engelen1 and
- Kathleen Marchal1Email author
© Zhao et al; licensee BioMed Central Ltd. 2011
Published: 15 February 2011
With the availability of large scale expression compendia it is now possible to view own findings in the light of what is already available and retrieve genes with an expression profile similar to a set of genes of interest (i.e., a query or seed set) for a subset of conditions. To that end, a query-based strategy is needed that maximally exploits the coexpression behaviour of the seed genes to guide the biclustering, but that at the same time is robust against the presence of noisy genes in the seed set as seed genes are often assumed, but not guaranteed to be coexpressed in the queried compendium. Therefore, we developed Pro Bic, a query-based biclustering strategy based on Probabilistic Relational Models (PRMs) that exploits the use of prior distributions to extract the information contained within the seed set.
We applied Pro Bic on a large scale Escherichia coli compendium to extend partially described regulons with potentially novel members. We compared Pro Bic's performance with previously published query-based biclustering algorithms, namely ISA and QDB, from the perspective of bicluster expression quality, robustness of the outcome against noisy seed sets and biological relevance.
This comparison learns that Pro Bic is able to retrieve biologically relevant, high quality biclusters that retain their seed genes and that it is particularly strong in handling noisy seeds.
Pro Bic is a query-based biclustering algorithm developed in a flexible framework, designed to detect biologically relevant, high quality biclusters that retain relevant seed genes even in the presence of noise or when dealing with low quality seed sets.
With the large body of publicly available gene expression data, compendia are being compiled that assess gene expression in a plethora of conditions and perturbations . Comparing own experimental data with these large scale gene expression compendia allows viewing own findings in a more global cellular context. To this end query-based biclustering techniques [2–6] can be used that combine both gene and condition selection to identify genes that are coexpressed with genes of interest (i.e., a query or seed set, containing one or more genes) for a subset of conditions. These biclustering algorithms do not only differ from each other in their search strategy, but also in the way they exploit the expression signal of the seed genes to identify the query-based biclusters. Some algorithms use the mean expression profile of the seeds to initialize the biclustering , while others use the ‘similarity’ between the mean profiles of the seed set and the bicluster to constrain the search at each iteration .
For a query-based biclustering algorithm, it is naturally important to keep a bicluster centered around the seed genes as it should not converge to a bicluster that no longer contains the seed genes. However, it can not be guaranteed that all genes within the seed set will be tightly coexpressed. In such cases, adhering too strictly to the query (e.g., by relying heavily on the mean query profile to steer the biclustering), will deteriorate the results as the algorithm will not be able to compensate for incoherent query profiles. An efficient query-based biclustering algorithm should therefore always retain part of the seed genes, but simultaneously allow sufficient freedom to adjust for non-perfect or noisy sets of seed genes. In order to accommodate for these contrasting requirements in a flexible way, we developed a query-based biclustering method called Pro Bic. The model is formulated in the framework of Probabilistic Relational Models (PRMs) [7–9]. Query information is exploited via a Bayesian prior. We compared our algorithm with two of the best state-of-the art query-based biclustering algorithms, namely Iterative Signature Algorithm (ISA)  and Query-Driven Biclustering (QDB) , for a number of different bicluster comparison criteria on a large compendium of Escherichia coli microarray experiments.
The general goal is the identification of sets of genes with similar expression profiles (coordinated changes) that differ significantly from the background profile in a subset of experimental conditions (i.e., constant column biclusters using the terminology of Madeira and Oliveira ). By exploiting information contained in a given set of seed genes, we constrain the search space to biclusters that represent patterns similar to the seed gene pattern.
To this end we developed a framework based on Probabilistic Relational Models (PRMs) for query-driven biclustering of microarray data. PRMs were developed as an extension of Bayesian networks to the relational domain.
The PRM imposes that each expression value that belongs to the same bicluster and array combination is modeled by a distribution with the same parameters.
with θ being the collection of model parameters. In the following sections, we will discuss the likelihood and the prior in detail.
As shown in Equation (2), this likelihood consists of a conditional probability distribution (CPD) for the expression values and several marginal distributions modeling the assignment of genes and arrays to biclusters (see also following sections).
Modeling the expression values
A bicluster is modeled by a set of independent Normal distributions, one for each array that was assigned to the bicluster. For each array, an individual expression level is either part of (one or more) bicluster(s) or part of a background distribution. The distribution of the background expression values is modeled per array as a Normal distribution with parameters (µ a , bgr , σ a , bgr ). The parameters of the background distributions are fixed and derived a priori from the dataset using a robust estimation .
The parameter that regulates this probability can be defined as a penalty factor log(πbicl/ πbgr) to control model complexity. The factor indicates how many times more likely it must be that an expression value is part of the bicluster distribution compared to being part of the background distribution before it is actually assigned to that bicluster. Detailed explanations of each of these factors can be found in the Additional File 1: Detailed explanation of the expression level CPD.
Marginal distributions modeling the assignment of genes and arrays to biclusters
The likelihood function (Equation (2)) contains a number of factors which can be used to introduce expert knowledge.
The first factor P 1 (g.B) reflects general expert knowledge on gene to bicluster assignments. It expresses the prior probability or expectation that a gene will belong to a bicluster, irrespective of the bicluster identity. It expresses our belief in the degree of modularity of the dataset and indirectly affects the average number of genes in a bicluster. By penalizing the addition of genes to a bicluster, one can control the tightness of coexpression in the bicluster. The second factor P 2 (g.B b ) reflects prior knowledge on specific gene to bicluster assignments, namely, the probability for a specific gene g to belong to a particular bicluster b. This prior can be used to introduce detailed biological prior knowledge in the model on genes that should belong together in a bicluster.
Similarly P(a.B b ) describes the prior probability for a specific array a to belong to a specific bicluster b. This prior can be used in a similar way as P 2 (g.B b ), namely to introduce prior knowledge, by specifying the conditions that are more likely to belong together.
Each array is also given a unique identifier a.ID and in principle a probability distribution can be defined for every array. This distribution was chosen uniform for the analyses described in this study.
Prior for the model parameters P(θ)
Any member of the exponential family can be used as a conjugate to the distribution of Equation (7). We use Normal-Inverse-χ2 priors on the column-wise Gaussian probability distributions. This distribution is parameterized by the hyperparameters (µ 0 ; κ 0 ; ν 0 ; σ 2 0 ). µ 0 reflects the prior mean and σ 2 / κ 0 reflects the a priori variance on this mean. The parameters ν 0 and σ 2 0 determine the a priori variance of the distribution and its associated variance.
The query-based aspect of our biclustering approach exploits the possibility to use strong priors on the bicluster distribution. By choosing the average expression value per array a of the set of query genes (µaquery ) as the prior mean µa,b0 , the algorithm will identify a bicluster b that remains centered around the original expression profile of the query genes. The prior standard deviation σa,b0 is by default chosen to be smaller than the background standard deviation σabgr by a fraction f bcl in order to identify tight bicluster profiles.
To prevent the bicluster from drifting too far away from the seed profile, the parameter κ0 should have a high value relative to the other hyperparameters (i.e., to force the variance on the prior mean to be small). The parameter ν0 determines the relative weight of the prior versus the data.
A more detailed explanation of the hyperparameters can be found in Additional File 2: Conjugate prior distributions.
Learning the model
Maximization step: maximize over the model parameters, keeping the hidden variables (i.e., the gene and array bicluster assignments g.B and a.B) fixed.
Expectation step: find the expected values for the hidden variables g.B and a.B, keeping the current model parameters fixed.
For benchmarking we used a real dataset. This allows for a more unbiased comparison than e.g., the often used Prelic dataset  of which the assumptions underlying the data simulation favours some algorithms over others. Real data consisted of an E. coli cross-platform expression compendium  and sets of seed genes were derived from RegulonDB (version 6.2) . RegulonDB enlists for the documented E. coli transcription factors (TFs) both single (all targets regulated by one specific TF) and complex regulons (targets regulated by a specific combination of TFs). We obtained in total 225 different sets of seeds ranging in size from 1 gene to 98 genes, corresponding to 89 simple regulons and 136 complex regulons (see Additional File 3: Biological dataset).
Running Pro Bic and benchmarking with other algorithms
For Pro Bic, two parameters that are influential in a query-based setting, i.e., f bcl and log(πbicl/ πbgr) were tuned. These parameters determine the bicluster size and quality as is illustrated in Additional File 4: Influence of parameter settings. ISA was obtained from ; QDB was obtained from . For details on the used running parameters for all algorithms, see Additional File 5: Running parameters of query-based biclustering tools.
Calculation of bicluster comparison criteria
In order to assess the quality of the obtained biclusters and compare the results of different algorithms, we define several performance criteria related to both the expression quality of the biclusters as well as the retrieved biclusters' biological relevance.
Bicluster expression quality
In Equation (10) and (11), G is the number of genes in the seed set or bicluster, C is the number of conditions in the compendium in case of a seed set and the number of conditions selected after biclustering in case of a bicluster. x i , j is the expression value measured for gene i and condition j, and is the mean expression value of the genes in the seed set or bicluster for a certain array j.
Functional and motif enrichment
For the calculation of the functional enrichment, GO categories were derived from Ecocyc . For the overrepresentation of regulatory motifs in biclusters genome wide motif hits were obtained by motif screening. Screening was performed using Clover  and the PWMs representing the motifs of interest. For all first genes of transcription units their upstream region (400bp) was screened. In case of multiple TUs per gene, the TU for which the corresponding first gene had the longest intergenic region was selected. All sequence information was retrieved from NCBI (NC_000913) . We estimated from all motif screening scores a robust noise distribution and used this distribution to select for each motif a threshold on the score. Hits with a score higher than this threshold were considered true motif hits. Seed genes were excluded from the biclusters for all enrichment calculations. Enrichments were calculated using the hypergeometric distribution (0.05 significance level). Due to the discreteness of the distributions one-sided mid-P-values were used .
We used Pro Bic to search for genes that are tightly coexpressed with known regulon members in E. coli. Genes found to be closely expressed with the regulon seed genes are assumed to be potential undocumented targets for the regulon’s associated TF(s). For all tests described below, we benchmarked our method with other query-based biclustering algorithms for which a high performance on real datasets was shown previously, i.e., QDB  and ISA . The results section is divided into two main parts: first we evaluate the performance of the algorithms for retrieving good quality biclusters, next we see whether these perfornance differences actually lead to more biologically relevant results.
Performance of the algorithms
To evaluate the query-based biclustering performance of the different approaches we assess to what extent the biclusters remain centered around the seed genes, how their expression quality relates to the expression quality of the seed genes, and the ability of the methods to handle noisy seed genes.
Behavior towards seed genes
A first illustration of how the used algorithms cope differently with the seed genes is given in Additional File 6: Behavior of the different algorithms towards seed genes. This table shows to what extent the final biclusters still contained the original seed genes. It suggests that when a seed set is informative for the queried expression dataset (meaning that the dataset indeed contains additional genes being coexpressed with the seed set), all algorithms are able to find biclusters that contain all or part of the original seed genes. In contrast seed sets that are non-informative for the queried dataset give rise to either ‘empty’ biclusters, to biclusters with only the seed genes or to biclusters that drift away and do not longer contain the seed genes. Here ISA mainly results in biclusters that loose their seeds genes. The reason is that ISA is query-based only in its initialization. Contrary to ISA, Pro Bic and QDB both use prior distributions to keep the bicluster profile centered around the seed gene profile.
Expression quality of the biclusters
Difference in handling noisy seed genes
Results for the seed sets of intermediate quality can be found in the Additional File 7: Robustness of ProBic, QDB and ISA to noisy seed genes of intermediate quality. Pro Bic is most robust against the presence of noisy seed genes: in the high quality seed set it tends to keep all true seed genes and finds extra genes irrespective of the percentage of noisy seed genes that was added to the true seed genes, while QDB and ISA fail to identify biclusters containing the true seed genes.
In the case of a low quality seed set, it is harder to distinguish the true seed genes from the randomly added ones. While all algorithms perform worse under these conditions, Pro Bic still retrieves part of the seed genes for up to 20% of noise genes. Both ISA and QDB mostly fail to keep the true seed genes in all conditions.
In the previous section we showed how Pro Bic outperforms QDB and ISA with respect to a set of objective query-based bicluster criteria. In this section we will evaluate whether Pro Bic also leads to more relevant biclusters from a biological perspective. The biological relevance of retrieved biclusters is assessed through functional and motif enrichment calculations and a cross-validation experiment.
Functional and motif enrichment
Cross-validation for identification of known targets of TF(s)
In this experimental set up, we used part of the known regulon members as seeds and tested to what extent the different query-based bicluster approaches could retrieve the left out known targets (i.e., validation set) as additional bicluster members. Of the original 225 regulon sets, we retained the ones with five or more genes. Each of these resulting 49 sets was divided into a seed set and a validation set containing respectively four fifth and one fifth of the number of original seed genes. This procedure was repeated in a five-fold cross-validation set up.
For each of the methods, a query-based biclustering was performed with the seed sets. Results were validated by checking if the left out genes of the original regulon set (i.e., the validation set) were retrieved in the obtained biclusters. To this end we calculated the percentage of genes of the validation set that were retrieved in the biclusters (i.e., recall). To compensate for the fact that the recall is likely to increase with the size of the biclusters, we also calculated the percentage of validation genes found in the bicluster to the total number of genes in the bicluster (i.e., ‘enrichment’ of validation genes in the bicluster). Results presented in Additional File 8: Recall and ‘enrichment’ of the biclusters in a cross-validation experiment, are the averages of the five cross-validations and show that for most seed sets, biclusters of Pro Bic show a higher recall and enrichment of the genes in the validation set than biclusters obtained by QDB and ISA. In most cases none of the algorithms is able to retrieve all of the validation genes (i.e., a recall of 1). This is to be expected as a regulon membership not necessarily implies coexpression under the conditions of the expression compendium.
In this work we developed Pro Bic, a query-based biclustering model formulated in the framework of Probabilistic Relational Models. We compared Pro Bic to related query-based biclustering techniques with respect to obtaining high quality and biologically relevant biclusters, centered around the seed genes and with respect to their ability of dealing with noisy seed genes. Although query-based biclustering by ISA results in biologically relevant biclusters of high expression quality, ISA does not constrain the bicluster to remain centered around the seed genes and therefore cannot handle noisy genes in the seed set. Pro Bic on the other hand does so in a soft way by using prior distributions to constrain the bicluster distributions. In that sense Pro Bic is more similar to QDB, another model based biclustering algorithm. Despite their model similarities, QDB and Pro Bic differ in the implementation. Firstly, QDB estimates the parameters for the background distribution for each array during the iterations of the optimization procedure from the expression values on an array that are not yet assigned to a bicluster. However, as genes on the array can be over- or underexpressed without belonging to the current bicluster, the background variance will tend to get overestimated, rendering it more difficult to determine whether for a certain selection of genes a condition belongs to the background or not. Pro Bic on the other hand estimates the parameters of the background distributions upfront and in a robust way, i.e., by 'filtering out' per condition the most over- and underexpressed genes.
Another major difference is the way the model is learned: whereas Pro Bic is run with one parameter setting until convergence to a local optimum, QDB uses a resolution sweep: i.e., the prior variance on the bicluster distributions is increased during several consecutive runs of the algorithm, allowing biclusters to become gradually more coarse grained. The starting point for a new run with an increased prior variance is the solution (i.e., a mode in the posterior landscape) that was found in a previous run performed with a slightly smaller prior variance. Applying strong priors on the model parameters give rise to rather simple posterior distributions and a fast convergence for each prior variance. This can come at the cost of the bicluster quality: relying too heavily on the seed genes explains the bad performance in case of noisy seed genes and the dependence of the quality of the bicluster on the quality of the seed genes.
Because of the ability to use a less informative prior and the characteristics of the estimation of the background distributions, Pro bic tends to find biclusters with tightly coexpressed genes and conditions in which the bicluster genes are markedly differentially expressed. This is reflected in the high scores for the quality parameters. Biclusters retrieved by QDB on the other hand are often coarse grained, noisy and large in both genes and conditions.
Pro Bic is a query-based biclustering algorithm, designed to detect biologically relevant, high quality biclusters that retain their seed genes even in the presence of noise or when dealing with low quality seeds. It outperforms QDB and ISA with respect to a set of objective query-based bicluster criteria, namely to what extent the biclusters remain focused around the seed genes, how their expression quality relates to the expression quality of the seed genes, and their ability in handling noisy seed genes. This increased performance also resulted in more relevant biclusters from a biological perspective. In addition, the underlying PRM-based framework is extendable towards integrating additional data sources such as motif information and ChIP-chip information that can further help refining the obtained biclusters.
Funding: This work is supported by: 1) KUL: GOA AMBioRICS, GOA/08/011, CoE NATAR IOK-C1895-PF/10/010, SymBioSys; CREA/08/023 2) IWT: SBO-BioFrame; 3) IUAP P6/25 (BioMaGNet); 4) FWO IOK-B9725-G.0329.09; 5) ZKB8933/CREA/08/023/ BOF, 6) HFSP-RGY0079/2007C.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 1, 2011: Selected articles from the Ninth Asia Pacific Bioinformatics Conference (APBC 2011). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/12?issue=S1.
- Fierro AC, Vandenbussche F, Engelen K, Van de Peer Y, Marchal K: Meta Analysis of Gene Expression Data within and Across Species. Curr Genomics 2008, 9: 525–534. 10.2174/138920208786847935PubMed CentralView ArticlePubMedGoogle Scholar
- Owen AB, Stuart J, Mach K, Villeneuve AM, Kim S: A gene recommender algorithm to identify coexpressed genes in C. elegans. Genome Res 2003, 13: 1828–1837.PubMed CentralPubMedGoogle Scholar
- Bergmann S, Ihmels J, Barkai N: Iterative signature algorithm for the analysis of large-scale gene expression data. Phys Res E Stat Nonlin Soft Matter Phys 2003, 67(3 Pt 1):031902. 10.1103/PhysRevE.67.031902View ArticleGoogle Scholar
- Wu CJ, Kasif S: GEMS: a web server for biclustering analysis of biclustering data. Nucleic Acids Res 2005, 33: W596-W599. 10.1093/nar/gki469PubMed CentralView ArticlePubMedGoogle Scholar
- Dhollander T, Sheng Q, Lemmens K, De Moor B, Marchal K, Moreau Y: Query-driven module discovery in microarray data. Bioinformatics 2007, 23: 2573–2580. 10.1093/bioinformatics/btm387View ArticlePubMedGoogle Scholar
- Hibbs MA, Hess DC, Myers CL, Huttenhower C, Li K, Troyanskaya OG: Exploring the functional landscape of gene expression: directed search of large microarray compendia. Bioinformatics 2007, 23: 2692–2699. 10.1093/bioinformatics/btm403View ArticlePubMedGoogle Scholar
- Koller D, Pfeffer A: Probabilistic frame-based systems. Proceedings of the Fifteenth National Conference on Artificial Intelligence: 26–30 July 1998; Madison 1998, 580–587.Google Scholar
- Friedman N, Getoor L, Koller D, Pfeffer A: Learning probabilistic relational models. International Joint Conference on Artificial Intelligence: 31 July – 6 August 1999; Stockholm 1999, 1300–1309.Google Scholar
- Getoor L, Friedman N, Koller D, Taskar B: Learning probabilistic models of relational structure. Proceedings of the 18th International Conference on Machine Learning: 2001; San Francisco 2001, 170–177.Google Scholar
- Madeira SC, Oliveira AL: Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform 2004, 1: 24–45. 10.1109/TCBB.2004.2View ArticlePubMedGoogle Scholar
- Van den Bulcke T: Robust algorithms for inferring regulatory networks based on gene expression measurements and biological prior information. PhD thesis. Katholieke Universiteit Leuven, Faculty of Engineering; 2009.Google Scholar
- Dempster AP, Laird NM, Rubin DB: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society series B 1977, 39: 1–38.Google Scholar
- Prelić A, Bleuler S, Zimmermann P, Wille A, Bühlmann P, Gruissem W, Hennig L, Thiele L, Zitzler E: A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics 2006, 22: 1122–1129. 10.1093/bioinformatics/btl060View ArticlePubMedGoogle Scholar
- Lemmens K, De Bie T, Dhollander T, De Keersmaecker SC, Thijs IM, Schoofs G, De Weerdt A, De Moor B, Vanderleyden J, Collado-Vides J, Engelen K, Marchal K: DISTILLER: a data integration framework to reveal condition dependency of complex regulons in Escherichia coli. Genome Biol 2009, 10: R27. 10.1186/gb-2009-10-3-r27PubMed CentralView ArticlePubMedGoogle Scholar
- Gama-Castro S, Jimenez-Jacinto V, Peralta-Gil M, Santos-Zavaleta A, Penaloza-Spinola MI, Contreras-Moreira B, Segura-Salazar J, Muniz-Rascado L, Martinez-Flores I, Salgado H, Bonavides-Martinez C, Abreu-Goodger C, Rodriguez-Penagos C, Miranda-Rios J, Morett E, Merino E, Huerta AM, Trevino-Quintanilla L, Collado-Vides J: RegulonDB: gene regulation model of Escherichia coli K-12 beyond transcription, active (experimental) annotated promoters and Textpresso navigation. Nucleic Acids Res 2008, 36: D120–124. 10.1093/nar/gkm994PubMed CentralView ArticlePubMedGoogle Scholar
- ISA matlab package[http://www2.unil.ch/cbg/index.php?title=ISA]
- QDB source code[http://homes.esat.kuleuven.be/_tdhollan/Supplementary_Information_Dhollander_2007/index.html]
- Keseler IM, Bonavides-Martínez C, Collado-Vides J, Gama-Castro S, Gunsalus RP, Johnson DA, Krummenacker M, Nolan LM, Paley S, Paulsen IT, Peralta-Gil M, Santos-Zavaleta A, Shearer AG, Karp PD: EcoCyc: a comprehensive view of Escherichia coli biology. Nucleic Acids Res 2009, 37: D464-D470. 10.1093/nar/gkn751PubMed CentralView ArticlePubMedGoogle Scholar
- Frith MC, Fu Y, Yu L, Chen J-F, Hansen U, Weng Z: Detection of functional DNA motifs via statistical over-representation. Nucleic Acids Res 2004, 32(4):1372–81. 10.1093/nar/gkh299PubMed CentralView ArticlePubMedGoogle Scholar
- NCBI (NC_000913) Escherichia coli str. K-12 substr. MG1655 chromosome, complete genome[http://www.ncbi.nlm.nih.gov/nuccore/49175990]
- Rivals I, Personnaz L, Taing L, Potier MC: Enrichment or depletion of a GO category within a class of genes: which test? Bioinformatics 2007, 23: 401–407. 10.1093/bioinformatics/btl633View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.