Model framework
The general goal is the identification of sets of genes with similar expression profiles (coordinated changes) that differ significantly from the background profile in a subset of experimental conditions (i.e., constant column biclusters using the terminology of Madeira and Oliveira [10]). By exploiting information contained in a given set of seed genes, we constrain the search space to biclusters that represent patterns similar to the seed gene pattern.
To this end we developed a framework based on Probabilistic Relational Models (PRMs) for query-driven biclustering of microarray data. PRMs were developed as an extension of Bayesian networks to the relational domain.
An overview of the Pro Bic Probabilistic Relational Model is shown in Figure 1: it contains the classes Gene, Array and Expression. For each class, gene, array and expression objects exist that are specific instantiations of the class (denoted by the lowercase letters g, a and e respectively). The complete set of genes, array and expression objects that belong to a certain class are indicated by uppercase letters G, A and E. Each object g and a of respectively the Gene and Array class has a number of binary attributes. The Boolean attributes B
b
indicate for each gene (array) object whether that gene (array) belongs to a bicluster b or not. The gene-bicluster labels g.B
b
(over all biclusters b) and the array-bicluster labels a.B
b
are the hidden variables of the model. The Array class has an additional attribute ID that uniquely identifies each individual array object a. This is needed because Pro Bic searches for constant column biclusters. Finally, each object e of the class Expression has a single numeric attribute e.level that contains the expression level for each specific gene and array combination.
The PRM imposes that each expression value that belongs to the same bicluster and array combination is modeled by a distribution with the same parameters.
Posterior distribution
The posterior distribution for the Pro Bic model is shown in Equation (1).
with θ being the collection of model parameters. In the following sections, we will discuss the likelihood and the prior in detail.
Likelihood
The likelihood of the Pro Bic model is shown in Equation (2). It expresses how probable the observed dataset is for different settings of the parameter vector θ. For notational convenience, the dependency on the model parameters is not written explicitly.
As shown in Equation (2), this likelihood consists of a conditional probability distribution (CPD) for the expression values and several marginal distributions modeling the assignment of genes and arrays to biclusters (see also following sections).
Modeling the expression values
The CPD for modeling the expression values consists of two factors:
The first factor describes the conditional probability of all expression levels given their gene and array bicluster assignment attributes:
A bicluster is modeled by a set of independent Normal distributions, one for each array that was assigned to the bicluster. For each array, an individual expression level is either part of (one or more) bicluster(s) or part of a background distribution. The distribution of the background expression values is modeled per array as a Normal distribution with parameters (µ
a
,
bgr
, σ
a
,
bgr
). The parameters of the background distributions are fixed and derived a priori from the dataset using a robust estimation [11].
A second factor regulates the model complexity by adjusting the probability that an expression value is belonging to a bicluster distribution compared to the background distribution.
The parameter that regulates this probability can be defined as a penalty factor log(πbicl/ πbgr) to control model complexity. The factor indicates how many times more likely it must be that an expression value is part of the bicluster distribution compared to being part of the background distribution before it is actually assigned to that bicluster. Detailed explanations of each of these factors can be found in the Additional File 1: Detailed explanation of the expression level CPD.
Marginal distributions modeling the assignment of genes and arrays to biclusters
The likelihood function (Equation (2)) contains a number of factors which can be used to introduce expert knowledge.
The probability for the gene to bicluster assignments P(g.B) is defined as a combination of two factors where each one defines a separate aspect of the prior [11]:
The first factor P
1
(g.B) reflects general expert knowledge on gene to bicluster assignments. It expresses the prior probability or expectation that a gene will belong to a bicluster, irrespective of the bicluster identity. It expresses our belief in the degree of modularity of the dataset and indirectly affects the average number of genes in a bicluster. By penalizing the addition of genes to a bicluster, one can control the tightness of coexpression in the bicluster. The second factor P
2
(g.B
b
) reflects prior knowledge on specific gene to bicluster assignments, namely, the probability for a specific gene g to belong to a particular bicluster b. This prior can be used to introduce detailed biological prior knowledge in the model on genes that should belong together in a bicluster.
Similarly P(a.B
b
) describes the prior probability for a specific array a to belong to a specific bicluster b. This prior can be used in a similar way as P
2
(g.B
b
), namely to introduce prior knowledge, by specifying the conditions that are more likely to belong together.
Each array is also given a unique identifier a.ID and in principle a probability distribution can be defined for every array. This distribution was chosen uniform for the analyses described in this study.
Prior for the model parameters P(θ)
A bicluster is modeled by a set of independent Normal distributions, one for each array that was assigned to the bicluster. Each bicluster-array combination is thus modeled as a Normal distribution with its own set of parameters (µ
a
,
b
, σ
a
,
b
) (Figure 1). Prior knowledge on the model parameters is introduced in the model through the appropriate prior distributions. We choose for conjugate prior distributions as these result in a simple decomposition of the total probability distribution.
Any member of the exponential family can be used as a conjugate to the distribution of Equation (7). We use Normal-Inverse-χ2 priors on the column-wise Gaussian probability distributions. This distribution is parameterized by the hyperparameters (µ
0
; κ
0
; ν
0
; σ2
0
). µ
0
reflects the prior mean and σ2/ κ
0
reflects the a priori variance on this mean. The parameters ν
0
and σ2
0
determine the a priori variance of the distribution and its associated variance.
The query-based aspect of our biclustering approach exploits the possibility to use strong priors on the bicluster distribution. By choosing the average expression value per array a of the set of query genes (µaquery ) as the prior mean µa,b0 , the algorithm will identify a bicluster b that remains centered around the original expression profile of the query genes. The prior standard deviation σa,b0 is by default chosen to be smaller than the background standard deviation σabgr by a fraction f
bcl
in order to identify tight bicluster profiles.
To prevent the bicluster from drifting too far away from the seed profile, the parameter κ0 should have a high value relative to the other hyperparameters (i.e., to force the variance on the prior mean to be small). The parameter ν0 determines the relative weight of the prior versus the data.
A more detailed explanation of the hyperparameters can be found in Additional File 2: Conjugate prior distributions.
Learning the model
To learn the model, we applied a hard-assignment EM approach [12] consisting of the following steps until convergence (for a detailed explanation, see [11]):
-
Maximization step: maximize over the model parameters, keeping the hidden variables (i.e., the gene and array bicluster assignments g.B and a.B) fixed.
-
Expectation step: find the expected values for the hidden variables g.B and a.B, keeping the current model parameters fixed.
As an initialization of the hidden variables (the gene to bicluster and array to bicluster assignments) a set of seed genes is used:
Datasets
For benchmarking we used a real dataset. This allows for a more unbiased comparison than e.g., the often used Prelic dataset [13] of which the assumptions underlying the data simulation favours some algorithms over others. Real data consisted of an E. coli cross-platform expression compendium [14] and sets of seed genes were derived from RegulonDB (version 6.2) [15]. RegulonDB enlists for the documented E. coli transcription factors (TFs) both single (all targets regulated by one specific TF) and complex regulons (targets regulated by a specific combination of TFs). We obtained in total 225 different sets of seeds ranging in size from 1 gene to 98 genes, corresponding to 89 simple regulons and 136 complex regulons (see Additional File 3: Biological dataset).
Running Pro Bic and benchmarking with other algorithms
For Pro Bic, two parameters that are influential in a query-based setting, i.e., f
bcl
and log(πbicl/ πbgr) were tuned. These parameters determine the bicluster size and quality as is illustrated in Additional File 4: Influence of parameter settings. ISA was obtained from [16]; QDB was obtained from [17]. For details on the used running parameters for all algorithms, see Additional File 5: Running parameters of query-based biclustering tools.
Calculation of bicluster comparison criteria
In order to assess the quality of the obtained biclusters and compare the results of different algorithms, we define several performance criteria related to both the expression quality of the biclusters as well as the retrieved biclusters' biological relevance.
Bicluster expression quality
High quality biclusters are identified as those that contain genes that are mutually tightly coexpressed and that preferentially vary largely over the selected conditions (i.e., with a profile different from the background profile). This behaviour is reflected in two measures: the standard deviation within conditions (STD-within, Equation (10)), and the uncentered standard deviation of the mean profile across conditions (STD-across, Equation (11)) respectively. These two measures can also be summarized in their ratio (i.e., STD-across/STD-within) to objectively assess the expression quality of a bicluster.
In Equation (10) and (11), G is the number of genes in the seed set or bicluster, C is the number of conditions in the compendium in case of a seed set and the number of conditions selected after biclustering in case of a bicluster. x
i
,
j
is the expression value measured for gene i and condition j, and
is the mean expression value of the genes in the seed set or bicluster for a certain array j.
Functional and motif enrichment
For the calculation of the functional enrichment, GO categories were derived from Ecocyc [18]. For the overrepresentation of regulatory motifs in biclusters genome wide motif hits were obtained by motif screening. Screening was performed using Clover [19] and the PWMs representing the motifs of interest. For all first genes of transcription units their upstream region (400bp) was screened. In case of multiple TUs per gene, the TU for which the corresponding first gene had the longest intergenic region was selected. All sequence information was retrieved from NCBI (NC_000913) [20]. We estimated from all motif screening scores a robust noise distribution and used this distribution to select for each motif a threshold on the score. Hits with a score higher than this threshold were considered true motif hits. Seed genes were excluded from the biclusters for all enrichment calculations. Enrichments were calculated using the hypergeometric distribution (0.05 significance level). Due to the discreteness of the distributions one-sided mid-P-values were used [21].