Materials and data
Expression data
Expression data for Halobacterium were collected by members of the Baliga lab, containing genomewide measurements of mRNA expression in 292 conditions, as described in full in [22] and references therein. Expression data for H. pylori and S. cerevisiae were collected from the Stanford Microarray Database [3]. Certain experiments such as strain comparisons, genomic DNA, and RNA decay experiments which are unlikely to relate to gene regulation were removed from the sets prior to analysis. This filtering resulted in 58 of an original 250 conditions for H. pylori and 667 of 1051 conditions for S. cerevisiae. Data for E. coli was compiled from publiclyavailable data provided to us by E. A1m, including 86 conditions. As a preprocessing step, genes were removed from the expression data for which there was not significant (1.5fold) expression in any of the experiments. The data were then rownormalized (each gene's expression levels normalized to mean = 0, SD = 1). No further preprocessing or filtering of the microarray data was performed.
Association and metabolic networks
Genetic associations derived from comparative genomics, such as phylogenetic profile, Rosetta Stone, gene neighbor and gene cluster, were compiled from Prolinks [23] and Predictome [62] for all organisms. These networks include predicted operon "associations," which were also used to identify "unique" regulatory sequences that are to be used in the motif detection. Metabolic network reconstructions from the Kyoto Encyclopedia of Genes and Genomes [48] were represented as associations between two genes if they participate in a reaction sharing one or more ligands, after removing the most highlyconnected ligands, such as water and ATP [16].
Interaction networks
H. pylori proteinprotein interactions were collected from the global experiments of [70]. S. Cerevisiae proteinprotein, proteinDNA, and genetic interactions were collected from DIP [73] and BIND [9]. The proteinDNA interactions were converted into a network of associations between all pairs of genes whose upstream sequences were found to bind to the same regulator(s).
Upstream sequences
Upstream sequences for all organisms were obtained from GenBank using the Regulatory Sequence Analysis Tools (RSAT [92]). Using these tools, we extracted 1000bp cisregulatory sequences. For bacteria and archaea, these sequences were shortened to 500 bp, and then "operonshifted" using the gene cluster (operon association) networks from Prolinks [23] and Predictome [62]. Upstream sequences for genes in the same operon were converted to "operonshifted" sequences by using the (same) upstream sequence of the first gene in the operon Similar "operonshifted" upstream sequences were identified using BLASTN [8] using a 50 bp nongapped alignment window, to avoid using multiple copies of the same sequence in the motif detection.
Functional annotations for comparison tests
Gene ontology (GO) [40] annotations for each organism were obtained from the European Bioinformatics Institute [1] and matched to annotation names obtained from the GO web site. KEGG annotations were downloaded from their web site [2]. Predicted and experimentallyderived DNA binding motifs were obtained for S. cerevisiae from [39, 60], and for E. coli from [71, 72]. When these binding motifs were provided as position weight matrices (PWMs), they were converted into regular expressions, in order to enable rapid scanning of upstream sequences.
The bicluster model
Model overview
Each bicluster is modeled via a Markov chain process, in which the bicluster is iteratively optimized, and its state is updated based upon conditional probability distributions computed using the cluster's previous state. This enables us to define probabilities that each gene or condition belongs in the bicluster, conditioned upon the current state of the bicluster, as opposed to requiring us to build a complete (joint) model for the bicluster, a priori. The components of this conditional probability are modeled independently (one for each of the different types of information which we are integrating) as pvalues based upon individual data likelihoods, which are then combined into a regression model to derive the full conditional probability. In this work, three major distinct data types are used (gene expression, upstream sequences, and association networks), and accordingly pvalues for three such model components are computed: the expression component, the sequence component, and the network component.
Each bicluster begins as a seed, or starting cluster, that is iteratively optimized by adding/removing genes and conditions to/from the cluster by sampling from the conditional probability distribution using a Monte Carlo procedure, to prevent premature convergence. Such an iterative machine learning technique is akin to a Markov chain Monte Carlo (MCMC) process. Additional clusters are seeded and optimized until a given number (k_{max}) of clusters have been generated, or significant optimization is no longer possible. The complete process is shown schematically in Fig. 8, and described in detail below.
In the following discussion, let i be an arbitrary gene and j an arbitrary experimental condition. A bicluster k ∈ K is fully defined by its set of genes I_{
k
}and experimental conditions J_{
k
}. The membership y_{
lk
}∈ {0, 1} of an arbitrary gene or condition l in bicluster k is an independent Bernoulli indicator variable with conditional probability p(y_{
lk
}= 1).
The expression component
The expression data is a set of measurements of genes i ∈ I over experiments j ∈ J, comprising a I × J matrix x_{
ij
}∈ X. Each bicluster k defines a I_{
k
} × J_{
k
} submatrix x_{
i'j'
}∈ X_{
k
}: i' ∈ I_{
k
}⊂ I; j' ∈ J_{
k
}⊂ J. The variance in the measured levels of condition j is {\sigma}_{j}^{2}=I{}^{1}{\displaystyle {\sum}_{i\in I}{({x}_{ij}{\overline{x}}_{j})}^{2}}, where {\overline{x}}_{j}={\displaystyle {\sum}_{i\in I}{x}_{ij}/\leftI\right}. We compute the mean expression level of condition j over the cluster's genes I_{
k
}, {\overline{x}}_{jk}={\displaystyle {\sum}_{i\in {I}_{k}}{x}_{ij}/\left{I}_{k}\right}. Then, the likelihood of an arbitrary measurement x_{
ij
}relative to this mean expression level is
p({x}_{ij})=\frac{1}{\sqrt{2\pi \left({\sigma}_{j}^{2}+{\epsilon}^{2}\right)}}\mathrm{exp}\phantom{\rule{0.5em}{0ex}}\left[\frac{1}{2}\frac{{({x}_{ij}{\overline{x}}_{jk})}^{2}+{\epsilon}^{2}}{{\sigma}_{j}^{2}+{\epsilon}^{2}}\right],\phantom{\rule{0.5em}{0ex}}\left(1\right)
which includes the term ε for an unknown systematic error in condition j, here assumed to be the same for all j. Note that the use of σ_{
j
}over all genes I rather than a σ_{
jk
}computed over I_{
k
}results in a lower likelihood p(x_{
ij
}) for those conditions j that have a small overall variance, and are therefore more likely to be correlated by random chance. Also, such lowvariance conditions could be the result of poor labeling, or other systematic problems.
The likelihood of the measurements of an arbitrary gene i among the conditions in bicluster k are p({x}_{i})={\displaystyle {\prod}_{j\in {J}_{k}}p({x}_{ij})}, and similarly the likelihood of a condition j's measurements are p({x}_{j})={\displaystyle {\prod}_{i\in {I}_{k}}p({x}_{ij})}. We integrate the two tails of the Normal distribution in Eq. 1 to derive coexpression pvalues for each gene i, r_{
ik
}, and for each condition j, r_{
jk
}, relative to bicluster k.
Sequence component (motif cooccurrence)
Each gene i has an upstream cisregulatory sequence S_{
i
}(a string of DNA nucleotides of length l_{S}), and bicluster k defines a set of sequences S_{
k
}for all S_{
i'
}; i' ∈ I_{
k
}. The decision whether an arbitrary gene's upstream sequence, S_{
i
}, shares common motif(s) with sequences S_{
k
}, is determined via a twostep process: (1) identify one or more motif(s) M_{
k
}that is (are) significantly overrepresented in many (if not all) bicluster sequences S_{
k
}, and then (2) scan Si to see if it also contains M_{
k
}.
In this work, we are not advancing the basic methodology for motif detection (step 1), as relatively mature methods exist for finding motifs given a fixed set of sequences [91]. Instead, we are describing an overall strategy that incorporates previously existing motif finding algorithm(s) into a clustering procedure. As such, the procedure is motifdetectionalgorithm agnostic, and the search may be performed using one of many existing methods [91]. Our only requirements are that (a) significantly overrepresented motifs do not have to exist in all sequences S_{
k
}, and (b) it can produce a score (preferentially a pvalue) that an arbitrary sequence contains the detected motif(s). The MEME algorithm [10], which identifies significant sequence motifs using expectation maximization of one or more probabilistic motif models given a fixed set of sequences and a background residue model, is used to perform step (1), as it meets the first criterion (a). MEME's companion algorithm MAST [10], which computes the pvalue that an arbitrary sequence matches the set of motifs detected with MEME, is used to perform step (2), as it meets the second criterion (b). During the motif detection step, for any genes in bicluster k which are in an operon, we make sure to use only one copy of the upstream sequence for that operon (i.e. upstream of the first gene in that operon), as described above ("Upstream sequences"). Additional details on the specific parameters passed to these procedures are provided in [Additional File 1, Table Three].
Thus, using these two algorithms, we can detect a set of motifs M_{
k
}in sequences S_{
k
}, and compute a pvalue that a sequence S_{
i
}contains those motifs. Note that this pvalue is computed for each upstream sequence in the genome, including those for the genes within cluster k, to derive the motif pvalues, s_{
ik
}, for each gene i relative to bicluster k, at each iteration of the MCMC procedure.
Association network component
To build up a highlyconnected subnetwork among genes that are in a bicluster (given a full set of associations), we aim to add genes preferentially that have a greater number of connections to those currently in the bicluster than one would expect (at random) based upon the overall connectivity in the network. Thus, we compute pvalues for observing the associations between a gene or experimental condition and the genes or conditions currently in bicluster k, given an association network N ∈ N. In the following discussion, genes are the primary consideration, but networks of associations between experimental conditions are conceivable (e.g., we might wish to preferentially group conditions that are part of the same time series). The network association pvalue, {q}_{ik}^{N}, is computed based upon the number of edges in network N connecting gene i to genes I_{
k
}in bicluster k, relative to the total number of edges connected between i and the genes {{I}^{\prime}}_{k} (that are not in cluster k), as well as the connections within and between the gene sets I_{
k
}and {{I}^{\prime}}_{k}. The hypergeometric distribution is used to compute the probability of observing such an arrangement of connections by chance:
p({n}_{i\to {I}_{k}}{n}_{i\to {{I}^{\prime}}_{k}},{n}_{{I}_{k}\to {I}_{k}},{n}_{{I}_{k}\to {{I}^{\prime}}_{k}})=\frac{\left(\begin{array}{c}{n}_{i\to {I}_{k}}+{n}_{{I}_{k}\to {I}_{k}}\\ {n}_{i\to {I}_{k}}\end{array}\right)\left(\begin{array}{c}{n}_{i\to {{I}^{\prime}}_{k}}+{n}_{{I}_{k}\to {{I}^{\prime}}_{k}}\\ {n}_{i\to {{I}^{\prime}}_{k}}\end{array}\right)}{\left(\begin{array}{c}{n}_{i\to {I}_{k}}+{n}_{{I}_{k}\to {I}_{k}}+{n}_{i\to {{I}^{\prime}}_{k}}+{n}_{{I}_{k}\to {{I}^{\prime}}_{k}}\\ {n}_{i\to {I}_{k}}+{n}_{i\to {{I}^{\prime}}_{k}}\end{array}\right)},\phantom{\rule{0.5em}{0ex}}\left(2\right)
where A → B represents the set of associations between the elements in gene set A with those in set B, and n_{A→B}is the number of these associations. Expression (2) is analogous to the hypergeometric measure of mutual clustering coefficient described by [37]. However, it does not account for the global structure of the network; it is only concerned with the local associations, i.e. those directly connected to gene i and the bicluster's genes, I_{
k
}. This choice of connectivity measure allows a single value to be directly computed for each gene, relative to each cluster, and gives greater preference to an individual gene i being added to cluster k if a large fraction of i's associations are with the other genes in the cluster (and vice versa), independent of the global distribution of associations in the network. Individual pvalues, {q}_{ik}^{N}, for each gene i and each network N are computed for bicluster k by integrating the lower tail of the distribution in Eq. 2.
The joint cluster membership probability
The ultimate goal is to decide gene or condition bicluster membership jointly on the basis of the three individual sets of pvalues r_{
ik
}, s_{
ik
}, and q_{
ik
}computed above (for the remainder of this discussion, we now use i to denote a gene or experimental condition). A common procedure for combining "scores" such as these into a single joint likelihood is to perform a multiparametric logistic regression [41] that treats each pvalue measure as a random variable and estimates the joint membership likelihood p(y_{
ik
}= 1) using the logistic function,
{\pi}_{ik}\equiv p({y}_{ik}=1{X}_{k},{S}_{i},{M}_{k},N)\propto \mathrm{exp}\phantom{\rule{0.5em}{0ex}}\left[{\beta}_{0}+{r}_{0}\mathrm{log}\phantom{\rule{0.5em}{0ex}}({r}_{ik})+{s}_{0}\mathrm{log}\phantom{\rule{0.5em}{0ex}}({s}_{ik})+{\displaystyle \sum _{N\in N}{q}_{0}^{N}}\mathrm{log}\phantom{\rule{0.5em}{0ex}}({q}_{ik}^{N})\right].\phantom{\rule{0.5em}{0ex}}\left(3\right)
This model approximates a (probabilistic) discriminating hyperplane in the space defined by r_{
ik
}, s_{
ik
}, and q_{
ik
}, parameterized by the four independent variables β_{0} (the intercept), and r_{0}, s_{0}, and q_{0} (the slope) that maximally discriminates the genes or conditions within the bicluster (I_{
k
}) from those outside ({{I}^{\prime}}_{k}). Conceptually, the model implies that a gene or condition that poorly matches the bicluster based on one data type can still be added to the bicluster if it matches well to the other data types, analogous to, for example, the explicit "softening" of cluster boundaries performed by [15]. Note that when element i is an experimental condition, s_{0} (the motif parameter) is zero.
In practice, during early iterations when the bicluster is not welldiscriminated from the background, such an unconstrained regression leads to unstable situations such as unwarranted overweighting or inversion of one or more variables (r_{0}, s_{0}, or q_{0}). Additionally, depending upon the quality of the data set(s) being used and the predisposition (or prior knowledge) of the researcher, different runs of the algorithm stressing different data types may be desired. Finally, there is good reason to expect that certain data types (e.g. sequence motifs) will be less informative early in the procedure when the biclusters are poorlydefined, and only later will it make sense to incorporate them into the bicluster model.
Therefore, we perform a constrained logistic regression by transforming the regression space defined by r_{
ik
}, s_{
ik
}, and q_{
ik
}into one dimension, projecting the logpvalues onto a single vector, g_{
ik
},
{g}_{ik}={r}_{0}\mathrm{log}\phantom{\rule{0.5em}{0ex}}({\tilde{r}}_{ik})+{s}_{0}\mathrm{log}\phantom{\rule{0.5em}{0ex}}({\tilde{s}}_{ik})+{\displaystyle \sum _{N\in N}{q}_{0}^{N}\mathrm{log}\phantom{\rule{0.5em}{0ex}}({\tilde{q}}_{ik}^{N}),\phantom{\rule{0.5em}{0ex}}\left(4\right)}
where r_{0}, s_{0}, and q_{0} are specified for each iteration according to an "annealing schedule," described below. Here, each of the dimensions have been standardized to place the logpvalues for each data set on the same scale, with log(\tilde{\chi}_{
ik
}) = [log(χ_{
ik
})  μ_{
k
}]/σ_{
k
}, where μ_{
k
}and σ_{
k
}are the mean and standard deviation of the logpvalues log(χ_{
ik
}) (χ_{
ik
}denotes either r_{
ik
}, s_{
ik
}, or q_{
ik
}), only over the genes or conditions in the bicluster (i ∈ I_{
k
}). This is necessary because the pvalues for each component of our model were derived for different types of data, each with widely differing sizes. For example, the pvalues are likely to be smaller (on average) for the component with the most data (here, the expression data), or for motifs with larger lengths. The projection described in Eq. 4 constrains the regression by fixing the slope of the discriminating hyperplane (via parameters r_{0}, s_{0}, and q_{0}), with only the intercept β_{0} permitted to vary from cluster to cluster. These parameters may also be interpreted as mixing parameters that control the fractional contribution of each model component to the cluster likelihood, g_{
ik
}. They may be defined by the user, and/or may be modified throughout the course of cluster optimization. For example, early in the procedure when the bicluster is a poorlydefined seed, coexpression and certain association networks (e.g. operon associations, for bacteria and archaea) are extremely informative, whereas a common cisregulatory motif is less likely exist among the genes in the bicluster. Only later (when the bicluster has been optimized on the basis of expression data and operon associations) does it make sense to incorporate sequence motifs into the bicluster model. Therefore we employ a strategy for slowly varying the relative contribution of each of the regression parameters, as the cluster is optimized, as part of an annealing schedule (described further, below). The constrained binomial regression is now given by
π_{
ik
}≡ p(y_{
ik
}= 1X_{
k
}, S_{
i
}, M_{
k
}, N) ∝ exp(β_{0} + β_{1}g_{
ik
}), (5)
where parameters β= [β_{0}, β_{1}] fully determine the conditional probability of membership p(y_{
ik
}= 1) of a gene or condition i in bicluster k.
One additional complication arises near the end of a bicluster optimization, that a bicluster may be perfectly discriminated from the background (resulting in an infinite negative loglikelihood and undefined regression). This may be addressed in two ways: the first is to constrain or fix the slope β_{1} of the regression, allowing only the intercept β_{0} to vary. We chose a second option, to perform a penalized maximum likelihood estimation described by [42] and originally proposed by [35]. This penalized estimate of β provides bias reduction in the case of small sample sizes (small biclusters), and solves the separation problem in the context of perfect discrimination and infinite likelihood. β can be determined with this penalized likelihood measure using an efficient iterative process [35].
We now have a set of probabilities, π_{
ik
}, that each gene or condition i is associated with bicluster k given the bicluster's current state. We would now like to perform moves (i.e. add or remove genes and conditions) that are most likely to improve the likelihood of the bicluster based upon the model. We do this by sampling moves from π_{
ik
}. These probabilities may be further adjusted via additional (prior) constraints on the model, as described below.
The cMonkey iterative procedure
Seeding the clusters
The Markov chain process by which a bicluster is optimized requires "seeding" of the bicluster to start the procedure. We experimented with many datadriven methods for generating seed biclusters, including (a) singlegene seeds, (b) random or semicorrelated seeds using a prespecified distribution of cluster sizes, and (c) seeding on the basis of coexpressed edges in association networks (for example, operon associations). In principle, any seeding method may be used, including the clusters produced by other clustering or biclustering methods. Many different seeding methods are used in order to broaden the parameter space which is searched, and depending upon the annealing schedule used, the algorithm can be made more or lesssensitive to the selected starting points. As biclusters are optimized sequentially, in order to maximize our coverage of the overall (gene) search space, they are seeded only with genes that have not previously been placed into any other biclusters. It should be noted, however, that during subsequent iterations, genes that are already in other biclusters can still be added to new biclusters, with additional constraints that are described later.
Each bicluster is seeded using a random choice from one of a variety of methods, each of which utilizes one or more different types of input data. For each newlyseeded bicluster, I' be the set of genes that are currently not in any other biclusters, i is a randomlychosen gene from I' and J_{
i
}is the set of conditions in which i has the highest amount of variance. The seeding methods available are:

1.
A single random gene: The cluster is seeded with i and J_{
i
}. For the first few iterations of this bicluster's optimization, only gene additions are allowed (forcing the bicluster to grow in size, early on).

3.
n coexpressed genes from another clustering method: Clusters are generated using an other clustering or biclustering method, and these are used as seeds for further optimization.

2.
n semicoexpressed genes: Up to n  1 additional genes from I' are randomly chosen from those with a high Pearson correlation (P_{
c
}> 0.8) with i in conditions J_{
i
}. n is chosen randomly from a set of predefined cluster seeding sizes, currently 2, 5, 10, 20, μ, where μ, is described Methods.

4.
n highlyconnected genes: Up to n  1 random genes from I' are added from those with P_{
c
}> 0 with i, and are first neighbors with i in a given association network, n is chosen as in (2); the association network is chosen randomly from the following (if available for that organism): operons, metabolic assoc., proteinDNA interactions, proteinprotein interactions.

5.
n genes with a common motif: Up to n  1 genes from I' are randomly chosen from those with P_{
c
}> 0 with i, and also have a common dmer with i in their upstream sequences, allowing for up to l residue differences, n is chosen as in (2); defaults of d = 9 and l = 1 were used.
Annealing the clusters
A newlyseeded bicluster k is iteratively improved with respect to the joint likelihood derived above. At each iteration, significant motifs are detected (using MEME), and the joint membership probabilities π_{
ik
}for each gene or condition i are computed. We then perform moves using Simulated Annealing (SA) [51], to preferentially add genes or conditions i to bicluster k if they have a high probability of membership (y_{
ik
}= 0 and π_{
ik
}≈ 1), and to drop genes or conditions from that bicluster if they have a low probability of membership (y_{
ik
}= 1 and π_{
ik
}≈ 0). Moves which may decrease the likelihood of the cluster model are permitted, with a frequency that decreases during the course of the procedure, as parameterized by an annealing temperature T:
\begin{array}{cc}p(add{\pi}_{ik})={e}^{{\pi}_{ik}/T};& p(drop{\pi}_{ik})={e}^{(1{\pi}_{ik})/T}\end{array}.\phantom{\rule{0.5em}{0ex}}\left(6\right)
All moves are performed by sampling them from the probability in Eq. 6. This Simulated Annealing procedure is dampened by restricting the total number of gene/condition moves at each iteration to n_{max} = 5, in order to reduce the chance that a bicluster will change drastically before its model is reevaluated. We find that Simulated Annealing, while not the most efficient search strategy available, improves upon greedy search strategies such as Expectation Maximization, by being able to escape local minima and therefore being able to more completely assign genes and conditions to clusters as appropriate [24]. Other stochastic or greedy search strategies may be applicable to solving this model, for example if speed is deemed to be a more important consideration than completeness of the solution.
Additional model constraints: bicluster size and overlap
The search space for this problem is often dominated by very strong attractors and if we do not restrict the gene/condition move set, biclusters are likely to repeatedly descend into the same set of deep local minima (thereby increasing the bicluster overlap, or redundancy). This is an issue seen in many biclustering algorithms, and a commonlypracticed ad hoc remedy is to postfilter the bicluster set to remove redundant ones. We choose a more straightforward, easilyparameterized solution: to constrain the total number of biclusters z_{
i
}into which each gene i may fall (and in effect to reduce the amount of "gene overlap" of the final bicluster set), z_{
i
}is modeled as a Poisson process with cumulative distribution F_{
v
}(z_{
i
}) (where v is the expected number of biclusters per gene). Then the probability of adding or dropping i to/from bicluster k, p(addπ_{
ik
}) and p(dropπ_{
ik
}) (Eq. 6), is multiplied with this prior probability of observing the gene in that many biclusters (relative to the expected number):
\begin{array}{ccc}{p}^{\prime}(add{\pi}_{ik})& =& {F}_{v}({z}_{i})/{F}_{v}(v){e}^{{\pi}_{ik}/T};\\ {p}^{\prime}(drop{\pi}_{ik})& =& [1{F}_{v}({z}_{i})]/[1{F}_{v}(v)]{e}^{(1{\pi}_{ik})/T}.\end{array}\phantom{\rule{0.5em}{0ex}}\begin{array}{c}\left(7\right)\end{array}
Thus the solution is constrained to what seems to be a more biologically intuitive model: include each gene in an average of v = 2 (the default) clusters. This constraint results in an increased tendency to drop a gene from a bicluster if it is already in more than two biclusters, and a decreased tendency to drop the gene if it is in less than two biclusters.
Bicluster sizes can also vary widely between biclustering methods; some generate biclusters with only three genes on average [6], to single biclusters with nearly 3/4 of all genes in the data [18]. We constrain bicluster k's final size (number of genes, I_{
k
}), using a cumulative Normal distribution N(I_{
k
}, μ, σ) as a prior constraint on I_{
k
}. This conditional distribution is applied by further adjusting the relative ratios of the distributions (Eq. 6) from which the gene moves are sampled:
\frac{{\displaystyle {\sum}_{i\in {I}_{k}^{\text{'}}}{p}^{\prime}(drop{\pi}_{ik})}}{{\displaystyle {\sum}_{i\in {I}_{k}}{p}^{\prime}(add{\pi}_{ik})}}\equiv \frac{N({I}_{k},\mu ,\sigma )}{[1N(\left{I}_{k}\right,\mu ,\sigma )]}.\phantom{\rule{0.5em}{0ex}}\left(8\right)
The result is that if I_{
k
} <μ, the number of genes sampled to add to the bicluster will tend to be greater than the number sampled to drop, and vice versa if I_{
k
} > μ. We parameterize our prior expectation of bicluster sizes using μ = σ = 30, to match previous estimates of regulon sizes for wellstudied organisms (e.g. Alkema, Lenhard, and Wasserman, 2004). This amounts to a soft constraint, which still allows for considerable variation in final bicluster sizes. A similar constraint may be applied to the biclusters' experiment sizes, which enables the generation of biclusters with a larger number of experiments (on average) than are typically included in biclusters derived by other methods (e.g. [19, 86]).
The annealing schedule
To enforce convergence we schedule the annealing temperature T to slowly decrease during the procedure, as in standard simulated annealing procedures. We find that varying T linearly from 0.15 to 0.05 over 100–150 optimization steps is generally effective for all organisms for which biclustering was performed. As was described previously, there are reasons to vary (in addition to T) the three model mixture constraint parameters, r_{0}, s_{0}, and q_{0} with each iteration. We have found that the most effective schedule upweights the expression (r_{0}) and certain association networks (q_{0}; e.g. operons and metabolic networks) early in a run to build up coexpressed biclusters, and then slowly increases the influence of the sequence motifs (s_{0}) as the biclusters become betterdefined (Fig. 9). For similar reasons, additional parameters, such as the number of motifs searched for, can also vary (i.e. increase or decrease monotonically) with iteration. Details on the default cMonkey parameters used for this work are listed in [Additional File 1, Table Three].
Implementation
cMonkey is implemented in the R statistical programming language [5], a highlyflexible crossplatform language widely used in the statistical community. It has been parallelized, using PVM [84] as implemented in the S implified N etwork O f W orkstations (snow) R library, and runs efficiently on a multinode Linux cluster; it can be run on a singleprocessor desktop computer as well. On a typical single2 GHz processor, the algorithm can generate ~100 biclusters in between 12 and 48 hours, depending on the organism, data size, and motif detection parameters chosen. All parameters relevant to the biclustering procedure that have not been described in the main text are listed in [Additional File 1, Table Three].
Comparison with other biclustering and clustering methods
The different bi/clustering algorithms used for the comparative analysis included: ChengChurch [25], Order Preserving Submatrix (OPSM [18]), Iterative Signature (ISA [19]), xMOTIF [55], and BIMAX [6] (all of these algorithms were run using the BICAT implementation [17]), SAMBA [86] (as implemented by the authors as part of EXPANDER [77]), and both hierarchical clustering and kmeans clustering [30] with cluster number (k) ranging from 10 to 300 (implemented in R). None of these methods utilize data integration, and all were run on the same data sets as cMonkey. All biclustering procedures were run using their default parameters and data normalization/discretization schemes (while the effects of varying the parameters for each of these methods would be a worthwhile study, it is beyond the scope of this work). The analysis was performed on the Gasch [36] subset of the S. cerevisiae data containing measurements of 2993 genes over 173 stressrelated conditions. S. cerevisiae was chosen for these comparisons because of the highquality data and varied external "references" available for this organism, against which clusters could be compared. In all cases where pvalues for judging annotation overrepresentation are listed, they were computed following a procedure similar to [21]; namely, cumulative hypergeometric pvalues were corrected for multiple hypothesis testing in an experimentwise manner for each cluster, by computing the fraction of uncorrected pvalues derived for 1000 randomized instances of the cluster (the null model) that were less than or equal to the best pvalue obtained for that cluster. To assess the effect of various biases caused by inclusion of different parts of the cMonkey model, we performed these same analyses on cMonkey runs with various model parameters up and downweighted, as described in the Results section. In all cases where we compared the motifdetection results of specific biclusters (in the Results section), we used MEME and MAST [10] (with the same parameters as for cMonkey) to search for motifs de novo in the upstream sequences of the clusters' genes.
Each different biclustering algorithm returned bicluster sets with wide differences in cluster count, cluster size (genes and experiments), amount of overlap/redundancy, expression coherence, and other general characteristics only related to their treatment of the expression data. We therefore computed "bulk" measurements for each bicluster set, such as those listed in [Additional File 1, Table Two]. One of these, f, is defined as the total fraction of elements in the expression data matrix X which fall in at least one bicluster. A measurement that quantifies the degree to which each complete bicluster set recapitulates the variance in X is defined as follows:
RMSD\equiv \sqrt{{\displaystyle \sum _{k}{\displaystyle \sum _{j\in {J}_{k}}{\displaystyle \sum _{i\in {I}_{k}}\frac{{({x}_{ij}{\overline{x}}_{jk})}^{2}}{{n}_{ij}{\sigma}^{2}}}}}}\phantom{\rule{0.5em}{0ex}}\left(9\right)
where, as above, {\overline{x}}_{jk}={\displaystyle {\sum}_{i\in {I}_{k}}{x}_{ij}}/\left{I}_{k}\right is the mean expression profile of bicluster k, and n_{
ij
}= ∑_{
k
}i ∈ I_{
k
}∧ j ∈ J_{
k
}is the number of biclusters containing element x_{
ij
}. This measure is dependent upon the fractional coverage f of the expression data matrix by the bicluster set (better coverage will generally lead to better RMSD) as well as the average bicluster residual (better residual leads to better RMSD), but is nearly independent of bicluster redundancy. It therefore is a good measure of the tradeoff that each bi/clustering method chooses between data coverage and bicluster coexpression. Because the expression data set has been variancenormalized (see Methods), RMSD ranges between 0–1, where a smaller RMSD implies that the mean expression profiles of the biclusters more accurately "generate" the original data matrix X.
In an attempt to remove some overlap and size bias related to these quality measurements (see Discussion), we also performed tests on a "filtered" set of biclusters, in which we greedily identified the largest 100 clusters with a volumeoverlap (genes × conditions) of < 0.5 [6], excluding any with > 200 genes. For methods such as cMonkey, this filtering step removes a large number of nonredundant (but smaller) clusters, while for other methods (e.g. OPSM), it removes a large fraction of derived clusters and for others (such as SAMBA) it has little effect. Finally, in a further attempt to disentangle the cluster size bias inherent in these comparisons, we performed the same analyses on a set of evenly divided cluster sets ("big" and "small" halves; results shown in [Additional File 1]).