Materials and data
Expression data
Expression data for Halobacterium were collected by members of the Baliga lab, containing genome-wide measurements of mRNA expression in 292 conditions, as described in full in [22] and references therein. Expression data for H. pylori and S. cerevisiae were collected from the Stanford Microarray Database [3]. Certain experiments such as strain comparisons, genomic DNA, and RNA decay experiments which are unlikely to relate to gene regulation were removed from the sets prior to analysis. This filtering resulted in 58 of an original 250 conditions for H. pylori and 667 of 1051 conditions for S. cerevisiae. Data for E. coli was compiled from publicly-available data provided to us by E. A1m, including 86 conditions. As a pre-processing step, genes were removed from the expression data for which there was not significant (1.5-fold) expression in any of the experiments. The data were then row-normalized (each gene's expression levels normalized to mean = 0, SD = 1). No further pre-processing or filtering of the microarray data was performed.
Association and metabolic networks
Genetic associations derived from comparative genomics, such as phylogenetic profile, Rosetta Stone, gene neighbor and gene cluster, were compiled from Prolinks [23] and Predictome [62] for all organisms. These networks include predicted operon "associations," which were also used to identify "unique" regulatory sequences that are to be used in the motif detection. Metabolic network reconstructions from the Kyoto Encyclopedia of Genes and Genomes [48] were represented as associations between two genes if they participate in a reaction sharing one or more ligands, after removing the most highly-connected ligands, such as water and ATP [16].
Interaction networks
H. pylori protein-protein interactions were collected from the global experiments of [70]. S. Cerevisiae protein-protein, protein-DNA, and genetic interactions were collected from DIP [73] and BIND [9]. The protein-DNA interactions were converted into a network of associations between all pairs of genes whose upstream sequences were found to bind to the same regulator(s).
Upstream sequences
Upstream sequences for all organisms were obtained from GenBank using the Regulatory Sequence Analysis Tools (RSAT [92]). Using these tools, we extracted 1000-bp cis-regulatory sequences. For bacteria and archaea, these sequences were shortened to 500 bp, and then "operon-shifted" using the gene cluster (operon association) networks from Prolinks [23] and Predictome [62]. Upstream sequences for genes in the same operon were converted to "operon-shifted" sequences by using the (same) upstream sequence of the first gene in the operon Similar "operon-shifted" upstream sequences were identified using BLASTN [8] using a 50 bp non-gapped alignment window, to avoid using multiple copies of the same sequence in the motif detection.
Functional annotations for comparison tests
Gene ontology (GO) [40] annotations for each organism were obtained from the European Bioinformatics Institute [1] and matched to annotation names obtained from the GO web site. KEGG annotations were downloaded from their web site [2]. Predicted and experimentally-derived DNA binding motifs were obtained for S. cerevisiae from [39, 60], and for E. coli from [71, 72]. When these binding motifs were provided as position weight matrices (PWMs), they were converted into regular expressions, in order to enable rapid scanning of upstream sequences.
The bicluster model
Model overview
Each bicluster is modeled via a Markov chain process, in which the bicluster is iteratively optimized, and its state is updated based upon conditional probability distributions computed using the cluster's previous state. This enables us to define probabilities that each gene or condition belongs in the bicluster, conditioned upon the current state of the bicluster, as opposed to requiring us to build a complete (joint) model for the bicluster, a priori. The components of this conditional probability are modeled independently (one for each of the different types of information which we are integrating) as p-values based upon individual data likelihoods, which are then combined into a regression model to derive the full conditional probability. In this work, three major distinct data types are used (gene expression, upstream sequences, and association networks), and accordingly p-values for three such model components are computed: the expression component, the sequence component, and the network component.
Each bicluster begins as a seed, or starting cluster, that is iteratively optimized by adding/removing genes and conditions to/from the cluster by sampling from the conditional probability distribution using a Monte Carlo procedure, to prevent premature convergence. Such an iterative machine learning technique is akin to a Markov chain Monte Carlo (MCMC) process. Additional clusters are seeded and optimized until a given number (kmax) of clusters have been generated, or significant optimization is no longer possible. The complete process is shown schematically in Fig. 8, and described in detail below.
In the following discussion, let i be an arbitrary gene and j an arbitrary experimental condition. A bicluster k ∈ K is fully defined by its set of genes I
k
and experimental conditions J
k
. The membership y
lk
∈ {0, 1} of an arbitrary gene or condition l in bicluster k is an independent Bernoulli indicator variable with conditional probability p(y
lk
= 1).
The expression component
The expression data is a set of measurements of genes i ∈ I over experiments j ∈ J, comprising a |I| × |J| matrix x
ij
∈ X. Each bicluster k defines a |I
k
| × |J
k
| submatrix x
i'j'
∈ X
k
: i' ∈ I
k
⊂ I; j' ∈ J
k
⊂ J. The variance in the measured levels of condition j is , where . We compute the mean expression level of condition j over the cluster's genes I
k
, . Then, the likelihood of an arbitrary measurement x
ij
relative to this mean expression level is
which includes the term ε for an unknown systematic error in condition j, here assumed to be the same for all j. Note that the use of σ
j
over all genes I rather than a σ
jk
computed over I
k
results in a lower likelihood p(x
ij
) for those conditions j that have a small overall variance, and are therefore more likely to be correlated by random chance. Also, such low-variance conditions could be the result of poor labeling, or other systematic problems.
The likelihood of the measurements of an arbitrary gene i among the conditions in bicluster k are , and similarly the likelihood of a condition j's measurements are . We integrate the two tails of the Normal distribution in Eq. 1 to derive co-expression p-values for each gene i, r
ik
, and for each condition j, r
jk
, relative to bicluster k.
Sequence component (motif co-occurrence)
Each gene i has an upstream cis-regulatory sequence S
i
(a string of DNA nucleotides of length lS), and bicluster k defines a set of sequences S
k
for all S
i'
; i' ∈ I
k
. The decision whether an arbitrary gene's upstream sequence, S
i
, shares common motif(s) with sequences S
k
, is determined via a two-step process: (1) identify one or more motif(s) M
k
that is (are) significantly overrepresented in many (if not all) bicluster sequences S
k
, and then (2) scan Si to see if it also contains M
k
.
In this work, we are not advancing the basic methodology for motif detection (step 1), as relatively mature methods exist for finding motifs given a fixed set of sequences [91]. Instead, we are describing an overall strategy that incorporates previously existing motif finding algorithm(s) into a clustering procedure. As such, the procedure is motif-detection-algorithm agnostic, and the search may be performed using one of many existing methods [91]. Our only requirements are that (a) significantly overrepresented motifs do not have to exist in all sequences S
k
, and (b) it can produce a score (preferentially a p-value) that an arbitrary sequence contains the detected motif(s). The MEME algorithm [10], which identifies significant sequence motifs using expectation maximization of one or more probabilistic motif models given a fixed set of sequences and a background residue model, is used to perform step (1), as it meets the first criterion (a). MEME's companion algorithm MAST [10], which computes the p-value that an arbitrary sequence matches the set of motifs detected with MEME, is used to perform step (2), as it meets the second criterion (b). During the motif detection step, for any genes in bicluster k which are in an operon, we make sure to use only one copy of the upstream sequence for that operon (i.e. upstream of the first gene in that operon), as described above ("Upstream sequences"). Additional details on the specific parameters passed to these procedures are provided in [Additional File 1, Table Three].
Thus, using these two algorithms, we can detect a set of motifs M
k
in sequences S
k
, and compute a p-value that a sequence S
i
contains those motifs. Note that this p-value is computed for each upstream sequence in the genome, including those for the genes within cluster k, to derive the motif p-values, s
ik
, for each gene i relative to bicluster k, at each iteration of the MCMC procedure.
Association network component
To build up a highly-connected subnetwork among genes that are in a bicluster (given a full set of associations), we aim to add genes preferentially that have a greater number of connections to those currently in the bicluster than one would expect (at random) based upon the overall connectivity in the network. Thus, we compute p-values for observing the associations between a gene or experimental condition and the genes or conditions currently in bicluster k, given an association network N ∈ N. In the following discussion, genes are the primary consideration, but networks of associations between experimental conditions are conceivable (e.g., we might wish to preferentially group conditions that are part of the same time series). The network association p-value, , is computed based upon the number of edges in network N connecting gene i to genes I
k
in bicluster k, relative to the total number of edges connected between i and the genes (that are not in cluster k), as well as the connections within and between the gene sets I
k
and . The hypergeometric distribution is used to compute the probability of observing such an arrangement of connections by chance:
where A → B represents the set of associations between the elements in gene set A with those in set B, and nA→Bis the number of these associations. Expression (2) is analogous to the hypergeometric measure of mutual clustering coefficient described by [37]. However, it does not account for the global structure of the network; it is only concerned with the local associations, i.e. those directly connected to gene i and the bicluster's genes, I
k
. This choice of connectivity measure allows a single value to be directly computed for each gene, relative to each cluster, and gives greater preference to an individual gene i being added to cluster k if a large fraction of i's associations are with the other genes in the cluster (and vice versa), independent of the global distribution of associations in the network. Individual p-values, , for each gene i and each network N are computed for bicluster k by integrating the lower tail of the distribution in Eq. 2.
The joint cluster membership probability
The ultimate goal is to decide gene or condition bicluster membership jointly on the basis of the three individual sets of p-values r
ik
, s
ik
, and q
ik
computed above (for the remainder of this discussion, we now use i to denote a gene or experimental condition). A common procedure for combining "scores" such as these into a single joint likelihood is to perform a multi-parametric logistic regression [41] that treats each p-value measure as a random variable and estimates the joint membership likelihood p(y
ik
= 1) using the logistic function,
This model approximates a (probabilistic) discriminating hyperplane in the space defined by r
ik
, s
ik
, and q
ik
, parameterized by the four independent variables β0 (the intercept), and r0, s0, and q0 (the slope) that maximally discriminates the genes or conditions within the bicluster (I
k
) from those outside (). Conceptually, the model implies that a gene or condition that poorly matches the bicluster based on one data type can still be added to the bicluster if it matches well to the other data types, analogous to, for example, the explicit "softening" of cluster boundaries performed by [15]. Note that when element i is an experimental condition, s0 (the motif parameter) is zero.
In practice, during early iterations when the bicluster is not well-discriminated from the background, such an unconstrained regression leads to unstable situations such as unwarranted over-weighting or inversion of one or more variables (r0, s0, or q0). Additionally, depending upon the quality of the data set(s) being used and the predisposition (or prior knowledge) of the researcher, different runs of the algorithm stressing different data types may be desired. Finally, there is good reason to expect that certain data types (e.g. sequence motifs) will be less informative early in the procedure when the biclusters are poorly-defined, and only later will it make sense to incorporate them into the bicluster model.
Therefore, we perform a constrained logistic regression by transforming the regression space defined by r
ik
, s
ik
, and q
ik
into one dimension, projecting the log-p-values onto a single vector, g
ik
,
where r0, s0, and q0 are specified for each iteration according to an "annealing schedule," described below. Here, each of the dimensions have been standardized to place the log-p-values for each data set on the same scale, with log(
ik
) = [log(χ
ik
) - μ
k
]/σ
k
, where μ
k
and σ
k
are the mean and standard deviation of the log-p-values log(χ
ik
) (χ
ik
denotes either r
ik
, s
ik
, or q
ik
), only over the genes or conditions in the bicluster (i ∈ I
k
). This is necessary because the p-values for each component of our model were derived for different types of data, each with widely differing sizes. For example, the p-values are likely to be smaller (on average) for the component with the most data (here, the expression data), or for motifs with larger lengths. The projection described in Eq. 4 constrains the regression by fixing the slope of the discriminating hyperplane (via parameters r0, s0, and q0), with only the intercept β0 permitted to vary from cluster to cluster. These parameters may also be interpreted as mixing parameters that control the fractional contribution of each model component to the cluster likelihood, g
ik
. They may be defined by the user, and/or may be modified throughout the course of cluster optimization. For example, early in the procedure when the bicluster is a poorly-defined seed, co-expression and certain association networks (e.g. operon associations, for bacteria and archaea) are extremely informative, whereas a common cis-regulatory motif is less likely exist among the genes in the bicluster. Only later (when the bicluster has been optimized on the basis of expression data and operon associations) does it make sense to incorporate sequence motifs into the bicluster model. Therefore we employ a strategy for slowly varying the relative contribution of each of the regression parameters, as the cluster is optimized, as part of an annealing schedule (described further, below). The constrained binomial regression is now given by
π
ik
≡ p(y
ik
= 1|X
k
, S
i
, M
k
, N) ∝ exp(β0 + β1g
ik
), (5)
where parameters β= [β0, β1] fully determine the conditional probability of membership p(y
ik
= 1) of a gene or condition i in bicluster k.
One additional complication arises near the end of a bicluster optimization, that a bicluster may be perfectly discriminated from the background (resulting in an infinite negative log-likelihood and undefined regression). This may be addressed in two ways: the first is to constrain or fix the slope β1 of the regression, allowing only the intercept β0 to vary. We chose a second option, to perform a penalized maximum likelihood estimation described by [42] and originally proposed by [35]. This penalized estimate of β provides bias reduction in the case of small sample sizes (small biclusters), and solves the separation problem in the context of perfect discrimination and infinite likelihood. β can be determined with this penalized likelihood measure using an efficient iterative process [35].
We now have a set of probabilities, π
ik
, that each gene or condition i is associated with bicluster k given the bicluster's current state. We would now like to perform moves (i.e. add or remove genes and conditions) that are most likely to improve the likelihood of the bicluster based upon the model. We do this by sampling moves from π
ik
. These probabilities may be further adjusted via additional (prior) constraints on the model, as described below.
The cMonkey iterative procedure
Seeding the clusters
The Markov chain process by which a bicluster is optimized requires "seeding" of the bicluster to start the procedure. We experimented with many data-driven methods for generating seed biclusters, including (a) single-gene seeds, (b) random or semi-correlated seeds using a pre-specified distribution of cluster sizes, and (c) seeding on the basis of co-expressed edges in association networks (for example, operon associations). In principle, any seeding method may be used, including the clusters produced by other clustering or biclustering methods. Many different seeding methods are used in order to broaden the parameter space which is searched, and depending upon the annealing schedule used, the algorithm can be made more- or less-sensitive to the selected starting points. As biclusters are optimized sequentially, in order to maximize our coverage of the overall (gene) search space, they are seeded only with genes that have not previously been placed into any other biclusters. It should be noted, however, that during subsequent iterations, genes that are already in other biclusters can still be added to new biclusters, with additional constraints that are described later.
Each bicluster is seeded using a random choice from one of a variety of methods, each of which utilizes one or more different types of input data. For each newly-seeded bicluster, I' be the set of genes that are currently not in any other biclusters, i is a randomly-chosen gene from I' and J
i
is the set of conditions in which i has the highest amount of variance. The seeding methods available are:
-
1.
A single random gene: The cluster is seeded with i and J
i
. For the first few iterations of this bicluster's optimization, only gene additions are allowed (forcing the bicluster to grow in size, early on).
-
3.
n co-expressed genes from another clustering method: Clusters are generated using an other clustering or biclustering method, and these are used as seeds for further optimization.
-
2.
n semi-co-expressed genes: Up to n - 1 additional genes from I' are randomly chosen from those with a high Pearson correlation (P
c
> 0.8) with i in conditions J
i
. n is chosen randomly from a set of pre-defined cluster seeding sizes, currently 2, 5, 10, 20, μ, where μ, is described Methods.
-
4.
n highly-connected genes: Up to n - 1 random genes from I' are added from those with P
c
> 0 with i, and are first neighbors with i in a given association network, n is chosen as in (2); the association network is chosen randomly from the following (if available for that organism): operons, metabolic assoc., protein-DNA interactions, protein-protein interactions.
-
5.
n genes with a common motif: Up to n - 1 genes from I' are randomly chosen from those with P
c
> 0 with i, and also have a common d-mer with i in their upstream sequences, allowing for up to l residue differences, n is chosen as in (2); defaults of d = 9 and l = 1 were used.
Annealing the clusters
A newly-seeded bicluster k is iteratively improved with respect to the joint likelihood derived above. At each iteration, significant motifs are detected (using MEME), and the joint membership probabilities π
ik
for each gene or condition i are computed. We then perform moves using Simulated Annealing (SA) [51], to preferentially add genes or conditions i to bicluster k if they have a high probability of membership (y
ik
= 0 and π
ik
≈ 1), and to drop genes or conditions from that bicluster if they have a low probability of membership (y
ik
= 1 and π
ik
≈ 0). Moves which may decrease the likelihood of the cluster model are permitted, with a frequency that decreases during the course of the procedure, as parameterized by an annealing temperature T:
All moves are performed by sampling them from the probability in Eq. 6. This Simulated Annealing procedure is dampened by restricting the total number of gene/condition moves at each iteration to nmax = 5, in order to reduce the chance that a bicluster will change drastically before its model is reevaluated. We find that Simulated Annealing, while not the most efficient search strategy available, improves upon greedy search strategies such as Expectation Maximization, by being able to escape local minima and therefore being able to more completely assign genes and conditions to clusters as appropriate [24]. Other stochastic or greedy search strategies may be applicable to solving this model, for example if speed is deemed to be a more important consideration than completeness of the solution.
Additional model constraints: bicluster size and overlap
The search space for this problem is often dominated by very strong attractors and if we do not restrict the gene/condition move set, biclusters are likely to repeatedly descend into the same set of deep local minima (thereby increasing the bicluster overlap, or redundancy). This is an issue seen in many biclustering algorithms, and a commonly-practiced ad hoc remedy is to post-filter the bicluster set to remove redundant ones. We choose a more straightforward, easily-parameterized solution: to constrain the total number of biclusters z
i
into which each gene i may fall (and in effect to reduce the amount of "gene overlap" of the final bicluster set), z
i
is modeled as a Poisson process with cumulative distribution F
v
(z
i
) (where v is the expected number of biclusters per gene). Then the probability of adding or dropping i to/from bicluster k, p(add|π
ik
) and p(drop|π
ik
) (Eq. 6), is multiplied with this prior probability of observing the gene in that many biclusters (relative to the expected number):
Thus the solution is constrained to what seems to be a more biologically intuitive model: include each gene in an average of v = 2 (the default) clusters. This constraint results in an increased tendency to drop a gene from a bicluster if it is already in more than two biclusters, and a decreased tendency to drop the gene if it is in less than two biclusters.
Bicluster sizes can also vary widely between biclustering methods; some generate biclusters with only three genes on average [6], to single biclusters with nearly 3/4 of all genes in the data [18]. We constrain bicluster k's final size (number of genes, |I
k
|), using a cumulative Normal distribution N(|I
k
|, μ, σ) as a prior constraint on |I
k
|. This conditional distribution is applied by further adjusting the relative ratios of the distributions (Eq. 6) from which the gene moves are sampled:
The result is that if |I
k
| <μ, the number of genes sampled to add to the bicluster will tend to be greater than the number sampled to drop, and vice versa if |I
k
| > μ. We parameterize our prior expectation of bicluster sizes using μ = σ = 30, to match previous estimates of regulon sizes for well-studied organisms (e.g. Alkema, Lenhard, and Wasserman, 2004). This amounts to a soft constraint, which still allows for considerable variation in final bicluster sizes. A similar constraint may be applied to the biclusters' experiment sizes, which enables the generation of biclusters with a larger number of experiments (on average) than are typically included in biclusters derived by other methods (e.g. [19, 86]).
The annealing schedule
To enforce convergence we schedule the annealing temperature T to slowly decrease during the procedure, as in standard simulated annealing procedures. We find that varying T linearly from 0.15 to 0.05 over 100–150 optimization steps is generally effective for all organisms for which biclustering was performed. As was described previously, there are reasons to vary (in addition to T) the three model mixture constraint parameters, r0, s0, and q0 with each iteration. We have found that the most effective schedule up-weights the expression (r0) and certain association networks (q0; e.g. operons and metabolic networks) early in a run to build up co-expressed biclusters, and then slowly increases the influence of the sequence motifs (s0) as the biclusters become better-defined (Fig. 9). For similar reasons, additional parameters, such as the number of motifs searched for, can also vary (i.e. increase or decrease monotonically) with iteration. Details on the default cMonkey parameters used for this work are listed in [Additional File 1, Table Three].
Implementation
cMonkey is implemented in the R statistical programming language [5], a highly-flexible cross-platform language widely used in the statistical community. It has been parallelized, using PVM [84] as implemented in the S implified N etwork O f W orkstations (snow) R library, and runs efficiently on a multi-node Linux cluster; it can be run on a single-processor desktop computer as well. On a typical single-2 GHz processor, the algorithm can generate ~100 biclusters in between 12 and 48 hours, depending on the organism, data size, and motif detection parameters chosen. All parameters relevant to the biclustering procedure that have not been described in the main text are listed in [Additional File 1, Table Three].
Comparison with other biclustering and clustering methods
The different bi/clustering algorithms used for the comparative analysis included: Cheng-Church [25], Order Preserving Submatrix (OPSM [18]), Iterative Signature (ISA [19]), xMOTIF [55], and BIMAX [6] (all of these algorithms were run using the BICAT implementation [17]), SAMBA [86] (as implemented by the authors as part of EXPANDER [77]), and both hierarchical clustering and k-means clustering [30] with cluster number (k) ranging from 10 to 300 (implemented in R). None of these methods utilize data integration, and all were run on the same data sets as cMonkey. All biclustering procedures were run using their default parameters and data normalization/discretization schemes (while the effects of varying the parameters for each of these methods would be a worthwhile study, it is beyond the scope of this work). The analysis was performed on the Gasch [36] subset of the S. cerevisiae data containing measurements of 2993 genes over 173 stress-related conditions. S. cerevisiae was chosen for these comparisons because of the high-quality data and varied external "references" available for this organism, against which clusters could be compared. In all cases where p-values for judging annotation over-representation are listed, they were computed following a procedure similar to [21]; namely, cumulative hypergeometric p-values were corrected for multiple hypothesis testing in an experiment-wise manner for each cluster, by computing the fraction of uncorrected p-values derived for 1000 randomized instances of the cluster (the null model) that were less than or equal to the best p-value obtained for that cluster. To assess the effect of various biases caused by inclusion of different parts of the cMonkey model, we performed these same analyses on cMonkey runs with various model parameters up- and down-weighted, as described in the Results section. In all cases where we compared the motif-detection results of specific biclusters (in the Results section), we used MEME and MAST [10] (with the same parameters as for cMonkey) to search for motifs de novo in the upstream sequences of the clusters' genes.
Each different biclustering algorithm returned bicluster sets with wide differences in cluster count, cluster size (genes and experiments), amount of overlap/redundancy, expression coherence, and other general characteristics only related to their treatment of the expression data. We therefore computed "bulk" measurements for each bicluster set, such as those listed in [Additional File 1, Table Two]. One of these, f, is defined as the total fraction of elements in the expression data matrix X which fall in at least one bicluster. A measurement that quantifies the degree to which each complete bicluster set recapitulates the variance in X is defined as follows:
where, as above, is the mean expression profile of bicluster k, and n
ij
= ∑
k
i ∈ I
k
∧ j ∈ J
k
is the number of biclusters containing element x
ij
. This measure is dependent upon the fractional coverage f of the expression data matrix by the bicluster set (better coverage will generally lead to better RMSD) as well as the average bicluster residual (better residual leads to better RMSD), but is nearly independent of bicluster redundancy. It therefore is a good measure of the tradeoff that each bi/clustering method chooses between data coverage and bicluster co-expression. Because the expression data set has been variance-normalized (see Methods), RMSD ranges between 0–1, where a smaller RMSD implies that the mean expression profiles of the biclusters more accurately "generate" the original data matrix X.
In an attempt to remove some overlap and size bias related to these quality measurements (see Discussion), we also performed tests on a "filtered" set of biclusters, in which we greedily identified the largest 100 clusters with a volume-overlap (genes × conditions) of < 0.5 [6], excluding any with > 200 genes. For methods such as cMonkey, this filtering step removes a large number of non-redundant (but smaller) clusters, while for other methods (e.g. OPSM), it removes a large fraction of derived clusters and for others (such as SAMBA) it has little effect. Finally, in a further attempt to disentangle the cluster size bias inherent in these comparisons, we performed the same analyses on a set of evenly divided cluster sets ("big" and "small" halves; results shown in [Additional File 1]).