 Methodology article
 Open Access
 Published:
Identifying pathogenic processes by integrating microarray data with prior knowledge
BMC Bioinformatics volume 15, Article number: 115 (2014)
Abstract
Background
It is of great importance to identify molecular processes and pathways that are involved in disease etiology. Although there has been an extensive use of various highthroughput methods for this task, pathogenic pathways are still not completely understood. Often the set of genes or proteins identified as altered in genomewide screens show a poor overlap with canonical disease pathways. These findings are difficult to interpret, yet crucial in order to improve the understanding of the molecular processes underlying the disease progression. We present a novel method for identifying groups of connected molecules from a set of differentially expressed genes. These groups represent functional modules sharing common cellular function and involve signaling and regulatory events. Specifically, our method makes use of Bayesian statistics to identify groups of coregulated genes based on the microarray data, where external information about molecular interactions and connections are used as priors in the group assignments. Markov chain Monte Carlo sampling is used to search for the most reliable grouping.
Results
Simulation results showed that the method improved the ability of identifying correct groups compared to traditional clustering, especially for small sample sizes. Applied to a microarray heart failure dataset the method found one large cluster with several genes important for the structure of the extracellular matrix and a smaller group with many genes involved in carbohydrate metabolism. The method was also applied to a microarray dataset on melanoma cancer patients with or without metastasis, where the main cluster was dominated by genes related to keratinocyte differentiation.
Conclusion
Our method found clusters overlapping with known pathogenic processes, but also pointed to new connections extending beyond the classical pathways.
Background
Highthroughput experimental techniques provide objective views of the molecular changes that occur in cells during disease progression. A widely used tool when trying to understand the biological meaning of the observed changes is pathway analysis. In pathway analysis, the list of significantly altered genes or molecules is mapped onto usually precompiled pathways. A wide range of methods for finding predefined pathways overrepresented by significant genes have been developed and are steadily used, e.g. [1–3]. However, this strategy for identifying the important disease pathways is limited, as pathways are not fixed and change with context [4]. As a consequence, many of the significantly altered genes or molecules fall outside of the expected pathways [5].
An alternative approach, which is fundamentally different from the pathway analysis just described, is a purely datadriven approach, in which the experimental data is used without any guidance from prior knowledge about molecular interactions. The goal is then to discover hidden patterns of correlation between genes reflecting the complex processes and pathways that underlie cellular metabolism and physiology [6]. The most common approach for this task is unsupervised clustering. This class of methods was of the first to be applied to microarray data [7], and several approaches have been proposed, e.g hierarchical and kmeans clustering, Prediction Around Medoids (PAM) [8], modelbased clustering via Markov chain Monte Carlo (MCMC), e.g [9, 10], tight clustering [11] and clustering via networks [12]. Another way of learning molecular relationships from microarray data is to use Bayesian network (BN) methodology. Instead of finding groups of correlated molecules, BN methodology infers a network of (direct) interacting molecules. BNs were first proposed for microarray data in 2000 [13], and have also been much applied.
Along with the development of such unsupervised methods, there has been a steady development of prior information about molecular interactions. In addition to the mentioned pathways, there are many databases containing pairwise connections, such as proteinprotein interactions, transcription factors binding to genes and protein sequence similarities. Such information is maybe a more reliable source of information than pathways, as the latter lack proper definitions, and are as already mentioned heavily context dependent.
Bayes’ formula is well suited for taking prior information into account during inference, and says
The goal of the analysis is the posterior probability p (MD), i.e. the probability that the model (M) is correct given the data (D). In our setting, the model M represents a clustering or a network configuration, whereas D represents the microarray data. Many methods using BN methodology to incorporate informative priors have been proposed quite recently, e.g [14–17]. However, even with the help of prior information, inferring Bayesian networks for large numbers of genes remains computationally challenging, and the number of genes that can be included is usually smaller than one hundred. Inference about groups or clusters of genes is a less daunting task, as there are fewer group/cluster configurations than network configurations. For cluster analysis, there has been some, but not many, suggestions for how to make use of prior knowledge, e.g [18]. Many Bayesian modelbased clustering methods have been proposed, e.g. [9, 10, 19, 20] but common to all of these is that they use noninformative priors. Bayesian clustering with truly informative priors has to our knowledge not previously been proposed.
In the present paper, we develop a novel method for clustering using Bayesian statistics and informative priors making use of Markov chain Monte Carlo (MCMC) sampling. We name our method MCIP (MCMC Clustering with Informative Priors). In our method, coregulation patterns in the microarray data are used together with prior knowledge to find groups or modules of interacting genes. Specifically, we propose a method which searches for an optimal partitioning of the genes into functional modules, where module assignment is based on both microarray data and prior knowledge. The prior value for assigning genes to the same module is retrieved from databases containing gene pairs with a previously reported interaction or connection. Markov chain Monte Carlo (MCMC) sampling is used to search for the most reliable modules. Having found the modules, we generate subnetworks consisting of the prior pairs within each module, hypothesizing that these pairs represent the direct interactions. In the presented work we use prior information in forms of proteinprotein interaction data, transcription factor binding predictions and protein sequence similarities. We apply our method to simulated data as well as two realworld microarray data sets, one in heart failure and one in melanoma cancer.
Methods
Assume we have a genomic dataset D, e.g. a microarray dataset, with measurements of n genes. We want to use D together with prior knowledge about molecular interactions to find the most likely partitioning into groups or modules of functionally related genes. Now assume that the n genes represent K different modules, where K is not known in advance. Let M = {(i_{ m },j_{ m },p_{ m }),m ∈ {1,…,q}} be the collection of q gene pairs for which prior knowledge exist, where ${p}_{m}={p}_{{i}_{m},{j}_{m}}$ is the prior probability that gene i_{ m }and gene j_{ m }belong to the same module.
Now let g= (g_{1},…,g_{ n }) be a representation of group/module membership for each of the n genes, where g_{ j }∈ {1,…,K}. We want to find the most likely division ginto modules given the data as well as the prior knowledge, or formally, find the gmaximizing P(g∣ D,M). Using Bayes’ formula, we have
Marginal likelihood given grouping
We start by finding the marginal likelihood given the subdivision into different groups (modules/clusters) and the prior information, i.e. P (D∣g,M). Assume the data in group k follow a multivariate normal distribution with covariance matrix Σ_{ k }, whose prior is inverse Wishart f (Ψ_{ k },m_{ k }) where Ψ_{ k }describes the prior covariance matrix and m_{ k }describes the sharpness of the prior. Assume also that all prior information about the data is contained in M, and that genes belonging to different groups are independent of each other. Consequently, we can let each group k have prior covariance matrix Ψ_{ k }= ρ_{ k }I_{ k }, where I_{ k }is a n_{ k }× n_{ k }identity matrix and n_{ k }is the number of genes in group k. We thus have that
where N is the number of rows in D (number of individuals), X_{ k }is the data matrix for genes belonging to group k and Γ_{ i }() is the multivariate gamma function. If the gene expressions are standardized to have mean values equal to zero and variances equal to one, it is natural to choose ρ_{ k }= 1 and m_{ k }= n_{ k }+ 2, as this yields expected values of 1 for the variances, i.e. for the diagonal elements. The expectancy of the inverse Wishart distribution is defined only for m_{ k }> n_{ k }+ 1, so this is a minimalistic way of coding for the knowledge that the data has been standardized.
Probability of grouping given prior information
Next, we want to find P (g∣ M), i.e. the probability of grouping ggiven the prior information M. Assume we have n genes and K ∈ {1,…,n} groups. Let N (n,K) denote the number of possible subdivisions of n genes into K groups. A new gene can be inserted into one of the K existing groups, or as its own, singlemembered, group, so N (n,K) can be defined recursively by
where N (K,K) = 1. Let N_{ ij }(n,K) = K · N (n  2,K) + N (n  2,K  1) = N (n  1,K) be the number of subdivisions of n genes into K groups where i and j are in the same group. Let $N\left(n\right)=\sum _{K=1}^{n}N(n,K)$ be the number of subdivisions of n genes in total and ${N}_{\mathit{\text{ij}}}\left(n\right)=\sum _{K=1}^{n}{N}_{\mathit{\text{ij}}}(n,K)=\sum _{K=1}^{n}N(n1,K)=N(n1)$ be the number of subdivisions of n genes where i and j are in the same group.
Now let M = {(i_{ m },j_{ m },p_{ m }),m ∈ {1,…,q}} be the set of q pairs for which prior knowledge exist, where we define p_{ m }as the prior probability of forcing gene i_{ m }and j_{ m }to belong to the same group. Consider first the situation where we have no prior knowledge, i.e. M_{0} = ∅. Let (K ,n) denote that we have K groups of n objects, and let P (K ∣ M_{0}) = 1/n, i.e equal prior probability for the number of groups. For a particular grouping g= (g_{1},…,g_{ n }) that has a total of K_{ g }groups, let P (g∣ M_{0},K_{ g })) = I (K = K_{ g })/N (n,K_{ g }), i.e. equal probability for all groupings for a given number of groups K_{ g }. This implies that the probability of grouping ggiven no prior knowledge is
We denote this probably measure ${P}_{{M}_{0}}$ and refer to it as the baseline prior.
We will now specify a probability measure which has the same structure as the baseline prior, except that i and j are forced to be in the same group and will thus be handled as a unit. To do this, consider the situation where we have one pair of genes that belong to the same group with probability one. The prior knowledge is then defined as M_{1,1} = {(i,j),p_{1} = 1}. We then have in total N_{ ij }(n) = N(n  1) possible subdivisions and n  1 possible number of groups. The number of subdivisions into K groups is N_{ ij }(n,K) = N (n  1,K). This means that
where I is the indicator function, so
Call this probability measure ${P}_{{M}_{1,1}}$.
Then consider the situation where we have one pair of genes that belong to the same group with probability p_{1} ≠ 1. We now define the prior knowledge as M_{1} = {(i,j),p_{1} (≠ 1)}. The idea is that ${P}_{{M}_{1}}={p}_{1}{P}_{{M}_{1,1}}+(1{p}_{1}){P}_{{M}_{0}}$, that is a mixture between a probability measure which forces i and j to be in the same group and a probability measure that treats all genes equally (the baseline prior). Following this idea, we have
Now let’s generalize to the situation where we have q pairs of genes with existing prior knowledge, i.e. we have M = {(i_{ m },j_{ m },p_{ m }),m ∈ {1,…,q}} with ${p}_{m}={p}_{{i}_{m},{j}_{m}}$. Since we have a probability for each pair, we need to introduce some notation specifying what pairs in the prior that are forced to be in the same group. This can be achieved by introducing X = {X_{ m }},m ∈ 1,…,q, where X_{ m }∈ {0,1} indicates whether the pair (i_{ m },j_{ m }) is forced to be in the same group or not, and M_{ X }= {(i_{ m },j_{ m },p_{ m })X_{ m }= 1} are the pairs that are forced together. We also define the total number of forced pairs for such a combination X as $x=\sum _{m=1}^{q}{X}_{m}.$ We have that
and thus
Since we have
and $P(\mathit{g}\mid M)=\sum _{X}P(\mathit{g}\mid {M}_{X})P\left({M}_{X}\right)$, we arrive at
Note that this expression increases exponentially with the number of prior pairs q. In order to avoid computational cost increasing exponentially with the number of prior pairs, we developed a Monte Carlo estimation of P(g∣ M), described in Additional file 1.
For a given grouping, the baseline prior contributes to the overall probability of a grouping with an additive factor (1  p_{1})(1  p_{2}) ⋯ (1p_{ q }) P (gM_{0}) (see Eq. 2 and the expression for the baseline prior P(gM_{0})). This implies that the baseline prior will contribute to the probability of two genes being clustered. Define P_{ b } as the probability for two genes being in the same group given the baseline prior M_{0}. It is important that this probability is nonzero, in order to allow for the possibility of two arbitrary genes being in the same group whether they are in the set of prior pairs or not. If not, it would be impossible to group gene pairs that are not in the set of specified prior pairs. However, this implies that the probability of grouping two genes (i_{ m }, j_{ m }) in a prior pair will be be the mixture p_{ m }+ (1  p_{ m })P_{ b }, as there is a nonzero probability that the genes are connected, even if they should not be connected according to the prior information. For instance, if the baseline prior for two arbitrary genes to be connected is 0.1 and pair number m has prior enforcement probability p_{ m }= 0.8, the total prior probability for the pair (i_{ m },j_{ m }) to be connected will be 0.82.
There will be a baseline probability for two arbitrary genes to be connected. We have become aware that with equal probability for all number of groups and equal probability for each grouping given the number of groups, the baseline for any two genes to be connected will be dependent on the total number of genes, n. In general, we get that the prior for groupings gives that the baseline probability, P_{ b }for two genes to be connected when there are K groups will be P(i and j in the same group ∣ K groups) = N (n  1,K)/N(n,K). The total probability will then be ${P}_{b}=(1/n)\sum _{K}N(n1,K)/N(n,K)$, which depends on the number of genes in total, n. For n = 100, P_{ b }= 0.048 while for the extreme case n = 10, P_{ b }= 0.25. It might be seen as a weakness that there is any dependency on the number of genes in the dataset on the baseline probability. An alternative approach could be to specify the baseline prior according to a given baseline probability for any arbitrary pair of genes. This would make for a more consistent baseline prior, but has the disadvantage that the baseline prior has to be set manually. An alternative would be to make the baseline prior into a parameter to be estimated in a hierarchical Bayesian model.
Removing cycles
The set of priors might contain cycles, which can easily occur, e.g if both direct and indirect connections are included. We have developed an algorithm for detecting cycles, and if such are found, the prior pair with the smallest prior probability (smallest extra connection probability) is removed, as this connection is interpreted as a result of the rest of the cycle. The reason for excluding cycles in the prior is that with cycles, the pairwise specification of prior probabilities of pairs could be misleading. As an example, let us assume the prior is specified so that there is a 0.8 prior probability of a pairing between gene A and B (prior pair 1), between B and C (prior pair 2) and between A and C (prior pair 3). Then the probability for a forced connection between A and B (thus disregarding the baseline probability) will be P(A and B forced to be in the same group) = P(prior pair 1 enforced) + P(prior pair 1 not enforced)P(prior pair 2 enforced)P(prior pair 3 enforced) = 0.8 + 0.2*(0.8*0.8) = 0.928. Thus in order for the pairwise prior probabilities to be interpreted as the probability for two pairs to be forced to be connected, cycles should be avoided.
Markov chain Monte Carlo procedure for integrative networks
We use Markov chain Monte Carlo (MCMC) to sample from the posterior distribution P (D ∣ g,M), as the analytical solution is not known. We will use the Metropolis Hastings algorithm, which is the most general version of MCMC. Specifically, we propose the following algorithm:

Start with a random grouping, called g.

Sample N groupings g based on the following scheme:

Propose either (each with probability 1/3)

∗ A partitioning of one of the groups (except when K = n)

∗ A merging of two of the groups (except when K = 1)

∗ A transition of one of the genes to another group (except when K = 1)


Call the new grouping g_{new}.

Accept g_{new} with the following probability:
$$\phantom{\rule{14.0pt}{0ex}}P\left(\text{accept}{\mathit{g}}_{\text{new}}\right)\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}\text{min}\phantom{\rule{0.3em}{0ex}}\left(\phantom{\rule{0.3em}{0ex}}1,\frac{P(D\phantom{\rule{0.3em}{0ex}}\mid \phantom{\rule{0.3em}{0ex}}{\mathit{g}}_{\text{new}})P({\mathit{g}}_{\text{new}}\mid M)Q({\mathit{g}}_{\text{old}}\phantom{\rule{0.3em}{0ex}}\mid \phantom{\rule{0.3em}{0ex}}{\mathit{g}}_{\text{new}})}{P(D\phantom{\rule{0.3em}{0ex}}\mid \phantom{\rule{0.3em}{0ex}}{\mathit{g}}_{\text{old}})P({\mathit{g}}_{\text{old}}\phantom{\rule{0.3em}{0ex}}\mid \phantom{\rule{0.3em}{0ex}}M)Q({\mathit{g}}_{\text{new}}\phantom{\rule{0.3em}{0ex}}\mid \phantom{\rule{0.3em}{0ex}}{\mathit{g}}_{\text{old}})}\phantom{\rule{0.3em}{0ex}}\right)\phantom{\rule{0.3em}{0ex}},$$

where P (D ∣ g) is given by (1), P (g∣ M) is given by (2) and Q(g_{x}g_{y}) is the probability of proposing the group configuration g_{x} from the configuration g_{y}, and is given by (35) in the next section.
In order to avoid convergence to local maxima, we implemented parallel tempering [21], as described in Additional file 1.
Inferring clusters from MCMC samples
The above MCMC procedure results in a series of samples g. These samples can be used to calculate the posterior similarity matrix (PSM), in which each entry (i,j) gives the proportion p_{ ij }of samples gene i and gene j occur together in the same cluster. We infer clusters from the PSM using the minbinder function in the mcclust R library [22]. Specifically, minbinder makes use of hierarchical clustering with the PSM as distance matrix together with the cuttree function to produce cluster configurations with the number of clusters ranging from 1 to L, where L is a userspecified maximum. Now letting I_{ K }(i,j) be the indicator for whether gene i and j are in the same cluster for the configuration using K clusters (based on hierarchical clustering and the cuttree function), and using absolute difference as loss function, the posterior expected loss e_{ K }for K clusters is calculated in minbinder as ${e}_{K}=\sum _{i<j}\left{I}_{K}\right(i,j){p}_{\mathit{\text{ij}}}$. The inferred cluster configuration is the result of hierarchical clustering and cuttree for $\widehat{K}$ clusters, where $\widehat{K}$ is the K minimizing the posterior expected loss, i.e. $\widehat{K}=\underset{K}{\text{min}}{e}_{K}$.
Proposal distribution
Let n, as before, be the total number of genes, and let n_{ k }be the number of genes in group k and n_{ s }be the number of singlemembered groups. If the number of groups K equals 1, the only allowed option is splitting into two new groups. If K = n, we can only have a merging of two genes into a new group. If 1<K < n, all three types of moves are allowed (splitting and merging groups, and moving a gene into another group). Each type of move then has probability 1/3.
First, consider the situation where g→ g^{′} is due to a splitting of group k. For the configuration g, we then either have 1 < K < n, and a splitting is occurring with probability 1/3, or K = 1, and this type of move is occurring with probability 1. The probability of getting the new grouping g^{′} is then the product of the probability of having a split within group k, which is n_{ k }/(n  n_{ s }), the probability of the actual splitting of j into two specific new subgroups, which is 1/N (n_{ k },2), and the probability of having a split (rather than a merge or a transition of one of the genes), which is 1/3 · I (1 < K < n) + I (K = 1). That is, we have,
Next, consider the situation where g→ g^{′} is due to a merging of group k and l. A merging happens with probability one if the number of groups is equal to n and with probability 1/3 if 1 < K < n. Two groups, l and k, are chosen by first sampling a random gene and finding which group the gene belongs to, then picking another random gene and resampling as long as that other gene belongs to the same group. The probability of merging group l and group k has thus a proposal probability P(merge l and k) = P(choose l and k ∣ merge)P(merge) = [P(choose k first)P(choose l second ∣ k chosen first) + P(choose l first)P(choose k second ∣ l chosen first)] P(merge). That is, we have
Finally, consider the situation where g→ g^{′} is the result of moving a gene in group l to group k. A move of one gene from group l to group k is proposed by sampling a random gene and resampling if the gene itself constitutes a singlemembered group. Another random gene not belonging to group l is then sampled, defining another group k. A random element from group l is then assigned group identity k. The proposal probability thus becomes P(move the kth gene of group l to group k) = P(choose group l to move from and group k to move to and choose ith element of group l ∣ move)P(move) = P(choose group l)P(choose group k ∣ group l)P(choose ith element from group l)P(move) = (n_{ l }/(n  n_{ s }))(n_{ k }/(n  n_{ l }))(1/n_{ l })(1/3) I (1 < K < n). Thus, we have
Computational aspects
The likelihood estimation involves the determinant function, which is O (n^{3}) in the number n of genes in each module. In our method, the most computationally expensive calculation is the exact calculation of the prior probability P(gM). This calculation is exponential in the number of prior pairs. Often, as is the case for the data used in this paper, the number of pairs for which there exist prior knowledge, can be substantial. To deal with such situations, we developed an approximate estimate of the prior probability based on Monte Carlo simulations. Comparisons on moderately large number of priors suggested that although less accurate, a Monte Carlo estimate of the prior gives results close to the results obtained when calculating the prior exactly (Additional file 2). For larger number of priors, this can not be tested, but experience shows stable results, suggesting that the stochastic nature of the algorithm do not seriously affect the results. With 100 genes and 100 samples and 200 priors, and using 800 for the Monte Carlo prior calculation, it took our method approximately 10 minutes to run 10000 samples on an 64 GB RAM, 16 physical CPU cores compute node.
Prior information
We will use three sources of prior information: proteinprotein interactions, transcription factor binding predictions as well as protein sequence similarity calculations. Gene pairs in these databases have evidence of being functionally related. The strength of the evidence is represented by some type of score for each database. In our MCMC algorithm the prior information is formulated in terms of the prior probability of forcing two genes to belong to the same group. We will use the logit transformation to transform the scores into the unit interval, thus enabling them to be used as a probability measure. Further details are given below for each prior type. There is very little overlap between the prior information databases. In cases where a gene pair is found in more than one database, the highest prior probability will be used.
Protein interaction data
Proteinprotein interactions (PPI) are important for the majority of biological functions. PPI databases contain lists of proteins pairs for which there exist some evidence for interaction. The evidence comes from various types of studies, ranging from highthroughput methods, more traditional lowthroughput proteomics studies, as well as in silico predictions based on known interactions [23]. iRefIndex [23] is a consolidated protein interaction database comprising information from nine wellknown interaction databases. One of the features of the database is the socalled lpr (lowest PMID reuse) score, which is the lowest number of unique interactions that are supported by one of the interaction’s PubMed identifiers (PMID). A low lpr indicates that the interaction is more likely to rely on lowthroughput methods than on highthroughput methods. As the former is recognized as more reliable than the latter, the lpr score can be used as a measure of the reliability of the interaction. We will use a logit transformation taking x to 1/(1 + e xp (  x)) of the inverse of the lpr score to obtain a probability measure (between zero and one). For example, if the lpr score is one, p_{ m }becomes 0.73, if the lpr score is zero, p_{ m }becomes 0.5.
Transcription factor bindings
Transcription factors (TF) are proteins that bind to specific DNA sequences, thereby controlling the expression levels of the corresponding genes. Binding preferences of many transcription factors are known and characterized by a sequence binding motif. However, binding affinity does not depend entirely on the match of the sequence to the motif, but will rely also on sequence specific features. Ernst et al.[24] developed TF binding predictions, where they first found a general binding preference of the sequences based on 28 local genomic features, including histone modifications, conservation and melting temperature. Then, they combined this general binding preference score with motif information for specific transcription factors to improve prediction of genes bound by the factor. We will make use of these scores, and again, we will apply the logit transformation to transform the scores to values between 0 and 1.
Protein sequence similarity
Proteins with similar sequences are likely to be functionally related as the proteins may be expressed by paralogous genes (having common ancestry) or by genes that are selected to have the same function. For instance, two homologous proteins can be phosphorylated by the same kinase, thus playing roles in the same signaling pathway. One additional feature of using protein homology data in this setting is that the function of proteins for which the function is unknown can be learned by borrowing information from their protein homologs. We will calculate percent similarity between each of the human proteins in RefSeq using BLAST [25].
Results
Simulated data
To evaluate the performance of our method we used simulated gene expression data generated according to [26]. In our study, we used a total of five clusters of genes C^{∗} = (C_{1},…,C_{5}) with dimension N (denoted d in [26]) samples. Cluster sizes n_{ c }(c = 1,…,5) were generated from n_{ c }∼ 2 × Poisson (λ). Expression values in cluster C_{ c }were generated using a hierarchical lognormal model as in [26]:

(1)
A vector of cluster template for cluster C_{ c }was created with four periods of constant expression of size m_{1},m_{2},m_{3} and m_{4}. The sizes m_{ k }, k = 1,…,4, was from a uniform distribution such that $\sum _{k}{m}_{k}=N$ and m_{ k }> 2. An initial template with constant pattern in four periods was simulated from $\text{log}\left(\underset{k}{\overset{\left(c\right)}{\mu}}\right)\sim N\left(\mu ,{\sigma}^{2}\right)$.

(2)
Sample variability and gene variability: Sample variability ${\sigma}_{s}^{2}$ was introduced and the cluster template ${T}_{j}^{\left(c\right)}$, j = 1,…,N, was generated from $\text{log}\phantom{\rule{0.3em}{0ex}}\left(\underset{j}{\overset{\left(c\right)}{T}}\right)\sim N\left(\text{log}\phantom{\rule{0.3em}{0ex}}\left(\underset{k}{\overset{\left(c\right)}{\mu}}\right),\underset{s}{\overset{2}{\sigma}}\right)$. Then for each gene vector i in sample j, gene variability was added and gene expression values generated as $\text{log}\phantom{\rule{0.3em}{0ex}}\left({x}_{\mathit{\text{ij}}}\right)\sim N\left(\text{log}\left(\underset{j}{\overset{\left(c\right)}{T}}\right),\underset{0}{\overset{2}{\sigma}}\right)$.

(3)
We repeated steps 1 and 2 to generate five clusters. We used parameter values of μ = 6, σ = 1, σ_{ s }= 1, σ_{0} = 1, and λ = 10.
As in [26], extra variation was introduced to evaluate robustness of clustering methods against potential random errors introduced from experimental procedures, such as sample acquisition, labeling hybridization and scanning. To each element of the logtransformed expression matrix we added a a random error from a normal distribution with mean zero and standard deviation (SD) equal to 0, 1 and 2. In addition, the sample size was varied, using N = 10, 100 and 1000. For each of these nine scenarios, 50 datasets were generated.
For each dataset, three scenarios for the available prior information were used. In (A), we assumed that no prior information was used. In (B), we assumed priors pairs were available, where 20% where misspecified, i.e. 20% of the gene pairs had members belonging to different groups. In the last scenario (C), all pairs were assumed to be correctly specified (the members belonged to the same group). Prior values were generated from a uniform U(0.5,1) distribution.
We compared our method with five wellknown clustering methods for which a software already exist, namely hierarchical clustering, kmeans clustering, Partitioning Around Medoids (PAM) [8], Modelbased clustering (Mclust) [9] and tight clustering [11]. For our method, after a burnin period generating 10K samples, we generated 10K samples from which each 100th sample was chosen. For all methods except ours, the number of clusters were estimated using the Gap index [27] (our method implicitly searches for the optimal number of clusters). For our method, clusters were inferred by minimizing the posterior expected loss based on the MCMC samples as described in the Methods section. The number of clusters estimated by the GAP index as well as our method is shown by boxplots in Additional file 3: Figure S2. The Rand index, defined as the proportion of concordant gene pairs in two partitions among all possible gene pairs, was used as evaluation measure. Specifically, we used the adjusted Rand index [28], which is standardized to have expected value zero when the partitions are randomly generated and takes maximum value one if two partitions are perfectly identical. Unlike the other methods, tight clustering produces clusters where some genes are not allocated to any cluster. In the calculation of the Rand index, only the allocated genes are considered.
The results are shown in Figure 1. We see that when all pairs are correctly specified (scenario C), our method (MCIP) was at least as good as all other methods, and superior to the other methods for the smallest sample size (N = 10). When 20% of the priors were misspecified (scenario B), the performance was better than our method without using priors (scenario A), as well as hierarchical clustering, which was overall the second best method. We note that Mclust had a very variable performance, and that tight clustering was performing very poorly for large sample sizes. In order to further investigate the effect of misspecifications of the priors on model performance, we calculated the adjusted Rand index for increasing proportion of misspecifications. Additional file 4: Figure S1 shows that about 40% misspecifications were allowed, in the sense that this corresponded to the use of no prior information. We also note that there was a correspondence between number of estimated clusters (Additional file 3: Figure S2) and performance (Figure 1). Especially for small sample sizes (N = 10), the number of clusters found by maximizing the GAP index, as well as with our method without the use of priors, quite often yielded many more clusters than the true number of clusters (equal to five). This bias was much less evident for our method with the use of priors (MCIPB and MCIPC). Additional file 5: Figure S3 shows the performance after fixing the number of clusters to the true number of clusters (five) for all methods except our method, which inherently finds the number of clusters. The figure shows that poor performance, especially seen for tight clustering (N = 100 and N = 1000) and Mclust (N = 100), was not only due to bias in the estimation of number of clusters, as these methods also performed poorly after fixing the number of clusters.
Heart failure data
We used the data described in [29], consisting of microarray gene expression measurements from fourteen mice subjected to aortic banding and five sham operated mice. Aortic banding leads to increased left ventricular pressure. To compensate for the increased load, gene expression changes occur resulting in myocardial remodeling, involving hypertrophy of cardiomyocytes. Ultimately, the cardiac hypertrophy might lead to development of heart failure. We based our network analysis on the most differentially expressed genes between aortic banding and sham. To find differentially expressed genes we performed ttests between the two groups, using log2 expression values, before multiple testing correction was performed using the method of [30]. We used a false discovery rate (FDR) cutoff of 5%, and among these genes we picked the 400 with largest fold change (up or down). We looked up connections between these genes and assigned prior probabilities for the pairs based on the prior databases described in the previous section. For each of the three prior databases, there were several hundred pairs where both genes were represented in our inputlist. As the use of so many priors pairs was too computationally demanding for our method, we picked the 50 top scoring pairs for each of the prior types. We applied our MCMC algorithm using altogether 150000 Monte Carlo samples, with the first 50000 samples used for burnin (i.e these first samples were discarded and only used to ascertain a stable starting point in the MCMC sampling). We applied parallel tempering as described in Additional file 1. As we here had more than hundred prior pairs, we approximated P(g∣ M) with the Monte Carlo estimator (1) defined in Additional file 1, using K = 800, as this value gave stable results within a reasonable computation time. Clusters were inferred by minimizing the posterior expected loss based on the posterior similarity matrix, which was calculated from the collection of each of the 100th MCMC sample after the burnin period.
Table 1 summarizes the clusters and Additional file 6: Figure S4 displays the clusters as (fully connected) networks (all nodes in the same cluster are connected) using Cytoscape [31]. There is one one large module of mainly upregulated genes (aorta banding vs sham), and one smaller module of both up and downregulated genes. In order to investigate these modules more thoroughly we applied Gene Ontology [32] analysis using the R/Bioconductor package GOstats [33]. Additional file 7 shows the most significantly altered GO categories in each of the modules. The top GO term of the larger module was extracellular region (p = 3e34, FDR qvalue = 5e31), and many of the other modules were related to this term (proteinaceous extracellular matrix, extracellular space, extracellular matrix structural constituent, collagen). In the smaller module many GO terms were related to carbohydrate metabolism (polysaccharide metabolic process, carbohydrate biosynthetic process). Figure 2 contains a subset of the larger network, showing prior pairs occurring within the main module. We also applied our method without the use of priors, as well as kmeans clustering. For the latter, the Gap index was used to find the number of clusters K. Both these methods gave one dominant cluster and two smaller clusters. GO analyses of the main clusters are given in Additional files 8 and 9.
Melanoma cancer data
Metastatic melanoma is a deadly disease while nonmetastatic melanoma and other cutaneous tumor types are usually cured with surgical removal of the primary tumors. To find network of genes differentially expressed between metastatic and nonmetastatic tumors we used data from [34], which included microarray gene expressions from 47 metastatic and 40 nonmetastatic tumor samples of patients with various cutaneous tumors. As for the heart failure data, we used the 400 most differentially expressed genes (based on fold change) for which Benjamini Hochberg FDR <0.05, and found gene pairs in the PPI, the TF and the sequence similarity database where both genes were represented in our input list. We also here approximated P(g∣ M) with the Monte Carlo estimator (1) of the Additional file 1, K = 800 samples. Again, result clusters were inferred by minimizing the posterior expected loss based on the posterior similarity matrix calculated from each 100th of 150K MCMC samples generated after a burnin period of 50K samples.
Table 2 summarizes the clusters/modules and Additional file 10: Figure S5 shows the modules as (fully connected) networks. Figure 3 shows prior pairs within each module. The top ten GO categories from GOstats analysis on each module are shown in Additional file 11. We note that the top three GO categories of the largest module were epidermis development (p =6e32, FDR q value =8e29), cornified envelope (p =2e18, FDR qvalue =3e15) and keratinization (p =1e14, FDR qvalue =3e15). We also applied our method without the use of priors and kmeans clustering to the melanoma data. GO analyses of the main clusters are given in Additional files 12 and 13.
Method evaluation
In order to evaluate our method we made use of literature reported interactions that occurred in abstracts of articles labeled with the Medical Subject Headings (MeSH, http://www.nlm.nih.gov/mesh/) term left ventricular hypertrophy for the heart failure clusters (summarized in Table 1 and displayed in Additional file 6: Figure S4) and the MeSH term melanoma for the melanoma clusters (summarized in Table 2 and displayed in Additional file 10: Figure S5). We applied gene pairs with pvalues smaller than 5% only, using the method described in [35]. We will refer to these interactions as "true" interactions. The application of our method together with minimization of the posterior expected loss led to an inferred clustering. By considering whether the genes of each possible gene pair occurred in the same group or not in the inferred clustering, we were able to calculate the sensitivity (proportion of gene pairs in the literature network also occurring together in the same group in the inferred clustering), the specificity (proportion of gene pairs not in the literature network that were also not in the same clustering group), the positive predictive value (proportion of genes occurring in the same group also occurring in the literature network). From the sensitivities and specificities the Area Under Curve (AUC) was also calculated. Table 3 shows performance measures for the heart failure and the melanoma data, respectively, using our method both with and without priors, and compared to the results of kmeans clustering. For kmeans clustering the optimal number of clusters was found using the Gap index.
Discussion
Using simulated data we compared our method to established clustering methods as kmeans and hierarchical clustering. When no prior information is provided our method did not offer any gains over other standard approaches, except for the N = 10, SD = 0 regime. When most priors were correctly specified (80% or 100%), our method was as least as good as the bestperforming established methods for large samples, and superior to the same methods for small sample sizes. We believe that the majority of the priors will be correctly specified in most cases. The reason for this is that there will usually exist so many prior pairs that one would restrict oneself to only those with the strongest evidence. Of course, a previously shown connection might still not be real in the current situation, e.g. if the connection has been discovered in a different tissue type than the one under study. However, this problem will be limited if one is analyzing a set of genes differentially expressed between two conditions, as many of these will be coexpressed, and thus also correlated.
The Gene Ontology analysis of the heart failure network showed that many of the top ranked GO categories for the larger module were related to the extracellular matrix. Awareness has emerged regarding the extracellular matrix changes taking place in myocardial remodeling [36, 37]. When investigating Figure 2, we see that the larger module contains many collagens (type I alpha 1, type III alpha 1, type IV alpha 1, type V alpha 1, type VI alpha 1, type XII alpha 1, type XIV alpha 1 and type XVI a), which are important structural components of the extracellular matrix. The figure shows that these molecules were primarily connected by protein sequence homology (blue lines). Moreover, other molecules thought to be parts of the extracellular matrix were identified in the same module (thrombospondin 3 and 4, fibromodulin, versican, biglycan, fibronectin, elastin). For example, fibronectin has been demonstrated to play a role in the organizing of collagen type I [38]. The module also contained enzymes and hormonal factors known to process the extracellular matrix structural proteins (matrix metallopeptidase 2, transforming growth factor beta 2 and beta 3, latent transforming growth factor beta binding protein 2, lysyl oxidase, TIMP metallopeptidase inhibitor 1). For instance, matrix metallopeptidase 2 is known to be important for the breakdown of collagen type I [39]. Finally, the extracellular matrix related transcription factor early growth response 3 was a hub in this module, indicating a particular importance in this pathological process. Altered metabolism has been described in remodeled and failing hearts. E.g. reduced fatty acid oxidation and an increase in glucose utilization have been reported [40–42]. The smaller module contained several genes involved in carbohydrate metabolic processes (6phosphofructo2kinase, phosphorylase kinase gamma 1, aldolase B fructosebisphosphate, fructose bisphosphatase 2). Taken together, the two modules confirm regulatory and signaling events and structural alterations known to be involved in myocardial remodeling. We believe that the complete result network, shown in Additional file 6: Figure S4, contains many other important, not yet discovered, interactions. However, a detailed investigation of all these connections is beyond the scope of this paper.
The melanoma network was derived from the data of Riker et al.[34]. As the present paper represents a reanalysis of the data, the outcome of an analysis with similar input data should be expected to generate somewhat similar output data. The main difference in approach is that while the Riker paper primarily performs a direct Gene Ontology analysis, the current approach attempts to combine multiple sources of information prior to performing any functional analysis. In this way, the resultant network modules will represent subnets of relevance with respect to the combined prior information, thus presenting local information clusters with higher likelihood of coherence in function for each cluster. Riker et al. observed differentially expressed genes involved in keratinocyte differentiation and epidermal development, including loricrin (LOR), involucrin (IVL), keratin 5 (KRT5) and plakophilin 1 (PKP1), suggesting a loss of epidermal characteristics, in highlighting the desquamation process. In our largest module (module 1) we found many genes related to keratinocyte differentiation and epidermal development (see Additional file 11, Table 1). When looking at Figure 3, where gene pairs with prior information within each module were connected, we note that keratins and kallikreins were connected through a set of interacting genes that included desmocollin 1 and 3 (DSC1 and DSC3), desmoglein 1 and 3 (DSG1 and DSG3), desmoplakin (DSP) and corneodesmosin (CDSN). Many of the genes in this path are related to corneodesmosomes (KLK5 and KLK7: kallikrein 5 and 7, DSG1, DSG3, DSC1, DSC3, DSP, IVL, LOR and keratin 10). Corneodesmosomes are located at the junctional structures that mediate corneocyte cohesion [43], underlining keratinocyte dedifferentiation. The individual connections in this path are supported by previous findings; Caubet et al. [43] showed that KLK5 and KLK7 degrade the desmosomal proteins DSC1 and DSG1. Smith et al.[44] showed that when epidermal cells differentiate, DSC1 and DSG1 are bound by PKP1 and DSP, thereby enhancing anchorage to keratin intermediate filaments. Interactions between plakophilins, desmoplakins and keratins have also been shown by other authors, e.g. [45–47]. Interestingly, the transcription factor RARrelated orphan receptor A (RORA) was downregulated, and placed as a hub in this module. RORA has recently been indicated as a breast tumor suppressor [48], as well as having a role in keratinocyte differentiation [49]. Taken together, Figure 3 suggests that the loss of epidermal characteristics seen in the metastatic individuals is mediated by a downregulation of the RORA transcription factor.
When comparing the GO analysis of the main clusters from our method with the use of priors (Additional file 7 and 11) to the GO analysis of the main clusters using our method without priors (Additional files 8 and 9) and kmeans clustering (Additional files 9 and 13), we note that many of the same GO categories were showing up. But the pvalues were overall much smaller for our method with the use of priors, than the other two, indicating that our method has better sensitivity of finding functionally related gene groups. We note that the path found in the melanoma data linking the RORA transcription factor to the keratins via the kallikreins and the desmosomal proteins, would not have been discovered by our method without priors or kmeans clustering, as none of the these methods gave clusterings where all these gene subgroups were contained in one cluster.
Method evaluation using the literaturederived network showed that adding prior knowledge yielded improved AUC. The AUC was improved from 0.62 to 0.65 for the heart failure data and from 0.6 to 0.64 for the melanoma data. The positive predicted values (PPV) were also improved (0.02 to 0.023 for the heart failure data and 0.024 to 0.03 for the melanoma data). We note that these values are very small, which is because there are very few gene pairs with cocitation (only around one in a hundred of all possible pairs) compared to the number of gene pairs clustered together (more than half of the pairs for the heart failure data, for instance). It is important to be aware that the possibility of improvement is limited by the degree of overlap between the list of pairs in the prior interaction databases and list of pairs in the literature reference network. Many of the pairs in the literature databases were not found in the interaction databases. In fact this was the case for about 90% for these data. Such interactions are, when evaluated on the literature network, treated as false positives, but might of course be true positives. Of the 90% of the pairs in the literature network that were not in the prior databases, the evidence for grouping the genes together was based solely on the correlation in the microarray data. Another limitation of the literature network is that it itself might contain false positives. Genes that are mentioned in the same PubMed abstract are not necessarily connected.
We have here focused on finding groups of connected genes, and not on discovering direct interactions. Direct interactions are often inferred using Bayesian networks (BN). As mentioned in the Background section, the BN formalism allows for incorporation of prior knowledge, and BN methods for genomic data has indeed been proposed. However, constructing largescale networks using BN methodology is very difficult as the number of possible configurations is about exponential in the number of genes. Formally this has been shown to be an NPComplete problem. By instead focusing on groups, we heavily reduce the number of possible configurations. Thus, our method can handle many more genes than BN methods. This is relevant, as microarray data may involve several hundred regulated genes. One way of using our method is to apply it as an initial step prior to a BN analysis; If the number of genes is too large to apply BN methodology to the full set, our method can be used to first find smallersized, independent sets of genes, followed by separate BN analysis on each of the subsets. The result of such an analysis will be similar to the networks shown in Figures 2 and 3, where prior pairs within each cluster are shown. We believe that the connections shown in Figures 2 and 3 are reliable and robust, as they display connections between genes that are coregulated and that have a previously shown connection. Using BN on each cluster could lead to an improvement, as novel interactions can be detected based on strong correlations in the data, and is one possibility for future methodological development.
Along with the development of the primarily datadriven methods, various methods focusing more on the prior information have been proposed, e.g. [5, 50, 51]. In these approaches, the authors made use of the observed changes in the experimental data, e.g. the most differentially regulated genes, while the interactions among these genes were derived from prior information only. For example, Huang et al.[5] applied the socalled Prizecollecting Steiner tree to select nodes and interactions. As we speak, the prior knowledge about molecular interactions is rapidly increasing. For instance, the data generated by the largescale consortium project Encode [52] represent huge possibilities for extending the source of prior knowledge about transcriptional regulation. Obviously, methods enabling the use of prior knowledge when constructing genetic networks will increase their relevance in line with this development.
Conclusions
We have presented a novel method for finding modules of interacting molecules, representing biological pathways or processes, based on microarrays or other transcriptomic data. The method is a clustering approach, which is a commonly used technique for analyzing transcriptomic data. Unlike previous approaches using Bayesian statistics to infer clusters, e.g. [9, 10], we used informative (as opposed to noninformative) priors. The prior information was assumed to be in form of pairs of connected genes, along with a connection score, as this enabled us to capture a lot of relevant prior information, like proteinprotein interactions, transcription factor binding information and protein sequence similarity measurements. Using simulated data our method showed improved ability of identifying correct groups compared to traditional clustering, especially for small sample sizes, and for situations where most priors were correctly specified. When applying the method to realworld microarray data in heart failure and melanoma cancer we found clusters overlapping with known pathogenic processes, but where subnetworks of prior pairs within each cluster pointed to new connections extending beyond the classical disease pathways.
Availability
The method is implemented in C++ and R and is available from: http://folk.uio.no/trondr/gene_corr, as well as a tool in the Galaxy [53] based Genomic HyperBrowser [54], see http://hyperbrowser.uio.no/dev2 under assorted tools →MCMC gene corr. The data sets supporting the results of this article are available from the Gene Expression Omnibus repository: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE36074 (heart failure data) and http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE7553 (melanoma cancer data).
References
 1.
Alexa A, Rahnenfuhrer J, Lengauer T: Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics. 2006, 22 (13): 16001607. 10.1093/bioinformatics/btl140.
 2.
Vaske CJ, Benz SC, Sanborn JZ, Earl D, Szeto C, Zhu J, Haussler D, Stuart JM: Inference of patientspecific pathway activities from multidimensional cancer genomics data using PARADIGM. Bioinformatics. 2010, 26 (12): i237i245. 10.1093/bioinformatics/btq182.
 3.
Huang DW, Sherman BT, Lempicki RA: Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009, 4 (1): 4457.
 4.
Werner: Bioinformatics applications for pathway analysis of microarray data. Curr Opin Biotechnol. 2008, 19: 5054. 10.1016/j.copbio.2007.11.005.
 5.
Huang SsC, Fraenkel E: Integrating proteomic, transcriptional, and interactome data reveals hidden components of signaling and regulatory networks. Sci Signal. 2009, 2 (81): ra40
 6.
Djebbari A, Quackenbush J: Seeded Bayesian networks: constructing genetic networks from microarray data. BMC Syst Biol. 2008, 2: 5710.1186/17520509257.
 7.
Allison DB, Gadbury GL, Heo MS, Fernandez JR, Lee CK, Prolla TA, Weindruch R: A mixture model approach for the analysis of microarray gene expression data. Comput Stat Data Anal. 2002, 39 (1): 120. 10.1016/S01679473(01)000469.
 8.
Kaufman L, Rousseeouw P: Statistical Data Analysis Based on the L1norm and Related Methods. 1987, North Holland: Amsterdam
 9.
Fraley C, Raftery A: MCLUST: software for modelbased cluster analysis. J Classif. 1999, 16 (2): 297306. 10.1007/s003579900058.
 10.
Booth JG, Casella G, Hobert JP: Clustering using objective functions and stochastic search. J Roy Stat Soc B Stat Meth. 2008, 70 (1): 119139. 10.1111/j.14679868.2007.00629.x.
 11.
Tseng G, Wong W: Tight clustering: a resamplingbased approach for identifying stable and tight patterns in data. Biometrics. 2005, 61 (1): 1016. 10.1111/j.0006341X.2005.031032.x.
 12.
Lee H, Hsu A, Sajdak J, Qin J, Pavlidis P: Coexpression analysis of human genes across many microarray data sets. Genome Res. 2004, 14 (6): 10851094. 10.1101/gr.1910904.
 13.
Friedman N, Linial M, Nachman I, Pe’er D: Using Bayesian networks to analyze expression data. J Comput Biol. 2000, 7: 601620. 10.1089/106652700750050961.
 14.
Werhli A, Husmeier D: Reconstructing gene regulatory networks with bayesian networks by combining expression data with multiple sources of prior knowledge. Stat Appl Genet Mol Biol. 2007, 6: 81
 15.
Geier F, Timmer J, Fleck C: Reconstructing generegulatory networks from time series, knockout data, and prior knowledge. BMC Syst Biol. 2007, 1: 1110.1186/17520509111.
 16.
Mukherjee S, Speed TP: Network inference using informative priors. Proc Natl Acad Sci U S A. 2008, 105 (38): 1431314318. 10.1073/pnas.0802272105.
 17.
Gao S, Wang X: Quantitative utilization of prior biological knowledge in the Bayesian network modeling of gene expression data. BMC Bioinformatics. 2011, 12: 35910.1186/1471210512359.
 18.
Segal E, Shapira M, Regev A, Pe’er D, Botstein D, Koller D, Friedman N: Module networks: identifying regulatory modules and their conditionspecific regulators from gene expression data. Nat Genet. 2003, 34 (2): 166176. 10.1038/ng1165.
 19.
Yeung K, Fraley C, Murua A, Raftery A, Ruzzo W: Modelbased clustering and data transformations for gene expression data. Bioinformatics. 2001, 17 (10): 977987. 10.1093/bioinformatics/17.10.977. [3rd Georgia TechEmory International Conference on Bioinformaticsin Silico Biology: Bioinformatics after the Human Genome, ATLANTA, GEORGIA, NOV 15–18, 2001],
 20.
Medvedovic M, Sivaganesan S: Bayesian infinite mixture model based clustering of gene expression profiles. Bioinformatics. 2002, 18 (9): 11941206. 10.1093/bioinformatics/18.9.1194.
 21.
Geyer C: MarkovChain MonteCarlo maximumlikelihood. Computing Science and Statistics. Edited by: Keramidas EM, Fairfax Station VA. 1991, Interface Foundation of North America Inc, 156163.
 22.
Fritsch A: mcclust: Process an MCMC Sample of Clusterings. 2012, [http://CRAN.Rproject.org/package=mcclust]. [R package version 1.0]
 23.
Razick S, Magklaras G, Donaldson IM: iRefIndex: a consolidated protein interaction database with provenance. BMC Bioinformatics. 2008, 9: 40510.1186/147121059405.
 24.
Ernst J, Plasterer HL, Simon I, BarJoseph Z: Integrating multiple evidence sources to predict transcription factor binding in the human genome. Genome Res. 2010, 20 (4): 526536. 10.1101/gr.096305.109.
 25.
Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W, Lipman D: Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 33893402. 10.1093/nar/25.17.3389.
 26.
Thalamuthu A, Mukhopadhyay I, Zheng X, Tseng GC: Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics. 2006, 22 (19): 24052412. 10.1093/bioinformatics/btl406.
 27.
Tibshirani R, Walther G, Hastie T: Estimating the number of clusters in a data set via the gap statistic. J Roy Stat Soc B Stat Meth. 2001, 63 (2): 411423. 10.1111/14679868.00293.
 28.
Hubert L, Arabie P: Comparing partitions. J Classif. 1985, 2: 193218. 10.1007/BF01908075.
 29.
Skrbic B, Bjørnstad JL, Marstein HS, Carlson CR, Sjaastad I, Nygård S, Bjørnstad S, Christensen G, Tønnessen T: Differential regulation of extracellular matrix constituents in myocardial remodeling with and without heart failure following pressure overload. Matrix Biol. 2013, 32 (2): 133142. 10.1016/j.matbio.2012.11.011.
 30.
Benjamini Y, Hochberg Y: Controlling the false discovery rate  a practical and powerful approach to multiple testing. J Roy Stat Soc B Stat Meth. 1995, 57 (1): 289300.
 31.
Shannon P, Markiel A, Ozier O, Baliga N, Wang J, Ramage D, Amin N, Schwikowski B, Ideker T: Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Res. 2003, 13 (11): 24982504. 10.1101/gr.1239303.
 32.
The Gene Ontology Consortium: Gene ontology: tool for the unification of biology. Nat Genet. 2000, 25: 2529. 10.1038/75556.
 33.
Falcon S, Gentleman R: Using GOstats to test gene lists for GO term association. Bioinformatics. 2007, 23 (2): 257258. 10.1093/bioinformatics/btl567.
 34.
Riker AI, Enkemann SA, Fodstad O, Liu S, Ren S, Morris C, Xi Y, Howell P, Metge B, Samant RS, Shevde LA, Li W, Eschrich S, Daud A, Ju J, Matta J: The gene expression profiles of primary and metastatic melanoma yields a transition point of tumor progression and metastasis. BMC Med Genomics. 2008, 1: 1310.1186/17558794113.
 35.
Adamic L, Wilkinson D, Huberman B, Adar E: A literature based method for identifying genedisease connections. CSB2002: IEEE Computer Society Bioinformatics Conference. 2002, IEEE Computer Society, 109117. [1st International IEEEComputerSociety Bioinformatics Conference (CSB2002), Stanford univ, Stanford, CA, Aug 14–16, 2000],
 36.
Bowers SLK, Banerjee I, Baudino TA: The extracellular matrix: at the center of it all. J Mol Cell Cardiol. 2010, 48 (3, SI): 474482. 10.1016/j.yjmcc.2009.08.024.
 37.
Fomovsky GM, Thomopoulos S, Holmes JW: Contribution of extracellular matrix to the mechanical properties of the heart. J Mol Cell Cardiol. 2010, 48 (3, SI): 490496. 10.1016/j.yjmcc.2009.08.003.
 38.
Sottile J, Shi F, Rublyevska I, Chiang HY, Lust J, Chandler J: Fibronectindependent collagen I deposition modulates the cell response to fibronectin. Am J Physiol Cell Physiol. 2007, 293 (6): C1934C1946. 10.1152/ajpcell.00130.2007.
 39.
Stearns M: Alendronate blocks TGFbeta 1 stimulated collagen 1 degradation by human prostate PC3 ML cells. Clin Exp Metastasis. 1998, 16 (4): 332339.
 40.
Ingwall JS: Energy metabolism in heart failure and remodelling. Cardiovasc Res. 2009, 81 (3): 412419.
 41.
Lehman JJ, Kelly DP: Gene regulatory mechanisms governing energy metabolism during cardiac hypertrophic growth. Heart Fail Rev. 2002, 7: 175185. 10.1023/A:1015332726303. [10.1023/A:1015332726303],
 42.
Lehman J, Kelly D: Transcriptional activation of energy metabolic switches in the developing and hypertrophied heart. Clin Exp Pharmacol Physiol. 2002, 29 (4): 339345. 10.1046/j.14401681.2002.03655.x.
 43.
Caubet C, Jonca N, Brattsand M, Guerrin M, Bernard D, Schmidt R, Egelrud T, Simon M, Serre G: Degradation of corneodesmosome proteins by two serine proteases of the kallikrein family, SCTE/KLK5/hK5 and SCCE/KLK7/hK7. J Invest Dermatol. 2004, 122 (5): 12351244. 10.1111/j.0022202X.2004.22512.x.
 44.
Smith E, Fuchs E: Defining the interactions between intermediate filaments and desmosomes. J Cell Biol. 1998, 141 (5): 12291241. 10.1083/jcb.141.5.1229.
 45.
Kouklis P, Hutton E, Fuchs E: Making a connection: direct binding between keratin intermediate filaments and desmosomal proteins. J Cell Biol. 1994, 127 (4): 10491060. 10.1083/jcb.127.4.1049.
 46.
Meng J, Bornslaeger E, Green K, Steinert P, Ip W: Twohybrid analysis reveals fundamental differences in direct interactions between desmoplakin and cell typespecific intermediate filaments. J Biol Chem. 1997, 272 (34): 2149521503. 10.1074/jbc.272.34.21495.
 47.
Hofmann I, Mertens C, Brettel M, Nimmrich V, Schnolzer M, Herrmann H: Interaction of plakophilins with desmoplakin and intermediate filament proteins: an in vitro analysis. J Cell Sci. 2000, 113 (13): 24712483.
 48.
Xiong G, Wang C, Evers BM, Zhou BP, Xu R: ROR alpha suppresses breast tumor invasion by inducing SEMA3F expression. Cancer Res. 2012, 72 (7): 17281739. 10.1158/00085472.CAN112762.
 49.
Taylor JM, Street TL, Hao L, Copley R, Taylor MS, Hayden PJ, Stolper G, Mott R, Hein J, Moffatt MF, Cookson WOCM: Dynamic and physical clustering of gene expression during epidermal barrier formation in differentiating keratinocytes. PLOS ONE. 2009, 4 (10): e765110.1371/journal.pone.0007651.
 50.
Dittrich MT, Klau GW, Rosenwald A, Dandekar T, Mueller T: Identifying functional modules in proteinprotein interaction networks: an integrated exact approach. Bioinformatics. 2008, 24 (13): I223I231. 10.1093/bioinformatics/btn161.
 51.
HaibeKains B, Olsen C, Djebbari A, Bontempi G, Correll M, Bouton C, Quackenbush J: Predictive networks: a flexible, open source, web application for integration and analysis of human gene networks. Nucleic Acid Res. 2012, 40 (D1): D866D875. 10.1093/nar/gkr1050.
 52.
The ENCODE Project Consortium: An integrated encyclopedia of DNA elements in the human genome. Nature. 2012, 489 (article number 3): 543572.
 53.
Giardine B, Riemer C, Hardison R, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent W, Nekrutenko A: Galaxy: a platform for interactive largescale genome analysis. Genome Res. 2005, 15 (10): 14511455. 10.1101/gr.4086505.
 54.
Sandve GK, Gundersen S, Rydbeck H, Glad IK, Holden L, Holden M, Liestol K, Clancy T, Ferkingstad E, Johansen M, Nygaard V, Tostesen E, Frigessi A, Hovig E: The Genomic HyperBrowser: inferential genomics at the sequence level. Genome Biol. 2010, 11: R12110.1186/gb20101112r121.
Acknowledgements
We thank Ian Donaldson for help using the iRefIndex database, Torbjørn Rognes for providing sequence homology scores using BLAST, Jason Ernst for giving us TF binding predictions, Kjetil Klepper for providing mappings of PWM ids to gene symbols, Kai Trengereid and Morten Johansen for incorporating the software into the Genomic HyperBrowser, and John Quackenbush and his lab for hosting internship of SN during which the idea of this project was conceived. We also would like to thank the anonymous reviewers for their detailed and constructive comments. The work of SN was supported by the SouthEastern Norway Regional Health Authority.
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
SN and TR developed the methodology. SN, TR and VN implemented the software. SN, TR, TC, VN, JB, BS and TT generated and collected data used in this paper. SN, TR, TC, JB, GC and EH analyzed the results. SN, TR, JB, GC and EH wrote the article. All authors participated in discussions, read and approved the final manuscript.
Electronic supplementary material
12859_2013_6368_MOESM1_ESM.PDF
Additional file 1: Calculations and parallel tempering description. Calculations of expressions used in the MCMC algorithm and description of the use of parallel tempering. (PDF 79 KB)
12859_2013_6368_MOESM2_ESM.PDF
Additional file 2: Monte Carlo estimation of prior. Evaluation of the precision of the Monte Carlo estimates of the prior compared to the exact calculation. (PDF 419 KB)
12859_2013_6368_MOESM3_ESM.PDF
Additional file 3: Figure S2: Number of estimated clusters. GAP index is used for all methods except ours, which inherently finds the number of clusters. hclust = hierarchical clustering, kmeans = k  means clustering, PAM = Prediction Around Medoids, Mclust = modelbased clustering, tight = tight clustering, MCIPA, is our method (MCMC Clustering using Informative Priors), but with no priors used, MCIPB is our method using priors with 20% of the priors misspecified, and MCIPC is our method with all prior pairs correctly specified. (PDF 11 KB)
12859_2013_6368_MOESM4_ESM.PDF
Additional file 4: Figure S1: Investigation of the effect of misspecified priors on model performance. The figure shows boxplots of adjusted Rand index values for 50 simulated datasets using the simulation setup described in the Results section, using extra variation SD of 0, 1 and 2, and number of individuals N of 10, and 0. (PDF 12 KB)
12859_2013_6368_MOESM5_ESM.PDF
Additional file 5: Figure S3: Evaluation of performance in terms of adjusted Rand index following the simulation scheme of [26]. The simulation setup is identical to the one generating Figure 1, except that for this figure the number of clusters where fixed to the true number of clusters (five). hclust = hierarchical clustering, kmeans = kmeans clustering, PAM = Prediction Around Medoids, Mclust = modelbased clustering, tight = tight clustering, MCIPA, is our method (MCMC Clustering using Informative Priors), but with no priors used, MCIPB is our method using priors with 20% of the priors misspecified, and MCIPC is our method with all prior pairs correctly specified. (PDF 12 KB)
12859_2013_6368_MOESM6_ESM.PDF
Additional file 6: Figure S4: The results of applying our method to the most differentially expressed genes between aorta banding and sham in the microarray heart failure data. Red node color means upregulated in aorta banding vs sham, green color downregulated. (PDF 483 KB)
GO results MCIP with priors, heart failure data.
Additional file 7: Results of Gene ontology analysis using GOstats [33] of heart failure clusters found using our method with priors. (PDF 40 KB)
12859_2013_6368_MOESM8_ESM.PDF
Additional file 8: GO results MCIP without priors, main cluster, heart failure data. Results of Gene ontology analysis of main heart failure cluster found using our method without priors. (PDF 39 KB)
12859_2013_6368_MOESM9_ESM.PDF
Additional file 9: GO results Kmeans clustering, main cluster, heart failure data. Results of Gene ontology analysis of main heart failure cluster found using Kmeans clustering. (PDF 40 KB)
12859_2013_6368_MOESM10_ESM.PDF
Additional file 10: Figure S5. The results of applying our method to the most differentially expressed genes between metastatic and nonmetastatic melanoma cancer patients. Red node color means upregulated in metastatic melanoma, green color downregulated. (PDF 204 KB)
12859_2013_6368_MOESM11_ESM.PDF
Additional file 11: GO results MCIP with priors, melanoma data. Results of Gene ontology analysis of melanoma clusters found using our method with priors. (PDF 54 KB)
12859_2013_6368_MOESM12_ESM.PDF
Additional file 12: GO results MCIP without priors, main cluster, melanoma cancer data. Results of Gene ontology analysis of main melanoma cluster found using our method without priors. (PDF 40 KB)
12859_2013_6368_MOESM13_ESM.PDF
Additional file 13: GO results Kmeans clustering, main cluster, melanoma cancer data. Results of Gene ontology analysis of main melanoma cluster found using Kmeans clustering. (PDF 24 KB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Nygård, S., Reitan, T., Clancy, T. et al. Identifying pathogenic processes by integrating microarray data with prior knowledge. BMC Bioinformatics 15, 115 (2014). https://doi.org/10.1186/1471210515115
Received:
Accepted:
Published:
Keywords
 Markov Chain Monte Carlo
 Bayesian Network
 Gene Pair
 Prior Information
 Area Under Curve