Identifying pathogenic processes by integrating microarray data with prior knowledge
 Ståle Nygård^{1, 2, 3}Email author,
 Trond Reitan^{4},
 Trevor Clancy^{5},
 Vegard Nygaard^{5},
 Johannes Bjørnstad^{3, 6},
 Biljana Skrbic^{3, 6},
 Theis Tønnessen^{3, 6},
 Geir Christensen^{2, 3} and
 Eivind Hovig^{5, 7, 8}
DOI: 10.1186/1471210515115
© Nygård et al.; licensee BioMed Central Ltd. 2014
Received: 14 November 2013
Accepted: 9 April 2014
Published: 24 April 2014
Abstract
Background
It is of great importance to identify molecular processes and pathways that are involved in disease etiology. Although there has been an extensive use of various highthroughput methods for this task, pathogenic pathways are still not completely understood. Often the set of genes or proteins identified as altered in genomewide screens show a poor overlap with canonical disease pathways. These findings are difficult to interpret, yet crucial in order to improve the understanding of the molecular processes underlying the disease progression. We present a novel method for identifying groups of connected molecules from a set of differentially expressed genes. These groups represent functional modules sharing common cellular function and involve signaling and regulatory events. Specifically, our method makes use of Bayesian statistics to identify groups of coregulated genes based on the microarray data, where external information about molecular interactions and connections are used as priors in the group assignments. Markov chain Monte Carlo sampling is used to search for the most reliable grouping.
Results
Simulation results showed that the method improved the ability of identifying correct groups compared to traditional clustering, especially for small sample sizes. Applied to a microarray heart failure dataset the method found one large cluster with several genes important for the structure of the extracellular matrix and a smaller group with many genes involved in carbohydrate metabolism. The method was also applied to a microarray dataset on melanoma cancer patients with or without metastasis, where the main cluster was dominated by genes related to keratinocyte differentiation.
Conclusion
Our method found clusters overlapping with known pathogenic processes, but also pointed to new connections extending beyond the classical pathways.
Background
Highthroughput experimental techniques provide objective views of the molecular changes that occur in cells during disease progression. A widely used tool when trying to understand the biological meaning of the observed changes is pathway analysis. In pathway analysis, the list of significantly altered genes or molecules is mapped onto usually precompiled pathways. A wide range of methods for finding predefined pathways overrepresented by significant genes have been developed and are steadily used, e.g. [1–3]. However, this strategy for identifying the important disease pathways is limited, as pathways are not fixed and change with context [4]. As a consequence, many of the significantly altered genes or molecules fall outside of the expected pathways [5].
An alternative approach, which is fundamentally different from the pathway analysis just described, is a purely datadriven approach, in which the experimental data is used without any guidance from prior knowledge about molecular interactions. The goal is then to discover hidden patterns of correlation between genes reflecting the complex processes and pathways that underlie cellular metabolism and physiology [6]. The most common approach for this task is unsupervised clustering. This class of methods was of the first to be applied to microarray data [7], and several approaches have been proposed, e.g hierarchical and kmeans clustering, Prediction Around Medoids (PAM) [8], modelbased clustering via Markov chain Monte Carlo (MCMC), e.g [9, 10], tight clustering [11] and clustering via networks [12]. Another way of learning molecular relationships from microarray data is to use Bayesian network (BN) methodology. Instead of finding groups of correlated molecules, BN methodology infers a network of (direct) interacting molecules. BNs were first proposed for microarray data in 2000 [13], and have also been much applied.
Along with the development of such unsupervised methods, there has been a steady development of prior information about molecular interactions. In addition to the mentioned pathways, there are many databases containing pairwise connections, such as proteinprotein interactions, transcription factors binding to genes and protein sequence similarities. Such information is maybe a more reliable source of information than pathways, as the latter lack proper definitions, and are as already mentioned heavily context dependent.
The goal of the analysis is the posterior probability p (MD), i.e. the probability that the model (M) is correct given the data (D). In our setting, the model M represents a clustering or a network configuration, whereas D represents the microarray data. Many methods using BN methodology to incorporate informative priors have been proposed quite recently, e.g [14–17]. However, even with the help of prior information, inferring Bayesian networks for large numbers of genes remains computationally challenging, and the number of genes that can be included is usually smaller than one hundred. Inference about groups or clusters of genes is a less daunting task, as there are fewer group/cluster configurations than network configurations. For cluster analysis, there has been some, but not many, suggestions for how to make use of prior knowledge, e.g [18]. Many Bayesian modelbased clustering methods have been proposed, e.g. [9, 10, 19, 20] but common to all of these is that they use noninformative priors. Bayesian clustering with truly informative priors has to our knowledge not previously been proposed.
In the present paper, we develop a novel method for clustering using Bayesian statistics and informative priors making use of Markov chain Monte Carlo (MCMC) sampling. We name our method MCIP (MCMC Clustering with Informative Priors). In our method, coregulation patterns in the microarray data are used together with prior knowledge to find groups or modules of interacting genes. Specifically, we propose a method which searches for an optimal partitioning of the genes into functional modules, where module assignment is based on both microarray data and prior knowledge. The prior value for assigning genes to the same module is retrieved from databases containing gene pairs with a previously reported interaction or connection. Markov chain Monte Carlo (MCMC) sampling is used to search for the most reliable modules. Having found the modules, we generate subnetworks consisting of the prior pairs within each module, hypothesizing that these pairs represent the direct interactions. In the presented work we use prior information in forms of proteinprotein interaction data, transcription factor binding predictions and protein sequence similarities. We apply our method to simulated data as well as two realworld microarray data sets, one in heart failure and one in melanoma cancer.
Methods
Assume we have a genomic dataset D, e.g. a microarray dataset, with measurements of n genes. We want to use D together with prior knowledge about molecular interactions to find the most likely partitioning into groups or modules of functionally related genes. Now assume that the n genes represent K different modules, where K is not known in advance. Let M = {(i_{ m },j_{ m },p_{ m }),m ∈ {1,…,q}} be the collection of q gene pairs for which prior knowledge exist, where ${p}_{m}={p}_{{i}_{m},{j}_{m}}$ is the prior probability that gene i_{ m }and gene j_{ m }belong to the same module.
Marginal likelihood given grouping
where N is the number of rows in D (number of individuals), X_{ k }is the data matrix for genes belonging to group k and Γ_{ i }() is the multivariate gamma function. If the gene expressions are standardized to have mean values equal to zero and variances equal to one, it is natural to choose ρ_{ k }= 1 and m_{ k }= n_{ k }+ 2, as this yields expected values of 1 for the variances, i.e. for the diagonal elements. The expectancy of the inverse Wishart distribution is defined only for m_{ k }> n_{ k }+ 1, so this is a minimalistic way of coding for the knowledge that the data has been standardized.
Probability of grouping given prior information
where N (K,K) = 1. Let N_{ ij }(n,K) = K · N (n  2,K) + N (n  2,K  1) = N (n  1,K) be the number of subdivisions of n genes into K groups where i and j are in the same group. Let $N\left(n\right)=\sum _{K=1}^{n}N(n,K)$ be the number of subdivisions of n genes in total and ${N}_{\mathit{\text{ij}}}\left(n\right)=\sum _{K=1}^{n}{N}_{\mathit{\text{ij}}}(n,K)=\sum _{K=1}^{n}N(n1,K)=N(n1)$ be the number of subdivisions of n genes where i and j are in the same group.
We denote this probably measure ${P}_{{M}_{0}}$ and refer to it as the baseline prior.
Call this probability measure ${P}_{{M}_{1,1}}$.
Note that this expression increases exponentially with the number of prior pairs q. In order to avoid computational cost increasing exponentially with the number of prior pairs, we developed a Monte Carlo estimation of P(g∣ M), described in Additional file 1.
For a given grouping, the baseline prior contributes to the overall probability of a grouping with an additive factor (1  p_{1})(1  p_{2}) ⋯ (1p_{ q }) P (gM_{0}) (see Eq. 2 and the expression for the baseline prior P(gM_{0})). This implies that the baseline prior will contribute to the probability of two genes being clustered. Define P_{ b } as the probability for two genes being in the same group given the baseline prior M_{0}. It is important that this probability is nonzero, in order to allow for the possibility of two arbitrary genes being in the same group whether they are in the set of prior pairs or not. If not, it would be impossible to group gene pairs that are not in the set of specified prior pairs. However, this implies that the probability of grouping two genes (i_{ m }, j_{ m }) in a prior pair will be be the mixture p_{ m }+ (1  p_{ m })P_{ b }, as there is a nonzero probability that the genes are connected, even if they should not be connected according to the prior information. For instance, if the baseline prior for two arbitrary genes to be connected is 0.1 and pair number m has prior enforcement probability p_{ m }= 0.8, the total prior probability for the pair (i_{ m },j_{ m }) to be connected will be 0.82.
There will be a baseline probability for two arbitrary genes to be connected. We have become aware that with equal probability for all number of groups and equal probability for each grouping given the number of groups, the baseline for any two genes to be connected will be dependent on the total number of genes, n. In general, we get that the prior for groupings gives that the baseline probability, P_{ b }for two genes to be connected when there are K groups will be P(i and j in the same group ∣ K groups) = N (n  1,K)/N(n,K). The total probability will then be ${P}_{b}=(1/n)\sum _{K}N(n1,K)/N(n,K)$, which depends on the number of genes in total, n. For n = 100, P_{ b }= 0.048 while for the extreme case n = 10, P_{ b }= 0.25. It might be seen as a weakness that there is any dependency on the number of genes in the dataset on the baseline probability. An alternative approach could be to specify the baseline prior according to a given baseline probability for any arbitrary pair of genes. This would make for a more consistent baseline prior, but has the disadvantage that the baseline prior has to be set manually. An alternative would be to make the baseline prior into a parameter to be estimated in a hierarchical Bayesian model.
Removing cycles
The set of priors might contain cycles, which can easily occur, e.g if both direct and indirect connections are included. We have developed an algorithm for detecting cycles, and if such are found, the prior pair with the smallest prior probability (smallest extra connection probability) is removed, as this connection is interpreted as a result of the rest of the cycle. The reason for excluding cycles in the prior is that with cycles, the pairwise specification of prior probabilities of pairs could be misleading. As an example, let us assume the prior is specified so that there is a 0.8 prior probability of a pairing between gene A and B (prior pair 1), between B and C (prior pair 2) and between A and C (prior pair 3). Then the probability for a forced connection between A and B (thus disregarding the baseline probability) will be P(A and B forced to be in the same group) = P(prior pair 1 enforced) + P(prior pair 1 not enforced)P(prior pair 2 enforced)P(prior pair 3 enforced) = 0.8 + 0.2*(0.8*0.8) = 0.928. Thus in order for the pairwise prior probabilities to be interpreted as the probability for two pairs to be forced to be connected, cycles should be avoided.
Markov chain Monte Carlo procedure for integrative networks
We use Markov chain Monte Carlo (MCMC) to sample from the posterior distribution P (D ∣ g,M), as the analytical solution is not known. We will use the Metropolis Hastings algorithm, which is the most general version of MCMC. Specifically, we propose the following algorithm:

Start with a random grouping, called g.

Sample N groupings g based on the following scheme:

Propose either (each with probability 1/3)

∗ A partitioning of one of the groups (except when K = n)

∗ A merging of two of the groups (except when K = 1)

∗ A transition of one of the genes to another group (except when K = 1)


Call the new grouping g_{new}.

Accept g_{new} with the following probability:$\phantom{\rule{14.0pt}{0ex}}P\left(\text{accept}{\mathit{g}}_{\text{new}}\right)\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}\text{min}\phantom{\rule{0.3em}{0ex}}\left(\phantom{\rule{0.3em}{0ex}}1,\frac{P(D\phantom{\rule{0.3em}{0ex}}\mid \phantom{\rule{0.3em}{0ex}}{\mathit{g}}_{\text{new}})P({\mathit{g}}_{\text{new}}\mid M)Q({\mathit{g}}_{\text{old}}\phantom{\rule{0.3em}{0ex}}\mid \phantom{\rule{0.3em}{0ex}}{\mathit{g}}_{\text{new}})}{P(D\phantom{\rule{0.3em}{0ex}}\mid \phantom{\rule{0.3em}{0ex}}{\mathit{g}}_{\text{old}})P({\mathit{g}}_{\text{old}}\phantom{\rule{0.3em}{0ex}}\mid \phantom{\rule{0.3em}{0ex}}M)Q({\mathit{g}}_{\text{new}}\phantom{\rule{0.3em}{0ex}}\mid \phantom{\rule{0.3em}{0ex}}{\mathit{g}}_{\text{old}})}\phantom{\rule{0.3em}{0ex}}\right)\phantom{\rule{0.3em}{0ex}},$

where P (D ∣ g) is given by (1), P (g∣ M) is given by (2) and Q(g_{x}g_{y}) is the probability of proposing the group configuration g_{x} from the configuration g_{y}, and is given by (35) in the next section.
In order to avoid convergence to local maxima, we implemented parallel tempering [21], as described in Additional file 1.
Inferring clusters from MCMC samples
The above MCMC procedure results in a series of samples g. These samples can be used to calculate the posterior similarity matrix (PSM), in which each entry (i,j) gives the proportion p_{ ij }of samples gene i and gene j occur together in the same cluster. We infer clusters from the PSM using the minbinder function in the mcclust R library [22]. Specifically, minbinder makes use of hierarchical clustering with the PSM as distance matrix together with the cuttree function to produce cluster configurations with the number of clusters ranging from 1 to L, where L is a userspecified maximum. Now letting I_{ K }(i,j) be the indicator for whether gene i and j are in the same cluster for the configuration using K clusters (based on hierarchical clustering and the cuttree function), and using absolute difference as loss function, the posterior expected loss e_{ K }for K clusters is calculated in minbinder as ${e}_{K}=\sum _{i<j}\left{I}_{K}\right(i,j){p}_{\mathit{\text{ij}}}$. The inferred cluster configuration is the result of hierarchical clustering and cuttree for $\widehat{K}$ clusters, where $\widehat{K}$ is the K minimizing the posterior expected loss, i.e. $\widehat{K}=\underset{K}{\text{min}}{e}_{K}$.
Proposal distribution
Let n, as before, be the total number of genes, and let n_{ k }be the number of genes in group k and n_{ s }be the number of singlemembered groups. If the number of groups K equals 1, the only allowed option is splitting into two new groups. If K = n, we can only have a merging of two genes into a new group. If 1<K < n, all three types of moves are allowed (splitting and merging groups, and moving a gene into another group). Each type of move then has probability 1/3.
Computational aspects
The likelihood estimation involves the determinant function, which is O (n^{3}) in the number n of genes in each module. In our method, the most computationally expensive calculation is the exact calculation of the prior probability P(gM). This calculation is exponential in the number of prior pairs. Often, as is the case for the data used in this paper, the number of pairs for which there exist prior knowledge, can be substantial. To deal with such situations, we developed an approximate estimate of the prior probability based on Monte Carlo simulations. Comparisons on moderately large number of priors suggested that although less accurate, a Monte Carlo estimate of the prior gives results close to the results obtained when calculating the prior exactly (Additional file 2). For larger number of priors, this can not be tested, but experience shows stable results, suggesting that the stochastic nature of the algorithm do not seriously affect the results. With 100 genes and 100 samples and 200 priors, and using 800 for the Monte Carlo prior calculation, it took our method approximately 10 minutes to run 10000 samples on an 64 GB RAM, 16 physical CPU cores compute node.
Prior information
We will use three sources of prior information: proteinprotein interactions, transcription factor binding predictions as well as protein sequence similarity calculations. Gene pairs in these databases have evidence of being functionally related. The strength of the evidence is represented by some type of score for each database. In our MCMC algorithm the prior information is formulated in terms of the prior probability of forcing two genes to belong to the same group. We will use the logit transformation to transform the scores into the unit interval, thus enabling them to be used as a probability measure. Further details are given below for each prior type. There is very little overlap between the prior information databases. In cases where a gene pair is found in more than one database, the highest prior probability will be used.
Protein interaction data
Proteinprotein interactions (PPI) are important for the majority of biological functions. PPI databases contain lists of proteins pairs for which there exist some evidence for interaction. The evidence comes from various types of studies, ranging from highthroughput methods, more traditional lowthroughput proteomics studies, as well as in silico predictions based on known interactions [23]. iRefIndex [23] is a consolidated protein interaction database comprising information from nine wellknown interaction databases. One of the features of the database is the socalled lpr (lowest PMID reuse) score, which is the lowest number of unique interactions that are supported by one of the interaction’s PubMed identifiers (PMID). A low lpr indicates that the interaction is more likely to rely on lowthroughput methods than on highthroughput methods. As the former is recognized as more reliable than the latter, the lpr score can be used as a measure of the reliability of the interaction. We will use a logit transformation taking x to 1/(1 + e xp (  x)) of the inverse of the lpr score to obtain a probability measure (between zero and one). For example, if the lpr score is one, p_{ m }becomes 0.73, if the lpr score is zero, p_{ m }becomes 0.5.
Transcription factor bindings
Transcription factors (TF) are proteins that bind to specific DNA sequences, thereby controlling the expression levels of the corresponding genes. Binding preferences of many transcription factors are known and characterized by a sequence binding motif. However, binding affinity does not depend entirely on the match of the sequence to the motif, but will rely also on sequence specific features. Ernst et al.[24] developed TF binding predictions, where they first found a general binding preference of the sequences based on 28 local genomic features, including histone modifications, conservation and melting temperature. Then, they combined this general binding preference score with motif information for specific transcription factors to improve prediction of genes bound by the factor. We will make use of these scores, and again, we will apply the logit transformation to transform the scores to values between 0 and 1.
Protein sequence similarity
Proteins with similar sequences are likely to be functionally related as the proteins may be expressed by paralogous genes (having common ancestry) or by genes that are selected to have the same function. For instance, two homologous proteins can be phosphorylated by the same kinase, thus playing roles in the same signaling pathway. One additional feature of using protein homology data in this setting is that the function of proteins for which the function is unknown can be learned by borrowing information from their protein homologs. We will calculate percent similarity between each of the human proteins in RefSeq using BLAST [25].
Results
Simulated data
To evaluate the performance of our method we used simulated gene expression data generated according to [26]. In our study, we used a total of five clusters of genes C^{∗} = (C_{1},…,C_{5}) with dimension N (denoted d in [26]) samples. Cluster sizes n_{ c }(c = 1,…,5) were generated from n_{ c }∼ 2 × Poisson (λ). Expression values in cluster C_{ c }were generated using a hierarchical lognormal model as in [26]:
 (1)
A vector of cluster template for cluster C_{ c }was created with four periods of constant expression of size m_{1},m_{2},m_{3} and m_{4}. The sizes m_{ k }, k = 1,…,4, was from a uniform distribution such that $\sum _{k}{m}_{k}=N$ and m_{ k }> 2. An initial template with constant pattern in four periods was simulated from $\text{log}\left(\underset{k}{\overset{\left(c\right)}{\mu}}\right)\sim N\left(\mu ,{\sigma}^{2}\right)$.
 (2)
Sample variability and gene variability: Sample variability ${\sigma}_{s}^{2}$ was introduced and the cluster template ${T}_{j}^{\left(c\right)}$, j = 1,…,N, was generated from $\text{log}\phantom{\rule{0.3em}{0ex}}\left(\underset{j}{\overset{\left(c\right)}{T}}\right)\sim N\left(\text{log}\phantom{\rule{0.3em}{0ex}}\left(\underset{k}{\overset{\left(c\right)}{\mu}}\right),\underset{s}{\overset{2}{\sigma}}\right)$. Then for each gene vector i in sample j, gene variability was added and gene expression values generated as $\text{log}\phantom{\rule{0.3em}{0ex}}\left({x}_{\mathit{\text{ij}}}\right)\sim N\left(\text{log}\left(\underset{j}{\overset{\left(c\right)}{T}}\right),\underset{0}{\overset{2}{\sigma}}\right)$.
 (3)
We repeated steps 1 and 2 to generate five clusters. We used parameter values of μ = 6, σ = 1, σ_{ s }= 1, σ_{0} = 1, and λ = 10.
As in [26], extra variation was introduced to evaluate robustness of clustering methods against potential random errors introduced from experimental procedures, such as sample acquisition, labeling hybridization and scanning. To each element of the logtransformed expression matrix we added a a random error from a normal distribution with mean zero and standard deviation (SD) equal to 0, 1 and 2. In addition, the sample size was varied, using N = 10, 100 and 1000. For each of these nine scenarios, 50 datasets were generated.
For each dataset, three scenarios for the available prior information were used. In (A), we assumed that no prior information was used. In (B), we assumed priors pairs were available, where 20% where misspecified, i.e. 20% of the gene pairs had members belonging to different groups. In the last scenario (C), all pairs were assumed to be correctly specified (the members belonged to the same group). Prior values were generated from a uniform U(0.5,1) distribution.
We compared our method with five wellknown clustering methods for which a software already exist, namely hierarchical clustering, kmeans clustering, Partitioning Around Medoids (PAM) [8], Modelbased clustering (Mclust) [9] and tight clustering [11]. For our method, after a burnin period generating 10K samples, we generated 10K samples from which each 100th sample was chosen. For all methods except ours, the number of clusters were estimated using the Gap index [27] (our method implicitly searches for the optimal number of clusters). For our method, clusters were inferred by minimizing the posterior expected loss based on the MCMC samples as described in the Methods section. The number of clusters estimated by the GAP index as well as our method is shown by boxplots in Additional file 3: Figure S2. The Rand index, defined as the proportion of concordant gene pairs in two partitions among all possible gene pairs, was used as evaluation measure. Specifically, we used the adjusted Rand index [28], which is standardized to have expected value zero when the partitions are randomly generated and takes maximum value one if two partitions are perfectly identical. Unlike the other methods, tight clustering produces clusters where some genes are not allocated to any cluster. In the calculation of the Rand index, only the allocated genes are considered.
Heart failure data
We used the data described in [29], consisting of microarray gene expression measurements from fourteen mice subjected to aortic banding and five sham operated mice. Aortic banding leads to increased left ventricular pressure. To compensate for the increased load, gene expression changes occur resulting in myocardial remodeling, involving hypertrophy of cardiomyocytes. Ultimately, the cardiac hypertrophy might lead to development of heart failure. We based our network analysis on the most differentially expressed genes between aortic banding and sham. To find differentially expressed genes we performed ttests between the two groups, using log2 expression values, before multiple testing correction was performed using the method of [30]. We used a false discovery rate (FDR) cutoff of 5%, and among these genes we picked the 400 with largest fold change (up or down). We looked up connections between these genes and assigned prior probabilities for the pairs based on the prior databases described in the previous section. For each of the three prior databases, there were several hundred pairs where both genes were represented in our inputlist. As the use of so many priors pairs was too computationally demanding for our method, we picked the 50 top scoring pairs for each of the prior types. We applied our MCMC algorithm using altogether 150000 Monte Carlo samples, with the first 50000 samples used for burnin (i.e these first samples were discarded and only used to ascertain a stable starting point in the MCMC sampling). We applied parallel tempering as described in Additional file 1. As we here had more than hundred prior pairs, we approximated P(g∣ M) with the Monte Carlo estimator (1) defined in Additional file 1, using K = 800, as this value gave stable results within a reasonable computation time. Clusters were inferred by minimizing the posterior expected loss based on the posterior similarity matrix, which was calculated from the collection of each of the 100th MCMC sample after the burnin period.
Summarizing the inferred heart failure clusters
Cluster  Size  Upreg. (%)  Edges  Priors (%)  Top three GO terms 

1  315  89.3  98910  0.4  Extracellular region, basement membrane, proteinaceous extracellular matrix 
2  85  28.4  7140  0.1  Positive regulation of cell cycle, polysaccharide metabolic process, carbohydrate 
biosynthetic process 
Melanoma cancer data
Metastatic melanoma is a deadly disease while nonmetastatic melanoma and other cutaneous tumor types are usually cured with surgical removal of the primary tumors. To find network of genes differentially expressed between metastatic and nonmetastatic tumors we used data from [34], which included microarray gene expressions from 47 metastatic and 40 nonmetastatic tumor samples of patients with various cutaneous tumors. As for the heart failure data, we used the 400 most differentially expressed genes (based on fold change) for which Benjamini Hochberg FDR <0.05, and found gene pairs in the PPI, the TF and the sequence similarity database where both genes were represented in our input list. We also here approximated P(g∣ M) with the Monte Carlo estimator (1) of the Additional file 1, K = 800 samples. Again, result clusters were inferred by minimizing the posterior expected loss based on the posterior similarity matrix calculated from each 100th of 150K MCMC samples generated after a burnin period of 50K samples.
Summarizing the inferred melanoma cancer clusters
Cluster  Size  Upreg. (%)  Edges  Priors (%)  Top three GO terms 

1  130  2.3  16770  1.3  Epidermis development, cornified envelope, keratinization 
2  25  4.0  600  0.0  Androgen biosynthetic process, extracellular region, desmosome 
3  28  3.6  756  1.7  Forebrain morphogenesis, response to vitamin A, negative regulation of neuron maturation 
4  25  0.0  600  1.2  Keratinization, peptide crosslinking, desmosome 
5  17  52.9  272  1.8  Testosterone 16alphahydroxylase activity, negative regulation of Rho GTPase activity, 
negative regulation of epidermal growth factoractivated receptor activity  
6  13  61.5  156  1.9  Alcohol sulfotransferase activity, CDPdiacylglycerol biosynthetic process, embryonic 
hindgut morphogenesis  
7  13  53.8  156  1.3  Regulation of mesonephros development, negative regulation of cell proliferation involved 
in mesonephros development, negative regulation of fibroblast growth factor receptor  
signaling pathway involved in ureteric bud formation  
8  12  0.0  132  0.0  Glutamate dehydrogenase (NAD+) activity, glutamate dehydrogenase [NAD(P)+] activity, 
negative regulation of myelination  
9  12  16.7  132  1.5  Serinetype endopeptidase activity, serine hydrolase activity, peptidase activity, acting on 
Lamino acid peptides  
10  10  50.0  90  0.0  Adherens junction organization,interleukin1 Type I receptor binding, dihydrotestosterone 
17betadehydrogenase activity  
11  10  0  90  1.1  Nacylglucosamine 2epimerase activity, osteoclast proliferation, alkaloid catabolic process 
12  10  30.0  90  0.0  Middle ear morphogenesis, vagus nerve morphogenesis, activation of phospholipase 
D activity by Gprotein coupled receptor protein signaling pathway  
13  11  9.1  110  0.9  Scavenger receptor activity, negative regulation of collateral sprouting of intact axon 
in response to injury, regulation of cell adhesion  
14  11  0.0  110  0.9  Positive regulation of cell development, regulation of cell morphogenesis involved in 
differentiation, argininosuccinate synthase activity  
15  12  16.7  132  0.0  Pharynx development, cellular response to external biotic stimulus, nucleus 
16  9  0.0  72  1.4 Alphadystroglycan binding, isopeptide crosslinking via N6(Lisoglutamyl)L lysine,  
positive regulation of arachidonic acid secretion  
17  7  100.0  42  14.3  Vascular transport, interleukin1 Type II blocking receptor activity, Golgi cis cisterna 
18  6  66.7  30  3.3  NLRP1 inammasome complex, 17alpha,20alphadihydroxypregn4en3one 
dehydrogenase activity, androsterone dehydrogenase (Bspecific) activity 
Method evaluation
Comparing performance of our method (MCIP) with and without the use of priors to k means clustering
Heart failure data  Melanoma data  

Sens.  Spec.  PPV  AUC  K  Sens.  Spec.  PPV  AUC  K  
MCIP with priors  0.62  0.68  0.023  0.65  2  0.67  0.6  0.030  0.64  18 
MCIP without priors  0.61  0.64  0.020  0.62  3  0.68  0.53  0.024  0.60  20 
Kmeans  0.52  0.65  0.019  0.59  3  0.70  0.48  0.021  0.58  10 
Discussion
Using simulated data we compared our method to established clustering methods as kmeans and hierarchical clustering. When no prior information is provided our method did not offer any gains over other standard approaches, except for the N = 10, SD = 0 regime. When most priors were correctly specified (80% or 100%), our method was as least as good as the bestperforming established methods for large samples, and superior to the same methods for small sample sizes. We believe that the majority of the priors will be correctly specified in most cases. The reason for this is that there will usually exist so many prior pairs that one would restrict oneself to only those with the strongest evidence. Of course, a previously shown connection might still not be real in the current situation, e.g. if the connection has been discovered in a different tissue type than the one under study. However, this problem will be limited if one is analyzing a set of genes differentially expressed between two conditions, as many of these will be coexpressed, and thus also correlated.
The Gene Ontology analysis of the heart failure network showed that many of the top ranked GO categories for the larger module were related to the extracellular matrix. Awareness has emerged regarding the extracellular matrix changes taking place in myocardial remodeling [36, 37]. When investigating Figure 2, we see that the larger module contains many collagens (type I alpha 1, type III alpha 1, type IV alpha 1, type V alpha 1, type VI alpha 1, type XII alpha 1, type XIV alpha 1 and type XVI a), which are important structural components of the extracellular matrix. The figure shows that these molecules were primarily connected by protein sequence homology (blue lines). Moreover, other molecules thought to be parts of the extracellular matrix were identified in the same module (thrombospondin 3 and 4, fibromodulin, versican, biglycan, fibronectin, elastin). For example, fibronectin has been demonstrated to play a role in the organizing of collagen type I [38]. The module also contained enzymes and hormonal factors known to process the extracellular matrix structural proteins (matrix metallopeptidase 2, transforming growth factor beta 2 and beta 3, latent transforming growth factor beta binding protein 2, lysyl oxidase, TIMP metallopeptidase inhibitor 1). For instance, matrix metallopeptidase 2 is known to be important for the breakdown of collagen type I [39]. Finally, the extracellular matrix related transcription factor early growth response 3 was a hub in this module, indicating a particular importance in this pathological process. Altered metabolism has been described in remodeled and failing hearts. E.g. reduced fatty acid oxidation and an increase in glucose utilization have been reported [40–42]. The smaller module contained several genes involved in carbohydrate metabolic processes (6phosphofructo2kinase, phosphorylase kinase gamma 1, aldolase B fructosebisphosphate, fructose bisphosphatase 2). Taken together, the two modules confirm regulatory and signaling events and structural alterations known to be involved in myocardial remodeling. We believe that the complete result network, shown in Additional file 6: Figure S4, contains many other important, not yet discovered, interactions. However, a detailed investigation of all these connections is beyond the scope of this paper.
The melanoma network was derived from the data of Riker et al.[34]. As the present paper represents a reanalysis of the data, the outcome of an analysis with similar input data should be expected to generate somewhat similar output data. The main difference in approach is that while the Riker paper primarily performs a direct Gene Ontology analysis, the current approach attempts to combine multiple sources of information prior to performing any functional analysis. In this way, the resultant network modules will represent subnets of relevance with respect to the combined prior information, thus presenting local information clusters with higher likelihood of coherence in function for each cluster. Riker et al. observed differentially expressed genes involved in keratinocyte differentiation and epidermal development, including loricrin (LOR), involucrin (IVL), keratin 5 (KRT5) and plakophilin 1 (PKP1), suggesting a loss of epidermal characteristics, in highlighting the desquamation process. In our largest module (module 1) we found many genes related to keratinocyte differentiation and epidermal development (see Additional file 11, Table 1). When looking at Figure 3, where gene pairs with prior information within each module were connected, we note that keratins and kallikreins were connected through a set of interacting genes that included desmocollin 1 and 3 (DSC1 and DSC3), desmoglein 1 and 3 (DSG1 and DSG3), desmoplakin (DSP) and corneodesmosin (CDSN). Many of the genes in this path are related to corneodesmosomes (KLK5 and KLK7: kallikrein 5 and 7, DSG1, DSG3, DSC1, DSC3, DSP, IVL, LOR and keratin 10). Corneodesmosomes are located at the junctional structures that mediate corneocyte cohesion [43], underlining keratinocyte dedifferentiation. The individual connections in this path are supported by previous findings; Caubet et al. [43] showed that KLK5 and KLK7 degrade the desmosomal proteins DSC1 and DSG1. Smith et al.[44] showed that when epidermal cells differentiate, DSC1 and DSG1 are bound by PKP1 and DSP, thereby enhancing anchorage to keratin intermediate filaments. Interactions between plakophilins, desmoplakins and keratins have also been shown by other authors, e.g. [45–47]. Interestingly, the transcription factor RARrelated orphan receptor A (RORA) was downregulated, and placed as a hub in this module. RORA has recently been indicated as a breast tumor suppressor [48], as well as having a role in keratinocyte differentiation [49]. Taken together, Figure 3 suggests that the loss of epidermal characteristics seen in the metastatic individuals is mediated by a downregulation of the RORA transcription factor.
When comparing the GO analysis of the main clusters from our method with the use of priors (Additional file 7 and 11) to the GO analysis of the main clusters using our method without priors (Additional files 8 and 9) and kmeans clustering (Additional files 9 and 13), we note that many of the same GO categories were showing up. But the pvalues were overall much smaller for our method with the use of priors, than the other two, indicating that our method has better sensitivity of finding functionally related gene groups. We note that the path found in the melanoma data linking the RORA transcription factor to the keratins via the kallikreins and the desmosomal proteins, would not have been discovered by our method without priors or kmeans clustering, as none of the these methods gave clusterings where all these gene subgroups were contained in one cluster.
Method evaluation using the literaturederived network showed that adding prior knowledge yielded improved AUC. The AUC was improved from 0.62 to 0.65 for the heart failure data and from 0.6 to 0.64 for the melanoma data. The positive predicted values (PPV) were also improved (0.02 to 0.023 for the heart failure data and 0.024 to 0.03 for the melanoma data). We note that these values are very small, which is because there are very few gene pairs with cocitation (only around one in a hundred of all possible pairs) compared to the number of gene pairs clustered together (more than half of the pairs for the heart failure data, for instance). It is important to be aware that the possibility of improvement is limited by the degree of overlap between the list of pairs in the prior interaction databases and list of pairs in the literature reference network. Many of the pairs in the literature databases were not found in the interaction databases. In fact this was the case for about 90% for these data. Such interactions are, when evaluated on the literature network, treated as false positives, but might of course be true positives. Of the 90% of the pairs in the literature network that were not in the prior databases, the evidence for grouping the genes together was based solely on the correlation in the microarray data. Another limitation of the literature network is that it itself might contain false positives. Genes that are mentioned in the same PubMed abstract are not necessarily connected.
We have here focused on finding groups of connected genes, and not on discovering direct interactions. Direct interactions are often inferred using Bayesian networks (BN). As mentioned in the Background section, the BN formalism allows for incorporation of prior knowledge, and BN methods for genomic data has indeed been proposed. However, constructing largescale networks using BN methodology is very difficult as the number of possible configurations is about exponential in the number of genes. Formally this has been shown to be an NPComplete problem. By instead focusing on groups, we heavily reduce the number of possible configurations. Thus, our method can handle many more genes than BN methods. This is relevant, as microarray data may involve several hundred regulated genes. One way of using our method is to apply it as an initial step prior to a BN analysis; If the number of genes is too large to apply BN methodology to the full set, our method can be used to first find smallersized, independent sets of genes, followed by separate BN analysis on each of the subsets. The result of such an analysis will be similar to the networks shown in Figures 2 and 3, where prior pairs within each cluster are shown. We believe that the connections shown in Figures 2 and 3 are reliable and robust, as they display connections between genes that are coregulated and that have a previously shown connection. Using BN on each cluster could lead to an improvement, as novel interactions can be detected based on strong correlations in the data, and is one possibility for future methodological development.
Along with the development of the primarily datadriven methods, various methods focusing more on the prior information have been proposed, e.g. [5, 50, 51]. In these approaches, the authors made use of the observed changes in the experimental data, e.g. the most differentially regulated genes, while the interactions among these genes were derived from prior information only. For example, Huang et al.[5] applied the socalled Prizecollecting Steiner tree to select nodes and interactions. As we speak, the prior knowledge about molecular interactions is rapidly increasing. For instance, the data generated by the largescale consortium project Encode [52] represent huge possibilities for extending the source of prior knowledge about transcriptional regulation. Obviously, methods enabling the use of prior knowledge when constructing genetic networks will increase their relevance in line with this development.
Conclusions
We have presented a novel method for finding modules of interacting molecules, representing biological pathways or processes, based on microarrays or other transcriptomic data. The method is a clustering approach, which is a commonly used technique for analyzing transcriptomic data. Unlike previous approaches using Bayesian statistics to infer clusters, e.g. [9, 10], we used informative (as opposed to noninformative) priors. The prior information was assumed to be in form of pairs of connected genes, along with a connection score, as this enabled us to capture a lot of relevant prior information, like proteinprotein interactions, transcription factor binding information and protein sequence similarity measurements. Using simulated data our method showed improved ability of identifying correct groups compared to traditional clustering, especially for small sample sizes, and for situations where most priors were correctly specified. When applying the method to realworld microarray data in heart failure and melanoma cancer we found clusters overlapping with known pathogenic processes, but where subnetworks of prior pairs within each cluster pointed to new connections extending beyond the classical disease pathways.
Availability
The method is implemented in C++ and R and is available from: http://folk.uio.no/trondr/gene_corr, as well as a tool in the Galaxy [53] based Genomic HyperBrowser [54], see http://hyperbrowser.uio.no/dev2 under assorted tools →MCMC gene corr. The data sets supporting the results of this article are available from the Gene Expression Omnibus repository: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE36074 (heart failure data) and http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE7553 (melanoma cancer data).
Declarations
Acknowledgements
We thank Ian Donaldson for help using the iRefIndex database, Torbjørn Rognes for providing sequence homology scores using BLAST, Jason Ernst for giving us TF binding predictions, Kjetil Klepper for providing mappings of PWM ids to gene symbols, Kai Trengereid and Morten Johansen for incorporating the software into the Genomic HyperBrowser, and John Quackenbush and his lab for hosting internship of SN during which the idea of this project was conceived. We also would like to thank the anonymous reviewers for their detailed and constructive comments. The work of SN was supported by the SouthEastern Norway Regional Health Authority.
Authors’ Affiliations
References
 Alexa A, Rahnenfuhrer J, Lengauer T: Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics. 2006, 22 (13): 16001607. 10.1093/bioinformatics/btl140.View ArticlePubMedGoogle Scholar
 Vaske CJ, Benz SC, Sanborn JZ, Earl D, Szeto C, Zhu J, Haussler D, Stuart JM: Inference of patientspecific pathway activities from multidimensional cancer genomics data using PARADIGM. Bioinformatics. 2010, 26 (12): i237i245. 10.1093/bioinformatics/btq182.View ArticlePubMed CentralPubMedGoogle Scholar
 Huang DW, Sherman BT, Lempicki RA: Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009, 4 (1): 4457.View ArticleGoogle Scholar
 Werner: Bioinformatics applications for pathway analysis of microarray data. Curr Opin Biotechnol. 2008, 19: 5054. 10.1016/j.copbio.2007.11.005.View ArticlePubMedGoogle Scholar
 Huang SsC, Fraenkel E: Integrating proteomic, transcriptional, and interactome data reveals hidden components of signaling and regulatory networks. Sci Signal. 2009, 2 (81): ra40View ArticlePubMed CentralPubMedGoogle Scholar
 Djebbari A, Quackenbush J: Seeded Bayesian networks: constructing genetic networks from microarray data. BMC Syst Biol. 2008, 2: 5710.1186/17520509257.View ArticlePubMed CentralPubMedGoogle Scholar
 Allison DB, Gadbury GL, Heo MS, Fernandez JR, Lee CK, Prolla TA, Weindruch R: A mixture model approach for the analysis of microarray gene expression data. Comput Stat Data Anal. 2002, 39 (1): 120. 10.1016/S01679473(01)000469.View ArticleGoogle Scholar
 Kaufman L, Rousseeouw P: Statistical Data Analysis Based on the L1norm and Related Methods. 1987, North Holland: AmsterdamGoogle Scholar
 Fraley C, Raftery A: MCLUST: software for modelbased cluster analysis. J Classif. 1999, 16 (2): 297306. 10.1007/s003579900058.View ArticleGoogle Scholar
 Booth JG, Casella G, Hobert JP: Clustering using objective functions and stochastic search. J Roy Stat Soc B Stat Meth. 2008, 70 (1): 119139. 10.1111/j.14679868.2007.00629.x.View ArticleGoogle Scholar
 Tseng G, Wong W: Tight clustering: a resamplingbased approach for identifying stable and tight patterns in data. Biometrics. 2005, 61 (1): 1016. 10.1111/j.0006341X.2005.031032.x.View ArticlePubMedGoogle Scholar
 Lee H, Hsu A, Sajdak J, Qin J, Pavlidis P: Coexpression analysis of human genes across many microarray data sets. Genome Res. 2004, 14 (6): 10851094. 10.1101/gr.1910904.View ArticlePubMed CentralPubMedGoogle Scholar
 Friedman N, Linial M, Nachman I, Pe’er D: Using Bayesian networks to analyze expression data. J Comput Biol. 2000, 7: 601620. 10.1089/106652700750050961.View ArticlePubMedGoogle Scholar
 Werhli A, Husmeier D: Reconstructing gene regulatory networks with bayesian networks by combining expression data with multiple sources of prior knowledge. Stat Appl Genet Mol Biol. 2007, 6: 81Google Scholar
 Geier F, Timmer J, Fleck C: Reconstructing generegulatory networks from time series, knockout data, and prior knowledge. BMC Syst Biol. 2007, 1: 1110.1186/17520509111.View ArticlePubMed CentralPubMedGoogle Scholar
 Mukherjee S, Speed TP: Network inference using informative priors. Proc Natl Acad Sci U S A. 2008, 105 (38): 1431314318. 10.1073/pnas.0802272105.View ArticlePubMed CentralPubMedGoogle Scholar
 Gao S, Wang X: Quantitative utilization of prior biological knowledge in the Bayesian network modeling of gene expression data. BMC Bioinformatics. 2011, 12: 35910.1186/1471210512359.View ArticlePubMed CentralPubMedGoogle Scholar
 Segal E, Shapira M, Regev A, Pe’er D, Botstein D, Koller D, Friedman N: Module networks: identifying regulatory modules and their conditionspecific regulators from gene expression data. Nat Genet. 2003, 34 (2): 166176. 10.1038/ng1165.View ArticlePubMedGoogle Scholar
 Yeung K, Fraley C, Murua A, Raftery A, Ruzzo W: Modelbased clustering and data transformations for gene expression data. Bioinformatics. 2001, 17 (10): 977987. 10.1093/bioinformatics/17.10.977. [3rd Georgia TechEmory International Conference on Bioinformaticsin Silico Biology: Bioinformatics after the Human Genome, ATLANTA, GEORGIA, NOV 15–18, 2001],View ArticlePubMedGoogle Scholar
 Medvedovic M, Sivaganesan S: Bayesian infinite mixture model based clustering of gene expression profiles. Bioinformatics. 2002, 18 (9): 11941206. 10.1093/bioinformatics/18.9.1194.View ArticlePubMedGoogle Scholar
 Geyer C: MarkovChain MonteCarlo maximumlikelihood. Computing Science and Statistics. Edited by: Keramidas EM, Fairfax Station VA. 1991, Interface Foundation of North America Inc, 156163.Google Scholar
 Fritsch A: mcclust: Process an MCMC Sample of Clusterings. 2012, [http://CRAN.Rproject.org/package=mcclust]. [R package version 1.0]Google Scholar
 Razick S, Magklaras G, Donaldson IM: iRefIndex: a consolidated protein interaction database with provenance. BMC Bioinformatics. 2008, 9: 40510.1186/147121059405.View ArticlePubMed CentralPubMedGoogle Scholar
 Ernst J, Plasterer HL, Simon I, BarJoseph Z: Integrating multiple evidence sources to predict transcription factor binding in the human genome. Genome Res. 2010, 20 (4): 526536. 10.1101/gr.096305.109.View ArticlePubMed CentralPubMedGoogle Scholar
 Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W, Lipman D: Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 33893402. 10.1093/nar/25.17.3389.View ArticlePubMed CentralPubMedGoogle Scholar
 Thalamuthu A, Mukhopadhyay I, Zheng X, Tseng GC: Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics. 2006, 22 (19): 24052412. 10.1093/bioinformatics/btl406.View ArticlePubMedGoogle Scholar
 Tibshirani R, Walther G, Hastie T: Estimating the number of clusters in a data set via the gap statistic. J Roy Stat Soc B Stat Meth. 2001, 63 (2): 411423. 10.1111/14679868.00293.View ArticleGoogle Scholar
 Hubert L, Arabie P: Comparing partitions. J Classif. 1985, 2: 193218. 10.1007/BF01908075.View ArticleGoogle Scholar
 Skrbic B, Bjørnstad JL, Marstein HS, Carlson CR, Sjaastad I, Nygård S, Bjørnstad S, Christensen G, Tønnessen T: Differential regulation of extracellular matrix constituents in myocardial remodeling with and without heart failure following pressure overload. Matrix Biol. 2013, 32 (2): 133142. 10.1016/j.matbio.2012.11.011.View ArticlePubMedGoogle Scholar
 Benjamini Y, Hochberg Y: Controlling the false discovery rate  a practical and powerful approach to multiple testing. J Roy Stat Soc B Stat Meth. 1995, 57 (1): 289300.Google Scholar
 Shannon P, Markiel A, Ozier O, Baliga N, Wang J, Ramage D, Amin N, Schwikowski B, Ideker T: Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Res. 2003, 13 (11): 24982504. 10.1101/gr.1239303.View ArticlePubMed CentralPubMedGoogle Scholar
 The Gene Ontology Consortium: Gene ontology: tool for the unification of biology. Nat Genet. 2000, 25: 2529. 10.1038/75556.View ArticlePubMed CentralGoogle Scholar
 Falcon S, Gentleman R: Using GOstats to test gene lists for GO term association. Bioinformatics. 2007, 23 (2): 257258. 10.1093/bioinformatics/btl567.View ArticlePubMedGoogle Scholar
 Riker AI, Enkemann SA, Fodstad O, Liu S, Ren S, Morris C, Xi Y, Howell P, Metge B, Samant RS, Shevde LA, Li W, Eschrich S, Daud A, Ju J, Matta J: The gene expression profiles of primary and metastatic melanoma yields a transition point of tumor progression and metastasis. BMC Med Genomics. 2008, 1: 1310.1186/17558794113.View ArticlePubMed CentralPubMedGoogle Scholar
 Adamic L, Wilkinson D, Huberman B, Adar E: A literature based method for identifying genedisease connections. CSB2002: IEEE Computer Society Bioinformatics Conference. 2002, IEEE Computer Society, 109117. [1st International IEEEComputerSociety Bioinformatics Conference (CSB2002), Stanford univ, Stanford, CA, Aug 14–16, 2000],View ArticleGoogle Scholar
 Bowers SLK, Banerjee I, Baudino TA: The extracellular matrix: at the center of it all. J Mol Cell Cardiol. 2010, 48 (3, SI): 474482. 10.1016/j.yjmcc.2009.08.024.View ArticlePubMed CentralPubMedGoogle Scholar
 Fomovsky GM, Thomopoulos S, Holmes JW: Contribution of extracellular matrix to the mechanical properties of the heart. J Mol Cell Cardiol. 2010, 48 (3, SI): 490496. 10.1016/j.yjmcc.2009.08.003.View ArticlePubMed CentralPubMedGoogle Scholar
 Sottile J, Shi F, Rublyevska I, Chiang HY, Lust J, Chandler J: Fibronectindependent collagen I deposition modulates the cell response to fibronectin. Am J Physiol Cell Physiol. 2007, 293 (6): C1934C1946. 10.1152/ajpcell.00130.2007.View ArticlePubMedGoogle Scholar
 Stearns M: Alendronate blocks TGFbeta 1 stimulated collagen 1 degradation by human prostate PC3 ML cells. Clin Exp Metastasis. 1998, 16 (4): 332339.View ArticlePubMedGoogle Scholar
 Ingwall JS: Energy metabolism in heart failure and remodelling. Cardiovasc Res. 2009, 81 (3): 412419.View ArticlePubMed CentralPubMedGoogle Scholar
 Lehman JJ, Kelly DP: Gene regulatory mechanisms governing energy metabolism during cardiac hypertrophic growth. Heart Fail Rev. 2002, 7: 175185. 10.1023/A:1015332726303. [10.1023/A:1015332726303],View ArticlePubMedGoogle Scholar
 Lehman J, Kelly D: Transcriptional activation of energy metabolic switches in the developing and hypertrophied heart. Clin Exp Pharmacol Physiol. 2002, 29 (4): 339345. 10.1046/j.14401681.2002.03655.x.View ArticlePubMedGoogle Scholar
 Caubet C, Jonca N, Brattsand M, Guerrin M, Bernard D, Schmidt R, Egelrud T, Simon M, Serre G: Degradation of corneodesmosome proteins by two serine proteases of the kallikrein family, SCTE/KLK5/hK5 and SCCE/KLK7/hK7. J Invest Dermatol. 2004, 122 (5): 12351244. 10.1111/j.0022202X.2004.22512.x.View ArticlePubMedGoogle Scholar
 Smith E, Fuchs E: Defining the interactions between intermediate filaments and desmosomes. J Cell Biol. 1998, 141 (5): 12291241. 10.1083/jcb.141.5.1229.View ArticlePubMed CentralPubMedGoogle Scholar
 Kouklis P, Hutton E, Fuchs E: Making a connection: direct binding between keratin intermediate filaments and desmosomal proteins. J Cell Biol. 1994, 127 (4): 10491060. 10.1083/jcb.127.4.1049.View ArticlePubMedGoogle Scholar
 Meng J, Bornslaeger E, Green K, Steinert P, Ip W: Twohybrid analysis reveals fundamental differences in direct interactions between desmoplakin and cell typespecific intermediate filaments. J Biol Chem. 1997, 272 (34): 2149521503. 10.1074/jbc.272.34.21495.View ArticlePubMedGoogle Scholar
 Hofmann I, Mertens C, Brettel M, Nimmrich V, Schnolzer M, Herrmann H: Interaction of plakophilins with desmoplakin and intermediate filament proteins: an in vitro analysis. J Cell Sci. 2000, 113 (13): 24712483.PubMedGoogle Scholar
 Xiong G, Wang C, Evers BM, Zhou BP, Xu R: ROR alpha suppresses breast tumor invasion by inducing SEMA3F expression. Cancer Res. 2012, 72 (7): 17281739. 10.1158/00085472.CAN112762.View ArticlePubMed CentralPubMedGoogle Scholar
 Taylor JM, Street TL, Hao L, Copley R, Taylor MS, Hayden PJ, Stolper G, Mott R, Hein J, Moffatt MF, Cookson WOCM: Dynamic and physical clustering of gene expression during epidermal barrier formation in differentiating keratinocytes. PLOS ONE. 2009, 4 (10): e765110.1371/journal.pone.0007651.View ArticlePubMed CentralPubMedGoogle Scholar
 Dittrich MT, Klau GW, Rosenwald A, Dandekar T, Mueller T: Identifying functional modules in proteinprotein interaction networks: an integrated exact approach. Bioinformatics. 2008, 24 (13): I223I231. 10.1093/bioinformatics/btn161.View ArticlePubMed CentralPubMedGoogle Scholar
 HaibeKains B, Olsen C, Djebbari A, Bontempi G, Correll M, Bouton C, Quackenbush J: Predictive networks: a flexible, open source, web application for integration and analysis of human gene networks. Nucleic Acid Res. 2012, 40 (D1): D866D875. 10.1093/nar/gkr1050.View ArticlePubMed CentralPubMedGoogle Scholar
 The ENCODE Project Consortium: An integrated encyclopedia of DNA elements in the human genome. Nature. 2012, 489 (article number 3): 543572.Google Scholar
 Giardine B, Riemer C, Hardison R, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent W, Nekrutenko A: Galaxy: a platform for interactive largescale genome analysis. Genome Res. 2005, 15 (10): 14511455. 10.1101/gr.4086505.View ArticlePubMed CentralPubMedGoogle Scholar
 Sandve GK, Gundersen S, Rydbeck H, Glad IK, Holden L, Holden M, Liestol K, Clancy T, Ferkingstad E, Johansen M, Nygaard V, Tostesen E, Frigessi A, Hovig E: The Genomic HyperBrowser: inferential genomics at the sequence level. Genome Biol. 2010, 11: R12110.1186/gb20101112r121.View ArticlePubMed CentralPubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.