DiffCoEx: a simple and sensitive method to find differentially coexpressed gene modules
© Tesson et al; licensee BioMed Central Ltd. 2010
Received: 2 June 2010
Accepted: 6 October 2010
Published: 6 October 2010
Large microarray datasets have enabled gene regulation to be studied through coexpression analysis. While numerous methods have been developed for identifying differentially expressed genes between two conditions, the field of differential coexpression analysis is still relatively new. More specifically, there is so far no sensitive and untargeted method to identify gene modules (also known as gene sets or clusters) that are differentially coexpressed between two conditions. Here, sensitive and untargeted means that the method should be able to construct de novo modules by grouping genes based on shared, but subtle, differential correlation patterns.
We present DiffCoEx, a novel method for identifying correlation pattern changes, which builds on the commonly used Weighted Gene Coexpression Network Analysis (WGCNA) framework for coexpression analysis. We demonstrate its usefulness by identifying biologically relevant, differentially coexpressed modules in a rat cancer dataset.
DiffCoEx is a simple and sensitive method to identify gene coexpression differences between multiple conditions.
There are two major classes of approach to the analysis of gene expression data collected in microarray studies: either one can identify genes that are differentially expressed in different conditions, or the patterns of correlated gene expression (coexpression). Coexpression analysis identifies sets of genes that are expressed in a coordinated fashion, i.e. respond in a similar fashion to the controlled or uncontrolled perturbation present in the experiment. Such coexpression is considered as evidence for possible co-regulation and for membership to common biological processes under the principle of guilt-by-association . When comparing the transcriptome between two conditions, it is a natural step to identify differential coexpression to get an even more informative picture of the dynamic changes in the gene regulatory networks. Changes in the differential coexpression structure of the genes are, for example, a group of genes strongly correlated in one condition but not in the other, or one module correlating to another module in one condition, whereas they are no longer correlated in the other condition. Differential coexpression may indicate rewiring of transcriptional networks in response to disease or adaptation to different environments.
Differential coexpression has been reported in diverse organisms and across various conditions. For example, Fuller et al.  reported a differentially coexpressed module in obese mice compared to lean mice; Van Nas et al.  found gender-specific coexpression modules; Oldham et al.  identified gene modules that were differentially coexpressed between humans and chimpanzees; and Southworth et al.  found that aging in mice was associated with a general decrease in coexpression. Differential coexpression patterns associated with diseases have been an important focus of research, see review by De la Fuente et al. .
Differential coexpression methods can be divided into two categories that serve distinct purposes: on the one hand, targeted approaches study gene modules that are defined a priori, while, on the other hand, untargeted approaches aim at grouping genes into modules on the basis of their differential coexpression status.
Sensitively detect groups of genes in which the correlation of gene pairs within the group is significantly different between conditions.
Sensitively detect changes in correlations between two groups of genes even when the within-group correlation is conserved across conditions.
Allow for simple comparison of more than two conditions.
Multiple methods have been proposed to identify such large-scale correlation patterns [5, 7–12]. However, this early work provided only partial solutions to the problem of differential coexpression since, with one recent exception , none of the proposed methods were entirely untargeted. Instead, existing methods can be divided into two categories: targeted and "semi-targeted" approaches. In targeted approaches, pre-defined modules are surveyed for correlation changes between two conditions. For example, Choi et al.  proposed a method that focuses on the analysis of modules based on known gene annotations, such as GO categories, and tests the significance of the coexpression changes using a statistical measure known as dispersion. This has the advantage of not requiring the gene sets to be highly correlated in one of the two conditions. However, this method is targeted in that it relies on the study of known functional gene sets and is not able to identify novel, non-annotated modules or modules that would only partially match annotated categories. "Semi-targeted" approaches use classical coexpression methods in one of the conditions to define modules and study whether these modules are also coexpressed in the second condition. DCA (differential clustering analysis)  is an example of a method using one of the two conditions as reference, meaning the clusters under consideration are obtained from one condition and then studied in the other condition. In order to avoid bias towards one of the conditions, Ihmels et al. suggested doing a reciprocal analysis, switching the reference and target conditions, while Southworth et al. used a third dataset as reference . A drawback of such "semi-targeted" methods is that the analysis will only focus on groups of genes that emerge as clusters in at least one of the conditions, and will therefore potentially miss more subtle cases. As an example, a weak but significant condition-dependent correlation structure between a group of genes that otherwise belong to distinct, strongly coexpressed and conserved clusters would not be detected by this approach. A first attempt at an untargeted approach was introduced by Southworth et al. , who proposed applying hierarchical clustering using the difference in pairwise correlations between both conditions as a similarity metric for two genes. This approach is therefore suited to identifying groups in which the within-group correlation changes (first criterion), but it cannot be applied to the detection of module-to-module correlation differences (second criterion). The field of differential coexpression analysis would therefore benefit from a new, truly untargeted and sensitive method for identifying differentially correlated modules that would satisfy all three criteria.
Here we present a solution to this problem in the form of the DiffCoEx approach for untargeted differential coexpression analysis: a method which applies the powerful tools of Weighted Gene Coexpression Network Analysis (WGCNA) to differential network analysis. We first describe the five steps involved in DiffCoEx and then, to illustrate the method's effectiveness, we present the results of an analysis performed on a publicly available dataset generated by Stemmer et al. .
Our method builds on WGCNA [14, 15], which is a framework for coexpression analysis. Identification of coexpression modules with WGCNA follows three steps: first an adjacency matrix is defined between all the genes under consideration based on pair-wise correlations. Then the generalized topological overlap measure  is computed from the adjacency matrix and converted into a dissimilarity measure. Finally, using this dissimilarity measure, hierarchical clustering is applied, followed by tree cutting using either a static or a dynamic height cut. The resulting clusters form modules of genes in which all members are strongly inter-correlated.
The principle of DiffCoEx is to apply WGCNA to an adjacency matrix representing the correlation changes between conditions. DiffCoEx clusters genes using a novel dissimilarity measure computed from the topological overlap  of the correlation changes between conditions. Intuitively, the method groups two genes together when their correlations to the same sets of genes change between the different conditions. The complete process of our differential coexpression analysis comprises five steps, described below. The notation X designates a square matrix with the dimension of the number of genes considered and x ij is used to define the element of X at row i and column j.
In this step, different correlation measures can be used, such as the Pearson or Spearman coefficient.
In this matrix, high values of d ij indicate that the coexpression status of gene i and gene j changes significantly between the two conditions. The correlation change is quantified as the difference between signed squared correlation coefficients so that changes in correlation which are identical in terms of explained variance (r 2 ) are given the same weight. This adjacency matrix is defined such that it only takes values between 0 and 1. The soft threshold parameter β is taken as a positive integer and is used to transform the correlation values so that the weight of large correlation differences is emphasized compared to lower, less meaningful, differences. β should be regarded as a tuning parameter, and in practice it is advisable to try different values of β. In WGCNA, it is recommended to choose β so that the resulting coexpression network follows an approximate scale-free topology . However the "scale-free" topology nature of biological networks has been disputed , and another way is to consider the soft threshold parameter as a stringency parameter: using high values of β means putting less emphasis on smaller changes in correlation, and therefore being more statistically stringent. Accordingly, since larger sample sizes come with higher statistical significance of small correlation changes, smaller values of the soft threshold can be used as the sample size increases. In practice, we view the soft threshold parameter as a tuning parameter, and we always check the significance of the result afterwards, both statistically and using biological criteria relevant in each specific study.
The use of the topological overlap measure to construct a dissimilarity metrics allows the identification of genes that share the same neighbors in the graph formed by the differential correlation network as defined by the adjacency matrix created in Step 2. Intuitively, a low value of t ij (high similarity) means that gene i and gene j both have significant correlation changes with the same large group of genes. This group of genes constitutes their "topological overlap" in the differential correlation network and may, or may not, include gene i and gene j . This property allows DiffCoEx to satisfy both criteria (i) and (ii) as stated earlier. On the one hand, if gene i and gene j are part of a module of genes coexpressed in only one condition (criterion (i), illustrated in Figure 1A), then the topological overlap between gene i and gene j in the difference network consists of all the genes within that module. On the other hand, if gene i and gene j are equally inter-correlated in both conditions but correlate with the genes in a distinct module in only one condition (criterion (ii), illustrated in Figure 1B), then the topological overlap between gene i and gene j in the difference network consists of the genes in that other module. In both cases gene i and gene j will therefore be grouped together: in the first case forming a differentially correlated module, and in the second case forming a module with differential module-to-module correlation with another group of genes.
We note that since the adjacency matrix takes values between 0 and 1, the dissimilarity matrix computed here also takes values between 0 and 1, as shown in .
The dissimilarity matrix T is used as input for clustering and modules are identified.
The clustering can be done using standard hierarchical clustering with average linkage, followed by module extraction from the resulting dendrogram, either using a fixed cut height or with more elaborate algorithms such as the dynamicTreeCut . Alternative clustering techniques, such as Partitioning Around Medoids (PAM) , may be used in this step.
Assess the statistical significance of coexpression changes.
This is necessary because DiffCoEx uses user-defined parameters: the soft threshold β used to transform the adjacency matrix in Step 2 and the clustering parameters in Step 4 (tree cutting settings, for example). Unsuitable settings may lead to the detection of clusters with non-significant differential coexpression.
The statistical significance of differential coexpression can be assessed using a measure of the module-wise correlation changes such as the dispersion statistic , the t-statistic , or the average absolute correlation. Permutations or simulations of the data can be used to generate a null distribution of those statistics by providing estimates of the extent of differential correlation that can be expected to occur by chance. An example of implementing a permutation procedure to assess the significance of differential coexpression using the dispersion statistics is presented in Additional File 1.
Extending the DiffCoEx method to multiple conditions
For two conditions, one can verify that this formulation is equivalent to that proposed earlier in Step 2.
A less sensitive variant to detect more striking patterns
This will make the method less sensitive to subtle coexpression changes, but may help in extracting more strikingly differentially coexpressed modules.
Variant without the topological overlap
This will make DiffCoEx focus only on within-module differential coexpression (criteria (i)) and not on module-to-module differential coexpression (criteria (ii)). This variant is computationally more efficient since the topological overlap computation is omitted.
We present here the results of our method as used on a previously published dataset. We identify modules of genes that are differentially coexpressed and, by using gene set enrichment analysis, we provide evidence for their biological relevance.
Our dataset (Gene Expression Omnibus GEO GSE5923) contains Affymetrix gene expression profiles of renal cortex outer medulla in wild-type- and Eker rats treated with carcinogens. The dataset is a time course as the rats were treated with Aristolochic Acid (AA) or Ochratoxin A (OTA), respectively, for 1, 3, 7 or 14 days. In total, the dataset consists of 84 arrays measuring 15,923 probe sets. Details about the experimental settings are available in the original paper .
Eker rats are predisposed to renal tumor because they are heterozygous for a loss-of-function mutation in the tuberous sclerosis 2 (Tsc2) tumor suppressor gene. Stemmer et al.  compared the transcriptional responses of the rats to the carcinogens and found that the expression levels of genes belonging to a number of cancer-related pathways were affected differently in the mutant compared to the wild-type rats. In our re-analysis of the data, we switched the focus from differential expression to differential coexpression in an attempt to identify functional modules responding to carcinogen treatment with a different coexpression signature in mutant Eker rats compared to wild type rats.
We applied the DiffCoEx method to the quantile normalized data . A duplicate set of 12 controls present only for Eker rats was discarded in order to have a symmetric experimental setting among wild-type- and Eker rats. We used the Spearman rank correlation in order to reduce sensitivity to outliers, and the hierarchical clustering and module assignment was performed using dynamicTreeCut . The detailed algorithm and R code used in this analysis are given in Additional File 1.
The cases of the black, orange and green modules illustrate an interesting characteristic of DiffCoEx: the method is able to identify module-to-module correlation changes. Interestingly, the black module is not differentially correlated in the wild-type rats compared to the Eker rats. Instead, what qualifies the black module as a differentially coexpressed module is its very significant drop in correlation with the genes in the blue and purple modules in the wild-type rats compared to the Eker mutants (see Figure 2A). Similar patterns can be observed for the orange and green modules. This property makes DiffCoEx a sensitive approach for detecting any type of large-scale correlation change.
Following Choi et al. , significance of the coexpression differences was assessed by comparing the dispersion index values of each module in the data with the null distribution obtained from permuted (scaled) data (see Additional File 1 for details and Additional File 2: Figure S1 for an overview of the permutation results). In 1000 permutations, none of the blue, brown, purple, red or yellow modules obtained as high a dispersion value as that obtained from the non-permuted data, indicating a significance p-value < 0.001. Module-to-module coexpression changes were tested by assessing the significance of the correlation changes between the genes from each possible module pair, using a similar "module-to-module" dispersion measure and generating null distributions from the same permutation approach. Additional File 2: Figure S1 shows that the coexpression change between the black and blue modules, for example, is highly significant since no permutation yielded as high a dispersion value.
Annotations enriched in differentially coexpressed modules
Metabolism of xenobiotics by cytochrome P450
Glutathione transferase activity
Xenobiotic metabolic process
No significant enrichment
Renal cell carcinoma
Pathways in cancer
Small GTPase mediated signal transduction
In Figure 2B, the expression data for the 13 genes of the yellow module, which were associated with the "pancreatic cancer" KEGG annotation, illustrate what differential coexpression is: a difference in the coordination of the variation of a group of genes between two conditions. In the Eker rats, these cancer genes show coordinated variation, whereas in the wild-type rats this coordination is absent.
This analysis was carried out using the R statistical package with the WGCNA  library, on a Linux computer with 128 GB physical memory. Large memory (around 10 GB) is required to compute correlation matrices for over 10,000 genes. For module definition, hierarchical clustering was combined with dynamicTreeCut  using a minimum size of 20 genes. Details of the process and code can be found in Additional File 1.
Discussion and conclusions
The method we present here has the advantage of comparing two (or more) datasets in a global, unbiased and unsupervised manner. It represents a major improvement over earlier two-way comparisons, in which clustering was first performed in one condition and the coexpression of the genes in the resulting clusters was then assessed in the other condition. Moreover, DiffCoEx is very sensitive because (i) it does not require differentially coexpressed modules to be detected as coherent, coexpressed modules in one of the two conditions; instead, only the difference in coexpression is considered to define the module; and (ii) it can identify all types of large-scale correlation changes, including module-to-module correlation changes. Using a simulation study (see Additional File 4), we demonstrate examples of differential coexpression patterns that can be uncovered using DiffCoEx but that were missed by existing approaches.
Differential coexpression provides information that would be missed using classical methods focusing on the identification of differentially expressed genes. For example, as Figure 2A shows, many of the differentially coexpressed clusters display few differences between the two conditions in terms of mean overall expression. This indicates that the changes in correlation that we observed cannot be explained by the genes being not expressed, and therefore not correlated in one of the two conditions.
Differential coexpression may be caused by different biological mechanisms. For example, a group of genes may be under the control of a common regulator (e.g. a transcription factor or epigenetic modification) that is active in one condition, but absent in the other condition. In such a case, the correlation structure induced by variation in the common regulator would only be present in the first condition. Another possible interpretation relates to the presence or absence of variation in some factors driving a gene module. To observe correlation of a group of genes responding to a common factor, this factor needs to vary. In the absence of variation of the driving factor, no correlation can be observed, even though the actual biological links that form the network are not altered. It is therefore important to ensure that the perturbations which give rise to variation within each condition are: (i) biologically relevant (as opposed to batch effects, for example) and (ii) comparable in nature and amplitude.
DiffCoEx provides a simple and efficient approach to study how different sample groups respond to the same perturbations. These perturbations can be either well characterized and controlled, or stochastic and unknown. In our example analysis, on top of random physiological fluctuations present in any dataset, there was a controlled perturbation induced by the time-course treatment with different carcinogens present. Since the carcinogen treatment is a controlled experimental factor, it is possible to use classical methods to study the transcriptomic changes it induces rather than using DiffCoEx. However, a fundamental advantage of using DiffCoEx in such a case is that it requires no model assumptions and is a quick and efficient approach. Differential coexpression approaches are even more useful when the variation among the samples in one condition is caused by uncontrolled factors, whose effects cannot easily be dissected. A typical example would be genetic variation present in a natural population or an experimental cross. DiffCoEx constitutes a valuable tool of broad applicability now that such genetic studies are becoming increasingly important for studying gene regulatory networks [22–24].
This work was supported by a BioRange grant SP1.2.3 from the Netherlands Bioinformatics Centre (NBIC), which is supported by a BSIK grant through the Netherlands Genomics Initiative (NGI). We thank Jackie Senior for editing this article.
- Chu S, DeRisi J, Eisen M, Mulholland J, Botstein D, Brown PO, Herskowitz I: The Transcriptional Program of Sporulation in Budding Yeast. Science 1998, 282(5389):699–705. 10.1126/science.282.5389.699View ArticlePubMedGoogle Scholar
- Fuller TF, Ghazalpour A, Aten JE, Drake TA, Lusis AJ, Horvath S: Weighted gene coexpression network analysis strategies applied to mouse weight. Mamm Genome 2007, 18(6–7):463–472. 10.1007/s00335-007-9043-3View ArticlePubMedPubMed CentralGoogle Scholar
- van Nas A, Guhathakurta D, Wang SS, Yehya N, Horvath S, Zhang B, Ingram-Drake L, Chaudhuri G, Schadt EE, Drake TA, et al.: Elucidating the role of gonadal hormones in sexually dimorphic gene coexpression networks. Endocrinology 2009, 150(3):1235–1249. 10.1210/en.2008-0563View ArticlePubMedPubMed CentralGoogle Scholar
- Oldham MC, Horvath S, Geschwind DH: Conservation and evolution of gene coexpression networks in human and chimpanzee brains. Proceedings of the National Academy of Sciences of the United States of America 2006, 103(47):17973–17978. 10.1073/pnas.0605938103View ArticlePubMedPubMed CentralGoogle Scholar
- Southworth LK, Owen AB, Kim SK: Aging mice show a decreasing correlation of gene expression within genetic modules. PLoS Genet 2009, 5(12):e1000776. 10.1371/journal.pgen.1000776View ArticlePubMedPubMed CentralGoogle Scholar
- de la Fuente A: From 'differential expression' to 'differential networking' - identification of dysfunctional regulatory networks in diseases. Trends Genet 2010, 26(7):326–333. 10.1016/j.tig.2010.05.001View ArticlePubMedGoogle Scholar
- Cho SB, Kim J, Kim JH: Identifying set-wise differential co-expression in gene expression microarray data. BMC Bioinformatics 2009, 10: 109–109. 10.1186/1471-2105-10-109View ArticlePubMedPubMed CentralGoogle Scholar
- Choi JK, Yu U, Yoo OJ, Kim S: Differential coexpression analysis using microarray data and its application to human cancer. Bioinformatics (Oxford, England) 2005, 21(24):4348–4355. 10.1093/bioinformatics/bti722View ArticleGoogle Scholar
- Choi Y, Kendziorski C: Statistical methods for gene set co-expression analysis. Bioinformatics 2009, 25(21):2780–2786. 10.1093/bioinformatics/btp502View ArticlePubMedPubMed CentralGoogle Scholar
- Ihmels J, Bergmann S, Berman J, Barkai N: Comparative gene expression analysis by differential clustering approach: application to the Candida albicans transcription program. PLoS Genetics 2005, 1(3):e39-e39. 10.1371/journal.pgen.0010039View ArticlePubMedPubMed CentralGoogle Scholar
- Lai Y, Wu B, Chen L, Zhao H: A statistical method for identifying differential gene-gene co-expression patterns. Bioinformatics (Oxford, England) 2004, 20(17):3146–3155. 10.1093/bioinformatics/bth379View ArticleGoogle Scholar
- Watson M: CoXpress: differential co-expression in gene expression data. BMC Bioinformatics 2006, 7: 509–509. 10.1186/1471-2105-7-509View ArticlePubMedPubMed CentralGoogle Scholar
- Stemmer K, Ellinger-Ziegelbauer H, Ahr HJ, Dietrich DR: Carcinogen-specific gene expression profiles in short-term treated Eker and wild-type rats indicative of pathways involved in renal tumorigenesis. Cancer Research 2007, 67(9):4052–4068. 10.1158/0008-5472.CAN-06-3587View ArticlePubMedGoogle Scholar
- Zhang B, Horvath S: A general framework for weighted gene co-expression network analysis. Statistical Applications in Genetics and Molecular Biology 2005, 4(1):1128–1128. 10.2202/1544-6115.1128View ArticleGoogle Scholar
- Langfelder P, Horvath S: WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 2008, 9(1):559–559. 10.1186/1471-2105-9-559View ArticlePubMedPubMed CentralGoogle Scholar
- Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabasi AL: Hierarchical Organization of Modularity in Metabolic Networks. Science 2002, 297(5586):1551–1555. 10.1126/science.1073374View ArticlePubMedGoogle Scholar
- Khanin R, Wit E: How scale-free are biological networks. J Comput Biol 2006, 13(3):810–818. 10.1089/cmb.2006.13.810View ArticlePubMedGoogle Scholar
- Langfelder P, Zhang B, Horvath S: Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics (Oxford, England) 2008, 24(5):719–720. 10.1093/bioinformatics/btm563View ArticleGoogle Scholar
- Kaufman L, Rousseeuw PJ: Finding groups in data. an introduction to cluster analysis. 1990.Google Scholar
- Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 2003, 19(2):185–193. 10.1093/bioinformatics/19.2.185View ArticlePubMedGoogle Scholar
- Backes C, Keller A, Kuentzer J, Kneissl B, Comtesse N, Elnakady YA, Müller R, Meese E, Lenhof HP: GeneTrail--advanced gene set enrichment analysis. Nucleic Acids Research 2007, (35 Web Server):W186–192-W186–192.Google Scholar
- Schadt EE: Molecular networks as sensors and drivers of common human diseases. Nature 2009, 461(7261):218–223. 10.1038/nature08454View ArticlePubMedGoogle Scholar
- Li Y, Breitling R, Jansen RC: Generalizing genetical genomics: getting added value from environmental perturbation. Trends in Genetics: TIG 2008, 24(10):518–524. 10.1016/j.tig.2008.08.001View ArticlePubMedGoogle Scholar
- Jansen RC, Tesson BM, Fu J, Yang Y, McIntyre LM: Defining gene and QTL networks. Current Opinion in Plant Biology 2009, 12(2):241–246. 10.1016/j.pbi.2009.01.003View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.