Co-expressed gene module selection and benchmark
We utilized two publicly available microarray studies of Dengue infection (GSE18090 and GSE43777) to compare CEMiTool to two R packages: WGCNA and Petal [10]. CEMiTool was run using its default parameters and all optional files. After filtering, the analyses were performed on 2129 genes for study GSE18090, and 1765 genes for study GSE43777. Our assumption is that greater gene set enrichment in pathways relevant to the diseases are good proxies for the quality of a co-expression network analysis. For study GSE18090, CEMiTool selected a soft-threshold value of 6 and identified 12 different co-expression modules, out of which 9 had at least one significantly enriched pathway in the Over Representation Analysis. Notably, modules M4 and M6 were significantly enriched with interferon and cytokine signaling pathways, along with antiviral mechanisms, as expected from an infectious disease such as dengue. Furthermore, module M2 was significantly enriched for toll-like receptor cascades, which have been shown to lead to and induce the release of proinflammatory cytokines and chemokines in Dengue infections. These findings mirror what was found in the 7 significantly enriched (of a total of 11) co-expression modules observed for study GSE43777 (beta = 5). Running CEMiTool analyses with all possible optional files for both studies in an average computer took around 3 min (Table 1).
In order to compare WGCNA to CEMiTool, WGCNA was run on the Dengue studies using the top 4000 most variant genes of each dataset. Since WGCNA does not specify the optimal number of input genes, we utilized the same number of genes suggested in their tutorial. The analysis identified 18 modules for study GSE18090 using a soft-threshold of 9. Interestingly, however, over half of them (10) presented no significantly enriched pathways after Over Representation Analysis (p-value < 0.01). In contrast to the CEMiTool results, WGCNA did not report pathways related to toll-like receptor cascades. As for study GSE43777 (beta = 6), WGCNA returned 10 significantly enriched modules out of a total of 16. These results suggest that, despite running on a smaller number of genes, CEMiTool is able to successfully filter irrelevant genes and construct modules using the most important genes. Our custom WGCNA script was able to run the analysis in a similar time as CEMiTool (around 3 min, Table 1). However, this did not take into account the considerable time required to manually insert all steps needed to perform WGCNA analyses, select the user-specified parameters, and the steep learning curve necessary in order to understand the whole procedure.
To account for the difference in the number of input genes, we also ran WGCNA using the filtered datasets returned by CEMiTool’s filter. For study GSE18090, WGCNA identified 16 modules, with a soft-threshold of 7. Out of these, 10 modules had at least one significantly enriched pathway in the Over Representation Analysis. As expected, results became more similar to CEMiTool’s, with the inclusion of a module related to Toll-like receptor activity (M2), and different modules for interferon types gamma (M4) and alpha/beta (M5). As for study GSE43777, WGCNA (beta value of 6) was able to identify 6 significantly enriched modules out of a total of 12, giving it 2 more non-significantly enriched modules than CEMiTool. These subtle differences are likely to be derived from the difference in the selected beta values and showcase CEMiTool’s ability to produce results comparable to established tools such as WGCNA with greater ease and convenience.
We ran Petal using the same input genes utilized by WGCNA analysis (4000 most variant genes). Petal is a software which attempts to define a co-expression network using an automatically defined threshold to indicate similar expression between genes [10]. However, after 20 min for study GSE18090 and 40 min for study GSE43777, the program was unable to select any threshold for either study. This happened again when the filtered datasets from CEMiTool were attempted, albeit with lower runtimes (9 min for study GSE18090 and 4 min for study GSE43777). We encountered several other problems, such as confusing command line output; no output plots or complementary analyses; massive cluttering of user’s workspace with no option to redirect the several output files; lack of user tutorial or vignette; and inconsistent naming schemes, resulting in an unpleasant user experience.
Other packages, such as CoXpress, DINGO and DiffCoEx were not considered for benchmarking since they analyze more than 2 groups of samples (Table 1). Given these results, we chose to focus the remainder of our benchmarking on the differences between CEMiTool and WGCNA.
The WGCNA method [5] receives an input “m x n” gene expression matrix, containing n samples under specific conditions and m genes, where each element in the matrix gives the expression of one gene in a particular sample. The correlation between each pair of genes is then transformed into an m x m adjacency matrix through an adjacency function. The adjacency matrix may be signed or unsigned. In the former, correlations in the [− 1, 1] interval are scaled into the [0, 1] interval, while in the latter, negative correlations are made positive. During the process, these values are then raised to a power of β, called the soft-threshold, which effectively adjusts how smoothly the connection strengths transition from their lowest to their highest values. The selection of β directly impacts on how adherent to the scale-free model the network will be. In general, the WGCNA authors recommend to use the “scale-free topology criterion” [5], in which the chosen β value is the one that leads the network’s topology to be, at least approximately, scale-free. Adherence to a scale-free topology is measured by a linear regression fit (R2) that quantifies the extent to which the degree distribution of the genes in the network follows a power law. Thus, for WGCNA, the chosen β value is the lowest one with which an R2 > 0.85 (or R2 > 0.8 in the original paper [5]).
However, the selection of the best soft-threshold is relatively arbitrary and can differ from study to study. By looking at a plot showing R2 values for each β ranging from 1 to 20, WGCNA users are required to manually define the value of β by considering the trade-off between R2 and connectivity - a higher β may make the network more scale-free, but also lowers the mean connectivity.
Despite the WGCNA authors have demonstrated that networks are relatively robust to the selection of the soft-thresholding parameter [5], a more rigorous framework for the selection of beta is still lacking, being usually defined visually by the user, hindering reproducibility and workflow automation. Although WGCNA provides a function named pickSoftThreshold that can automatically select the β value, we have created an alternative algorithm, which is based on the concept of Cauchy sequences [17], that improves the automatic selection of the β value, allowing for more reliable and consistent results (See Methods).
Briefly, our method investigates if all possible pairs of β values (in a certain range) possess a difference between their R2 values within a pre-defined range ϵ, and selects the first beta value in this sequence to satisfy this property. Moreover, our algorithm allows for a lower threshold for R2 (R2 > 0.8) when compared to WGCNA default threshold (R2 > 0.85) - which, in turn, allows for lower values of β. Once the β value is defined, the remaining steps for creating the modules follow the standard WGCNA procedure.
To benchmark the selection of β, we compared the method implemented in WGCNA (pickSoftThreshold function) with our algorithm (Additional file 1: Text) on 15 publicly available microarray studies. Using the same genes as input, we utilized three different methods for module identification: WGCNA’s pickSoftThreshold function with R2 values > 0.8 and > 0.85 (WGCNA’s default), as well as CEMiTool’s cemitool function with R2 > 0.8. Figure 2a shows the value of β for each implementation. With the exception of study GSE53441, the value of β returned by CEMiTool was always equal to or lower than the one returned by WGCNA.
It is worth mentioning that the soft-thresholding impacts not only the network’s topology, but also its information content: the higher the β value, the lower its mean connectivity - since connection strengths in the adjacency matrix are bounded by [0,1] [18]. Consequently, a trade-off between the network’s connectivity and its adherence to a scale-free topology must be considered. Therefore, in the context of this work we consider lower β values to be of more interest than higher values, as long as their R2 values are similar.
The difference between WGCNA and CEMiTool in selecting the β parameter can be largely explained by the lower R2 threshold implemented in our tool (0.8 in CEMiTool versus 0.85 in WGCNA). We picked this lower R2 threshold observing the WGCNA authors’ original recommendation [5]. Also, CEMiTool utilizes a stringent algorithm, based on Cauchy sequences, to select the lowest β parameter that stabilizes the sequence (i.e. keeps its R2 values within a pre-defined range), while keeping the R2 above the threshold. When the same R2 threshold (0.8) is applied, CEMiTool usually returns the same β parameter value as WGCNA’s pickSoftThreshold function. In several cases, however, WGCNA returned an inappropriate β value of 1 (Fig. 2).
Input gene selection
Prior to identifying co-expression modules, it is recommended to filter input genes by either mean expression or variance, rather than by differential expression since this would invalidate the scale-free topology assumption [19]. Nevertheless, the number of genes to be chosen is left undetermined, leading to arbitrary choices that might affect downstream analyses. We thus opted for a flexible, yet objective method of gene selection (Additional file 1: Text). Briefly, by modeling the variance of genes as an inverse gamma distribution, as suggested in [20], we can select genes based on a p-value (in our analyses, we set p = 0.1 as cutoff). For certain types of RNA-seq data normalizations, our method allows for a correction of the mean-variance dependency [21, 22] by modeling the expression data as a negative binomial distribution [22], and then performing the adequate Variance Stabilizing Transformation (VST) [23] (Additional file 1: Text). To remove potential noise, our package also removes by default the 25% genes with lowest mean expression across all samples prior to filtering.
In order to determine the most suitable default filtering parameters, we applied CEMiTool to 300 microarray studies obtained from the GEO (Gene Expression Omnibus [24, 25]) database using differing filter p-value thresholds, and assessed the biological significance of the resulting modules (Additional file 2: Table S1). This was determined by calculating the Combined Enrichment Score (CES) of the output modules with respect to the Reactome pathways (Fig. 3). Briefly, the CES allows us to condense the overall enrichment results into a single number - the lower this number is, the more enriched the modules are (Additional file 1: Text). As the filtering p-value increases from 0.05 to 0.3, the CES reaches a global minimum at p ≈ 0.1, suggesting that the noise introduced by non-correlated genes outweighs the gain in information (Fig. 3). The filtering p-value is therefore set to 0.1 as a default, but is also easily adjustable by the user via the filter_pval argument to the cemitool function to allow the analysis to be more or less stringent, as needed.
Influence of the number of samples on the scale-free topology model fit
To assess the minimum optimal number of samples for analyses, we devised a quality control parameter for the β x R2 curve, ɸ . We define ɸ as the ratio of the area under the curve relative to the area of the rectangle made by β × 1, which is the highest possible value for R2. Higher values of ɸ mean that the topology of the network converges sharply to a scale-free degree distribution. To estimate the minimum number of samples that returns the highest ɸ value before reaching a plateau, we bootstrapped the number of samples for 3 microarray studies, selecting at first 5 random samples, and then incrementing the sample number by 5 at each step. CEMiTool was run 10 times at each step using default parameters. As shown in Fig. 4, the parameter ɸ tends to stabilize at around 20 samples (which is in accordance to previous findings [26]), indicating that the β x R2 curve, and thus network topology, should not vary so much in behavior starting at that sample number.
Application to RNA-seq datasets
We also ran CEMiTool on 8 RNAseq studies, 4 of which had been previously normalized by log2 CPM (GSE69015, GSE77926, GSE92754, GSE94855), 2 normalized by RPKM (GSE44183 and GSE54456), 1 by FPKM (GSE77564) and 1 only adjusted for fragment and length biases (GSE65540).
The study GSE54456 [27] has measured 174 transcriptomes of lesional psoriatic and normal skin samples. Among the 8 modules identified by CEMiTool (Fig. 5), the module M1 was enriched for immune system pathways, including interferon alpha signaling, which is known to be related to the disease [28, 29]. One notable hub gene for module M1 was S100A7A. Although this gene was not mentioned in the original publication [27], others have shown that the expression of S100A7A is upregulated in lesioned-skin psoriasis patients [30]. CEMiTool analyses also revealed a module related to extracellular matrix organization and collagen formation (Fig. 5), suggesting that the expression of genes responsible for maintaining the structure of the skin may be coordinately altered by the disease.
CEMiTool applied to over 1000 publicly available microarray studies
To demonstrate that CEMiTool can be easily automated, we ran the package on 1094 microarray studies obtained from the GEO database. For each study, we downloaded the authors’ normalized data and ran the cemitool function using the default parameters. Figure 6 shows the distribution of β values, and the number of modules and filtered genes selected for the analyses.
Almost 12,000 gene modules were identified by CEMiTool, containing in total over 2 million genes. The studies span hundreds of different biological conditions, including cancer, drug treatments, infectious diseases, and inflammatory and neurological pathologies. The list of all studies can be found in Additional file 3: Table S2.
Applying CEMiTool to study dengue
To gain novel insights about immunity to infectious diseases, we ran the package on two publicly available microarray studies containing the blood transcriptome of patients infected or not with the Dengue virus (GEO accession numbers GSE18090 and GSE43777). We then annotated samples using the phenotypes provided by the original authors: control (non-infected patients), DF (patients diagnosed as Dengue fever) and DHF (patients diagnosed as dengue hemorrhagic fever). Protein-protein interaction data from GeneMania [31] and gene sets from Reactome Pathways Database [32] were also used in the CEMiTool analyses. The results obtained by such analyses are partially displayed in Fig. 7. The over representation analysis shows that the package identified modules (M4 in GSE18090 and M3 in GSE43777) related to anti-viral immunity, such as interferon signaling and the ISG15 antiviral mechanism. Moreover, Gene Set Enrichment Analyses show that the activity of these modules is higher in DF or DHF compared to Control samples (Fig. 7). Also, the module network graph of study GSE43777 highlights important genes as network hubs, such as CCL2, (coding for chemokine C-C motif ligand 2), which is known to be associated with severe dengue and dengue shock syndrome [33]. However, CCL2 was not highlighted as a key gene for Dengue infection in the original article associated with the study GSE43777 [34].
Modular analyses of visceral leishmaniasis
Finally, we used CEMiTool to investigate the blood transcriptome of patients infected with visceral leishmaniasis (VL), a major public health problem in Brazil and worldwide. For this, we performed 17 RNA-seq experiments using the whole-blood obtained from 6 patients infected with Leishmania infantum, before and after treatment, as well as 5 uninfected healthy individuals (Additional file 1: Text). CEMiTool has generated 14 modules containing 1700 genes (Fig. 8 and Additional file 4: html report). Of those, modules M7 and M10 refer to interferon-mediated immune responses (IFNgamma and type I IFN, respectively), which are well known to be involved in experimental models of leishmaniasis [35]. However, although IFNgamma response (M7) has been reported in clinical studies as well, little is known about the role of type I IFN (M10) in VL patients. In fact, type I IFN genes are typically elicited in viral infections, and not by protozoan parasites such as Leishmania infantum. Further experiments must be conducted to assess how the type I IFN genes may drive the functions of innate and adaptive immune cells during VL infection. Also, CEMiTool was able to unravel the dynamics of genes involved in B cell-mediated immunity during VL treatment, as shown by module M3 (Fig. 8). Integration with protein-protein interaction data revealed CD79A and CD79B as potential hubs in module M3 (Fig. 8). Both proteins form a dimer associated with the B-cell antigen receptor (BCR), and are critical for B cell immunity. The finding that CD79A and CD79B genes, as well as other members of the modules related to B cell development, are co-expressed and that the module activity is increased on VL treatment demonstrate that CEMiTool can provide new insights about the host response to treatment and to the disease.
Gene set enrichment analysis revealed novel insights about the molecular disturbances caused by the infection (Fig. 8). For instance, the transcriptional activity of module M4, which is enriched by genes associated with “platelet degranulation” and “hemostasis”, is significantly high in VL patients before receiving treatment (Additional file 4: html report). Indeed, VL is associated with several haematological manifestations, including anaemia, leucopenia, and disseminated intravascular coagulation [36]. In addition, the pattern of activity of module M9, which is associated with cell cycle (Additional file 4: html report and Fig. 8), suggests that an intense proliferation of cells expressing CCR5 (such as macrophages, dendritic cells and memory T cells) may be occurring during VL infection. Taken together, our findings may define which genes are driving these haematological manifestations, and thus suggesting effective drug treatments to VL.