 Research article
 Open Access
 Published:
Distance correlation application to gene coexpression network analysis
BMC Bioinformatics volume 23, Article number: 81 (2022)
Abstract
Background
To construct gene coexpression networks, it is necessary to evaluate the correlation between different gene expression profiles. However, commonly used correlation metrics, including both linear (such as Pearson’s correlation) and monotonic (such as Spearman’s correlation) dependence metrics, are not enough to observe the nature of real biological systems. Hence, introducing a more informative correlation metric when constructing gene coexpression networks is still an interesting topic.
Results
In this paper, we test distance correlation, a correlation metric integrating both linear and nonlinear dependence, with other three typical metrics (Pearson’s correlation, Spearman’s correlation, and maximal information coefficient) on four different arrays (macrophage and liver) and RNAseq (cervical cancer and pancreatic cancer) datasets. Among all the metrics, distance correlation is distribution free and can provide better performance on complex relationships and antioutlier. Furthermore, distance correlation is applied to Weighted Gene Coexpression Network Analysis (WGCNA) for constructing a gene coexpression network analysis method which we named Distance Correlationbased Weighted Gene Coexpression Network Analysis (DCWGCNA). Compared with traditional WGCNA, DCWGCNA can enhance the result of enrichment analysis and improve the module stability.
Conclusions
Distance correlation is better at revealing complex biological relationships between gene profiles compared with other correlation metrics, which contribute to more meaningful modules when analyzing gene coexpression networks. However, due to the high time complexity of distance correlation, the implementation requires more computer memory.
Background
The general insight in systems biology is that genes and proteins act together in intricate networks, rather than in isolation [1]. Networks are widespread in biomedicine, such as the gene coexpression network, regulatory network metabolism network and proteinprotein interaction network [2]. A key step in the analysis of large genomewide gene expression datasets is the grouping of genes into coexpression modules using module detection methods [3]. Genes are represented by nodes, and connecting the node genes which are significantly coexpressed across appropriately chosen samples [4].
To construct and analyse a gene coexpression network, it is necessary to assess the interactions between two genes. Such interactions are measured by calculating the correlation coefficients of different gene expression profiles. Detecting and evaluating dependencies between variables usually requires the definition of ‘ distance’ or ‘similarity’ [5]. In the gene coexpression network, the overall shapes of gene expression patterns (or profiles) are of greater interest than the individual magnitudes of each feature [6]. The gene modules obtained by gene coexpression network analysis should represent highly correlated genes which are also called coexpressed genes [7]. Therefore, we should apply a metric between the gene based on ’similarity’ measure rather than using ’distance’ depending on the gene coexpression network. There are many methods to calculate the correlation coefficients between genes, with the most popular being Pearson correlation and Spearman’s rank correlation. Pearson correlation measures the strength of the linear relationship between two random variables, whereas Spearman’s rank correlation evaluates how well the correlation between two variables can be formulated by a monotonic function [8, 9]. However, linear and monotonic dependence are not the only ones observed in a real biological system, there are many other complex relationships observed in biological systems. Therefore, if the method of correlation measure is limited to linear dependence measures in the construction of gene coexpression networks, the ability of gene coexpression networks to recreate the accurate network and identify the appropriate gene modules will also be limited. To overcome this barrier, additional appropriate methods are needed to measure the complex relationships between genes.
The Pearson correlation coefficient is the most common default measure among gene coexpression network analysis methods [10, 11]. Several assumptions made with respect to Pearson correlation limit its effectiveness. (i) The Pearson correlation coefficient is usually not suggested for nonnormally distributed data [12]. (ii) Pearson correlation, which only captures linear relationships between two given components, measures the degree of a linear relationship between two realvalued variables [13]. That is, the linear relationships between two variables should be known, after which Pearson correlation can be used to assess how well the relationship between the two variables can be described by a linear equation. However, the relationships between genes are complex, and only partly explained by linear relationships. (iii) The Pearson coefficient is sensitive to outliers. For example, if one sample has a very high expression for two genes, the Pearson correlation coefficient can approach +1 due to just one outlier, falsely indicating a strong correlation, when in fact it does not exist [14]. (iv) A zero value for the Pearson coefficient reveals there isn’t a linear correlation between the variables, however, it does not show that the variables are independent. A classic example is to define \(Y=X^2\) where X is a random variable on \([1, 1]\). X and Y are not independent, but the value for the Pearson correlation coefficient between X and Y is zero.
A robust alternative to the Pearson correlation coefficient is the Spearman correlation coefficient which can be applied to nonnormally distributed data since it is more robust to outliers. As the Spearman correlation coefficient has less statistical power than the Pearson coefficient, the Pearson coefficient is recommended if the data are normally distributed [15]. Moreover, since gene data are continuous, Spearman correlation must convert continuous into rank data, leading to a loss of the original information. For continuous data, Spearman correlation has lower accuracy than Pearson correlation.
There are many other metrics in addition to the Pearson correlation coefficient and Spearman correlation coefficient, such as the maximal information coefficient (MIC) [16] and distance correlation [17]. Maximal information component analysis (MICA), which combines MIC with an Interaction Component Model, has been proposed in [18]. This algorithm can obtain improved modules when the networks contain confounding factors. However, it has been indicated that there are no mathematical arguments for MIC, and its results are based solely on the analysis of simulated data in the literature [19,20,21]. In practice, MIC may provide a high score for two variables even if they are not correlated, thus creating a false correlation between two variables.
Distance correlation is a statistical measure of the correlation proposed by Szkely et al. [17]. In contrast to Pearson correlation, which can only detect linear relationships, distance correlation measures the degree of all types of possible relationships between two genes. Generally speaking, compared with classical Pearson correlation, distance correlation has some distinct advantages: it does not assume normality, it can measure a nonlinear relationship between two variables, the presence of outliers has a reduced influence on distance correlation, and the distance correlation is zero only if the random vectors are independent. Distance correlation is a valuable, practical, and natural tool in data analysis [17]. Inspired by MICA, we incorporated distance correlation into WGCNA to construct a distance correlationbased WGCNA (DCWGCNA) algorithm for gene coexpression analysis. In DCWGCNA, the correlation coefficients between the gene expression profiling data are calculated by distance correlation, and the other process of DCWGCNA is identical to the traditional WGCNA except for the different correlation coefficients. To illustrate the performance of distance correlation, the three most popular used and representative correlation coefficients were selected for comparison: Pearson productmoment correlation, Spearman’s rankorder correlation, and maximal information coefficient (MIC) based on information entropy.
In this study, we mainly used four datasets, macrophage [22,23,24] and liver [25, 26] datasets were from microarray datasets. The data were normalized by the robust multiarray average (RMA) method. Here, we used the macrophage dataset comprising 329 samples, for which 3611 genes were retained. Simultaneously, the liver dataset comprised 288 samples and 3089 genes. Cervical cancer and pancreatic cancer datasets were RNAseq datasets derived from The Cancer Genome Atlas (TCGA). These datasets are transformed logarithmic and RSEM normalized. We selected 3808 genes in 308 samples for the cervical cancer dataset. For the pancreatic cancer dataset, there are 3029 genes in 183 samples.
Results
In this section, we first illustrate that distance correlation is suitable for measuring the correlation between two genes. Simultaneously, distance correlation is compared with the other three typical correlation analysis methods named Pearson correlation, Spearman correlation, and MIC. Then, we construct a distance correlationbased WGCNA (DCWGCNA) algorithm for gene coexpression analysis by replacing the Pearson correlation in WGCNA with distance correlation. Finally, we validate the performance of the new algorithm based on scalefree topology (SFT) fit, clustering results, enrichment analysis, and module stability by analysing gene expression profiles using four datasets from microarray data and RNAseq data. In Fig. 1, we provide a flowchart of this study.
Distance correlation is distribution free
A normal distribution is not a requirement for using the Pearson correlation coefficient, but the testing for statistical significance of the correlation may be reduced, so the Pearson correlation coefficient is usually not suggested for nonnormally distributed data [12, 15, 27, 28]. The gene data we used were usually logtransformed to achieve a normal distribution. However, most genes are in fact not normally distributed even after log transformation. To verify that the data were normally distributed, we show the normal quantilequantile (QQ) plots of the gene expression profile obtained from the macrophage and cervical cancer datasets in Fig. 2. Each subfigure of Fig. 2 is a normal QQ plot of one gene expression profile. A normal QQ plot comparing a gene expression profile on the vertical axis to a standard normal population on the horizontal axis is shown. In the QQ plots, normally distributed data appear as an approximately straight line. The GPHN gene shown in Fig. 2a is approximately normally distributed. However, in Fig. 2b, c, the points are not clustered on the \(45^\circ\) line, but rather follow a curve, suggesting that the genes are not normally distributed. The data for these two microarrays show skewness and a bimodal distribution, respectively. These distributions were also observed in the RNAseq data for cervical cancer, where the results shown in Fig. 2d are approximately normally distributed, in Fig. 2e show a skewness distribution, and in Fig. 2f show a bimodal distribution.
Using the ShapiroWilk test [29], we calculated the percentage of genes with a pvalue greater than 0.01, which were considered to be normally distributed. In the macrophage dataset, approximately 34.26% of genes were normally distributed, while this value was 30.88% in the liver dataset, 37.65% in the cervical cancer dataset, and 23.21% in the pancreatic cancer dataset. Thus, most of the genes in these four datasets were not normally distributed. Since distance correlation does not require any distributional assumption [17], testing for the statistical significance of distance correlation will not be less even if the data are nonnormally distributed, which differs from the Pearson coefficient. Thus, a reasonable measurement value is given by the distance correlation, making distance correlation more suitable for measuring the dependence between two genes.
Distance correlation better fits complex relationships
Complex relationships between genes are pervasively observed in biological systems. In Fig. 3, we provide several examples of relationships observed between pairs of genes from microarray data and RNAseq data. These relationships of gene pairs were selected based on the Pearson correlation values, the distance correlation values, and the difference between them. Specifically, a small distance correlation value indicates an almost independent relationship. A small difference between the Pearson and distance correlation values might associate with linearity. A greater Pearson coefficient than distance correlation value indicates the potential for outliers in the direction of the regression line of the normal data. A smaller Pearson coefficient than the distance correlation value indicates a nonlinear relationship or outliers in the normal direction of the regression line.
s shown in Fig. 3a, the relationship between the two genes was approximately independent. Figure 3b shows a linear and monotonic relationship between the two genes. In Fig. 3c, a significant power function relationship can be observed between the two genes, suggestive of a nonlinear and monotonic relationship. Figure 3d, e show the threshold and quadripartite relations, respectively, which are nonlinear and nonmonotonic. For RNAseq data cervical cancer, independent, linear and power function relationships can also be observed in Fig. 3f–h. Figure 3i shows the parabolic relation, which is nonlinear and nonmonotonic. To evaluate the identification ability of different correlation coefficients, we correspondingly show the receiver operating characteristic (ROC) curves of the four methods (sample size of 30) in Fig. 3. Table 1 provides the areas under the ROC curves obtained by applying each method for every gene relationship corresponding to Fig. 3. For each varied sample size n (\(n=15, 30, 60\)), we applied the statistical tests (null hypothesis: two variables independent, alternative hypothesis: dependent). The ROC curves were constructed by repeating the above procedure 1000 times. In the ROC curves, the xaxis denotes the significance level (Pvalue), and the yaxis denotes the proportion of the rejected null hypothesis. The ROC curves show the proportion of the rejected null hypothesis on different significance levels. When the two variables are dependent, the higher the proportion of rejected null hypothesis, the stronger is the correlation coefficient identification ability at the same significance level. Then, if the power curve of the measure increases rapidly, the responding area close to 1 denotes high power. A ROC curve close to diagonal \(45^\circ\) or an area under the ROC curve close to 0.50 is equivalent to random decisions [28].
Next, we will discuss the ability of the four correlation methods to identify the relationship of various variables by using the ROC curves in Fig. 3 and the areas under the ROC curves in Table 1. The results shown in Fig. 3a is the independent relationship in the macrophage: the ROC curve of each algorithm is close to the diagonal \(45^\circ\) and the area under the curve is close to 0.5. All four methods failed to identify any association, which correctly reflected the independent relationship between the two genes. The results are shown in Fig. 3b is the linearity in the liver: the Pearson, Spearman, and distance correlation could quickly identify a linear correlation, but the MIC was relatively weaker in terms of identifying a linear relationship. Figure 3c shows the power function relationship in the macrophage: Pearson and distance correlation were the most powerful methods to identify this relationship, while Spearman was the second most powerful and MIC the weakest. It should be noted that the Pearson coefficient, which is generally considered to have a strong ability to recognize linear relationships, in this case, Fig. 3c can identify the nonlinear monotonic relationships with high power. Figure 3d, e respectively show thresholds relationship in macrophage and quadripartite relationship in the liver, both of which were nonlinear and nonmonotonic. For these kinds of relationships, distance correlation has the strongest identification ability, followed by MIC and Pearson and Spearman as the weakest (lowest power). For RNAseq cervical cancer data, independent, linear and power function relationships are also observed in Fig. 3f–h. The ability to identify the relationship is consistent with the microarray data. Figure 3i reflects the parabolic relationship, which is nonlinear and nonmonotonic. For this kind of relationship, distance correlation has the strongest identification ability, followed by MIC, Spearman, and Pearson as the weakest (lowest power).
Distance correlation is robust to outliers
In statistics, an outlier is a data point that differs significantly from other observations [30]. We selected four pairs of genes from the liver and cervical cancer dataset to observe the effect of outlier points on correlation coefficients. In Fig. 4a, b, d, e, the values of four correlation coefficients with and without outliers are shown, respectively. In Fig. 4a, a point in the bottom left corner was determined as the outlier using the local outlier factor (LOF) algorithm [31]. The LOF algorithm is implemented using the ‘lofactor’ Function in the R package ‘DMwR’ [32]. All the correlation values decreased when the outlier points were removed, and the correlation coefficients became larger when the outliers were in the direction of the regression line of the normal data. In Fig. 4b, five points in the bottom right corner were determined as outliers using the LOF algorithm. Moreover, these outliers were in the normal direction of the regression line and maintained a good distance from it. The correlation coefficients with outliers were smaller than those without outliers. By comparing the two pairs of data, the Pearson coefficient changed the most, while the other three correlation coefficients did not vary as sharply. These results indicate that Pearson correlation is too sensitive to outliers even if there are few, but distance correlation and other correlation coefficients are not as sensitive. Researchers may incorrectly identify the correlation of two genes if the analysis is performed only according to Pearson coefficient values in the presence of outliers. From the ROC curve shown in Fig. 4c corresponding to Fig. 4b, the correlation identification ability of distance correlation is strongest, while the Pearson coefficient is weakest in the presence of gene correlations and outliers.
The phenomenon in which outliers lie in the direction of the regression line and in the normal direction can also be found in the RNAseq data for cervical cancer (see Fig. 4d, e). This conclusion was consistent with the microarray data. Pearson correlation is too sensitive to outliers, and the correlation identification ability of distance correlation is strongest.
For most cases shown in Table 1, if two genes were dependent, the probability of type I error decreased as the sample size increased, the probability of rejecting \(H_0\) increased, and the area under the curve increased. As a result, relationships between genes become easier to recognize when the samples size is increased. However, in the case of outliers in Table 2, the area under the curve obtained by the Pearson coefficient decreased with increasing sample size. Thus, the ability of the Pearson coefficient to identify the relationship between genes decreased because the probability of sampling outliers will be higher with increasing sample size. The Pearson coefficient was too sensitive to outliers; thus, its ability to identify the correlation between two genes was weakened. For the four datasets used in this paper, when outliers existed, the distance correlation was robust to the outliers, and the ability to identify correlations between genes was strongest.
Based on the above findings, distance correlation seemed more appropriate than other correlation coefficients for the identification of relationships between genes. In this paper, distance correlation was applied to construct the gene coexpression network, so we will discuss the difference between distance correlation and the classical Pearson coefficient, which is used in WGCNA. Additionally, the difference between DCWGCNA and WGCNA is also compared and analysed.
Effects of distance correlation on different datasets
Szekely et al. [33] verified that the value of the distance correlation is always less than the absolute value of the Pearson correlation for bivariate normal data. Nevertheless, the difference between the two values is small. From the discussion in “Distance correlation better fits complex relationships” section, the identification capabilities of the linear and monotonous relationships between genes in the Pearson correlation and distance correlation were similar. Therefore, we could estimate that a complex relationship existed between two random variables if the distance correlation value was larger than the Pearson correlation. Here the complex relations refers to the nonbivariate normal data and nonlinear and nonmonotonic relations. Generally, the correlation value greater than 0.8 is described as strong correlation while the value less than 0.5 is described as weak correlation [34]. To measure the proportion of complex relationships in the microarray and RNAseq datasets, we selected pairs of genes with distance correlation coefficients greater than 0.5 from the four datasets. Next, we analysed the distribution of Pearson correlation coefficients of the reserved pairs of genes. Of the total gene pairs, 9.98% had Pearson correlation coefficients less than 0.5 in the macrophage dataset (Fig. 5a). Moreover, the ratios in the liver dataset, (Fig. 5b), the cervical cancer dataset (Fig. 5c) and pancreatic cancer dataset (Fig. 5d) were 10.61%, 6.09% and 12.35%, respectively.
The cutoff threshold was used in the gene coexpression network [35, 36]. If 0.5 was used as the correlation coefficient cutoff threshold, approximately 10% of the complex correlated data information would be lost using the criterion that the Pearson correlation coefficient must be greater than 0.5. In WGCNA, softthresholding power was used, which amplifies disparity between strong and weak correlations. When the Pearson coefficient was used, these complex relationships achieved a smaller value, which would be further weakened by the soft threshold, resulting in a small weight of the two genes and inaccurate clustering results.
Distance correlation shows evidence of SFT
Studies reported by Barabasi and Albert have suggested that the underlying topology of biological networks is approximately ’scalefree’ [37, 38]. It is important to identify hubs which dominate the SFT networks, such as genes, proteins, and metabolites, since they usually have great biological significance [39,40,41,42]. Therefore, we investigated the SFT of the two correlation coefficients for the four datasets.
Softthresholding powers are key parameters in the construction of a gene coexpression network. In Fig. 6, we plotted the scalefree fit index (yaxis) of the Pearsonbased WGCNA (see Fig. 6a, c, e, g) and DCWGCNA (see Fig. 6b, d, f and h) corresponding to the softthresholding power (xaxis) for the four datasets. The closer the scalefree fit index is to 1, the better is the scalefree network. It was previously recommended to choose the softthresholding power when the scalefree fit index first reaches 0.9 [37], which is represented by the red horizontal line.
For macrophage and liver datasets, SFT was achieved when we raised the similar matrix to the 4 power for both Pearson and distance correlation. Moreover, for the liver dataset, the scalefree fit index of the Pearson correlation was less than 0.9, and the scalefree fit index of the distance correlation was greater than 0.9 when the recommended softthresholding power was 4. For the cervical cancer dataset, For both Pearson and distance correlation, after raising the correlation matrix to the power of 4, the SFT was achieved. The scalefree fit index of the Pearson correlation was greater than the distance correlation when the recommended softthresholding power was 4. Likewise, the pancreatic cancer dataset requires the power of 7 and 6. The scalefree fit index of the Pearson correlation was less than that of the distance correlation for the same softthresholding power.
Generally for the datasets, Pearson correlation and distance correlation could both achieve SFT with no significant difference. However, distance correlation showed slight advantages in the scalefree fit index compared with Pearson correlation for the liver and pancreatic cancer datasets when using the same softthresholding power.
Cluster trees and modules
To illustrate the impact of the different correlation measures on the gene coexpression network, we assigned the same parameter for Pearsonbased WGCNA and DCWGCNA with different correlation coefficients only. The cluster trees (dendrograms) of the two algorithms for the four datasets are shown in Fig. 7. The numbers of genes in each module are given in Additional file 1: Table S1.
In Fig. 7, modules are shown in different colours. For the macrophage dataset using Pearsonbased WGCNA, 11 coexpression modules were constructed with 3,611 genes from 316 samples; however, this number was 9 using DCWGCNA. For the liver dataset using Pearsonbased WGCNA, 13 coexpression modules were constructed with 3,089 genes from 288 samples; however, this number was 10 using DCWGCNA. For the cervical cancer dataset, using Pearsonbased WGCNA, 7 coexpression modules were constructed with 3,808 genes from 308 samples; this number was 6 using DCWGCNA. For the pancreatic cancer dataset, 12 coexpression modules were constructed with 3,029 genes from 183 samples both using Pearsonbased WGCNA and DCWGCNA. Thereinto, the grey module composed of genes that were not assigned to any gene coexpression module. The only obvious difference between these two algorithms was the correlation matrix when the input data and the other parameters were kept the same. As shown in Fig. 7, the number of modules identified by DCWGCNA was usually less than that identified by Pearsonbased WGCNA.
It can be intuitively seen that the cluster trees, module size, and cluster results obtained by the gene coexpression network analysis were completely different when the correlation coefficients were different and all other parameters identical. These findings indicate that the selection of correlation coefficients has a significant influence on the results of the gene coexpression network. Therefore, choosing an appropriate correlation coefficient is very important for the construction of a gene coexpression network.
Gene enrichment comparison
Coexpressed genes are often involved in the same biological processes [43]. Therefore, the modules highly enriched for specific gene categories are more reasonable [18]. To compare the average enrichment score and stability of the algorithms, we assigned the appropriate values to the deepSplit and minClusterSize parameters in cutreeDynamic functions of the WGCNA package to make the module numbers obtained by WGCNA based on different correlation coefficients equal.
In the present analysis, we took the Top 3 enrichment scores in the Functional Annotation Clustering of DAVID [44,45,46]. The higher the enrichment score, the lower the Pvalue, thus the greater was the enrichment. The enrichment score of modules is an important index to appraise the rationality of modules. We will discuss the average enrichment scores of the modules derived from the gene coexpression network constructed with four different correlation coefficients to measure the enrichment degree of the coexpression network.
From Table 3, the modules from DCWGCNA had a higher average DAVID enrichment score for the three datasets. The DAVID enrichment score of each module can is given in Additional file 2: Table S2. The higher the DAVID enrichment score, the more reasonable is the module. Modules from DCWGCNA had the highest average DAVID enrichment scores among the three datasets, second only to Pearson correlation in the liver datasets. Modules from MIC almost had the lowest average DAVID enrichment scores, which was consistent with the above analysis of statistical power.
For the liver dataset, the modules from WGCNA had a higher average DAVID enrichment score. We considered this phenomenon to be due to the following two reasons. First, we observed many outliers in the liver dataset, similar to the result shown in Fig. 4a. In some cases, such an ‘outlier’ might actually be biologically meaningful [13]. Second, the enrichment score was influenced not only by the correlation coefficient but also by the randomness of input data, parameter selection, clustering algorithm, module size and the number of genes in modules.
To test whether the identified modules obtained by DCWGCNA are biologically meaningful, take the cervical cancer samples as an example, the highly enriched (Top3) biological process (BP) terms in GO for network modules were summarized. The turquoise module is significantly enriched in categories “epidermis development” (p = 3.4E−14) which is correlated with EpithelialMesenchymal Transition in cervical cancer [47], “angiogenesis” (p = 2.3E−7) which is important early in cervical pathogenesis [48], “keratinocyte differentiation” (p = 1.8E−6) which is dependent by Human Papillomaviruses (HPV) lifecycle, and the cervical cancers are driven by HPV of the highrisk variety [49]. Overall, the enrichment terms show the biologically meaningful of the modules obtained by DCWGCNA.
To compare the biological functions of the modules between the Pearsonbased WGCNA and DCWGCNA. For the cervical cancer dataset, we selected the highly enriched BP terms (\(p<0.01\)) in GO for network modules obtained by the two methods. For the same colour, we calculated the numbers of overlapped BP terms between two modules obtained by the two methods and the proportion of the overlapped BP terms in the BP terms obtained by Pearsonbased WGCNA. Thereinto, the proportion of the turquoise module is 71.62% which shows that the potential biological functions of the modules obtained by different clustering methods are similar sometimes. Simultaneously, the proportion of the brown module is 7.14% which shows that the potential biological functions of the modules obtained by different clustering methods are distinct obviously.
Stability analysis
In this section, we will compare the stability for highly correlated gene pairs and modules. First, we compare the stability based on highly correlated gene pairs. The microarray and RNAseq datasets were divided into two halves with the same number of samples randomly, and calculated the correlation coefficients between different genes for each half independently. We calculated the Pearson coefficient of one half of the data and sorted these gene pairs according to the absolute Pearson value. The most correlated n (\(n = 5000, 10{,}000, 25{,}000, 50{,}000, 100{,}000\)) gene pairs were chosen. Then, we calculated the Pearson coefficient of these n gene pairs for the other half of the data. The stability of highly correlated gene pairs was demonstrated by comparing the mean absolute difference between the two halves of data. The smaller the mean absolute difference, the more stable it is. The same process was conducted for distance correlation. The results are shown in Fig. 8. For the macrophage dataset, Fig. 8a shows that the Pearson coefficient had good stability when the top number was small. With an increase in the top number, the stability of the distance correlation was stronger than the Pearson coefficient. For the liver data set, Fig. 8b shows that the stability of the distance correlation was significantly better than Pearson’s coefficient. Similarly, the advantage became more obvious as the top number increased. For the cervical cancer and pancreatic cancer datasets, the stability of the distance correlation was significantly better than the Pearson coefficient, which can be seen in Fig. 8c, d.
To determine the module stability, the liver data were randomly divided into two groups with the same number of samples, and each group of data was processed independently to obtain gene modules. To discuss the module preservation between the two parts of data, we examine the preservation significance (Fig. 9). The shade of the red colour here represents \(\log (p)\), where p is the Fisher’s exact test pvalue for the overlap between the two modules. The shade of the red colour indicates the significant of the pvalue; The darker the colour, the smaller the Pvalue, the more significant the overlap, and the more stable the module [37]. The numbers in Fig. 9 represent the gene counts in the corresponding module interaction. Since the gray module consists of genes that are not assigned to any module, it is not surprising that the grey module does not overlap with other modules.
We assigned appropriate parameters to make the module numbers obtained by WGCNA and DCWGCNA equal to 9. The recommended softthresholding powers of WGCNA were 5 and 4, the number of the modules whose preservation significances larger than 50 was four (Fig. 9a). The red and green modules of part 1 of WGCNA showed relatively weak preservation. Moreover, the recommended softthresholding powers of DCWGCNA were both 4. In Fig. 9b, five modules whose preservation significance is larger than 50 are shown. The black module of part 1 of DCWGCNA showed relatively weak preservation, and there were fewer modules with weak overlap. For the liver dataset, DCWGCNA was more stable than WGCNA in terms of preservation significance and suggested soft threshold. Other results from the other three datasets can be found in Additional file 3: Fig. S1. The results are similar to those obtained for the liver dataset,showing that DCWGCNA was more stable than WGCNA. Since the division of the data was random, the gene modules were different. However, the stability did not change dramatically in either test.
Discussion
In this paper, distance correlation was applied to WGCNA to construct a new gene coexpression network analysis method and the algorithm was verified to be valid using microarray and RNAseq datasets.
We showed the complexity of gene relationships using gene expression microarray datasets and calculated the correlation coefficients of pairs of genes using different measures. The results showed that the distance correlation ability to detect various complex relationships between genes was strong. Moreover, Pearson correlation was too sensitive to outliers, while distance correlation was robust. Therefore, distance correlation is a good choice to measure the relationship between genes and discuss comparatively reasonable results.
To illustrate the efficiency of using distance correlation, we calculated the proportion of complex relationships in gene relationships when the distance correlation coefficient was greater than 0.5. Approximately 10% of relationships in the both microarray and RNAseq datasets were complex.
To compare the performance of Pearsonbased WGCNA and DCWGCNA, we kept the other parameters constant and constructed a gene coexpression network using these two algorithms. DCWGCNA showed slight advantages in the speed of achieving SFT and the scalefree fit index compared with Pearson correlation when the same softthresholding power was used. For further research, we calculated the DAVID enrichment score of the modules obtained using WGCNA based on four correlation measures. DCWGCNA had a higher average DAVID enrichment score in most of the four datasets compared with the others. Moreover, considering the stability analysis, the modules from DCWGCNA had better module preservation than those from Pearsonbased WGCNA.
DCWGCNA is significant, especially for gene datasets with complex relationships. However, the time complexity of distance correlation is \(O(n^2)\) which is still a relatively expensive computation. The time complexity of distance correlation is greater than that of Pearson correlation of O(n). For example, the calculation of the distance correlation coefficients between approximately 3,000 genes in 300 samples by a PC (CPU: core i7) required approximately 9 hours. Moreover, according to the formula of distance correlation, an \(n\times n\) distance matrix must be calculated. Thus, the memory required is large with a large number of samples. Fortunately, the number of samples analysed in this paper was small; therefore, the required memory was not vast. For the convenience of application, the R code used in this study has been provided in Additional file 4.
Conclusions
Compared with the Pearson correlation coefficient, distance correlation is distribution free and better fits complex relationships. Its limitation is the large computation time and inability to determine positive and negative correlations. When outliers exist and the significant impact on the correlation coefficient value is undesirable, distance correlation is a better alternative choice. However, if these outliers are not errors and highlight their biological significance, we can choose the higher value between Pearson and distance correlations.
In this paper, we only discuss the applicability of distance correlation in gene coexpression network analysis. In fact, distance correlation is also suitable for measuring other complex relationships (such as proteins and metabolites), and we will attempt to apply distance correlation to other network analyses in our future work.
Methods
Distance correlation
Distance correlation was proposed in 2007 by Szekely, Rizzo, and Bakirov [17, 33]. For two random variables X and Y, the distance correlation coefficient is denoted as R(X, Y). And, \(V_{n}(X,Y)\) denotes the empirical distance covariance which is a nonnegative number defined by:
Here, \(A_{kl}\) and \(B_{kl}\) are defined as \(A_{kl}=a_{kl}{\bar{a}}_{k.}{\bar{a}}_{.l}+{\bar{a}}_{..}\) and \(B_{kl}=b_{kl}{\bar{b}}_{k.}{\bar{b}}_{.l}+{\bar{b}}_{..}\), where \(a_{kl}=\Vert X_kX_l\Vert _p,\ b_{kl}=\Vert Y_kY_l\Vert _q\) with \(k,l=1,\ldots , n\).
Likewise, for a random variable X, \(V_n (X)\) is nonnegative and defined as
Finally, the empirical distance correlation \(R_n (X,Y)\) is the square root of the following scheme
In this paper, we use the energy package in R to calculate the distance correlation (see the references in the manual for more details; https://CRAN.Rproject.org/package=energy).
Hypothesis testing and ROC curves
To evaluate the ability of the correlation coefficients for identifying the correlation between gene expression data, the concept of the receiver operating characteristic (ROC) curve was introduced based on the correlation coefficient hypothesis test. The test of dependence between X and Y is described as a hypothesis test as follows: \(H_0\): X and Y are independent. \(H_1\): X and Y are dependent. The definition of the ROC curves, the hypothesis testing, and realization methods used for the four correlation coefficients are referred to [28].
Scalefree topology
Scalefree networks are complex networks which have a few highly connected nodes and most poorly connected nodes [50, 51]. The relationship between the topological properties of network nodes (genes and proteins) and functional essentiality in interaction networks is wellknown [42, 52, 53].
SFT [4] has determined that the frequency distribution p(k) of connectivity number follows the power law: \(p(k)\sim k^{\gamma }\), where k is a nonnegative real number. The weighted gene coexpression network conforms to SFT. Whether the network satisfies approximate SFT can be intuitively determined by drawing log(p(k)) versus log(k). The model fitting index \(R^2\) is the squared of the correlation coefficient between log(p(k)) and log(k). There is a straight line between log(p(k)) and log(k) when \(R^2\) of the model approaches 1.
The WGCNA R package [37] provides functions that help to choose the parameters (pickSoftThreshold), and the function scaleFreePlot help to evaluate whether the network exhibits SFT.
WGCNA
WGCNA is a popular tool for identifying modules of highly correlated genes [4, 37, 54, 55]. The function of WGCNA is plentiful; only part has been used in this paper. We will introduce the process of module division of WGCNA. For other functions, we refer to the tutorials for the WGCNA package at https://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/. The method used for network construction proceeded as follows. First, a similarity coexpression matrix was calculated with Pearson or another correlation coefficient cor(i, j) for all gene expression profiles. In this paper, we focused on the unsigned network in which the correlation coefficients are changed into absolute values. Next, an adjacency matrix is obtained from the similarity coexpression matrix by using the softthresholding power. The power is selected according to the criteria of approximately fitting the SFT network. Then, from the adjacency matrix, a topological overlap matrix is obtained. Based on the dissimilarity topological overlap matrix, a dendrogram was generated by using the hierarchical clustering method. Finally, different numbers of modules were obtained by dynamic tree cutting.
DAVID enrichment analysis
DAVID [45, 46] is a database of bioinformatics resource which is available at http://david.abcc.ncifcrf.gov/. In this paper, the enrichment score of the DAVID Functional Annotation Clustering report is used. The report combines annotation terms from 14 public databases including KEGG pathways and Gene Ontology. The report makes the similar annotations class together and makes the biology clearer. The Group Enrichment Score is the geometric mean of the log pvalues in the annotation term cluster, which is used to evaluate the enrichment significance of a gene module.
Module stability
To evaluate the stability of the modules obtained by Pearsonbased WGCNA and DCWGCNA, one dataset was divided into two groups with the same number of samples randomly, and each group of data was processed independently to obtain gene modules to discuss the module preservation between the two parts. We calculated the overlaps of the two modules for each pair and used Fisher’s exact test to obtain a pvalue for each of the pairwise overlaps. Module stability was processed by the R code (see [37]).
Datasets
We evaluated the performance of this approach primarily on four different arrays (macrophage and liver) and RNAseq (Cervical cancer and pancreatic cancer) datasets.
The macrophage and liver datasets were expression profiling by the array. We used 329 samples for macrophage dataset [22,23,24]. The liver dataset was gene expression in mice tissues consisting of 288 samples [25, 26]. Moreover, the data of both datasets were normalized using the RMA method. The macrophage and liver data can be downloaded from http://www.ncbi.nlm.nih.gov/geo/ under series GSE38705 and GSE16780, respectively.
The computation complexity of distance correlation is large \(O(n^2)\). To overcome this computational difficulty and obtain the distance correlation coefficient, it is crucial to screen genes and limit the number of genes screened to a small number. From a biological point of view, a gene must be expressed to some extent before it can be translated into protein or considered biologically significant [56], and lowexpression genes may be difficult to discern from noise [57]. Thus, we removed the genes with low expression by filtering. Simultaneously, we selected the genes whose expression varied significantly, since the genes without variation are uninformative for network analysis [18]. We calculated the coefficient of variation for each probe set. Then, the dataset will be reduced by selecting probes beyond the average intensity and the coefficient of variation (CV) is greater than 5%. In this way, we selected 3,611 genes for the macrophage dataset and 3,089 genes from the liver dataset for analysis.
Cervical cancer and pancreatic cancer datasets from Gene expression RNAseq were performed using TCGA: https://www.cancer.gov/tcga. About the technique of obtaining cervical cancer and pancreatic cancer datasets, we refer to [58, 59]. The gene expression data of the datasets are transformed \(log_2(x+1)\) and RSEM normalized count. Then, we calculated the average and CV of the gene expression profiles. And, the dataset was reduced by selecting probes beyond the average intensity and CV is greater than 10%. We selected 3808 genes in 308 samples for the cervical cancer dataset. For the pancreatic cancer dataset, there are 3029 genes in 183 samples.
Availability of data and materials
The macrophage and liver datasets can be download at http://www.ncbi.nlm.nih.gov/geo/ under accession numbers GSE38705 and GSE16780, respectively. Cervical cancer and pancreatic cancer datasets from Gene expression RNAseq were performed using TCGA: https://www.cancer.gov/tcga. The R code associated with this study is published in one of the supplementary materials.
Abbreviations
 WGCNA:

Weighted gene coexpression network analysis
 DCWGCNA:

Distance correlationbased weighted gene coexpression network analysis
 MIC:

Maximal information coefficient
 MICA:

Maximal information component analysis
 HMDP:

Hybrid mouse diversity panel
 RMA:

Robust multiarray average
 TCGA:

The Cancer Genome Atlas
 SFT:

Scalefree topology
 QQ:

Quantile quantile
 ROC:

Receiver operating characteristic
 LOF:

Local outlier factor
 EASE:

Expression analysis systematic explorer
 DAVID:

Database for annotation, visualization and integrated discovery
 GO:

Gene Ontology
 BP:

Biological process
 HPV:

Human Papillomaviruses
 KEGG:

Kyoto encyclopedia of genes and genomes
 OxPAPC:

Oxidized 1palmitoyl2arachidonoylsnglycero3phosphorylcholine
 TCGA:

The cancer genome atlas
 RSEM:

RNAseq by expectationmaximization
 CV:

Variation coefficient
 CRAN:

Comprehensive R archive network
References
De Smet R, Marchal K. Advantages and limitations of current network inference methods. Nat Rev Microbiol. 2010;8(10):717.
Zhang J, Huang K. Normalized imqcm: an algorithm for detecting weak quasicliques in weighted graph with applications in gene coexpression module discovery in cancers. Cancer Inform. 2014;13:14021.
Saelens W, Cannoodt R, Saeys Y. A comprehensive evaluation of module detection methods for gene expression data. Nat Commun. 2018;9(1):1090.
Zhang B, Horvath S. A general framework for weighted gene coexpression network analysis. Stat Appl Genet Mol Biol. 2005;4(1):17.
Steuer R, Kurths J, Daub CO, Weise J, Selbig J. The mutual information: detecting and evaluating dependencies between variables. Bioinformatics. 2002;18(suppl2):231–40.
Jiang D, Tang C, Zhang A. Cluster analysis for gene expression data: a survey. IEEE Trans Knowl Data Eng. 2004;16(11):1370–86.
Botía JA, Vandrovcova J, Forabosco P, Guelfi S, D’Sa K, Hardy J, Lewis CM, Ryten M, Weale ME, Consortium UKBE, et al. An additional kmeans clustering step improves the biological features of WGCNA gene coexpression networks. BMC Syst Biol. 2017;11(1):47.
Spearman C. The proof and measurement of association between two things. Am J Psychol. 1987;100(3/4):441–71.
Song L, Langfelder P, Horvath S. Comparison of coexpression measures: mutual information, correlation, and model based indices. BMC Bioinform. 2012;13(1):328.
Hou J, Ye X, Li C, Wang Y. Kmodule algorithm: an additional step to improve the clustering results of WGCNA coexpression networks. Genes. 2021;12(1):87.
Giorgi FM, Del Fabbro C, Licausi F. Comparative study of RNAseqand microarrayderived coexpression networks in Arabidopsis thaliana. Bioinformatics. 2013;29(6):717–24.
Chok NS. Pearson’s versus Spearman’s and Kendall’s correlation coefficients for continuous data. Ph.D. thesis, University of Pittsburgh; 2010.
Mutwil M. Integrative transcriptomic approaches to analyzing plant coexpression networks. PhD thesis, University of Potsdam;2010.
Mutwil M, Usadel B, Schütte M, Loraine A, Ebenhöh O, Persson S. Assembly of an interactive correlation network for the Arabidopsis genome using a novel heuristic clustering algorithm. Plant Physiol. 2010;152(1):29–43.
Petereit J, Smith S, Harris FC, Schlauch KA. petal: coexpression network modelling in r. BMC Syst Biol. 2016;10(2):51.
Reshef DN, Reshef YA, Finucane HK, Grossman SR, Mcvean G, Turnbaugh PJ, et al. Detecting novel associations in large data sets. Science. 2011;334(6062):1518–24.
Székely GJ, Rizzo ML, Bakirov NK, et al. Measuring and testing dependence by correlation of distances. Ann Stat. 2007;35(6):2769–94.
Rau C, Wisniewski N, Orozco LD, Bennett BJ, Weiss JN, Lusis AJ. Maximal information component analysis: a novel nonlinear network analysis method. Front Genet. 2013;4:28.
Simon N, Tibshirani R. Comment on “detecting novel associations in large data sets” by Reshef et al, science Dec 16, 2011; 2014. arXiv:1401.7645.
Kinney JB, Atwal GS. Equitability, mutual information, and the maximal information coefficient. Proc Natl Acad Sci USA. 2014;111(9):3354–9.
Gorfine M, Heller R, Heller Y. Comment on “detecting novel associations in large data sets”. Eprint Arxiv; 2015.
Orozco LD, Bennett BJ, Farber CR, Ghazalpour A, Pan C, Che N, et al. Unraveling inflammatory responses using systems genetics and geneenvironment interactions in macrophages. Cell. 2012;151(3):658–70.
Codoni V, Blum Y, Civelek M, Proust C, Franzén O, Consortium C, CADGenomics ILC, Björkegren JL, LeGoff W, Cambien F, et al. Preservation analysis of macrophage gene coexpression between human and mouse identifies park2 as a genetically controlled master regulator of oxidative phosphorylation in humans. G3 Genes Genomes Genet. 2016;6(10):3361–71.
Buscher K, Ehinger E, Gupta P, Pramod AB, Wolf D, Tweet G, Pan C, Mills CD, Lusis AJ, Ley K. Natural variation of macrophage activation as diseaserelevant phenotype predictive of inflammation and cancer survival. Nat Commun. 2017;8(1):1–10.
Bennett BJ, Farber CR, Orozco L, Kang HM, Ghazalpour A, Siemers N, et al. A highresolution association mapping panel for the dissection of complex traits in mice. Genome Res. 2010;20:281–90.
Ghazalpour A, Bennett BJ, Shih D, Che N, Orozco L, Pan C, et al. Genetic regulation of mouse liver metabolite levels. Mol Syst Biol. 2014;10(5):730.
Mukaka MM. A guide to appropriate use of correlation coefficient in medical research. Malawi Med J. 2012;24(3):69–71.
de Siqueira Santos S, Takahashi DY, Nakata A, Fujita A. A comparative study of statistical methods used to identify dependencies between gene expression signals. Brief Bioinform. 2013;15(6):906–18.
Shapiro SS, Wilk MB. An analysis of variance test for normality (complete samples). Biometrika. 1965;52(3/4):591–611.
Grubbs FE. Procedures for detecting outlying observations in samples. Technometrics. 1969;11(1):1–21.
Breunig MM, Kriegel HP, Ng RT, Sander J Lof: identifying densitybased local outliers. In: ACM Sigmod Record, vol. 29. ACM; 2000. p. 93–104.
Torgo L. Data mining with R: learning with case studies. New York: Chapman and Hall; 2011.
Székely GJ, Rizzo ML. Brownian distance covariance. Ann Appl Stat. 2009;3(4):1236–65.
Castro Sotos AE, Vanhoof S, Van Den Noortgate W, Onghena P. The transitivity misconception of Pearson’s correlation coefficient. Stat Educ Res J. 2009;8(2):33–55.
Sanoudou D, Haslett JN, Kho AT, Guo S, Gazda HT, Greenberg SA, Lidov HG, Kohane IS, Kunkel LM, Beggs AH. Expression profiling reveals altered satellite cell numbers and glycolytic enzyme transcription in nemaline myopathy muscle. Proc Natl Acad Sci. 2003;100(8):4666–71.
Rahmani B, Zimmermann MT, Grill DE, Kennedy RB, Oberg AL, White BC, Poland GA, McKinney BA. Recursive indirectpaths modularity (ripm) for detecting community structure in RNAseq coexpression networks. Front Genet. 2016;7:80.
Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinform. 2008;9(1):559.
Barabási AL, Gulbahce N, Loscalzo J. Network medicine: a networkbased approach to human disease. Nat Rev Genet. 2011;12(1):56.
Atiia AA, Hopper C, Inoue K, Vidal S, Waldispuhl J. Computational intractability law molds the topology of biological networks. Appl Netw Sci. 2020;5:34.
Albert R. Scalefree networks in cell biology. J Cell Sci. 2005;118(21):4947–57.
Nafis S, Kalaiarasan P, Brojen Singh R, Husain M, Bamezai RN. Apoptosis regulatory protein–protein interaction demonstrates hierarchical scalefree fractal network. Brief Bioinform. 2014;16(4):675–99.
Andrecut M, Kauffman S, Madni A. Evidence of scalefree topology in gene regulatory network of human tissues. Int J Mod Phys C. 2008;19(02):283–90.
van Dam S, Cordeiro R, Craig T, van Dam J, Wood SH, de Magalhães JP. GeneFriends: an online coexpression analysis tool to identify novel gene targets for aging and complex diseases. BMC Genomics. 2012;13(1):535.
Huang DW, Sherman BT, Tan Q, Collins JR, Alvord WG, Roayaei J, et al. The DAVID gene functional classification tool: a novel biological modulecentric algorithm to functionally analyze large gene lists. Genome Biol. 2007;8(9):183.
Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2008;4(1):44.
Huang DW, Sherman BT, Lempicki RA. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2008;37(1):1–13.
Lee M, Chou C, Tang M, Shen M. Epithelialmesenchymal transition in cervical cancer: correlation with tumor progression, epidermal growth factor receptor overexpression, and snail upregulation. Clin Cancer Res. 2008;14(15):4743–50.
Minion LE, Tewari KS. Cervical cancerstate of the science: from angiogenesis blockade to checkpoint inhibition. Gynecol Oncol. 2018;148(3):609–21.
Rodrigues C, Joy LR, Sachithanandan SP, Krishna S. Notch signalling in cervical cancer. Exp Cell Res. 2019;385(2):111682.
Barabási AL, Bonabeau E. Scalefree networks. Sci Am. 2003;288(5):60–9.
Barabási AL, Albert R. Emergence of scaling in random networks. Science. 1999;286(5439):509–12.
Albert R, Jeong H, Barabási AL. Error and attack tolerance of complex networks. Nature. 2000;406(6794):378.
Jeong H, Mason SP, Barabási AL, Oltvai ZN. Lethality and centrality in protein networks. Nature. 2001;411(6833):41.
Tai Y, Liu C, Yu S, Yang H, Sun J, Guo C, et al. Gene coexpression network analysis reveals coordinated regulation of three characteristic secondary biosynthetic pathways in tea plant (camellia sinensis). BMC Genomics. 2018;19(1):616.
Luo M, Zhang Q, Xia M, Hu F, Ma Z, Chen Z, et al. Differential coexpression and regulatory network analysis uncover the relapse factor and mechanism of T cell acute leukemia. Mol Ther Nucleic Acids. 2018;12:184–94.
Chen Y, Lun AT, Smyth GK. From reads to genes to pathways: differential expression analysis of RNAseq experiments using Rsubread and the edgeR quasilikelihood pipeline. F1000Research. 2016;5:1438.
Sha Y, Phan JH, Wang MD. Effect of lowexpression gene filtering on detection of differentially expressed genes in RNAseq data. In: 2015 37th annual international conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE; 2015. p. 6461–6464.
Network CGAR, et al. Integrated genomic and molecular characterization of cervical cancer. Nature. 2017;543(7645):378–84.
Raphael BJ, Hruban RH, Aguirre AJ, Moffitt RA, Yeh JJ, Stewart C, Robertson AG, Cherniack AD, Gupta M, Getz G, et al. Integrated genomic characterization of pancreatic ductal adenocarcinoma. Cancer Cell. 2017;32(2):185–203.
Acknowledgements
Not applicable.
Funding
This work was supported by the National Natural Science Foundation of China (Grant No. 41876100, 61471139), the Development Project of Applied Technology in Harbin (Grant No.2016RAXXJ071) and the HEU Fundamental Research Funds for the Central University (Grant No. 3072019CFG0401). The funding body had no role of the funding body in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.
Author information
Authors and Affiliations
Contributions
JH and XY designed the study. JH, WF, QZ and YH analyzed the data. JH, YSL and QZ contributed to the interpretation of the results. JH, XY, YL, YSL and YW drafted the work and made critical revisions. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1: Tabls S1.
The number of genes in each module.
Additional file 2: Tabls S2.
The gene number and DAVID enrichment score in each module.
Additional file 3: Fig. S1.
Module preservation between even partitionings of datasets.
Additional file 4. Code S1.
The R code of the study.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Hou, J., Ye, X., Feng, W. et al. Distance correlation application to gene coexpression network analysis. BMC Bioinformatics 23, 81 (2022). https://doi.org/10.1186/s1285902204609x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285902204609x
Keywords
 Gene expression
 Distance correlation
 WGCNA
 Enrichment analysis