 Research
 Open Access
 Published:
Gene coexpression network based on part mutual information for genetogene relationship and genecancer correlation analysis
BMC Bioinformatics volume 23, Article number: 194 (2022)
Abstract
Background
Finding correlation patterns is an important goal of analyzing biological data. Currently available methods for correlation analysis mainly use nondirect associations, such as the Pearson correlation coefficient, and focus on the interpretation of networks at the level of modules. For biological objects such as genes, their collective function depends on pairwise genetogene interactions. However, a large amount of redundant results from module level methods often necessitate further detailed analysis of gene interactions. New approaches of measuring direct associations among variables, such as the part mutual information (PMI), may help us better interpret the correlation pattern of biological data at the level of variable pairs.
Results
We use PMI to calculate gene coexpression networks of cancer mRNA transcriptome data. Our results show that the PMIbased networks with fewer edges could represent the correlation pattern and are robust across biological conditions. The PMIbased networks recall significantly more important parts of omics defined genepair relationships than the Pearson Correlation Coefficient (PCC)based networks. Based on the scores derived from PMIrecalled copy number variation or DNA methylation genepairs, the patients with cancer can be divided into groups with significant differences on disease specific survival.
Conclusions
PMI, measuring direct associations between variables, extracts more important biological relationships at the level of gene pairs than conventional indirect association measures do. It can be used to refine module level results from other correlation methods. Particularly, PMI is beneficial to analysis of biological data of the complicated systems, for example, cancer transcriptome data.
Introduction
In the era of big data, life science research has produced vast amounts of data from different aspects of the biological systems. These aspects are usually referred to as omics such as genomics and transcriptomics. The complexity of a biological issue, such as cancer progression, requires integration and systematic analysis of data from multiple omics. Current studies mainly center on gene expression because of their direct contributions to the phenotype of the research object. A study often aims to find which gene can be used as a target to perturb a biological system of interest. In this case, the correlation pattern of gene expression will be the key for the multiomics integration as it is either the results of the regulations from other omics or the reasons contributing to the states of other omics. The correlation pattern of gene expression, also referred to as a correlation network, is usually represented as an indirect graph where the genes are nodes and the intergene correlations are edges. The edges in a correlation network are defined by correlation measures, such as the Pearson Correlation Coefficient (PCC), and Spearman Correlation Coefficient (SCC), etc. To obtain important genes from the correlation network, a frequently used method is to extract network modules. A network module on the gene correlation network is a set of genes that have high internal correlation but are relatively isolated from the other parts of the whole network. A set of genes forming one network module indicates that they may be regulated by similar mechanisms and contribute to common biological functions. This module level correlation analysis has been remarkably successful in interpreting gene expression data. The most popular algorithm for network module extraction is the Weight correlation network analysis (WGCNA). WGCNA uses correlation measures such as PCC to initialize the whole network, and uses the topological overlay measures for module extraction. This method can give robust results at the module level [1].
However, the module level interpretation of gene expression data lacks detailed analysis at the genepair level. The implementation of biological function relies on the cooperation of genes. The basic functional unit of a biological system is the relationship between a pair of genes defined by omics. For example, a transcriptional regulation relationship is a transcription factor regulates a target gene. The protein–protein interactome consists of thousands of relationships of two proteins that interact with each other. Obviously, the expression pattern of genes is highly related to those omics defined relationships. Hence, analyzing the correlation pattern of gene expression at the genepair level is necessary to better understand biological functions and to enable multiple omics integration. In the human genome, the real expression relationships among genes contains cascading structures, and interweaves into a network. When using correlation measures like PCC, there will be many dense cliquelike structures in the resulting correlation network, while the real active biological relationship may not be that densely organized. This redundancy of edges (genepairs) prevents us from finding the real important gene or intergene relationships. In the meantime, PCClike correlation measures, such as the SCC or biweight midcorrelation are restricted for linear or monotonic correlations. But the real biological relationship may not be satisfied by this mathematical assumption.
Partial or semipartial(part) statistics may be helpful to eliminate redundancy. To detect nonmonotonic association, Mutual Information (MI) is a theoretical solution. To overcome these two restrictions at one time, we should use a MIlike multivariate statistic for correlation analysis. Part Mutual Information (PMI) is a recently invented algorithm that satisfies our requirement. PMI has shown better performance on simulated data based on yeast gene regulatory network than many other algorithms including direct association like Conditional Mutual Information (CMI) [2]. PMI calculates the correlation network in a very unique way. PMI starts with a fully connected network and proceed with checking every PMI value among all combinations of variables then retains only the top one [2]. Compared to methods like the partial/part correlation matrix calculation implemented by R package ppcor, the PMI network calculation procedure may intrinsically consider the heterogeneity of distribution of correlation strength. The resulting networks are also sparser [3]. All these mathematical features are combined. The theoretical advantage of PMI for biological correlation analysis is promising. But as a new algorithm, PMI has not been widely used in real data. The high calculation complexity makes it hard to apply PMI to data with the scale of the human genome. In the meantime, switching from PCC to PMI in the analysis of correlation patterns of biological data raises questions in the multiomics viewpoint, e.g., what is direct biological association, and how a direct biological association is different from an indirect association, when connecting them to biological mechanisms stemming from multiple omics. The mathematical concept of direct association has indicated that the variablepairs found by PMI are more important than those found by PCC, because the later one is indirect and redundant. We need to understand this difference in a biological context.
In this paper, we apply PMI network calculation procedure to analyze the correlation of mRNA transcriptome data of human cancer and confirm that PMI direct associations are more biologically significant than PCC indirect associations. We also apply the PMIbased correlations for cancer survival analysis.
First, we construct PMI networks for the differential coexpression status of chosen gene sets using a multiplestage breast cancer dataset. The results suggest PMI is more efficient in capturing important generelationships. We calculate PMI networks from several subtype cancer datasets, and compare the recall of gene correlation networks using PMI or PCC to the omics defined genepair relationships. We find that PMI is more efficient in recalling these relationships, and that the recalled relationships in the PMI network have significantly higher omics defined weights than those in the PCC network. Finally, we extracted genepairs in PMI networks that match with genepair relationships defined by copy number variation or DNA methylation in cancer. We then test their individual influence on patient survival, and build a simple scoring system to classify patient survival. Our results suggest PMI can extract more biologically important results at the genepair level than PCC. The PMI network calculation procedure can provide networks sparse enough and are biological meaningful for further detailed analysis at the level of variable pairs. PMIbased correlation analysis will be more helpful in the study of complex biological systems with multiple omics.
Results
PMI correlation change network for cancer progression
In the first part, we want to test the performance of PMI extracting biological meaningful genepairs in an intuitive way. We want to obtain an optimal PMI network for each input data, the PMI network calculation procedure will be repeated with gradually increased thresholds until the resulting networks are stable. We collect a breast cancer mRNA transcriptome dataset, which contains 2 cancer progression with 4 stages. If we directly use the whole genome data to calculate an optimal PMI network, the calculation will be too much to finish. To obtain the optimal PMI networks, we divide the whole genome into small gene sets by WGCNA module extraction and GO enrichment analysis provided by WebGestalt [1, 4]. Then, we calculate the optimal PMI networks of one gene set on each stage of one cancer progression and combine the four networks into one integrated network, which hereafter is referred to as the PMI correlation change network.
To find the important parts involved in cancer development, we exam a variety of detailed structures in a correlation change network and review literatures that report their functions in cancer development. In this part, we use the correlation change networks of GO:0032609 production of interferongamma as an example. This GO term is enriched in all stages of ER negative and positive processes (Fig. 1) [5]. When we place the nodes by the mean stage values (the average of stage numbers of one node’s edges), the wellknown immunosuppressive genes can be observed in the late phase, such as EBI3 in ER negative network, and CD274 (PDL1) and PDCD1LG2 (PDL2) in ER positive network [6,7,8]. There will be several hub nodes in one correlation change network, in which some nodes have edges that exist in different stages. These hub node genes, in combination with their edge stage patterns, reflect the changes during cancer development. For example, XCL1, a chemokine ligand for T cell attraction [7], has 9 neighbors in the ER positive network. XCL1 forms a connection with tumor suppressors: IRF8 and ITK in early cancer stages [9, 10]. In later stages, CD274 (PDL1), an immunosuppressor, correlates with XCL1. The correlation change network of GO term interferongamma production gives a reasonable representation of the biological mechanisms of cancer progression.
In the meantime, we address the performance and reliability of an optimal PMI network. Specifically, we would like to confirm that the optimal PMI network possess the essential part of the correlation in the PCC network for the same input data, even if the optimal PMI network is much sparser than the PCC network. Similar to the expression of housekeeping genes in a genome, a set of genes, if they are functionally involved in broad and a variety of different biologic conditions, will have some constitutive correlations. In the two PMI correlation change networks of GO term interferongamma production, we observe similar network modules that represent the core function of this GO term. These modules consist of 5 genes (CD2, ITK, SASH3, EOMES and IRF8) with similar stage composition and topological structure (Fig. 2). CD2 is an antigen expressed in all peripheral T cells [7]. ITK is an essential kinase for T cell differentiation and proliferation [7]. So, a constitutive correlation between CD2 and ITK is reasonable and represents the existence of T cells. SASH3 is an adaptor protein, probably downstream of the TCR [11]. EOMES is an essential transcriptional factor (TF) for the differentiation of effective CD8+ T cell [7]. IRF8 is an interferon regulatory factor with tumor suppressive function [9]. As these 3 genes all relate to (or affect) T cell activity in disease situation, it’s reasonable that they connect with the CD2ITK center in cancer stages but not in normal situations. Similar network modules can be found in other GO term networks. Their existence suggests optimal PMI networks can capture biological mechanisms and are effective in giving reproducible results.
PMI correlation is closer to omics defined relationship
We match edges from PMI networks and comparable PCC networks with multiple types of omics defined genepair relationship, and compare the recalls between PMI and PCC. The results of the comparison are summarized in Table 1, the details are listed in supplementary tables. In normalstatus omics, including TR, CoTR, PPI and SCGD, both PMI and PCC networks nonrandomly recall these relationships in almost all cancer subtypes. PMIrecalls have higher nonrandomness and higher weights than comparable PCCrecalls, suggesting PMI networks captured more important parts than PCC networks. TR and CoTR are directly associated with gene expression. But there are too few PMIrecalls of TR in the whole network (Additional file 2: Table S1). Although the PMIrecalls of CoTR cover a fair share of each network, the unrecalled parts are significantly larger than the recalled parts, suggesting the CoTR relationships may lack specificity (Additional file 3: Table S2, Additional file 4: Table S3). On the other hand, PMIrecalls of SCGD have more edges and significantly lower distance values than the PCCrecalls (Additional file 5: Table S4). This contrast suggests that the proximity of a pair of genes is one of the differences between PMI direct associations and PCC indirect associations. For PPI, the PMIrecalls have more edges and higher nonrandomness than the PCCrecalls in all cancer subtypes (Additional file 6: Table S5).
Similar contrast between PMI and PCC is also observed in diseasespecific omics. We look the 4 types of CoCNV (GG, LL, LG and GL). For one pair of genes in one sample, PMIrecalls of GG and LL types have higher nonrandomness and higher weight than comparable PCCrecalls in almost all cancer subtypes (Additional file 7: Table S6, Additional file 8: Table S7, Additional file 9: Table S8, Additional file 10: Table S9). While the GL and LG types seem to be nonrandomly excluded by PMI in some cancer subtypes, this tendency of exclusion is less significant in PCCrecalls (Additional file 7: Table S6, Additional file 8: Table S7). For the 4 types of the CoDM, only the HH type is nonrandomly recalled by PMI networks from each sample in most cancer subtypes, while the nonrandomness of recall of the other 3 CoDM types is not very consistent across cancer subtypes. PMIrecalls of the HH type of CoDM have higher nonrandomness and higher weight than comparable PCCrecalls in most cancer subtypes (Additional file 11: Table S10). The LL, HL and LH types of CoDM also have higher weights in PMIrecalls than PCCrecalls (Additional file 12: Table S11, Additional file 13: Table S12, Additional file 14: Table S13, Additional file 15: Table S14).
In summary, PMI networks recall relationships with higher nonrandomness and higher weights than comparable PCC networks, in both normalstatus omics and cancerspecific omics. These observations prove that the PMI network is generally closer to omics relationships, although it is not determinable for one specific PMI edge that is dominated by one specific omics defined relationship. We demonstrate that the mathematical concept of direct association is more biologically significant.
Cancer survival analysis based on PMI network
We perform univariate survival analysis with the genepairs of CoCNV or CoDM from PMI networks. Patients are divided into groups by the CoCNV or CoDM types of one genepair, two groups of patients with significant different survival give one significant survival factor. Such as a CoCNV survival factor: Gene1Gene2 GG > LL, a CoDM survival factor: Gene3Gene4 HHhigh > HHlow. Many significant CoCNV or CoDM survival factors exist in most cancer subtypes (Additional file 16: Table S15, Additional file 17: Table S16, Fig. 3). The CNV and DM status of one gene directly influences the gene transcription. So, the CoCNV and CoDM survival factors should impact patient survival through the changes of gene expression. However, the gene expression levels between groups divided by these factors are not significantly different. When we use the median value of the expression of these genes to divide patients into low and high expression groups, the survival statuses of the two groups are usually not significantly different from each other. The CNV and DM provide unique information about patient survival that cannot be superseded by expression level. The genepairs of significant CoCNV or CoDM survival factors can be a set of isolated edges scattered on the network or some small modules on the network (Fig. 4). Both isolated edges and small modules of factor sets can be used to calculate the scores for patients and divide the patients into groups with significant differences of survival. The score calculated with more survival factors gives better divided patient groups with survival probability (Fig. 5). Cox multivariate analysis with the scores and clinical covariates shows both the CoCNV scores and CoDM scores are stronger classification variables than clinical covariates, and the two of them can be used together to achieve better classification (Additional file 18: Table S17, Additional file 19: Table S18). Only the LUSC00 has sufficient samples for subsampling remodel test, the distribution of pvalues of the covariates confirmed the consistency of results (Additional file 20: Table S19, Additional file 21: Table S20). So, the PMIrecalled genepairs with CoCNV or CoDM relationships are systematically related to the probability of cancer survival, and they may influence the mechanisms for cancer progression.
In the meantime, we find a notable difference between CoCNV and CoDM. For CoCNV, some genes in the PMI network gene set can classify patients into groups with significantly different survival probability by their single gene CNV, such as CTLA4, ORM1, and RASAL3 in triple negative breast cancer. When switching from single gene CNV to genepair CoCNV, many genes have improved significance of survival classification (Additional file 22: Table S21). For example, in triple negative breast cancer, PTPN7 cannot separate patients into groups with significant survival difference. The pvalue of survival test between PTPN7 gain group and PTPN7 unchanged group is 0.25. When cooperating with RASAL3, the PTPN7 and RASAL3 GG group has a significantly better survival probability than the NC group with pvalue 0.02. This GG group is also more significant than the division by single gene RASAL3 (pvalue = 0.03). This result suggests that the overall influence of CNV in one cancer subtype is an accumulation of the single CNV of each gene.
In term of CoDM, the genes that can classify patient survival by their single gene DM never show up in the significant CoDM survival factors in all cancer subtypes. For example, there are 13 singlesignificant genes and 47 pairsignificant genes in triple negative breast cancer without overlap. This result suggests that the impact of DM in one cancer subtype consists of two independent parts, a singlegene part and a pairofgene part.
Discussion
Correlation analysis plays an important role in the analysis of biological data. Most frequently used algorithms focus on obtaining module level interpretation from only one type of data. In case of multiple omics integration, there is a need to find important pairs of variables (edges) from the correlation networks. Current methods in frequent use for correlation analysis give too dense correlation networks. The edges in a dense correlation network contain too much redundancy to highlight important pairs of variables. The direct association and causal interaction are two ways to deal with redundancy. Among direct associations, PMI has a very unique calculation procedure that checks all combinations of the input variables, which may be helpful to get more meaningful results for biological data. Causal interaction usually refers to the directed relationship from TF to target gene. This kind of algorithms often rely on special restriction for the structure of coexpression network or take prior knowledge of whether a gene is TF. The results of causal interaction detection are supposed to be restricted in the field of transcriptional regulation [12]. Considering the abovementioned two points, the PMI network calculation procedure seems to be an ideal method to detect generally important parts of the biological system. Due to the high calculation complexity and unique calculation procedure, PMI has not been widely used in the analysis of biological data. The performance of PMI networks on real experiment data needs to be tested. Also, we are interested in learning the differences between PCC indirect associations and PMI direct associations as well as their connections to biological relationships. In this paper, we first trialed PMI on expression data of WGCNAGO enriched small gene sets. We show that optimal PMI networks can display the whole correlation pattern with fewer edges while retaining important core biological relationships. This will be very helpful to uncover important correlations of a pair of genes. So, using PMI to refine the results from traditional algorithms is promising.
Based on the assumption that the correlation pattern of gene expression data partly matches the relationships defined by omics, we match edge lists from PMI/PCC networks and compare their recall status. Although a single specific edge in the PMI network may not fully and precisely match one omics defined genepair relationship, the whole PMI network is statistically closer to many types of omics defined genepair relationships, as compared to the comparable PCC network. So, PMI direct association is indeed more biologically significant than PCC indirect association. A PMIbased correlation network can better represent the correlation pattern of the biological data and is more helpful for multiomics integration.
In the last part, we use genepairs with CoCNV or CoDM relationships from the PMI/PCC networks of cancer subtypes to classify patient survival. In the case of the log rank test statistic values of single survival factors, PMI does not show significantly better classification than PCC. When we put the genepairs of both PMI and PCC significant survival factors on a network, we see that PMI covers more isolated edges than PCC. Excluding the PCConly (Green in Fig. 6, Additional file 1: Fig. S1) edges, the network does not lose too much module structure. So, when selecting the same number of edges, using PMIbased network can cover more groups of genes with different functions, which will represent the mechanisms of the entire biological system better.
The ideal multiple omics integration should be to fully model the correspondence from the biological relationships to the mathematical relationships of the biological variables in one biological condition. Although we have proved that PMIbased gene correlations are significantly closer to biological relationships than PCCbased results, we still cannot precisely map gene correlations and biological relationships. This ideal modeling may be approached by using data with less heterogeneity and higher specificity in a biological condition. On the other hand, finding the more important parts in the biological system is the major task of correlation analysis. For complex human diseases like cancer, the changes from normal status to disease status are often driven by the combination of a vast number of possible mutations scattered on the whole genome, instead of one single dominating factor. In this situation, PMI will be a better choice for the correlation analysis.
Our current study focuses on the performance and reliability of a whole calculation procedure for correlation analysis of real biological data. We do not include more comparison with other types of statistics for direct association, because the advantage of PMI network calculation procedure may be the result of combining PMI core algorithm definition and the cutting edges by comparing combination numbers of PMI calculation and the threshold scanning strategy. The design and modification of these calculation procedure involve too much mathematical and computational work, which is beyond our expertise. On the other hand, we guess the edgecutting method is the key for refining edgelevel correlation results, but it also causes the huge calculation cost and makes the computation cannot be done parallelly. If new computation methods can solve these two problems while provide similar results as the current PMI network calculation procedure, it will be great help for correlation analysis of biological data.
Conclusions
Our findings confirm that PMI is a precise and robust correlation measure for transcriptome data. The theoretical advantage of PMI direct association over PCC indirect association does stand for a higher biological importance of the edges in the two types of correlation networks. We believe this advantage is applicable to other types of biological data. PMI can be easily used to refine results from the level of modules to obtain further interpretation at the level of variable pairs. An optimal PMI network is useful in highlighting the essential parts of the correlation in one module. Although the calculation of the optimal PMI network for a genomescale data is incomplete, our results suggest it is worth of using PMI to focus on parts with high strength correlation patterns. The advantage of PMI direct association makes it especially suitable for cancer data analysis. PMIbased correlation analysis can provide more valuable targets from the detailed correlation pattern of biological systems.
Materials and methods
Data selection and preprocess
We use two parts of cancer transcriptome dataset for two different purposes.
In the first part, we try to intuitively represent the characters of the PMI network. We want to find the differences between the PMI network and the PCC network and determine whether the PMI network more robustly captures the real correlation pattern. So, we need a dataset with the same set of variables but from different biological conditions. We selected 10 mRNA expression datasets of breast cancer from Gene Expression Omnibus (GEO) and ArrayExpress database (Table 2). The corresponding sample information is downloaded from ArrayExpress [13, 14]. All the 10 datasets are based on Affymetrix Hgu133plus2 platform [15,16,17,18,19,20,21,22,23,24]. The raw data from each dataset is normalized by RMA algorithm and batchcorrected by the removeBatchEffect function from R package Limma [25, 26]. Preprocessed data is separated by tumor grades and estrogen receptor (ER) status into 7 groups. Two cancer progressions are defined based on ER status in which the normal group is used twice (Table 3).
In the second part, we use 6 TCGA cancer datasets from XenaBrowser [27]. To reduce the data heterogeneity, we split each dataset by cancer subtype. The criterion for cancer subtype definition come from the clinical information. We use the main cancer type code plus several numbers to refer to each cancer subtype; the digits of numbers equal to the attributes used to divide cancer subtypes. All cancer subtypes have zero as the first number, which means they are limited to "primary tumor" in the "sample_type" attribute of TCGA clinical information. BRCA uses 3 more attributes; they are “breast_carcinoma_estrogen_receptor_status” (0 = negative, 1 = positive), “breast_carcinoma_progesterone_receptor_status” (0 = negative, 1 = positive), and “lab_proc_her2_neu_immunohistochemistry_receptor_status” (0 = negative, 1 = mid type, 2 = positive). COAD uses 2 more attributes; the first is "CDE_ID_3226963" attribute (microsatellite instability, 0 = stable, 1 = low instable, 2 = high instable), and the second is "histological_type" attribute which limits samples to normal COAD. Mucinous COAD is filtered. ESCA uses "histological_type" attribute (0 = NOS, 1 = ESCC). LUAD uses "Expression_Subtype" attribute (0 = Bronchioid, 1 = Squamoid, 2 = Magnoid). LUSC uses "histological_type" attribute and is limited to "NOS". STAD uses "CDE_ID_3226963" attribute (microsatellite instability, 0 = stable, 1 = low instable, 2 = high instable). For example, BRCA0000 means the value of “sample_type” is “primary tumor”, the values of “breast_carcinoma_estrogen_receptor_status”, “breast_carcinoma_progesterone_receptor_status” and “lab_proc_her2_neu_immunohistochemistry_receptor_status” are all “negative”. The sample sizes of each cancer subtype and each type of data are listed in Table 4. Sample groups with more than 20 samples are used in the following analysis.
PMI network calculation
The mathematical definition of PMI is as follow. X and Y are two scalar variables and Z is an (n − 2)dimensional vector (n > 2). Then, the PMI between variables X and Y given Z is defined as.
\(PMI\left( {X;YZ} \right) = D\left( {p\left( {x,y,z} \right)p^{*} \left( {xz} \right)p^{*} \left( {yz} \right)p\left( z \right)} \right),\)where p(x,y,z) is the joint probability distribution of X, Y and Z, and D(p(x, y, z)p^{*}(xz)p^{*}(yz)p(z)) represents the extended KL divergence from p(x,y,z) to p^{*}(xz)p^{*}(yz)p(z). The p^{*}(xz) and p^{*}(yz) are defined as.
\(p^{*} \left( {xz} \right) = \mathop \sum \limits_{y} p\left( {xz,y} \right)p\left( y \right), p^{*} \left( {yz} \right) = \mathop \sum \limits_{x} p\left( {yz,x} \right)p\left( x \right),\)and the extended KL divergence is defined as
\(D\left( {p\left( x \right),q\left( x \right)} \right) = \mathop \sum \limits_{x} p\left( x \right)\log \frac{p\left( x \right)}{{q\left( x \right)}}.\)
The PMI network calculation starts with a fully connected network, and gradually removes edges by a preset threshold. To obtain the optimal PMI network for a given data, the calculation scans a range of the thresholds from low to high, until the resulting networks tend to be stable [2]. When the input data is large, the calculation complexity for an optimal calculation will be too high to finish. To overcome this mathematical unsolvable problem, we make a compromise on the size of data or the completeness of optimization. In the first part, we use PMI to show the change of correlation pattern in a small gene set along the progress of cancer, the workflow is shown in Fig. 7. We use a WGCNA module extraction coupled with WebGestalt GO enrichment analysis method to separate the whole genome into small gene sets [1, 4]. These small gene sets are internally correlated with GObased functional annotation. We select some of them and calculate the optimal PMI networks. Then, by combining single stage networks along one cancer progression, we got the PMI correlation change network for the corresponding GO term. In a correlation change network, one edge is a combination of 1–4 singlestage edges from the 4 stages of one cancer process, the stage number of 4 stages are defined as 1–4. The stage combination of an edge is marked as four 0/1 connected by hyphens, which is referred as the existence pattern. For nodes in the correlation change network, the mean stage value is defined as the average value of all the stage numbers of the singlestage edges of one node. In the second part, we focus on the connection between network edges and omics defined genepair relationships. The workflow is shown in Additional file 1: Fig. S2. Separating the whole genome into gene sets will lead to some genes missed, but the whole genome contains too many variables to fulfill optimal PMI calculation. So, we use the whole genome data as input, starting with a high threshold, and try to get as many as edges in the limited time for each cancer subtype.
Comparable PCC network
To understand the difference between PMI and PCC in the context of analysis of biological data, we need to construct a series of PCC networks based on each PMI network. The PCC networks use similar standards to the corresponding PMI network, including using the strength of correlation to select the top several correlations among a set of genes. For each PMI network, we construct 3 types of comparable PCC networks. Type 1 PCC network selects edges from all gene pairs within the PMI gene set by descending absolute PCC value until it has the same number of edges with the corresponding PMI network. Type 2 PCC network uses the same method until it covers all the genes in the PMI network. Type 3 PCC network considers each PMI edge separately. For one PMI edge Gene1Gene2 from one PMI network, if another gene, Gene3, from the PMI network has higher absolute PCC values with both Gene1 and Gene2 than the absolute PCC value of Gene1 with Gene2, then edge Gene3Gene1 and edge Gene3Gene2 are both included into the edges of comparable PCC network. Edges in Type 3 PCC network may not exist in the PMI network; these edges are called PCConly edges. PCConly edges have higher PCC values than those of the reference PMI edges when they are selected. However, they are removed in the PMI network. The biological characters of PCConly edges may be the key for explaining the difference between PMI correlation and PCC correlation in the gene expression pattern along with the cancer progression. Type 1 and type 2 PCC networks are only used to compare the network sparsity between PMI and PCC. Type 3 PCC networks are used to compare the difference between PMI and PCC connecting to omics defined relationships.
Omics defined genepair relationships
We need to compare the edges from gene correlation networks with the relationships between a pair of genes defined by biological data or knowledge. We use transcriptional regulation (TR) data from the RegNetwork database [28]. Besides the transcriptional factor (TF) targeting one gene relationship, we check each genepair for common upper regulators (TF or microRNA) and transform the original TR data into CoTR data. TF and microRNA are checked separately. The CoTR relationship of one genepair is weighted by the number of common upper regulators. The protein–protein interaction (PPI) data is from the Highquality protein interactomes (HINT) database, only literature curated data is used [29]. The human genome coordinate data is downloaded from UCSC genome browser. GRCh37 version of NCBI RefSeq data is used. We calculate the distance between two transcription starting points of a pair of genes (if they are on the same chromosome) for the omics comparison (Same Chromosome Gene Distance, SCGD) [30]. The cancer dataset from XenaBrowser contains copy number variation (CNV) and DNA methylation (DM) data for each sample [27]. In consideration of the status of two genes, we convert them into CoCNV and CoDM data for the omics comparison. The CoCNV status of a pair of genes in one sample can be one of the following 5 types: gain–gain (GG), loss–loss (LL), gain–loss (GL), loss–gain (LG), or one or two of the genes are not mutated (NC). We define the DM data of one gene in one sample as the 2 numbers of low or high (L or H) methylation CpG islands. The CoDM status of a genepair in one sample is described by 4 values (LL, HH, LH, HL). Which are the products of 2 values of 2 genes cross timed. The CoCNV and CoDM of one genepairs are weighted and normalized by the number of samples with the corresponding mutation.
Comparison of PMI with PCC
We assume that the correlation pattern of gene expression stems from the omics defined genepair relationships. We therefore match genepairs from PMI/PCC networks in previous steps with each type of genepair relationships and compare the recalling statistics between PMI network and comparable PCC network. We use the word “recall” to refer the genepair of a PMI edge is also a record in each omics data. For example, Gene1 and Gene2 have PPI interaction record from database, a PMI network contains the Gene1Gene2 edge, then the PMI network recalls the Gene1Gene2 PPI; Gene3 and Gene4 have one or more samples with both gene amplification mutation, a PMI network contains the Gene3Gene4 edge, then the PMI network recalls the Gene3Gene4 CoCNVGG. Each network in each omics has a Zscore about the number of recalled genepairs. The null distribution was calculated by randomly select the same number of genepairs from all genepairs and check the numbers of recall for 100 times. For CoTR, CoCNV and CoDM, their records can be weighted by one or more record attributes, so they have more than one types of Zscores. A CoTR record Gene1Gene2 is weighted by the number of regulators (TF or miRNA) targeting both Gene1 and Gene2. A CoCNV or CoDM record is weighted by the number of samples with the type of CoCNV or CoDM. Because a CNV mutation record can be ± 1 (low level loss or amplification) or ± 2 (high level loss or amplification), so a CoCNV record can be further weighted by the absolute value of the product of the two mutation levels of two genes.
Cancer survival analysis
We collect genepairs with CoCNV or CoDM relationships from the PMI networks and use them to divide patients. We use the disease specific survival (DSS) data and perform univariate survival analysis with Python lifelines package [31]. The survival curves are drawn by Python matplotlib package [32]. For one genepair, the CoCNV data divide the patients into 5 groups: GG, LL, GL, LG and NC. Every 2 groups with more than 20 samples are compared. A CoCNV survival factor is marked, for example as “Gene1Gene2 GG > LL”, while the CoDM data of one genepair consists of 4 values: HH, LL, HL, LH. CoDM values with normalized mean average deviation higher than 0.25 are chosen. We use the median of one chosen CoDM value as threshold to divide patients into two groups, then perform the survival analysis. So, a CoDM survival factor is marked as “Gene1Gene2 HHlow > HHhigh”. Survival factors with log rank test pvalue lower than 0.05 are considered as significant. We also perform similar survival analysis with single gene CNV/DM or the expression of one or two genes and compare their efficiency with the genepairbased factors. Then, we build a simple scoring system to classify patients with genepairs in significant CoCNV/CoDM factors. For a given set of CoCNV/CoDM survival factors, the survival score of a patient is calculated by summing up the influence of all genepairs,
where I_{gp} is depended on the relative survival status in the gp survival factor. For CoCNV survival factors, if one type is NC, only the other type is considered has impact on survival. For example, a genepair Gene1Gene2 has only one CoCNV survival factor GG > NC,
If another genepair Gene3Gene4 has CoCNV survival factors LL > NC, NC > GG, LL > GG and GL > GG,
For CoDM survival factor, each CoDM type is analyzed separately. For example, a genepair has CoDM HHhigh > HHlow survival factor,
Patients with higher score have better survival. Patients are divided into 3 groups by the CoCNV/CoDM scores then perform survival analysis. To assess if the distribution of the genepairs of a set of survival factors on the PMI network has influence on the survival classification, we choose 3 sets survival factors for CoCNV or CoDM, the first set has survival factors from isolated edges, the survival factors in the second set construct small network modules, the third set is the first two sets combined.
To confirm that the CoCNV/CoDM survival scores provide additional information about the survival classification as compared to other clinical covariates, we performed Cox multivariate analysis with the CoCNV/CoDM survival scores and clinical covariates. For the 3 cancer subtypes we used for survival analysis, clinical covariates gender, age_at_initial_pathologic_diagnosis, pathologic_M, pathologic_N, pathologic_T and pathologic_stage are chosen. The raw values of pathologic T/N/M/stage are simplified and converted to integers 0/1/2/3/4. Gender is used as covariate only for LUSC, and BRCA only use samples from female patients. We perform 100 times of 80% subsample, remodel on two models of LUSC00 and calculate the mean and standard deviation of the pvalues of each covariate to confirm the consistency of results.
Network visualization
All networks are visualized by Cytoscape with default preferred layout and manual adjustment [33].
Availability of data and materials
The datasets analyzed in the current study are available in the Gene Expression Omnibus database https://www.ncbi.nlm.nih.gov/geo/, accession numbers GSE7904, GSE22544, GSE36774, GSE42568, GSE45827, GSE47109, GSE54002 and GSE65194, and ArrayExpress database https://www.ebi.ac.uk/arrayexpress/, accession numbers EMTAB5724 and ETABM276, and UCSC XenaBrowser https://xenabrowser.net/datapages/ for the TCGA cancer datasets of BRCA, COAD, ESCA, LUAD, LUSC and STAD.
Abbreviations
 PMI:

Part mutual information
 PCC:

Pearson correlation coefficient
 WGCNA:

Weight correlation network analysis
 GO:

Gene ontology
 TR:

Transcriptional regulation
 CoTR:

Cotranscriptional regulation
 PPI:

Protein–protein interaction
 SCGD:

Same chromosome gene distance
 CNV:

Copy number variation
 DM:

DNA methylation
References
Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinform. 2008;9:559.
Zhao J, Zhou Y, Zhang X, Chen L. Part mutual information for quantifying direct associations in networks. Proc Natl Acad Sci USA. 2016;113(18):5130–5.
Kim S. ppcor: an R package for a fast calculation to semipartial correlation coefficients. Commun Stat Appl Methods. 2015;22(6):665–74.
Liao Y, Wang J, Jaehnig EJ, Shi Z, Zhang B. WebGestalt 2019: gene set analysis toolkit with revamped UIs and APIs. Nucleic Acids Res. 2019;47(W1):W199w205.
Resource TGO. 20 Years and still GOing strong. Nucleic Acids Res. 2019;47(D1):D330–8.
Hao S, Chen X, Wang F, Shao Q, Liu J, Zhao H, et al. Breast cancer cellderived IL35 promotes tumor progression via induction of IL35producing induced regulatory T cells. Carcinogenesis. 2018;39(12):1488–96.
O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44(D1):D733–45.
Yearley JH, Gibson C, Yu N, Moon C, Murphy E, Juco J, et al. PDL2 expression in human tumors: relevance to antiPD1 therapy in cancer. Clin Cancer Res. 2017;23(12):3158–67.
Luo X, Xiong X, Shao Q, Xiang T, Li L, Yin X, et al. The tumor suppressor interferon regulatory factor 8 inhibits βcatenin signaling in breast cancers, but is frequently silenced by promoter methylation. Oncotarget. 2017;8(30):48875–88.
SagivBarfi I, Kohrt HE, Czerwinski DK, Ng PP, Chang BY, Levy R. Therapeutic antitumor immunity by checkpoint blockade is enhanced by ibrutinib, an inhibitor of both BTK and ITK. Proc Natl Acad Sci USA. 2015;112(9):E966–72.
Astoul E, Laurence AD, Totty N, Beer S, Alexander DR, Cantrell DA. Approaches to define antigen receptorinduced serine kinase signal transduction pathways. J Biol Chem. 2003;278(11):9267–75.
Mercatelli D, Scalambra L, Triboli L, Ray F, Giorgi FM. Gene regulatory network inference resources: a practical overview. Biochim Biophys Acta. 2020;1863(6):194430.
Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, et al. NCBI GEO: archive for functional genomics data sets–update. Nucleic Acids Res. 2013;41(Database issue):D9915.
Athar A, Füllgrabe A, George N, Iqbal H, Huerta L, Ali A, et al. ArrayExpress update—from bulk to singlecell expression data. Nucleic Acids Res. 2019;47(D1):D711–5.
Kinalis S, Rossing C, Molecular subtyping of breast cancer by transcription profiling. ArrayExpress. 2017; EMTAB5724
Cris Print, Expression data from primary breast tumor. Gene Expression Omnibus 2012; GSE36774.
Clarke C, Madden SF, Doolan P, Aherne ST, Joyce H, O’Driscoll L, et al. Correlating transcriptional networks to breast cancer survival: a largescale coexpression analysis. Carcinogenesis. 2013;34(10):2300–8.
D’Alfonso TM, van Laar RK, Vahdat LT, Hussain W, Flinchum R, Brown N, et al. BreastPRS is a gene expression assay that stratifies intermediaterisk Oncotype DX patients into high or lowrisk for disease recurrence. Breast Cancer Res Treat. 2013;139(3):705–15.
Li R, Ackerman WE, Mihai C, Volakis LI, Ghadiali S, Kniss DA. Myoferlin depletion in breast cancer cells promotes mesenchymal to epithelial shape change and stalls invasion. PLOS ONE. 2012;7(6):e39766.
Richardson AL, Wang ZC, De Nicolo A, Lu X, Brown M, Miron A, et al. X chromosomal abnormalities in basallike human breast cancer. Cancer Cell. 2006;9(2):121–32.
Hawthorn L, Luce J, Stein L, Rothschild J. Integration of transcript expression, copy number and LOH analysis of infiltrating ductal carcinoma of the breast. BMC Cancer. 2010;10:460.
Gruosso T, Mieulet V, Cardon M, Bourachot B, Kieffer Y, Devun F, et al. Chronic oxidative stress promotes H2AX protein degradation and enhances chemosensitivity in breast cancer patients. EMBO Mol Med. 2016;8(5):527–49.
Tan TZ, Miow QH, Miki Y, Noda T, Mori S, Huang RY, et al. Epithelialmesenchymal transition spectrum quantification and its efficacy in deciphering survival and drug responses of cancer patients. EMBO Mol Med. 2014;6(10):1279–93.
Maubant S, Tesson B, Maire V, Ye M, Rigaill G, Gentien D, et al. Transcriptome analysis of Wnt3atreated triplenegative breast cancer cells. PLOS ONE. 2015;10(4):e0122333.
Team RC. R: A language and environment for statistical computing. 2013.
Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. limma powers differential expression analyses for RNAsequencing and microarray studies. Nucleic Acids Res. 2015;43(7):e47.
Goldman MJ, Craft B, Hastie M, Repečka K, McDade F, Kamath A, et al. Visualizing and interpreting cancer genomics data via the Xena platform. Nat Biotechnol. 2020;38(6):675–8.
Liu ZP, Wu C, Miao H, Wu H. RegNetwork: an integrated database of transcriptional and posttranscriptional regulatory networks in human and mouse. Database. 2015;215:bav095.
Das J, Yu H. HINT: Highquality protein interactomes and their applications in understanding human disease. BMC Syst Biol. 2012;6:92.
Haeussler M, Zweig AS, Tyner C, Speir ML, Rosenbloom KR, Raney BJ, et al. The UCSC genome browser database: 2019 update. Nucleic Acids Res. 2019;47(D1):D853–8.
DavidsonPilon JKC, Noah J, Sean R, Ben K, Paul Z, Mike W, Abdeali JK, Deepyaman D, Andrew FG, Alex P, Daniel W, Ilson G, Luis M, Arturo MT, Kyle S, Harsh G, Jona JL, … Skipper S. CamDavidsonPilon/lifelines. Zenodo 2021.
Hunter JD. Matplotlib: a 2D graphics environment. Comput Sci Eng. 2007;9(3):90–5.
Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003;13(11):2498–504.
Acknowledgements
We thank ZeXian Liu for insightful discussion.
Funding
This work was supported by Strategic Priority CAS Project XDB38000000. The funding body played no role in the design of the study nor in the collection, analysis, and interpretation of data, nor in writing the manuscript.
Author information
Authors and Affiliations
Contributions
YHJ, ZL, and JRW designed the study. YHJ performed data analysis and wrote the manuscript. JL, ZBZ, LL and ZXL participated in data analysis and manuscript writing. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declared that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1.
Fig S1 and Fig S2. Fig S1 is the networks of CoDM PMI and PCC survival factors of three cancer subtypes. Fig S2 is the flow chart of omics comparison and survival analysis.
Additional file 2.
Supplementary Table S1. PMI/PCC networks recall TR relationships.
Additional file 3.
Supplementary Table S2. PMI/PCC networks recall TFCoTR relationships.
Additional file 4.
Supplementary Table S3. PMI/PCC networks recall miRCoTR relationships.
Additional file 5.
Supplementary Table S4. PMI/PCC networks recall SCGD relationships.
Additional file 6.
Supplementary Table S5. PMI/PCC networks recall PPI relationships.
Additional file 7.
Supplementary Table S6. Zscores of PMI networks recall CoCNV relationship.
Additional file 8.
Supplementary Table S7. Zscores of PCC networks recall CoCNV relationships.
Additional file 9.
Supplementary Table S8. PMI/PCC networks recall CoCNV GG relationships.
Additional file 10.
Supplementary Table S9. PMI/PCC networks recall CoCNV LL relationships.
Additional file 11.
Supplementary Table S10. Zscores of PMI/PCC networks recall CoDM relationships.
Additional file 12.
Supplementary Table S11. PMI/PCC networks recall CoDM HH relationships.
Additional file 13.
Supplementary Table S12. PMI/PCC networks recall CoDM HL relationships.
Additional file 14.
Supplementary Table S13. PMI/PCC networks recall CoDM LH relationships.
Additional file 15.
Supplementary Table S14.PMI/PCC networks recall CoDM LL relationships.
Additional file 16.
Supplementary Table S15. PMI recalled CoCNV survival analysis.
Additional file 17.
Supplementary Table S16. PMI recalled CoDM survival analysis.
Additional file 18.
Supplementary Table S17. Cox multivariate analysis for BRCA0000 and BRCA0112.
Additional file 19.
Supplementary Table S18. Cox multivariate analysis for LUSC00.
Additional file 20
Supplementary Table S19. The pvalue distribution of randomly subsampled LUSC00 Cox multivariate models using only clinical covariates.
Additional file 21
Supplementary Table S20. The pvalue distribution of randomly subsampled LUSC00 Cox multivariate modules using clinical covariates and both CoCNV and CoDM scores.
Additional file 22
Supplementary Table S21. Comparison of survival classification using single gene CNV and two gene CoCNV.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Jiang, YH., Long, J., Zhao, ZB. et al. Gene coexpression network based on part mutual information for genetogene relationship and genecancer correlation analysis. BMC Bioinformatics 23, 194 (2022). https://doi.org/10.1186/s12859022047329
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859022047329
Keywords
 Correlation analysis
 Direct association
 Part mutual information
 Multiple omics integration
 Cancer survival analysis