CoMI: consensus mutual information for tissue-specific gene signatures

Background The gene signatures have been considered as a promising early diagnosis and prognostic analysis to identify disease subtypes and to determine subsequent treatments. Tissue-specific gene signatures of a specific disease are an emergency requirement for precision medicine to improve the accuracy and reduce the side effects. Currently, many approaches have been proposed for identifying gene signatures for diagnosis and prognostic. However, they often lack of tissue-specific gene signatures. Results Here, we propose a new method, consensus mutual information (CoMI) for analyzing omics data and discovering gene signatures. CoMI can identify differentially expressed genes in multiple cancer omics data for reflecting both cancer-related and tissue-specific signatures, such as Cell growth and death in multiple cancers, Xenobiotics biodegradation and metabolism in LIHC, and Nervous system in GBM. Our method identified 50-gene signatures effectively distinguishing the GBM patients into high- and low-risk groups (log-rank p = 0.006) for diagnosis and prognosis. Conclusions Our results demonstrate that CoMI can identify significant and consistent gene signatures with tissue-specific properties and can predict clinical outcomes for interested diseases. We believe that CoMI is useful for analyzing omics data and discovering gene signatures of diseases. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04682-2.

identified from specific tissues is a promising avenue to maximize efficacy in target tissues while minimizing the safety risks of affecting unrelated tissues [4].
Many methods have been proposed to identify the gene signatures between normal and disease states [5]. The simple and common way is to use the T-test, which is a statistics-based method. According to the distributions between two states, the T-test evaluates the probability (p-value) in each gene. The significant difference of gene expression is often defined if the probability is less than 0.05 or 0.01. Another method is calculating the fold change (FC) of gene expression, which is not a statistical test and has no associated values for indicating the level of confidence in the genes as differentially expressed or not [6]. Therefore, some methods were developed, such as Significance Analysis of Microarrays (SAM) and Cancer Outlier Profile Analysis (COPA), to choose the biomarkers in high confidence [7][8][9]. The SAM approach proposed false discovery rate (FDR) to reduce genes showing significantly different expression by chance (i.e., false positive) [7,10]. The COPA, based on the hypothesis of tumor heterogeneity, identified candidate genes overexpressing in subsets of samples by using median and median absolute deviation of gene expression profiles [8]. In another study, Parker et al. determined 50 genes (i.e., PAM50) for classified four breast subtypes with clinical means by using Predictive Analysis of Microarray (PAM) algorithm [2,11], but this approach relied on sample labels that was difficult to propose new subtype of diseases. Recently, Gentles et al. applied CIBERSORT computational methods and PRECOG tools to identify cancer prognostic biomarkers and therapeutic targets [12], however, they only focused on the genes that related to clinical outcomes, but the ones might not be suitable for diagnosis due to the non-significant gene expression changes between the sample subgroups [13].
In this study, we hypothesize that the biological processes of tissues/organs are dysregulated during tumorigenesis, and the perturbed genes (i.e., tissue-specific genes) are often involved in corresponding functions of tissues [14][15][16]. To address these issues, we propose consensus mutual information (CoMI) to analyze omics data and to identify gene signatures. We utilized mutual information (MI) and gene expression distance as the basis to find the significantly and consistently expressed genes and gene signatures between normal and disease states. For multiple cancer omics data, our identified gene signatures could reflect cancer-related signatures and have tissue-specific properties (mean odds ratio = 2.89). Based on our previously developed global omics data analysis method [17], our CoMI identified gene signatures not only involved in common cancerrelated progress, such as Cell growth and death, but also reflect tissue unique functions of Xenobiotics biodegradation and metabolism in LIHC and Nervous system in GBM. For clinical prognosis, a 50-gene signature identified by CoMI could distinguish the GBM patients into high-and low-risk groups based on gene expression patterns, and could predict clinical outcomes at 12-month survival (log-rank p = 0.006). We believe that our method and results are useful for analyzing omics data, discovering gene signatures with tissue-specific properties, and predicting clinical outcomes of diseases.

CoMI for identifying gene signatures in multiple cancers
To identify consistent patterns of gene expression for gene signatures development in multiple cancers, we utilized CoMI to analyze genome-wide gene expression profiles in LIHC, GBM, BLCA, BRCA, and COAD. To evaluate the cancer associations of selected genes from CoMI in multiple cancer datasets, we collected 1675 cancer-related genes derived from HPA database [18]. The results show that the gene signatures selected from the different ranking cut-off of CoMI have a high probability to be cancer-related genes compared with T-test, FC, and Significance Analysis of Microarrays (SAM) (Fig. 1). For example, the top-ranked 200 significant genes were 190, 182, 181, and 126 genes for CoMI, T-test, FC, and SAM, respectively.
To investigate the statistical meanings of CoMI, we calculated Spearman's correlation coefficient (ρ) between CoMI scores and FC values, SAM scores and p-values of T-test on all of 20,531 genes in NGS profiles of TCGA, respectively (Additional file 1: Fig. S1). For these methods, the average ρ values of these five cancer types were 0.81 (FC), 0.95 (T-test), and 0.94 (SAM). These results suggest that CoMI has statistical meanings and is able to identify significantly and consistently expressed genes and disease-related gene signatures from omics data.

Tissue-specific properties of gene signatures
To investigate the biological meanings of gene signatures identified by CoMI in different cancers, we collected tissue-specific genes from HPA with protein annotation of tissue specificity in liver, brain, urinary bladder, breast, or colon. Here, we used the odds ratio to assess whether gene signatures have tissue-specific properties or not in five cancers ( Fig. 2 and Additional file 1: Fig. S2). In the comparison of CoMI and T-test, the average odds ratios of gene signatures selected from eight kinds of top-ranked thresholds were 2.89, 5.40, 1.90, 1.74, and 2.99 in LIHC, GBM, BLCA, BRCA, and COAD, respectively. For example, the top-ranked 200 genes selected by CoMI and T-test in GBM were 89 and 23 with annotated brain specificity, respectively. The odds were 0.8 (89/111) and 0.13 ( Fig. S2). Besides, we observed that CoMI has a lower ranking on average of descending order than the ones of T-test and SAM in all of the tissuespecific genes in each cancer (Additional file 1: Figs. S3 and S4). These results demonstrate that gene signatures identified by CoMI are more like to have tissue-specific properties. Furthermore, we used our previously developed global omics data analysis method, Hierarchical System Biology Model (HiSBiM), to investigate the involved pathways as well as biological subsystems and systems of gene signatures [17]. We observed that biological subsystems of the 200-gene signatures identified by CoMI and T-test were involved in common cancer-related pathways, such as Cell growth and death and Replication and repair in five cancer types (Fig. 3). In particular, the gene signature of LIHC identified by CoMI was enriched in Xenobiotics biodegradation and metabolism (meta-z score = 3.72) and Lipid metabolism (meta-z score = 2.74), which could reflect unique functions to liver tissue. In GBM, we observed that neurotransmission-related functions were highly enriched, such as Nervous system (meta-z score = 7.49) and Signaling molecules and interaction (meta-z score = 2.46). In addition, the Digestive system (meta-z score = 7.03) was highly regulated in COAD.
We also evaluated the tissue-specific properties of CoMI and COPA [9] based on HPA database and HiSBiM analysis. The results show that the genes identified by CoMI are related to both common cancer-related pathways and tissue-specific properties, such as Endocrine system (meta-z score = 5.40) in BRCA (Additional file 1: Fig. S5). Additionally, CoMI outperformed COPA and our results indicated that CoMI not only can identify tissue-specific gene signatures in different cancers, but also can reflect corresponding biological pathways and functions unique to those tissues.

Clinical prognosis of gene signatures
The grade IV astrocytomas (i.e., GBM) are an aggressive class of brain cancer, it is difficult to treat and has a poor median 12-month overall survival, and the urgent need to develop a prognostic gene signature [19]. We selected a 50-gene signature with significant and consistent expressions by using our CoMI from the gene expression profile in GBM, as well as, five of those genes were brain tissue specificity (Table 1). According to the gene expression patterns, we found that these 50 genes could cluster 156 tumor samples into four groups (Fig. 4A). According to the 12-month overall survival analysis, the patients in GBM-C1 group (n = 15) have a significantly lower survival probability (30%) than GBM-C3 group (62%; log-rank p = 0.006; Fig. 4B). Moreover, we found that 70% of CoMI top-ranked 10 genes can distinguish patients into high-and low-risk groups (log-rank p < 0.01), such as PBK (log-rank p = 0.0058) and CCNB2 (log-rank p = 0.009), however, only five of 10 genes with log-rank p < 0.01 in T-test (Additional file 1: Fig. S6). These results indicated that our identified 50-gene signature could predict clinical outcomes and provide the available clues for developing the new therapeutic strategies in GBM.

Discussion
Genome-wide gene expression profiling has been used to identify genetic signatures that could be associated with the outcome of cancer patients [1]. Some studies using genome-wide gene expression profiling for developing gene signatures, and many of       these approaches have shown to better define the prognosis of cancer patients, such as the PAM50 gene signature can use to classify breast tumors into four subtypes and to predict clinical prognosis [2,3].  In this paper, we propose consensus mutual information (CoMI) to analyze omics data and to identify gene signatures. For multiple cancer omics data, we used CoMI to identify gene signatures, and those genes could reflect cancer-related signatures and have tissue-specific properties. In general, normal cells perform the function they are meant to perform, whereas cancer cells may not execute these functions. For example, cancerous thyroid cells may not produce thyroid hormone [20], and cancerous white blood cells are not functioning as they should [21]. On the other hand, the genes recorded in cancer hallmarks are often activated in cancer cells [22].
Here, we hypothesize that tissue-specific genes are often related to lost-of-tissue functions which are often down-regulated in the disease states. Our CoMI can identify these tissue-specific genes which are significant change (S Dist score) and have consensus expressed values (S MI ) within cancer and normal samples. The changes between cancer and normal samples were quantified by considering global gene expression values for all genes in omics data. Based on our scoring function, CoMI identified the liver-specific genes, such as CYP1A2 (CoMI score = 0.99 and FC = − 7.76), CYP2C8 (CoMI score = 0.71 and FC = − 4.14), and CYP3A4 (CoMI score = 0.55 and FC = − 5.85), are the members of the cytochrome P450 family involving in Lipid metabolism (meta-z score = 2.74) as well as Xenobiotics biodegradation and metabolism (meta-z score = 3.72) in LIHC. Additionally, CoMI also identified gene SLC22A1 (CoMI score = 0.53 and FC = − 4.81) involving in liver unique functions, that is, bile secretion of Digestive system (meta-z score = 4.54). Conversely, MAP2K1, an essential component of the MAP kinase signal transduction pathway related to cancer hallmark, was not a tissue-specific gene. We found that MAP2K1 significantly changed in most cancers, but its expression values are not consistent (CoMI score = 0.20). These results show that our CoMI identified gene signatures not only involved in cancer-related progress, such as Cell growth and death, but also reflect tissue unique functions of Nervous system (meta-z score = 7.49) in GBM. For clinical prognosis, our identified 50-gene signature could stratify GBM patients into high-and low-risk groups, and could predict clinical outcomes with 12-month survival.
There were some limitations to the current study. Firstly, the omics data (i.e., NGS) is only collected from TCGA, and different sources and platforms, such as microarray, are needed to use to validate our method and identified gene signatures. Secondly, the results of this study are mostly based on bioinformatics analysis and predictions, and further experiments are needed to prove the tissue-specific properties of our identified gene signatures. Thirdly, although we identified prognostic gene signatures, those genes remain to be further explored in our future work.

Conclusion
In summary, we have proposed CoMI for analyzing omics data and discovering gene signatures. Our method accomplished the identification of genes and gene signatures, which have consistently and significantly changed between normal and disease states. Our results indicated that CoMI could identify gene signatures with tissue-specific properties for interested diseases, and is able to be applied to predict clinical prognosis.

Methods
To identify significantly and consistently expressed genes and gene signatures, we proposed a method, consensus mutual information (called CoMI), for analyzing omics data between normal and disease samples (Fig. 5A). We first calculated gene expression variations for each gene in a given omics data (Fig. 5B). For the evaluating of the consensus gene expression in normal and tissue states, we transferred the continuous gene expression to discrete integer symbols based on expression variations and intensities (Fig. 5C). Finally, we computed CoMI score for all genes, which can be used as a measure for significantly and consistently expressed genes and gene signatures between two states (Fig. 5D).

Omics and validation data
Here, we collected omics data (i.e. NGS) in LIHC, GBM, BLCA, BRCA, and COAD from The Cancer Genome Atlas (TCGA) databases [23]. These datasets contain 228 normal and 2315 tumor samples. To evaluate our method and compare to T-test and FC method, we collected the 1675 cancer-related genes derived from the Human Protein Atlas (HPA) database [18]. The tissue-specific genes were collected from HPA with protein annotation of tissue specificity in liver (409 genes), brain (1313 genes), urinary bladder (56 genes), breast (82 genes), or colon (147 genes). Finally, Fig. 5 Overview of consensus mutual information. A Flowchart describing the main procedure. B We firstly evaluated the standard deviation ( σ i ) of gene expression intensity for each gene in a given omics data. C We then computing the average standard deviation ( σ ) from σ i of all genes, as well as expression means of normal and disease samples (i.e., μ N and μ D ), respectively, in each gene. For a given gene i in sample j, we assigned expression intensity into 7 integral symbols by considering the σ and its μ N and μ D. D The gene expression values were converted to discretized integer symbol ranging from 0 to 6. The highly expressed genes were assigned to the highest symbol 6 and lowly expressed genes were assigned to the lowest symbol 0. (D) The calculation of consensus mutual information (CoMI) values of all genes we calculated the p-values (i.e., T-test), fold changes, and SAM scores of all genes by using limma and samr R package [24,25].

Discretization
To consider the readily quantifiable and significant expressions in the disease state of genes and gene signatures, we firstly evaluated the standard deviation ( σ i ) of expression intensity of the gene i (Fig. 5B). Then, we calculated the average standard deviation ( σ ) from σ i of all genes (Fig. 5C). For each gene, computing expression means of normal (μ N ) and disease (μ D ) samples, respectively. For a given gene i in sample j, we utilize the average standard deviation ( σ ), average expression (μ N,i and μ D,i ) to assign expression intensity (EI i,j ) into an integral symbol (ES i,j ) by using the following equations: The gene expression values were converted to discretized integer symbols ranging from 0 to 6. The highly expressed gene was assigned to the symbol 6 and lowly expressed gene was assigned to the symbol 0. For instance, the σ is 1.04 in LIHC, and the μ N and μ D of the gene NAT2 are 10.4 and 5.7, respectively. The expression intensity (EI) of NAT2 is 10.8 in sample TCGA-DD-AAE3-01A-11R-A41C-07, which satisfies the EI i,j > µ D,i +µ N ,i 2 + 2.5σ (i.e., 10.7), therefore, we assign the ES NAT2,TCGA-DD-AAE3-01A-11R-A41C-07 value to 6.

CoMI: Consensus mutual information
For each gene i in a given omics data, we evaluated its consensus mutual information (CoMI) and to identify significantly and consistently expressed genes and gene signatures between normal and disease states (Fig. 5D). The CoMI of a gene is defined as: where S MI is gene expression difference between two states using mutual information; S Dist is the gene expression distance between two states by using mean distance. For the gene i, the S MI is given as: where p(x,y) is the probability of gene i in symbol x and state y; p(x) is the fraction of gene i in symbol x, and p(y) is the fraction of samples in state y. Y is the number of states (here, Y = 2 for normal and disease states), and X is the number of symbols (here, X = 7). The S Dist is given as: where N and D are numbers of samples in the normal (y 1 ) and disease (y 2 ) states, respectively; ES i,n and ES i,d are the integral symbols of gene i at the sample for normal and disease states, respectively. According to the equation of mutual information, the S MI is related to the ratio between normal and cancer samples. S MI is also related to the overlap of integer symbols between normal and cancer samples. For example, S MI = 1, while p(normal) = p(cancer) and there is no overlap of integer symbols between normal and cancer samples; S MI is from 0.3 to 0.5, while p(normal) = p(cancer) and half of the integer symbols in normal and cancer samples are the same; S MI = 1, while p(normal) = p(cancer) and there is no overlap of integer symbols between normal and cancer samples; S MI = 0.44, while 10 × p(normal) = p(cancer) and there is no overlap of integer symbols between normal and cancer samples; S MI is from 0.3 to 0.4, while 10 × p(normal) = p(cancer) and half of the integer symbols in normal samples are the same as integer symbols in cancer samples; S MI = 0.28, while 20 × p(normal) = p(cancer) and there is no overlap of integer symbols between normal and cancer samples. In our collected NGS data in LIHC, GBM, BLCA, BRCA, and COAD from TCGA databases, there are 228 normal and 2315 tumor samples and the expected maxima of S MI might be 0.44. Here, we assumed that CoMI > 0.6 might be a suitable cut-off which could ensure only half of the integer symbols in normal samples are the same as integer symbols in cancer samples and the S Dist is greater than 2 (Additional file 1: Fig. S7).
Moreover, we found that both S MI and S Dist could be used as good indexes to identify the genes with tissue-specific properties (Additional file 1: Fig. S8). S Dist focuses on the distance between two states and is more related to the tissue-specific genes in our five data sets. In this study, we provide a scoring system that has a reliable discretizing method and consider both distances between two states and mutual information to identify gene signatures. Using S MI and S Dist could evaluate the distance, significantly and consistently expressed value of the gene between normal and disease states under the same scale.

Supplementary Information
The online version contains supplementary material available at https:// doi. org/ 10. 1186/ s12859-022-04682-2. 1: Fig. S1. The correlations of the ranking of all genes were compared between CoMI, T-test, FC, and SAM in five cancer types. Fig. S2. The odds ratios of tissue-specific genes selected by CoMI and SAM in five cancer types, including LIHC, GBM, BRCA, and COAD. Fig. S3. Boxplot for the results of the ranking of tissue-specific genes