Colon adenocarcinoma (COAD) is one of the most common gastrointestinal tumors worldwide, and immunotherapy is one of the most promising treatments for it. Identifying immune genes involved in the development and maintenance of cancer is key to the use of tumor immunotherapy. This study aimed to determine the prognostic value of immune genes in patients with COAD and to establish an immune-related gene signature. Differentially expressed genes, immune-related genes (DEIGs), and transcription factors (DETFs) were screened using the following databases: Cistrome, The Cancer Genome Atlas (TCGA), the Immunology Database and Analysis Portal, and InnateDB. We constructed a network showing the regulation of DEIGs by DETFs. Using weighted gene co-expression network analysis, we prepared 5 co-expressed gene modules; 6 hub genes (CD1A, CD1B, FGF9, GRP, SERPINE1, and F2RL2) obtained using univariate and multivariate regression analysis were used to construct a risk model. Patients from TCGA database were divided into high- and low-risk groups based on whether their risk score was greater or less than the mean; the public dataset GSE40967, which contains gene expression profiles of 566 colon cancer patients, was used for validation.
Survival analysis, somatic gene mutations, and tumor-infiltrating immune cells differed significantly between the high- and low-risk groups.
This immune-related gene signature could play an important role in guiding treatment, making prognoses, and potentially developing future clinical applications.
Colorectal cancer (CRC) is one of the most common digestive system malignancies, accounting for approximately 1.2 million new cases and 600,000 deaths every year worldwide . More than 2.2 million new CRC cases and 1.1 million deaths are projected in 2030 . Colon adenocarcinoma (COAD) is the most common histopathological type of CRC. It often evolves into invasive cancer due to gene mutations and continuous accumulation of colonic adenomatous lesions . Current treatments for patients with CRC include surgery, radiation, chemotherapy, targeted therapy, and immunotherapy. The 5 years survival rate exceeds 90% who undergo curative surgery for patients with localized tumor . However, most patients are in the middle to late stages of the disease when diagnosed, in which case the 5 years survival rate decreases to approximately 10% . The resistance of cancer cells to the immune response has been recognized as a new sign of cancer, and in recent years specific immune checkpoint therapeutic techniques have been extensively investigated . Immune checkpoint inhibitors (ICIs), which manipulate the immune system to reactivate the antitumor immune response by blocking immune checkpoint proteins (PD-1 and CTLA-4) or their ligands (PD-L1), have been shown to have significant therapeutic effects in several cancers. However, immunotherapy does not achieve better efficacy in all CRC patients. Thus, it is equally important to understand the tumor immune microenvironment to explore tumor-associated immune signature biomarkers . Moreover, characterizing immune function in different responding populations could help improve the efficacy of immunotherapy on CRC.
It has been reported that polygenic prediction models possess better predictive ability than single gene models for cancer prognosis. There have been a growing number of recent articles about immune-related genes predicting the risk of colon cancer. Ma et al. reported a prognostic model based on 13 immune-related genes , Wang et al.  also identified a novel prognostic signature of immune-related genes for CRC patients. A different prognostic model based on another 13 immune-related genes was recognized by Wang’s group . However, the complexity and diversity of the data and the way in which this information can be used effectively is a formidable challenge. Methods for online public datasets analysis have not been standardized to date, so here we also provide our results for readers to compare. The combination with other similar predictive models will contribute to explain the predictive role of immune-related genes in COAD prognosis more comprehensively.
Accordingly, exploring the underlying connection between immune-related genes involved in the development and progress of COAD may conduce to build a prognostic prediction system for COAD. First, we used cancer patient information from the TCGA , ImmPORT, InnateDB, and Cistrome online databases to identify 6 immune-related genes that are associated with the prognosis of COAD patients. Next, we incorporated these six genes into the study accordingly to establish a risk model that predicts the survival prognosis of patients with COAD. Based on risk scores, we successfully divided COAD patients into low-risk and high-risk groups, which not only had different COAD prognosis, but also exhibited different gene expression profiles and different tumor-infiltrating immune cell characteristics. Targeting immune-related genes can deepen our understanding of the role that tumor immunity plays in COAD. This risk assessment model could provide more guiding significance for predicting the prognosis and even precision treatment of COAD patients.
Data processing and identification of differentially expressed immune-related genes (DEIGs), genes (DEGs), and transcription factors (DETFs)
The transcriptome RNA sequencing data and clinical materials of 514 patients with COAD were obtained from TCGA. The screening criteria for DEGs between tumors and normal samples was set as a false discovery rate (FDR) < 0.05 and |log fold change (FC)|> 1, a total of 7782 differentially expressed genes were identified (Fig. 1A), B). A list of 2660 immune genes was obtained from the Immunome database, which were downloaded from InnateDB and ImmPORT databases; a total of 649 DEIGs were obtained from the screening (|log FC|> 1, FDR < 0.05) (Fig. 1C, D). We also obtained 318 transcription factors (TFs) from the Cistrome program and 67 DETFs from screening (Fig. 1E, F).
Functional analysis of DEIGs
Gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analyses of the DEIGs were conducted. The GO-biological process (BP) analysis indicated that the DEIGs were mainly involved in B cell immunoglobulin-mediated immune responses, complement activation, and regulation of immune system processes (Fig. 2A, B). The KEGG analysis indicated that the DEIGs were mainly involved in cytokine–cytokine receptor interaction, viral protein interaction with cytokines and cytokine receptors, chemokine signaling pathways, the IL-17 signaling pathway, neuroactive ligand–receptor interactions, and the nuclear factor (NF)-κ B signaling pathway (Fig. 2C, D).
Building the DEIG and DETF interaction network
The correlation between DEIGs and DETFs was obtained using these screening criteria: cor > 0.5 and p < 0.001. The correlation network diagram was drawn using Cytoscape (Fig. 3). The specific correlation results are shown in Additional file1: Table S1.
Weighted gene co-expression network analysis (WGCNA) of DEIGs
WGCNA [12, 13] identifies gene modules with similar expression patterns by calculating gene expression relationships, analyzing relationships between gene modules and phenotypes, and mapping the regulatory network between genes in the gene module and central genes (hubs). The WGCNA package in the R software was used to divide DEIGs into five modules (“MEgreen,” “MEblue,” “MEbrown,” “MEyellow,” and “MEgrey”) (Fig. 4). The optimal power value was 4. Prognostic models were built based on the minimal p-value (< 0.05). Genes in the green module were selected for subsequent analysis.
Obtaining immune genes related to the prognosis of COAD patients
After obtaining the “MEgreen” gene module, univariate regression analysis was performed on the clinical data of COAD patients in the TCGA and Gene Expression Omnibus (GEO) databases. Nine genes were screened as prognosis-related genes (CD1A, CD1B, FGF9, GRP, OXTR, SPHK1, BGN, SERPINE1, and F2RL2) using p < 0.05 as the selection criterion. The results of univariate analysis are shown in Fig. 5A. The Kaplan–Meier (K–M) curves of nine genes in COAD patients were plotted using the Survival package in R software (Fig. 5B–J).
Building of a survival prognosis model and performing a survival analysis
The nine prognosis-related genes were included in a multivariate Cox regression analysis. The inclusion criteria were p < 0.05 and HR ≠ 1. Six genes (CD1A, CD1B, FGF9, GRP, SERPINE1, and F2RL2) were finally incorporated into the core model for risk calculation. The relationship between the genes and risk score is shown in Additional file2: Table S2. The composite risk scores of patients in TCGA datasets were calculated. The COAD patients were separated into low- and high-risk groups using the median risk score as the cutoff. The prognostic model was built using TCGA data and verified in the GEO datasets GSE40967. Further, we used the R software Survival and timeROC packages to draw two groups of K–M and receiver operating characteristic (ROC) curves, respectively (Fig. 6A–D), The R software ComplexHeatmap package was used to conduct a chi-square test between the demographics of the low- and high-risk groups and draw the clinical correlation heat map (p < 0.05, Fig. 6E). There were significant differences between the two groups in the tumor (T), node (N), metastasis (M) and stages. Univariate and multivariate Cox regression analyses revealed that age, T, stage, and risk score were independent prognostic factors for patients with COAD (Fig. 7A, B).
Single-sample gene set enrichment analysis (ssGSEA) for low- and high-risk groups
The GO and KEGG enrichment analysis files (c5.go.v7.4.smbols and c2.cp.kegg.v7.4.symbols) were downloaded from the GSEA database (http://www.gsea-msigdb.org/gsea/index.jsp). GO and KEGG enrichment analyses of the high- and low-risk groups were performed using the clusterProfiler and the org.Hs.eg.db packages in R. GO enrichment analysis of the high-risk group indicated the main enrichment was in keratinocyte differentiation, skin development, collagen-containing extracellular matrix (ECM) external encapsulating structure development, and structural molecule activity. The most significantly enriched KEGG pathways were axon guidance, ECM receptor interaction, focal adhesion, the peroxisome proliferator-activated receptor (PPAR) signaling pathway, and systemic lupus erythematosus. GO enrichment analysis of the low-risk group indicated the main enrichment was in the activation of immune responses, adaptive immune responses, immune response regulating signaling pathways, immunoglobulin production, and immunoglobulin complexes. The KEGG pathway indicated enrichment in allograft rejection, asthma, autoimmune thyroid disease, the intestinal immune network for immunoglobulin A (IgA) production, and primary immunodeficiency (Fig. 7C–F).
Comparison somatic mutation in high- and low-risk groups
Somatic mutation profiles of patients with COAD downloaded from TCGA were analyzed and visualized using the R maftools package . A total of 388 patients had mutations; after removing samples with no amino acid mutation, 202 were high-risk and 184 were low-risk. The five genes with the highest somatic mutation rate in the high-risk group were APC, TP53, TTN, KRAS, and SYNE1. Missense mutations were the most common category. The high‐risk group had a higher mutation frequency than the low‐risk group (Fig. 8A, B).
Analyzing the tumor-infiltrating immune cells of high- and low-risk groups
Next, we used CIBERSORT to estimate the proportions of 22 distinct immune cell types (p < 0.05) (Fig. 8E), The Wilcoxon signed-rank test was used to determine the differences between tumor-infiltrating immune cells cells in the high-risk group. Resting dendritic cells and follicular helper T cells were present in significantly higher fractions in low-risk patients than in high-risk patients (p < 0.05; Fig. 8C). The immune cell function of the high-risk group was lower than that of the low-risk group with respect to adenomatous polyposis coli (APC) co-inhibition, APC co-expression, T-cell function, and macrophage function (Fig. 8D).
Comparison with other models
We compared four corresponding prognostic models: a seven-gene signature (Sun), six-gene signature (Liang), twelve-gene signature (Mia), and seven-gene signature (Chen) [15,16,17,18]. We took the median risk value of all samples as the division standard, divided them into high- and low-risk groups, and used the four models to calculate the risk of the patients in TCGA. The ROC and K–M curves for the four models are shown in Fig. 9A–J. The values of area under the curve (AUC) for the four models at 5 years were 0.581, 0.521, 0.616, and 0.555, respectively, all significantly lower than our model (0.639). We calculated the concordance indexes (C-indexes), which were used to evaluate the prediction capability of the mixed-effect Cox model , of all models. The results showed that our model exhibited the highest C-index value (0.704; Fig. 9K).
Verification of six immune-related genes in external databases
The Tumor Immune Estimation Resource (TIMER) online database was used to analyze the differential expression of six genes in this model in 17 types of tumors and adjacent tissues. The CD1A, CD1B, GRP, SERPINE1, and F2RL2 genes were highly expressed in tumor tissue. In contrast, FGF9 was highly expressed in normal colon tissue (Fig. 10). A Human Protein Atlas (HPA) database search was performed to verify the protein expression levels of CD1A, CD1B, SERPINE1, and FGF9 (Fig. 11).
Growing evidence suggests that the immune system plays a decisive role in the development and progression of CRC. FDA approval of anti-PD-1 monoclonal antibodies, as well as an anti-CTLA-4 monoclonal antibody, has improved the lives of some CRC patients, suggesting that immunotherapy had significant anticancer potential. However, immune checkpoint block could not achieve the desired response rate in all patients with CRC. This suggests that we still face many challenges in early diagnosis and treatment strategies for COAD, including a lack of awareness of high-risk patients, a lack of clinically applicable biomarkers, and the precise treatment for high-risk populations. Thus, we need to investigate the immune system intensively to understand how it interacts with cancer cells and describe the molecular characteristics of the COAD tumor immune microenvironment. More importantly, the research of predictive biomarkers may bring new expectations for the immunotherapy of CRC. Most studies focus only on a single molecule, but tumor development and metastasis are often the result of a synergy between multiple molecules. A single biomarker is difficult to achieve the specificity and sensitivity required in detecting cancer, current evidence suggests predictive models are more helpful than single biomarkers.
Therefore, in this current study, we developed a prognostic risk prediction model based on 6 immune-related genes. First, 7782 DEGs and 649 DEIGs were screened utilizing the TCGA database and InnateDB and ImmPORT databases. Next, WGCNA identified a set of DEIGs with similar expression patterns, and univariate and multivariate COX regression analyses confirmed that 6 of these DEIGs had independent prognostic value for patients with COAD. Then, these six hub genes (CD1A, CD1B, FGF9, GRP, SERPINE1, and F2RL2) were used to construct a risk score model to predict OS of COAD patients. Patients were divided into high- and low-risk groups, and the two groups exihibited completely different T, N, M, and stages. Moreover, GSEA analysis was used to derive differences in gene expression patterns between the high risk and low risk groups, meanwhile CIBERSORT assessment yielded different characteristics of tumor immune cell infiltration in the high- and low-risk groups. In the present risk prediction model, the higher the risk score, the worse the prognosis of the high-risk group and vice versa. The validation in the GEO database further confirmed the stability of this model. Thus, the successfully established immune-related gene model provides a new and effective method to predict the prognosis of COAD patients.
Currently, many studies have demonstrated that immune-related genes can coordinate and coordinate the onset and progression of cancer. CD1A is a lipid antigen-presenting molecule whose expression can be induced in monocytes and dendritic cells. CD1B is recognized as a marker by γδ T cells and plays an important role as an effector of tissue injury, infection, and cancer development ; it also regulates the differentiation and maturation of dendritic cells . Fibroblast growth factor 9 (FGF9) plays a critical role in patients with colon cancer with resistance to epidermal growth factor receptor (EGFR)-targeted therapy, and combination therapy with anti-EGFR inhibitor may reverse drug resistance . Gastrin-releasing peptide (GRP) may serve as an independent predictor of survival in patients with colon cancer . SERPINE1 plays an essential role in remodeling the tumor microenvironment and the infiltration of immune cells ; some noncoding RNAs influence the epithelial–mesenchymal transition of colon cancer by regulating SERPINE1 [24,25,26].
GO enrichment analysis of the high-risk group indicated the main enrichment was in keratinocyte differentiation, skin development, collagen-containing extracellular matrix (ECM) external encapsulating structure development, and structural molecule activity. Pradella et al.  revealed that Unc-5 netrin receptor B (UNC5B) is an axon guidance regulator whose expression is associated with tumor angiogenesis and poor prognosis. Mitra et al. demonstrated that cell migration involves integrating ECM with the actin cytoskeleton through transmembrane receptors . Focal adhesion kinase (FAK) is activated by the ECM, promoting the formation of focal adhesion complexes and focal adhesion maturation [28,29,30]. Thus, the FAK signaling pathway may be activated in the high-risk group, leading to colorectal cancer invasion in vitro and metastasis in vivo. The most significantly enriched KEGG pathways were axon guidance, ECM receptor interaction, focal adhesion, the peroxisome proliferator-activated receptor (PPAR) signaling pathway, and systemic lupus erythematosus. In colon cancer, the absence of PPARα and PPARβ/δ expression promotes cancer growth, and PPARγ suppresses tumorigenesis through the regulation of and interaction with β-catenin [31,32,33]. Systemic lupus erythematosus is associated with abnormal autoimmune reactions; however, the mechanisms have not been identified. Colon cancer is one of the most common complications in gastrointestinal diseases . Down-regulated molecular pathways in the low-risk group included allograft rejection, asthma, autoimmune thyroid disease, intestinal immune network for IgA production, and primary immunodeficiency. Asthma symptoms are alleviated by reducing eosinophil production; previous studies have proposed that the depletion of eosinophils due to asthma severely compromises antitumor immunity in syngeneic and genetic models of colorectal cancer. This association is possibly due to defective Th1 and CD8+ T-cell responses . Bacteria are linked to cancer , and intestinal bacteria may make the tumor microenvironment more favorable for IgA production. Consequently, IgAs are widely used as biomarkers for early cancer screening [37,38,39]. Recent studies have indicated that patients with primary immunodeficiency tend to have a higher cancer incidence because of genomic instability due to defective DNA repair mechanisms [40, 41]. However, the association between autoimmune thyroid disease, allograft rejection, and colorectal cancer has not been found.
Overall, the current study introduces an immune-related gene module as a novel prognostic tool for COAD patients. The present study features a model based on a comprehensive population database and high-throughput sequencing data, which was successfully validated by subsequent testing in an external clinical cohort. Ultimately, identifying the immune function profile of high-risk populations can help improve the efficacy of immunotherapy for the precise treatment of CRC. Nevertheless, there are still some limitations that need to be mentioned. First, as different bioinformatics algorithms may lead to different results, the combination with other similar predictive models will contribute to explain the predictive role of immune-related genes in COAD prognosis more comprehensively. Second, transcriptome analysis does not reflect the molecular mechanisms of COAD immunobiology, which may be better elucidated by proteomics and/or metabolomics. Third, since this study is a retrospective study with statistics from online databases, the validation of predictive effectiveness of the model in clinical practice was required. Fourth, our predictive model can add predictive value to existing patient risk groupings. Combining risk scores, TNM systems and age synergistically or complementarily is essential for clinical work. In conclusion, although we initially explored the expression characteristics and immune associations of immune-related genes, these genes have not been fully elucidated and deserve further in-depth study. In the next work, we will continue to validate the prognostic accuracy of this model on a large scale with more samples and more external experiments.
In summary, we created a six-gene prognostic model with good predictive capability in both the training and validation sets. This model could help clinicians predict individual risks of patients with COAD in the development of personalized COAD treatment.
Data extraction and integration were conducted using Perl (v5.32.1). DEIGs, DEGs, and DETFs were analyzed using R version 4.1.1 and the relevant Bioconductor packages (e.g., limma v3.48.3 and edgeR v3.34.1)  according to the screening criteria of |log2FC|> 1 and FDR < 0.05. Plots were generated using the R package ggplot2 (v3.3.5), and heatmaps were drawn using the R package pheatmap (v1.0.12).
DEIGs enriched and analyzed using GO and KEGG
GO and KEGG enrichment analyses of screened genes were performed using the R package clusterProfiler. An FDR < 0.05 was set as the cutoff criterion to identify the outstanding GO terms and KEGG pathways visualized using bubble and circle diagrams.
Correlation analysis between DEIGs and DETFs
Correlation analysis between DEIGs and DETFs was performed using the cor.test in R software (cor > 0.5, p < 0.001). Protein–protein interaction networks were generated using Cytoscape (version 3.8.2) .
WGCNA of DEIGs
WGCNA was performed by applying the R package WGCNA to DEIGs to obtain a different module. It was used to generate the module network plots using the R igraph package. Subsequently, we determined the intersection of the module genes obtained from the GEO and TCGA databases.
Immune-related immune genes obtained from intersection genes
Clinical information and gene expression for the univariate analysis were analyzed using Cox regression, and the corresponding DEGs were screened out as prognostic immune‐related genes for further study at p < 0.05 and |hazard ratio (HR)|≠ 1.
Prognostic model construction
Based on the expression of screened genes from the previous step, a risk model was built using multivariate Cox regression model analysis, calculated as follows: Risk score = Expgene1 × coefgene1 + Expgene2 × coefgene2 + … + Expgenen × coefgenen. Exp represents the expression level of the gene and coef is the estimated regression coefficient of the gene derived from the multivariate Cox analysis.
Evaluation and analysis of risk model
The RNA sequencing data of patients with COAD were obtained from TCGA database and set as the training set, whereas the external validation cohorts were obtained from the GEO dataset (GSE40967). The patients were then separated into high- and low-risk groups using the mean risk score in the training set as the cutoff value. Using the files downloaded from the MsigDB database, GO and KEGG enrichment analyses were performed for high-risk and low-risk groups using R software. K–M analysis and univariate and multivariate independent prognosis analyses were performed for the two groups using the survminer and survival R packages and the forest plot was drawn. The ROC curve and time-dependent ROC-based AUC were plotted using the R package timeROC. We found a similar prediction model in the literature and compared the predictive accuracy of the prognostic models using a C-index and plotting ROC curves to compare the AUC values.
Analyzing somatic mutation in high- and low-risk groups
The tumor mutation burden (TMB) was calculated as mutations per megabase (mut/Mb). The tumor mutation of patients in the high and low-risk groups was analyzed using TMB data. The R package maftools was used to analyze and visualize the somatic mutation data.
Correlation analysis of tumor-infiltrating immune cells
The proportion of infiltrating immune cells was calculated using the CIBERSORT algorithm  and the results were considered significant at p < 0.05. The Wilcoxon signed-rank test was used to analyze the differential abundance of infiltrating immune cells and immune cell function between the low- and high-risk groups, and box plots were created using the ggpubr package in R.
Validation of reliability of risk models in other databases
Mizukami T, et al. Significance of FGF9 gene in resistance to anti-EGFR therapies targeting colorectal cancer: a subset of colorectal cancer patients with FGF9 upregulation may be resistant to anti-EGFR therapies. Mol Carcinog. 2017;56(1):106–17.
Andersen CB, et al. Falcarindiol purified from carrots leads to elevated levels of lipid droplets and upregulation of peroxisome proliferator-activated receptor-γ gene expression in cellular models. Front Pharmacol. 2020;11:565524.
Bridges KM, et al. Relating stool microbial metabolite levels, inflammatory markers and dietary behaviors to screening colonoscopy findings in a racially/ethnically diverse patient population. Genes. 2018;9(3):119.
ZR is responsible for collecting data and conducting preliminary analyses. GZ and JY are responsible for guiding the use of statistical methods and the revision of the discussion section. All authors read and approved the final manuscript.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.