Data processing and identification of differentially expressed immune-related genes (DEIGs), genes (DEGs), and transcription factors (DETFs)
The transcriptome RNA sequencing data and clinical materials of 514 patients with COAD were obtained from TCGA. The screening criteria for DEGs between tumors and normal samples was set as a false discovery rate (FDR) < 0.05 and |log fold change (FC)|> 1, a total of 7782 differentially expressed genes were identified (Fig. 1A), B). A list of 2660 immune genes was obtained from the Immunome database, which were downloaded from InnateDB and ImmPORT databases; a total of 649 DEIGs were obtained from the screening (|log FC|> 1, FDR < 0.05) (Fig. 1C, D). We also obtained 318 transcription factors (TFs) from the Cistrome program and 67 DETFs from screening (Fig. 1E, F).
Functional analysis of DEIGs
Gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analyses of the DEIGs were conducted. The GO-biological process (BP) analysis indicated that the DEIGs were mainly involved in B cell immunoglobulin-mediated immune responses, complement activation, and regulation of immune system processes (Fig. 2A, B). The KEGG analysis indicated that the DEIGs were mainly involved in cytokine–cytokine receptor interaction, viral protein interaction with cytokines and cytokine receptors, chemokine signaling pathways, the IL-17 signaling pathway, neuroactive ligand–receptor interactions, and the nuclear factor (NF)-κ B signaling pathway (Fig. 2C, D).
Building the DEIG and DETF interaction network
The correlation between DEIGs and DETFs was obtained using these screening criteria: cor > 0.5 and p < 0.001. The correlation network diagram was drawn using Cytoscape (Fig. 3). The specific correlation results are shown in Additional file1: Table S1.
Weighted gene co-expression network analysis (WGCNA) of DEIGs
WGCNA [12, 13] identifies gene modules with similar expression patterns by calculating gene expression relationships, analyzing relationships between gene modules and phenotypes, and mapping the regulatory network between genes in the gene module and central genes (hubs). The WGCNA package in the R software was used to divide DEIGs into five modules (“MEgreen,” “MEblue,” “MEbrown,” “MEyellow,” and “MEgrey”) (Fig. 4). The optimal power value was 4. Prognostic models were built based on the minimal p-value (< 0.05). Genes in the green module were selected for subsequent analysis.
Obtaining immune genes related to the prognosis of COAD patients
After obtaining the “MEgreen” gene module, univariate regression analysis was performed on the clinical data of COAD patients in the TCGA and Gene Expression Omnibus (GEO) databases. Nine genes were screened as prognosis-related genes (CD1A, CD1B, FGF9, GRP, OXTR, SPHK1, BGN, SERPINE1, and F2RL2) using p < 0.05 as the selection criterion. The results of univariate analysis are shown in Fig. 5A. The Kaplan–Meier (K–M) curves of nine genes in COAD patients were plotted using the Survival package in R software (Fig. 5B–J).
Building of a survival prognosis model and performing a survival analysis
The nine prognosis-related genes were included in a multivariate Cox regression analysis. The inclusion criteria were p < 0.05 and HR ≠ 1. Six genes (CD1A, CD1B, FGF9, GRP, SERPINE1, and F2RL2) were finally incorporated into the core model for risk calculation. The relationship between the genes and risk score is shown in Additional file2: Table S2. The composite risk scores of patients in TCGA datasets were calculated. The COAD patients were separated into low- and high-risk groups using the median risk score as the cutoff. The prognostic model was built using TCGA data and verified in the GEO datasets GSE40967. Further, we used the R software Survival and timeROC packages to draw two groups of K–M and receiver operating characteristic (ROC) curves, respectively (Fig. 6A–D), The R software ComplexHeatmap package was used to conduct a chi-square test between the demographics of the low- and high-risk groups and draw the clinical correlation heat map (p < 0.05, Fig. 6E). There were significant differences between the two groups in the tumor (T), node (N), metastasis (M) and stages. Univariate and multivariate Cox regression analyses revealed that age, T, stage, and risk score were independent prognostic factors for patients with COAD (Fig. 7A, B).
Single-sample gene set enrichment analysis (ssGSEA) for low- and high-risk groups
The GO and KEGG enrichment analysis files (c5.go.v7.4.smbols and c2.cp.kegg.v7.4.symbols) were downloaded from the GSEA database (http://www.gsea-msigdb.org/gsea/index.jsp). GO and KEGG enrichment analyses of the high- and low-risk groups were performed using the clusterProfiler and the org.Hs.eg.db packages in R. GO enrichment analysis of the high-risk group indicated the main enrichment was in keratinocyte differentiation, skin development, collagen-containing extracellular matrix (ECM) external encapsulating structure development, and structural molecule activity. The most significantly enriched KEGG pathways were axon guidance, ECM receptor interaction, focal adhesion, the peroxisome proliferator-activated receptor (PPAR) signaling pathway, and systemic lupus erythematosus. GO enrichment analysis of the low-risk group indicated the main enrichment was in the activation of immune responses, adaptive immune responses, immune response regulating signaling pathways, immunoglobulin production, and immunoglobulin complexes. The KEGG pathway indicated enrichment in allograft rejection, asthma, autoimmune thyroid disease, the intestinal immune network for immunoglobulin A (IgA) production, and primary immunodeficiency (Fig. 7C–F).
Comparison somatic mutation in high- and low-risk groups
Somatic mutation profiles of patients with COAD downloaded from TCGA were analyzed and visualized using the R maftools package [14]. A total of 388 patients had mutations; after removing samples with no amino acid mutation, 202 were high-risk and 184 were low-risk. The five genes with the highest somatic mutation rate in the high-risk group were APC, TP53, TTN, KRAS, and SYNE1. Missense mutations were the most common category. The high‐risk group had a higher mutation frequency than the low‐risk group (Fig. 8A, B).
Analyzing the tumor-infiltrating immune cells of high- and low-risk groups
Next, we used CIBERSORT to estimate the proportions of 22 distinct immune cell types (p < 0.05) (Fig. 8E), The Wilcoxon signed-rank test was used to determine the differences between tumor-infiltrating immune cells cells in the high-risk group. Resting dendritic cells and follicular helper T cells were present in significantly higher fractions in low-risk patients than in high-risk patients (p < 0.05; Fig. 8C). The immune cell function of the high-risk group was lower than that of the low-risk group with respect to adenomatous polyposis coli (APC) co-inhibition, APC co-expression, T-cell function, and macrophage function (Fig. 8D).
Comparison with other models
We compared four corresponding prognostic models: a seven-gene signature (Sun), six-gene signature (Liang), twelve-gene signature (Mia), and seven-gene signature (Chen) [15,16,17,18]. We took the median risk value of all samples as the division standard, divided them into high- and low-risk groups, and used the four models to calculate the risk of the patients in TCGA. The ROC and K–M curves for the four models are shown in Fig. 9A–J. The values of area under the curve (AUC) for the four models at 5 years were 0.581, 0.521, 0.616, and 0.555, respectively, all significantly lower than our model (0.639). We calculated the concordance indexes (C-indexes), which were used to evaluate the prediction capability of the mixed-effect Cox model [20], of all models. The results showed that our model exhibited the highest C-index value (0.704; Fig. 9K).
Verification of six immune-related genes in external databases
The Tumor Immune Estimation Resource (TIMER) online database was used to analyze the differential expression of six genes in this model in 17 types of tumors and adjacent tissues. The CD1A, CD1B, GRP, SERPINE1, and F2RL2 genes were highly expressed in tumor tissue. In contrast, FGF9 was highly expressed in normal colon tissue (Fig. 10). A Human Protein Atlas (HPA) database search was performed to verify the protein expression levels of CD1A, CD1B, SERPINE1, and FGF9 (Fig. 11).