neoDL: a novel neoantigen intrinsic feature-based deep learning model identifies IDH wild-type glioblastomas with the longest survival

Background Neoantigen based personalized immune therapies achieve promising results in melanoma and lung cancer, but few neoantigen based models perform well in IDH wild-type GBM, and the association between neoantigen intrinsic features and prognosis remain unclear in IDH wild-type GBM. We presented a novel neoantigen intrinsic feature-based deep learning model (neoDL) to stratify IDH wild-type GBMs into subgroups with different survivals. Results We first derived intrinsic features for each neoantigen associated with survival, followed by applying neoDL in TCGA data cohort(AUC = 0.988, p value < 0.0001). Leave one out cross validation (LOOCV) in TCGA demonstrated that neoDL successfully classified IDH wild-type GBMs into different prognostic subgroups, which was further validated in an independent data cohort from Asian population. Long-term survival IDH wild-type GBMs identified by neoDL were found characterized by 12 protective neoantigen intrinsic features and enriched in development and cell cycle. Conclusions The model can be therapeutically exploited to identify IDH wild-type GBM with good prognosis who will most likely benefit from neoantigen based personalized immunetherapy. Furthermore, the prognostic intrinsic features of the neoantigens inferred from this study can be used for identifying neoantigens with high potentials of immunogenicity. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04301-6.


Data Description
Mutations and clinical information were downloaded from the ATLAS-TCGA pan-glioma study (Ceccarelli, et al., 2016). Gene expression microarray data with Agilent chip (G4502A) at level 3 were downloaded from TCGA Data portal. We termed the data from TCGA as TCGA cohort. Mutations, RNAseq gene expression data, and clinical information in Asian population were collected from a recently published cohort (Wang, et al., 2016), designated as Pri cohort.
The samples that were not diagnosed as IDH wild-type GBM or did not have clinical information were removed, resulting in 268 and 46 samples in the two cohorts, respectively.
A neoepitope with strong affinity for MHC ( 50 equal or less than 500 nM ) may be a more robust neoantigen candidate if the paired wild-type epitope has a poor affinity for MHC ( 50 greater than 500 nM) (Wood, et al., 2018). The neoantigens for each sample in both TCGA cohort and Pri cohort were from our previous study (Zhang, et al., 2019), which used missense mutations to generate all possible 9-mer peptides and defined the mutant 9-mer peptides as neoantigens when the 50 of mutant-type 9-mer peptides was less than 500 nM and the corresponding wild-type binder more than 500 nM. All the downstream analyses were based on the inferred neoantigens (the mutant peptides) and their corresponding wild-type peptides.

Feature calculation for neoantigens
For the purpose of extracting features from neoantigens, the samples with detected mutant peptides remained in the downstream analysis, including 262 samples in the TCGA cohort and 42 samples in Pri cohort. A total of 2928 features were extracted from 2263 neoantigens (2081for TCGA cohort; 182 for Pri cohort) in the downstream analysis. Specifically, features used in the calculation were derived using the R package "Peptides"(v2.4.2) including 66 amino acid descriptors and physical-chemical properties (aliphatic, auto-correlation, auto-covariance, Boman index, theoretical net charge, cross-covariance, hydrophobic moment, hydrophobicity, instability, molecular weight).
Additionally, the "aaComp" command was also used to describe amino acid features including Tiny, Small, Aliphatic, Aromatic, Non-polar, Polar, Basic, Acidic. Variables were derived by the presence (1) or absence (0) of each feature. Characteristic variables were performed in four conditions respectively, including the complete sequence, the site of mutation along with each antigen and the dipeptides/tripeptides related to the mutation site, each absolute position along each antigen and related dipeptide/tripeptide composition, and the difference of each feature in the mutated versus reference antigen.
The features, described overall content of a protein, for example, amino acid composition, were significant. Variables demonstrating presence(1) or absence(0) of each amino acid type following, including the first or last 3 amino acid residues or middle residues of each antigen, the first or last amino acid residues of each antigen, the first or last 2 amino acid residues or middle residues of each antigen.
To measure the complexity at the protein and residue level, we computed Shannon entropy of a protein and entropy of each type of residues using the following equations： where HS is Shannon entropy of a protein sequence and is the entropy of a residue type i. p is the probability of the existence of a given amino acid in the sequence. We calculated the Shannon entropy of the mutant peptides and the difference of Shannon entropy in the mutant antigen versus reference antigen. Cancer is characterized by the accumulation of mutations, so the analysis of mutant positions is valid. Therefore, the Shannon entropy of the dipeptides/tripeptides related to the mutation site and the entropy difference of mutations process were performed. The entropy of a residue type was also calculated for the mutant peptides and reference peptides.

Prognostic feature selection
The features were calculated for all detected 9-mer mutated peptides and wild peptides. There were multiple types of mutations in each patient, resulting in a large number of mutant peptides in total. Thus, each feature value calculated in all the peptides detected in a patient was averaged as the final value. Univariate Cox regression analysis was performed here to predict the impact of each feature on prognosis. The threshold of P-value was set as 0.05, which means all the features with Pvalue lower than or equal to 0.05 were deemed as statistically significant (termed as valid features).
A correlation matrix of the valid features was conducted, and visualized through heatmaps using the package 'pheatmap' in R language.

Hierarchical k-means clustering
Hierarchical k-means clustering was applied upon the Z-Score-transformed valid features to stratify patients into two clusters. Hierarchical k-means was performed using the "hkmeans" command of the R package 'factoextra' (version 1.0.7). The overall survival differences between two clusters of patients were compared through Kaplan-Meier survival analysis.

Deep-learning model construction
The grouping results derived from hierarchical k-means clustering were used as labels, marking 0 and The maximum number of iterations was set as 1000. The initial connection weights and biases of each layer were randomly generated, and end up reaching stable parameters through training iterations.

Leave one out cross validation (LOOCV)
After determining the framework of the model, cross validation was a necessary step. Specifically, the training data was separated into two sections randomly with proportion of training and testing sets as 6 to 4. The training set was used to train the model to determine the unknown parameters, while the test set was used to validate the effect of the predicted parameters. To obtain the optimal model, the above process was carried out 300 times. Kaplan-Meier survival analysis was operated each time to see if the model can divide the samples into two groups with a statistically significant survival difference. Only groups with P-value lower than or equal to the threshold of 0.05 were regarded as statistically significant. Among 300 times trial, the more significant stratifications, the more stable our model is.

Independent validation
A model with fixed parameters corresponding to the lowest P-value was selected as the optimal model.
To test the performance of the optimal model, Pri cohort was used as an external test data. The optimal model divided patients in the Pri cohort into long-and short-term survival clusters. Kaplan Meier analysis was conducted between the long-and short-term survival clusters in Pri cohort to test the predictive performance of the optimal model for IDH wild-type GBMs. Besides, other glioma subtypes from TCGA data were also tested by the model, including Astrocytoma, Classical-like, OligoAstrocytoma.

Tumor purity estimation
Tumor purities were estimated by ESTIMATE (Yoshihara, et al., 2013) from gene expression profiles.
There were a total of 242 and 29 IDH wild-type GBMs in the TCGA cohort and Pri cohort with gene expression profiles available, respectively. The purity score was performed for each sample, using the R package 'estimate'(version 1.6.7). Meanwhile, the immune score and the stromal score were also estimated.

GO enrichment analysis
To identify differential genes expression between different groups, GO enrichment analysis was conducted using Gene Set Enrichment Analysis (GSEA 4.0.3) (Subramanian, et al., 2005)  Supplementary Figure S1. Flow chart of neoDL.The flow chart contains three parts: extracting neoantigen-features, training process(that was done with 300 iterations) and applying in an independent data set.
Supplementary Figure S2. Survival of glioma patients stratified according to missense mutational load. a, Glioma IDH status sub-groups; b, Glioma histology sub-groups; c, Glioma transcriptomic subgroups; d,Glioma DNA methylation sub-groups. red line, high mean mutational load ; blue line, low mean mutational load; n, number of patients; p-value was determined using the log-rank test.
Supplementary Figure S3. Survival of glioma patients stratified according to absolute number of neoantigens. a, Glioma IDH status sub-groups; b, Glioma histology sub-groups; c, Glioma transcriptomic sub-groups; d,Glioma DNA methylation sub-groups. red line, high mean mutational load ; blue line, low mean mutational load; n, number of patients; p-value was determined using the log-rank test.
Supplementary Figure S4. Survival of glioma patients stratified according to differential agretopicity index (DAI). a, Glioma IDH status sub-groups; b, Glioma histology sub-groups; c, Glioma transcriptomic sub-groups; d,Glioma DNA methylation sub-groups. red line, high mean DAI; blue line, low mean DAI; n, number of patients; p-value was determined using the log-rank test.
Supplementary Figure  Supplementary Figure S9. Comparison of the similarity of valid feature values between long-term survival and short-term survival groups of IDH wild-type GBM in two cohorts. a, molecular weight related features. b, molecular size/volume related features. c, molecular electrostatic potential/polarity related features. P-value was calculated by unpaired T test. ). The left and the right col contain the TCGA cohort and Pri cohort, respectively.
Supplementary Figure S10. Survival of glioma patients stratified according to 2 feature values，and analysis of the correlations between these features in two cohorts. a, absolute position 3 and 4 composed-dipeptide protFP2 value ; b, absolute position 3 and 4 composed-dipeptide VHSE-scale 2. red line, high mean value, blue , low mean value; n, number of patients; p-value was determined using the log-rank test. c, the Pearson correlation coefficient of the two features. The left and the right col contain the TCGA cohort and Pri cohort, respectively.
Supplementary Figure S11. Comparison of the similarity of immune score and stromal score between two groups and the correlation analysis between purity and mutation load. a, stromal score. b, immune score. Immune score and stromal score were derived by ESTIMATE. P value was calculated by unpaired T test. c, pearson correlation between purity and log10(mutation load). The left and the right col contain the TCGA cohort and Pri cohort, respectively.