StackTTCA: a stacking ensemble learning-based framework for accurate and high-throughput identification of tumor T cell antigens

Charoenkwan, Phasit; Schaduangrat, Nalini; Shoombuatong, Watshara

doi:10.1186/s12859-023-05421-x

Research
Open access
Published: 28 July 2023

StackTTCA: a stacking ensemble learning-based framework for accurate and high-throughput identification of tumor T cell antigens

Phasit Charoenkwan¹,
Nalini Schaduangrat² &
Watshara Shoombuatong²

BMC Bioinformatics volume 24, Article number: 301 (2023) Cite this article

1154 Accesses
3 Citations
Metrics details

Abstract

Background

The identification of tumor T cell antigens (TTCAs) is crucial for providing insights into their functional mechanisms and utilizing their potential in anticancer vaccines development. In this context, TTCAs are highly promising. Meanwhile, experimental technologies for discovering and characterizing new TTCAs are expensive and time-consuming. Although many machine learning (ML)-based models have been proposed for identifying new TTCAs, there is still a need to develop a robust model that can achieve higher rates of accuracy and precision.

Results

In this study, we propose a new stacking ensemble learning-based framework, termed StackTTCA, for accurate and large-scale identification of TTCAs. Firstly, we constructed 156 different baseline models by using 12 different feature encoding schemes and 13 popular ML algorithms. Secondly, these baseline models were trained and employed to create a new probabilistic feature vector. Finally, the optimal probabilistic feature vector was determined based the feature selection strategy and then used for the construction of our stacked model. Comparative benchmarking experiments indicated that StackTTCA clearly outperformed several ML classifiers and the existing methods in terms of the independent test, with an accuracy of 0.932 and Matthew's correlation coefficient of 0.866.

Conclusions

In summary, the proposed stacking ensemble learning-based framework of StackTTCA could help to precisely and rapidly identify true TTCAs for follow-up experimental verification. In addition, we developed an online web server (http://2pmlab.camt.cmu.ac.th/StackTTCA) to maximize user convenience for high-throughput screening of novel TTCAs.

Peer Review reports

Introduction

Tumor cells generate molecules called tumor antigens (TAs). TAs are classified into two types: tumor associated antigens (TAAs) and tumor specific antigens (TSAs). TAAs are self-proteins which are highly expressed in tumor cells in comparison to normal cells, while TSAs are found solely in tumor cells [1, 2]. The human body is capable of recognizing TAs and initiating the innate and adaptive immune responses of the body to eliminate cancerous growths. Innate immune cells, (i.e., neutrophils, macrophages, NK cells, dendritic cells, and others) can quickly respond and to offer defense mechanisms that are nonspecific. The adaptive immune system, comprising T-cells and B-cells, is a more intricate and slower process to target antigens. However, it has the potential to create a robust and targeted immune response to combat tumors or cancers [3]. Dendritic cells (DCs) that present antigens break down TAs and exhibit small peptides through major histocompatibility complex class I (MHC-I) to activate CD8+ T-cells that are cytotoxic, or through MHC class II to stimulate CD4+ T-cells that are helper T-cells. However, CD8+ T-cells are crucial for eradicating tumors and performing surveillance of the immune system to target cancer cells [4, 5]. Hence, T-cell epitopes linked with TAs are one of the most important targets for developing cancer immunotherapy, which can help eliminate diseases and prevent their recurrence. In recent times, the identification of peptides originating from TAs as epitopes has been used as immunotherapeutic agents to combat various types of tumors and cancers. [3, 4, 6, 7]. For a T-cell antigen to be an ideal target in cancer immunotherapy, it needs to fulfil several criteria. These include exhibiting specificity to the tumor, which means that it should be highly expressed in cancerous tissues but should not trigger autoimmunity or immune tolerance. Additionally, the antigen should be prevalent and abundant in tumor cells, especially if it plays a crucial role in oncogenesis and can prevent the tumor from evading the immune system. Furthermore, the antigen should be immunogenic, meaning that it should be capable of generating an immune response, which can be assessed by cytokine release, tumor cytolysis and most importantly, T-cell recognition. Finally, epitopes with favorable properties, such as optimal length, hydrophobicity, and aromaticity, could be highly effective [1, 8,9,10]. In order to create successful experiments for personalized and precise immunotherapy, it is crucial to have a comprehensive understanding of the immunogenic epitopes found on tumor antigens.

The existence of large peptide databases, such as immune epitope database (IEDB) [11], TANTIGEN [12], and TANTIGEN 2.0 [13], is expected to aid in the identification of tumor T-cell antigens (TTCAs) that bind to MHC-I molecules. By using sequence information alone, computational methods have the potential to rapidly and precisely identify TTCAs, which can be a more time-efficient and cost-effective alternative to experimental approaches. This is especially important given the laborious and expensive nature of test-based discovery, making it imperative to develop efficient computational methods for TTCAs identification [14,15,16]. To date, there are a variety of computational approaches that have been created for TTCA identification based on sequence information, including TTAgP1.0 [17], iTTCA-Hybrid [18], TAP1.0 [19], iTTCA-RF [20], iTTCA‑MFF [21], and PSRTTCA [22]. Table 1 summarizes the existing computational approaches according to the applied benchmark datasets, machine learning (ML) methods, and web server availability. According to the applied ML methods, the six existing computational approaches can be categorized into two groups, i.e., single ML-based (TTAgP1.0 [17], iTTCA-Hybrid [18], TAP1.0 [19], iTTCA-RF [20], and iTTCA‑MFF [21]) and ensemble learning-based (PSRTTCA [22]) methods. Among the existing computational approaches, PSRTTCA [22] was recently developed constructed on RF-based meta-approach. In PSRTTCA, a pool of propensities for amino acids and dipeptides was estimated by using the scoring card method (SCM) and then treated as the input feature vector for the construction of a meta-predictor. More details for the existing computational approaches are summarized in two previous studies [21, 22]. Although the existing computational approaches attained reasonably good performances, their performance is still not yet satisfactory in terms of the independent test dataset. For example, PSRTTCA, which performed best among various TTCA predictors, could provide an accuracy (ACC) of 0.827 and Matthew’s correlation coefficient (MCC) of 0.654.

Table 1 Summary of existing computational methods for the prediction of QSPs

Full size table

The objective of this study is to present a stacking ensemble learning-based framework, called as StackTTCA, for the precise and comprehensive detection of TTCAs. The procedure of the StackTTCA development is described in Fig. 1. First, we employed 12 different feature encoding schemes from various aspects to extract the information of TTCAs, including composition information, reduced amino acid sequences information, pseudo amino acid composition information, and physicochemical properties. Second, we trained 13 individual ML methods by using each feature encoding. As a result, 156 baseline models were obtained and used to create a 156-D probabilistic feature vector. Finally, the feature selection strategy was utilized to optimize this probabilistic feature vector and then used as the optimal feature vector for the construction of the stacked model. After conducting extensive comparative analysis through an independent test, it was found that StackTTCA demonstrated superior performance in identifying TTCAs when compared to several ML classifiers and existing methods. In order to better understand the remarkable performance of StackTTCA, we have utilized the Shapley Additive exPlanation algorithm to enhance model interpretation and identify the most significant features of StackTTCA. Finally, an online web server (http://2pmlab.camt.cmu.ac.th/StackTTCA) was created to facilitate high-throughput screening of novel TTCAs, maximizing user convenience.

Materials and methods

Overall framework of StackTTCA

A comprehensive illustration of the steps involved in the development and performance evaluation of StackTTCA is provided in Fig. 1. Firstly, robust training and independent test datasets were gathered. Subsequently, a collection of baseline models was established by employing 13 machine learning methods in combination with 12 feature encoding techniques. The resulting baseline models were then used to generate a feature vector comprising 156 probabilistic features with a range of 0 to 1. The feature vector was further optimized through a feature selection scheme to construct the meta-classifier, i.e., StackTTCA. The efficacy of the StackTTCA was evaluated using tenfold cross-validation, independent testing, and case studies. Finally, an online web server for StackTTCA was developed to enhance its accessibility and usability.

Benchmark dataset

In fact, there are two popular benchmark datasets, which were originally collected by Charoenkwan et al. [18] and Herrera-Bravo et al. [19]. However, the dataset in [18] involved incorrect negative samples [19, 22]. Thus, in this study, we employed the remaining benchmark dataset for assessing the performance of our proposed approach. This dataset was used for training several existing methods (i.e., iTTCA-Hybrid [18], TAP1.0 [19], iTTCA-RF [20], and PSRTTCA [22]). To be specific, the number of unique TTCAs and unique non-TTCAs are 592 and 593, which are considered as positive and negative samples, respectively. The training dataset from Herrera-Bravo et al. [19] were constructed by randomly selecting 474 TTCAs and 474 non-TTCAs, while the remaining TTCAs and non-TTCAs were employed as the independent test dataset.

Stacking ensemble learning-based framework

Instead of simply selecting an optimal single ML model, this study aims to build a stacking ensemble learning-based framework [14,15,16, 23, 24] by taking advantage of several ML models for the improved prediction performance of TTCAs. Figure 1 illustrates the overall framework of StackTTCA. It comprises of three main steps, including baseline model construction, probabilistic feature optimization, and meta-classifier development. In brief, we first applied the state-of-the-art ML methods and feature encoding schemes to create a pool of baseline models. Second, the output of these baseline models are generated and optimized using the feature selection scheme. Finally, the optimal feature set is employed to develop a meta-classifier.

At the first step, TTCAs and non-TTCAs were encoded based on 12 different feature encoding schemes (see Table 2). After that, we trained 13 individual ML methods by using each feature encoding scheme. As a result, we obtained 156 baseline models (13 ML × 12 encoding). In addition, we employed a grid search to determine the optimal parameters of ADA, ET, LGBM, LR, MLP, RF, SVMLN, SVMRBF, and XGB classifiers in conjunction with the tenfold cross-validation procedure to maximize their performances (see Additional file 1: Table S1). All the baseline models were created using the Scikit-learn v0.24.1 package [25].

Table 2 Summary of 12 different feature encodings along with their corresponding description and dimension

Full size table

At the second step, we conducted the tenfold cross-validation procedure for each baseline model to generate a new probabilistic feature for extracting the crucial information of TTCAs. After performing this process, we obtained a 156-D probabilistic feature vector (APF). The APF can be represented by

$${\text{APF}} = \left\{ {{\text{PF}}_{1,1} , {\text{PF}}_{1,2} ,{\text{PF}}_{1,3} , \ldots ,{\text{PF}}_{{{\text{i}},{\text{j}}}} , \ldots ,{\text{PF}}_{13,12} } \right\}$$

(1)

where ${\text{PF}}_{{{\text{i}},{\text{j}}}}$ is the probabilistic feature (PF) generated from the ith ML method in conjunction with the jth feature encoding. Although the dimension of the APF is 156, some of them are not effective and provide noisy information. Therefore, we conducted the feature optimization process based on our developed genetic algorithm (GA), termed (GA-SAR) [27, 30,31,32], for determining m import PFs (m < 156). The m-D probabilistic feature vector is referred as BPF. The chromosomes of GA-SAR consist of two parts, including binary and parametric genes. Herein, the parameters and their values for the GA-SAR contain r_begin = 5, m_stop = 20, P_m = 0.05, and Pop = 20 [27, 28, 32]. The procedure of the feature importance selection based on the GA-SAR method is described as follows. First, we randomly constructed a population of Pop individuals and comprehensively evaluated the performance of all Pop individuals using the fitness function and the tenfold cross-validation scheme. Second, we used the tournament selection to obtain the best Pop for the construction of a mating pool. Third, we performed the self-assessment-report operation (SAR) between the best Pop and each other individual Pop to obtain the new children. In this study, we treated 20 generations as the stop condition. Further information regarding this algorithm has been provided in our previous studies [32,33,34,35].

In the last step, we used ET method as the meta-classifier (called mET) for the development of the stacked model. We trained individually stacking ensemble models by using two probabilistic feature vector, including APF and BPF. The binary and parametric genes of the mET predictor consisted of n = 156 PFs and n_estimators $\in$ {20, 50, 100, 200, 500} (see Additional file 1: Table S1). Here, we selected the best-performing feature vector in terms of MCC in order to construct StackTTCA.

Evaluation metrics

In order to show the effectiveness of our proposed approach, its prediction performance was assessed by using four standard evaluation metrics, including ACC, MCC, specificity (Sp) and sensitivity (Sn) [36]. These evaluation metrics are computed as follows:

$${\text{ACC}} = \frac{{{\text{TP}} + {\text{TN}}}}{{\left( {{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}} \right)}}$$

(2)

$${\text{MCC}} = \frac{{{\text{TP}} \times {\text{TN}} - {\text{FP}} \times {\text{FN}}}}{{\sqrt[{}]{{\left( {{\text{TP}} + {\text{FP}}} \right)\left( {{\text{TP}} + {\text{FN}}} \right)\left( {{\text{TN}} + {\text{FP}}} \right)\left( {{\text{TN}} + {\text{FN}}} \right)}}}}$$

(3)

$${\text{Sp}} = \frac{{{\text{TN}}}}{{\left( {{\text{TN}} + {\text{FP}}} \right)}}$$

(4)

$${\text{Sn}} = \frac{{{\text{TP}}}}{{\left( {{\text{TP}} + {\text{FN}}} \right)}}$$

(5)

where TN and TP are the number of negative and positive samples predicted to be negative and positive, respectively. In the meanwhile, FN and FP are the number of positive and negative samples predicted to be negative and positive, respectively [37,38,39,40]. Furthermore, we utilized area under the receiver operating characteristics (ROC) curve (AUC) to assess the robustness of the model [41, 42].

Results

Optimization of stacked models

In our stacking framework, two new probabilistic feature vectors (i.e., APF and BPF) were generated based on a pool of ML classifiers and then used to construct stacked models. Here, we assessed and compared the impact of these two vectors using mET classifiers in TTCA identification. As mentioned above, the APF was represented by the 156-D probabilistic feature vector, while the BPF was obtained by using the GA-SAR method for the selection of m import PFs. After performing the feature selection, the optimal number of m was 10. Specifically, the 10 import PFs were generated based on 10 different ML classifiers, including ET-RSAcid, LR-RSAcid, ET-DPC, SVMLN-CTD, XGB-CTD, ET-APAAC, ADA-APAAC, RF-PCP, SVMLN-AAI, and PLS-AAI. The performance comparison results between the APF and BPF are shown in Table 3. In case of the tenfold cross-validation results, it could be noticed that both APF and BPF exhibits impressive overall performance in terms of ACC, MCC, and AUC with ranges of 0.867–0.879, 0.737–0.760, and 0.933–0.935, respectively. In the meanwhile, we observed that the BPF outperformed the APF in terms of all five metrics used. Furthermore, on the independent test dataset, the BPF’s ACC, MCC, and Sn were 3.38, 6.85, and 5.08%, respectively, higher than the APF. As a results, the BPF was selected to construct our final stacked model.

Table 3 Cross-validation and independent test results for ET classifiers trained with three different features

Full size table

Performance comparison with other ensemble strategies

To verify the necessity of the stacking strategy, we compared its performance with that of related ensemble strategies [16, 33, 34, 43], namely, the average scoring and majority voting. In brief, the average scoring and majority voting involves using the prediction outputs from 156 baseline models to create corresponding ensemble models by averaging and voting the probabilistic scores, respectively. Table 4 summarizes the performance comparison of different models trained based on various ensemble strategies. In Table 4, both the cross-validation and independent test results demonstrate that the stacking strategy exhibits impressive overall performance across all five evaluation metrics. For example, in terms of the independent test results, the stacking strategy outperforms the two compared ensemble strategies by 10.92–11.77, 19.53–20.38, and 21.47–23.42% in ACC, Sn, and MCC, respectively. These results indicate that the stacking strategy is an effective approach for improving the prediction of TTCAs.

Table 4 Performance comparison of different models trained based on different ensemble strategies

Full size table

Performance comparison with conventional ML classifiers

In this section, the performance of all 156 constituent baseline models are assessed and presented in Fig. 2 and Additional file 1: Tables S2 and S3. As shown in Fig. 2, we noticed that the top ten baseline models having the highest MCC consist of XGB-CTD, LGBM-CTD, ET-CTD, RF-CTD, ADA-CTD, MLP-CTD, SVMRBF-CTD, ET-RSPolar, SVMLN-CTD, and LR-RSAcid. In the meanwhile, eight out of the top ten baseline models were developed based on the CTD descriptor, highlighting that the CTD descriptor was crucial in TTCA identification. For the XGB-CTD’s cross-validation results, Table 5 shows that this classifier exhibits the highest ACC and MCC of 0.848 and 0.698, respectively. XGB-CTD still outperformed other compared ML classifiers in terms of ACC and MCC on the independent test dataset. This evidence implies that XGB-CTD was the best ML classifier among all the compared ML classifiers. Therefore, we further compared the performance of StackTTCA against the top five baseline models (i.e., XGB-CTD, LGBM-CTD, ET-CTD, RF-CTD, and ADA-CTD) to elucidate the advantages of the stacking strategy (Table 4). StackTTCA demonstrated superior performance on both the training and independent test datasets, outperforming all other methods across all five evaluation metrics. Impressively, in the context of the independent test dataset, the ACC, Sn, and MCC of StackTTCA were 2.95, 4.24, and 5.99%, respectively, higher than XGB-CTD. Additionally, among the top five baseline models, StackTTCA exhibited the highest number of true positives and the lowest number of false negatives (Fig. 3). Furthermore, to understand the reason behind the better performance of StackTTCA, we utilized t-SNE to generate six boundary plots for our model and the top five baseline models [44, 45]. These plots depict TTCAs and non-TTCAs as red and blue dots, respectively. The visualization in Fig. 4 reveals that StackTTCA accurately classified majority of dots, whereas several dots from the top five baseline models were misclassified. Taking into account both cross-validation and independent test outcomes, StackTTCA exhibits improved and consistent prediction performance compared to several conventional ML classifiers.

Table 5 Performance comparison of StackTTCA and top five ML classifiers

Full size table

Performance comparison with state-of-the-art methods

In this section, we compared the performance of StackTTCA against the state-of-the-art methods by conducting an independent test. To conduct a fair performance comparison, the state-of-the-art methods involving iTTCA-Hybrid [18], TAP1.0 [19], iTTCA-RF [20], and PSRTTCA [22] were selected for the comparative analysis herein. All prediction performances of these four methods are directly obtained from the PSRTTCA study [22]. Figure 5 and Table 6 show the performance comparison results of StackTTCA and the four state-of-the-art methods. Among these four compared methods, the most effective one was PSRTTCA, which clearly outperformed other related methods. By comparing with PSRTTCA, StackTTCA achieved a better performance in terms of ACC, Sn, Sp, and MCC. To be specific, the ACC, Sn, Sp, and MCC of StackTTCA was 10.55, 13.56, 7.56, and 21.21%, respectively, higher than PSRTTCA. In addition, we also performed case studies to verify the predictive reliability in realistic scenarios. All 73 experimentally verified TTCAs were retrieved from the PSRTTCA study [22]. Additional file 1: Table S4 lists the prediction results of StackTTCA and the four compared methods. As can be seen from Additional file 1: Table S4, StackTTCA secured the best performance in terms of the case studies. Specifically, 60 out of 73 TTCAs (ACC of 0.822) were correctly predicted by StackTTCA, while the four compared methods could correctly predict 45 – 55 peptide sequences to be TTCAs (ACC of 0.616–0.753). These results highlight the effectiveness and generalization ability of the proposed model, highlighting that StackTTCA can help to precisely and rapidly identify true TTCAs for follow-up experimental verification.

Table 6 Performance comparison of StackTTCA and the state-of-the-art methods on the independent test dataset

Full size table

Feature importance analysis

In this section, we explore the impact of the 10 essential PFs used to create StackTTCA. We used the SHAP method to interpret the StackTTCA’s TTCAs identification. These PFs were generated from 10 different ML classifiers that were selected using the GA-SAR method. The classifiers used were ET-RSAcid, LR-RSAcid, ET-DPC, SVMLN-CTD, XGB-CTD, ET-APAAC, ADA-APAAC, RF-PCP, SVMLN-AAI, and PLS-AAI. Figure 6 illustrates the feature ranking of the 10 essential PFs based on their Shapley values. A positive SHAP value indicates a high likelihood of the prediction outputs being TTCA, while a negative value suggests a low probability of the outputs being TTCA. The top five crucial PFs were determined to be those based on XGB-CTD, ET-DPC, SVMLN-CTD, ET-APAAC, and LR-RSAcid, all of which exhibited positive SHAP values. Consequently, XGB-CTD had a relatively high probabilistic score for most TTCAs, while it had a relatively low score for most non-TTCAs. In contrast, PLS-AAI had a relatively high score for most non-TTCAs and a relatively low score for most TTCAs.

Discussion

Discovery and characterization of new TTCAs via experimental technologies are expensive and time-consuming. Therefore, computational approaches that can identify TTCAs using sequence information alone are highly desirable to facilitate community-wide efforts in analyzing and characterizing TTCAs. Although a variety of computational approaches have been proposed for TTCA identification, their performance is still not satisfactory. To overcome this shortcoming, this study presents StackTTCA, a stacking ensemble learning-based framework, for accurately identifying TTCAs and facilitating their large-scale characterization. In the present study, we conducted the three comparative experiments to compare the performance of StackTTCA against conventional ML classifiers, related ensemble strategies, and existing state-of-the-art methods. These experiments aimed to reveal the effectiveness and robustness of our proposed approach. The comparative experiments on the independent test dataset and case studies indicate that StackTTCA is capable of providing more accurate and stable prediction performance. Although the developed StackTTCA approach achieves improvement in TTCA identification, this study still has some shortcomings that can be addressed in future work. Firstly, the limited number of available TTCAs might restrict the prediction performance [46, 47]. Thus, we are motivated to collect additional TTCAs and combine them to construct an up-to-date dataset. Secondly, the discriminative power of the feature representation directly influences the model’s performance. In the future, we plan to combine our probabilistic features with other informative and powerful features, such as fastText, GloVe, and Word2Vec [48, 49].

Conclusion

In this research, we have introduced a novel stacking ensemble learning framework, called StackTTCA, for identifying TTCAs accurately and facilitating the large-scale characterization. The major contributions of StackTTCA are as follows: (i) StackTTCA utilized various feature encoding methods from different perspectives to extract information related to TTCAs, including composition information, reduced amino acid sequence information, pseudo amino acid composition information, and physicochemical properties. Thirteen individual ML methods were used to establish 156 different baseline models, which generated a 156-D probabilistic feature vector. This feature vector was optimized and used to construct the optimal stacked model; (ii) Through a series of benchmarking experiments, we demonstrated that StackTTCA outperformed several conventional ML classifiers and existing methods in terms of independent testing, achieving an accuracy of 0.932 and Matthew's correlation coefficient of 0.866; (iii) We employed the interpretable SHAP method to analyze and elucidate the identification of TTCAs by StackTTCA; and (iv) To facilitate high-throughput screening of new TTCAs, we developed an online web server (http://2pmlab.camt.cmu.ac.th/StackTTCA) for user convenience.

Availability of data and materials

All the data used in this study are available at http://2pmlab.camt.cmu.ac.th/StackTTCA.

Abbreviations

TAAs:: Tumor associated antigens
TAs:: Tumor antigens
TSAs:: Tumor specific antigens
DCs:: Dendritic cells
MHC-I:: Histocompatibility complex class I
IEDB:: Immune epitope database
TTCAs:: Tumor T-cell antigens
ML:: Machine learning
ET:: Extremely randomized trees
QDA:: Quadratic discriminant analysis
RF:: Random forest
SVM:: Support vector machine
SCM:: Scoring card method
MCC:: Matthew’s correlation coefficient
ACC:: Accuracy
ADA:: AdaBoost
DT:: Decision tree
KNN:: K-nearest neighbor
LGBM:: Light gradient boosting machine
LR:: Logistic regression
MLP:: Multilayer perceptron
NB:: Naive Bayes
PLS:: Partial least squares
SVMRBF:: Support vector machine with radial basis function
SVMLN:: Support vector machine with linear kernels
XGB:: Extreme gradient boosting.
PF:: Probabilistic feature
GA:: Genetic algorithm
SAR:: Self-assessment-report operation
Sp:: Specificity
Sn:: Sensitivity
TP:: True positive
FP:: False positive
TN:: True negative
FN:: False negative
ROC:: Receiver operating characteristic
AUC:: Area under the ROC curve

References

Ilyas S, Yang JC. Landscape of tumor antigens in T cell immunotherapy. J Immunol. 2015;195(11):5117–22.
Article CAS PubMed Google Scholar
Zamora AE, Crawford JC, Thomas PG. Hitting the target: how T cells detect and eliminate tumors. J Immunol. 2018;200(2):392–9.
Article CAS PubMed Google Scholar
Zhang L, Huang Y, Lindstrom AR, Lin T-Y, Lam KS, Li Y. Peptide-based materials for cancer immunotherapy. Theranostics. 2019;9(25):7807.
Article CAS PubMed PubMed Central Google Scholar
Vermaelen K. Vaccine strategies to improve anti-cancer cellular immune responses. Front Immunol. 2019;10:8.
Article CAS PubMed PubMed Central Google Scholar
Alspach E, et al. MHC-II neoantigens shape tumour immunity and response to immunotherapy. Nature. 2019;574(7780):696–701.
Article CAS PubMed PubMed Central Google Scholar
Breckpot K, Escors D. Dendritic cells for active anti-cancer immunotherapy: targeting activation pathways through genetic modification. Endocr Metab Immune Disord Drug Targets (Former Curr Drug Targets Immune Endocr Metab Disord). 2009;9(4):328–43.
Article CAS Google Scholar
Miliotou AN, Papadopoulou LC. CAR T-cell therapy: a new era in cancer immunotherapy. Curr Pharm Biotechnol. 2018;19(1):5–18.
Article PubMed Google Scholar
Calis JJ, et al. Properties of MHC class I presented peptides that enhance immunogenicity. PLoS Comput Biol. 2013;9(10): e1003266.
Article PubMed PubMed Central Google Scholar
Chowell D, et al. TCR contact residue hydrophobicity is a hallmark of immunogenic CD8+ T cell epitopes. Proc Natl Acad Sci. 2015;112(14):E1754–62.
Article CAS PubMed PubMed Central Google Scholar
Nishimura Y, Tomita Y, Yuno A, Yoshitake Y, Shinohara M. Cancer immunotherapy using novel tumor-associated antigenic peptides identified by genome-wide cDNA microarray analyses. Cancer Sci. 2015;106(5):505–11.
Article CAS PubMed PubMed Central Google Scholar
Vita R, et al. The immune epitope database (IEDB): 2018 update. Nucleic Acids Res. 2019;47(D1):D339–43.
Article CAS PubMed Google Scholar
Olsen LR, Tongchusak S, Lin H, Reinherz EL, Brusic V, Zhang GL. TANTIGEN: a comprehensive database of tumor T cell antigens. Cancer Immunol Immunother. 2017;66(6):731–5.
Article CAS PubMed Google Scholar
Zhang G, Chitkushev L, Olsen LR, Keskin DB, Brusic V. TANTIGEN 2.0: a knowledge base of tumor T cell antigens and epitopes. BMC Bioinform. 2021;22(8):1–8.
Google Scholar
Wei L, Zhou C, Chen H, Song J, Su R. ACPred-FL: a sequence-based predictor using effective feature representation to improve the prediction of anti-cancer peptides. Bioinformatics. 2018;34(23):4007–16.
Article CAS PubMed PubMed Central Google Scholar
Rao B, Zhou C, Zhang G, Su R, Wei L. ACPred-Fuse: fusing multi-view information improves the prediction of anticancer peptides. Brief Bioinform. 2020;21(5):1846–55.
Article PubMed Google Scholar
Qiang X, Zhou C, Ye X, Du P-F, Su R, Wei L. CPPred-FL: a sequence-based predictor for large-scale identification of cell-penetrating peptides by feature representation learning. Brief Bioinform. 2020;21(1):11–23.
PubMed Google Scholar
Lissabet JFB, Belén LH, Farias JG. TTAgP 1.0: a computational tool for the specific prediction of tumor T cell antigens. Comput Biol Chem. 2019;83: 107103.
Article Google Scholar
Charoenkwan P, Nantasenamat C, Hasan MM, Shoombuatong W. iTTCA-Hybrid: improved and robust identification of tumor T cell antigens by utilizing hybrid feature representation. Anal Biochem. 2020;599: 113747.
Article CAS PubMed Google Scholar
Herrera-Bravo J, Belén LH, Farias JG, Beltrلn JF. TAP 1.0: a robust immunoinformatic tool for the prediction of tumor T-cell antigens based on AAindex properties. Comput Biol Chem. 2021;91: 107452.
Article CAS PubMed Google Scholar
Jiao S, Zou Q, Guo H, Shi L. iTTCA-RF: a random forest predictor for tumor T cell antigens. J Transl Med. 2021;19(1):1–11.
Article Google Scholar
Zou H, Yang F, Yin Z. iTTCA-MFF: identifying tumor T cell antigens based on multiple feature fusion. Immunogenetics. 2022;74(5):447–54.
Article CAS PubMed Google Scholar
Charoenkwan P, Pipattanaboon C, Nantasenamat C, Hasan MM, Moni MA, Shoombuatong W. PSRTTCA: a new approach for improving the prediction and characterization of tumor T cell antigens using propensity score representation learning. Comput Biol Med. 2023;152: 106368.
Article CAS PubMed Google Scholar
Zhang T, Jia Y, Li H, Xu D, Zhou J, Wang G. CRISPRCasStack: a stacking strategy-based ensemble learning framework for accurate identification of Cas proteins. Brief Bioinform. 2022;23(5):bbac335.
Article PubMed Google Scholar
Wu H, et al. scHiCStackL: a stacking ensemble learning-based method for single-cell Hi-C classification using cell embedding. Brief Bioinform. 2022;23(1):bbab396.
Article PubMed Google Scholar
Pedregosa F, et al. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825–30.
Google Scholar
Ahmad S, et al. SCORPION is a stacking-based ensemble learning framework for accurate prediction of phage virion proteins. Sci Rep. 2022;12(1):4106.
Article CAS PubMed PubMed Central Google Scholar
Charoenkwan P, Schaduangrat N, Moni MA, Manavalan B, Shoombuatong W. SAPPHIRE: a stacking-based ensemble learning framework for accurate prediction of thermophilic proteins. Comput Biol Med. 2022;146:105704.
Article CAS PubMed Google Scholar
Charoenkwan P, Schaduangrat N, Moni MA, Manavalan B, Shoombuatong W. NEPTUNE: a novel computational approach for accurate and large-scale identification of tumor homing peptides. Comput Biol Med. 2022;148: 105700.
Article CAS PubMed Google Scholar
Xu C, Ge L, Zhang Y, Dehmer M, Gutman I. Computational prediction of therapeutic peptides based on graph index. J Biomed Inform. 2017;75:63–9.
Article PubMed Google Scholar
Charoenkwan P, et al. AMYPred-FRL is a novel approach for accurate prediction of amyloid proteins by using feature representation learning. Sci Rep. 2022;12(1):1–14.
Article Google Scholar
Charoenkwan P, Schaduangrat N, Moni MA, Shoombuatong W, Manavalan B. Computational prediction and interpretation of druggable proteins using a stacked ensemble-learning framework. Iscience. 2022;25(9): 104883.
Article CAS PubMed PubMed Central Google Scholar
Charoenkwan P, Schaduangrat N, Nantasenamat C, Piacham T, Shoombuatong W. iQSP: a sequence-based tool for the prediction and analysis of quorum sensing peptides using informative physicochemical properties. Int J Mol Sci. 2019;21(1):75.
Article PubMed PubMed Central Google Scholar
Charoenkwan P, Nantasenamat C, Hasan MM, Moni MA, Manavalan B, Shoombuatong W. UMPred-FRL: a new approach for accurate prediction of umami peptides using feature representation learning. Int J Mol Sci. 2021;22(23):13124.
Article CAS PubMed PubMed Central Google Scholar
Charoenkwan P, Nantasenamat C, Hasan MM, Moni MA, Manavalan B, Shoombuatong W. StackDPPIV: a novel computational approach for accurate prediction of dipeptidyl peptidase IV (DPP-IV) inhibitory peptides. Methods. 2022;204:189–98.
Article CAS PubMed Google Scholar
Charoenkwan P, Schaduangrat N, Lio P, Moni MA, Manavalan B, Shoombuatong W. NEPTUNE: a novel computational approach for accurate and large-scale identification of tumor homing peptides. Comput Biol Med. 2022;148:105700.
Article CAS PubMed Google Scholar
Azadpour M, McKay CM, Smith RL. Estimating confidence intervals for information transfer analysis of confusion matrices. J Acoust Soc Am. 2014;135(3):EL140–6.
Article PubMed Google Scholar
Lai H-Y, et al. iProEP: a computational predictor for predicting promoter. Mol Ther Nucl Acids. 2019;17:337–46.
Article CAS Google Scholar
Lv H, Dao F-Y, Guan Z-X, Yang H, Li Y-W, Lin H. Deep-Kcr: accurate detection of lysine crotonylation sites using deep learning method. Brief Bioinform. 2021;22(4):bbaa255.
Article PubMed Google Scholar
Lv H, Zhang Z-M, Li S-H, Tan J-X, Chen W, Lin H. Evaluation of different computational methods on 5-methylcytosine sites identification. Brief Bioinform. 2019;21:982–95.
Article Google Scholar
Su Z-D, et al. iLoc-lncRNA: predict the subcellular location of lncRNAs by incorporating octamer composition into general PseKNC. Bioinformatics. 2018;34(24):4196–204.
Article CAS PubMed Google Scholar
Ullah M, Han K, Hadi F, Xu J, Song J, Yu D-J. PScL-HDeep: image-based prediction of protein subcellular location in human tissue using ensemble learning of handcrafted and deep learned features with two-layer feature selection. Brief Bioinform. 2021;22(6):bbab278.
Article PubMed PubMed Central Google Scholar
Mandrekar JN. Receiver operating characteristic curve in diagnostic test assessment. J Thorac Oncol. 2010;5(9):1315–6.
Article PubMed Google Scholar
Xie R, et al. DeepVF: a deep learning-based hybrid framework for identifying virulence factors using the stacking strategy. Brief Bioinform. 2021;22(3):bbaa125.
Article PubMed Google Scholar
Van Der Maaten L. Accelerating t-SNE using tree-based algorithms. J Mach Learn Res. 2014;15(1):3221–45.
Google Scholar
Van der Maaten L, Hinton G. Visualizing data using t-SNE. J Mach Learn Res. 2008;9(11):2579–605.
Google Scholar
Su R, Hu J, Zou Q, Manavalan B, Wei L. Empirical comparison and analysis of web-based cell-penetrating peptide prediction tools. Brief Bioinform. 2020;21(2):408–20.
Article PubMed Google Scholar
Basith S, Manavalan B, Hwan Shin T, Lee G. Machine intelligence in peptide therapeutics: a next-generation tool for rapid disease screening. Med Res Rev. 2020;40(4):1276–314.
Article CAS PubMed Google Scholar
Lv H, Dao F-Y, Zulfiqar H, Lin H. DeepIPs: comprehensive assessment and computational identification of phosphorylation sites of SARS-CoV-2 infection using a deep learning-based approach. Brief Bioinform. 2021;22(6):bbab244.
Article PubMed Google Scholar
Charoenkwan P, Nantasenamat C, Hasan MM, Manavalan B, Shoombuatong W. BERT4Bitter: a bidirectional encoder representations from transformers (BERT)-based model for improving the prediction of bitter peptides. Bioinformatics. 2021;37(17):2556–62.
Article CAS PubMed Google Scholar

Download references

Acknowledgements

This work is also supported by College of Arts, Media and Technology, Chiang Mai University and partially supported by Chiang Mai University and Mahidol University. This work was supported by Information Technology Service Center (ITSC) of Chiang Mai University.

Funding

This work was financially supported by the National Research Council of Thailand and Mahidol University (N42A660380) and Specific League Funds from Mahidol University.

Author information

Authors and Affiliations

Modern Management and Information Technology, College of Arts, Media and Technology, Chiang Mai University, Chiang Mai, 50200, Thailand
Phasit Charoenkwan
Center for Research Innovation and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand
Nalini Schaduangrat & Watshara Shoombuatong

Authors

Phasit Charoenkwan
View author publications
You can also search for this author in PubMed Google Scholar
Nalini Schaduangrat
View author publications
You can also search for this author in PubMed Google Scholar
Watshara Shoombuatong
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

PC: Design of this study, methodology, formal analysis, software, investigation, webserver development. NS: Drafting the article and substantively revising it. WS: Project administration, supervision, design of this study, methodology, data collection, data analysis and interpretation, drafting the article, and critical revision of the article. All authors have reviewed and approved the manuscript.

Corresponding author

Correspondence to Watshara Shoombuatong.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: Table S1.

Hyperparameter search details used for the construction of nine ML-based classifiers. Table S2. Cross-validation results of 156 baseline models as developed with 13 ML algorithms and 12 feature encoding schemes. Table S3. Independent test results of 156 baseline models as developed with 13 ML algorithms and 12 feature encoding schemes. Table S4. Detailed prediction results of TAP 1.0, iTTCA-Hybrid, iTTCA-RF, PSATTCA, and StackTTCA on case studies.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Charoenkwan, P., Schaduangrat, N. & Shoombuatong, W. StackTTCA: a stacking ensemble learning-based framework for accurate and high-throughput identification of tumor T cell antigens. BMC Bioinformatics 24, 301 (2023). https://doi.org/10.1186/s12859-023-05421-x

Download citation

Received: 11 April 2023
Accepted: 19 July 2023
Published: 28 July 2023
DOI: https://doi.org/10.1186/s12859-023-05421-x

StackTTCA: a stacking ensemble learning-based framework for accurate and high-throughput identification of tumor T cell antigens

Abstract

Background

Results

Conclusions

Introduction

Materials and methods

Overall framework of StackTTCA

Benchmark dataset

Stacking ensemble learning-based framework

Evaluation metrics

Results

Optimization of stacked models

Performance comparison with other ensemble strategies

Performance comparison with conventional ML classifiers

Performance comparison with state-of-the-art methods

Feature importance analysis

Discussion

Conclusion

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Supplementary Information

Additional file 1: Table S1.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us