CarSite-II: an integrated classification algorithm for identifying carbonylated sites based on K-means similarity-based undersampling and synthetic minority oversampling techniques

Background Carbonylation is a non-enzymatic irreversible protein post-translational modification, and refers to the side chain of amino acid residues being attacked by reactive oxygen species and finally converted into carbonyl products. Studies have shown that protein carbonylation caused by reactive oxygen species is involved in the etiology and pathophysiological processes of aging, neurodegenerative diseases, inflammation, diabetes, amyotrophic lateral sclerosis, Huntington’s disease, and tumor. Current experimental approaches used to predict carbonylation sites are expensive, time-consuming, and limited in protein processing abilities. Computational prediction of the carbonylation residue location in protein post-translational modifications enhances the functional characterization of proteins. Results In this study, an integrated classifier algorithm, CarSite-II, was developed to identify K, P, R, and T carbonylated sites. The resampling method K-means similarity-based undersampling and the synthetic minority oversampling technique (SMOTE-KSU) were incorporated to balance the proportions of K, P, R, and T carbonylated training samples. Next, the integrated classifier system Rotation Forest uses “support vector machine” subclassifications to divide three types of feature spaces into several subsets. CarSite-II gained Matthew’s correlation coefficient (MCC) values of 0.2287/0.3125/0.2787/0.2814, False Positive rate values of 0.2628/0.1084/0.1383/0.1313, False Negative rate values of 0.2252/0.0205/0.0976/0.0608 for K/P/R/T carbonylation sites by tenfold cross-validation, respectively. On our independent test dataset, CarSite-II yield MCC values of 0.6358/0.2910/0.4629/0.3685, False Positive rate values of 0.0165/0.0203/0.0188/0.0094, False Negative rate values of 0.1026/0.1875/0.2037/0.3333 for K/P/R/T carbonylation sites. The results show that CarSite-II achieves remarkably better performance than all currently available prediction tools. Conclusion The related results revealed that CarSite-II achieved better performance than the currently available five programs, and revealed the usefulness of the SMOTE-KSU resampling approach and integration algorithm. For the convenience of experimental scientists, the web tool of CarSite-II is available in http://47.100.136.41:8081/ Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04134-3.

SMOTE-KSU resampling approach and integration algorithm. For the convenience of experimental scientists, the web tool of CarSite-II is available in http:// 47. 100. 136. 41: 8081/ Keywords: Carbonylation, Protein post-translational modification, K-means similaritybased undersampling, The integrated classifier, Rotation forest Background Protein carbonylation is an irreversible chemical modification in oxidative stress, which refers to the side chain of amino acid residues being attacked by reactive oxygen species and finally converted into carbonyl products [1]. Modification of the protein by carbonylation will cause changes in the structure of the protein, causing it to lose its original biological function, eventually leading to cell and tissue dysfunction and pathophysiological changes in the body. The level of protein carbonylation has only been used for a long time to evaluate the degree of oxidation of biological organisms as an indicator to measure the oxidative damage of proteins. However, studies have shown that protein carbonylation caused by reactive oxygen species is involved in the etiology and pathophysiological processes of aging, apoptosis and various neurodegenerative diseases. Under oxidative stress induced by different diseases, carbonylation has certain selectivity for proteins, that is, some proteins are easily carbonylated, while others are not easily carbonylated [1]. Taking the cytoskeleton as an example, glial fibrillary acidic protein (GFAP) is the protein most vulnerable to oxidative damage in multiple sclerosis [2], Pick's disease [3], and aging [4]. Its carbonylation level increased, however, it decreased in patients with Alzheimer's disease [4]. In addition, the β-actin carbonylation level of another cytoskeleton molecule increased in Alzheimer's disease [4] and multiple sclerosis [2], but decreased in aging.
After the carbonylated protein is produced, it cannot be repaired by the body's antioxidant defense mechanism, so it will slowly accumulate over time, resulting in the change or loss of the functions of key enzymes in various signaling pathways, and then trigger a series of diseases related to protein carbonylation: aging, neurodegenerative diseases (such as Alzheimer's disease, Parkinson's disease, and Multiple sclerosis), inflammation, diabetes, and tumor (such as Uterine fibroids, malignant prostate cancer, and breast cancer). These all indicate that protein carbonylation modification is not only a sign of the degree of cell oxidation, but also involved in the pathophysiological process of the disease.
For the following reasons, it is necessary to develop computational methods for prediction of carbonylation sites. (1) Since the carbonylation site is the decisive factor for the functional change or deletion of the carbonylated protein, the identification of the carbonylation site and its role in the protein are crucial for understanding the protein carbonylation process and related pathogenesis, and current experimental approaches used to identify carbonylation sites are expensive, time-consuming, and limited in protein processing abilities. Computational prediction of the carbonylation residue location in protein post-translational modifications enhances the functional characterization of proteins. (2) Corresponding prediction and analysis of protein carbonylation sites can give experimental researchers a pre-experimental evaluation to make them aware of the occurrence probability and corresponding number of carbonylation sites on the target protein, allowing for more targeted experiments. (3) In order to reveal the pathophysiological process of the diseases (aging, neurodegenerative diseases, inflammation, diabetes, tumor and so on), the prediction of protein carbonylation sites is significance for in-depth understanding the biological functions and developing effective drugs. Therefore, it is very important to establish an online prediction platform with clear interface and easy identification of carbonylation sites.
It is worth noting that only four types of residues are particularly sensitive to carbonylation, and they are lysine (K), proline (P), arginine (R), and threonine (T) residues [5]. In the past several years, a series of computational methods and tools have been proposed for identifying carbonylation proteins and sites [5][6][7][8][9][10][11][12][13]. However, the predictive performance of protein carbonylation sites is still unsatisfactory compared with other post-translational modification sites (PTMs) in proteins. Therefore, for the sake of satisfying the modern requirement to develop efficient high-throughput computing tools, supererogation is still required to move forward a single step, improving the predictive performance of carbonylation sites.
In the current study, K-means similarity-based undersampling (KSU) and the synthetic minority oversampling technique (SMOTE) were introduced and combined to construct balance training datasets for K, P, R, and T carbonylation modification sites, respectively. SMOTE [14] was utilized to synthesize K, P, R, and T carbonylation sites (positive training samples) by using experimentally validated positive training samples, while KSU was applied to eliminate samples with little information that have little impact on classification and redundant samples. The resampling method combining KSU and SMOTE was conveniently named SMOTE-KSU. Based on constructing positive and negative training samples using the SMOTE-KSU resampling method, a novel computational predictive tool was developed. This tool, named as CarSite-II, was created to distinguish carbonylation sites from non-carbonylation sites through distance-based residue (DR) feature extraction strategy and Rotation Forest integrated algorithm-based "support vector machine" (SVM) subclassification. According to the related results obtained by tenfold cross-validation and independent tests, CarSite-II achieves remarkably better predictive performance than existing predictor tools. Figure 1 shows the flow chart for constructing four optimal models for K/P/R/T carbonylation sites, CarSite-II. The Fig. 1 mainly consists of the following four parts to improve the prediction accuracy of K/P/R/T carbonylation sites: (1) construct protein carbonylation training and testing dataset. (2) use the feature extraction strategy of distance-based residue to formulate K/P/R/T carbonylation samples. (3) KSU undersampling method and SMOTE oversampling technique were incorporated to balance the training dataset. (4) The tenfold cross validation was used to select the optimal model.

Amino acid composition of carbonylation sites
To explore the position-specific differences in amino acid residue distributions in the carbonylation and non-carbonylation sites, training samples were submitted to the pLogo web server [15] (https:// plogo. uconn. edu/), and the sequence logo of four carbonylated residues was shown in Fig. 2. As we can see from  site sequence logo, Arginine (R) at position − 5, − 4, − 3, − 2, and − 1 was significantly overrepresented in R carbonylation site sequence logo, Proline (P) was not significantly overrepresented in P carbonylation site sequence logo, and Threonine (T) at position − 3 and -2 was significantly overrepresented in T carbonylation site sequence logo.

Balance the training dataset and select optimal parameters of DR and rotation forest
As described in Material and methods, each sequence in the training dataset can be encoded by DR, and SMOTE oversampling and KSU undersampling were used to resample the training dataset to make the same size of positive and negative training samples. We calculated the number of samples (N) removed from the negative samples or added to the positive samples during the process of resampling according to the following formula [16]: where k 0 = 0.5, k 1 = 0.5 , and n 0 or n 1 represented the number of sequences included in the negative or positive training samples. Therefore, N was 13189/11128/11323/ 12040 for K/P/R/T carbonylation sites, respectively.
SVM was used for subclassification of the Rotation Forest algorithm, and the parameters of the Rotation Forest algorithm were set to the following: K ranged from 300 to 400, with an interval of 10, and the number of subclassifiers was set as five. The concrete results of the K/P/R/T carbonylation sites 10-fold cross validation were listed in the Additional File 1: SupTable (SubTable1.1-SubTable1.4. The predictive performance of K/P/R/T carbonylation sites by 10-fold cross validation). As we can see from SupTable (SubTable1.1-SubTable1.4. The predictive performance of K/P/R/T carbonylation sites by 10-fold cross validation), while d MAX = 3, K = 400 , the K carbonylation dataset can get the best prediction results. While d MAX = 2, K = 400,d MAX = 1, K = 400 , d MAX = 3, K = 400 , the P/R/T carbonylation dataset can get the best prediction results, respectively. To improve the predictive performance of carbonylation sites, the parameters selected above were used to construct the final integrated prediction model for K/P/ R/T carbonylation sites. The prediction performance for K/P/R/T carbonylation sites based on the Rotation Forest integrated algorithm by tenfold cross-validation is shown in Fig. 3.
As we can see from

The effectiveness of resampling approach
The related predictive results of the independent tests were utilized to clarify the effectiveness of our combination of the SMOTE-KSU resampling method. The comparison results are listed in Table 1 for without resampling, conducting SMOTE only for positive sequences, conducting KSU only for negative sequences, and conducting SMOTE-KSU resampling for the training dataset. The prediction performance for K/P/R/T carbonylation sites by tenfold cross validation. a The prediction performance for K carbonylation sites. b The prediction performance for R carbonylation sites. c The prediction performance for P carbonylation sites. d The prediction performance for T carbonylation sites We discovered that CarSite-II based on the SMOTE-KSU resampling approach reached the best performance, with MCC of 0.6358/0.2910/0.4629/0.3685 for K/P/R/T carbonylation sites, respectively. Additionally, KSU undersampling achieved the second best prediction performance, with Sn values of 70.94% for K carbonylation sites. The values of Sn obtained by without resampling, and SMOTE oversampling for K/P/R/T carbonylation sites, and KSU undersampling for P/R/T carbonylation sites, were less than 50%. The major reason for this may be imbalance of training dataset. The ratios between training positive samples and training negative samples for K carbonylation sites were over 1:22 (618:13807), 1:43 (618:26995), and 1:1.9 (13807:26995) corresponding to KSU undersampling, without resampling, and SMOTE oversampling. The ratios between training positive samples and training negative samples for P/R/T carbonylation sites were also very different (i.e. the training dataset is extremely unbalanced) for KSU undersampling, without resampling, and SMOTE oversampling. Thus, we did not consider them further.
In order to further look at comparative performance, the ROC curves comparision of different resampling methods for K/P/R/T carbonylation sites on our independent test dataset was given in Fig. 4.

Comparison with other prediction methods and discuss
To better test and verify the performance of CarSite-II, we compared CarSite-II with three currently available programs in our independent test. The first predictive tool, CarSPred, based on four types of features and mRMR feature selection agorithm with Fig. 4 The ROC curves comparison of different resampling methods K/P/R/T carbonylation sites on our independent test dataset. a Comparison of different resampling methods for K carbonylation datatset. b Comparison of different resampling methods for R carbonylation datatset. c Comparison of different resampling methods for P carbonylation datatset. d Comparison of different resampling methods for T carbonylation datatset weighted support vector machine [7]. In 2016, Lv et al. based three types of features and IFS feature selection algorithm with weighted support vector machine [7] to construct the predictive tool CarSPred.Y [9]. In our previous work, the one-sided selection undersampling algorithm was used to balanced training dataset, and hybrid combination of four feature extraction strategies with support vector machine to build the tool, CarSite [13].
In terms of the dataset used to build the above three currently available programs and the prediction threshold used for each method, CarSPred used 266K/119R/116T/114P human carbonylation sites and 1802K/754R/702T/716P human non-carbonylation sites to construct the tool, and used 34K/17/5T/12P carbonylation sites and 147K/93R/30T/76P non-carbonylation sites from the human and other mammals to construct the test dataset, and the determination threshold can be assigned to any value from 0 to 1 which is set to 0.5 by default. CarSPred.Y used 86K/56R/44T/59P carbonylation sites and 536K/363R/271T/358P non-carbonylation sites from yeast proteins to construct the training model, and the determination threshold was same with CarSPred. CarSite used the same cabonylation proteins with CarSPred and the threshold was set as 0.5. In this study, we used the threshold of 0.5 to make relevant comparisons.
CarSite-II was compared with CarSPred.Y, CarSPred, and CarSite. The relevant results to identify carbonylation sites are shown in the Table 2. We can see from Table 2 that although the value of Sp by CarSite-II was about 0.45% lower than that for CarSPred for K carbonylation sites, the values of Sn was about 85.47% higher. Car-Site-II gained the best Sn of 89.74%, 81.25%, 79.63% and 66.67% for K/P/R/T carbonylation sites, respectively, which generally lead to 18.8%, 12.5%, 24.07% and 8.34%, and 58.97%, 25%, 53.7% and 33.34% improvement with regard to the second and third  [6], CarSpred [7], iCar-PseCp [8] and CarSite [13], CarSite-II was compared with these methods using tenfold cross-validation according to the results listed in their works. As shown in Table 3, CarSite-II was significantly better than PTMPred, CarSpred, iCar-PseCp and CarSite.
Meanwhile, we used Wilcoxon signed rank test to verify the significant of different methods in Table 1 and Table 2. The relevant results are listed in Additional File 2: SubTable 2. The Wilcoxon signde rank of the K/P/R/T carbonylation sites. Twosided test for the null hypothesis that x-y comes from a distribution with zero median at the 5% significance level. As we can see from the Additional File 2: SubTable 2. The Wilcoxon signde rank of the K/P/R/T carbonylation sites, the values of H are all 1. In other words, it indicates a rejection of the null hypothesis at the 5% significance level.
These results indicated that CarSite-II is a significant improvement over all currently available tools.

Discussion
Protein carbonylation is a type of protein oxidative damage, which is itself an irreversible chemical modification in oxidative stress, which refers to the side chain of amino acid residues being attacked by reactive oxygen species and finally converted into carbonyl products [1]. Modification of the protein by carbonylation will cause changes in the structure of the protein, causing it to lose its original biological function, eventually leading to cell and tissue dysfunction and pathophysiological changes in the body. The study by Nabeshi and his team showed that carbonyl modification of purified Cu, Zn-SOD increased by the reaction with H 2 O 2 . Therefore, progressive accumulation of oxidative damage to Cu, Zn-SOD, may cause dysfunction of defense systems against oxidative stress in SAMP8 with a higher oxidative states, leading to acceleration of aging. Furthermore, carbonyl modification of HCNP-pp may be involved in pathophysiological alterations associated with deterioration in the learning and memory in the brain seen in SAMP8 [17].

Conclusions
In the current study, a novel resampling approach, SMOTE-KSU, was proposed to balance the size of small and large samples. A balanced dataset based on SMOTE-KSU resampling, the optimal parameters of DR, and Rotation Forest for K, P, R, and T carbonylation sites were selected according to the related results of tenfold cross-validation, respectively. Hereafter, we applied a majority voting strategy to develop the integrated predictor CarSite-II based on the Rotation Forest integrated algorithm. The related results revealed that CarSite-II achieved better performance than the currently available five programs, and revealed the usefulness of the SMOTE-KSU resampling approach and integration algorithm. Since Deep learning plays an important supplementary role in sequence analysis, we may construct a Deep learning predict model to better identify carbonylation sites in the future work. Our future work aims at extending this work to other bioinformatics sequence recognition. For the convenience of experimental scientists, we have given a web-server guide on how to use the CarSite-II web tool to get their desired results without the need to follow the complicated mathematic equations that presented just for the integrity in developing the web tool CarSite-II. The detailed steps are shown in the Additional file 3: SubTable 3. Web-Server Guide.

Data collection and pre-processing
The dataset gathered from CarbonylDB [18], which was the only existing database or resource for carbonylated proteins or sites, was used in the current study. From Car-bonylDB, we collected 685, 178, 211, and 208 experimentally verified K, P, R, and T carbonylated sites on 468 human proteins as positive samples, while the remaining 42523K, 35302P, 33050R, and 34774T carbonylated sites on the same 468 human proteins were regarded as negative samples to construct the training dataset. Meanwhile, CD-HIT [19] was utilized as the software for the removal of redundant samples. For a cut-off of 40% identity, 445 carbonylated human proteins were retained. Subsequently, for a cut-off of 70% identity, some carbonylated sites with a high identity of the 445 carbonylated proteins were removed. Finally, a total of 618K, 162P, 204R, and 191T carbonylated sites (the positive training samples) and 26995K, 22418P, 22849R, and 24271T non-carbonylated sites (the negative training samples) were collected. Furthermore, to avoid overestimating the predictive performance resulting from overfitting of the training dataset and to evaluate the proposed model's real predictive performance, an independent testing set was constructed. The independent testing set was constructed by collecting the proteins of rats, yeast, and mice from Car-bonylDB [18] (298 rat proteins, 239 yeast proteins, and 90 mouse proteins), and CD-HIT [19] was used to remove redundant proteins and samples. For a cut-off of 40% identity, 277 rat proteins, 222 yeast proteins, and 76 mouse proteins were retained. Subsequently, cd-hit-2d [19] was used to control for homology between training and test datasets and within the test dataset. For a cut-off of 40% identity, 223 rat proteins, 209 yeast proteins, and 42 mouse proteins were retained. Then, for a cut-off of 70% identity, some carbonylated sites with a high identity of the retained three species of carbonylated proteins were removed, a total of 117K, 16P, 54R, and 24T carbonylated sites were collected. For collecting negative test samples, after having filtered out fragments with 30% identity, the final negative test dataset comprised 7439K, 5318P, 5966R, and 6507T non-carbonylated sites. Finally, the independent test set contained 117 K, 16P, 54R, and 24T carbonylated sites and 7439K, 5318P, 5966R, and 6507T non-carbonylated sites. Table 4 shows the concrete statistics of the training dataset and independent test dataset.

Distance-based residue features extraction strategy
DR, proposed by Liu et al. [20], was used to convert carbonylation and non-carbonylation protein sequences into valid numerical vectors in this study. Given a protein sequence R with L amino acid residues, i.e.
where R i represents the ith position amino acid residue along a given protein sequence. The DR measure of R can be defined as: where 20 indicated 20 kinds of naïve amino acid residues: i ∈ {A, C, D, E, F , G, H , I, K , L, M, N , P, Q, R, S, T , V , W , Y } , T 0 i (R) was the occurrences of the amino acid residue i, and T d ij (R) was the occurrences of the amino acid residue pair (i, j). d MAX represented the maximum distance between amino acid residue pair (i, j), and in this study, we set it as 1, 2, and 3, respectively.
In order to make researchers further understand the concrete process of converting a carbonylation or non-carbonylation protein sequence into valid numerical vector, the concrete process of generating DR feature vectors shown in Fig. 5.

The synthetic minority oversampling technique
The SMOTE algorithm is the most frequently and commonly used oversampling method [21][22][23]. The primary idea of the SMOTE algorithm is to place synthetic example along the line segments connecting existing rare examples [14]. We briefly present the following: Given a positive training sample X , and searching for its k nearest neighbor examples (usually set as 5), assume that the oversampling ratio was N , then N samples were selected from its k nearest neighbor examples. Conduct a random linear interpolation between X and Y j j = 1, 2, . . . , N to create a new rare sample P j according to the formula (5): where rand(0, 1) represents the random number generated in the interval (0, 1) . For a concrete explanation of the SMOTE algorithm, please refer to References [14].
(5) P j = X + rand(0, 1) * Y j − X , j = 1, 2, . . . , N . The second step, F ij is the j-th feature subset used to train the sub-classifier D i . Corresponding to each feature subset F ij , X ij is a subset of samples containing feature F ij in X . Using bootstrap resampling technology for X ij , 75% of the samples are randomly and repeatedly extracted to form a new bootstrap sample set X ′ ij . Then, we performed the principal component analysis on X ′ ij , and recorded the generated coefficient matrix ] . It is worth noting that the possible eigenvalue is zero, resulting in M j ≤ M . The purpose of a linear transformation on feature subsets rather than full data sets is to avoid constructing subclassifiers with the same coefficient matrix.
The third step, construct a sparse "rotation" matrix R i with the obtained coefficient matrix C ij : Because the bootstrap process disturbs the order of data, in order to calculate the training set of the subclassifier D i , each column in the matrix R i needs to be reordered according to the original feature set. The rotation matrix obtained after reordering is denoted as R α i ∈ R N ×n . For subclassifier D i , the training set after the rotation transformation is X ′ = XR α i . The fourth step, in the classification phase, the new sample x also needs to conduct rotation transformation, and the new sample after the rotation transformation is x ′ = xR α i . We let d ij xR α i be the subclassifier D i to determine the probability that the sample x belongs to classes 1 or 2, and the credibility of assigning the sample to a certain class is: Sample x judges the category to which it belongs with maximum credibility, where L represents the number of subclassifiers, and 1 or 2 indicate the sample belonging to positive or negative.
In this study, we used SVM as the subclassifier for the Rotation Forest integrated algorithm.

Construct and evaluate model
To further improve the performance of predicting carbonylation and non-carbonylation sites, the Rotation Forest integrated algorithm was utilized by using a majority voting strategy to integrate the predictive results of subclassifiers. The performance of CarSite-II was evaluated using the following six measurements: Sensitivity (Sn), Specificity (Sp), Accuracy (Acc), Matthew's correlation coefficient (MCC), geometric mean (G-mean) and the area under the receiver operating characteristic curves (AUC), which were defined as follows: