Skip to main content

m5CPred-SVM: a novel method for predicting m5C sites of RNA

Abstract

Background

As one of the most common post-transcriptional modifications (PTCM) in RNA, 5-cytosine-methylation plays important roles in many biological functions such as RNA metabolism and cell fate decision. Through accurate identification of 5-methylcytosine (m5C) sites on RNA, researchers can better understand the exact role of 5-cytosine-methylation in these biological functions. In recent years, computational methods of predicting m5C sites have attracted lots of interests because of its efficiency and low-cost. However, both the accuracy and efficiency of these methods are not satisfactory yet and need further improvement.

Results

In this work, we have developed a new computational method, m5CPred-SVM, to identify m5C sites in three species, H. sapiens, M. musculus and A. thaliana. To build this model, we first collected benchmark datasets following three recently published methods. Then, six types of sequence-based features were generated based on RNA segments and the sequential forward feature selection strategy was used to obtain the optimal feature subset. After that, the performance of models based on different learning algorithms were compared, and the model based on the support vector machine provided the highest prediction accuracy. Finally, our proposed method, m5CPred-SVM was compared with several existing methods, and the result showed that m5CPred-SVM offered substantially higher prediction accuracy than previously published methods. It is expected that our method, m5CPred-SVM, can become a useful tool for accurate identification of m5C sites.

Conclusion

In this study, by introducing position-specific propensity related features, we built a new model, m5CPred-SVM, to predict RNA m5C sites of three different species. The result shows that our model outperformed the existing state-of-art models. Our model is available for users through a web server at https://zhulab.ahu.edu.cn/m5CPred-SVM.

Background

Over 170 chemical modifications have been discovered in both coding and non-coding RNAs to date [1,2,3]. 5-cytosine-methylation is one of the most common post-transcriptional modifications (PTCM) and has been found in almost all types of RNA [4, 5]. This modification can regulate nuclear mRNA output and RNA variable splicing, increase RNA stability, regulate protein translation and RNA–protein interaction, and maintain the normal structure of RNA [6,7,8,9,10,11,12,13,14]. Under the catalysis of RNA methyltransferase, methylation occurs on the carbon atom in the fifth position of a cytosine to generate 5-methylcytosine (m5C). Therefore, accurate identification of m5C sites in RNA is of great importance for understanding the mechanism and function of this modification.

Both experimental and computational methods have been developed to determine and predict m5C sites in RNA. Experimental methods such as bisulfite sequencing [5, 12], m5C-RIP [15], Aza-IP [16], mi-CLIP [17] and RBS-seq [18] have been somewhat successful in identification of m5C sites in RNAs of different species. However, these experimental methods are time-consuming and expensive, and they are not able to keep pace with the explosive increase of RNA sequences revealed by the rapid development of sequencing technology. Instead, computational methods can be able to provide a faster and more cost-effective way for m5C site identification.

So far, eight computational methods for predicting m5C site have been reported, which were summarized in Table 1 according to the datasets, algorithms, webservers, evaluation strategies and features employed. Feng et al. [19] built their model using a support vector machine based on PseKNC features extracted from RNA segments, and a balanced dataset with 120 m5C sites and 120 non-m5C sites was used to build this model. In addition, nine other datasets with 120 non-m5C sites were randomly selected to demonstrate their model is not sensitive to the selection of non-m5C sites. Later, Qiu et al. [20] have developed a model called iRNAm5C-PseDNC to predict m5C sites by using random forests. Differently, this model was built on an imbalanced and redundant dataset with 475 m5C sites and 1425 non-m5C sites. Then, by using ensemble learning methods, Zhang et al. [21] have developed a model called m5C-HPCR. A new heuristic algorithm was introduced to reduce the number of physical and chemical properties of nucleotides. The m5C-HPCR was benchmarked on both Feng et al.’s dataset and Qiu et al.’s dataset. Sabooh et al. [22] have developed a model by fusing composite encoding features including Di-Nucleotide Composition (DNC), Tri-Nucleotide Composition (TNC) and Tetra- Nucleotide Composition (TetraNC). The same dataset as that of Feng et al. [19] and Zhang et al. [21] was again used to build this model by using SVM. Recently, Fang et al. [23] compared the balanced dataset used in Feng et al.’s work and the imbalanced dataset used in Qiu et al.’s work, and developed a model named RNAm5CPred to predict m5C sites of H. sapiens. The model was built by SVM and an independent test set was used to evaluate different methods. A new predictor (PEA-m5C) developed by Song et al. [24] mainly focused on predicting m5C sites in A. thaliana. The model was built by using random forests on an imbalanced dataset but was tested on three balanced independent datasets. Li et al. [25] had collected data from GEO database and developed a web server RNAm5Cfinder based on random forest algorithm, which can be used to predict m5C sites in eight kinds of cells or tissues of mouse and human. All the m5C sites recorded in three GEO records and all other non-m5C sites in the genomes were collected to train their models, however, the redundancy of the datasets was not well dealt with. More recently, Lv et al. [26] developed a server called iRNA-m5C to predict m5C sites of four types of species. Their models are built with random forests with features of PseKNC, MNBE (mono-nucleotide binary encoding), KNFC(K-tuple nucleotide frequency component) and NV (natural vector).

Table 1 Summarization of the existing methods for predicting m5C sites of RNA

Although these reported methods performed well in the recognition of m5C sites in animal and plant RNA sequences, it is possible that the performance can be improved by introducing position specific related features such as position specific nucleotide propensity (PSNP), position specific dinucleotide propensity (PSDP). The effectiveness of these features has been proved in previous works [27, 28] for predicting m6A of RNA, however, the use of these features to predict m5C sites has not been explored in these methods mentioned above. It is expected that the performance of computational methods can be further improved by mining position specific related features and composition related features.

In this study, we have developed a new method, m5CPred-SVM, to predict m5C sites in RNA sequences of three different species, H. sapiens, M. musculus and A. thaliana. First, we generated six kinds of features based on RNA sequences, namely k-nucleotide frequency (KNF), pseudo dinucleotide composition (pseDNC), K-spaced nucleotide pair frequency (KSNPF), position-specific nucleotide propensity (PSNP), K-spaced position-specific dinucleotide propensity (KSPSDP) and Chemical Property with Density (CPD). Then, the sequential forward feature selection strategy was used to select a compact feature subset from these six kinds of features. Based on this selected feature subset, our method was built using a support vector machine (SVM). At last, the performance of our method was compared with several existing methods. The results showed that our method can offer substantially better performance than these existing methods on the independent test sets.

Results

Performance of each type of feature

By using SVM over the ten folds cross-validation, we have evaluated the performances of the six types of extracted features for the three species, namely H. sapiens, M. musculus and A. thaliana. As shown in Table 2, PSNP, KSPSDP, CPD are the three features showing the best performances among the six types of features for H. sapiens. The cross-validation AUROCs for these three features are 0.879, 0.862 and 0.850, respectively. Table 3 shows that CPD, KSPSDP and PSNP are the three features providing the best performances for M. musculus. The cross-validation AUROCs are 0.812, 0.803 and 0.794 for these three features, respectively. As for A. thaliana, Table 4 shows that the top three models with the best performances were based on PseDNC, 4NF and CPD, and the corresponding cross-validation AUROCs are 0.760, 0.759 and 0.753. The ROC curves of the six types of features for H. sapiens, M. musculus and A. thaliana are shown in Fig. 1.

Table 2 The results of feature selection for H. sapiens
Table 3 The results of feature selection for M. musculus
Table 4 The results of feature selection for A. thaliana
Fig. 1
figure1

The ROC curves that show the performances of the six type of features for H. sapiens, M. musculus and A. thaliana, respectively. a H. sapiens; b M. musculus; c A. thaliana

Feature subsets selected by SFS

Considering the fact that different features may be complementary, combination of the six generated features may improve the predictive performance. However, there are also redundancy between these features, and a high dimensional input feature can make the model training very time-consuming and easily over-fitting. In order to solve the problem, we have used the sequential forward feature selection (SFS) strategy to select a compact feature subset from these features to build our final models.

As shown in Table 2, the cross validation accuracy was convergent at the fourth round in the SFS process for training the model of H. sapiens. The highest AUROC is 0.899, and the corresponding feature subset includes PSNP, 4NF, 5SNPF and PseDNC.

For M. musculus, the cross validation accuracy was convergent at the third round in the SFS process (Table 3). The highest AUROC is 0.822 and the corresponding feature subset includes CPD, 4NF and 1SPSDP.

As for A. thaliana, Table 4 shows that the cross validation accuracy is convergent at the fifth round in the SFS process. The highest AUROC is 0.782 and the corresponding feature subset includes PseDNC, 3SPSDP, 4NF, 1SNPF and PSNP.

Model sensitivity to the selection of negative samples

To evaluate if the selection of the negative samples affects the predictive performances of the models, we built other nine models based on positive samples and other nine negative subsets for both H. sapiens and M. musculus with the optimal feature subsets. Additional file 1: Tables S1 and S2 show cross-validation performances of the 10 models built on the positive samples and the 10 negative subsets for H. sapiens and M. musculus, respectively. The means and the standard errors of the ROC AUCs of the ten models are 0.843 and 0.029, and 0.822 and 0.003, for H. sapiens and M. musculus, respectively. Additional file 1: Tables S3 and S4 show the performances of the ten models on the independent test sets of H. sapiens and M. musculus, respectively. The means and the standard errors of the ROC AUCs of the ten models are 0.834 and 0.024, and 0.776 and 0.007, for H. sapiens and M. musculus, respectively. The results indicate the performance of the models is affected a little for H. sapiens by the selection of negative samples, however, the performance is barely affected for M. musculus by the selection of negative samples. The main possible reason is that the dataset for H. sapiens is smaller compared with that of M. musculus. The distribution is easily fluctuated for small samples.

Comparison with other classifiers

Studies above has showed that support vector machine performed well in predicting m5C sites for different species. In order to further investigate and compare the performance of other classifiers, we used other five classifiers, namely KNN [29], Adaboost [30], random forests [31], decision tree [32], logistic regression [33] and XGBoost [34] to build models based on the selected feature subsets for all the three species. The hyper parameters for KNN, Adaboost, random forests and XGboost were also optimized with grid search. The \(k\) of KNN is set from 1 to 10 with a step 1. The ntree of 10 to 1000 with a step 20 is set for both Adaboost and random forests. The learning rate, max depth and nrounds of XGboost are set between 2−4 and 2−1, 2 and 10, and 23 and 210, respectively. Table 5 shows the cross validation results of the six classifiers. For H. sapiens, the AUCROC value of SVM is 0.899, which is higher than those of XGBoost, RF, KNN, AdaBoost, decision tree and LR at 0.020, 0.050, 0.049, 0.039, 0.221 and 0.282, respectively. Moreover, the SVM model achieved the highest values in all other metrics. For M. musculus, the AUCROC value of SVM is 0.822, which is again higher than those of RF, KNN, AdaBoost, decision tree and LR at 0.008, 0.093, 0.010, 0.207 and 0.011, respectively, but a little bit less than XGBoost (0.823). For A. thaliana, SVM again gave the highest AUC value at 0.782, which is higher than those of XGBoost, RF, KNN, AdaBoost, decision tree and LR for 0.012, 0.004, 0.048, 0.026, 0.195 and 0.052, respectively. For other metrics, KNN has the highest Sp value and RF has the highest Pre value, while the Sn, Acc, MCC and F1 score value of SVM offered the highest values for all the remaining 4 metrics. The fact that SVM has outperformed all other six classifiers for both H. sapiens and A. thaliana and is comparable to XGBoost for M. musculus further confirms that SVM is a stable and robust classifier. As a result, SVM was selected as the final classifier in this study.

Table 5 Comparison of different classifiers based on the cross-validation results on the training datasets for the three species

Comparison with other existing methods

In this study, we have also compared our methods with some other existing m5C site prediction methods [19,20,21,22,23,24,25,26]. Because different benchmark datasets have been used for building different methods, independent test sets were used to ensure the objectiveness of the comparison. These independent test sets were only used for comparison and not for building our models. At present, four methods are available to identify m5C sites of H. sapiens, namely RNAm5Cfinder [25], iRNA-m5C [26], iRNAm5C-PseDNC [20] and RNAm5CPred [23]. Two methods are available for predicting m5C sites of M. musculus, namely RNAm5Cfinder [25] and iRNA-m5C [26]. Two methods are available for detecting m5C sites of A. thaliana, namely PEA-m5C [24] and iRNA-m5C. Table 6 shows the predictive results of these methods on the independent test sets for the three species, and Fig. 2 shows the relevant ROC curves and PRC curves. For H. sapiens, iRNAm5C-PseDNC has the highest Sp value (0.971), while our method gives significantly higher values for Sn, Pre, Acc, MCC, F1 score and AUROC when compared with other methods. For M. musculus, other than Sp and Pre, again our method has the highest values for all the remaining metrics (Sn, Pre, Acc, MCC, F1 score and AUROC). For A. thaliana, our method gives the highest value for all the metrics. All these results have indicated that our methods performed better than other existing methods in predicting m5C sites.

Table 6 Comparison with existing methods on the independent test sets
Fig. 2
figure2

ROC curves and PRC curves for our model and other models on the independent test sets. Upper panel: H. sapiens; Lower panel: M. musculus

Cross-species verification

In this study, models were built for H. sapiens, M. musculus and A. thaliana individually. It will be of great interest to evaluate the species-specificity and transferability of these models using the cross-species verification. To achieve that, the three models built on the three species-specific m5C training data sets were further tested on the three independent test datasets. Figure 3 shows the test results. Firstly, all the three models performed well on its own independent test sets (see the diagonal of Fig. 3). Secondly, the models of H. sapiens and M. musculus both performed poorly on the independent test set of A. thaliana, and vice versa. One possible reason to explain this result is because H. sapiens and M. musculus are both mammals while A. thaliana is plant. This reason is supported by Fig. 4, which shows that the nucleotide distribution in the sequence of A. thaliana is different from that of H. sapiens and M. musculus. Thirdly, the M. musculus-specific model performed well on the H. sapiens-specific independent test set, however, the H. sapiens-specific model performed poorly on the M. musculus-specific independent test set. It might be because the H. sapiens-specific datasets have smaller sizes than the M. musculus-specific datasets, and the smaller datasets sizes limit the variety of sequences which confines the transferability of the H. sapiens-specific model.

Fig. 3
figure3

The heat map showing the cross species prediction accuracies. Once a species-specific model was established on its own training dataset in rows, it was validated on the data from the same species as well as the independent data from the other two species in columns

Fig. 4
figure4

The nucleotide distribution around m5C and non-m5C sites. The top panel of the x-axis is for m5C site containing sequence, while the down panel of the x-axis is for non-m5C site containing sequences. a H. sapiens; b M. musculus; c A. thaliana

Web implementation

For the convenience of researchers, a user-friendly and publicly accessible web server was built to implement our method, which is available at https://zhulab.ahu.edu.cn/m5CPred-SVM. Users can predict the m5C sites on this server without complicated calculation. The detailed procedure to use the web server is as below:

To start with, users need to choose one from the three species, H. sapiens, M. musculus and A. thaliana. After that, users can type the query RNA sequences into the input box or upload a FASTA format file (Note that the input sequence should be in FASTA format and the length of each query sequence should be longer than 41 bp). Then, by clicking the 'submit' button, the system will do the calculation and give the final result. In the backend, the server would find the cytosine in the query sequence. All cytosine-centric RNA fragment would be extracted with flank size equals to 20 and the missing nucleotides would be filled by ‘N’. There might be lots of cytosines in a sequence, and our predictive model will reconstruct the sequence separately for each of them. The server home page also contains our contact information for users to contact us in case they have problems with the server or have suggestions.

Discussion

Our study shows that the position specific related features can be effective features for discriminating m5C sites from non-m5C sites. Theoretically speaking, the difference of nucleotide distribution between RNA sequences containing m5C sites and those without m5C sites determines how well we can discriminate them. In other words, the nucleotide distribution around the m5C site may have a certain preference. In order to investigate the nucleotide distribution preference for each sequence position, we adopted Two Sample Logo tool [35] to conduct visualization of the nucleotide site preference around m5C and non-m5C sites in the three species. Figure 4 clearly shows that significant difference does exist in nucleotide distribution around the m5C sites and the non-m5C sites for these three species, and the difference was found to descend in the sequence of H. sapiens, M. musculus and A. thaliana according to the depleted ratio (see Y axis of Fig. 4). It is shown that the depleted ratio of H. sapiens is from −28.6 to 28.6% and the depleted ratio of M. musculus is from −24.2 to 24.2%, which means the differences of nucleotide position preferences between positive and negative samples of the two species are significantly different. However, the corresponding depleted ratio of A. thaliana is from −11.9 to 11.9%. This is in line with our results that the six types of features for H. sapiens performed better than those for M. musculus, and the features for A. thaliana performed worst. The sequence differences observed here may account for the performance difference of the six types features observed before for the three species.

In addition, this figure can also explain why PSNP, KSPSDP performed well among the six types of features for both H. sapiens and M. musculus, while PseDNC and 4NF achieved best accuracy for A. thaliana. Among these six types of features, PSNP and KSPSDP are the two features that consider position preference information. As mentioned before, both H. sapiens and M. musculus have high position preferences of nucleotide in RNA sequences, thus it is not surprising that PSNP and KSPSDP performed best for these two species. On the contrary, the position preferences of RNA sequences of A. thaliana are not as significant as those of H. sapiens and M. musculus, so the two features, PSNP and KSPSDP, did not performed as well as they did for H. sapiens and M. musculus.

KNF, KSNPF and pseDNC are three features related to nucleotide composition of RNA segments. KNF can describe the local sequence-order information of nucleotide sequences. The idea of KSNPF is to calculate the frequency of sixteen pairs of nucleotides spaced by K-length polynucleotides. With increasing of K, KSNPF feature takes the position correlation information into account within the nucleotide sequence. PseDNC feature contains both local and global sequence-order information. The performances of these three features are determined by the composition difference between positive and negative samples.

The CPD feature contains nucleotide information at each position of the RNA segments and it also contains the nucleotide composition information along the RNA sequences, so that it performs well for all these three species.

According to our results, models based on these selected feature subsets selected by SFS had made improvements of about two percents in performance when compared with models based on single feature. As all of these selected feature subsets are combinations of the position specific features and the composition features, the improvements observed here can further confirm the complementarity between these two groups of features. It should be noted that we tried a large number of other types of features generated by iLearn [36] or Pse-in-One [37] toolkits when we designed the input features (data not shown). The sequence-based features generated by these two toolkits have been used widely for predicting both RNA post-transcriptional modification sites [38,39,40] and post-translational modification sites [41, 42]. Our experimental results demonstrated that our proposed feature combination in this study yielded satisfactory performance, which cannot be significantly improved when they were combined with other features.

We summarized the possible reasons for our method to outperform other existing methods. For the benchmark datasets, we used larger training sets for H. sapiens and M. musculus than iRNA-m5C which is the latest model for multiple species. Large datasets are helpful for improvement of the generalization of models. In addition, we added two types of position specific propensity features, PSNP and KSPSDP. Our results (Tables 2, 3 and 4) demonstrate PSNP and KSPSDP have played key roles in improving method performance.

Conclusion

In this study, a new computational method, m5CPred-SVM, was developed for predicting m5C sites in RNA sequences. Non-redundant large benchmark datasets were collected for three species, namely H. sapiens, M. musculus and A. thaliana. A total of six types of features, including features related to composition, features related to position specific and features related to physicochemical properties were used in building our models. Results have showed that the features related to position specific are effective in differentiating m5C sites from non-m5C sites for H. sapiens and M. musculus. Nucleotide distribution analysis reveals that nucleotide position preferences are significant for both H. sapiens and M. musculus, which account for the effectiveness of the features related to position specific propensity. For the same reason, the features related to position specific propensity are not that effective for A. thaliana because the nucleotide position preferences are less significant compared with that for the other two species. Optimal feature subsets were selected from these six types of features using the sequential forward feature selection strategy. All the three subsets consisted of feature related position specific propensity and feature related to nucleotide composition which indicate the complementarity between the features. The performance of our method m5CPred-SVM was objectively compared with other existing methods by using independent test sets. The results showed that our method can offer significantly better performances than all the other existing methods. Finally, a web server was built at https://zhulab.ahu.edu.cn/m5CPred-SVM to facilitate the access to our method by academic users to predict the m5C sites in RNA sequences.

Methods

Benchmark datasets

High quality benchmark datasets are extremely important for training and evaluating machine learning models. In this study, m5C data of three species have been collected from recently published literature. For A. thaliana, same datasets constructed by Lv et al. [26] were used for fair comparison. The positive RNA segments which contain m5C site in the center were collected from NCBI Gene Expression Omnibus (GEO) database with the accession number GSE94065 [43]. This dataset contains 6289 positive samples and 6289 negative samples.

The positive samples in the datasets of M. musculus and H. sapiens were obtained from the works of Yang et al. [44] and Vahid Khoddami et al. [18], respectively. For H. sapiens, we collected the data from the work of Vahid Khoddami et al. [18]. The file “GSE90963_Table_S1-m5C_candidate_sites.xlsx” was downloaded from GEO (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE90963), which recorded both m5C sites information by their RBS-seq work and the m5C sites information in other public datasets. Firstly, we collected the sites with high-threshold in their RBS-seq work. Secondly, we collected the sites both reported in their RBS-seq work and the public datasets. Totally, 408 m5C sites were collected for H. sapiens. For M. musculus, we collected the data from Additional file 1: Table S3 of Yang et al.’s work [44]. The m5C sites detected in six different tissues are all considered as positive examples, thus we obtained 13042 RNA segments centered with m5C. In order to avoid bias of the datasets, similar sequences in the datasets were removed using the CD-HIT program [45] with the sequence identity threshold set at 70%, through which we have obtained 5563 and 269 positive samples for M. musculus and H. sapiens, respectively. In machine learning, the model performance may be degraded and the prediction results may be out of balance due to the inconsistency of the amount of data between the positive sample and the negative sample [46, 47]. Therefore, we have randomly selected the same number of negative samples as that of positive samples for the establishment of the benchmark dataset. It is worth noting that the redundancy of the negative examples was also removed using CD-HIT with the sequence identity threshold set at 70%. To verify if the model is sensitive to the selection of negative samples, we conducted the same procedure to generate other nine negative subsets for both H. sapiens and M. musculus. We have not done the same thing for A. thaliana because we did not know the details about the generation of negative samples of A. thaliana which were obtained from Lv et al.’s work.

The benchmark dataset is usually divided into two parts. One is the training dataset and the other is the independent test set. The training dataset is used for model construction, cross-validation and determination of hyper-parameters of the learning algorithms. The independent test dataset is used to test the performance and generalization ability of the model. In this study, 69 positive samples and 69 negative samples were randomly selected as the independent test dataset and the remaining 200 positive samples and 200 negative samples were used as the training dataset for H. sapiens. For M. musculus and A. thaliana, 1000 positive samples and 1000 negative samples were randomly selected as the independent test datasets, and the remaining samples (4563 positive samples and 4563 negative samples for M. musculus, 5298 positive samples and 5298 negative samples for A. thaliana) were used as the training datasets.

The fragment of each RNA in the datasets is represented as:

$$R_{\lambda } \left( C \right) = N_{ - \lambda } N_{{ - \left( {\lambda - 1} \right)}} \ldots N_{ - 1} CN_{1} \ldots N_{{ + \left( {\lambda - 1} \right)}} N_{\lambda }$$
(1)

where \(N_{ - \lambda }\) represents the upstream nucleotide of central cytosine and \(N_{\lambda }\). rresents the downstream nucleotide of central cytosine. In most previous works [19,20,21,22, 26], the length of the input RNA segments was set to 41 and the m5C site is located in the central position 21. In this study, we have also extracted features from the 41 bp long RNA segments.

The details of the training dataset and the testing dataset are shown in Table 7.

Table 7 The information of the datasets

Feature extraction

K-nucleotide frequency (KNF)

As a classic sequence coding feature, K-nucleotide frequency (KNF, also called NC (Nucleotide composition)) has been widely used to build bioinformatics models [48,49,50]. Suppose we have an RNA segment R of length L:

$$R = n_{1} n_{2} n_{3} \ldots n_{i} \ldots n_{L - 1} n_{L}$$
(2)

\(n_{i}\) indicates the ith nucleotide of R, and it can be any one of the four nucleotide bases in RNA, i.e.\(n_{i}\) \(\in\){A, C, G, U}. For a given K value, KNF represents the frequency of occurrence of each K-mer nucleotide component in the nucleotide sequence. It can be calculated by the formula (3).

$$f\left( {n_{1} n_{2} \ldots n_{k} } \right) = \frac{{N\left( {n_{1} n_{2} \ldots n_{k} } \right)}}{L - K + 1}$$
(3)

where \(n_{1} n_{2} \ldots n_{K}\) indicates a K-mer nucleotide component. It is not difficult to find that the K-mer nucleotide composition of an RNA sequence is a \(4^{K}\)-dimensional vector consisting of frequency of each K-mer type. As the value of K increases, the dimension of the feature vector increases exponentially. For example, when K = 1, four types of single nucleotide frequencies can be obtained. We chose the K value of 4 to calculate the frequency at which 4 nucleotides appears (4NF) according to a previous work [23]. The RNA fragment can be encoded as:

$$R\left( {4NF} \right) = \left[ {f_{AAAA} f_{AAAC} \ldots f_{GCUU} \ldots f_{UUUG} f_{UUUU} } \right]$$
(4)

K-spaced nucleotide pair frequency (KSNPF)

K-spaced nucleotide pair frequency is another method for encoding RNA sequences [51]. This method mainly calculates the frequency of 16 pairs of nucleotides separated by k-length polynucleotides. We use \(n_{1}\) × {K}\(n_{2}\) to represent K-spaced nucleotide pairs. Since \(n_{1}\) and \({ }n_{2} { }\) have four possible values, so there are sixteen (\(4^{2}\) = 16) possible combinations. For example: AxxC is a two spacer nucleotide pair. The calculation formula of KSNPF is

$$f\left( {n_{1} {\text{x}}\left\{ {\text{K}} \right\}n_{2} } \right) = \frac{{N\left( {n_{1} {\text{x}}\left\{ {\text{K}} \right\}n_{2} } \right)}}{L - K + 1}$$
(5)

In this work, we tried different K values in order to determine the best KSNPF features for different species. The selection of K for different species can be found in Additional file 1: Table S5.

Position-specific nucleotide propensity (PSNP)

In several previous works [27, 28, 51], position-specific nucleotide propensity has been used to predict the post-transcriptional modification of RNA. This feature is obtained by calculating the difference in nucleotide frequencies at specific positions between positive and negative RNA fragments. It was first introduced in Li et al.’s work [28]. According to Eq. (1), the RNA fragment can be re-expressed as:

$$R_{\lambda } = N_{1} N_{2} \ldots N_{i} \ldots N_{2\lambda + 1}$$
(6)

First, we calculated the frequency of the four nucleotides at the i-th position in the positive sample and the negative sample, respectively. After that, the 4-dimensional positive vectors and the 4-dimensional negative vectors were combined individually to obtain two \(4 \times \left( {2\lambda + 1} \right)\) position-specific occurrence frequency matrices for positive and negative samples, respectively. The two matrices were named as \(M^{ + }\) and \(M^{ - }\),\(M^{ + }\) is for positive samples and \(M^{ - }\) is for negative samples. Through \(M^{ + }\) and \(M^{ - }\),we defined the position-specific nucleotide propensity matrix, denoted as \(X_{PSNP}\), as below:

$$X_{PSNP} = M_{ + } - M_{ - }$$
(7)

K-spaced position-specific dinucleotide propensity (KSPSDP)

Position-specific dinucleotide propensity is defined using the similar procedure to define PSNP. To calculate this feature, we rewrite Eq. (6) as a dinucleotide:

$$R_{\lambda } = D_{1} D_{2} \ldots D_{i} \ldots D_{2\lambda }$$
(8)

where Di represents the dinucleotide at the i-th position of RNA and has 16 types of values. By using the similar way for calculating the PSNP feature, we can get the (\(16 \times 2\lambda\)) position-specific dinucleotide propensity (PSDP) matrix.

To calculate K-spaced position-specific dinucleotide propensity, \(n_{1}\) × {K}\(n_{2}\) was used to represent K-spaced nucleotide pairs. PSDP is a specific case for KSPSDP when K equals 0. In this work, we tried different K values to determine the best KSPSDP features for different species. The selection of K values for different species can be found in Additional file 1: Table S6.

Pseudo dinucleotide composition (PseDNC)

The pseudo K-tuple nucleotide composition (PseKNC) has been used to represent an RNA sequence with a discrete model or vector which can keep considerable sequence order information, especially the global or long-range sequence order information [20, 26, 52, 53]. In this study, we used PseDNC (K = 2 for PseKNC) to encode the RNA segments. Three physicochemical properties, free energy, hydrophilicity and stacking energy were used to generate features of PseDNC. The values of these three physicochemical properties of 16 dinucleotides are shown in Table 8.

Table 8 Three types of physicochemical properties of dinucleotides in RNA

Chemical property with density (CPD)

The four types of nucleotides in RNA (A (adenine), U (uracil), G(guanine) and C(cytosine)) can be divided into three categories according to their chemical structures and internal binding characteristics [54]. Considering the ring structure of the nucleotide, C and U are pyrimidines with one ring, while A and G are purines with two rings. As for the secondary structure, the hydrogen bonds of A and U are weak, while the hydrogen bonds of G and C are strong. In terms of chemical functionality, U and G are classified as keto groups, while A and C are in amino groups. These three aspects of chemical properties can be represented as a three-dimensional vector (x, y, z), where x, y, z represent the ring structure, the hydrogen bond, and the chemical functionality of the nucleotides respectively. In this way, each nucleotide \(n_{i} = \left( {x_{i} ,y_{i} ,z_{i} } \right)\) in an RNA sequence can be encoded as:

$$x_{i} = \left\{ {\begin{array}{*{20}c} {1 if n_{i} \varepsilon \left\{ {A,} \right.\left. G \right\}} \\ {0 if n_{i} \varepsilon \left\{ {U,} \right.\left. C \right\}} \\ \end{array} } \right., y_{i} = \left\{ {\begin{array}{*{20}c} {1 if n_{i} \varepsilon \left\{ {A,} \right.\left. C \right\}} \\ {0 if n_{i} \varepsilon \left\{ {U,} \right.\left. G \right\}} \\ \end{array} } \right., z_{i} = \left\{ {\begin{array}{*{20}c} {1 if n_{i} \varepsilon \left\{ {A,} \right.\left. U \right\}} \\ {0 if n_{i} \varepsilon \left\{ {C,} \right.\left. G \right\}} \\ \end{array} } \right.$$
(9)

Thus, the four types of nucleotide, A, U, G and C, can be encoded as (1,1,1), (0,0,1), (0,1,0), (1,0,0), respectively.

In order to better represent the distribution of each nucleotide in the RNA sequence, the density of a nucleotide, which describes the frequency of the nucleotide occurring before current position, is denoted as:

$$d_{i} = \frac{1}{{\left| {N_{i} } \right|}}\mathop \sum \limits_{j = 1}^{i} f\left( {n_{j} } \right), f\left( {n_{j} } \right) = \left\{ {\begin{array}{*{20}c} {1 \;if\; n_{j} = p} \\ {0 \;if\; n_{i} \ne p} \\ \end{array} } \right.$$
(10)

where di is the density of nucleotide, i is the current position of RNA sequence, \(\left| {N_{i} } \right|\) is the length of the ith prefix string \(\left\{ {n_{1} ,n_{2} , \ldots ,n_{i} } \right\}\) in the sequence, and \(p\) is the symbol of {A, U, G, C}.

By integrating the nucleotide chemical property and the distribution of each nucleotide in the RNA sequence, a (\(4 \times \xi\))-dimensional CPD feature vector can be generated, where \(\xi\) is the length of the RNA segment.

Support vector machine

Support vector machine (SVM) is a popular statistical learning method and has been extensively used to build bioinformatics models [23, 50, 55,56,57,58] because of its high efficiency and robust output. In this study, we used the MATLAB function FITCSVM to build our models. SVM uses kernel functions to project low-dimensional data into high-dimensional space. A few different kernel functions can be used in training. In this work, the radial basis kernel function was selected with two hyper parameters (box constraint and kernel scale) to be used with FITCSVM function. The two parameters were optimized by a grid search with box constraint from \(2^{ - 5}\) to \(2^{15}\) and kernel scale from \(2^{ - 10}\) to \(2^{6}\).

Evaluation criteria

Ten-fold cross-validation was used to evaluate the generalization performance based on the training dataset. For the ten-fold cross-validation, the training dataset was divided into ten roughly equal-sized subsets with a stratified sampling, and then one subset was used as a validation set whereas the remaining nine subsets were combined for training. This process was repeated ten times with ten models built and validated. Finally, the average performance was obtained. In this study, the ten-fold cross-validation was used for feature selection, parameter optimization and classifier comparison.

Different metrics were used to assess the model performance, namely accuracy (Acc), sensitivity (Sen), specificity (Spe), precision (Pre), Matthews correlation coefficient (Mcc) and F1-score. The specific formulas are as below:

$$\left\{ {\begin{array}{*{20}c} {{\text{Sen}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}}} \\ {{\text{Spe}} = \frac{{{\text{TN}}}}{{{\text{TN}} + {\text{FP}}}}} \\ {{\text{Pre}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}}}}} \\ {{\text{Acc}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{FN}} + {\text{FP}} + {\text{FN}}}}} \\ {{\text{Mcc}} = \frac{{{\text{TP }} \times {\text{*TN}} - {\text{FP* FN}}}}{{\sqrt {\left( {{\text{TP}} + {\text{FP}}} \right)\left( {{\text{TP}} + {\text{FN}}} \right)\left( {{\text{TN}} + {\text{FP}}} \right)\left( {{\text{TN}} + {\text{FN}}} \right)} }}} \\ {{\text{F}}1 = \frac{2*TP}{{\left( {2*TP + FP + FN} \right)}}} \\ \end{array} } \right.$$
(11)

where, TP, TN, FP and FN represent the number of true-positive (m5C sites that were predicted as m5C sites), true-negative (non-m5C sites that were predicted as non-m5C sites), false-positive (non-m5C sites that were predicted as m5C sites) and false-negative (m5C sites that were predicted as non-m5C sites) samples, respectively.

In addition, we draw the receiver operating characteristic curve (ROC curve) [59] and precision recall curve (PRC curve) [60], to evaluate the performances of different models. ROC curve demonstrates the relationship between sensitivity and 1-specificity at different thresholds, and PRC curve reflects the trend of precision changing with recall. These two curves can be used to evaluate the predictive capability of the proposed method across entire range of decision values. The areas under these two curves (AUROC and AUPRC) were also calculated to quantify the model performance. AUROC and AUPRC have value ranging from 0 to 1. The closer the value approximate 1, the better the model performance is.

Feature selection

There are three major methods for feature selection: Filter, Wrapper and Embedded. We have chosen the sequence forward selection algorithm (SFS) under Wrapper as the feature selection algorithm in this study. Six types of features are generated and constitute the high-dimensional feature vector of each sample. The following specific operations of SFS were used to achieve a compact and efficient feature subset: in the first round, the ten-fold cross-validation results were obtained for models built on each of the six types of features. The best performing feature type was selected according to the AUROC value and then proceeded to the next round of calculation. In the second round, the remaining five types of features were added to the best performing feature type selected in the first round. Similarly, the best performing feature combination was again selected according to the AUROC value and proceeded to the next round of calculation. This process continued until AUROC converged. The subset with the highest AUROC value was considered as the optimal feature subset.

The entire procedure of m5CPred-SVM is illustrated in Fig. 5.

Fig. 5
figure5

The flowchart of m5CPred-SVM

Availability of data and materials

The webserver is at https://zhulab.ahu.edu.cn/m5CPred-SVM/. The data sets used in this study are also available on the website. All other data generated or analyzed during this study are included in this published article or the Additional files.

Abbreviations

SVM:

Support vector machine

Sn:

Sensitivity

Sp:

Specificity

Acc:

Accuracy

Pre:

Precision

Mcc:

Matthew correlation coefficient

AUC:

Area under the curve

ROC:

Receiver operating characteristic

SFS:

Sequential forward feature selection

References

  1. 1.

    Boccaletto P, Machnicka MA, Purta E, Piatkowski P, Baginski B, Wirecki TK, de Crecy-Lagard V, Ross R, Limbach PA, Kotter A et al. MODOMICS: a database of RNA modification pathways. 2017 update. Nucleic Acids Res 2018, 46(D1):D303–7.

  2. 2.

    Xuan JJ, Sun WJ, Lin PH, Zhou KR, Liu S, Zheng LL, Qu LH, Yang JH. RMBase v2.0: deciphering the map of RNA modifications from epitranscriptome sequencing data. Nucleic Acids Res. 2018, 46(D1):D327–34.

  3. 3.

    Frye M, Harada BT, Behm M, He C. RNA modifications modulate gene expression during development. Science. 2018;361(6409):1346–9.

    CAS  PubMed  Google Scholar 

  4. 4.

    Dubin DT, Taylor RH. The methylation state of poly A-containing messenger RNA from cultured hamster cells. Nucleic Acids Res. 1975;2(10):1653–68.

    CAS  PubMed  Google Scholar 

  5. 5.

    Squires JE, Patel HR, Nousch M, Sibbritt T, Humphreys DT, Parker BJ, Suter CM, Preiss T. Widespread occurrence of 5-methylcytosine in human coding and non-coding RNA. Nucleic Acids Res. 2012;40(11):5023–33.

    CAS  PubMed  Google Scholar 

  6. 6.

    Agris PF. Bringing order to translation: the contributions of transfer RNA anticodon-domain modifications. EMBO Rep. 2008;9(7):629.

    CAS  PubMed  Google Scholar 

  7. 7.

    Alexandrov A, Chernyakov I, Gu W, Hiley SL, Hughes TR, Grayhack EJ, Phizicky M. E: rapid tRNA decay can result from lack of nonessential modifications. Mol Cell. 2006;21(1):87–96.

    CAS  Google Scholar 

  8. 8.

    Chen Y, Sierzputowska-Gracz H, Guenther R, Everett K, Agris PF. 5-Methylcytidine is required for cooperative binding of Mg2+ and a conformational transition at the anticodon stem-loop of yeast phenylalanine tRNA. Biochemistry. 1993;32(38):10249–53.

    CAS  Google Scholar 

  9. 9.

    David R, Burgess A, Parker B, Li J, Pulsford K, Sibbritt T, Preiss T, Searle IR. Transcriptome-wide mapping of RNA 5-methylcytosine in arabidopsis mRNAs and non-coding RNAs. Plant Cell. 2017;29(3):445.

    CAS  PubMed  Google Scholar 

  10. 10.

    Hong B, Brockenbrough JS, Wu P, Aris JP. Nop2p is required for pre-rRNA processing and 60S ribosome subunit synthesis in yeast. Mol Cell Biol. 1997;17(1):378–88.

    CAS  PubMed  Google Scholar 

  11. 11.

    Motorin Y, Helm M. tRNA stabilization by modified nucleotides. Biochemistry. 2010;49(24):4934–44.

    CAS  PubMed  Google Scholar 

  12. 12.

    Schaefer M, Pollex T, Hanna K, Tuorto F, Meusburger M, Helm M, Lyko F. RNA methylation by Dnmt2 protects transfer RNAs against stress-induced cleavage. Genes Dev. 2010;24(15):1590–5.

    CAS  PubMed  Google Scholar 

  13. 13.

    Motorin Y, Lyko F, Helm M. 5-methylcytosine in RNA: detection, enzymatic formation and biological functions. Nucleic Acids Res. 2010;38(5):1415–30.

    CAS  PubMed  Google Scholar 

  14. 14.

    Zhang X, Liu Z, Yi J, Tang H, Xing J, Yu M, Tong T, Shang Y, Gorospe M, Wang W. The tRNA methyltransferase NSun2 stabilizes p16INK(4) mRNA by methylating the 3’-untranslated region of p16. Nat Commun. 2012;3:712.

    PubMed  Google Scholar 

  15. 15.

    Edelheit S, Schwartz S, Mumbach MR, Wurtzel O, Sorek R. Transcriptome-wide mapping of 5-methylcytidine RNA modifications in bacteria, archaea, and yeast reveals m5C within archaeal mRNAs. PLoS Genet. 2013;9(6):e1003602.

    CAS  PubMed  Google Scholar 

  16. 16.

    Khoddami V, Cairns BR. Identification of direct targets and modified bases of RNA cytosine methyltransferases. Nat Biotechnol. 2013;31(5):458–64.

    CAS  PubMed  Google Scholar 

  17. 17.

    Hussain S, Blanco S, Dietmann S, Lombard P, Sugimoto Y, Paramor M, Ule J, Frye M. NSun2-mediated cytosine-5 methylation of vault noncoding RNA determines its processing into regulatory small RNAs. Cell Rep. 2013;4(2):255–61.

    CAS  PubMed  Google Scholar 

  18. 18.

    Khoddami V, Yerra A, Mosbruger TL, Fleming AM, Burrows CJ, Cairns BR. Transcriptome-wide profiling of multiple RNA modifications simultaneously at single-base resolution. Proc Natl Acad Sci USA. 2019;116(14):6784–9.

    CAS  PubMed  Google Scholar 

  19. 19.

    Feng P, Ding H, Chen W, Lin H. Identifying RNA 5-methylcytosine sites via pseudo nucleotide compositions. Mol BioSyst. 2016;12(11):3307.

    CAS  PubMed  Google Scholar 

  20. 20.

    Qiu WR, Jiang SY, Xu ZC, Xiao X, Chou KC. iRNAm 5C-PseDNC: identifying RNA 5-methylcytosine sites by incorporating physical-chemical properties into pseudo dinucleotide composition. Oncotarget. 2017;8(25):41178–88.

    PubMed  Google Scholar 

  21. 21.

    Zhang M, Xu Y, Li L, Liu Z, Yang X, Yu DJ. Accurate RNA 5-methylcytosine site prediction based on heuristic physical-chemical properties reduction and classifier ensemble. Anal Biochem. 2018;550:41–8.

    CAS  PubMed  Google Scholar 

  22. 22.

    Sabooh MF, Iqbal N, Khan M, Khan M, Maqbool HF. Identifying 5-methylcytosine sites in RNA sequence using composite encoding feature into Chou’s PseKNC. J Theor Biol. 2018;452:1–9.

    CAS  PubMed  Google Scholar 

  23. 23.

    Fang T, Zhang Z, Sun R, Zhu L, He J, Huang B, Xiong Y, Zhu X. RNAm 5CPred: prediction of RNA 5-methylcytosine sites based on three different kinds of nucleotide composition. Mol Ther Nucleic Acids. 2019;18:739–47.

    CAS  PubMed  Google Scholar 

  24. 24.

    Song J, Zhai J, Bian E, Song Y, Yu J, Ma C. Transcriptome-wide annotation of m(5)C RNA modifications using machine learning. Front Plant Sci. 2018;9:519.

    PubMed  Google Scholar 

  25. 25.

    Li J, Huang Y, Yang X, Zhou Y, Zhou Y. RNAm 5Cfinder: a web-server for predicting RNA 5-methylcytosine (m5C) sites based on random forest. Sci Rep. 2018;8(1):17299.

    PubMed  Google Scholar 

  26. 26.

    Lv H, Zhang ZM, Li SH, Tan JX, Chen W, Lin H. Evaluation of different computational methods on 5-methylcytosine sites identification. Brief Bioinform 2019.

  27. 27.

    He J, Fang T, Zhang Z, Huang B, Zhu X, Xiong Y. PseUI: Pseudouridine sites identification based on RNA sequence information. BMC Bioinform. 2018;19(1):306.

    Google Scholar 

  28. 28.

    Li GQ, Liu Z, Shen HB, Yu DJ. TargetM6A: Identifying N6-methyladenosine Sites from RNA Sequences via Position-Specific Nucleotide Propensities and a Support Vector Machine. IEEE Trans Nanobiosci. 2016, PP(99):1–1.

  29. 29.

    Cover TM, Hart PE. Nearest neighbor pattern classification. IEEE Transinftheory. 1967;13(1):21–7.

    Google Scholar 

  30. 30.

    Schapire RE. A brief introduction to boosting. In: Proceedings of the sixteenth international joint conference on artificial intelligence, IJCAI 99, Stockholm, Sweden, July 31–August 6, 1999 2 Volumes, 1450 pages: 1999.

  31. 31.

    Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.

    Article  Google Scholar 

  32. 32.

    Quinlan JR. Introduction of decision trees. Mach Learn. 1986;1:81–106.

    Google Scholar 

  33. 33.

    Cox DR. The regression analysis of binary sequences. J R Stat Soc. 21(1):238–238.

  34. 34.

    Chen T, Guestrin C: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 2016:785–794.

  35. 35.

    Vacic V, Iakoucheva LM, Radivojac P. Two sample logo: a graphical representation of the differences between two sets of sequence alignments. Bioinformatics. 2006;22(12):1536–7.

    CAS  Article  Google Scholar 

  36. 36.

    Chen Z, Zhao P, Li F, Marquez-Lago TT, Leier A, Revote J, Zhu Y, Powell DR, Akutsu T, Webb GI, et al. iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data. Brief Bioinform. 2020;21(3):1047–57.

    Article  Google Scholar 

  37. 37.

    Liu B, Liu F, Wang X, Chen J, Fang L, Chou KC. Pse-in-one: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res. 2015;43(W1):W65-71.

    CAS  Article  Google Scholar 

  38. 38.

    Chen Z, Zhao P, Li F, Wang Y, Smith AI, Webb GI, Akutsu T, Baggag A, Bensmail H, Song J. Comprehensive review and assessment of computational methods for predicting RNA post-transcriptional modification sites from RNA sequences. Brief Bioinform. 2020;21(5):1676–96.

    Article  Google Scholar 

  39. 39.

    Liu Q, Chen J, Wang Y, Li S, Jia C, Song J, Li F. DeepTorrent: a deep learning-based approach for predicting DNA N4-methylcytosine sites. Brief Bioinform. 2020.

  40. 40.

    Bi Y, Xiang D, Ge Z, Li F, Song J. An interpretable prediction model for identifying N7-methylguanosine sites based on XGBoost and SHAP. Mol Ther Nucleic Acids. 2020;22:362–72.

    Google Scholar 

  41. 41.

    Li F, Fan C, Marquez-Lago TT, Leier A, Revote J, Jia C, Zhu Y, Smith AI, Webb GI, Liu Q, et al. PRISMOID: a comprehensive 3D structure database for post-translational modifications and mutations with functional impact. Brief Bioinform. 2020;21(3):1069–79.

    PubMed  Google Scholar 

  42. 42.

    Chen Z, Zhao P, Li F, Leier A, Marquez-Lago TT, Webb GI, Baggag A, Bensmail H, Song J. PROSPECT: a web server for predicting protein histidine phosphorylation sites. J Bioinform Comput Biol. 2020;18(4):2050018.

    CAS  PubMed  Google Scholar 

  43. 43.

    Cui X, Liang Z, Shen L, Zhang Q, Bao S, Geng Y, Zhang B, Leo V, Vardy LA, Lu T, et al. 5-Methylcytosine RNA Methylation in Arabidopsis Thaliana. Mol Plant. 2017;10(11):1387–99.

    CAS  PubMed  Google Scholar 

  44. 44.

    Yang X, Yang Y, Sun BF, Chen YS, Xu JW, Lai WY, Li A, Wang X, Bhattarai DP, Xiao W, et al. 5-methylcytosine promotes mRNA export - NSUN2 as the methyltransferase and ALYREF as an m(5)C reader. Cell Res. 2017;27(5):606–25.

    CAS  PubMed  Google Scholar 

  45. 45.

    Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28(23):3150–2.

    CAS  PubMed  Google Scholar 

  46. 46.

    Liu Z, Xiao X, Qiu WR, Chou KC. iDNA-Methyl: identifying DNA methylation sites via pseudo trinucleotide composition. Anal Biochem. 2015;474:69–77.

    CAS  Google Scholar 

  47. 47.

    Xiao X, Min JL, Lin WZ, Liu Z, Cheng X, Chou KC. iDrug-Target: predicting the interactions between drug compounds and target proteins in cellular networking via benchmark dataset optimization approach. J Biomol Struct Dyn. 2015;33(10):2221–33.

    CAS  Google Scholar 

  48. 48.

    Brayet J, Zehraoui F, Jeanson-Leh L, Israeli D, Tahi F. Towards a piRNA prediction using multiple kernel fusion and support vector machine. Bioinformatics. 2014;30(17):i364-370.

    CAS  PubMed  Google Scholar 

  49. 49.

    Vinje H, Liland KH, Almoy T, Snipen L. Comparing K-mer based methods for improved classification of 16S sequences. BMC Bioinform. 2015;16:205.

    Google Scholar 

  50. 50.

    Zhu X, He J, Zhao S, Tao W, Xiong Y, Bi S. A comprehensive comparison and analysis of computational predictors for RNA N6-methyladenosine sites of Saccharomyces cerevisiae. Brief Funct Genomics. 2019;18(6):367–76.

    CAS  PubMed  Google Scholar 

  51. 51.

    Wang X, Yan R. RFAthM6A: a new tool for predicting m 6 A sites in Arabidopsis thaliana. Plant Mol Biol. 2018;96(3):327–37.

    CAS  PubMed  Google Scholar 

  52. 52.

    Chen W, Zhang X, Brooker J, Lin H, Zhang L, Chou KC. PseKNC-General: a cross-platform package for generating various modes of pseudo nucleotide compositions. Bioinformatics. 2015;31(1):119–20.

    CAS  PubMed  Google Scholar 

  53. 53.

    Liu B, Fang L, Long R, Lan X, Chou KC. iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition. Bioinformatics. 2016;32(3):362–9.

    CAS  PubMed  Google Scholar 

  54. 54.

    Bari ATMG, Reaz MR, Choi HJ, Jeong BS. DNA Encoding for splice site prediction in large DNA sequence. Springer Berlin, 2013.

  55. 55.

    Cortes C, Vapnik V. Support-vector networks. In: Machine learning: 1995. 273–297.

  56. 56.

    Qiao Y, Xiong Y, Gao H, Zhu X, Chen P. Protein-protein interface hot spots prediction based on a hybrid feature selection strategy. BMC Bioinform. 2018;19(1):14.

    Google Scholar 

  57. 57.

    Wang X, Pardalos PM. A survey of support vector machines with uncertainties. Ann Data Sci. 2014;1(3–4):293–309.

    Google Scholar 

  58. 58.

    Zhu X, Ericksen SS, Mitchell JC. DBSI: DNA-binding site identifier. Nucleic Acids Res. 2013;41(16):e160.

    PubMed  Google Scholar 

  59. 59.

    Fawcett T: An introduction to ROC analysis. Pattern Recognit Lett. 27(8):861–74.

  60. 60.

    Buckland MK, Gey FC. The relationship between recall and precision. J Assoc Inf Sci Technol . 1994;45(1):12–9.

    Google Scholar 

Download references

Acknowledgements

The authors thank Jerry Chen for assistance with manuscript preparation and revision.

Funding

This work was supported by National Natural Science Foundation of China under grants No. 21403002, 31601074 and 61872094. The fundings had no role in the design of the study and collection, analysis, and interpretation of data and writing the manuscript.

Author information

Affiliations

Authors

Contributions

Conceived the study: XZ, SB. Designed the study: XZ, SB. Participate designed the study: XC, YX. Analyzed the data: XC, YX, YL, YC. Wrote the paper: XC, YX, XZ, SB. All authors read and approved the manuscript.

Corresponding authors

Correspondence to Shoudong Bi or Xiaolei Zhu.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1

. This file provides the performances of models built on positive samples and other nine negative subsets for H. sapiens and M. musculus, and more detailed data for selecting the Ks for KSNPF and KSPSDP. Table S1: The cross validation performances using different negative subsets of H. sapiens. Table S2: The cross validation performances using different negative subsets of M. musculus. Table S3: The performances on the independent test sets for models built on different negative subsets of H. sapiens. Table S4: The performances on the independent test sets for models built on different negative subsets of M. musculus. Table S5: The cross validation results of KSNPF for the three species with different Ks. Table S6: The cross validation results of KSPSDP for the three species with different Ks.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Chen, X., Xiong, Y., Liu, Y. et al. m5CPred-SVM: a novel method for predicting m5C sites of RNA. BMC Bioinformatics 21, 489 (2020). https://doi.org/10.1186/s12859-020-03828-4

Download citation

Keywords

  • 5-Methylcytosine sites
  • Support vector machine
  • Computational predictor
  • Position specific propensity
  • Web server