Skip to main content

Deep-m5U: a deep learning-based approach for RNA 5-methyluridine modification prediction using optimized feature integration

Abstract

Background

RNA 5-methyluridine (m5U) modifications play a crucial role in biological processes, making their accurate identification a key focus in computational biology. This paper introduces Deep-m5U, a robust predictor designed to enhance the prediction of m5U modifications. The proposed method, named Deep-m5U, utilizes a hybrid pseudo-K-tuple nucleotide composition (PseKNC) for sequence formulation, a Shapley Additive exPlanations (SHAP) algorithm for discriminant feature selection, and a deep neural network (DNN) as the classifier.

Results

The model was evaluated using two benchmark datasets, i.e., Full Transcript and Mature mRNA. Deep-m5U achieved overall accuracies of 91.47% and 95.86% for the Full Transcript and Mature mRNA datasets with 10-fold cross-validation, and for independent samples, the model attained 92.94% and 95.17% accuracy.

Conclusion

Compared to existing models, Deep-m5U showed approximately 5.23% and 3.73% higher accuracy on the training data and 3.95% and 3.26% higher accuracy on independent samples for the Full Transcript and Mature mRNA datasets, respectively. The reliability and effectiveness of Deep-m5U make it a valuable tool for scientists and a potential asset in pharmaceutical design and research.

Peer Review reports

Introduction

In recent years, the achievement of measurement technologies mapping for RNA has significantly propelled research into RNA epigenetic modifications. Currently, over 170 chemicals identified have been modifications in cellular RNA, with notable examples including 5-methylcytosine (m5C), 5-hydroxymethylcytosine (5hmC), N6-methyladenosine (m6A), N1-methyladenosine (m1A), 2’-O-methylation of ribose (2’-O-Me) and pseudouridine (Ψ) [1]. Lesser-known modifications such as 7-methylguanosine (m7G), adenosine-to-inosine (A-to-I), dihydrouridine (D), N2-methylguanosine (m2G), and N4-acetylcysteine have also been identified [2]. These modifications can affect all four RNA bases, adenine (A), cytosine (C), guanine (G), and uracil (U), as well as the ribose sugar. While nearly all RNA species undergo modification, transfer ribosomal RNA (rRNA) and RNA (tRNA) are the most heavily modified [3]. More than 100 mutations in RNA-modifying enzymes have been linked to human illnesses, underscoring these modifications' critical role in regulating gene expression and protein production [4]. Moreover, RNA modifications, like post-translational protein modifications, serve dynamic cellular functions by modulating cell-specific activities. Current developments in detecting RNA modifications have been simplified by two principal methods: next-generation sequencing (NGS) and liquid chromatography coupled with mass spectrometry (LC–MS). LC–MS is highly sensitive and specific but cannot provide sequence context [5]. In contrast, NGS offers comprehensive sequence information but struggles to detect modifications directly. This limitation arises because RNA modifications often interfere with reverse transcription, introducing errors or blockages during sequencing [6]. These technological advancements are crucial for further understanding RNA modifications' complexity and biological significance [7, 8].

The 5-methyluridine (m5U) modification is a significant epigenetic mark that has drawn global attention from researchers. It is commonly found in cytosolic tRNAs and other non-coding RNAs like mRNA and rRNA [9]. Various enzymes, such as TrmA in E. coli, Trm2 in S. cerevisiae, and TRMT2A and TRMT2B, catalyze m5U modifications [10]. This modification is critical to RNA structures, but the conserved T-loop motif is essential in alleviating the secondary structure of RNAs [10, 11]. Despite being among the most prevalent RNA changes, there is still limited research on identifying and understanding the functions of m5U. Methylation of uridine at its fifth carbon, carried out by specific enzymes, may have been among the first pyrimidine methyltransferases to evolve. m5U is linked to diseases like breast cancer and lupus, as well as plant development and stress response. Accurately pinpointing m5U sites is crucial for understanding its biological role. Still, experimental methods like miCLIP-Seq and iCLIP are costly and time-consuming, often yielding limited data due to antibody specificity issues [12].

To address the challenges of detecting RNA modifications, researchers have proposed using machine learning (ML) algorithms [13,14,15]. Experimentally identifying all RNA modifications is expensive and complicated, so computational methods are now widely used. Various ML-based tools have been developed to predict RNA modifications, such as (WHISTLE [16], SRAMP [17], iRNA-Methyl [18], RNA modification [19], m7GHub V2.0 [20], DirectRMDB [21], MODOMICS [22], ConsRM [23]), m5C site predictors (iRNA-m5C [24], RNADSN [25], pseudouridine predictors (iRNA-PseU, PPUS [26] and RNAm5Cfinder [27]). A few computational and experimental techniques have been created to improve the identification of m5U-modified sites and to handle the complexity of RNA modifications. One of the prominent tools is m5UPred, introduced by Jiang et al. [28], which uses an SVM algorithm. This method uses sequence-derived features like nucleotide concentration and chemistry to predict m5U sites in human RNA. While m5UPred achieved a respectable accuracy of 83.60% for Full Transcripts and 89.91% for Mature mRNA in cross-validation tests, it faced challenges like overfitting. It was limited by the quality of its dataset, which wasn't redundantly processed.

Ao et al. [29] proposed the m5U-SVM model to address these limitations, enhancing prediction by merging distributed representation characteristics with traditional physicochemical features. This model was also based on SVM and used a multi-view feature approach, enhancing its ability to differentiate between m5U and non-m5U sites. Under tenfold cross-validation, m5U-SVM showed improved performance, with an estimated average accuracy of 88.876% for Full Transcripts and 94.358% for Mature mRNA. Despite these promising results, both models still rely on conventional learning methods that struggle to accurately predict m5U-modified sites, primarily due to the resemblance among m5U and non-m5U sites.

This paper presents a robust computational predictor named Deep-m5U, designed to accurately identify 5-methyluridine (m5U) modifications, leveraging effective feature extraction techniques. The model utilizes pseudo-k-tuple nucleotide compositions, which are grouped into parts such as single nucleotide composition (SNC), dinucleotide composition (DNC), trinucleotide composition (TNC), quad nucleotide composition (QNC), and penta nucleotide composition (PNC) to establish robust and intricate patterns from the RNA sequence. These features are further enhanced by using structural and global sequence-order information and converting an RNA sequence into a feature vector. These vectors are then followed by the integration process to develop a new feature set, which is a blended one, and it helps the model to represent the biological data more efficiently. The SHAP is set to improve this feature set, which enforces removing noisy and irrelevant features from the set of features to keep only the most discriminative ones for the classification. Furthermore, classifier training uses a DNN after selecting the relevant features under consideration. The model was validated through a tenfold cross-validation process by the benchmark data of Mature mRNA and Full Transcript data sets. The results were outstanding, with the identified accuracies of 91.47% and 95. 86%, respectively. Also, it showed that the accuracy of the proposed model, Deep-m5U, is higher than that of existing predictors with enhanced performance in all the evaluation criteria. Therefore, it can be used to predict RNA modification sites effectively. Figure 1 illustrates the structure of the suggested model. The main outcomes of this study are as follows:

  1. 1.

    A novel deep computational model for predicting 5-methyluridine (m5U) modifications.

  2. 2.

    Integrating an optimized feature extraction process using nonlinear activation functions and multi-layer architectures to address dataset complexities.

  3. 3.

    A thorough assessment of model efficacy using both validation and testing datasets.

  4. 4.

    Superior performance in contrast to existing m5U predictors, as measured by various assessment parameters.

Fig. 1
figure 1

Proposed model framework

Materials and methods

Benchmark dataset

Datasets are essential for developing accurate and effective machine-learning models for predicting and detecting RNA 5-methyluridine (m5U) modification sites. These types of datasets, also known as Benchmark datasets, are essential to test the efficiency and capability of the ML algorithms. In this research, we used benchmark datasets that were used by [28, 30, 31]. We created a new sequence of forty-one nucleotides for the positive samples based on the sequence containing the experimentally defined modified uridine sites. The negative instances were randomly sourced from the unaltered uridine sites of the identical transcripts used for the positive samples. These two databases for Full Transcript were the set of 3,696 positive and 3,696 negative samples in Mature mRNA databases, including 1,232 positive and 1,232 negative samples. At first, we did not exclude homologous sequences for the corresponding genomic regions, which could be problematic if sequence identity exceeded a certain threshold. We used the CD-HIT method to eliminate homologous sequences from the complete transcript m5U modification site data to tackle this issue, and an 80% homology similarity was used to enhance the data quality. Table 1 shows the details of the final m5U modification site dataset.

Table 1 Benchmark dataset samples of RNA 5-methyluridine

Feature formulation

Several approaches are designed to encode DNA, proteins, and RNA samples into mathematical structures with structural features of nucleotides maintained [6, 32,33,34,35]. These approaches enable bioinformaticians to transform RNA sequences into different statistical features to retain uniqueness and patterns within the sequence [36]. The Pseudo Amino Acid Composition (PseAAC) technique fundamentally represents protein samples with a discrete approach. Because of its efficiency, the PseAAC method was further developed to predict the RNA and DNA molecules, called the Pseudo k-tuple Nucleotide Composition (PseKNC) approach [37]. Through the series of transformations possible with this feature formulation technique, PseKNC has emerged as a prominent method for modeling RNA and DNA sequences in computational biology [38, 39]. Using the PseKNC method, RNA sequences are converted to the corresponding feature vectors, allowing for a representation of the RNA secondary structure without compromising the sequence order as much as possible. It makes the data after transformation similar within the RNA samples, therefore appropriate for use in different machine learning procedures [40]. To this end, all RNA sequences were encoded using the PseKNC method. RNA primary sequences were transformed into numerical feature vectors where the order of the nucleotides is maintained. For example, consider an RNA sequence N with L nucleotides, represented as:

$$N = N_{1} N_{2} N_{3} ......N_{i} ....N_{L}$$
(1)
$$N_{i} \in \left\{ {A,U,C,G} \right\}\;\quad (i = 1,2,3,....,L)$$
(2)

where L stands for the total number of nucleotides in an RNA sequence, which is the length of the sequence, in the above Eq. 2, the symbol A represents the Adenine nucleotide base, U is a Uracil nucleotide base, symbol G represents Guanine nucleotide base and the symbol C stands for Cytosine nucleotide base.\({N}_{i}\) represents the nucleotide that occurs in the ith position concerning the sequence. Let's expand the general form of PseKNC representation discussed in [41] for the context of the sample presented in Eq. 1 below:

$$ N= \left[ {\varphi_{1} \varphi_{2} \varphi_{3} ........\varphi_{u} .............\varphi_{z} } \right]^{T}$$
(3)

The transposed vector T, the numeric value z, and \(\varphi_{u}\) representing the actual value of the RNA sequence's function vector may be computed using Eq. 4.

$$\varphi _{u} = \frac{{\frac{{f_{u}^{{K - tuple}} }}{{\sum\limits_{{i = 1}}^{{4^{k} }} {f_{u}^{{K - tuple}} + w\sum\limits_{{j = 1}}^{\lambda } {\theta _{j} } } }}(1 \le u \le 4^{k} ,u = 1,2,3,...)}}{{,\frac{{w\theta _{{u - 4^{k} }} }}{{\sum\limits_{{i = 1}}^{{4^{k} }} {f_{u}^{{K - tuple}} + w\sum\limits_{{j = 1}}^{\lambda } {\theta _{j} } } }}(4^{k} + 1 \le u \le 4^{k} + \lambda )}}$$
(4)

where \(\theta_{j}\) stands for the jth tier correlation factor or rank; this exhibits the correlation of the sequence order of the consecutive K-tuple nucleotides in RNA sequence. The parameter \(\lambda\) represents the overall degrees of correlation ranks to be accounted (or tiers) for through a weight factor \(w\). The impact of the correlation factors is stabilized. Theoretical and simulation work has shown that \(\lambda =1\) and \(w=0\) are the best values for the two parameters. 1, which, in turn, provides the best performance results. The correlation factor \(\theta_{j}\) can be calculated as follows:

$$\theta_{j} = \left\{ {\frac{1}{L - K - (\lambda - 1)}\sum\limits_{i = 1}^{L - K - (\lambda - 1)} {C_{i,i + j} } } \right.\quad j \to 1,2,.....,\lambda ,\lambda < L - K$$
(5)

where, \(C_{i,i + j}\) is the correlation function and can be computed using Eq. 6.

$$C_{i,i + j} = \frac{1}{u}\sum\limits_{\xi = 1}^{\lambda } {\left[ {H_{\xi } (N_{i} N_{i + 1} ...N_{i + K - 1} ) - H_{\xi } (N_{i + j} N_{i + j + 1} ...N_{i + j + K - 1} )} \right]}^{2}$$
(6)

where Ni represents any nucleotide (Ref. Equation 1), \(H_{\xi } (N_{i} N_{i + 1} ...N_{i + K - 1} )\) is the numerical value of the \(\xi th\) physicochemical property of \(N_{i} N_{i + 1} ...N_{i + K - 1}\) (K-tuple nucleotide) in the RNA sequence, and \(H_{\xi } (N_{i + j} N_{i + j + 1} ...N_{i + j + K - 1} )\) is the corresponding value for the K-tuple nucleotide.\(N_{i + j} N_{i + j + 1} ...N_{i + j + K - 1}\).

Hybrids features

In this study, every RNA sequence was written down as an isolated function vector using the PseKNC method. Analyzing the five feature extraction methods of the PseKNC, in which K varied from 1 to 5, we demonstrated that the PseKNC has five distinct functional modes. These modes correspond to different compositions: PseSNC: the proportion for single nucleotide composition; PseDNC: the proportion for dinucleotide composition; PseTNC: the proportion for trinucleotide composition; PseQNC: the proportion for quad nucleotide composition; PsePNC: the proportion for penta nucleotide composition. The number of features generated for each mode is indicated in Table 2 below. To construct a comprehensive hybrid feature vector, we combined all five feature vectors as follows:

$$N_{{m5U}} = N_{{PseSNC}} \cup N_{{PseDNC}} \cup N_{{PseTNC}} \cup N_{{PseQNC}} \cup N_{{PsePNC}}$$
(7)

where \(N_{m5U}\) denote the hybrid feature vector, \(\cup\) denotes the union, \(N_{PseSNC}\), \(N_{PseDNC}\), \(N_{PseTNC}\) \(N_{PseQNC}\) and \(N_{PsePNC}\) are the individual feature vectors and defined as follows:

$$N_{PseSNC} = \left| {f_{j = 1,...4D}^{1 - Tuple} } \right.\quad \mathop{\longrightarrow}\limits^{f}(A,C,G,U)$$
(8)
$$N_{PseDNC} = \left| {f_{j = 1,...16D}^{2 - Tuple} } \right.\quad \mathop{\longrightarrow}\limits^{f}(AA,CC,GG,UU)$$
(9)
$$N_{PseTNC} = \left| {f_{j = 1,...64D}^{3 - Tuple} } \right.\quad \mathop{\longrightarrow}\limits^{f}(AAA,CCC,GGG,UUU)$$
(10)
$$N_{PseQNC} = \left| {f_{j = 1,...256D}^{4 - Tuple} } \right.\quad \mathop{\longrightarrow}\limits^{f}(AAAA,CCCC,GGGG,UUUU)$$
(11)
$$N_{PsePNC} = \left| {f_{j = 1,...1024D}^{5 - Tuple} } \right.\quad \mathop{\longrightarrow}\limits^{f}(AAAAA,CCCCC,GGGGG,UUUUU)$$
(12)
Table 2 The total number of pseKNC features from k = 1 to k = 5

SHAP features selection

Decoding the biological import of selected features in machine learning models is sometimes problematic since these algorithms are known as black boxes, and their internal workings are complex to understand [42]. Another critical idea in machine learning is the data shape; it involves aspects like the organization, size, and arrangement of datasets utilized in a classification or regression function. Some behavioral patterns are exhibited by a machine learning algorithm based on the shape of the data sets it consists of. It is beneficial during data partition, such as dividing data into training, testing datasets, data normalization, and feature selection. Data cleaning is crucial because when data is well structured, it can perform optimally, hence the basis for decision-making. Through cooperative game theory, SHAP can provide a solution to explain the contributions of 'input features' present in a model [43]. SHAP scores each feature, and this numeric value encodes how informative that feature is to resulting decisions. The approach computes the prediction variation when a particular characteristic is included or excluded and quantifies its effect on the model. This incremental effect is mathematically formalized through Eq. 13, which points out how feature i impacts the result when interacting with different components of features.

$$SHAP_{i} (x) = \phi_{i} = \sum\nolimits_{{s \subseteq N\backslash \{ i\} }} {\frac{\left| S \right|(\left| N \right| - \left| S \right| - 1)}{{\left| N \right|}}} \left[ {f(S \cup \left\{ i \right\}) - f(S)} \right]$$
(13)

where:

  • \(\phi_{i}\), denotes the SHAP value for the feature \(i\).

  • \(N\), represents the set of all features.

  • \(S\), is a subset of features excluding feature i.

  • \(f\left( S \right)\) is the model's prediction given the features in S.

  • \(f(S \cup \left\{ i \right\})\) is the model's prediction given the features in S and feature \(\text{i}.\)

In this study, we use BorutaSHAP-based wrapper feature selection to identify the most influential features from the extracted vector, as it evaluates the contribution of each feature to model performance. BorutaSHAP enhances the training process by highlighting the global importance of features and facilitating the selection of the optimal feature set. For our model, we selected the top 125 features for the Full Transcript dataset and 80 features for the Mature mRNA from a hybrid feature vector with a total dimension of 1364 from both datasets. Figure 2 (a-b) presents the summarized BorutaSHAP plots for the top features, where each row represents a chosen feature. Red points indicate high-contributing features, while blue points signify those with lower contributions. The horizontal axis shows the SHAP values, where positive values push the prediction towards m5U, and negative values predict the non-m5U class.

Fig. 2
figure 2

Feature selection via SHAP analysis

Deep neural network architecture

Deep Neural Networks (DNNs) are a sub-classification of machine learning inspired by the structure and functionality of the human brain. DNN architecture involves an input layer, several hidden layers, and an output layer in between, as shown in Fig. 3.

Fig. 3
figure 3

Neuron representation in deep neural network configuration

The hidden layers are essential for the network to learn about features and patterns in data that it can't detect in the raw data. Whereas the number of hidden layers increases the predictive power to map complex patterns, it also increases the difficulty, computational costs, and over-fitting. Feature extraction is one of the most prominent advantages of DNNs. They do not need any feature engineering of the data since they can learn the features from the data independently, even if the data is unlabeled or suffers from unstructured data. As pointed out in [44], this capability is realized through standard learning methods. In this work, we will employ regular learning approaches. Experts have proved that DNNs are more effective in addressing complex classification problems than previous machine learning techniques because of their depth and flexibility. DNNs have been extensively used in several domains, including bioengineering [45], speech recognition, image recognition [35], and natural language processing [33]. DNNs show that they are equally efficient in such fields, pointing to their potential to solve numerous complex issues.

Model training

Based on a benchmark dataset, this study used the DNN model to identify m5U sites. From the complete sequences, 80% of the samples were used as the training dataset, while the remaining 20% were utilized as an independent test dataset. Consequently, the m5U modification site dataset consisted of 1,534/2,862 samples for training and 500/731 samples for an independent test set in the Full Transcript mode. 983/985 samples were for the training set, and 245/247 samples were for the independent test set in the Mature mRNA mode. The proposed multi-layer DNN model comprises an input–output layer and four hidden layers, as shown in Fig. 4 above. As with the previous novel architecture, each layer has multiple neurons, and the inputs and outputs correspond to the feature vectors shown in Eq. 14. The weights stored at each neuron are set by the Xavier initialization method [46], ensuring that the variance is well-conserved and that practical learning is promoted across the layers. In order to improve the learning technique of the model, a backpropagation algorithm was adopted to change the weights iteratively, enabling the reduction of errors between the output and target classes. The hyperbolic tanh (Tanh) activation function is used in both the input and hidden layers to incorporate nonlinearity into the developed model. This activation function enables the network to capture intricate patterns and the presence of relationships within data to decide whether a neuron should be activated because of the output generated. When measuring in the output layer, the activation function applied here is the softmax activation function. Since the probabilities of classifying the points or samples into an individual class, the values obtained are probabilities from 0 to 1. This approach ensures that the DNN model can develop ballistic meanings of the predictions, improving the accuracy of the number of m5U sites it can identify.

$$y_{a} = f(B_{a} + \sum\limits_{b = 1}^{m} {x_{b} } w_{b}^{a} )$$
(14)

where \(y_{a}\) denote output at a layer, \(B_{a}\) denote bias value, \(w_{b}^{a}\) represent weight used at a layer \(b\) by \(a\) neuron, \(x_{b}\) denote input feature, and \(f\) denote a nonlinear activation Tanh function, which can be calculated using Eq. 15.

$$f(i) = \frac{{e^{i} }}{{1 + e^{i} }}$$
(15)
Fig. 4
figure 4

RNA m5U sites proposed DNN model configuration

Evaluation of performance

Several performance indicators are usually employed during the assessment of the output by the machine learning algorithm to determine the performance and reliability of the model. (1) Accuracy (ACC) is formulated to estimate the classifier's precision of instances and is calculated by actual instances, including true positives and negatives, divided by total instances [47,48,49]. This offers an overall picture of the model’s effectiveness in achieving each objective. (2) Sensitivity (SN) or actual positive rate or recall rate, which measures the capability of the classifier in identifying the correct attributes for the particular positive instances [50,51,52,53,54]. This one is for assessing the probability that, indeed, the test will correctly identify subjects with the condition it is looking for and is computed by dividing the number of test subjects that are correctly identified by the disease by the total number of test subjects that have the disease plus those that are erroneously identified as having the disease. This is important, especially when it is relevant to identify positive instances, as in medical diagnosis or fraud detection, where it would be highly undesirable to 'overshoot' a favorable instance. Altogether, the proposed metrics provide a holistic picture of the model's performance, its accuracy, and, at the same time, its ability to filter the right instances [55]. (3) SP stands for Specificity because a negative rate calculates the antagonistic classes the classifier has recognized. (4) Mathew's Correlation Coefficient measures binary classification consistency [13,14,15]. The performance metrics can be expressed mathematically as follows:

$$ACC = 1 - \frac{{m^{5} u_{ - }^{ + } + m^{5} u_{ + }^{ - } }}{{m^{5} u^{ + } + m^{5} u^{ - } }}$$
(16)
$$SN = 1 - \frac{{m^{5} u_{ + }^{ - } }}{{m^{5} u^{ - } }}$$
(17)
$$SP = 1 - \frac{{m^{5} u_{ - }^{ + } }}{{m^{5} u^{ + } }}$$
(18)
$$MCC = \frac{{1 - (\frac{{m^{5} u_{ - }^{ + } + m^{5} u_{ + }^{ - } }}{{m^{5} u^{ + } + m^{5} u^{ - } }})}}{{\sqrt {(1 + \frac{{m^{5} u_{ - }^{ + } + m^{5} u_{ + }^{ - } }}{{m^{5} u^{ + } }}} )(1 + \frac{{m^{5} u_{ - }^{ + } + m^{5} u_{ + }^{ - } }}{{m^{5} u^{ - } }})}}$$
(19)

According to the above Equation, the variables represent the positive and negative values in the above-given equations. Where \(m^{5} u^{ + }\) represent True Positive, \(m^{5} u^{ - }\) True Negative, and \(m^{5} u_{ - }^{ + }\) False Positive and \(m^{5} u_{ + }^{ - }\) represent False Negative accordingly.

Discussion and experimental analysis

This section evaluates and discusses the proposed model's effectiveness in depth. Several validation tests, including the K-fold and independent tests, can be utilized to assess the overall performance of the machine learning training algorithm in bioinformatics. The K-fold cross-validation approach is a typical validation technique that uses evenly balanced findings [56]. Consequently, a tenfold cross-validation test employing such benchmarking datasets was used to examine the overall accuracy of the suggested prescription in this work.

System configuration

To experiment, we used the sixth-generation Intel Core i5 processor, an average desk work option that confidently performs its functions, such as data processing and basic computing tasks. SSD 256-GB, booting, reading, and writing speeds and application performance are much better than what HDD could provide. The system configuration also includes the 8 GB of RAM, which achieves a good level of multitasking. Typical Python 3 libraries such as Numpy and Scipy, common in data science workflows, were pre-installed onto the system for training and testing ML models. We also included Tensorflow and Keras [57] for building deep neural networks and Pandas and Matplotlib to do heavy work with data analysis, cleaning, and collating data for running machine learning models. This setup is well-suited for a data-centric individual or small members-focused team. Advanced tasks associated with larger datasets or resource-intensive tasks may require enhancement of the CPU and RAM. The overall system configuration, consisting of HP Core i5 6th generation, 256 GB SSD, and 8 GB RAM, is presented in Table 3.

Table 3 The system configuration, including software and hardware

Analysis of nucleotide composition

In this section, we utilized the Two-Sample Logo software [44] to compare sequences with and without m5U modifications and determine if nucleotides with m5U modification sites differ in composition. This approach generated two-sample logos highlighting areas where residues are significantly enriched or reduced in m5U-modified sequences. The statistical analysis, performed using a t-test with a significance threshold of p < 0.05, revealed notable differences, as illustrated in Fig. 5.

Fig. 5
figure 5

m5U sites nucleotide composition analysis

The differences in complete transcript mode sequences with and without m5U modifications are illustrated in Fig. 5-A, and Fig. 5-B shows the Full Transcript and Mature mRNA sequences, respectively. Regarding the consensus motifs, both modes had a conservative UUC at the positions 0–2. We observed specific nucleotide enrichments toward the regions containing m5U modification sites. For instance, in both modes, G was mainly found enriched with a ratio of 35 at position -8 and 50 at position -3. C primarily was concentrated at positions 7 and 8 while U was at position 1. Furthermore, another example of the homopolymer and homopolymeric region is found in G (positions -10 to -8) and C (positions 6 to 9). The analysis also revealed compositional differences between sequence types: C was enriched at position nineteen in the Full Transcript mode, while G was enriched in the Mature mRNA mode. In particular, when C Sch is at position -1, Full Transcript mode enriched A and Mature mRNA mode gave a higher concentration of G. Thus, nucleotide deviation analysis can predict RNA m5U modification sites.

Hyper parameters and optimizations

In this section, we intend to find the best values for the hyperparameters in the DNN model. We used a grid search algorithm [58] to assess DNN performance under different configurations [59, 60]. We noticed that the values of some parameters with the potential to improve DNN's performance were stochastic[61, 62]. We included the following parameters in the grid search algorithm: activation function, learning rate, and number of iterations. Based on the results, Table 4 identifies a set of the best-obtained hyperparameter values.

Table 4 List of optimal hyper-parameters value of proposed DNN model

We ran experiments to evaluate how different activation functions and learning rates impact performance. The results, shown in Table 5, include tests using ReLU, Sigmoid, and Tanh as activation functions, with learning rates ranging from 0.1 to 0.3. According to the table, the DNN classifier achieved the highest accuracy, 91.47% for the Full Transcript dataset and 95.86% for the Mature mRNA dataset when using Tanh as the activation function and a learning rate of 0.1.

Table 5 Impact of different learning rates and activation functions on the accuracy of DNN model using tenfold

Table 5 shows that a reduction in the learning rate results in an equal enhancement of the accuracy of the DNN model. However, increasing the learning rate to less than 0. 1 did not produce much higher increases in accuracy . Therefore, we can also state that regarding the value of the learning rate, the DNN model reached the maximum state of accuracy at the level of 0. 1 when using the Tanh activation function, which is expected. Details of the best hyperparameters for a few of the main ones are summarized in Table 4 below.

Next, we conducted numerous experiments to evaluate the DNN model's performance by varying the number of training epochs. The findings are illustrated in Figs. 6 and 7. The data shows that the error rate consistently decreases as training epochs increase. For instance, in Fig. 6, which represents the Full Transcript dataset, the DNN model started with an error loss of 1.287 at the initial epoch, steadily dropping to 0.004 by the 50th epoch. Similarly, Fig. 7 shows the results for the Mature mRNA dataset, where the initial error loss was 1.241 and reduced to 0.003 after the 30th epoch. From these results, we can conclude that 50 epochs for the Full Transcript dataset and 30 epochs for the Mature mRNA dataset are optimal, as the error rates stabilize at these points. The optimal configuration derived from this analysis is summarized in Table 4.

Fig. 6
figure 6

Error loss on the full transcript dataset using the tanh activation functions

Fig. 7
figure 7

Error loss on the mature mRNA dataset using the Tanh as activation functions

Performance analysis using sequence formulation techniques

In this section, we evaluate the proposed model using different sequence formulation methods, summarised in Table 6, using the Full Transcript dataset. As presented in Table 6, the findings show that the best performance was realized whenever feature combination or hybrid features were employed instead of the individual features methods. Initially, the model's performance metrics produced an accuracy of 89.03%, sensitivity of 89.12%, Specificity of 88.92%, and MCC of 0.798. All these metrics were captured before the implementation of any feature selection algorithms. Finally, to improve the model's overall performance, we used one more step in the feature space, feature selection, to bring down the dimension of the hybrid feature vector. This adjustment on the part of the assessment led to significant gains on the following ranges: Overall accuracy to the rate of 91.47%, sensitivity to the new rate of 92.94%, Specificity rising to the level of 90.01%, and lastly, MCC at 0.830.

Table 6 Performance comparison using sequence formulation techniques and hybrid feature vector using full transcript dataset

Similarly, we evaluated the proposed model's performance on different sequence formulation methods using the Mature mRNA dataset, as presented in Table 7. The results show that the model performed best when applying hybrid features, outperforming the individual formulation methods. For example, the proposed model achieved a success rate of 94.57%, with sensitivity at 95.04%, Specificity at 94.12%, and an MCC of 0.891. To further enhance its performance, we applied a feature selection method to reduce the dimensionality of the hybrid features. This led to a noticeable improvement, with the success rate increasing to 95.86%, sensitivity to 95.15%, Specificity to 96.58%, and an MCC of 0.917 accordingly.

Table 7 Performance comparison using sequence formulation techniques and hybrid feature vector using mature mRNA

Performance comparison of different classifiers

In this section, the performance of the proposed model is examined by testing it with several well-known supervised machine learning algorithms using hybrid features. The nature of the algorithms under consideration for this comparison are Random Forest (RF) [63], Support Vector Machine (SVM), Logistic Regression (LR), Naive Bayes (NB), and K- Nearest Neighbor (KNN) [64]. Random Forests is another ensemble learning technique that builds several decision trees by utilizing different bootstrapping methods. The result from each tree decision is combined by applying voting to improve the classification performances. It can be used in almost all classification and regression problems. K-Nearest Neighbor is a non-parameterized learning algorithm widely used in image processing. It divides instances into classes depending on the distance from the neighbors, and due to the straightforward approach, it fits most of the problems. Support Vector Machines are especially effective when dealing with linear and nonlinearly separable data; this algorithm searches for the best hyperplane to classify different classes effectively. This method is widely used, especially in bioinformatics, because of its effectiveness in working with large data sets. Naive Bayes, which derives from Bayes' Theorem, is a probabilistic classifier that analyzes features independently. It is particularly effective for text categorization, having small data sets, and working in high-dimensional spaces. Additional information about the performance of each of the algorithms is provided in Table 8.

Table 8 Performance comparison of different classifiers using Full transcript dataset

Table 8 compares different models applied to the Full Transcript dataset using various evaluation criteria. The presented models' performance can be analyzed based on accuracy: the Logistic Regression (LR) model reached 88.45% and an MCC of 0. 769. Respectively, the NB, RF, and KNN models had better performances with a mean accuracy of 89.09%, 89.73%, and 90.63%. The MCC values of the three sets were 0.782, 0.795, and 0.812. The SVM model recorded the highest accuracy amongst the traditional models at 91.19% with an MCC of 0.824. However, the proposed Deep-m5U model yielded the highest performance with an accuracy of up to 91.47% and an MCC of up to 0.830. Therefore, it demonstrates the higher efficiency of the proposed Deep-m5U model compared to the other methods used in the analysis.

In addition to the mentioned performance measures, the performance of the proposed model was evaluated with the AUC (Area Under the ROC Curve) metric, widely used to assess binary classifiers. Figure 8 shows the AUC outcomes for the proposed type using Full Transcript collection. The AUC metric ranges between 0 and 1, and the higher value of AUC suggests a classifier's better performance [65]. These two features combine in a plot having the FPR, false positive rate, on the x-axis and TPR, true positive rate, on the y-axis. The results of using the proposed model for the given dataset were quite encouraging, with the obtained AUC value of about 0.972 with the Full Transcript dataset higher than other machine learning algorithms, including NB, KNN, and SVM. This high AUC value gives more weight to the model of identifying the difference between positive cases and negative ones, thus proving effective.

Fig. 8
figure 8

The efficiency of AUC using the Full Transcript dataset

Moreover, Table 9 presents the performance metrics for various models on the Mature mRNA dataset. The SVM model had the highest accuracy of 94.44% and an MCC of 0.889 compared to the traditional machine learning techniques. The proposed Deep-m5U outperformed all others, with the highest accuracy of 95.86% and an MCC of 0.917. Similarly, in the other parameters, the F1 score was 95.52%, sensitivity 96.15%, and Specificity 94.58%. The DNN model outperformed the SVM algorithm and other traditional ML because it uses a single processing layer. Traditional ML struggles with complex datasets that have high nonlinearity.

Table 9 Performance comparison of different classifiers using a mature mRNA dataset

In contrast, the DNN model utilizes multiple processing layers, allowing it to handle complex and nonlinear data more effectively. The AUC values for the proposed method using Mature mRNA datasets are illustrated in Fig. 9. The AUC graph visually represents the model's performance, with an increasing area under the curve indicating improved performance. Conversely, a decrease in this area suggests a reduction in model effectiveness. The proposed model achieved AUC values of 0.981 using Mature mRNA datasets compared to the other machine-learning algorithms.

Fig. 9
figure 9

The efficiency of AUC using a Mature mRNA dataset

Comparison of proposed predictor with existing predictors

We evaluate the performance of the proposed Deep-m5U model against the existing models on both Full Transcript and Mature mRNA datasets. Aggregate the abovementioned comparison, which is provided in Table 10. Therefore, moderate improvements in the m5UPred predictions were observed for the Full Transcript dataset with an accuracy of around 83.60%, sensitivity of 72.82%, and Specificity of 89.38% in MCC and MCC levels, respectively 0.634. As for the evaluation, the m5U-SVM model [29] for audiovisual content proved to have better outcomes: accuracy of 88.88%, sensitivity of 81.23%, and Specificity of 92.95%, while the mean MCC was 0.753. However, the proposed model Deep-m5U is more effective than both, with an accuracy of 91%0.47%, sensitivity of 92.94%, and Specificity of 90.01%, with the corresponding values of MCC of 0.830. For the Mature mRNA dataset, the m5UPred model [28] found it altered at 89.91%, with a sensitivity of 88.64% and a specificity of 91.18%. It has shown that the degree of ethnic discrimination and an MCC of 0.798. The best results were observed in the m5U-SVM model from colleagues [29], where achieved accuracy of 94.36%, sensitivity of 92.98%, Specificity of 95.74 percent and an MCC of 0.888. In the previous tests, the highest results, an accuracy of 95.86%, were registered by the Deep-m5U model. A sensitivity of 96.15%, a specificity of 94.58% MCC, and an MCC of 0.917 even signifies that the proposed model has a better recognizance prowess than the other models on both datasets.

Table 10 The proposed predictor is compared to existing predictors on the training dataset

To further examine the proposed model's generalization, the proposed model's performance comparison on independent datasets is shown in Table 11. From Table 11, the m5U-SVM predictor achieved an accuracy of 90.82% and an MCC of 0.809 using the Full Transcript dataset, while the proposed Deep-m5U model significantly outperformed it with an accuracy of 92.94% and an MCC of 0.831. Similarly, for the Mature mRNA dataset, m5U-SVM had lower performance metrics with an accuracy of 94.11%, whereas the proposed Deep-m5U model again showed superior results with an accuracy of 95.17%.

Table 11 The performance of the proposed model compared to the existing models on the independent datasets

Performance evaluation by cross-technique and cross-cell-type validation

This section assesses cross-technique validation (tenfold) and cross-cell-type validation on the benchmark dataset used in [31]. Initially, we categorized the experimentally verified m5U sites based on their profiling methodologies, namely miCLIP and FICC-seq, and the cell lines, namely HEK293. The research outcomes are shown in Table 12, which provides a comparative performance study of the current models m5UPred [28] and m5U-GEPred [31] across two assessment scenarios: cross-validation and independent testing. In cross-validation, the proposed model Deep-m5U regularly attains the most outstanding average accuracy, i.e., 97.28% and 94.17%, surpassing m5UPred and m5U-GEPred. m5U-GEPred demonstrates robust performance; nevertheless, its precision is somewhat inferior to Deep-m5U. Similarly, Deep-m5U again exhibits greater average accuracy in the independent testing findings, i.e., 90.10% and 77.30%. m5U-GEPred exhibits superior accuracy relative to m5UPred; nevertheless, both models see a marginal decrease in performance when compared to the cross-validation scenario. The overall performance of the proposed Deep-m5U model shows exceptional accuracy and generalization capabilities in both testing methodologies.

Table 12 Performance comparison on the cross-cell-type and cross-technique validation

Conclusions

5-methyluridine (m5U) is a prominent RNA modification critical in various biological functions and disease pathogenesis. As a posttranscriptional modification involving methylation at the C5 position of uridine, understanding its biological significance requires precise computational tools. This study introduced a novel deep learning-based model, i.e., Deep-m5U, designed to accurately predict RNA m5U sites by leveraging hybrid features and utilizing tenfold cross-validation and independent datasets. The model effectively addressed the over-fitting issue by optimizing hyper-parameters and demonstrated robust performance, achieving accuracies of 91.47% and 95.86% on the Full Transcript and Mature mRNA datasets, respectively. Additionally, the model attained 92.94% and 95.17% accuracy on independent test samples, outperforming traditional machine learning methods and existing state-of-the-art approaches. The promising results of Deep-m5U highlight its potential to significantly contribute to further research in RNA modifications and their implications in disease, particularly in areas such as stress response and breast cancer, as well as its utility in developing therapeutic strategies.

In future, our research will be extended in three key directions: (a) developing a publicly accessible web server for large-scale RNA 5-methyluridine (m5U) prediction, enabling more comprehensive community access and application [66, 67] (b) implementing a multi-view feature encoding strategy to capture better the properties of biological sequences [68,69,70] and (c) exploring additional deep learning models to improve the identification of m5U modification sites in RNA sequences [71].

Availability of data and materials

The datasets used and analyzed during the current study are available on the GitHub link: https://github.com/salman-khan-mrd/Deep-m5U.

References

  1. Khanal J, Tayara H, Zou Q, Chong KT. Identifying DNA N4-methylcytosine sites in the rosaceae genome with a deep learning model relying on distributed feature representation. Comput Struct Biotechnol J. 2021;19:1612–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. El Allali A, Elhamraoui Z, Daoud R. Machine learning applications in RNA modification sites prediction. Comput Struct Biotechnol J. 2021;19:5510–24.

    Article  PubMed  PubMed Central  Google Scholar 

  3. Chou K-C. Progresses in predicting post-translational modification. Int J Pept Res Ther. 2020;26(2):873–88.

    Article  CAS  Google Scholar 

  4. Jonkhout N, Tran J, Smith MA, Schonrock N, Mattick JS, Novoa EM. The RNA modification landscape in human disease. RNA. 2017;23(12):1754–69.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. da Silva Oliveira JP, de Oliveira RT, Guedes AL, da Costa OM, Macedo AF. Metabolomic studies of anthocyanins in fruits by means of a liquid chromatography coupled to mass spectrometry workflow. Current Plant Biology. 2022;32:100260.

    Article  Google Scholar 

  6. Khan S, Khan MA, Khan M, Iqbal N, AlQahtani SA, Al-Rakhami MS, Khan DM. Optimized feature learning for anti-inflammatory peptide prediction using parallel distributed computing. Appl Sci. 2023;13(12):7059.

    Article  CAS  Google Scholar 

  7. Su D, Chan CT, Gu C, Lim KS, Chionh YH, McBee ME, Russell BS, Babu IR, Begley TJ, Dedon PC. Quantitative analysis of ribonucleoside modifications in tRNA by HPLC-coupled mass spectrometry. Nat Protoc. 2014;9(4):828–41.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Sarkar A, Gasperi W, Begley U, Nevins S, Huber SM, Dedon PC, Begley TJ. Detecting the epitranscriptome. Wiley Interdiscip Rev RNA. 2021;12(6):e1663.

    Article  CAS  PubMed  Google Scholar 

  9. Xiao S, Cao S, Huang Q, Xia L, Deng M, Yang M, Jia G, Liu X, Shi J, Wang W. The RNA N 6-methyladenosine modification landscape of human fetal tissues. Nat Cell Biol. 2019;21(5):651–61.

    Article  CAS  PubMed  Google Scholar 

  10. Laptev I, Shvetsova E, Levitskii S, Serebryakova M, Rubtsova M, Bogdanov A, Kamenski P, Sergiev P, Dontsova O. Mouse Trmt2B protein is a dual specific mitochondrial metyltransferase responsible for m5U formation in both tRNA and rRNA. RNA Biol. 2020;17(4):441–50.

    Article  CAS  PubMed  Google Scholar 

  11. Powell CA, Minczuk M. TRMT2B is responsible for both tRNA and rRNA m5U-methylation in human mitochondria. RNA Biol. 2020;17(4):451–62.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Carter J-M, Emmett W, Mozos IR, Kotter A, Helm M, Ule J, Hussain S. FICC-Seq: a method for enzyme-specified profiling of methyl-5-uridine in cellular RNA. Nucleic Acids Res. 2019;47(19):e113–e113.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Khan F, Khan M, Iqbal N, Khan S, Muhammad Khan D, Khan A, Wei D-Q. Prediction of recombination spots using novel hybrid feature extraction method via deep learning approach. Front Genet. 2020;11:539227.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Inayat N, Khan M, Iqbal N, Khan S, Raza M, Khan DM, Khan A, Wei DQ. iEnhancer-DHF: identification of enhancers and their strengths using optimize deep neural network with multiple features extraction methods. Ieee Access. 2021;9:40783–96.

    Article  Google Scholar 

  15. Ahmad W, Ahmad A, Iqbal A, Hamayun M, Hussain A, Rehman G, Khan S, Khan UU, Khan D, Huang L. Intelligent hepatitis diagnosis using adaptive neuro-fuzzy inference system and information gain method. Soft Comput. 2019;23:10931–8.

    Article  Google Scholar 

  16. Chen K, Wei Z, Zhang Q, Wu X, Rong R, Lu Z, Su J, De Magalhães JP, Rigden DJ, Meng J. Whistle: a high-accuracy map of the human n 6-methyladenosine (m6a) epitranscriptome predicted using a machine learning approach. Nucleic Acids Res. 2019;47(7):e41–e41.

    Article  PubMed  PubMed Central  Google Scholar 

  17. Zhou Y, Zeng P, Li Y-H, Zhang Z, Cui Q. SRAMP: prediction of mammalian N6-methyladenosine (m6A) sites based on sequence-derived features. Nucleic Acids Res. 2016;44(10):e91–e91.

    Article  PubMed  PubMed Central  Google Scholar 

  18. Chen W, Feng P, Ding H, Lin H, Chou K-C. iRNA-Methyl: Identifying N6-methyladenosine sites using pseudo nucleotide composition. Anal Biochem. 2015;490:26–33.

    Article  CAS  PubMed  Google Scholar 

  19. Ma J, Zhang L, Chen S, Liu H. A brief review of RNA modification related database resources. Methods. 2022;203:342–53.

    Article  CAS  PubMed  Google Scholar 

  20. Wang X, Zhang Y, Chen K, Liang Z, Ma J, Xia R, de Magalhães JP, Rigden DJ, Meng J, Song B. m7GHub V2. 0: an updated database for decoding the N7-methylguanosine (m7G) epitranscriptome. Nucleic Acids Res. 2024;52:D203–12.

    Article  PubMed  Google Scholar 

  21. Zhang Y, Jiang J, Ma J, Wei Z, Wang Y, Song B, Meng J, Jia G, De Magalhães JP, Rigden DJ. DirectRMDB: a database of post-transcriptional RNA modifications unveiled from direct RNA sequencing technology. Nucleic Acids Res. 2023;51(D1):D106–16.

    Article  CAS  PubMed  Google Scholar 

  22. Cappannini A, Ray A, Purta E, Mukherjee S, Boccaletto P, Moafinejad SN, Lechner A, Barchet C, Klaholz BP, Stefaniak F. MODOMICS: a database of RNA modifications and related information. 2023 update. Nucleic Acids Res. 2024;52:D239–44.

    Article  PubMed  Google Scholar 

  23. Song B, Chen K, Tang Y, Wei Z, Su J, De Magalhães JP, Rigden DJ, Meng J. ConsRM: collection and large-scale prediction of the evolutionarily conserved RNA methylation sites, with implications for the functional epitranscriptome. Brief Bioinf. 2021;22:bbab088.

    Article  Google Scholar 

  24. Feng P, Chen W. iRNA-m5U: a sequence based predictor for identifying 5-methyluridine modification sites in saccharomyces cerevisiae. Methods. 2022;203:28–31.

    Article  CAS  PubMed  Google Scholar 

  25. Li Z, Mao J, Huang D, Song B, Meng J. RNADSN: transfer-learning 5-Methyluridine (m5U) modification on mRNAs from common features of tRNA. Int J Mol Sci. 2022;23(21):13493.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Li Y-H, Zhang G, Cui Q. PPUS: a web server to predict PUS-specific pseudouridine sites. Bioinformatics. 2015;31(20):3362–4.

    Article  CAS  PubMed  Google Scholar 

  27. Li J, Huang Y, Yang X, Zhou Y, Zhou Y. RNAm 5Cfinder: a web-server for predicting RNA 5-methylcytosine (m5C) sites based on random forest. Sci Rep. 2018;8(1):17299.

    Article  PubMed  PubMed Central  Google Scholar 

  28. Jiang J, Song B, Tang Y, Chen K, Wei Z, Meng J. m5UPred: a web server for the prediction of RNA 5-methyluridine sites from sequences. Mol Therapy-Nucleic Acids. 2020;22:742–7.

    Article  CAS  Google Scholar 

  29. Ao C, Ye X, Sakurai T, Zou Q, Yu L. m5U-SVM: identification of RNA 5-methyluridine modification sites based on multi-view features of physicochemical features and distributed representation. BMC Biol. 2023;21(1):93.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Yu L, Zhang Y, Xue L, Liu F, Jing R, Luo J. Evaluation and development of deep neural networks for RNA 5-Methyluridine classifications using autoBioSeqpy. Front Microbiol. 2023;14:1175925.

    Article  PubMed  PubMed Central  Google Scholar 

  31. Xu Z, Wang X, Meng J, Zhang L, Song B. m5U-GEPred: prediction of RNA 5-methyluridine sites based on sequence-derived and graph embedding features. Front Microbiol. 2023;14:1277099.

    Article  PubMed  PubMed Central  Google Scholar 

  32. Qiyas M, Naeem M, Khan N, Khan S, Khan F: Confidence levels bipolar complex fuzzy aggregation operators and their application in decision making problem. IEEE Access 2024.

  33. Khan S, Khan M, Iqbal N, Dilshad N, Almufareh MF, Alsubaie N. Enhancing sumoylation site prediction: a deep neural network with discriminative features. Life. 2023;13(11):2153.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Khan S, Uddin I, Khan M, Iqbal N, Alshanbari HM, Ahmad B, Khan DM. Sequence based model using deep neural network and hybrid features for identification of 5-hydroxymethylcytosine modification. Sci Rep. 2024;14(1):9116.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Khan S, AlQahtani SA, Noor S, Ahmad N. PSSM-Sumo: deep learning based intelligent model for prediction of sumoylation sites using discriminative features. BMC Bioinf. 2024;25(1):284.

    Article  CAS  Google Scholar 

  36. Shen H-B, Chou K-C. PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition. Anal Biochem. 2008;373(2):386–8.

    Article  CAS  PubMed  Google Scholar 

  37. Liu B, Liu F, Fang L, Wang X, Chou K-C. repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects. Bioinformatics. 2015;31(8):1307–9.

    Article  PubMed  Google Scholar 

  38. Lin H, Deng E-Z, Ding H, Chen W, Chou K-C. iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucleic Acids Res. 2014;42(21):12961–72.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Chen W, Feng P-M, Lin H, Chou K-C. ISS-PseDNC: identifying splicing sites using pseudo dinucleotide composition. Biomed Res Int. 2014;2014(1):623149.

    PubMed  PubMed Central  Google Scholar 

  40. Liu B, Wu H, Chou K-C. Pse-in-One 2.0: an improved package of web servers for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nat Sci. 2017;9(04):67.

    CAS  Google Scholar 

  41. Chen W, Lin H, Chou K-C. Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences. Mol BioSyst. 2015;11(10):2620–34.

    Article  CAS  PubMed  Google Scholar 

  42. Ahmad A, Akbar S, Tahir M, Hayat M, Ali F. iAFPs-EnC-GA: identifying antifungal peptides using sequential and evolutionary descriptors based multi-information fusion and ensemble learning approach. Chemom Intell Lab Syst. 2022;222:104516.

    Article  CAS  Google Scholar 

  43. Akbar S, Ali F, Hayat M, Ahmad A, Khan S, Gul S. Prediction of antiviral peptides using transform evolutionary and SHAP analysis based descriptors by incorporation with ensemble learning strategy. Chemom Intell Lab Syst. 2022;230:104682.

    Article  CAS  Google Scholar 

  44. Zhu Z, Albadawy E, Saha A, Zhang J, Harowicz MR, Mazurowski MA. Deep learning for identifying radiogenomic associations in breast cancer. Comput Biol Med. 2019;109:85–90.

    Article  PubMed  PubMed Central  Google Scholar 

  45. Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. Commun ACM. 2017;60(6):84–90.

    Article  Google Scholar 

  46. Khan S, Khan M, Iqbal N, Hussain T, Khan SA, Chou K-C. A two-level computation model based on deep learning algorithm for identification of piRNA and their functions via Chou’s 5-steps rule. Int J Pept Res Ther. 2020;26:795–809.

    Article  CAS  Google Scholar 

  47. Raza A, Uddin J, Almuhaimeed A, Akbar S, Zou Q, Ahmad A. AIPs-SnTCN: predicting anti-inflammatory peptides using fastText and transformer encoder-based hybrid word embedding with self-normalized temporal convolutional networks. J Chem Inf Model. 2023;63(21):6537–54.

    Article  CAS  PubMed  Google Scholar 

  48. Akbar S, Raza A, Al Shloul T, Ahmad A, Saeed A, Ghadi YY, Mamyrbayev O, Tag-Eldin E. PAtbP-EnC: identifying anti-tubercular peptides using multi-feature representation and genetic algorithm-based deep ensemble model. IEEE Access. 2023;11:137099–114.

    Article  Google Scholar 

  49. Akbar S, Raza A, Zou Q. Deepstacked-AVPs: predicting antiviral peptides using tri-segment evolutionary profile and word embedding based multi-perspective features with deep stacking model. BMC Bioinf. 2024;25(1):102.

    Article  CAS  Google Scholar 

  50. Akbar S, Zou Q, Raza A, Alarfaj FK. iAFPs-Mv-BiTCN: predicting antifungal peptides using self-attention transformer embedding and transform evolutionary based multi-view features with bidirectional temporal convolutional networks. Artif Intell Med. 2024;151:102860.

    Article  PubMed  Google Scholar 

  51. Ullah M, Akbar S, Raza A, Zou Q. DeepAVP-TPPred: identification of antiviral peptides using transformed image-based localized descriptors and binary tree growth algorithm. Bioinformatics. 2024;40(5):btae305.

    Article  PubMed  PubMed Central  Google Scholar 

  52. Raza A, Alam W, Khan S, Tahir M, Chong KT. iPro-TCN: prediction of DNA promoters recognition and their strength using temporal convolutional network. IEEE Access. 2023;11:66113–21.

    Article  Google Scholar 

  53. Raza A, Uddin J, Akbar S, Alarfaj FK, Zou Q, Ahmad A. Comprehensive analysis of computational methods for predicting anti-inflammatory peptides. Archiv Comput Method Eng. 2024. https://doi.org/10.1007/s11831-024-10078-7.

    Article  Google Scholar 

  54. Raza A, Uddin J, Zou Q, Akbar S, Alghamdi W, Liu R. AIPs-DeepEnC-GA: predicting anti-inflammatory peptides using embedded evolutionary and sequential feature integration with genetic algorithm based deep ensemble model. Chemom Intell Lab Syst. 2024;254:105239.

    Article  CAS  Google Scholar 

  55. Akbar S, Ahmad A, Hayat M, Rehman AU, Khan S, Ali F. iAtbP-Hyb-EnC: prediction of antitubercular peptides via heterogeneous feature representation and genetic algorithm based ensemble learning model. Comput Biol Med. 2021;137:104778.

    Article  PubMed  Google Scholar 

  56. Ahmad A, Akbar S, Hayat M, Ali F, Khan S, Sohail M. Identification of antioxidant proteins using a discriminative intelligent model of k-space amino acid pairs based descriptors incorporating with ensemble feature selection. Biocybern Biomed Eng. 2022;42(2):727–35.

    Article  CAS  Google Scholar 

  57. Obadi A, AlHarbi A, Abdel-Razzak H, Al-Omran A. Biochar and compost as soil amendments: effect on sweet pepper (Capsicum annuum L.) growth under partial root zone drying irrigation. Arab J Geosci. 2020;13(13):508.

    Article  CAS  Google Scholar 

  58. Elsisi M, Mahmoud K, Lehtonen M, Darwish MM. Reliable industry 4.0 based on machine learning and IOT for analyzing, monitoring, and securing smart meters. Sensors. 2021;21(2):487.

    Article  PubMed  PubMed Central  Google Scholar 

  59. Uddin I, Awan HH, Khalid M, Khan S, Akbar S, Sarker MR, Abdolrasol MG, Alghamdi TA. A hybrid residue based sequential encoding mechanism with XGBoost improved ensemble model for identifying 5-hydroxymethylcytosine modifications. Sci Rep. 2024;14(1):20819.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Shahid Akbar MH, Tahir M, Khan S, Alarfaj FK. cACP-DeepGram: Classification of anticancer peptides via deep neural network and skip-gram-based word embedding model. Artif Intell Med. 2022;131:102349.

    Article  PubMed  Google Scholar 

  61. Akbar S, Khan S, Ali F, Hayat M, Qasim M, Gul S. iHBP-DeepPSSM: identifying hormone binding proteins using PsePSSM based evolutionary features and deep learning approach. Chemom Intell Lab Syst. 2020;204:104103.

    Article  CAS  Google Scholar 

  62. Ahmad A, Akbar S, Khan S, Hayat M, Ali F, Ahmed A, Tahir M. Deep-AntiFP: prediction of antifungal peptides using distanct multi-informative features incorporating with deep neural networks. Chemom Intell Lab Syst. 2021;208:104214.

    Article  CAS  Google Scholar 

  63. Fawagreh K, Gaber MM, Elyan E. Random forests: from early developments to recent advancements. Syst Sci Control Eng An Open Access J. 2014;2(1):602–9.

    Google Scholar 

  64. Cheng D, Zhang S, Deng Z, Zhu Y, Zong M: k NN algorithm with data-driven k value. In: advanced data mining and applications: 10th international conference, ADMA 2014, Guilin, China, December 19–21, 2014 Proceedings 10: 2014. Springer: 499–512.

  65. Zhou G-P, Chen D, Liao S, Huang R-B. Recent progresses in studying helix-helix interactions in proteins by incorporating the Wenxiang diagram into the NMR spectroscopy. Curr Top Med Chem. 2016;16(6):581–90.

    Article  CAS  PubMed  Google Scholar 

  66. Arif M, Fang G, Fida H, Musleh S, Yu D-J, Alam T. iMRSAPred: improved prediction of Anti-MRSA peptides using physicochemical and pairwise contact-energy properties of amino acids. ACS Omega. 2024;9(2):2874–83.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  67. Arif M, Fang G, Ghulam A, Musleh S, Alam T. DPI_CDF: druggable protein identifier using cascade deep forest. BMC Bioinf. 2024;25(1):145.

    Article  CAS  Google Scholar 

  68. Ge F, Arif M, Yan Z, Alahmadi H, Worachartcheewan A, Yu D-J, Shoombuatong W. MMPatho: leveraging multilevel consensus and evolutionary information for enhanced missense mutation pathogenic prediction. J Chem Inf Model. 2023;63(22):7239–57.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  69. Hu J, Zeng W-W, Jia N-X, Arif M, Yu D-J, Zhang G-J. Improving DNA-binding protein prediction using three-part sequence-order feature extraction and a deep neural network algorithm. J Chem Inf Model. 2023;63(3):1044–57.

    Article  CAS  PubMed  Google Scholar 

  70. Hu J, Chen K-X, Rao B, Ni J-Y, Thafar MA, Albaradei S, Arif M. Protein-peptide binding residue prediction based on protein language models and cross-attention mechanism. Anal Biochem. 2024;694:115637.

    Article  CAS  PubMed  Google Scholar 

  71. Sikander R, Arif M, Ghulam A, Worachartcheewan A, Thafar MA, Habib S. Identification of the ubiquitin–proteasome pathway domain by hyperparameter optimization based on a 2D convolutional neural network. Front Genet. 2022;13:851688.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

This work was supported by Research Supporting Project Number (RSPD2024R585), King Saud University, Riyadh, Saudi Arabia.

Funding

This research is not funded.

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed equally to this work. SK and SN wrote the main manuscript. SAQ and NA were responsible for debugging the code and providing the datasets. AN, HHA, and WA conducted the independent testing and reviewed the manuscript for grammar.

Corresponding author

Correspondence to Nijad Ahmad.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that there is no conflict of interest regarding the publication of this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Noor, S., Naseem, A., Awan, H.H. et al. Deep-m5U: a deep learning-based approach for RNA 5-methyluridine modification prediction using optimized feature integration. BMC Bioinformatics 25, 360 (2024). https://doi.org/10.1186/s12859-024-05978-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-024-05978-1

Keywords