MSCAN: multi-scale self- and cross-attention network for RNA methylation site prediction

Background Epi-transcriptome regulation through post-transcriptional RNA modifications is essential for all RNA types. Precise recognition of RNA modifications is critical for understanding their functions and regulatory mechanisms. However, wet experimental methods are often costly and time-consuming, limiting their wide range of applications. Therefore, recent research has focused on developing computational methods, particularly deep learning (DL). Bidirectional long short-term memory (BiLSTM), convolutional neural network (CNN), and the transformer have demonstrated achievements in modification site prediction. However, BiLSTM cannot achieve parallel computation, leading to a long training time, CNN cannot learn the dependencies of the long distance of the sequence, and the Transformer lacks information interaction with sequences at different scales. This insight underscores the necessity for continued research and development in natural language processing (NLP) and DL to devise an enhanced prediction framework that can effectively address the challenges presented. Results This study presents a multi-scale self- and cross-attention network (MSCAN) to identify the RNA methylation site using an NLP and DL way. Experiment results on twelve RNA modification sites (m6A, m1A, m5C, m5U, m6Am, m7G, Ψ, I, Am, Cm, Gm, and Um) reveal that the area under the receiver operating characteristic of MSCAN obtains respectively 98.34%, 85.41%, 97.29%, 96.74%, 99.04%, 79.94%, 76.22%, 65.69%, 92.92%, 92.03%, 95.77%, 89.66%, which is better than the state-of-the-art prediction model. This indicates that the model has strong generalization capabilities. Furthermore, MSCAN reveals a strong association among different types of RNA modifications from an experimental perspective. A user-friendly web server for predicting twelve widely occurring human RNA modification sites (m6A, m1A, m5C, m5U, m6Am, m7G, Ψ, I, Am, Cm, Gm, and Um) is available at http://47.242.23.141/MSCAN/index.php. Conclusions A predictor framework has been developed through binary classification to predict RNA methylation sites.

In the last decade, dozens of experimental methods have been developed to identify the precise location of methylation sites on RNA, such as miCLIP [6], m 1 A-seq [7], PA-m6A-seq [8], m 1 A-ID-seq [9], m5C-RIP [10], m 1 A-MAP [11], and m 1 A-IP-seq [12].Despite their effectiveness, these experimental techniques are usually both timeconsuming and costly, limiting their use in different biological contexts [4], and making them inadequate for large-scale genomic data [13].Consequently, there is strong motivation to explore computational methods that can accurately and efficiently identify methylation sites based on sequence information alone.
As there are more vailable base-resolution datasets, researchers have designed some computational methods for RNA modification site prediction.These approaches formulate RNA methylation identification as a binary prediction task, and some machine learning models are trained to distinguish between truly methylated and non-methylated sites.These computational methods have been powerful additions for RNA methylation site prediction.
Traditional methods designed for sequence-based prediction usually first extract features based on human-understandable feature methods and then use a classifier to identify if the site is methylated based on the preceding extracted features.Specifically, RAMPred [14] adopts the support vector machine(SVM) to predict the m 1 A modification site, extracting features based on nucleotide composition(NC) and nucleotide chemical properties(NCP).iRNA-3typeA [15] adopts SVM to predict m 1 A, A-to-I, and m 6 A modification sites, which extracts features based on accumulated nucleotide frequency(ANF) and NCP.iMRM [16] extracts features based on NCP, NC, Nucleotide Density(ND), Dinucleotide physicochemical properties(DPCP), and Dinucleotide Binary Encoding(DBE) and employs XGboost to predict m 1 A, m 6 A, m 5 C, ψ, and A-to-I modification sites.The above sequence features are artificially extracted, and inevitably important features of the sequences are missed due to human cognitive limitations.
Analyzing biological sequences and interpreting biological information are the key challenges in achieving biological discovery.The application of natural language processing(NLP) to sequence analysis has attracted considerable attention in processing biological sequences [17].As biological sequences can be considered sentences, and k-mer subsequences are regarded as words [18,19], NLP can be used to understand the structure and function encoded in these sequences [17].Unlike traditional machine learning, deep learning (DL) methods follow an end-to-end design.Features are extracted directly based on the input sequence and the final labeling/prediction task.For example, EDLm6Apred [20] employs bidirectional long short-term memory (BiL-STM) to predict m 6 A sites, extracting features based on Word2vec, RNA word embedding [21], and one-hot encoding [22,23].However, LSTM, BiLSTM, and RNN cannot achieve parallel computation, leading to a long training time.
CNN can achieve parallel computation and learn local dependencies.For instance, m6A-word2vec [24] adopts CNN to identify m 6 A sites, extracting features based on Word2vec.Deeppromise [25] employs CNN to identify m 1 A and m 6 A sites, extracting features based on integrated enhanced nucleic acid composition (ENAC) [26], one-hot encoding, and RNA word embedding.However, These CNN structures only consider the contextual relationships of neighboring bases without considering the dependencies over long distances in the sequence.DeepM6ASeq [27] combines the advantages of CNN and BiLSTM by using two layers of CNN and one layer of BiLSTM to predict m 6 A sites.This approach may extract redundant features that interfere with prediction performance [28].The attention mechanism can quantify the degree of code-to-code dependency [29].Therefore, the application of the attention mechanism can capture the focused codes that affect the classification results.Plant6mA [30] utilizes a Transformer encoder to determine whether the input sequence contains an m6A site.However, due to the unique feature representation of transformers, these networks are primarily employed at a single scale.Although a single-scale self-attentive mechanism can focus on essential features of sequence context, it lacks information interaction with sequences at different scales.It isn't easy to learn complex word context relationships.
At present, most prediction model studies focus only on a single methylation modification, and few share the same binary classification model framework to achieve different methylation modification predictions.Even fewer cross-modification validation studies have been performed with different methylation test sets and trained models.Accounting for potential interactions between various RNA modifications, it would be interesting to use the same model to conduct cross-modification validation studies across different methylation test sets.
We present the Multi-scale Self-and Cross-attention Network (MSCAN), a novel approach designed to identify RNA methylation sites, addressing the challenges associated with current methods.Our model supports identifying twelve RNA modification types, including m 6 A, Ψ, m 1 A, m 6 Am, Am, Cm, m 7 G, Gm, Um, I, m 5 U, and m 5 C.
The MSCAN employs a unique multi-scale approach for analyzing RNA sequences.Specifically, we extracted the input 41-nucleotides (nt) sample sequence into multiple smaller subsequences centered around the sequence midpoint.To ensure accurate identification of methylation sites, the MSCAN analyzes these smaller subsequences at two distinct scales: 21-nt and 31-nt.This multi-scale analysis allows for a more comprehensive understanding of the RNA sequence context, ultimately leading to improved prediction performance.Secondly, word2vec was used to encode the three sets of sequences.Third, the three sets of sequences add positional information due to the correlation between nucleotide positions in the sequence.Four, the three sets of sequences were fed into the encoding module, which was constructed with a multi-scale self-and crossattention network and a feed-forward network(FFN) to extract potential contributing features for methylation site prediction.Finally, methylation predicted probabilities were obtained through a linear layer and the sigmoid function.The findings demonstrated that the MSCAN model surpassed the performance of state-of-the-art methods, including m6A-word2vec, DeepM6ASeq, and Plant6mA in independent tests.A user-friendly web server for MSCAN is available at http:// 47. 242.23.141/ MSCAN/ index.php.

Evaluation metrics
In this study, we used eight common classification indicators to evaluate the prediction of the model, including Accuracy (Acc), Sensitivity (Sen), Precision (Pre), Matthews correlation coefficient (MCC), Specificity (Sp), and F1 score (F1).The formulas of these metrics are as follows: Here, the true positive, true negative, false positive, and false negative are represented as TP, TN, FP, and FN, respectively.Moreover, the area under the receiver operating characteristic (AUROC) and the area under the precision-recall curve (AUPRC) are used to visually evaluate the model's overall performance.

Results analysis
MSCAN completed model training and experimental parameter optimization based on the dataset of Chen et al. [25].Subsequently, MSCAN completed the model's generalization ability evaluation based on the dataset of Song et al. [5].Specifically, based on the dataset of Chen et al., [5,25].As shown in Table 1, the input of the 11-nt sequence obtained the worst performance.The reason may be that too few bases in the 11-nt sequence affect feature extraction.The input of the 41-bp sequence obtained the best average performance of all the modifications, It may be worth mentioning that the 41-nt of the input sequence is also optimal for the XGboost and SVM method [14,16], so we choose 21-nt, 31-nt, and 41-nt RNA sequences as input sequences to achieve different combinations of input sequences.The combination of input sequences with different scale order is an important parameter that affects the performance of the training model.The performance of the MSCAN model with the different combinations of input sequences on the training data is shown in Table 2. MSCAN shows the best prediction performance when the combination is "21-nt + 41-nt + 31-nt".According to the MSCAN model design, "21-nt + 41-nt + 31-nt" input sequences are entered into the model to implement three attention mechanisms, including the self-attention calculation mechanism for the 21-nt sequence, and the cross-attention calculation mechanisms for both "21-nt + 41-nt" and "21-nt + 31-nt" combinatorial sequences.

Comparison analysis of different feature encoding methods
In this section, we evaluate the performance of three distinct feature encoding methods-Word2vec, One-hot, and ENAC-utilizing the same MSCAN model for predicting m 1 A sites on the test data of Chen et al.The outcomes of this comparison are presented in Fig. 1 and Table 3, demonstrating that Word2vec consistently surpasses the other two encoding methods across all performance indices.
The superior performance of Word2vec can be attributed to the limitations of the One-hot and ENAC encoding methods.While One-hot encoding focuses on the local information of individual bases, and ENAC encoding considers both nucleic acid composition and position information, both methods neglect the semantic information inherent in the sequence context.In contrast, Word2vec prioritizes the contextual relationships between bases, resulting in a more effective representation of the sequence.Our findings highlight the importance of selecting appropriate feature encoding methods for improved prediction accuracy, with Word2vec emerging as a particularly advantageous choice for the MSCAN model in the context of RNA methylation site prediction.4.

Comparison with different variants of the MSCAN model
SAN serves as the baseline model in this comparison.Upon the integration of crossattention modules, the area under the precision-recall curve (AUPRC) for SCAN and MSCAN models increased by 0.09% and 2.86%, respectively.These results highlight the importance of incorporating cross-attention mechanisms within the MSCAN model for improved performance in predicting RNA methylation sites.Consequently,  our findings emphasize the value of the multi-scale self-and cross-attention approach employed by MSCAN in advancing the understanding of RNA modifications and their functional implications.

Comparison with state-of-the-art approaches
We compared MSCAN with several state-of-the-art models, including m6A-word2vec, DeepM6ASeq, and Plant6mA.To ensure robust evaluation, we employed a fivefold cross-validation on the training data of Chen et al.As shown in Fig. 3 and Table 5, Our results demonstrate that MSCAN outperforms the other models, substantially improving prediction accuracy.
In particular, MSCAN achieves a 4.84% enhancement in the AUPRC metric compared to the second-best performing model, Plant6mA.This superior performance can be attributed to utilizing the multi-scale self-and cross-attention mechanisms in MSCAN, as opposed to the self-attention mechanism employed by Plant6mA.The results underscore the effectiveness of MSCAN in identifying RNA methylation sites.
Next, we compare the performance of MSCAN with other state-of-the-art models using the test data of Chen et al.The results, as illustrated in Fig. 4 and summarized in Table 6, demonstrate the superior performance of MSCAN in predicting RNA methylation sites.MSCAN outperforms DeepM6ASeq and m6A-word2vec by 1.47% and 2.4% in terms of AUPRC, respectively.This enhanced performance can be attributed to the multi-scale self-and cross-attention network's ability to capture meaningful sequence encodings for more accurate classification.Furthermore, MSCAN surpasses Plant6mA by 2.14% in AUPRC, which may further verify the limitations of the single-scale self-attention mechanism in learning complex contextual relationships between sequence elements.The integration of the cross-attention mechanism enables the model to discern deeper sequence meanings, thus improving its performance.

Assessing model reliability
To evaluate the reliability of our proposed model, we performed one hundred replications of experiments using the test data from Chen et al., evaluating the m6A-word-2vec, DeepM6ASeq, Plant6mA, and MSCAN models.In each replication, we used the same test data and ran each model under identical conditions to ensure experimental consistency.
To evaluate the statistical significance of AUPRC values between different methods, we employed Student's t-test [31].This statistical method helps determine whether performance differences between different methods are significant.Theoretically, the self-and cross-attention mechanism employed by the MSCAN model enables it to capture long-range dependencies and complex interactions between input features more effectively than other models, such as Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs).This characteristic is particularly advantageous in discerning biologically relevant patterns in methylation site prediction, which may contribute to the model's enhanced generalizability.

Comparison with cross-modification validation approaches
Thus far, our results have demonstrated the model's robust classification performance.Notably, a significant advantage of the proposed MSCAN model is its ability to learn the underlying associations among different RNA modifications.Previous studies have revealed clear evolutionary and functional cross-talk among various post-translational modifications of proteins [32] and histone and chromatin modifications [33].Such associations might also exist at the epi-transcriptome level among different RNA modifications.
To better understand the inherent shared structures among different RNA modifications, we performed cross-modification validation on the second dataset.The resulting AUROC values are displayed in Fig. 5.As the figure shows, cross-modification validation yielded poorer prediction results than those obtained using modification-consistent data and models, indicating the specificity of our method for a particular modification.
Interestingly, in experiments where the test dataset and model were inconsistent, some groups achieved high AUROC values greater than 0.85, suggesting strong and significant positive associations among certain RNA modifications, even those originating from different nucleotides.This observation implies the existence of regions intensively modified by multiple RNA modifications, which likely serve as key regulatory components for the epi-transcriptome layer of gene regulation.Notably, the sequence signatures of these key regulatory regions are largely shared among different RNA modifications (including those that modify different nucleotides) and were successfully captured by our model.As presented in Table 9, the most strongly associated modifications originated from the same type of base, with A and G belonging to purine-like bases, and C and U belonging to pyrimidine bases.
To further verify this finding, we compared Am, Gm, Cm, and Um correlations through local BLAST [34] software.First, the Am, Gm, and Cm comparison libraries are established based on the Am data set, Gm data set, and Cm data set respectively.Secondly, the Am, Gm, Cm, and Um data sets are used to compare the comparison libraries with different methylation in pairs.Then, the BLAST output table is obtained.Finally, compare the average value of the comparison result "bit-score".As shown in Table 10, the average bit-score value of the Gm sequence compared to the Am comparison library is high, indicating that the Am sequence and the Gm  Our model provides experimental verification of the existence of an inherent shared structure between different RNA modifications.These findings underscore the potential of the MSCAN model in advancing our understanding of the complex interplay between various RNA modifications and their functional implications.

Web server
We have developed a user-friendly web server for predicting twelve widely occurring human RNA modification sites (m 6 A, m 1 A, m 5 C, m 5 U, m 6 Am, m 7 G, Ψ, I, Am, Cm, Gm, and Um), accessible at http:// 47. 242.23.141/ MSCAN/ index.php, to facilitate the use of the MSCAN model for RNA methylation site prediction.Take the step of predicting the m 1 A methylation site as an example.First, click the "Prediction" button and select the "m 1 A" successively.Next, type or paste the RNA sequence, as shown in Fig. 6a.Third, leave your email address in the input box and click the "submit" button.After a calculation period, the prediction results will be displayed in a table, as shown in Fig. 6b.This intuitive web server offers researchers an efficient and convenient platform for employing the MSCAN model in their investigations of RNA modifications and their functional implications.

Discussion
First, based on the test data of Chen et al., we compared the performance of various features based on the MSCAN model, including One-hot encoding, ENAC, and Word-2vec.The results reveal that Word2vec outperforms One-hot and ENAC in predicting AUROC and AUPRC.Specifically, the AUPRC of MSCAN word2vec is 5.96% and 5.68% higher than that of MSCAN One-hot and MSCAN ENAC , respectively.These findings are in line with Zhang et al. 's study [20], which highlights that One-hot focuses on local semantic information while ENAC only considers the sequence's nucleic acid composition and position, neglecting more profound semantic information.Conversely, Word2vec captures the contextual semantic information of the sequence, significantly enhancing the model's predictive capability.Second, based on the test data of Chen et al., we assessed the impact of various MSCAN components by comparing the performance of different MSCAN variants, such as SCAN, SAN, MCAN, and CAN.Experimental results show that MSCAN reduces AUPRC by 3.04% and 2.77% respectively after deleting a self-attention module or a cross-attention module.This finding is consistent with Sun et al. 's study [35], which posits that the removal of self-or cross-attention modules leads to diminished model performance.When both the Multi-Scale and cross-attention modules are removed, the AUPRC of MSCAN decreases by 2.86%.This result aligns with Chen et al. 's study [36], which emphasizes that cross-attention effectively learns multi-scale transformer features for data recognition.
Third, we compared the performance of m6A-word2vec, DeepM6ASeq, Plant6mA, and MSCAN based on the test data of Chen et al.MSCAN's AUROC and AUPRC outperformed the other three state-of-the-art models.In particular, MSCAN surpassed Plant6mA by 2.14% in terms of AUPRC.This study substantiates that the utilization of multi-scale input and cross-attention allows the model to extract diverse features and provide deep semantics, which Plant6mA cannot achieve through information fusion from multiple scales.This conclusion is supported by Guo et al. 's study [37], which demonstrated that multi-scale transformers could extract rich and robust features from different scale inputs.
Five, based on the dataset of Song et al., we designed a cross-modification validation experiment in which twelve different methylation models were tested using twelve sets of methylation test datasets, respectively.We discovered that the most strongly associated modifications originated from the same base class, such as A and G belonging to purine-like bases.The AUROC and AUPRC metrics of the Am test set on the Gm prediction model are second only to the Am test set on the similar Am prediction model.This finding is consistent with Song et al. 's study [5], which proposed the existence of an inherent shared structure between different RNA modifications.
Lastly, we compared Am, Gm, Cm, and Um correlations through local BLAST software.We found the average bit-score value of the Gm sequence compared to the Am comparison library is high, indicating that the Am sequence and the Gm sequence are highly similar.Similarly, the average bit-score value of the Um sequence compared to the Cm comparison library is high, indicating that the Um sequence and Cm sequence similarity is high, which may validate the idea that the most closely related modifications originate from the same type of bases.These findings underscore the potential of the MSCAN model in advancing our understanding of the complex interplay between some RNA modifications and their functional implications.

Conclusions
This study presents a novel multi-scale cross-attention network (MSCAN) for predicting RNA methylation sites.By combining multi-scale, self-, and cross-attention mechanisms, MSCAN effectively extracts in-depth features from 41 base pair sequences at various scales.The model outperforms state-of-the-art predictors for all twelve modification sites, demonstrating its strong generalization ability.
Crucially, through the cross-modification validation experiments, our model unveils significant associations among different types of RNA modifications in terms of their related sequence contexts.This finding offers valuable insights into the complex relationships between RNA modifications and their respective sequence environments.
It is worth noting that the data set samples of the MSCAN model have the following conditions: (1) The sample is a 41-nt fixed-length sequence, (2) The methylation site must be in the center of the sequence, (3) The sample sequence must have a label.It may seem that MSCAN may only be tested by this method.We hope that in the future, targeting the characteristics of RNA sequences of different lengths, the model structure is adjusted to better capture and utilize these characteristics, and focusing particularly on studies that investigate the biological functions and regulatory mechanisms of different RNA sequence lengths.

Datasets
In the present study, the benchmark datasets employed to train and test the proposed methods were gathered from previous works [5,25].These datasets encompass twelve distinct types of RNA modifications, namely m 6 A, m 1 A, m 5 C, m 5 U, m 6 Am, m 7 G, Ψ, I, Am, Cm, Gm, and Um from H. sapiens.They can be downloaded from http:// 47. 242.23.141/ MSCAN/ index.php, and detailed information is provided in Table 11.To maintain consistency, all sequence samples were adjusted to a length of 41-nt, with the modified or unmodified site positioned at the center.In cases where the original sequence length fell short of 41-nt, we employed a padding technique, appending "−" to the head or tail of the sequence, to ensure a uniform length of 41-nt across all samples.The raw RNA datasets are represented as R 0 = {x n } N n=1 , where N is the sequence number, and each  11.The corresponding sequences were followed by aligning of the sequences according to sequence-logo representations rendered using the WebLogo program [38,39], As shown in Fig. 7.

Feature encoding representation
Achieving an effective feature encoding representation of the sequence is crucial for improving the evaluation metrics of a model.This study uses Word2vec to transform the sequence into embedded vector representations.Since its introduction in 2013, Word2vec has significantly advanced the performance of a wide array of natural language processing (NLP) tasks.
The Word2vec methodology offers two different frameworks for encoding: Skip-gram and Continuous Bag of Words (CBOW).The Skip-gram approach predicts contextual information surrounding a given word, whereas the CBOW model generates an embedding for the target word based on its contextual associations.These embeddings are derived through a neural network application, adeptly capturing the inherent relationships within the data.
We developed an RNA embedding approach by treating RNA sequences as sentences and k consecutive RNA nucleotides (k-mers) as words within these sentences.Mathematically, we define the mapping from single nucleotides to the vector representation of k-mers f : L � → Y L−k+1 , which is subsequently fed into the neural network for train- ing.This process results in d-dimensional embedded vectors, denoted by X n m ∈ R m×d m , where m = L − k + 1, and d m represents the embedding dimension.Gene2vec [21] demonstrated that 3-mers provide the optimal prediction performance.Consequently, we adopted a 3-mers encoding strategy for the input data.Specifically, we employed a sliding window of size 3-nt to slide 41-nt sample sequences with one stride, generating a sequence of 39 words.Each word corresponds to an index in all possible 3-mer

Model
As shown in Fig. 8, MSCAN represents an innovative DL architecture that employs a combination of multi-scale self-and cross-attention mechanisms and point-wise, fully connected layers in the encoder.This innovative approach enables the effective modeling of both intra-and inter-sequence interactions across a wide range of scales within RNAseq data by transforming local RNA sequences into high-dimensional vectors via representations through its multi-scale self-and cross-attention networks.MSCAN efficiently extracts crucial RNA sequence features, thereby facilitating the accurate prediction of m 1 A modifications.
The results of this study indicate that the nucleotide base neighboring the methylation site is instrumental in determining the specific type of methylation site and its potential functional consequences [40][41][42].Therefore, the original sample sequence was extracted with two subsequences.These subsequences were centered on the sequence midpoint.One subsequence was 21-nt long, and the other was 31-nt long, as shown in Fig. 9.In this paper, we represent the dataset as a collection of sample sequences, each consisting of a main sequence and two subsequences.The dataset can be expressed as {(x 1 s0 , x 1 s1 , x 1 s2 , y 1 ) , (x 2 s0 , x 2 s1 , x 2 s2 , y 2 ), ⋯, (x n s0 , x n s1 , x n s2 , y n )} , where y n ∈ {0, 1}, x i s0 , x i s1 , x i s2 are the three sequences of the i-th sample, x i s0 is the main sequence, with s0 = 41, x i s1 , x i s2 is the subsequence, with s1 = 21, and s2 = 31.Experiments show that the performance of trained models exhibits variability when the order of input sample sequences is altered, as shown in Table 1.MSCAN employs the Word2vec encoder to encode word vectors for these sequences.For example, sequences with lengths 21-nt, 41-nt, and 31-nt are transformed into three distinct matrices of varying dimensions: 19 × 100, 39 × 100, and 29 × 100, respectively.
To account for the lack of recursion or convolution in the model, it is necessary to incorporate information about the relative positions of tokens within sequences so that the model can utilize sequence order effectively.To achieve this, "position encoding" is added to the Word2vec embedding output, forming the input for the encoder.The positional encoding method employed in this work was first introduced by Vaswani et al. [43] in a machine translation task.
The encoder is composed of a stack of N = 3 identical layers.Each layer has two sublayers.The first sub-layer is a multi-scale self-and cross-attention network, while the second is a position-wise, fully connected feed-forward network.To facilitate effective information flow, each of these sub-layers incorporates a residual connection in conjunction with layer normalization.
The output generated by each sub-layer can be expressed as LayerNorm(x + sublayer(x)), where sublayer(x) represents the function associated with the sub-layer in question.Both the embedding layer and all model sub-layers yield outputs with a dimension of d model = 64 , allowing for seamless residual connections.
Upon completion of the classification process, a linear transformation followed by a sigmoid function is employed to convert the encoder output into predicted probabilities.We

Multi-scale self-and cross-attention network
The multi-scale self and cross-attention network constitutes the initial layer of the encoder, designed to handle linguistic input at various scales.Utilizing word2vec embeddings, matrices at three distinct scales (take X s0 , X s1 , X s2 as an example) are introduced into the self-attention and cross-attention modules for simultaneous computation.Specifically, X s0 is incorporated into the self-attention module, while the two combinations ( X s0 and X s1 , X s0 and X s2 ) are integrated into the cross-attention module.Subsequently, the outputs from these modules are directly added and relayed to the subsequent layer, as shown in Fig. 10.

Cross-attention network
The cross-attention network is designed to extract and learn relationships between words in sequences of varying scales, effectively capturing associations across different sequences.Using sequences X s0 and X s1 as examples, we first transform each sequence into three different terms, which are query, key, and value.This is achieved through the application of linear projections.
where X m ∈ R m×d model is the output of the sequence embedding module, m represents the length of the input sequence m ∈ {s 0 , s 1 , X m is transformed into the query matrix Q m ∈ R m×d k , the key matrix K m ∈ R m×d k , and the value matrix V m ∈ R m×d v , in which d k is the dimension of matrices Q m , K m , and d v is the dimension of matrix V m .
Second, we compute the cross-modal dot product between the query vector of X s0 and the key vector of X s1 , dividing the result value by d k , to estimate the association between the X s0 and X s1 .These results are subsequently refined and normalized utiliz- ing the softmax function, yielding attention weight coefficients.Lastly, we leverage these coefficients to aggregate the corresponding value vectors from each feature sequence, thereby facilitating that the associated information between the two sequences is obtained.The cross-attention function can be described as follows:

Self-attention network
In contrast to the cross-attention module, which primarily focuses on inter-sequence interactions, the self-attention module identifies and elucidates intra-sequence associations.The self-attention function is described as (7) Fig. 10 The internal structure of the multi-scale self-and cross-attention network

Multi-head multi-scale self-and cross-attention
The above elucidation pertains to single-headed attention, a fundamental mechanism in attention-based models.However, multi-headed attention is commonly employed in practice to augment model efficacy and expedite training.This technique entails conducting single-headed attention in parallel across multiple instances, known as "heads", and subsequently integrating the outcomes derived from each head.By incorporating multi-headed attention, the model can effectively capture diverse contextual information and intricate relationships inherent in the input data.The function of cross-attention is described as: where the In this task, we employ h = 8 parallel attention layers.For each layer,we use

Position-wise feed-forward networks
After the multi-headed, multi-scale self-and cross-attention layer, a second sub-layer is incorporated to augment the representative capacity of the model further.This additional component comprises a position-wise, fully connected feed-forward network, enhancing the overall model performance.The architecture of this network entails two successive linear transformations, with an intervening rectified linear unit (ReLU) activation function, ensuring a non-linear and expressive representation of the input data.It is defined as: The input and output have dimensionality d model = 64, while the inner layer's dimensionality is d ff = 256.

Classification module
To accomplish the classification task, the initial step involves computing the average of the encoder output.Subsequently, a linear transformation is applied, followed by implementing a sigmoid activation function.The optimization of the model is facilitated by employing cross-entropy loss as the primary objective.Finally, the methylation site (10) (12) FFN (x) = max(0, xW 1 + b 1 )W 2 + b 2 probabilities are acquired, providing a robust and comprehensive representation of the underlying biological processes.

Fig. 1
Fig. 1 Performance of the MSCAN model based on the different feature encoding We conducted ablation experiments to assess the contribution of key components within our proposed MSCAN model based on the test data of Chen et al.Utilizing Word2vec for RNA sequence encoding, we constructed four sub-networks: self-and cross-attention network (SCAN), self-attention network (SAN), multi-scale crossattention network (MCAN), and cross-attention network (CAN).SCAN represents MSCAN with one cross-attention module removed, SAN is SCAN devoid of crossattention, MCAN is MSCAN without self-attention, and CAN is MCAN with one cross-attention module removed.The outcomes of these experiments are depicted in Fig. 2 and summarized in Table

Fig. 2
Fig. 2 Performance of MSCAN and variant model on the test data

Fig. 3
Fig. 3 Performance of the different models on the training data

Fig. 4
Fig.4 The ROC and PRC of MSCAN and other state-of-the-art models on the test data

Fig. 5
Fig. 5 Heat map of different AUROC values in cross-methylation validation.The horizontal axis is the model type, and the vertical axis is the test data type

2 , 3 ,
. . ., L , where L is the fixed sequence length.The model training and experimental parameter optimization of MSCAN are based on the dataset of Chen et al., and the evaluation of MSCAN generalization capability is based on the dataset of Song et al.The ratio of positive-to-negative samples of Chen's and Song's datasets was 1:10 and 1:1, respectively, as shown in Table

Fig. 7
Fig. 7 The motif of methylation sites.a m 1 A in the dataset of Chen et al. b m 6 A. c Ψ. d m 1 A. e m 6 Am.f Am. g Cm. h Gm. i Um. j m 5 C. k m 5 U. l m 7 G. m I in the dataset of Song et al.

Fig. 9
Fig. 9 Schematic diagram of the obtained subsequences this paper first compared the performance of MSCAN with different combinations of input sequences on the training data.Second, the performance of MSCAN with different feature encoding was compared.Third, we compared the performance of different MSCAN model variants.Fourth, the MSCAN was compared with

Table 1
Evaluation results of five-fold cross-validation of transformer based on the training data of Chen et al.

Table 2
Evaluation results of MSCAN on five-fold cross-validation with different input sequences based on the training data of Chen et al.

Table 3
MSCAN model evaluation results with different feature encodings based on the test data of Chen et al.

Table 4
Comparing MSCAN and variant model evaluation results based on test data of Chen et al.
Bold Indicates the best performance SAN contains only self-attention; SCAN, and MSCAN are combinations of self-and cross-attention; CAN and MCAN are combinations of only cross-attention

Table 5
Evaluation results of MSCAN and other state-of-the-art models based on five-fold crossvalidation using the training data of Chen et al.

Table 7
below shows the p values for the difference in the performance of the four classifiers.

Table 6
Evaluation results of MSCAN and other state-of-the-art models based on the test data of Chen et al.Based on the data set of Song et al., the generalization ability of MSCAN was evaluated by training the model individually for each methylation type.As presented in Table8, the MSCAN model consistently outperforms state-of-the-art models, including m6A-word2vec, DeepM6ASeq, and Plant6mA.This result provides empirical evidence of the model's generalizability across diverse methylation site prediction tasks.

Table 7
A statistically significant correlation matrix for the difference in the performance of the four classifiers

Table 8
Compare MSCAN to other methods under AUC

Table 9
Association of RNA modifications revealed by MSCAN

Table 10
Compare the average bit-score of various methylated sequences Umsequence are highly similar.Similarly, the average bit-score value of the Um sequence compared to the Cm comparison library is high, indicating that the Um sequence and Cm sequence similarity is high, which may validate the idea that the most closely related modifications originate from the same type of bases.

Table 11 A
statistic of the training and test datasets