Skip to main content

CircSSNN: circRNA-binding site prediction via sequence self-attention neural networks with pre-normalization



Circular RNAs (circRNAs) play a significant role in some diseases by acting as transcription templates. Therefore, analyzing the interaction mechanism between circRNA and RNA-binding proteins (RBPs) has far-reaching implications for the prevention and treatment of diseases. Existing models for circRNA-RBP identification usually adopt convolution neural network (CNN), recurrent neural network (RNN), or their variants as feature extractors. Most of them have drawbacks such as poor parallelism, insufficient stability, and inability to capture long-term dependencies.


In this paper, we propose a new method completely using the self-attention mechanism to capture deep semantic features of RNA sequences. On this basis, we construct a CircSSNN model for the cirRNA-RBP identification. The proposed model constructs a feature scheme by fusing circRNA sequence representations with statistical distributions, static local contexts, and dynamic global contexts. With a stable and efficient network architecture, the distance between any two positions in a sequence is reduced to a constant, so CircSSNN can quickly capture the long-term dependencies and extract the deep semantic features.


Experiments on 37 circRNA datasets show that the proposed model has overall advantages in stability, parallelism, and prediction performance. Keeping the network structure and hyperparameters unchanged, we directly apply the CircSSNN to linRNA datasets. The favorable results show that CircSSNN can be transformed simply and efficiently without task-oriented tuning.


In conclusion, CircSSNN can serve as an appealing circRNA-RBP identification tool with good identification performance, excellent scalability, and wide application scope without the need for task-oriented fine-tuning of parameters, which is expected to reduce the professional threshold required for hyperparameter tuning in bioinformatics analysis.

Peer Review reports


Circular RNA (or circRNA) is a single-stranded RNA with a closed-loop structure [1, 2]. It is resistant to exonuclease-mediated degradation, and is more stable than most linear RNA. Recent studies have shown that circRNA molecules are rich in microRNA (miRNA) binding sites, which act as miRNA sponge (miRNA sponge) in cells [3,4,5], thus relieving the repressive effect of miRNA on its target genes and increasing the expression level of target genes. This mechanism of action is known as a competitive endogenous RNA (ceRNA) mechanism. By interacting with disease-associated miRNAs, circRNA plays a significant role in disease [6,7,8]. It has been shown that circRNA is conducive to the suppression of cancer by binding to some RBPs [9]. Therefore, an in-depth analysis of the interaction between circRNAs and RBPs to understand the development of tumor biology has a remarkable significance.

Benefiting from the high-throughput sequencing of RNA isolated by crosslinking immunoprecipitation (HITS-CLIP, also known as CLIP-Seq) sequencing technology, researchers have found there are several RBP binding sites in circRNA in eukaryotes [10, 11]. Therefore, many bioinformatic methods have been proposed to predict circRNA-RBP interactions. For example, inspired by the extraction of image features, Wang et al. proposed a circRNA-RBP classification model based on CNN, which uses the RBP binding sites on CS-circRNAs to predict its relevance to cancer [12]. Based on the capsule network, the CircRB [13] model also utilized convolutional operations to extract the features of circRNAs, and leveraged the dynamic routing algorithm to classify the binding sites. To introduce temporal information in circRNA-protein binding sites, Ju et al. first used CNN to extract features, then combined LSTM with conditional random fields and proposed a sequence-tagged deep learning model to identify circRNA-protein binding sites [14]. Similarly, Zhang et al. combined CNN and BiLSTM into a hybrid neural network in the CRIP model [15]. They also use CNN to extract features and use BiLSTM to capture the temporal information and obtain long-term association information. Unlike the methods mentioned above, CRIP used a codon-based scheme to encode RNA sequences [15]. Also based on a hybrid deep network composed of CNN and BiLSTM networks, Jia et al. applied XGBoost with incremental feature selection to conduct feature encoding and proposed PASSION [16] algorithm for circRNA-protein binding site prediction. Drawing on the ideas of NLP, Yang et al. proposed a KNFP (K-tuple Nucleotide Frequency Pattern) encoding scheme to describe local information, and applied word2vec to obtain global statistical information. The network architecture in Yang’s model is a hybrid model consisting of a multi-scale residual CNN, a BiGRU network and the attention Mechanism [17]. On this basis, Circ2CBA [18] uses a one-hot method to encode circRNA sequences and replaces the BiGRU network with BiLSTM. DeCban [19] combines CNNs with Attention Networks directly for feature extraction. Li et al. and Niu et al. introduced multi-view subspace learning and ensemble neural network into Yang’s model, and proposed two models named as DMSK [20] and CRBPDL [21], respectively. The models mentioned above have made impressive improvements in the performance of circRNA-RBP prediction, but there are still limitations in the description of global relations. This is because that these methods fail to make full use of the contextual information of circRNA sequences.

To overcome this issue, inspired by the newly proposed BERT(Bidirectional Encoder Representations from Transformers) model, Yang first pre-trained a DNABERT model [22], then fine-tuned the DNABERT to capture the semantic and syntactic information of the initial RNA sequence, and finally used the deep temporal convolutional network(DTCN) to predict the circRNA-protein binding sites [23]. Though the existing models have made many attempts, from single-view to multi-view, to enrich the diversity of features, they mainly resort to CNN and RNN or a hybrid of them to extract the deep features of circRNA, there is still large room for improvements regarding the issues such as the poor parallelism of network architecture, inability to flexibly capture long-term dependencies of features, and insufficient algorithm stability.

In this study, we developed a novel end-to-end circRNA-binding site prediction model called CircSSNN (CircRNA-binding site prediction via Sequence Self-attention Neural Network). To capture the hierarchical relationship between nucleotide sequences, we extract the initial features of circRNA sequence by a scheme of aggregating multiple gene encoding, including static local context and dynamic global context information. We then use the Transformer to design a network architecture i.e., Seq_Transformer, to extract the latent nucleotide dependencies to complete the task of CircRNA-RBP site prediction.

In the proposed model, the ResNet and LayerNorm modules are incorporated into the deep network to improve the robustness and reduce the sensitivity to hyperparameters, which also allows the algorithm to generalize well to different RNA-RBP combination recognition tasks. We compared CircSSNN with several state-of-the-art baselines on 37 popular circRNA benchmark datasets to verify its effectiveness and generalizability. Moreover, while keeping the network structure and hyperparameters unchanged, we directly applied CircSSNN to 31 linear RNAs datasets, and also obtained better performance than existing methods. The experimental results show that CircSSNN is superior to existing methods in terms of the recognition performance, and generalizability to different types of RNA-RBP. As such, it can serve as a competing candidate for the task of RNA-RBP prediction with a wide range of applications.

Materials and methods


To verify the effectiveness of the CircSSNN, we adopted 37 circRNA datasets as benchmark datasets following the baselines we compared [15, 16]. We first downloaded the datasets from the circRNA interactome database ( Subsequently, we obtained 335,976 positive samples and 335,976 negative samples following the process of iCircRBP-DHN [17].

To demonstrate the generalizability of CircSSNN regarding different types of RNA-RBP, we also tested the algorithm on 31 linear RNA datasets [24, 25] coming from CLIP-Seq data. Each linear RNA dataset has 5000 training samples and 1000 test samples [16].

Feature multi-descriptors

In CircSSNN, all CircRNA fragments were encoded into three types of quantified features: KNFP for expressing different levels of local contextual features, CircRNA2Vec for capturing contextual features representing long-term dependencies, and DNABERT for describing the global embedding features with learnable position encoding.

K-tuple nucleotide frequency pattern

To describe the local dependencies of circRNA sequences, KNFP is used to count the word frequency of substrings of circRNA with different lengths, thus the local context with varying lengths can be effectively captured [26].

Figure 1 shows the KNFP used in this paper consisting of three parts [17]: mononucleotide composition, dinucleotide composition and trinucleotide composition, i.e., k = 1,2,3. Considering a circRNA sequence with length n, i.e., \(S = \left[ {S_{1} , S_{2} , \ldots S_{n} } \right]\), in which \(S_{i} \in \left\{ {A,G,C,U} \right\}\), K-tuple nt composition can be employed to encode the raw sequence to get vector mixed by P1, P2, P3, in which each vector represents an individual k-tuple nt composition pattern, and it contains 4 k components as following:

$${\text{P}}_{k} = \left[ {p_{1} ,p_{2} ,p_{3} ,p_{4} , \ldots ,p_{{4^{k} }} } \right]$$
Fig. 1
figure 1

Encoding scheme of KNFP


We adopted the Doc2Vec model [27] to learn the global expression of circRNAs. Doc2Vec first obtains the circRNA substrings by moving a sliding window of width ten letter each step over the CircRNA sequence, and then tokenizes the obtained substrings into circRNA words by using the Circrna corpus from circBase [28].

We used Doc2Vec to learn the distributed expression of circRNA after tokenization. Specifically, for a central word wt obtained by tokenization, considering its context words \(w_{t - k} \sim w_{t + k}\), the conditional probability of this central word can be modeled as following,

$$\frac{1}{T}\sum\limits_{t = k}^{T - k} {\log } p\left( {w_{t} |w_{t - k} , \ldots ,w_{t + k} ,d} \right)$$

where d is the matrix of the document containing the substring considered, this is the difference between Doc2Vec and word2vec [29], i.e., the former considers the information of the document [27].

Global embedding features based on CircRNA sequences

BERT is a language model that has achieved great success recently. Based on Transformer, BERT trains its network by using unsupervised learning. Different from word2vec and Doc2Vec, BERT contains learnable positional parameters and thus can express relative position in the context. Pre-training with BERT can obtain well-generalized base parameters, which can be applied to a specific task just with corresponding fine-tuning.

Similar to HCRNet [23], we first tokenized a circRNA sequence by k-mer in which k is set as 3. Next, we performed fine-tuning on a large amount of circRNA data. Similar to the original BERT, this pre-training and fine-tuning strategy will save a lot of training time and facilitate the following learning tasks remarkably.

Deep neural network architecture

In this section, we propose the CircSSNN framework to fully exploit the latent representation of features and facilitate the subsequent classification tasks. The overall framework of network is shown in Fig. 2. The CircSSNN consists of two parts in total, i.e., the feature encoding module and the Sequence Self-Attention Mechanism module. As stated above, multiple initial features are extracted from the raw sequence by KNFP, CircRNA2vec and DNABERT, and these initial features are first input into the feature encoding module to obtain the unified feature sequences, which are subsequently input into the next module to extract features with self-attention. The final step of classification is carried out by SoftMax. The experimental flowchart of the CircSSNN is illustrated in Fig. 3.

Fig. 2
figure 2

The network framework of the CircSSNN

Fig. 3
figure 3

Experimental flowchart of the CircSSNN

Feature encoding module

The multiple initial features obtained from different feature descriptors have inconsistent channel numbers, magnitudes, magnitude units, etc. Such issues will hinder the later analysis. To overcome these issues, data unifying is needed to ensure that the initial features share the same form to facilitate the subsequent feature fusion.

We construct the feature encoder layer by CNN to unify the channels of multiple initial features and conduct data normalization. The feature encoder layer consists of three sublayers, i.e., the one-dimensional CNN layer, the one-dimensional BatchNorm layer, and the ReLU activation function.

Sequence self-attention mechanism module

Transformer [30] is a network architecture based on attention mechanisms and abandoned traditional CNN and RNN. More precisely, a Transformer module consists only of Self-Attention and Feedforward Neural Network (FNN). This simple architecture of the Transformer brings better performance, higher parallelism, and less time-complexity. It has been successfully applied to various fields such as NLP and CV, and many researchers [31,32,33] have incorporated the Transformer as a sub-model and achieved impressive success.

We partially adopt the architecture of the Transformer with slight modification as the extractor of deep structure, i.e., the Seq_Transformer as shown in Fig. 4.

Fig. 4
figure 4

The structure of Seq_Transformer

When constructing a neural network using the Transformer architecture superimposing multiple sub-layers, either in the encoder or in the decoder, leads to poor information propagation through the network, thus making the training very difficult [34, 35]. To overcome this issue, we leveraged the residual module to improve the efficiency of information propagation and conduct layer normalization to reduce the variance of the sub-layers. There are two ways to incorporate layer normalization into the residual network. Let F be a sub-layer (either in the encoder or decoder) in the Transformer architecture, and denote its parameter set by θl.


In the pioneering works of the Transformer [30], it is common practice to do residual addition followed by Layer Normalization (LN) as follows,

$$y_{l} = x_{l} + {\text{F}}(x_{l} ;\theta_{l} )$$
$$x_{l + 1} = {\text{LN}}(y_{l} )$$


In recent years, many researchers [36] prefer to conduct Layer Normalization (LN) on the inputs of sublayers rather than the outputs, like this,

$$x_{l + 1} = x_{l} + {\text{F}}({\text{LN}}(x_{l} );\theta_{l} )$$

The effect of Post-Norm or pre-Norm is comparable for shallow networks. Both methods can effectively improve the distribution of parameters, which facilitates smooth training. However, for a deeper network, it has been pointed out that Pre-norm is better than Post-norm [34, 35]. Specifically, for CircSSNN, since DNABERT is used in the initial feature extraction and the Seq_Transfomer is designed next, the network is rather deep in general. Therefore, for the cirRNA-RBP prediction, which is the task of the proposed model, we argue that the Pre-norm is more effective than the Post-norm. We have empirically demonstrated this point in the ablation experiments in the Section of Results.

Theoretically, this phenomenon can be explained by carefully examining of the nature of network training. It is well known that the training network is essentially the backward propagation of error computed by the loss function and the corresponding adjustment of weight parameters of the network according to the error propagation. Take a submodule containing L-layers for example, the error back-propagated from the next layer is represented by ε, and xL represents the output of the last layer. If the Transformer adopts the Post-Norm strategy, according to the chain rule, the partial derivative of ε with respect to xL can be calculated for a particular sublayer xl as follows [35],

$$\frac{{\partial {\mathcal{E}}}}{{\partial x_{l} }} = \frac{{\partial {\mathcal{E}}}}{{\partial x_{L} }} \times \prod\limits_{k = l}^{L - 1} {\frac{{\partial {\mathbf{LN}}\left( {y_{k} } \right)}}{{\partial y_{k} }}} \times \prod\limits_{k = l}^{L - 1} {\left( {1 + \frac{{\partial {\text{F}}\left( {x_{k} ;\theta_{k} } \right)}}{{\partial x_{k} }}} \right)}$$

where \(\prod\nolimits_{k = l}^{L - 1} {\frac{{\partial {\text{LN}}\left( {y_{k} } \right)}}{{\partial y_{k} }}}\) denotes the normalized information which is propagated backward, and \(\prod\nolimits_{k = l}^{L - 1} {\left( {1 + \frac{{\partial {\text{F}}\left( {x_{k} ;\theta_{k} } \right)}}{{\partial x_{k} }}} \right)}\) indicates the information which is back-propagated through the residual module. Similarly, for the case of the Pre-norm, we can obtain the gradient as follows [35],

$$\frac{{\partial {\mathcal{E}}}}{{\partial x_{l} }} = \frac{{\partial {\mathcal{E}}}}{{\partial x_{L} }} \times \left( {1 + \sum\limits_{k = l}^{L - 1} {\frac{{\partial {\text{F}}\left( {{\text{LN}}\left( {x_{k} } \right);\theta_{k} } \right)}}{{\partial x_{l} }}} } \right)$$

From Eq. (7), it is easy to find out that the term “1” in the parenthesis enables the direct backward propagation of \(\frac{{\partial {\mathcal{E}}}}{{\partial x_{L} }}\) from the last layer to the lth layer, i.e., the propagation through the residual module no longer depends on the number of layers.

Comparing the calculation of the information propagation of the residual module in Eq. (6) and Eq. (7), one can find that in Eq. (6) the information passing through the residual module does not propagate directly from layer L to layer l. This is because in Post-norm, the residual connection module is not a real bypass of the layer-normalization layer, resulting in a concatenated multiplicative term for the gradient propagation of the residual module in Eq. (6), i.e., \(\prod\nolimits_{k = l}^{L - 1} {\frac{{\partial {\text{LN}}\left( {y_{k} } \right)}}{{\partial y_{k} }}}\), in which it can be found obviously, if the number of layers goes deeper, this term will suffer from gradient vanishing or exploding.

Therefore, our model is connected by Pre-norm residual blocks [34, 35], and features are normalized before passing through the multi-headed self-attention network, thus producing a more stable gradient.

The overall process of CircSSNN is as follows. We first extract multiple initial features using KNFP, CircRNA2vec, and DNABERT respectively. These initial features are then integrated into multi-view fused feature zl, which is divided into two ways using the residual connection module as follows,

$$p = z_{l} + {\text{MultiHeadAttention}}({\text{LN}} (z_{l} ))$$

In Eq. (8), one way of information remained as it was and propagated from right to left directly, while the other way of information was first normalized by Pre-norm LN before passing through the MHA module. The Pre-norm LN is defined as,

$$\mu = \frac{1}{M}\sum\limits_{i = 1}^{M} {z_{i} }$$
$$\sigma^{2} = \frac{1}{M}\sum\limits_{i = 1}^{M} {\left( {z_{i} - \mu } \right)^{2} }$$
$$\hat{z}=\frac{\mathbf{z}-\mu}{\sqrt{\sigma^2+\epsilon}} \odot \gamma+\beta \triangleq \operatorname{LN}_{\gamma_{,} \beta}(z)$$

In Eqs. (911), M is the number of neurons. Features are extracted using scaled dot-product multi-head attention to capture contextual features as follows,

$$Q = {\text{Concat}} \left( {q_{1} , \ldots ,q_{{\text{h}}} } \right)$$
$$K = {\text{Concat}} \left( {k_{1} , \ldots ,k_{{\text{h}}} } \right)$$
$$V = {\text{Concat}} \left( {v_{1} , \ldots ,v_{{\text{h}}} } \right)$$
$${\text{MultiHeadAttention}}(Q,K,V) = {\text{softmax}} \left( {\frac{{QK^{T} }}{\sqrt d }} \right)V$$

In Eqs. (1214), h is the number of heads, qi, ki and vi, \(i \in \left\{ {1,2, \ldots h} \right\}\) denote the query, key, and value respectively. Q, K and V indicate the aggregation of multiple qi, ki, and vi, respectively. In Eq. (15), d is the dimension of the input vector. Then, the information passing through the MHA module and bypassing it are added together to get p as described in Eq. (8). Similarly, before the information passes through the FFN module, it is also processed by Pre-norm LN. In this way, the input information is finally turned into a unified structured deep feature to conduct the subsequent classification.

$$z_{s} = p + {\text{FFN}}({\text{LN}} (p))$$

From the network architecture of CircSSNN, one can find it differs from the existing models in two aspects.

First, to the best of our knowledge, this is the first attempt to introduce the residual module with Pre-norm LN in CircRNA recognition. As stated in [34, 35], the residual module with Post-norm LN brought about a higher risk of gradient vanishing or exploding when the network goes deeper. Therefore, we adopt the Pre-norm LN scheme to avoid this problem while using the residual connection to improve the efficiency of information transmission.

Second, we proposed the Seq_Transformer module based on self-attention to extract temporal contextual features. Most of the existing works proposed for CircRNA-RBP prediction, such as DMSK [20], CRBPDL [21], iCircRBP-DHN [17], and CRIP, etc., mainly use RNN such as LSTM or GRU for capturing temporal dependence. However, the computation of RNN or its variants is sequential, i.e., calculatiing results of time step t must depend on that of time step t-1, which dramatically limits the parallelism. In addition, long-term dependency is prone to loss during propagation along the sequential RNN network. LSTM and GRU adopted some gating mechanisms to mitigate this problem to a certain extent, but the effectiveness of gating mechanisms is undesirable for long-term dependencies. Therefore, compared with the models based on the Self-Attention mechanism, these models suffer from insufficient parallelism and poor ability to capture long dependencies. However, the attention mechanism has seldom been employed to extract features directly in this field. Up to now, only Yang et al. used the Attention mechanism in the iCircRBP-DHN model they proposed in 2020. But in iCircRBP-DHN, the attention mechanism was not employed as a direct feature extractor but as a supplement to the GRU mechanism, i.e., iCircRBP-DHN use the attention modules to capture features after GRU processing, which to some extent destroys the dependency relationship of the original data and makes the Attention mechanism play little role. In their subsequent work, i.e., the HCRNet proposed in 2022, they omitted the attention mechanism. In HCRNet, Yang et al. used DTCN to extract discriminative information from hybrid features and combine the parallelism of CNN with residual connection, and thus making various perceptual field sizes available and gradients stable. DTCN alleviates the limitations of RNN regarding to parallelism to some extent. However, it is still limited by the fixed perceptual field size of CNN, and the two issues of existing models, i.e., insufficient parallelism and inefficiency in capturing long-term dependencies, still exist. In contrast, in CircSSNN, after the initial multiple features were integrated into a unified one, feature extraction is performed directly using Seq_Transformer without intermediate processing by RNN or its variants. As a result, we solved the above two issues by adopting the Seq_Transformer. The advantages of Seq_Transformer can be analyzed as follows. First, it is constructed based on the Attention mechanism rather than sequential structure, so its calculation can be performed in the format of matrix multiplication, which can be easily parallelized and accelerated by modern deep learning frameworks based on GPUs. Second, by using the Seq_Transformer, the distance between any two positions in the sequence can be reduced to a constant, and long-term dependence can be effectively captured. In addition, due to the excellent parallelism of the Seq_Transformer, we can make the full use of multi-headed attention to focus on contextual information from different locations simultaneously. Therefore, the deep structure features extracted by the Seq_Transformer have good classification performance.


Experimental setting

For both circRNA and linRNA datasets, 80% of the samples were randomly selected as training data. The remaining 20% of them were used as test data. To show the generalizability of CircSSNN rather than the performance improvement brought by hyperparameter tuning, we didn’t set validation sets for hyperparameter tuning in experiments. The hyperparameters of CircSSNN were set to be the same across all datasets, which eliminates the trouble of hyperparameters tuning.

We used Adam as the optimizer, and set the parameters weight_decay and batch_size as 3e-4 and 64 respectively. The learning rate of Adam was controlled by the built-in learning rate scheduler of Pytorch in which the parameter initial_rate was set to be 3e-3. As the Seq_Transformer can capture deep features effectively and quickly, we let the learning rate decay to one-tenth every two rounds to accelerate the convergence.

Experimental results on circRNA datasets

We compared the CircSSNN with seven baselines on 37 circRNA-RBP datasets. To be fair, all the parameters were set as reported in the corresponding papers.

Four metrics including AUC, ACC, precision, and recall, were used to compare the performance of the competing methods. The performances of all methods, averaging over 37 circRNA datasets, were shown in Fig. 5. In Fig. 5, the colors of the solid circles correspond to the performance of each algorithm with respect to a certain metric, and these numbers can be obtained by looking at the color bars on the right side of Fig. 5, e.g., the green solid circle in the fourth row of the last column (from top to bottom) represents the performance of the PASSION model with respect to recall, which is about 80% (the third block in the color bar). The size of the circles indicates the ranking of the performances, i.e., the largest circle of size 5 corresponds to the best algorithm for each metric, while the smallest circle of size 1 corresponds to the worst one. Take the last column as an example again, since the Recall of CircSSNN, HCRNet and iCircRBP all are around 85% (the same color), but the size of the solid circles gives their ranking, i.e., in terms of recall, CircSSNN has the best performance among the three algorithms and iCircRBP has the worst performance.

Fig. 5
figure 5

The average performance of competing methods on 37 circRNA datasets

As can be seen from Fig. 5, the performance of CircSSNN is superior to all competing methods regarding to AUC and Recall, and is slightly inferior to HCRNet regarding ACC and Precision, but is higher than the other six methods by a large margin. The detailed average value of different methods regarding ACC, AUC, Precision and recall are 85.71%, 93.07%, 85.14%, 86.69% for CircSSNN; 85.81%, 93.04%, 85.68%, 86.35% for HCRNet. As the performances of other baselines are obviously by far inferior to that of the two methods mentioned above, we don’t list them here for simplicity. The detailed AUC values are summarized in Table 1.

Table 1 The AUC of competing methods on 37 circRNA datasets

Apparently, CircSSNN outperformed other competing baselines on 18 out of 37 circRNAs datasets, and produced the highest average AUC of 93.1%. The number of samples in each the 37 benchmark datasets ranges from 892 to 40,000, which validates that CircSSNN is applicable for datasets with an extensive range of scales. Even for small-scale datasets, CircSSNN still achieved competing performance.

To demonstrate the stability of the CircSSNN, we selected a moderate-scale dataset TIAL1 with 10,912 samples, and repeated the test of the top two models, i.e., CircSSNN and HCRNet, ten times on TIAL1. The fluctuation of performance was illustrated in Fig. 6. In Fig. 6, the curve of CircSSNN fluctuated more mildly than that of the HCRNet. It further illustrated that the Seq_Transformer used in the CircSSNN was more flexible, and less affected by sample randomness, and the features extracted by the Seq_Transformer are more stable.

Fig. 6
figure 6

Comparison of the stability of HCRNet

To compare the efficiency and parallelism of the CircSSNN and HCRNet, we trained the two models on 37 circRNA datasets ten times with the same hardware and software configuration, and the results showed the average training times of the two models are 10 h and 13 h, respectively, which showed that CircSSNN was more efficient and parallelizable. The reason is that the Seq_transformer used in the CircSSNN is entirely based on the attention mechanism, which converts data into Query, Key, and Value at the same time, and thus facilitates the parallel retrieval of feature information.

To demonstrate the advantage of Pre-norm over Post-norm, we kept the other modules of the CircSSNN unchanged, and compared the effect of Pre-norm and Post-norm on 37 circRNA datasets. In Fig. 7, the blue bar represents the performance of the CircSSNN with the Post-norm strategy, while the red bar represents the performance of the Pre-norm. As shown in Fig. 7, the Pre-norm strategy brings performance gains on 36 out of 37 datasets, with an increase of more than two percents on about half of the datasets.

Fig. 7
figure 7

Comparison of the effect of pre-norm and post-norm on 37 circRNA datasets

Finally, to demonstrate that the proposed feature fusion scheme is more effective than a single feature descriptor, ablation experiments were conducted while keeping other modules (except the feature descriptors modules) unchanged, and the results were plotted as violin plots, as shown in Fig. 8. It can be seen that, in terms of the AUC values of the proposed algorithm on 37 circRNA datasets, the distribution of the results obtained by the feature fusion scheme is more concentrated compared to that of a single feature descriptor, and the mean AUC values obtained by the feature fusion scheme are also the largest. The performance of the two descriptors, KNFP and CircRNA2Vec, varies obviously across different datasets, while the results of DNABert descriptors are more evenly distributed compared to the previous two, but its performance is also slightly inferior compared to the results of the feature fusion scheme. From Fig. 8, it can be seen that feature fusion scheme makes full use of the consistent and complementary information of each view and obtains excellent overall performance.

Fig. 8
figure 8

Comparison of the effect of different feature descriptors

The prediction performance of CircSSNN on linear RNA datasets

The CircSSNN is highly transformable, and can be applied to other types of RNA-RBP prediction tasks without hyperparameters tuning. To verify this, we tested the CircSSNN and the baselines on 31 linear RNA datasets, and the results were shown in Fig. 9. As shown in Fig. 9, without hyperparameters tuning, the CircSSNN achieved favorable performance over other state-of-the-art baselines, which demonstrated the CircSSNN was stable and transformable. The detailed value of AUC was listed in Table 2. Because the models designed for the cirRNA datasets, such as HCRNet and iCircRBP-DHN, do not specify the necessary details of operation and parameter settings for migrating them from the cirRNA dataset to the linRNA dataset, we cannot reproduce the results of these models in our experiments, and just list in Table 2 the AUC values published in their original papers for comparison. However, as can be observed in Fig. 9 and Table 2, even though compared with their results which were produced after fine-tuning of hyperparameters with validate sets, the results of the CircSSNN, which was obtained without hyperparameters tuning, still outperformed these models in most cases. In detail, the proposed CircSSNN achieved the best AUC on 21 out of the 31 linear RNA datasets, and the average value of AUC is 0.931, which is 0.7 percent higher than that of the HCRNet. In some datasets, the CircSSNN outperforms the HCRNet quite a bit, for example, the AUC of the CircSSNN is 4.5 and 3.6 percent higher than the HCRNet on the hnRNPL 1 dataset and the hnRNPL-2 dataset, respectively. Therefore, even directly keeping unchanged the network architecture and parameters designed for circRNA datasets, the CircSSNN can still produce competitive results when applied to linear RNA datasets.

Fig. 9
figure 9

Boxplot comparison results of different models on 31 linear RNA datasets regarding to AUC

Table 2 Average value of AUC obtained by different methods on 31 linear RNA datasets

In addition, to investigate the transformability of different methods, we also compared the CircSSNN and the HCRNet, the newest and most representative algorithm, on linear RNA with their hyper-parameters setting on CircRNA. The experimental results on the 31 linear RNA benchmark datasets are shown in Fig. 10.

Fig. 10
figure 10

Comparison of CircSSNN and HCRNet on 31 linear RNA datasets

As shown in Fig. 10, when both the CircSSNN and the HCRNet were tested on the linRNA datasets with their hyper-parameters settings on the CircRNA datasets, the CircSSNN outperformed the HCRNet about two, two and six percent regarding ACC, AUC, and Precision, respectively, while just slightly inferior to HCRNet regarding Recall by 0.7 percent. These results verified that the CircSSNN was more transformable than the HCRNet, and was able to obtain favorable results even without hyperparameter tuning. The AUC of the HCRNet was reported as 0.924 in its original paper, which was the result obtained by fine-tuning the hyperparameters with validate sets, but it dropped to 0.91 when no task-oriented fine-tuning of hype-parameters was conducted. Therefore, although HCRNet also achieved good performance on the linRNA datasets, the tuning of its hyperparameters requires expertise and a lot of trial and error, which is not conducive to generalization. In contrast, CircSSNN can be simply and efficiently transformed to other RNA-RBP identification tasks and has a wide range of applications.


The above experimental results verify the Seq_transformer adopted in the CircSSNN can effectively capture the semantic and global context of sequences and produce discriminative features, and the CircSSNN is more parallelable, stable and transformable than other baselines.

Compared with existing methods, the CircSSNN network architecture proposed in this paper can achieve excellent performance for the following two reasons: First, after integrating data from multiple views, directly use Seq_Transformer and make full use of multiple attention to simultaneously pay attention to contextual information from different locations to extract deep features. Without intermediate processing by RNNs or their variants. The distance between any two positions in the sequence can be reduced to a constant, effectively capturing long-term dependencies. Second, the Pre-norm based attention mechanism first applied to CircRNA recognition task can avoid the gradient disappearance or explosion risk brought by deep network, so that network training can obtain more stable gradient update.

Although the improvement of the CircSSNN over the HCRNet was not very remarkable, the HCRNet needed to tune its hyperparameters by validation sets, which is time-consuming and laborious. In contrast, the CircSSNN used the same set of hyperparameters for all datasets, i.e., it didn’t need validation sets to fine-tune the hyperparameters, which demonstrated that the CircSSNN was more flexible and insensitive to hyperparameters. This appealing characteristic made it easier to use, especially for non-computer professionals.


At present, most existing models for circRNA-RBP identification adopt CNN, RNN or their variant as feature extractors and have drawbacks such as poor parallelism, insufficient stability, and inability to capture long-term dependence. We propose the CircSSNN model based on the sequence self-attention mechanism. The CircSSNN extract deep features completely by the self-attention mechanism with good parallelism and can capture the long-term dependencies by reducing the distance between any two positions in a sequence to a constant. Multiple experiments on 37 circRNAs datasets and 31 linRNAs datasets using the same hyperparameters show that the CircSSNN achieves excellent performance, has good stability and scalability, and eliminates the problem of hyperparameters tuning compared with existing models. In conclusion, CircSSNN can serve as an appealing option for the task of circRNA-RBP identification.

Availability of data and materials

The datasets and codes are available at



Circular RNAs


RNA-binding proteins


Convolution neural network


Recurrent neural network




Deep temporal convolutional network


K-tuple Nucleotide Frequency Pattern


Bidirectional Encoder Representations from Transformers


CircRNA-binding site prediction via Sequence Self-attention Neural Network


  1. Memczak S, Jens M, Elefsinioti A, Torti F, Krueger J, Rybak A, et al. Circular RNAs are a large class of animal RNAs with regulatory potency. Nature. 2013;495(7441):333–8.

    Article  CAS  PubMed  Google Scholar 

  2. Hao S, Lv J, Yang Q, Wang A, Li Z, Guo Y, et al. Identification of key genes and circular RNAs in human gastric cancer. Med Sci Monit. 2019;25:2488–504.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Chen L-L. The biogenesis and emerging roles of circular RNAs. Nat Rev Mol Cell Biol. 2016;17(4):205–11.

    Article  CAS  PubMed  Google Scholar 

  4. Zang J, Lu D, Xu A. The interaction of circRNAs and RNA binding proteins: an important part of circRNA maintenance and function. J Neurosci Res. 2020;98(1):87–97.

    Article  CAS  PubMed  Google Scholar 

  5. Qu S, Yang X, Li X, Wang J, Gao Y, Shang R, et al. Circular RNA: a new star of noncoding RNAs. Cancer Lett. 2015;365(2):141–8.

    Article  CAS  PubMed  Google Scholar 

  6. Zhang H-d, Jiang L-h, Sun D-w, Hou J-c, Ji Z-l. CircRNA: a novel type of biomarker for cancer. Breast Cancer. 2018;25(1):1–7.

    Article  PubMed  Google Scholar 

  7. Xie F, Huang C, Liu F, Zhang H, Xiao X, Sun J, et al. CircPTPRA blocks the recognition of RNA N6-methyladenosine through interacting with IGF2BP1 to suppress bladder cancer progression. Mol Cancer. 2021;20(1):1–17.

    Article  Google Scholar 

  8. You X, Vlatkovic I, Babic A, Will T, Epstein I, Tushev G, et al. Neural circular RNAs are derived from synaptic genes and regulated by development and plasticity. Nat Neurosci. 2015;18(4):603–10.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Zhang M, Huang N, Yang X, Luo J, Yan S, Xiao F, et al. A novel protein encoded by the circular form of the SHPRH gene suppresses glioma tumorigenesis. Oncogene. 2018;37(13):1805–14.

    Article  CAS  PubMed  Google Scholar 

  10. Dudekula DB, Panda AC, Grammatikakis I, De S, Abdelmohsen K, Gorospe M. CircInteractome: a web tool for exploring circular RNAs and their interacting proteins and microRNAs. RNA Biol. 2016;13(1):34–42.

    Article  PubMed  Google Scholar 

  11. Ruan H, Xiang Y, Ko J, Li S, Jing Y, Zhu X, et al. Comprehensive characterization of circular RNAs in ~ 1000 human cancer cell lines. Genome medicine. 2019;11(1):1–14.

    Article  CAS  Google Scholar 

  12. Wang Z, Lei X, Wu F-X. Identifying cancer-specific circRNA–RBP binding sites based on deep learning. Molecules. 2019;24(22):4035.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Wang Z, Lei X. Identifying the sequence specificities of circRNA-binding proteins based on a capsule network architecture. BMC Bioinform. 2021;22(1):1–16.

    Article  Google Scholar 

  14. Ju Y, Yuan L, Yang Y, Zhao H. CircSLNN: identifying RBP-binding sites on circRNAs via sequence labeling neural networks. Front Genet. 2019;66:1184.

    Article  Google Scholar 

  15. Zhang K, Pan X, Yang Y, Shen HB. CRIP: predicting circRNA-RBP-binding sites using a codon-based encoding and hybrid deep neural networks. RNA. 2019;25(12):1604–15.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Jia C, Bi Y, Chen J, Leier A, Li F, Song J. PASSION: an ensemble neural network approach for identifying the binding sites of RBPs on circRNAs. Bioinform. 2020;36(15):4276–82.

    Article  CAS  Google Scholar 

  17. Yang Y, Hou Z, Ma Z, Li X, Wong KC. iCircRBP-DHN: identification of circRNA-RBP interaction sites using deep hierarchical network. Brief Bioinform. 2021;22(4):66.

    Article  CAS  Google Scholar 

  18. Guo Y, Lei X, Liu L, Pan Y. circ2CBA: prediction of circRNA-RBP binding sites combining deep learning and attention mechanism. Front Comput Sci. 2022;17(5):175–904.

    Article  Google Scholar 

  19. Yuan L, Yang Y. DeCban: prediction of circRNA-RBP interaction sites by using double embeddings and cross-branch attention networks. Front Genet. 2020;11:632861.

    Article  CAS  PubMed  Google Scholar 

  20. Li H, Deng Z, Yang H, Pan X, Wei Z, Shen HB, et al. circRNA-binding protein site prediction based on multi-view deep learning, subspace learning and multi-view classifier. Brief Bioinform. 2022.

    Article  PubMed  PubMed Central  Google Scholar 

  21. Niu M, Zou Q, Lin C. CRBPDL: identification of circRNA–RBP interaction sites using an ensemble neural network approach. PLoS Comput Biol. 2022;18(1):e1009798.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Ji Y, Zhou Z, Liu H, Davuluri RV. DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome. Bioinformatics. 2021;37(15):2112–20.

    Article  CAS  PubMed  Google Scholar 

  23. Yang Y, Hou Z, Wang Y, Ma H, Sun P, Ma Z, et al. HCRNet: high-throughput circRNA-binding event identification from CLIP-seq data using deep temporal convolutional network. Brief Bioinform. 2022.

    Article  PubMed  PubMed Central  Google Scholar 

  24. Alipanahi B, Delong A, Weirauch MT, Frey BJ. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol. 2015;33(8):831–8.

    Article  CAS  PubMed  Google Scholar 

  25. Pan X, Rijnbeek P, Yan J, Shen H-B. Prediction of RNA-protein sequence and structure binding preferences using deep convolutional and recurrent neural networks; 2017.

  26. Orenstein Y, Wang Y, Berger B. RCK: accurate and efficient inference of sequence-and structure-based protein–RNA binding models from RNAcompete data. Bioinformatics. 2016;32(12):i351–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Le Q, Mikolov T, editors. Distributed representations of sentences and documents. In: International conference on machine learning. PMLR; 2014.

  28. Glažar P, Papavasileiou P, Rajewsky N. circBase: a database for circular RNAs. RNA. 2014;20(11):1666–70.

    Article  PubMed  PubMed Central  Google Scholar 

  29. Mikolov T, Chen K, Corrado GS, Dean J, eds. Efficient estimation of word representations in vector space. In: International conference on learning representations; 2013.

  30. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Adv Neural Inf Process Syst. 2017;66:30.

    Google Scholar 

  31. Eldele E, Ragab M, Chen Z, Wu M, Kwoh CK, Li X, et al. Time-series representation learning via temporal and contextual contrasting; 2021. p. 2352–9.

  32. Yuan L, Chen Y, Wang T, Yu W, Shi Y, Jiang Z, et al., eds. Tokens-to-token ViT: training vision transformers from scratch on ImageNet. In: 2021 IEEE/CVF international conference on computer vision (ICCV); 2021.

  33. Han K, Wang Y, Chen H, Chen X, Guo J, Liu Z, et al. A survey on vision transformer. IEEE Trans Pattern Anal Mach Intell. 2023;45(1):87–110.

    Article  PubMed  Google Scholar 

  34. Xiong R, Yang Y, He D, Zheng K, Zheng S, Xing C, et al. On layer normalization in the transformer architecture; 2020. p. 10524–33.

  35. Wang Q, Li B, Xiao T, Zhu J, Li C, Wong DF, et al. Learning deep transformer models for machine translation. arXiv preprint arXiv:1906.01787. 2019.

  36. Klein G, Kim Y, Deng Y, Senellart J, Rush AM. OpenNMT: open-source toolkit for neural machine translation; 2017. p. 67–72.

Download references


The authors would like to acknowledge the support of the High Performance Computing (HPC) Platform of The Huizhou University, whose computing resources were used to perform some of the computations.


This work was supported by the National Natural Science Foundation of China under Grants (62062010, 62061003); and the Basic Ability Promotion Project of Guangxi Middle and Young University Teachers (2023KY1638).

Author information

Authors and Affiliations



CC: Conceptualization, Methodology, Software, Writing- Original draft preparation. ML: Visualization, Software Validation. SY: Writing-Reviewing and Editing, Supervision. CL: Funding acquisition, Project administration. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Shuhong Yang or Chungui Li.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cao, C., Yang, S., Li, M. et al. CircSSNN: circRNA-binding site prediction via sequence self-attention neural networks with pre-normalization. BMC Bioinformatics 24, 220 (2023).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: