Data sets and measurements
In this paper, we choose ASE dataset of RNase P type, RFA dataset of Hammerhead Ribozyme type, SPR dataset of Transfer RNA type, TMR dataset of tmRNA type and SPR dataset of Signal Recognition Particle RNA type to predict RNA secondary structure. The reasons are as follows: first, these five RNA data sets are five typical RNA secondary structure prediction data sets; Secondly, these five kinds of RNA data sets all have pseudoknots, which is in line with our research problems; Finally, these five kinds of RNA data sets include the cases that the maximum length of sequences is far greater than the minimum length, the maximum length is close to the minimum length, and the paired bases are larger than, approximately equal to, and smaller than the unpaired bases. These five RNA data sets ensure the persuasiveness of our experimental results [19,20,21]. The statistics of each subset are shown in Fig. 5, where ‘Pseudoknots’ represent the number of pseudoknots in the dataset, ‘Average length’ represents the average length of the dataset sequence, ‘Max length’ represents the maximum sequence length of the dataset, and ‘Min length’ represents the minimum sequence length of the dataset. The situation of paired bases and unpaired bases in each subset is shown in Fig. 1, where ‘paired’ represents the number of paired bases, ‘unpaired’ represents the number of unpaired bases, and ‘difference’ represents the difference between paired bases and unpaired bases.
In this paper, four indicators are used to evaluate models, i.e., sensitivity (SEN), specificity (PPV), Matthews correlation coefficient (MCC) and accuracy (ACC) to evaluate the model. MCC is an evaluation metrics that combines SEN and PPV [22, 23]. Their calculation methods are shown in Eq. (1).
$$\left\{ {\begin{array}{*{20}l} {SEN = \frac{TP}{{TP + FN}} } \hfill \\ {PPV = \frac{TP}{{TP + FP}} } \hfill \\ {MCC = \frac{TP*TN - FP*FN}{{\sqrt {\left( {TP + FP} \right)\left( {TP + FN} \right)\left( {TN + FP} \right)\left( {TN + FN} \right)} }}} \hfill \\ {ACC = \frac{TP + TN}{{TP + TN + FP + FN}}} \hfill \\ \end{array} } \right.$$
(1)
where TP represents the number of correctly predicted base pairs; TN means correctly predicting the number of unpaired bases; FP indicates the number of bases predicted to be paired but not actually paired; FN indicates the number of bases predicted to be unpaired but actually paired. The value range of MCC is between -1 and 1, and the value range of the other three indicators is between 0 and 1. The larger these four indicators are, the better the prediction effect of the model is.
The prediction of RNA secondary structure.
Bases are combined units of RNA structure, and stable base pairs are formed by hydrogen bond interaction between bases [24]. Correct prediction of secondary structure is a strong guarantee for prediction of RNA tertiary structure. Pseudoknots are complex and stable structures in many biological cells. Pseudoknots refer to the phenomenon of crossing between paired base pairs, such as base i is paired with base j, base m is paired with base n, and the phenomenon that their position sequence number in RNA sequence satisfies i < m < j < n is called the existence of pseudoknots in the RNA sequence [25, 26]. Although not all RNA secondary structures have pseudoknots, pseudoknots has an important influence on the function of RNA. Therefore, in order to analyze the real structure of RNA, we must solve the problem of pseudoknots. At present, the prediction of RNA secondary structure with pseudoknots has received extensive attention, which is also a major problem in the prediction of RNA secondary structure.
From the perspective of machine learning, the prediction of RNA secondary structure is to extract the relevant primary information of RNA sequence. After data preprocessing, the structured data is used as the input of machine learning model. Through model training, the matching of each base in RNA sequence can be predicted correctly to the greatest extent. This process can be seen as a supervised learning process with multiple classifications.
Feature selection and generation
According to Mathews et al. [27], there is a positive correlation between the pairing probability of bases predicted by partition function and the pairing probability of real two bases. Therefore, this paper takes the output result of partition function as a part of the input features to improve the prediction accuracy of RNA secondary structure. In addition, the more frequently a base appears in the sequence, the greater the possibility of pairing with the base. Therefore, in this experiment, the input features are as follows:
-
1.
Through the output of the partition function calculated by RNA structure software, an n*n output matrix will be obtained after a protein sequence with a length of n is calculated.
-
2.
The probability that a certain type of base in the sequence appears in the sequence, and the frequency occupied by that type of base in the sequence are recorded and expressed by a one-dimensional vector.
-
3.
Base type information. RNA primary structure can be represented by four bases: A, G, C and U. We use a four-dimensional vector to represent them. Tules are: A-0001, G-0010, C-0100, U-1000, others-0000.
Therefore, for each base, after selecting, transforming and expanding the data features, its input features can be regarded as an (N + 5)-dimensions vector, where N represents the maximum recursive value of the recurrent neural network, that is, the maximum length of the data set where the sequence is located.
The input of the model is a three-dimensional array X[i, j, k], and the first dimension i represents the i-th sequence in the data set, with the value range of 1 to batch_size; The second dimension j represents the j-th base of a certain sequence, and its value range is from 1 to N. The third dimension k represents the k-th feature of a base, with a value range of 1 to (N + 5). Where batch_size represents the number of training samples of the model each time, the value in this experiment is 200.
The output of the model is a two-dimensional array Y, \(Y\left[ {i,j} \right] = 0\) means that the (i + 1) base in the (i + 1)-th sequence is not paired with any base, otherwise it means that the (j + 1)-th base and the \(y\left[ {i,j} \right]\)-th base in the (i + 1)-th sequence are paired.
Bidirectional recursive GRU
GRU is an improved algorithm proposed to overcome the traditional recurrent neural network's inability to handle long-distance dependence well [28, 29]. The algorithm adds two gates, namely update gate and reset gate, whose expressions are as shown in Eq. (2).
$$\left\{ {\begin{array}{*{20}l} {r_{t} = \sigma \left( {W_{xr} x_{t} + W_{hr} h_{t - 1} + b_{r} } \right)} \hfill \\ {z_{t} = \sigma \left( {W_{XZ} x_{t} + W_{hz} h_{t - 1} + b_{z} } \right) } \hfill \\ {\widetilde{{h_{t} }} = \tanh \left( {W_{xh} x_{t} + W_{hh} \left( {r_{t} \times h_{t - 1} } \right) + h_{b} } \right)} \hfill \\ {h_{t} = z_{t} \times h_{t - 1} + \left( {1 - z_{t} } \right) \times \widetilde{{h_{t} }} } \hfill \\ \end{array} } \right.$$
(2)
Among them, x is the input of the model, and in this paper is the three-dimensional array X[i, j, k] described earlier. The input feature of each base corresponds to a recursive cycle, and the total number of recursive cycles is the maximum sequence length n. r denotes a reset gate, which can decide which information to discard and which new information to add, z denotes an update gate, which determines the extent to which previous information is discarded, and \(\tilde{h}\) and h denote a new hidden state and a current hidden state, respectively. σ and tanh represent Sigmoid function and tanh function, respectively, which are the parameters that the model needs to train and follow. The graphical abstraction is shown in Fig. 6.
Considering that the formation of the secondary structure of RNA is a structure formed by the interaction of hydrogen bonds between bases, the sequence information before and after a base in the RNA sequence will have certain influence on the secondary structure, so the bidirectional recurrent neural network model is selected for training in this experiment. Bidirectional recurrent neural network is a composite recurrent neural network that combines a forward-learning recurrent neural network and a backward-learning recurrent neural network. The calculation process is shown in Eq. (3).
$$h_{t} = \overrightarrow {{h_{t} }} + \overleftarrow {{h_{t} }}$$
(3)
The graphical abstraction of the bi-directional recurrent neural network is shown in Fig. 7. The weights w1 to w6 in the figure respectively represent the input to the forward and backward hidden layers, the forward and backward hidden layers to the hidden layer itself, and the forward and backward hidden layers to the output layer.
Variable length bidirectional recursive GRU
Because the length of each RNA sequence is not consistent, the traditional way of truncating the long sequence or simply completing the short sequence will lead to the loss of sequence information, resulting in the waste of information, or adding redundant information to the sequence, both of which have a certain negative impact on the prediction of RNA secondary structure [30,31,32]. Therefore, in this paper, a flag vector is introduced to carry out zero filling processing on short sequences in the data preprocessing stage, but in the training stage, the filling part will be filtered when the loss function is calculated for each base. Thus not only all effective information of the sequences is utilized, but also the redundant information filled in is not allowed to interfere with the test [33]. The calculation process of the cross entropy of a sequence m in the variable-length bidirectional recursive GRU model is as shown in Eqs. (4)–(6).
$$cross\_loss_{m} = \frac{{\mathop \sum \nolimits_{i = 1}^{n} flag\left[ i \right] \cdot \,loss_{m} }}{{\mathop \sum \nolimits_{i = 1}^{n} flag\left[ i \right]}}$$
(4)
$$flag\left[ i \right] = \left\{ {\begin{array}{*{20}l} {1,} \hfill & {i \le n} \hfill \\ {0,} \hfill & {n < i \le N} \hfill \\ \end{array} } \right.$$
(5)
$$loss_{m} = - \mathop \sum \limits_{j = 1}^{n + 1} y_{{\left[ {i,j} \right]}} \cdot \,\log \left( {y\left[ {i,j} \right]} \right)$$
(6)
where the operator “\(\cdot\)” represents multiplication of the corresponding positions of the vector.
From Eq. (5), it can be found that when n < i ≤ N, flag[i] takes a value of 0, which makes the values of \(flag\left[ i \right] \cdot \,loss_{m}\) also takes a value of 0, i.e. the bases participating in the completion will not affect the value of loss function \(cross\_loss_{m}\).
VLDB GRU
Since the ratio between paired bases and unpaired bases in the RNA STRAND data set is 2:3 [34, 35], in the multi-classification processing mode of this experiment, the pairing number of each base will be given sepecally, so the ratio of the number of bases belonging to each category is 2/n:2/n:…:2/n:3 [36,37,38]. This is a serious imbalance samples, and the model tends to classify more bases into the category of "unpaired" for higher accuracy [39, 40]. In order to avoid the model falling into this kind of local optimum, this paper sets a weight vector for each base. If the base is a single base and there is no base paired with it, it is assigned 1, otherwise, it is assigned to the sum of all bases in the sequence where the base is located. Thus, the calculation process of the cross entropy of a certain sequence m is shown in Eqs. (7) and (8). The training algorithm of VLDB GRU model is shown in Table 1.
$$cross\_loss_{m} = \frac{{\mathop \sum \nolimits_{i = 1}^{n} flag\left[ i \right] \cdot \,weight\left[ i \right] \cdot \,loss_{m} }}{{\mathop \sum \nolimits_{i = 1}^{n} flag\left[ i \right] \cdot \,weight\left[ i \right]}}$$
(7)
$$weight\left[ i \right] = \left\{ {\begin{array}{*{20}l} {1,} \hfill & {y\_\left[ {i,0} \right] \ne 0} \hfill \\ {\mathop \sum \limits_{i = 1}^{n} y\_[i,0],} \hfill & {y\_[i,0] = 0} \hfill \\ \end{array} } \right.$$
(8)
where y and y_ represent the probability that the model predicts a certain category and the label of the sample respectively. The flag vector indicates whether a base at the position, 1 indicates existence, 0 indicates inexistence. n is the length of the sequence. y_ is an array of n*(n + 1). When y_[i, j] is equal to 1, it means that the (i + 1) th base and the (j) th base are paired, otherwise it means that they are not paired. When y_ is equal to 1, it means that the (i + 1) th base is not paired with any base.
x and y_ in Table 1 represent the input characteristics of bases and the real labels of bases respectively, and L[i] corresponds to the loss function of Eq. (7).
Model parameter setting
The model maps the feature vector corresponding to each base through the input layer to a group of n-dimensional vectors, which are used as the input of the bidirectional GRU model. The output of the bidirectional GRU model is followed by two full connection layers and one output layer to finally classify the data. At the same time, the neural network may fall into over-fitting, so dropout layer is added to solve this problem. The overall design framework of the model is shown in Fig. 8.
In the VLDB GRU model designed in this paper, variable length means that the lengths of RNA sequences to be predicted are inconsistent. Therefore, we select the maximum sequence length in each data set as the number of GRU recursions. Many experiments show that when the number of GRU hidden layer neurons is selected to be 50, the number of all connected layer neurons is selected to be 150, the learning rate is set to be 0.1, and the maximum number of iterations is set to be 2000, the result is better. The experimental results are shown in Figs. 9, 10, and 11, in which the y axis represents the prediction accuracy rate, and the x axis represents the set values of the number of layers of the full connection layer, the number of layers of the hidden layer, and the learning rate, respectively.