Statistical analysis of k-mers of lncRNA and mRNA sequences
The human lncRNAs and mRNAs data were downloaded from the GENCODE database (Gencode.v26). The mouse lncRNAs and mRNAs data were downloaded from the GENCODE database (Genecode. VM21). The chicken lncRNAs and mRNAs data were downloaded from the Ensembl website database (5.0). Human data were used to build a convolutional neural network model. Mouse and chicken data were used to compare the superiority of the convolutional neural network model using the built model. The computed information used in this paper were as follows: (1) Operating system, Windows 10, with an InterCore I3–2365 M processor, memory size, 6G; (2) Python 3.5 to run the CNN code.
Studies have shown that the k-mer frequency information in lncRNA and mRNA sequences can reveal the distribution of various subsequences in biological sequences, to measure the similarities and variances of sequences [29].
A k-mer refers to all possible subsequences of length k in a DNA sequence, RNA sequence or amino acid sequence. Figure 3 shows the process of examining a k-mer in a sliding window mode in a sequence when k is three, in which there are 21 3-mers, namely, GCC, CCA, CAA, AAC, ACG, CGC, GCC, CCA, CAG, AGG, GGC, GCC, CCG, CGA, GAC, ACC, CCA, CAG, AGT, GTT, and TTC. Among them, GCC and CCA appear three times, CAG appears twice, and the others appear once. Similarly, we can count the k-mer frequency information of lncRNA and mRNA sequences.
For a sequence, if the sequence length is m, then the number of k-mer subsequence of length k has m-k + 1. The sequence generally consists of four bases, A, T, C, and G, and thus k-mers of length k have 4k possible structures.
We randomly selected 5000 lncRNA sequence data and 5000 mRNA sequence data from the human data, and we counted the k-mer frequency information when k = 1 and k = 3. As shown in Figs.4 and 5, when k = 1, the higher the content of bases G and C, the higher is the thermal stability of DNA molecules. When k = 2, after analysing the preference of dinucleotides, the frequency statistics of dinucleotides may represent certain characteristics of different species in different environments. For example, CG may be a methyl-CpG island and TA may be part of the TATA box. When k = 3, the coding and non-coding region in the sequence can be distinguished by counting the codon usage preference consisting of three bases. Therefore, considering the statistics, we analysed the k-mer frequency of lncRNA and mRNA sequence samples separately to discuss their differences.
From Fig. 4, the average contents of the A, C, G, and T bases in the lncRNA sequence were approximately 254 nt, 217 nt, 216 nt, and 240 nt, respectively, while the mRNA sequence had A, C, G, and T base contents of approximately 364 nt, 420 nt, 422 nt, and 343 nt, respectively. Therefore, the contents of the four 1-mers in the mRNA were higher than in the lncRNA. Furthermore, in the lncRNA and mRNA sequences, the contents of C and G bases were very similar, and the contents of A and T base were equivalent. However, in the lncRNA sequence, the contents of C and G base were lower than the contents of A and T base. While in the mRNA sequence, the opposite was true.
When k is taken as three, the 3-mer fragments appearing in the sequence are {AAA, AAT, AAC, AAG, ATA, ATT, ATC, ATG,..., GGA, GGT, GGC, GGG}. Similarly, the frequency information for each 3-mer segment of each sequence in the sample sequence is counted in turn, and the respective mean values are calculated to estimate the frequency of occurrence of each 3-mer in the lncRNA and mRNA sequences. The histogram of the 3-mer frequency distribution in the mRNA and lncRNA is plotted, as shown in Fig. 5a and b. As seen in Fig. 5a and b, the frequency distribution of most 3-mer subsequences in the mRNA sequence fluctuated sharply, but the frequency of 3-mers of GCG, CGG, CGC, CGT, CGA, TCG, and ACG was small, with only approximately 4 in each sequence. Moreover, the TGG, CAG, CTG, CCA, CCT, GCC, and GGC segments were enriched in the mRNA sequence, all with frequencies of approximately 40. In the lncRNA sequence, the frequency distribution of each 3-mer subsequence was relatively stable, and most of them were distributed approximately 15 with little fluctuation. A few 3-mers had a higher frequency, such as AAG, AGA, AGG, TGG, CAG, CTG, CCA, CCT, GAA, and GGA, exceeding 20. Moreover, only four 3-mers, ACG, TCG, CGA, and CGT, had frequencies lower than 5.
By analysing the frequency distributions of 1-mers and 3-mers of lncRNA and mRNA sequences, the k-mer distributions of the two were found to have their own preferences. Therefore, we could use the k-mer frequency distribution information for the sequence as the difference information between lncRNA and mRNA sequences.
The lncRNA and mRNA classification model based on the convolutional neural network
In this paper, a convolutional neural network algorithm was used to construct a model suitable for classifying gene sequences based on the transformation of sequences into the k-mer frequency matrix. The model framework is shown in Fig. 6.
As shown in Fig. 6, the lncRNA and mRNA classification model includes the input part and the convolutional neural network part. The input section contains the k-mers extracted from the sequence and their construction into a k-mer frequency matrix. The convolutional neural network consists of two layers. First, in the k-mer feature extraction layer, the input of each neuron is connected to the local acceptance domain of the previous layer, and the local features are extracted. Once the local feature is extracted, its positional relationship with other k-mer features is also determined. The second is the k-mer feature mapping layer. Each computing layer of the network consists of multiple feature maps. Each feature map is a plane, and the weights of all neurons on the plane are equal. The k-mer feature mapping structure uses the sigmoid function as the activation function of the convolutional network so that the feature map has displacement invariance. In addition, since all neurons on a mapped surface share weights, the number of network parameters is reduced. Each convolutional layer in the convolutional neural network is followed by a computational layer for local averaging and quadratic extraction of k-mer features. This unique two-feature extraction structure reduces feature resolution.
Construction of the k-mer frequency matrix
The k-mer frequency of each sequence is first normalized and converted to a frequency. Then, according to the application of the convolutional neural network in image recognition, the k-mer frequency of each sequence is constructed into a matrix form of the same size as the input of the model. Finally, the convolutional neural network is used to autonomously learn the difference between the two sequences of k-mer frequency information to achieve the purpose of classifying and identifying lncRNAs and mRNAs. The specific process is as follows:
-
Step 1: The k-mer frequency information for each sequence is counted. In this paper, the sequence is traversed in the order of A, T, C, and G, and finally the frequency of each 4-mer is counted;
-
Step 2: Normalize the frequency of each k-mer in the sequence in each sequence. That is, the frequency pi(i = n, n = 4k) of all k-mers in each sequence is obtained, and the sum of the frequencies of k-mers in each sequence is 1;
-
Step 3: The frequencies of all k-mers in each sequence are constructed into a matrix form A × B(A × B = 4k), and the elements in the matrix are arranged horizontally in the order of the k-mer. For example, when k is 4, the constructed matrix is 16 × 16.
K-mer screening based on relative entropy
When the value of k increases, the k-mer types with a length k in the sequence increase exponentially. If k is large, the average frequency of each k-mer will be less. Numerous k-mers have a frequency of zero. To reduce the complexity of the data calculation, we used relative entropy to screen k-mers.
Let plnc be the frequency distribution of the k-mer in the lncRNA sequence and pm be the frequency distribution of the k-mer in the mRNA sequence. The relative entropy of plnc and pm is then
$$ D\left({p}_{\ln c},{p}_m\right)={\sum}_{i=1}^{4^k}{p}_{\ln c}(i)\ln \frac{p_{\ln c}(i)}{p_m(i)},k\in \left[1,n\right],i\in \left[i,{4}^k\right], $$
(1)
If plnc = pm, then D(plnc, pm) = 0, which indicates that the k-mer frequency distribution of the lncRNA sequence does not differ from the frequency distribution of the mRNA. If there is a difference in the k-mer frequency distribution between lncRNA and mRNA, then the value of D(plnc, pm) will be greater than zero. Concurrently, the smaller the value of D(plnc, pm) is, the smaller will be the difference in the k-mer frequency distribution between lncRNA and mRNA. Otherwise, the larger the value of D(plnc, pm) is, the greater will be the difference in the k-mer frequency distribution between lncRNA and mRNA. To screen out k-mers that increase the difference information, set
$$ {d}_{\lambda }={p}_{\ln c}(i)\ln \frac{p_{\ln c}(i)}{p_m(i)},\lambda \in \left[i,{4}^k\right], $$
(2)
Sorting dλ in descending order obtains
$$ R=\frac{\sum_{\lambda =1}^n{d}_{\lambda }}{D\left({p}_{\ln c},{p}_m\right)},n\in \left[1,{4}^k\right]. $$
(3)
R reflects the target ratio of the extracted information of k-mers. The λ corresponding to R from 1 to 4k is sequentially calculated. If set R ≥ 98%, then calculate λ = ϖ, and the first ϖ k-mers are the filtered k-mers.
Specific steps are as follows:
-
Step 1: The sum of each k-mer in the lncRNA sample sequence and the mRNA sample sequence is separately determined. Then, the frequency of each k-mer in it is counted. Finally, the frequency values of the 4k kinds of k-mers in the sample are obtained. For example, when k = 4, the total frequency of 256 4-mers in the lncRNA and mRNA sample sequences are respectively counted, then the frequency value of each 4-mer is counted, and finally, 256 kinds of 4-mers are obtained. The sum of the frequency values of the 256 4-mers is 1 for the frequency values in the two sample sequences, respectively. The results of this step calculation are plnc and pm in the formula (1).
-
Step 2: According to the frequency value of each k-mer in the two sequences obtained in step 1, the relative entropy, that is, the value of D(plnc, pm), is calculated according to the formula (1). Then, R is calculated according to the value of D(plnc, pm) in formula (3), and finally, λ is taken as the value of ϖ when R ≥ 98%. Now according to the descending order of dλ, the first ϖ k-mers are the filtered k-mers.
-
Step 3: The lncRNA and mRNA are separately counted based on the frequency of the k-mers screened in step 2, and then the k-mers frequency is constructed in the form of a matrix with reference to the steps of data input processing.
Convolution calculation of the k-mer frequency matrix
The convolution calculation is used to strengthen the important features in the k-mer frequency matrix and weaken the influence of irrelevant k-mer features in this paper.
Taking k = 3 as an example, 64 k-mers can be extracted from an lncRNA sequence. According to the k-mer frequency, the lncRNA sequence can be constructed into an 8 × 8 k-mer frequency matrix M,
$$ M=\left[\begin{array}{cccccccc}0.0059& 0.0102& 0.0117& 0.0190& 0.0059& 0.0029& 0.0059& 0.0102\\ {}0.0088& 0.0073& 0.0220& 0.0059& 0.0220& 0.0146& 0.0234& 0.0146\\ {}0.0044& 0.0044& 0.0088& 0.0088& 0.0073& 0.0102& 0.0176& 0.0161\\ {}0.0132& 0.0176& 0.0322& 0.0117& 0.0117& 0.0220& 0.0249& 0.0146\\ {}0.0161& 0.0044& 0.0102& 0.0337& 0.0088& 0.0264& 0.0293& 0.0264\\ {}0.0278& 0.0439& 0.0366& 0.0190& 0.0044& 0.0132& 0.0190& 0.0102\\ {}0.0205& 0.0073& 0.0117& 0.0176& 0.0044& 0.0117& 0.0220& 0.0190\\ {}0.0161& 0.0220& 0.0366& 0.0102& 0.0146& 0.0073& 0.0176& 0.0161\end{array}\right]. $$
(4)
The convolution calculation is performed by taking Eq. (4) as input and randomly setting a convolution kernel of 3 × 3 ,
$$ \mathrm{Kernel}=\left[\begin{array}{ccc}1& 0& 1\\ {}0& 1& 0\\ {}1& 0& 0\end{array}\right]. $$
(5)
The convolution calculation is actually a process of weighted summation. The calculation format is usually
$$ {X}_j^l=f\left(\sum \limits_{i={M}_j}{X}_i^{l-1}\times {Kernel}_{ij}^l+{B}^l\right), $$
(6)
where f is the activation function of the neurons in the layer, and l indicates the number of layers in the network. The kernel is the convolution kernel, Mj is a local area of the input object, and B represents the offset of each layer.
In Eq. (5), the convolution kernel has nine parameters. The k-mer frequency matrix in Eq. (4) and the convolution kernel are convoluted by the calculation method of Eq. (6). Bl is set as 0. The specific calculation process is as shown in Eq. (7).
$$ {\displaystyle \begin{array}{c}\mathrm{feature}\ \mathrm{map}=\left[\begin{array}{cccccccc}{0.0059}_{\times 1}& {0.0102}_{\times 0}& {0.0117}_{\times 1}& 0.0190& 0.0059& 0.0029& 0.0059& 0.0102\\ {}{0.0088}_{\times 0}& {0.0073}_{\times 1}& {0.0220}_{\times 0}& 0.0059& 0.0220& 0.0146& 0.0234& 0.0146\\ {}{0.0044}_{\times 1}& {0.0044}_{\times 0}& {0.0088}_{\times 0}& 0.0088& 0.0073& 0.0102& 0.0176& 0.0161\\ {}0.0132& 0.0176& 0.0322& 0.0117& 0.0117& 0.0220& 0.0249& 0.0146\\ {}0.0161& 0.0044& 0.0102& 0.0337& 0.0088& 0.0264& 0.0293& 0.0264\\ {}0.0278& 0.0439& 0.0366& 0.0190& 0.0044& 0.0132& 0.0190& 0.0102\\ {}0.0205& 0.0073& 0.0117& 0.0176& 0.0044& 0.0117& 0.0220& 0.0190\\ {}0.0161& 0.0220& 0.0366& 0.0102& 0.0146& 0.0073& 0.0176& 0.0161\end{array}\right]\\ {}=\left[\begin{array}{cccccc}0.0293& 0.0556& 0.0323& 0.0527& 0.0337& 0.0467\\ {}0.0484& 0.0396& 0.0850& 0.0495& 0.0673& 0.0688\\ {}0.0469& 0.0498& 0.0480& 0.0644& 0.0657& 0.0776\\ {}0.0776& 0.0834& 0.1142& 0.0615& 0.0674& 0.0791\\ {}0.0907& 0.0820& 0.0497& 0.0821& 0.0557& 0.0835\\ {}0.0878& 0.0966& 0.0952& 0.0468& 0.0497& 0.0527\end{array}\right]\\ {}\kern2.75em \end{array}} $$
(7)
The first element of the feature map in Eq. (7) is the weighted sum of the first 3 × 3 local element value of the input matrix M and the corresponding element of the convolution kernel. Similarly, the second element is the weighted sum of the second 3 × 3 local element value and the corresponding element of the convolution kernel. Finally, a 6 × 6 size of the output matrix feature map is obtained.
The convolutional neural network of this model has two convolutional layers. The first convolutional layer uses 32 convolution kernels, and the second convolutional layer uses 64 convolution kernels. The size of each convolution kernel is 3 × 3, and the horizontal and vertical steps are 1. The border is filled with 0 in the samepadding to ensure that the size of the matrix remains the same as before convolution, i.e., the k-mer frequency matrix M after samepadding is
$$ {O}_j=\left[\begin{array}{cccccccc}0& 0& 0& 0& 0& 0& 0& 0\\ {}0& 0.0293& 0.0556& 0.0323& 0.0527& 0.0337& 0.0467& 0\\ {}0& 0.0484& 0.0396& 0.0850& 0.0495& 0.0673& 0.0688& 0\\ {}0& 0.0469& 0.0498& 0.0480& 0.0644& 0.0657& 0.0776& 0\\ {}0& 0.0776& 0.0834& 0.1142& 0.0615& 0.0674& 0.0791& 0\\ {}0& 0.0907& 0.0820& 0.0497& 0.0821& 0.0557& 0.0835& 0\\ {}0& 0.0878& 0.0966& 0.0952& 0.0468& 0.0497& 0.0527& 0\\ {}0& 0& 0& 0& 0& 0& 0& 0\end{array}\right]. $$
(8)
Pooling calculation of the convolution kernel output matrix
The pooling method adopted in this paper is the max pooling, which aims to reduce and compress the k-mer characteristics, as well as the calculation amount. The model has only one pooling layer. After the second convolutional layer, the window sliding calculation is performed in a step size of 2.
For Eq. (8), the maximum value of the first 2 × 2 block of matrix Oj is 0.0293. The final result of the pooled calculation of matrix Oj is
$$ {S}_j=\left[\begin{array}{cccc}0.0293& 0.0556& 0.0527& 0.0467\\ {}0.0484& 0.0850& 0.0673& 0.0776\\ {}0.0907& 0.1142& 0.0821& 0.0835\\ {}0.0878& 0.0966& 0.0497& 0.0527\end{array}\right]. $$
(9)
Since 64 k-mer matrices of 6 × 6 were obtained after the second convolutional layer and the size of the pooling window was 2 × 2, the number of rows and columns of the matrix became half the original, representing 64 k-mer 3 × 3 matrices.
Fully connected neural network based on SoftMax function
In order to prevent over-fitting and improve the generalization ability of the model, we reduce the connection of some neurons with a certain probability, so that some neurons in each training are not activated. After the pooling layer, the connection between the output neurons of the pooling layer and the neurons of the full connective layer was reduced with a probability of 0.25. The output matrix of the pooling layer is flattened and expanded to connect 128 neurons in the full connection layer. The activation function is still Relu function.
We used the SoftMax function to activate the output of the fully connected network in the model. The formula of the SoftMax function is
$$ f\left({z}_j\right)=\frac{e^{z_j}}{\sum_{i=1}^n{e}^{z_i}}. $$
(10)
From Eq. (10), if zj is greater than the other z, the value of the function f(zj) approaches 1, and otherwise it approaches 0. Therefore, when the value of f(zj) is 1, the input sequence of this model is judged to be an lncRNA sequence, and when the value of f(zj) is 0, the input sequence is a mRNA sequence.
The Adadelta optimizer is used to train the gradient descent in the training process, and the cross-entropy loss function is used as the loss function.
Setting of the evaluation index in the classification model
The indicator for evaluating the performance of a classification model is generally the classification accuracy, which is also known as the model accuracy. Commonly used evaluation indicators for the two-category problem are precision and recall. In this paper, the positive class was the lncRNA sequence, and the mRNA sequence was the negative class. The predictions of the classifier on the test data set were either correct or incorrect. The total number of occurrences in the four cases was recorded as follows:
TP——The positive class is predicted as the positive class number;
FN——The positive class is predicted as the negative class number;
FP——The negative class is predicted as the positive class number;
TN——The negative class is predicted as the negative class number.
The precision rate is defined as
$$ P=\frac{TP}{TP+ FP}, $$
(11)
The recall rate is defined as
$$ R=\frac{TP}{TP+ FN}. $$
(12)
In addition, the F1 score is the harmonic mean of the precision rate and the recall rate, i.e.,
$$ \frac{2}{F_1}=\frac{1}{P}+\frac{1}{R}, $$
(13)
$$ {F}_1=\frac{2 TP}{2 TP+ FP+ FN}, $$
(14)
If both the precision and recall rates are high, then F1 will be high [30].