Statistical analysis of kmers of lncRNA and mRNA sequences
The human lncRNAs and mRNAs data were downloaded from the GENCODE database (Gencode.v26). The mouse lncRNAs and mRNAs data were downloaded from the GENCODE database (Genecode. VM21). The chicken lncRNAs and mRNAs data were downloaded from the Ensembl website database (5.0). Human data were used to build a convolutional neural network model. Mouse and chicken data were used to compare the superiority of the convolutional neural network model using the built model. The computed information used in this paper were as follows: (1) Operating system, Windows 10, with an InterCore I3–2365 M processor, memory size, 6G; (2) Python 3.5 to run the CNN code.
Studies have shown that the kmer frequency information in lncRNA and mRNA sequences can reveal the distribution of various subsequences in biological sequences, to measure the similarities and variances of sequences [29].
A kmer refers to all possible subsequences of length k in a DNA sequence, RNA sequence or amino acid sequence. Figure 3 shows the process of examining a kmer in a sliding window mode in a sequence when k is three, in which there are 21 3mers, namely, GCC, CCA, CAA, AAC, ACG, CGC, GCC, CCA, CAG, AGG, GGC, GCC, CCG, CGA, GAC, ACC, CCA, CAG, AGT, GTT, and TTC. Among them, GCC and CCA appear three times, CAG appears twice, and the others appear once. Similarly, we can count the kmer frequency information of lncRNA and mRNA sequences.
For a sequence, if the sequence length is m, then the number of kmer subsequence of length k has mk + 1. The sequence generally consists of four bases, A, T, C, and G, and thus kmers of length k have 4^{k} possible structures.
We randomly selected 5000 lncRNA sequence data and 5000 mRNA sequence data from the human data, and we counted the kmer frequency information when k = 1 and k = 3. As shown in Figs.4 and 5, when k = 1, the higher the content of bases G and C, the higher is the thermal stability of DNA molecules. When k = 2, after analysing the preference of dinucleotides, the frequency statistics of dinucleotides may represent certain characteristics of different species in different environments. For example, CG may be a methylCpG island and TA may be part of the TATA box. When k = 3, the coding and noncoding region in the sequence can be distinguished by counting the codon usage preference consisting of three bases. Therefore, considering the statistics, we analysed the kmer frequency of lncRNA and mRNA sequence samples separately to discuss their differences.
From Fig. 4, the average contents of the A, C, G, and T bases in the lncRNA sequence were approximately 254 nt, 217 nt, 216 nt, and 240 nt, respectively, while the mRNA sequence had A, C, G, and T base contents of approximately 364 nt, 420 nt, 422 nt, and 343 nt, respectively. Therefore, the contents of the four 1mers in the mRNA were higher than in the lncRNA. Furthermore, in the lncRNA and mRNA sequences, the contents of C and G bases were very similar, and the contents of A and T base were equivalent. However, in the lncRNA sequence, the contents of C and G base were lower than the contents of A and T base. While in the mRNA sequence, the opposite was true.
When k is taken as three, the 3mer fragments appearing in the sequence are {AAA, AAT, AAC, AAG, ATA, ATT, ATC, ATG,..., GGA, GGT, GGC, GGG}. Similarly, the frequency information for each 3mer segment of each sequence in the sample sequence is counted in turn, and the respective mean values are calculated to estimate the frequency of occurrence of each 3mer in the lncRNA and mRNA sequences. The histogram of the 3mer frequency distribution in the mRNA and lncRNA is plotted, as shown in Fig. 5a and b. As seen in Fig. 5a and b, the frequency distribution of most 3mer subsequences in the mRNA sequence fluctuated sharply, but the frequency of 3mers of GCG, CGG, CGC, CGT, CGA, TCG, and ACG was small, with only approximately 4 in each sequence. Moreover, the TGG, CAG, CTG, CCA, CCT, GCC, and GGC segments were enriched in the mRNA sequence, all with frequencies of approximately 40. In the lncRNA sequence, the frequency distribution of each 3mer subsequence was relatively stable, and most of them were distributed approximately 15 with little fluctuation. A few 3mers had a higher frequency, such as AAG, AGA, AGG, TGG, CAG, CTG, CCA, CCT, GAA, and GGA, exceeding 20. Moreover, only four 3mers, ACG, TCG, CGA, and CGT, had frequencies lower than 5.
By analysing the frequency distributions of 1mers and 3mers of lncRNA and mRNA sequences, the kmer distributions of the two were found to have their own preferences. Therefore, we could use the kmer frequency distribution information for the sequence as the difference information between lncRNA and mRNA sequences.
The lncRNA and mRNA classification model based on the convolutional neural network
In this paper, a convolutional neural network algorithm was used to construct a model suitable for classifying gene sequences based on the transformation of sequences into the kmer frequency matrix. The model framework is shown in Fig. 6.
As shown in Fig. 6, the lncRNA and mRNA classification model includes the input part and the convolutional neural network part. The input section contains the kmers extracted from the sequence and their construction into a kmer frequency matrix. The convolutional neural network consists of two layers. First, in the kmer feature extraction layer, the input of each neuron is connected to the local acceptance domain of the previous layer, and the local features are extracted. Once the local feature is extracted, its positional relationship with other kmer features is also determined. The second is the kmer feature mapping layer. Each computing layer of the network consists of multiple feature maps. Each feature map is a plane, and the weights of all neurons on the plane are equal. The kmer feature mapping structure uses the sigmoid function as the activation function of the convolutional network so that the feature map has displacement invariance. In addition, since all neurons on a mapped surface share weights, the number of network parameters is reduced. Each convolutional layer in the convolutional neural network is followed by a computational layer for local averaging and quadratic extraction of kmer features. This unique twofeature extraction structure reduces feature resolution.
Construction of the kmer frequency matrix
The kmer frequency of each sequence is first normalized and converted to a frequency. Then, according to the application of the convolutional neural network in image recognition, the kmer frequency of each sequence is constructed into a matrix form of the same size as the input of the model. Finally, the convolutional neural network is used to autonomously learn the difference between the two sequences of kmer frequency information to achieve the purpose of classifying and identifying lncRNAs and mRNAs. The specific process is as follows:

Step 1: The kmer frequency information for each sequence is counted. In this paper, the sequence is traversed in the order of A, T, C, and G, and finally the frequency of each 4mer is counted;

Step 2: Normalize the frequency of each kmer in the sequence in each sequence. That is, the frequency p_{i}(i = n, n = 4^{k}) of all kmers in each sequence is obtained, and the sum of the frequencies of kmers in each sequence is 1;

Step 3: The frequencies of all kmers in each sequence are constructed into a matrix form A × B(A × B = 4^{k}), and the elements in the matrix are arranged horizontally in the order of the kmer. For example, when k is 4, the constructed matrix is 16 × 16.
Kmer screening based on relative entropy
When the value of k increases, the kmer types with a length k in the sequence increase exponentially. If k is large, the average frequency of each kmer will be less. Numerous kmers have a frequency of zero. To reduce the complexity of the data calculation, we used relative entropy to screen kmers.
Let p_{lnc} be the frequency distribution of the kmer in the lncRNA sequence and p_{m} be the frequency distribution of the kmer in the mRNA sequence. The relative entropy of p_{lnc} and p_{m} is then
$$ D\left({p}_{\ln c},{p}_m\right)={\sum}_{i=1}^{4^k}{p}_{\ln c}(i)\ln \frac{p_{\ln c}(i)}{p_m(i)},k\in \left[1,n\right],i\in \left[i,{4}^k\right], $$
(1)
If p_{lnc} = p_{m}, then D(p_{lnc}, p_{m}) = 0, which indicates that the kmer frequency distribution of the lncRNA sequence does not differ from the frequency distribution of the mRNA. If there is a difference in the kmer frequency distribution between lncRNA and mRNA, then the value of D(p_{lnc}, p_{m}) will be greater than zero. Concurrently, the smaller the value of D(p_{lnc}, p_{m}) is, the smaller will be the difference in the kmer frequency distribution between lncRNA and mRNA. Otherwise, the larger the value of D(p_{lnc}, p_{m}) is, the greater will be the difference in the kmer frequency distribution between lncRNA and mRNA. To screen out kmers that increase the difference information, set
$$ {d}_{\lambda }={p}_{\ln c}(i)\ln \frac{p_{\ln c}(i)}{p_m(i)},\lambda \in \left[i,{4}^k\right], $$
(2)
Sorting d_{λ} in descending order obtains
$$ R=\frac{\sum_{\lambda =1}^n{d}_{\lambda }}{D\left({p}_{\ln c},{p}_m\right)},n\in \left[1,{4}^k\right]. $$
(3)
R reflects the target ratio of the extracted information of kmers. The λ corresponding to R from 1 to 4^{k} is sequentially calculated. If set R ≥ 98%, then calculate λ = ϖ, and the first ϖ kmers are the filtered kmers.
Specific steps are as follows:

Step 1: The sum of each kmer in the lncRNA sample sequence and the mRNA sample sequence is separately determined. Then, the frequency of each kmer in it is counted. Finally, the frequency values of the 4^{k} kinds of kmers in the sample are obtained. For example, when k = 4, the total frequency of 256 4mers in the lncRNA and mRNA sample sequences are respectively counted, then the frequency value of each 4mer is counted, and finally, 256 kinds of 4mers are obtained. The sum of the frequency values of the 256 4mers is 1 for the frequency values in the two sample sequences, respectively. The results of this step calculation are p_{lnc} and p_{m} in the formula (1).

Step 2: According to the frequency value of each kmer in the two sequences obtained in step 1, the relative entropy, that is, the value of D(p_{lnc}, p_{m}), is calculated according to the formula (1). Then, R is calculated according to the value of D(p_{lnc}, p_{m}) in formula (3), and finally, λ is taken as the value of ϖ when R ≥ 98%. Now according to the descending order of d_{λ}, the first ϖ kmers are the filtered kmers.

Step 3: The lncRNA and mRNA are separately counted based on the frequency of the kmers screened in step 2, and then the kmers frequency is constructed in the form of a matrix with reference to the steps of data input processing.
Convolution calculation of the kmer frequency matrix
The convolution calculation is used to strengthen the important features in the kmer frequency matrix and weaken the influence of irrelevant kmer features in this paper.
Taking k = 3 as an example, 64 kmers can be extracted from an lncRNA sequence. According to the kmer frequency, the lncRNA sequence can be constructed into an 8 × 8 kmer frequency matrix M,
$$ M=\left[\begin{array}{cccccccc}0.0059& 0.0102& 0.0117& 0.0190& 0.0059& 0.0029& 0.0059& 0.0102\\ {}0.0088& 0.0073& 0.0220& 0.0059& 0.0220& 0.0146& 0.0234& 0.0146\\ {}0.0044& 0.0044& 0.0088& 0.0088& 0.0073& 0.0102& 0.0176& 0.0161\\ {}0.0132& 0.0176& 0.0322& 0.0117& 0.0117& 0.0220& 0.0249& 0.0146\\ {}0.0161& 0.0044& 0.0102& 0.0337& 0.0088& 0.0264& 0.0293& 0.0264\\ {}0.0278& 0.0439& 0.0366& 0.0190& 0.0044& 0.0132& 0.0190& 0.0102\\ {}0.0205& 0.0073& 0.0117& 0.0176& 0.0044& 0.0117& 0.0220& 0.0190\\ {}0.0161& 0.0220& 0.0366& 0.0102& 0.0146& 0.0073& 0.0176& 0.0161\end{array}\right]. $$
(4)
The convolution calculation is performed by taking Eq. (4) as input and randomly setting a convolution kernel of 3 × 3 ,
$$ \mathrm{Kernel}=\left[\begin{array}{ccc}1& 0& 1\\ {}0& 1& 0\\ {}1& 0& 0\end{array}\right]. $$
(5)
The convolution calculation is actually a process of weighted summation. The calculation format is usually
$$ {X}_j^l=f\left(\sum \limits_{i={M}_j}{X}_i^{l1}\times {Kernel}_{ij}^l+{B}^l\right), $$
(6)
where f is the activation function of the neurons in the layer, and l indicates the number of layers in the network. The kernel is the convolution kernel, M_{j} is a local area of the input object, and B represents the offset of each layer.
In Eq. (5), the convolution kernel has nine parameters. The kmer frequency matrix in Eq. (4) and the convolution kernel are convoluted by the calculation method of Eq. (6). B^{l} is set as 0. The specific calculation process is as shown in Eq. (7).
$$ {\displaystyle \begin{array}{c}\mathrm{feature}\ \mathrm{map}=\left[\begin{array}{cccccccc}{0.0059}_{\times 1}& {0.0102}_{\times 0}& {0.0117}_{\times 1}& 0.0190& 0.0059& 0.0029& 0.0059& 0.0102\\ {}{0.0088}_{\times 0}& {0.0073}_{\times 1}& {0.0220}_{\times 0}& 0.0059& 0.0220& 0.0146& 0.0234& 0.0146\\ {}{0.0044}_{\times 1}& {0.0044}_{\times 0}& {0.0088}_{\times 0}& 0.0088& 0.0073& 0.0102& 0.0176& 0.0161\\ {}0.0132& 0.0176& 0.0322& 0.0117& 0.0117& 0.0220& 0.0249& 0.0146\\ {}0.0161& 0.0044& 0.0102& 0.0337& 0.0088& 0.0264& 0.0293& 0.0264\\ {}0.0278& 0.0439& 0.0366& 0.0190& 0.0044& 0.0132& 0.0190& 0.0102\\ {}0.0205& 0.0073& 0.0117& 0.0176& 0.0044& 0.0117& 0.0220& 0.0190\\ {}0.0161& 0.0220& 0.0366& 0.0102& 0.0146& 0.0073& 0.0176& 0.0161\end{array}\right]\\ {}=\left[\begin{array}{cccccc}0.0293& 0.0556& 0.0323& 0.0527& 0.0337& 0.0467\\ {}0.0484& 0.0396& 0.0850& 0.0495& 0.0673& 0.0688\\ {}0.0469& 0.0498& 0.0480& 0.0644& 0.0657& 0.0776\\ {}0.0776& 0.0834& 0.1142& 0.0615& 0.0674& 0.0791\\ {}0.0907& 0.0820& 0.0497& 0.0821& 0.0557& 0.0835\\ {}0.0878& 0.0966& 0.0952& 0.0468& 0.0497& 0.0527\end{array}\right]\\ {}\kern2.75em \end{array}} $$
(7)
The first element of the feature map in Eq. (7) is the weighted sum of the first 3 × 3 local element value of the input matrix M and the corresponding element of the convolution kernel. Similarly, the second element is the weighted sum of the second 3 × 3 local element value and the corresponding element of the convolution kernel. Finally, a 6 × 6 size of the output matrix feature map is obtained.
The convolutional neural network of this model has two convolutional layers. The first convolutional layer uses 32 convolution kernels, and the second convolutional layer uses 64 convolution kernels. The size of each convolution kernel is 3 × 3, and the horizontal and vertical steps are 1. The border is filled with 0 in the samepadding to ensure that the size of the matrix remains the same as before convolution, i.e., the kmer frequency matrix M after samepadding is
$$ {O}_j=\left[\begin{array}{cccccccc}0& 0& 0& 0& 0& 0& 0& 0\\ {}0& 0.0293& 0.0556& 0.0323& 0.0527& 0.0337& 0.0467& 0\\ {}0& 0.0484& 0.0396& 0.0850& 0.0495& 0.0673& 0.0688& 0\\ {}0& 0.0469& 0.0498& 0.0480& 0.0644& 0.0657& 0.0776& 0\\ {}0& 0.0776& 0.0834& 0.1142& 0.0615& 0.0674& 0.0791& 0\\ {}0& 0.0907& 0.0820& 0.0497& 0.0821& 0.0557& 0.0835& 0\\ {}0& 0.0878& 0.0966& 0.0952& 0.0468& 0.0497& 0.0527& 0\\ {}0& 0& 0& 0& 0& 0& 0& 0\end{array}\right]. $$
(8)
Pooling calculation of the convolution kernel output matrix
The pooling method adopted in this paper is the max pooling, which aims to reduce and compress the kmer characteristics, as well as the calculation amount. The model has only one pooling layer. After the second convolutional layer, the window sliding calculation is performed in a step size of 2.
For Eq. (8), the maximum value of the first 2 × 2 block of matrix O_{j} is 0.0293. The final result of the pooled calculation of matrix O_{j} is
$$ {S}_j=\left[\begin{array}{cccc}0.0293& 0.0556& 0.0527& 0.0467\\ {}0.0484& 0.0850& 0.0673& 0.0776\\ {}0.0907& 0.1142& 0.0821& 0.0835\\ {}0.0878& 0.0966& 0.0497& 0.0527\end{array}\right]. $$
(9)
Since 64 kmer matrices of 6 × 6 were obtained after the second convolutional layer and the size of the pooling window was 2 × 2, the number of rows and columns of the matrix became half the original, representing 64 kmer 3 × 3 matrices.
Fully connected neural network based on SoftMax function
In order to prevent overfitting and improve the generalization ability of the model, we reduce the connection of some neurons with a certain probability, so that some neurons in each training are not activated. After the pooling layer, the connection between the output neurons of the pooling layer and the neurons of the full connective layer was reduced with a probability of 0.25. The output matrix of the pooling layer is flattened and expanded to connect 128 neurons in the full connection layer. The activation function is still Relu function.
We used the SoftMax function to activate the output of the fully connected network in the model. The formula of the SoftMax function is
$$ f\left({z}_j\right)=\frac{e^{z_j}}{\sum_{i=1}^n{e}^{z_i}}. $$
(10)
From Eq. (10), if z_{j} is greater than the other z, the value of the function f(z_{j}) approaches 1, and otherwise it approaches 0. Therefore, when the value of f(z_{j}) is 1, the input sequence of this model is judged to be an lncRNA sequence, and when the value of f(z_{j}) is 0, the input sequence is a mRNA sequence.
The Adadelta optimizer is used to train the gradient descent in the training process, and the crossentropy loss function is used as the loss function.
Setting of the evaluation index in the classification model
The indicator for evaluating the performance of a classification model is generally the classification accuracy, which is also known as the model accuracy. Commonly used evaluation indicators for the twocategory problem are precision and recall. In this paper, the positive class was the lncRNA sequence, and the mRNA sequence was the negative class. The predictions of the classifier on the test data set were either correct or incorrect. The total number of occurrences in the four cases was recorded as follows:
TP——The positive class is predicted as the positive class number;
FN——The positive class is predicted as the negative class number;
FP——The negative class is predicted as the positive class number;
TN——The negative class is predicted as the negative class number.
The precision rate is defined as
$$ P=\frac{TP}{TP+ FP}, $$
(11)
The recall rate is defined as
$$ R=\frac{TP}{TP+ FN}. $$
(12)
In addition, the F_{1} score is the harmonic mean of the precision rate and the recall rate, i.e.,
$$ \frac{2}{F_1}=\frac{1}{P}+\frac{1}{R}, $$
(13)
$$ {F}_1=\frac{2 TP}{2 TP+ FP+ FN}, $$
(14)
If both the precision and recall rates are high, then F_{1} will be high [30].