 Research
 Open access
 Published:
PlncRNAHDeep: plant long noncoding RNA prediction using hybrid deep learning based on two encoding styles
BMC Bioinformatics volume 22, Article number: 242 (2021)
Abstract
Background
Long noncoding RNAs (lncRNAs) play an important role in regulating biological activities and their prediction is significant for exploring biological processes. Long shortterm memory (LSTM) and convolutional neural network (CNN) can automatically extract and learn the abstract information from the encoded RNA sequences to avoid complex feature engineering. An ensemble model learns the information from multiple perspectives and shows better performance than a single model. It is feasible and interesting that the RNA sequence is considered as sentence and image to train LSTM and CNN respectively, and then the trained models are hybridized to predict lncRNAs. Up to present, there are various predictors for lncRNAs, but few of them are proposed for plant. A reliable and powerful predictor for plant lncRNAs is necessary.
Results
To boost the performance of predicting lncRNAs, this paper proposes a hybrid deep learning model based on two encoding styles (PlncRNAHDeep), which does not require prior knowledge and only uses RNA sequences to train the models for predicting plant lncRNAs. It not only learns the diversified information from RNA sequences encoded by pnucleotide and onehot encodings, but also takes advantages of lncRNALSTM proposed in our previous study and CNN. The parameters are adjusted and three hybrid strategies are tested to maximize its performance. Experiment results show that PlncRNAHDeep is more effective than lncRNALSTM and CNN and obtains 97.9% sensitivity, 95.1% precision, 96.5% accuracy and 96.5% F1 score on Zea mays dataset which are better than those of several shallow machine learning methods (support vector machine, random forest, knearest neighbor, decision tree, naive Bayes and logistic regression) and some existing tools (CNCI, PLEK, CPC2, LncADeep and lncRNAnet).
Conclusions
PlncRNAHDeep is feasible and obtains the credible predictive results. It may also provide valuable references for other related research.
Backgroud
Noncoding RNAs (ncRNAs) are considered as nonproteincoding transcripts [1]. Long ncRNAs (lncRNAs) usually refer to the ncRNAs with longer than 200 nucleotides [2] and they play an important role in regulating biological activities [3]. For example, lncRNAs are players in cardiovascular diseases and atherosclerosis and they have attracted attention in cancer research [4, 5]. They are involved in the vernalizationmediated FLOWERING LOCUS C repression, which affects the flowering in Arabidopsis [6, 7]. lncRNAs are pivotal players on the regulation in a range of developmental processes in plant [3, 8]. A growing number of plant lncRNAs have been gradually discovered, but their diverse functions are not appreciated enough. The prediction of plant lncRNAs is important for exploring the functional lncRNAs expressed in genomes and understanding their mechanisms.
Bioinformatics technology has been widely used in biological prediction. The traditional methods often use the physicochemical, sequential and structural features (codon frequency [9], open reading frame (ORF) [10] and similarity of known proteins [11]) as the inputs to train a shallow machine learning model (support vector machine (SVM) [12], random forest (RF) [13], knearest neighbor (kNN) [14], etc.) for prediction. CNCI is a powerful tool, and it uses adjoining nucleotide triplets to train SVM for classifying proteincoding and noncoding sequences [15]. PLEK, an alignmentfree tool, uses a computational pipeline based on improved kmer and SVM to distinguish lncRNAs from messenger RNAs (mRNAs) [16]. CPC is a classification tool based on SVM, which uses the sequence features to classify coded and noncoding RNAs [17] and its new version CPC2 with faster speed and higher accuracy has been published [18]. With the development of computer technology, deep learning has showed better performance and adaptability than shallow machine learning in many fields [19]. It is an endtoend learning, which extracts the potential features of the data and learns the rule by optimizing the loss function to avoid manually designing rule. LncADeep integrates intrinsic and homologous features into the deep belief network to construct models targeting fulllength and partiallength transcripts for classifying lncRNAs [20]. lncRNAnet incorporates the recurrent neural network (RNN) for RNA sequence modeling and the convolutional neural network (CNN) for detecting stop codons to obtain an ORF indicator in lncRNA classification [21]. However, none of these studies avoids the complex feature engineering, which is not only a timeconsuming process, but also requires the prior knowledge, such as a deep understanding of physicochemical, sequential and structural features of RNA and the proper use of some bioinformatics tools. It is significant to develop an efficient method that only uses RNA sequences to train the models and obtains credible predictive results.
In natural language processing and image classification, deep learning technology is used to automatically extract and learn abstract information from the data to train the model, which shows superior performance and strong adaptability and avoids complex feature engineering [19]. Inspired by it, the prediction of lncRNAs can be considered as natural language processing and image classification problems. Long shortterm memory (LSTM) is an appropriate model that has been successfully applied to natural language processing [22]. The sentences in natural language can be converted into the vectors as input of LSTM for training. CNN is appropriate for image classification [23]. The image can be converted into the twodimensional matrices as input of CNN for training. Furthermore, RNA sequences can be encoded into different forms as the inputs to train a variety of base models. The ensemble of them not only learns the information from multiple encoding forms, but also ensures the diversity of base models, and thus obtains better performance than a single model [24, 25]. Therefore, the raw RNA sequences can be encoded as vectors and matrices as the inputs to train LSTM and CNN respectively, and then the trained models are hybridized to comprehensively predict lncRNAs.
Up to now, various methods and tools for predicting animal lncRNAs have been published, while few for plant. Since ncRNAs are mainly transcribed by RNA polymerase II in animal and transcribed by RNA polymerases II, IV and V in plant [26], and plant lncRNAs have low level expression and crossspecies conservation [27], the predictors for animal do not guarantee the reliability to plant. Facing with these challenges, it is urgent and necessary to construct a reliable and powerful predictor for plant lncRNAs.
In this paper, plant lncRNAs are predicted by using hybrid deep learning based on two encoding styles (PlncRNAHDeep). Kmeans clustering [28] is used to solve the undersampling of negative sample in dataset. The raw RNA sequences are first encoded as vectors and matrices by pnucleotide [29] and onehot [30] encodings respectively. Then, the encoded sequences are input into lncRNALSTM proposed in our previous study [29] and CNN for training respectively. Finally, the trained models are hybridized at decision level to obtain the final predictive results. PlncRNAHDeep only uses RNA sequences to train the models for predicting plant lncRNAs. It learns the diversified information from two encoding styles and takes advantages of lncRNALSTM and CNN. The value of p in pnucleotide encoding is adjusted and three hybrid strategies are tested to maximize the performance. PlncRNAHDeep is more effective than lncRNALSTM and CNN. It also obtains the best results on Zea mays dataset compared with the shallow machine learning methods, such as SVM, RF, kNN, decision tree (DT), naive Bayes (NB) and logistic regression (LR), and the existing tools, such as CNCI, PLEK, CPC2, LncADeep and lncRNAnet.
Results
Effects of value of p and hybrid strategy variations
The value of p in pnucleotide encoding is an important parameter that affects the performance of lncRNALSTM and thus the performance of PlncRNAHDeep. 5fold cross validation is used to evaluate the effects of different values of p in lncRNALSTM and the results are obtained (Fig. 1).
When p is 3, lncRNALSTM obtains the best sensitivity, accuracy and F1 score, its precision is the second best among all methods. Thus, the value of p is set to 3 in the follow experiments.
The effects of different hybrid strategies in PlncRNAHDeep are evaluated using 5fold cross validation and the results are obtained (Fig. 2). Least significant difference (LSD) test is used to test statistically the accuracy of them and the significant difference is evaluated according to the obtained p value (Table 1).
PlncRNAHDeep with different hybrid strategies always obtains better results than CNN and lncRNALSTM. It also shows the significant accuracy over CNN and lncRNALSTM with the significance level of 0.05 from LSD test results. It means that the three hybrid strategies are all effective for enhancing the performance of a single CNN and lncRNALSTM. The PlncRNAHDeep methods with three hybrid strategies are compared with each other. PlncRNAHDeep_G obtains the best sensitivity, and PlncRNAHDeep_L obtains the best precision. They also obtain the similar accuracy and F1 score. PlncRNAHDeep_C does not obtain the best result in each criterion. From LSD test results, PlncRNAHDeep_L shows the significance on accuracy over PlncRNAHDeep_C with the level of 0.05. Although PlncRNAHDeep_G also obtains better accuracy than PlncRNAHDeep_C, there is no significant difference between their results. Accordingly, PlncRNAHDeep with the predominance of LSTM hybrid strategy (PlncRNAHDeep_L) is selected in the following experiments.
Impacts of balanced and imbalanced sample datasets
The number of negative sample may affect the performance of PlncRNAHDeep [31]. The datasets with different ratios of positive samples and negative samples are used to verify the performance (Table 2).
On the imbalanced sample datasets, the performance of PlncRNAHDeep is significantly degraded. Specially, on the imbalanced sample dataset with a ratio of positive samples and negative samples of 1:3, the F1 score, AUC and GM decrease 26.1%, 8.2% and 15.8% compared with them on the balanced sample dataset respectively. To ensure a good performance of PlncRNAHDeep, the balanced sample dataset is finally adopted.
Performance comparison with shallow machine learning methods
To verify the performance of proposed model, PlncRNAHDeep is compared with six shallow machine learning methods, which are SVM, RF, kNN, DT, NB and LR (Table 3). Moreover, the ROC curves of them are plotted and the AUC values are obtained (Fig. 3).
PlncRNAHDeep obtains 97.9% sensitivity, 95.1% precision, 96.5% accuracy and 96.5% F1 score. Its sensitivity, accuracy and F1 score are the best and precision is the second best among all methods. Its AUC achieves 0.9934 which is also better than those obtained by the other methods. RF obtains the second best sensitivity, precision, accuracy, F1 score and AUC, where precision is same as PlncRNAHDeep’s. DT obtains the third best sensitivity, accuracy and F1 score, but its precision and AUC are not in top three of all methods. Although LR obtains the best precision, its other results are all not in top three. SVM obtains the third best AUC, but its other results are unsatisfactory. All results of kNN and NB are not in top three, where NB’s results are the worst among all methods.
Performance comparison with existing tools
To further verify the performance of PlncRNAHDeep, it is compared with five existing tools (CNCI, PLEK, CPC2, LncADeep and lncRNAnet) which have been described in background section, and the results are obtained (Table 4).
All values obtained by PlncRNAHDeep are the best compared with the other tools. Its accuracy is 17.6%, 21.4%, 6.2%, 16.5% and 23.6% better than that of CNCI, PLEK, CPC2, LncADeep and lncRNAnet respectively. The sensitivity and precision of PlncRNAHDeep are 97.9% and 95.1% respectively and the difference of them is 2.8%, which shows good robustness of PlncRNAHDeep. CPC2 obtains the second best accuracy and the difference between its sensitivity and precision is 3.5%. The accuracies of CNCI, PLEK and LncADeep achieve 75% but not more than 80%. The sensitivity of CNCI and LncADeep are about 25% worse than the precision of them respectively, which indicates that they tend to predict lncRNA as the negative sample. The sensitivity of PLEK is obviously better than the precision of it, which indicates that it tends to predict mRNA as lncRNA. The difference between the sensitivity and precision of lncRNAnet is 1.3%, which shows the best robustness. However, its accuracy does not achieve 75%.
Discussion
lncRNALSTM with p = 3 in pnucleotide encoding obtains the best results, which means that when every three continuous nucleotides in RNA sequence are regarded as a word, the sample can be better characterized. For the negative samples (mRNAs), this may due to every three continuous nucleotides determine a codon, which further determines the amino acid [32]. For the positive samples (lncRNAs), this may due to that the conservative triplet codon characteristics are needed to perform their functions, such as matching the interacted protein sequence [9]. From another perspective, when the value of p is 1 or 2, each sample can only be encoded by 5 or 17 integers (including zeropadding), which is not enough to characterize the sample, especially for lncRNA with longer than 200 nucleotides. When the value of p is more than 3, the sample length is greatly shortened after encoding, and the information that can be extracted is limited, which is not conducive to model training.
PlncRNAHDeep with the predominance of LSTM hybrid strategy obtains the best results, which means that lncRNALSTM is used as the main model and CNN is used to assist in prediction. On the one hand, lncRNALSTM is an improved model that it is more suitable as the main model than the basic CNN [29]. On the other hand, pnucleotide encoding characterizes the sample with a variety of integers, while onehot encoding characterizes the sample with a 0–1 matrix, thus lncRNALSTM learns more information from the sample than CNN to show better performance.
In view of the successful application of LSTM and CNN in natural language processing and image processing respectively, the RNA sequences are encoded into vectors and matrices to train lncRNALSTM and CNN respectively [22, 23]. It takes advantage of the two deep learning models and further enhances the performance through hybridization [24, 25]. Therefore, PlncRNAHDeep performs better than a single deep learning or shallow machine learning model. Since lncRNAs are different in animal and plant, the predictors for animal do not guarantee the reliability to plant [26]. It is conceivable that the plant predictor PlncRNAHDeep obtains better results than other tools on plant lncRNA prediction. In addition, PlncRNAHDeep only needs to input RNA sequences to complete training and prediction, which is simple and friendly for users. As a representative species, Zea mays is widely cultivated in the world. PlncRNAHDeep has a good performance on Zea mays dataset, which indicates that it has potential to be applied to many other plant species.
Conclusions
In this paper, a hybrid deep learning using two encoding styles, PlncRNAHDeep, is presented to predict plant lncRNAs. It encodes the sample sequences using pnucleotide and onehot encodings for training lncRNALSTM and CNN respectively, and hybridizes the two models at decision level. It only uses the RNA sequences as the inputs to learn diversified information and takes advantages of lncRNALSTM and CNN. The performance of PlncRNAHDeep is verified by comparing with the shallow machine learning methods, including SVM, RF, kNN, DT, NB and LR, and the existing tools, including CNCI, PLEK, CPC2, LncADeep and lncRNAnet. The experiment results show that PlncRNAHDeep is quite an efficient method. It may also provide valuable references for other related studies.
The future work will try to implement PlncRNAHDeep for using online or downloading free. As the research progresses, the public databases of plant will become more abundant and more lncRNAs will be published. The widely application of PlncRNAHDeep is also worth expecting.
Methods
Datasets
Zea mays is a kind of model plant which is widely used as research subject and has an important research significance. To train the deep learning model adequately and avoiding underfitting, a large amount of published lncRNA data of Zea mays with abundant genetic annotation information were selected. 18,110 validated lncRNA sequences were downloaded from Green noncoding database (GreeNC) v1.12 [33] as the positive samples. 18,000 samples of them were selected randomly to generate a positive dataset.
From RefSeq database (https://www.ncbi.nlm.nih.gov/refseq/), 57,776 mRNA sequences were downloaded, the repeated sequences were filtered out, and 54,282 sequences were obtained as the negative samples. To generate a balanced sample dataset, the negative samples were undersampled. kmer frequency of each negative sample sequence was extracted [9]. Kmeans, an unsupervised clustering method [28], was used to cluster these negative samples based on their kmer frequencies. k was set to 1 and 2 and the clustering center point was set to 200 to save time and reduce the computational complexity. The number of samples in each cluster was recorded as x_{i} (i = 1, 2, …, 200). O_{i} (i = 1, 2, …, 200) samples were selected randomly from the ith cluster as follows:
where round() is the rounded function. The 18,000 selected samples were used to generate a negative dataset. Other two imbalance sample datasets were also generated using the above method, where the positive dataset kept 18,000 samples and the ratios of positive samples and negative samples were 1:2 and 1:3 respectively [31].
80% of the samples from the positive and negative datasets were selected randomly for training and validation, and the other 20% of the samples were tested.
Two encoding styles
Word segmentation is an important step in natural language processing and it encodes a sentence into a number vector [34]. Each RNA sequence is composed of nucleotide permutations, which is considered as a sentence. Thus, it can be encoded by “word segmentation” according to its biological characteristics. In the datasets, each sample was a chainlike molecule and composed by four bases (A, T, C and G) [35]. Each of the continuous p nucleotides (pnucleotide) in RNA sequence was regard as a “word”. The value of p could be 2, 3, 4, ..., which corresponded to 16, 64, 256, … pnucleotide formats respectively. Each pnucleotide format is represented by a unique positive integer from 1 to 4^{p}. A window with both length and step size of p slid along the RNA sequence to encode each pnucleotide format into a corresponding positive integer. To ensure that all samples have the same length after encoding, the samples with a length less than the longest one are zeropadded. Then each sample is encoded into a number vector (Fig. 4a).
Onehot is a common encoding style [30]. Here the rule of onehot encoding is set to that, A is encoded as (1, 0, 0, 0)^{T}, T is encoded as (0, 1, 0, 0)^{T}, C is encoded as (0, 0, 1, 0)^{T} and G is encoded as (0, 0, 0, 1)^{T}. Then each sample sequence is encoded into a 0–1 matrix (similar as a twodimensional grayscale image) of four rows and N columns, where N is set to the sequence length of the longest one among all samples. For those samples whose sequence length is less than N, the zeropadding is performed on their empty columns (Fig. 4b).
Feature extraction of RNAs
kmer frequency is the common sequence feature of RNAs [9]. For a sample consisting of A, T, C and G, a kmer contains k continuous bases to generate 4^{k} different forms. If the value of k is too large, it increases the training and test time, and leads to many zeros in the feature vector to adversely affect the model training. The kmer frequency with a large proportion also affects the role of other types of feature in model training. Therefore, 1mer, 2mers and 3mers frequencies were extracted. A sliding window of length k was used to match kmer along the sequence, the sliding step size was set to 1, and the frequency f_{j} was recorded as follows:
where s_{k} is the total number of matches, L is the length of the RNA sequence, a_{k} is a parameter to make each kmer frequency has the same effect, c_{j} is the number of matches of the jth form.
ORF is a segment of the RNA sequence that has the potential translation ability. The ORF coverage rate of mRNA is significantly higher than that of lncRNA [10]. The ORF information of each sample was obtained by TransDecoder v3.0.1 (https://github.com/TransDecoder/TransDecoder), and the integrity (int), coverage (cov) and normalized ORF (nORF) were extracted as follows:
where n is the number of ORF, l_{m} is the length of the mth ORF.
Structure of RNA forms an important intermediate level of description of nucleic acids. The stability of the structure is related to the number of base pairs in the sequence and GC content. The more stable the structure, the more free energy it releases. The structure information of each sample was obtained by RNAfold in ViennaRNA Package v2.4.11 [11], and the number of base pairs, GC content (GCcont) and normalized minimum free energy (nMFE) were extracted as follows:
where NA, NT, NC and NG are the number of A, T, C and G in a sample respectively, MFE is the minimum free energy.
All extracted features were combined into a 90dimensional vector as input for shallow machine learning methods in the comparison experiment. The extracted 1mer and 2mers frequencies were also used for clustering the negative samples in the creation of the datasets.
Architectures of lncRNALSTM and CNN
LSTM is a kind of RNN with gated structure [36]. Bidirectional LSTM is a further extension to solve the problem that LSTM only processes single direction information. It extracts information to update the network from both the positive and negative directions as follows:
where σ() is the sigmoid function, h is the vector in the hidden layer, “→” and “←” are the positive and negative directions respectively, t is the time, W is the weight, x is the input, b is the bias. The output of the two networks is superimposed as follows:
where y is the output.
lncRNALSTM is a LSTMbased model constructed in our previous study [29]. Its architecture contains a word embedding layer, a bidirectional LSTM layer and a fullyconnected layer. In the bidirectional LSTM layer, the units was set to 64 and the dropout rate was set to 0.4. In the fullyconnected layer, “sigmoid” was selected as the activation function. The binary cross entropy loss function was selected to calculate the loss which was optimized by using the “Adam” optimizer. The parameters of each layer were updated by backpropagation. Each pnucleotide encoded sample sequence was input as a 4^{p}dimensional vector into lncRNALSTM. Different from the overview of lncRNALSTM in [29], here the output was mapped to [0, 1] interval to obtain the confidence probability instead of the label. Its value indicated the confidence that the corresponding sample was predicted as a lncRNA (Fig. 5).
CNN is a popular deep learning model, a basic CNN structure usually includes the convolutional layer, pooling layer and fullyconnected layer [19]. The convolutional layer outputs feature maps by convolving the feature maps of the previous layer with a set of filters as follows:
where Fm_{out} is the output feature maps, Fm_{in} is the input feature maps, Ft_{j} means the jth filter, Nf is the number of filters, b is the bias. The pooling layer combines the outputs of one layer of neuron clusters into a single neuron in the next layer, and the commonly used schemes are maxpooling and averagepooling. The fullyconnected layer connects every neuron in one layer to every neuron in another layer.
The architecture of CNN in this paper was mainly constructed by two convolutional layers, two pooling layers and a fullyconnected layer. In the convolutional layers, the number of filters were set to 32 and 64 respectively. In the pooling layers, the maxpooling schemes were used. In the fullyconnected layer, the dropout rate was set to 0.4 and “softmax” was selected as the activation function. The categorical cross entropy loss function was selected to calculated the loss which was optimized by using the “SGD” optimizer. The parameters of each layer were updated by backpropagation. All parameter selections were referred to the related studies [37] and our previous experiences [38]. Each onehot encoded sample sequence was input as a 4 * N matrix into above CNN. The output was mapped to [0, 1] interval to obtain a 2dimensional confidence probability vector. The values of this vector indicated the confidence that the corresponding sample was predicted as mRNA and lncRNA respectively (Fig. 6).
Hybrid deep learning
lncRNALSTM and CNN were trained respectively, and used to predict the input sample sequence to output the confidence probabilities. Then they were hybridized on decision level based on three hybrid strategies.
The first was the greedy hybrid strategy (the method is denoted as PlncRNAHDeep_G), which was inspired by greedy selection [39]. It always selected the higher one of the two confidence probabilities obtained by two models respectively as follows:
where abs() is the absolute value function, Cp is the confidence probability that the sample is predicted as a lncRNA, Cp_{L} and Cp_{C} are Cp obtained by lncRNALSTM and CNN respectively.
The second was the predominance of CNN hybrid strategy (the method was denoted as PlncRNAHDeep_C). It selected the confidence probability obtained by CNN. However, when this confidence probability was not high enough, it selected the confidence probability obtained by lncRNALSTM as follows:
The third was the predominance of LSTM hybrid strategy (the method was denoted as PlncRNAHDeep_L). It was similar as the predominance of CNN hybrid strategy except that CNN and lncRNALSTM were exchanged as follows:
The final obtained confidence probability Cp was mapped to [0, 1] interval. The label, as the output of the hybrid deep learning, could be 1 (when Cp ≥ 0.5) or 0 (when Cp < 0.5), which indicated the corresponding sample was predicted as lncRNA or not respectively.
Implement of PlncRNAHDeep
PlncRNAHDeep was implemented by Keras 2.2.4 and all parameters used the default values from Keras documentation (https://keras.io/). All scripts were written by Python 3.6.5. The whole project was implemented on PC with 2.81 GHz CPU, 6 GB GPU and 8 GB RAM memory under a Microsoft Windows 10 operating system.
Evaluation criteria
The performance evaluation criteria in the experiments are as follows:
where true positive (TP) refers to the number of true lncRNAs which are correctly predicted, false negative (FN) refers to the number of true lncRNAs which are incorrectly predicted as mRNAs, false positive (FP) refers to the number of true mRNAs which are incorrectly predicted as lncRNAs, true negative (TN) refers to the number of true mRNAs which are correctly predicted. Sensitivity is the percentage of the correctly predicted lncRNAs in all true lncRNAs. Precision is the percentage of the correctly predicted lncRNAs in all samples predicted as lncRNAs. Accuracy is the percentage of the correctly predicted samples in the total samples. F1 score (F1score) is a harmonic average of sensitivity and precision. Geometric mean (GM) is a common criterion that gives a more accurate evaluation on imbalanced sample dataset. In addition, area under curve (AUC) from receiver operating characteristic (ROC) curve is also used for evaluation. The value of AUC ranges from 0 to 1, where AUC = 1 stands for the perfect prediction.
Availability of data and materials
The source code of PlncRNAHDeep and the used dataset are available at https://github.com/kangzhai/PlncRNAHDeep.
Abbreviations
 CNN:

Convolutional neural network
 DT:

Decision tree
 GreeNC:

Green noncoding database
 kNN:

kNearest neighbor
 lncRNA:

Long noncoding RNA
 LR:

Logistic regression
 LSD:

Least significant difference
 LSTM:

Long shortterm memory
 mRNA:

Messenger RNA
 NB:

Naive Bayes
 ORF:

Open reading frame
 RF:

Random forest
 RNN:

Recurrent neural network
 SVM:

Support vector machine
References
Zhou QZ, Zhang B, Yu QY, Zhang Z. BmncRNAdb: a comprehensive database of noncoding RNAs in the silkworm, Bombyx mori. BMC Bioinformatics. 2016;17:370.
Palazzo AF, Lee ES. Noncoding RNA: what is functional and what is junk? Front Genet. 2015;6:2.
Kung JTY, Colognori D, Lee JT. Long noncoding RNAs: past, present, and future. Genetics. 2013;193(3):651–69.
Aryal B, Rotllan N, FernándezHernando C. Noncoding RNAs and atherosclerosis. Curr Atherosclerosis Rep. 2014;16:407.
Schmitz SU, Grote P, Herrmann BG. Mechanisms of long noncoding RNA function in development and disease. Cell Mol Life Sci. 2016;73(13):2491–509.
Zhou X, Cui J, Meng J, Luan Y. Interactions and links among the noncoding RNAs in plants under stresses. Theor Appl Genet. 2020;133:3235–48.
Swiezewski S, Liu F, Magusin A, Dean C. Coldinduced silencing by long antisense transcripts of an Arabidopsis Polycomb target. Nature. 2009;462:799–802.
Wang J, Meng X, Dobrovolskaya OB, Orlov YL, Chen M. Noncoding RNAs and their roles in stress response in plants. Genom Proteom Bioinf. 2017;15:301–12.
Wekesa JS, Luan Y, Chen M, Meng J. A hybrid prediction method for plant lncRNAprotein interaction. Cells. 2019;8:521.
Dinger ME, Pang KC, Mercer TR, Mattick JS. Differentiating proteincoding and noncoding RNA: challenges and ambiguities. PLoS Comput Biol. 2008;4(11):e1000176.
Lorenz R, Bernhart SH, Siederdissen CHZ, Tafer H, Flamm C, Stadler PF, et al. ViennaRNA package 2.0. Algorithms Mol Biol. 2011;6:26.
Zou C, Gong J, Li H. An improved sequence based prediction protocol for DNAbinding proteins using SVM and comprehensive feature analysis. BMC Bioinformatics. 2013;14:90.
Zhao Q, Mao Q, Zhao Z, Dou T, Wang Z, Cui X, et al. Prediction of plantderived xenomiRs from plant miRNA sequences using random forest and onedimensional convolutional neural network models. BMC Genomics. 2018;19:839.
Bindewald E, Shapiro BA. RNA secondary structure prediction from sequence alignments using a network of knearest neighbor classifiers. RNA. 2006;12:342–52.
Sun L, Luo H, Bu D, Zhao G, Yu K, Zhang C, et al. Utilizing sequence intrinsic composition to classify proteincoding and long noncoding transcripts. Nucleic Acids Res. 2013;41(17):e166.
Li A, Zhang J, Zhou Z. PLEK: a tool for predicting long noncoding RNAs and messenger RNAs based on an improved kmer scheme. BMC Bioinformatics. 2014;15:311.
Kong L, Zhang Y, Ye ZQ, Liu XQ, Zhao SQ, Wei L, et al. CPC: assess the proteincoding potential of transcripts using sequence features and support vector machine. Nucleic Acids Res. 2007;35:W345–9.
Kang YJ, Yang DC, Kong L, Hou M, Meng YQ, Wei L, et al. CPC2: a fast and accurate coding potential calculator based on sequence intrinsic features. Nucleic Acids Res. 2017;45:W12–6.
LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436–44.
Yang C, Yang L, Zhou M, Xie H, Zhang C, Wang MD, et al. LncADeep: an ab initio lncRNA identification and functional annotation tool based on deep learning. Bioinformatics. 2018;34(22):3825–34.
Baek J, Lee B, Kwon S, Yoon S. LncRNAnet: long noncoding RNA identification using deep learning. Bioinformatics. 2018;34(22):3889–97.
Sundermeyer M, Ney H, Schlüter R. From feedforward to recurrent LSTM neural networks for language modeling. IEEE/ACM Trans Audio Speech Lang Process. 2015;23(3):517–29.
Wei Y, Xia W, Lin M, Huang J, Ni B, Dong J, et al. HCP: a flexible CNN framework for multilabel image classification. IEEE Trans Pattern Anal Mach Intell. 2016;38(9):1901–7.
Zhang L, Yu G, Xia D, Wang J. Protein–protein interactions prediction based on ensemble deep neural networks. Neurocomputing. 2019;324:10–9.
Moyano JM, Gibaja EL, Cios KJ, Ventura S. Review of ensembles of multilabel classifiers: models, experimental study and prospects. Inform Fusion. 2018;44:33–45.
Zhang H, He X, Zhu JK. RNAdirected DNA methylation in plants. RNA Biol. 2013;10(10):1593–6.
Schneider HW, Raiol T, Brigido MM, Walter MEMT, Stadler PF. A support vector machine based method to distinguish long noncoding RNAs from protein coding transcripts. BMC Genomics. 2017;18:804.
Kuo RJ, Wang HS, Hu TL, Chou SH. Application of ant Kmeans on clustering analysis. Comput Math Appl. 2005;50(10–12):1709–24.
Meng J, Chang Z, Zhang P, Shi W, Luan Y. lncRNALSTM: prediction of plant long noncoding RNAs using long shortterm memory based on pnts encoding. In: Proceedings of the 15th international conference on intelligent computing; 2019. p. 347–57.
Rodríguez P, Bautista MA, Gonzàlez J, Escalera S. Beyond onehot encoding: lower dimensional target embedding. Image Vision Comput. 2018;75:21–31.
Zhang L, Yu G, Guo M, Wang J. Predicting proteinprotein interactions using highquality noninteracting pairs. BMC Bioinformatics. 2018;19(Suppl 19):525.
Harigaya Y, Parker R. The link between adjacent codon pairs and mRNA stability. BMC Genomics. 2017;18:364.
Gallart AP, Pulido AH, Lagrán IAMD, Sanseverino W, Cigliano RA. GREENC: a wikibased database of plant lncRNAs. Nucleic Acids Res. 2016;44:D1161–6.
Ryu J, Koo HI, Cho NI. Word segmentation method for handwritten documents based on structured learning. IEEE Signal Proc Let. 2015;22(8):1161–5.
Li X, Yang L, Chen LL. The biogenesis, functions, challenges of circular RNAs. Mol Cell. 2018;71(3):428–42.
Yu Y, Si X, Hu C, Zhang J. A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput. 2019;31:1235–70.
Wen J, Liu Y, Shi Y, Huang H, Deng B, Xiao X. A classification model for lncRNA and mRNA based on kmers and a convolutional neural network. BMC Bioinformatics. 2019;20:469.
Zhang P, Meng J, Luan Y, Liu C. Plant miRNAlncRNA interaction prediction with the ensemble of CNN and IndRNN. Interdiscip Sci. 2020;12:82–9.
Farahat AK, Ghodsi A, Kamel MS. Efficient greedy feature selection for unsupervised learning. Knowl Inf Syst. 2013;35:285–310.
Acknowledgements
Not applicable.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 22 Supplement 3, 2021: Proceedings of the 2019 International Conference on Intelligent Computing (ICIC 2019): bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume22supplement3.
Funding
Publication costs were funded by the National Nature Science Foundation of China (No. 61872055). This work was supported by the National Nature Science Foundation of China (Nos. 61872055, 31872116). The funding bodies played no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.
Author information
Authors and Affiliations
Contributions
JM, QK and YL conceived, designed the experiments and analyzed the results. JM, ZC and QK conceived, designed the method and wrote the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Meng, J., Kang, Q., Chang, Z. et al. PlncRNAHDeep: plant long noncoding RNA prediction using hybrid deep learning based on two encoding styles. BMC Bioinformatics 22 (Suppl 3), 242 (2021). https://doi.org/10.1186/s12859020038702
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859020038702