 Research
 Open Access
 Published:
A learningbased method to predict LncRNAdisease associations by combining CNN and ELM
BMC Bioinformatics volume 22, Article number: 622 (2021)
Abstract
Background
lncRNAs play a critical role in numerous biological processes and life activities, especially diseases. Considering that traditional wet experiments for identifying uncovered lncRNAdisease associations is limited in terms of time consumption and labor cost. It is imperative to construct reliable and efficient computational models as addition for practice. Deep learning technologies have been proved to make impressive contributions in many areas, but the feasibility of it in bioinformatics has not been adequately verified.
Results
In this paper, a machine learningbased model called LDACE was proposed to predict potential lncRNAdisease associations by combining Extreme Learning Machine (ELM) and Convolutional Neural Network (CNN). Specifically, the representation vectors are constructed by integrating multiple types of biology information including functional similarity and semantic similarity. Then, CNN is applied to mine both local and global features. Finally, ELM is chosen to carry out the prediction task to detect the potential lncRNAdisease associations. The proposed method achieved remarkable Area Under Receiver Operating Characteristic Curve of 0.9086 in Leaveoneout crossvalidation and 0.8994 in fivefold crossvalidation, respectively. In addition, 2 kinds of case studies based on lung cancer and endometrial cancer indicate the robustness and efficiency of LDACE even in a real environment.
Conclusions
Substantial results demonstrated that the proposed model is expected to be an auxiliary tool to guide and assist biomedical research, and the close integration of deep learning and biology big data will provide life sciences with novel insights.
Background
In the past few decades, it is believed that only the proteincoding genes contain genetic information [1]. As the development continues to deepen, researchers found that the number of noncoding RNAs (ncRNAs) in the whole transcriptome is over 98% [2], which makes it confident to believe that ncRNAs may be a kind of biomolecules with abundant functions [3,4,5].
Long noncoding RNA (LncRNA) is a kind of ncRNA of which length longer than 200 nucleotides [6]. At first, the low expression level and high tissuespecific pattern of lncRNA mislead many researchers to treat it as “transcriptional noise”. Accumulated studies have proved that lncRNA is involved in many life activities such as immune system, genome regulation, and cellfate programming and reprogramming [7]. There is also a great number of researches confirm numerous human diseases such as cancers, blood diseases and neurodegeneration are associated with various kinds of lncRNAs [8]. Therefore, it is critical and urgent to identify uncovered human lncRNAdisease associations to facilitate understanding the mechanisms [9,10,11].
It is unrealistic to confirm uncovered lncRNAdisease associations by largescale wet experiments in terms of time consumption, high cost and high error rate [12]. Significant advances achieved by Artificial Intelligence (AI) and computational methods have had a huge impact in a wide field [13,14,15,16]. Due to the assumptions that similar lncRNAs are associated with similar diseases and vice versa [17]. Computational methods for the detection of uncovered relationships have become a hot topic in bioinformatics [18, 19] based on some related databases such as MNDR [20], Lnc2Cancer [21], NONCODE [22] and DrugBank [23].
To date, there are approximately 3 categories of methods for predicting potential associations or interactions between different bioentities. The first kind of methods is based on the matrix decomposition. Lu et al. [24] proposed a method called SIMCLDA to predict the lncRNAdisease potential association based on the induction matrix by combining ontology associations and function similarity. Chen et al. [25] present a novel framework called IMCMDA to infer potential miRNAdisease associations. Secondly, a large number of computational models predict associations borrow the idea of network. Chen et al. [26] propose a computational method to discover unknown drugtarget interactions by networkbased random walk with restart. Zhou et al. [27] proposed a rankbased method called RWRHLD to predict lncRNAdisease association by prioritizing candidate lncRNAdisease integrated networks. Thirdly, machinelearningbased methods for detecting diseaserelated miRNAs have been extensively mined. Guo et al. [28] proposed a supervised machine learning method based on various biological information. Computational methods could obtain new lncRNAdisease associations in a short time, which significantly provides a broad prospect for lowrisk and faster medical development [29]. The combination of control theory, machine learning and big data will provide relevant researchers with novel insights [30,31,32,33].
From the collection of data to the construction of computational models, lncRNA has attracted a lot of attention in the field of computational biology [34,35,36]. Chen et al. [37] developed a database called ncRNA Drug Targets Database (NRDTD) that collected clinically or experimentally supported ncRNAs as drug targets. Sun et al. [38] constructed a database called Disease Related LncRNAEF Interaction Database (DLREFD), which contains experimentally verified interactions among lncRNAs. Liu et al. [39] proposed a computational model to infer lncRNAdisease associations by combining human lncRNA expression profiles, gene expression profiles, and human diseaseassociated gene data.
In this paper, we proposed a novel learningbased prediction model called LDACE by combining CNN and ELM. The framework of the proposed method can be seen in the Fig. 1. Firstly, we downloaded known lncRNAdisease associations from LncRNADisease database [40] in October, 2018. 1765 independent associations consist of 328 different diseases and 881 different lncRNAs were obtained after removing redundant and invalid items. Then, an adjacency matrix could be constructed with above data to store the whole information. Secondly, the semantics similarity matrix and Gaussian interaction profile kernel similarity matrix of disease or lncRNA are calculated respectively to enable lncRNA or disease to be represented by abundant biological information. Finally, after feature selection and dimension transformation by CNN, the lowdimension vectors in a suitable space are taken into the ELM classifier for training, validation and test. As a result, LDACE obtained substantial performance with Area Under Receiver Operating Characteristic Curve (AUROC) of 0.9057 under Leaveoneout crossvalidation (LOOCV) and 0.8994 under fivefold cross validation. Moreover, the classifier and method comparison experiments are applied to assess the ability of the proposed model from different aspects. In addition, we also carried out 2 kinds of case studies to simulate the prediction effect of LDACE in the real environment. Considering the competitive performance of the various results under numerous evaluation criteria implemented, the proposed method can indeed serve as a guidance for practice. Meanwhile, this work can be viewed as a attempt to combine machine learning method with biological big data. It is anticipated to provide novel insight to understand mechanism and cell activity at molecular level for related biomedical researchers.
Results and discussion
Evaluation criteria
Cross validation was chosen to carry out the evaluation task to assess the performance fairly and comprehensively. For kfold crossvalidation, the whole data set is divided into k mutually exclusive subsets of equal size, each subset can be treated as the test set to evaluate the model in turn, and the others are utilized as the training set to construct the model. When crossvalidation is implemented, ROC and AUROC are drawn and calculated separately. ROC can be used at different thresholds to evaluate the ability of the model. The area of the ROC is Area Under the Curve (AUROC). When the AUROC is equal to 1, the classifier will generate a perfect prediction result. If the AUROC value is 0.5, this classifier can be treated as a random guess. A wide range of evaluation methods are used to assess our methods in a different way including accuracy (Acc.), sensitivity (Sen.), specificity (Spec.), precision (Prec.), and MCC. They are defined as:
where TP denotes the number of true positives; FP represents the number of false positives; TN indicates the number of true negatives; FN stands for the number of false negatives.
Leaveoneout cross validation (LOOCV)
For LeaveOneOut Cross Validation (LOOCV), only one sample is left as the test set at each time, and the others are treated as the training set to build the model. The total number of the whole v2018 dataset is 3530, so we repeat 3530 times to train and test in the end. For LOOCV, LDACE obtained a competitive AUROC of 0.9086. The ROC and AUROC achieved by the proposed method can be seen in Fig. 2.
Fivefold cross validation
Considering that LOOCV is laborintensive, timeconsuming and limited by realworld experiment. Fivefold cross validation was chosen to evaluate the proposed model from another perspective. As described in the above section for the kfold cross validation, it is required to repeat 5 times under this kind of evaluation strategy to obtain the final predictive performance. Specifically, LDACE achieved mean AUROC of 89.94% under fivefold cross validation with a 0.84% standard deviation. A various of evaluation metric including Acc., Sen., Spec., Prec. and MCC were 82.52%, 85.04%, 80, 81%, 65.19% and 89.95%, respectively. Their standard deviations were 0.61, 2.76, 2.12, 1.19 and 1.33. The high AUROC obtained by LDACE implied that the proposed model with various types of biological information indeed was reliable and effective to discover the potential lncRNAdisease associations. The low standard deviation demonstrated that LDACE was stable and robust. The results of the proposed method can be seen in Table 1 and Fig. 3.
Classifiers comparison
In order to evaluate the performance of ELM in this dataset, we compared ELM with some commonly used classifiers in this section. Under fivefold cross validation, the ROC and the AUROCs are as in the Fig. 4. For fairness, all parameters are set to default values and it is obvious that ELM achieved the most competitive results. The effective ability of ELM can be attributed to the following factors: (1) For NaïveBayes, each feature of the representation vector may not be independent which makes the classification effect dissatisfied. (2) For SVM, training and test samples may be linearly inseparable, and the choice of kernel function under default parameters is not optimal. (3) For decision tree, it is easy to over fit and ignore the correlation between attributes. ELMs with fewer training parameters, faster speeds, and a wide range of applications is chosen to perform the final classification task.
Compared with previous methods
To further assess the performance of our method with existed methods, LDACE was compared with other 3 networkbased models including LRLSLDA [41], LRLSLDALNCSIM1 [42] and LRLSLDALNCSIM2 [42]. Considering previous model was implemented on the previous dataset which was collected from LncRNADisease in October, 2012. For the sake of fairness, we also applied the proposed framework to train, validate and test on the same version 2012 dataset. The ROC and AUROC obtained by the LDACE can be seen in the Table 2. In conclusion, the proposed computational method increases the AUROC of 0.08, 0.043, and 0.0362, respectively. In addition, machine learningbased models have significant advantages when dealing with new sample problems compared to networkbased models.
Case study
To further have a more comprehensive evaluation of the proposed model in the real world, we implemented LDACE on lung adenocarcinoma and endometrial cancer as 2 kinds of case studies. The associations in the LncRNA Disease were treated as the training set to construct the computational model, and the other 3 databases including LncRNADisease 2.0 [43], Lnc2Cancer [21], MNDR [20] and CRlncRNA [44] were utilized to verify the prediction results.
In the first kind of case study, lung adenocarcinoma was selected as the research object. Positive samples are all associations existed in the LncRNADisease database and the number of them is 1765. Negative samples were of the equal size as the positive pairs randomly selected from unlabeled associations as mentioned above. The training set consists of both positive samples and negative samples was together sent to ELM for construction of the prediction model. We combined lung adenocarcinoma with all 881 lncRNAs appeared in LncRNADisease as the test set and sorted the prediction results to conveniently validate in other database. In the end, the probability of H19 was in 4/881 of the list. It has been associated with lung adenocarcinoma by recent researches [45] and it did not include in the LncRNADisease database.
In the second kind of case study, endometrial cancer was selected as the subject. In order to test the ability of the proposed model in solving new sample problems that is the new lncRNA prediction. Positive samples are composed of the remaining associations that do not contain endometrial cancer related pairs in LncRNADisease. Given that there are 48 endometrial cancer associated pairs, the number of positive samples is 1717 (1765 − 48). Like case study 1, we also randomly extracted and built the same number negative and test samples by similar method. After the construction of the classifier, we put the test set into the computational model for prediction and verified them in the other databases. The list of the validated top 10 lncRNAs can be seen as Table 3.
We carefully analyzed the model construction process and the predicted ranks. We think that the result is due to the following factors. From the view of model, due to the assumptions that similar lncRNAs are associated with similar diseases and vice versa. lncRNA and disease are mainly represented by known associations. Therefore, nodes with large degrees are more likely to be predicted. On the other hand, miRNA with numerous associations may be a hot spot. Several isolated nodes such as snhg4 may actually be associated to disease but has not been verified by wet experiments.
Discussion
As a kind of regulatory factor in the human cells, lncRNA has proven to be closely related to many complex diseases. However, considering the tedious and low efficiency of manual experiments, numerous calculation methods have been developed to assist in the identification of lncRNAdisease associations. In this paper, we proposed an efficient method to discover potential lncRNAdisease associations. We constructed and integrated multitype features including disease semantics feature, disease and lncRNA function feature. CNN was applied to extract lowdimensional abstract information from the above integrated features and ELM was applied to carry out the prediction task. The proposed method has achieved competitive performance in crossvalidation, method comparison and case study experiments.
More and more similar methods have been proposed to accelerate the process of experiments and expose the internal connection between lncNRAs and diseases. Most of these methods make use of the inherent properties of biological entities such as semantic similarity and known relationships such as functional similarity. There are also some methods that take account of additional biological entities such as genes or other ncRNAs as bridges to assist prediction. The method proposed in this paper contains the above characteristics to a certain extent but is not complete. In the future, based on the premise of sufficient and reliable data, we will expand a richer heterogeneous attribute network centered on lncRNA and disease to accelerate reasoning and discovery. We hope that the method we propose can not only provide novel insights for similar methods, but also accelerate the research process of related experimenters.
Conclusions
In this paper, a computational model called LDACE was proposed based on CNN and ELM to infer potential associations between lncRNAs and diseases. Specifically, the representation vectors of both lncRNA and disease can be constructed by various biological information including function and semantics similarity. After implementing feature extraction and dimension transformation from original space by CNN, the lowdimension dense vectors were sent into the ELM for prediction task. LDACE obtained a substantial performance of 0.9086 in LOOCV and 0.9014 in fivefold cross validation, respectively. Moreover, we carried out the classifier and method comparison experiment. The results achieved by LDACE highlighted that it is an interesting attempt to combine CNN with ELM, and the deep learning technology can significantly improve the performance of the model to distinguish unknown associations. In addition, 2 kinds case studies based on lung adenocarcinoma and endometrial cancer demonstrated the effectiveness of LDACE in the practical environment. Competitive results indicate that our method has a prominent ability in mining the hidden associations between lncRNA and disease. It is believed that the tight integration of deep learning with biological data will promote the development of all aspects in both computer and life sciences. We hope that our work will not only provide assistance and guidance for manual experiments, but also to open up a novel sight to mine potential information and promote deep understanding from biological data by machine learning method.
Methods
lncRNAdisease associations
Known lncRNAdisease associations were collected from the LncRNADisease database in August 2018. 2947 lncRNAdisease association pairs were in the initial downloaded file. After routine preprocessing operations such as identifier unification and redundancy removal, we got v2018 dataset containing 1765 independent lncRNAdisease associations including 881 lncRNAs and 328 diseases. Then we constructed an adjacency matrix A with 328 rows and 881 columns to store all associations of the v2018 dataset. The element A (i, j) was set to 1 if and only if the ith disease and jth lncRNA was experimentally validated to be associated.
Randomly selecting negative samples from unlabeled samples is a commonly used down sampling technique for construction dataset and widespread in bioinformatics [46]. Therefore, the same number of negative samples as the positive samples are randomly selected to form the whole data set together with the positive samples. The total number of the training set is 3530 containing 1765 experimental valid positive samples and 1765 negative samples.
To compare with the existed methods, we also downloaded the previous lncRNAdisease associations called v2012 dataset from the first published lncRNAdisease association prediction model [41, 42]. After the same operation as mentioned above, we obtained 293 independent associations composed of 118 lncRNAs and 167 diseases which is the same as described in the original paper.
Disease MeSH descriptors
Medical Subject Headings (MeSH) is a standard controlled vocabulary which aims at indexing life and medical books and journals. It can be roughly classified into 16 categories, including Health Care [N], Publication Characteristics [V], Geographicals [Z]. We downloaded all MeSH descriptors (headings) from National Library of Medicine (NLM) in August 2018 to construct and measure the semantics similarity between lncRNA and disease.
Disease semantic similarity matrix 1
Disease is a kind of abnormal life process that occurs when the body is under certain conditions and is affected by the damage of the disease. How to effectively represent disease as vectors is a difficult task in bioinformatics research for a long period. Previous method has proven that it is a high quality way to characterize disease by MeSH descriptor [47]. The specific calculation step is shown in the Fig. 5. Each disease can be represented as a Directed Acyclic Graph (DAG). For example, disease D’s DAG can be represented as DAG(D) = (D, \({N}_{D}\), \({E}_{D}\)), \({N}_{D}\) is a node set which contains disease D and its ancestor disease in DAG(D). \({E}_{D}\) is an edge set which contains all links between nodes in DAG(D).
Inspired by the Jaccard formula, the similarity can be calculated by dividing the intersection of two sets by the union of two sets. Disease D could be represented by a DAG and the semantics similarity between 2 diseases could be calculated as follows:
where \(\Delta\) is the factor and \(t\) is the node in DAG. \(\Delta\) can be from 0 to 1, and it is set to 0.5 according to previous literature [42]. In DAG (D), disease D contributes the most to itself. The further the distance is, the smaller the contribution of D's ancestral disease to D. Therefore, we can define the sum of the contributions of all nodes in the DAG(D) to disease D. DV1(D) can be calculated as:
The semantic similarity of disease i and disease j can be defined as follows:
Disease semantic similarity matrix 2
In the disease semantic similarity matrix 1, the algorithm only forces on a single object from the local view, but does not consider the difference between diseases from the whole perspective. Some scholars believe that the contribution is different because of the appearance frequency of disease in the whole MeSH. Combined with the view of information theory, they proposed novel ideas to improve this situation and achieved a certain degree of improvement [42]. The new contribution of disease t to disease D can be calculated as follows:
Then the semantic value of disease D can be obtained, DV2(D) as:
The semantic similarity of disease i and disease j can be defined as follows:
Disease Gaussian interaction profile kernel similarity matrix
Obviously, the matrix A includes the whole association contents of the v2018 database. Disease i can be represented as a function vector \({\text{d}}_{i}\) of 881 dimensions that is a column of matrix A. The value of each dimension in \({\text{d}}_{i}\) is determined by whether disease have been associated with lncRNA or not. If and only if the ith disease is valid proved to be associated with the jth lncRNA by wet experiment, the jth dimension of the vector is defined as 1, otherwise 0.
In fact, this can be treated as a functional representation of the lncRNA, and we transform it by Gaussian interaction profile kernel function to make more suitable for downstream classification tasks. Then similarity between diseases i and disease j can be defined as follows:
hyperparameters \(\alpha_{d}\) can be defined as follows:
Here we set \(\alpha_{d}^{^{\prime}}\) = 0.5, nd is set to 328 which equals to the number of disease. Finally, the disease Gaussian interaction profile kernel similarity matrix DG is a square matrix with 328 rows and columns.
Disease integrated similarity matrix
To integrate all biological information, the element of the final disease similarity matrix DS (i, j) can be defined as follows:
LncRNA Gaussian interaction profile kernel similarity matrix
It can represent each lncRNA’s function by the row of the matrix A similar to disease. The Gaussian profile kernel similarity between lncRNA i and j could be calculated as follows:
Given that there is no other information about lncRNAs, we directly regard RG as the lncRNA similarity matrix (RS). Parameter \(\alpha_{r}\) can be adjusted as follows:
Here, we set \(\alpha_{r}^{^{\prime}}\) = 0.5, and nl is set to 881 which equals to the number of lncRNA. Finally, the lncRNA similarity matrix RS of 328 rows and 328 columns can be constructed.
The representation of the association pair
From above operations, each lncRNA and disease can be represented as a vector by integrating various biology information. In summary, the ith disease can be represented as the ith row of the matrix DS as shown below:
The jth lncRNA can be represented as the jth row of the matrix RS as shown below:
The combination of the associations between the ith disease and the jth lncRNA is seen as follows:
Then we get 3530 1209dimensional vectors. Each positive sample is given a label 1 and each negative sample is given a label 0.
Convolutional neural networks (CNN)
Considering that the constructed representation vector is highdimensional and sparse, we hope to extract the effective features through Convolutional Neural Network (CNN) [48,49,50]. Compared to other machine learning method, CNN has its unique advantages in feature capture and model capacity [51]. In this paper, we choose CNN to carry out the feature extraction task [52, 53].
Convolution neural network is a multilayer neural network which consists of input layer, convolution layer, pooling layer, fullyconnected layer and output layer [54, 55]. The key of CNN lies in the convolutional layer and the pooling layer which extracted features and passed them into the fully connected layer for classification [56]. The weight of the convolution window is adjusted by the feedback result [57]. The convolution layer is applied to extract both local and global features with different filters. It can be shown in the Fig. 6.
ELM
GB Huang et al. [58] proposed Extreme Learning Machine which is a single hidden layer feedforward neural network algorithm. For traditional artificial neural networks, it will consume lots of resources and time to determine the paraments when backpropagation algorithm is applied [59]. Considering these iterative steps, there is only one hidden layer in ELM and when the classifier is trained, the number of hidden layer neuron nodes is the only hyperparameter that has to be set. The main steps of ELM are shown in Fig. 7.
ELM is a kind of single hidden layer feedforward network with random hidden nodes and the activation function f(x). For N arbitrary distinct samples \(\left( {x_{i} ,l_{i} } \right)\), where \(x_{i} = \left[ {x_{i1} ,x_{i2} , \ldots ,x_{im} } \right]^{T} \epsilon R^{n}\) and \(l_{i} = \left[ {l_{i1} ,l_{i2} , \ldots ,l_{im} } \right]^{T} \epsilon R^{m}\). Therefore, the output of ELM is represented as follows:
where \(N^{\prime}\) is the number of the hidden nodes, \(p_{i} = \left[ {p_{i1} ,p_{i2} , \ldots ,p_{im} } \right]^{T}\) is the weight vector from the input layer nodes to the ith hidden layer node, \(q_{i} = \left[ {q_{i1} ,q_{i2} , \ldots ,q_{im} } \right]^{T}\) is the weight vector from the ith hidden layer to the output layer, \(t_{i}\) is the threshold of the ith hidden node. \(p_{i} \cdot x_{j}\) is the inner product of \(p_{i}\) and \(t_{i}\).
The loss function is defined as follows:
In order to minimize the error between input and output, we need to determine the three parameters \(p_{i} ,q_{i} and t_{i}\) such that:
The Eq. (22) can be written compactly as \(Hq = l\) where
Therefore, in order to train the ELM, we need to find the appropriate parameters \(\hat{p}_{i} ,\hat{q}_{i} and \hat{t}_{i}\) such that
It is equivalent to minimize the loss function as follows:
ELM combined high learning efficiency and strong generalization ability is widely used in solving both academic and industrial issues. Here, all hyperparameters are set to default values.
Availability of data and materials
LncRNAdisease association dataset can be downloaded from the url: http://www.cuilab.cn/. Disease MeSH descriptors can be downloaded from the url: ftp://nlmpubs.nlm.nih.gov/online/mesh/. The code used or analyzed during this study are available from the corresponding author on reasonable requests.
Abbreviations
 LncRNA:

Long noncoding RNA
 ELM:

Extreme Learning Machine
 CNN:

Convolutional Neural Network
 AUROC:

Area Under Receiver Operating Characteristic Curve
 LOOCV:

Leaveoneout crossvalidation
 DAG:

Directed Acyclic Graph
References
Djebali S, Davis CA, Merkel A, Dobin A, Lassmann T, Mortazavi A, Tanzer A, Lagarde J, Lin W, Schlesinger F. Landscape of transcription in human cells. Nature. 2012;489(7414):101.
Bertone P, Stolc V, Royce TE, Rozowsky JS, Urban AE, Zhu X, Rinn JL, Tongprasit W, Samanta M, Weissman S. Global identification of human transcribed sequences with genome tiling arrays. Science. 2004;306(5705):2242–6.
You ZH, Lei YK, Gui J, Huang DS, Zhou X. Using manifold embedding for assessing and predicting protein interactions from highthroughput experimental data. Bioinformatics. 2010;26(21):2744–51.
Yi HC, You ZH, Huang DS, Kwoh CK. Graph representation learning in bioinformatics: trends, methods and applications. Brief Bioinform. 2021;23(1):bbab340.
Zhang Q, Wang S, Chen Z, He Y, Liu Q, Huang DS. Locating transcription factor binding sites by fully convolutional neural network. Brief Bioinform. 2021;22(5):bbaa435.
Wang L, You ZH, Huang DS, Li JQ. MGRCDA: metagraph recommendation method for predicting CircRNAdisease association. IEEE Trans Cybern. 2021.
Flynn RA, Chang HY. Long noncoding RNAs in cellfate programming and reprogramming. Cell Stem Cell. 2014;14(6):752–61.
Johnson R. Long noncoding RNAs in Huntington’s disease neurodegeneration. Neurobiol Dis. 2012;46(2):245–54.
Qiu MT, Hu JW, Yin R, Xu L. Long noncoding RNA: an emerging paradigm of cancer research. Tumor Biol. 2013;34(2):613–20.
Chen X, Sun YZ, Guan NN, Qu J, Huang ZA, Zhu ZX, Li JQ. Computational models for lncRNA function prediction and functional similarity calculation. Brief Funct Genom. 2019;18(1):58–82.
Chen X, Yan CC, Zhang X, You ZH. Long noncoding RNAs and complex diseases: from experimental results to computational models. Brief Bioinform. 2017;18(4):558–76.
He Y, Shen Z, Zhang Q, Wang S, Huang DS. A survey on deep learning in DNA/RNA motif mining. Brief Bioinform. 2021;22(4):bbaa229.
Gao S, Zhou M, Wang Y, Cheng J, Yachi H, Wang J. Dendritic neuron model with effective learning algorithms for classification, approximation, and prediction. IEEE Trans Neural Netw Learn Syst. 2018;30(2):601–14.
Liu T, Tian B, Ai Y, Zou Y, Wang FY. Parallel reinforcement learningbased energy efficiency improvement for a cyberphysical system. IEEE/CAA J Autom Sin. 2019;7(2):617–26.
Huang DS, Du JX. A constructive hybrid structure optimization methodology for radial basis probabilistic neural networks. IEEE Trans Neural Netw. 2008;19(12):2099–115.
Wang XF, Huang DS. A novel densitybased clustering framework by using level set method. IEEE Trans Knowl Data Eng. 2009;21(11):1515–31.
Wang XF, Huang DS, Du JX, Xu H, Heutte L. Classification of plant leaf images with complicated background. Appl Math Comput. 2008;205(2):916–26.
Chen X, Xie D, Zhao Q, You ZH. MicroRNAs and complex diseases: from experimental results to computational models. Brief Bioinform. 2019;20(2):515–39.
Chen X, Yan CC, Zhang X, Zhang X, Dai F, Yin J, Zhang Y. Drug–target interaction prediction: databases, web servers and computational models. Brief Bioinform. 2016;17(4):696–712.
Cui T, Zhang L, Huang Y, Yi Y, Tan P, Zhao Y, Hu Y, Xu L, Li E, Wang D. MNDR v2.0: an updated resource of ncRNA–disease associations in mammals. Nucleic Acids Res. 2017;46(D1):D371–4.
Ning S, Zhang J, Wang P, Zhi H, Wang J, Liu Y, Gao Y, Guo M, Yue M, Wang L. Lnc2Cancer: a manually curated database of experimentally supported lncRNAs associated with various human cancers. Nucleic Acids Res. 2015;44(D1):D980–5.
Liu C, Bai B, Skogerbø G, Cai L, Deng W, Zhang Y, Bu D, Zhao Y, Chen R. NONCODE: an integrated knowledge database of noncoding RNAs. Nucleic Acids Res. 2005;33(suppl_1):D112–5.
Wishart DS, Feunang YD, Guo AC, Lo EJ, Marcu A, Grant JR, Sajed T, Johnson D, Li C, Sayeeda Z. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res. 2017;46(D1):D1074–82.
Lu C, Yang M, Luo F, Wu FX, Li M, Pan Y, Li Y, Wang J. Prediction of lncRNAdisease associations based on inductive matrix completion. Bioinformatics. 2018;1:8.
Chen X, Wang L, Qu J, Guan NN, Li JQ. Predicting miRNA–disease association based on inductive matrix completion. Bioinformatics. 2018;34(24):4256–65.
Chen X, Liu MX, Yan GY. Drug–target interaction prediction by random walk on the heterogeneous network. Mol BioSyst. 2012;8(7):1970–8.
Zhou M, Wang X, Li J, Hao D, Wang Z, Shi H, Han L, Zhou H, Sun J. Prioritizing candidate diseaserelated long noncoding RNAs by walking on the heterogeneous lncRNA and disease network. Mol BioSyst. 2015;11(3):760–9.
Guo ZH, You ZH, Wang YB, Yi HC, Chen ZH. A learningbased method for LncRNAdisease association identification combing similarity information and rotation forest. iScience. 2019;19:786–95.
Zhao Y, Huang DS, Jia W. Completed local binary count for rotation invariant texture classification. IEEE Trans Image Process. 2012;21(10):4492–7.
Luo X, Wu H, Yuan H, Zhou M. Temporal patternaware QoS prediction via biased nonnegative latent factorization of tensors. IEEE Trans Cybern. 2019;50(5):1798–809.
Luo X, Zhou M, Li S, Hu L, Shang M. Nonnegativity constrained missing data estimation for highdimensional and sparse matrices from industrial applications. IEEE Trans Cybern. 2019;50(5):1844–55.
Luo X, Zhou M, Li S, Shang M. An inherently nonnegative latent factor model for highdimensional and sparse matrices from industrial applications. IEEE Trans Ind Inf. 2017;14(5):2011–22.
Huang DS, Jia W, Zhang D. Palmprint verification based on principal lines. Pattern Recognit. 2008;41(4):1316–28.
Wang XF, Huang DS, Xu H. An efficient local ChanVese model for image segmentation. Pattern Recognit. 2010;43(3):603–18.
Lu CY, Min H, Zhao ZQ, Zhu L, Huang DS, Yan S. Robust and efficient subspace segmentation via least squares regression. In: European conference on computer vision: 2012. Springer. p. 347–360.
Jia W, Huang DS, Zhang D. Palmprint verification based on robust line orientation code. Pattern Recognit. 2008;41(5):1504–13.
Chen X, Sun YZ, Zhang DH, Li JQ, Yan GY, An JY, You ZH: NRDTD: a database for clinically or experimentally supported noncoding RNAs and drug targets associations. Database 2017, 2017.
Sun YZ, Zhang DH, Ming Z, Li JQ, Chen X. DLREFD: a database providing associations of long noncoding RNAs, environmental factors and phenotypes. Database 2017, 2017.
Liu MX, Chen X, Chen G, Cui QH, Yan GY. A computational framework to infer human diseaseassociated long noncoding RNAs. PLoS ONE. 2014;9(1):e84408.
Chen G, Wang Z, Wang D, Qiu C, Liu M, Chen X, Zhang Q, Yan G, Cui Q. LncRNADisease: a database for longnoncoding RNAassociated diseases. Nucleic Acids Res. 2012;41(D1):D983–6.
Chen X, Yan GY. Novel human lncRNAdisease association inference based on lncRNA expression profiles. Bioinformatics. 2013;29(20):2617–24.
Chen X, Yan CC, Luo C, Ji W, Zhang Y, Dai Q. Constructing lncRNA functional similarity network based on lncRNAdisease associations and disease semantic similarity. Sci Rep. 2015;5:11338.
Bao Z, Yang Z, Huang Z, Zhou Y, Cui Q, Dong D. LncRNADisease 2.0: an updated database of long noncoding RNAassociated diseases. Nucleic Acids Res. 2018;47(D1):D1034–7.
Wang J, Zhang X, Chen W, Li J, Liu C. CRlncRNA: a manually curated database of cancerrelated long noncoding RNAs with experimental proof of functions on clinicopathological and molecular features. BMC Med Genom. 2018;11(6):114.
Wang P, Lu S, Mao H, Bai Y, Ma T, Cheng Z, Zhang H, Jin Q, Zhao J, Mao H. Identification of biomarkers for the detection of early stage lung adenocarcinoma by microarray profiling of long noncoding RNAs. Lung Cancer. 2015;88(2):147–53.
BenHur A, Noble WS. Kernel methods for predicting protein–protein interactions. Bioinformatics. 2005;21(suppl_1):i38–46.
Wang D, Wang J, Lu M, Song F, Cui Q. Inferring the human microRNA functional similarity and functional network based on microRNAassociated diseases. Bioinformatics. 2010;26(13):1644–50.
Li B, Zheng CH, Huang DS. Locally linear discriminant embedding: an efficient method for face recognition. Pattern Recognit. 2008;41(12):3813–21.
Zheng CH, Huang DS, Zhang L, Kong XZ. Tumor clustering using nonnegative matrix factorization with gene selection. IEEE Trans Inf Technol Biomed. 2009;13(4):599–607.
Wu Y, Zhang K, Wu D, Wang C, Yuan CA, Qin X, Zhu T, Du YC, Wang HL, Huang DS. Person reidentification by multiscale feature representation learning with random batch feature mask. IEEE Trans Cogn Dev Syst. 2020;13(4):865–74.
Wu D, Wang C, Wu Y, Wang QC, Huang DS. Attention deep model with multiscale deep supervision for person reidentification. IEEE Trans Emerg Top Comput Intell. 2021;5(1):70–8.
Hu R, Jia W, Ling H, Huang D. Multiscale distance matrix for fast plant leaf recognition. IEEE Trans Image Process. 2012;21(11):4667–72.
Zhang Q, Wang D, Han K, Huang DS. Predicting TFDNA binding motifs from ChIPseq datasets using the bagbased classifier combined with a multifold learning scheme. IEEE/ACM Trans Comput Biol Bioinform. 2020;18(5):1743–51.
Zhang Q, Yu W, Han K, Nandi AK, Huang DS. Multiscale capsule network for predicting DNAprotein binding sites. IEEE/ACM Trans Comput Biol Bioinform. 2020;18(5):1793–800.
Peng C, Zheng Y, Huang DS. Capsule network based modeling of multiomics data for discovery of breast cancerrelated genes. IEEE/ACM Trans Comput Biol Bioinf. 2019;17(5):1605–12.
Liu B, Yang F, Huang DS, Chou KC. iPromoter2L: a twolayer predictor for identifying promoters and their types by multiwindowbased PseKNC. Bioinformatics. 2018;34(1):33–40.
Shen Z, Zhang Q, Han K, Huang DS. A deep learning model for RNAprotein binding preference prediction based on hierarchical LSTM and attention network. IEEE/ACM Trans Comput Biol Bioinform. 2020.
Huang GB, Zhu QY, Siew CK. Extreme learning machine: theory and applications. Neurocomputing. 2006;70(1–3):489–501.
Li B, Fan ZT, Zhang XL, Huang DS. Robust dimensionality reduction via feature space to feature space distance metric learning. Neural Netw. 2019;112:1–14.
Acknowledgements
The authors would like to thank all the editors and anonymous reviewers for their constructive advices.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 22 Supplement 5 2021: Proceedings of the International Conference on Biomedical Engineering Innovation (ICBEI) 20192020. The full contents of the supplement are available at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume22supplement5
Funding
This work was supported by the grant of National Key R&D Program of China (Nos. 2018AAA0100100 and 2018YFA0902600) and partly supported by National Natural Science Foundation of China (Grant Nos. 61732012, 61772370, 61932008, 61772357, 62002297, 62002266, and 62073231) and supported by “BAGUI Scholar” Program and the Scientific and Technological Base and Talent Special Program, GuiKe AD18126015 of the Guangxi Zhuang Autonomous Region of China. The funding was used to develop, implement, and evaluate the proposed algorithms. The funding body did not play any role in the design and implementation of the algorithms and in writing the manuscript.
Author information
Authors and Affiliations
Contributions
Z.H. G. and Z.H. C considered the algorithm, arranged the datasets, and performed the analyses. Z.H. Y., Y.B. W., H.C. Y. and M.N. W. wrote the manuscript. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Guo, ZH., Chen, ZH., You, ZH. et al. A learningbased method to predict LncRNAdisease associations by combining CNN and ELM. BMC Bioinformatics 22 (Suppl 5), 622 (2021). https://doi.org/10.1186/s12859022046113
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859022046113
Keywords
 CNN
 ELM
 lncRNA
 Disease
 Association prediction