Skip to main content

A learning-based method to predict LncRNA-disease associations by combining CNN and ELM

Abstract

Background

lncRNAs play a critical role in numerous biological processes and life activities, especially diseases. Considering that traditional wet experiments for identifying uncovered lncRNA-disease associations is limited in terms of time consumption and labor cost. It is imperative to construct reliable and efficient computational models as addition for practice. Deep learning technologies have been proved to make impressive contributions in many areas, but the feasibility of it in bioinformatics has not been adequately verified.

Results

In this paper, a machine learning-based model called LDACE was proposed to predict potential lncRNA-disease associations by combining Extreme Learning Machine (ELM) and Convolutional Neural Network (CNN). Specifically, the representation vectors are constructed by integrating multiple types of biology information including functional similarity and semantic similarity. Then, CNN is applied to mine both local and global features. Finally, ELM is chosen to carry out the prediction task to detect the potential lncRNA-disease associations. The proposed method achieved remarkable Area Under Receiver Operating Characteristic Curve of 0.9086 in Leave-one-out cross-validation and 0.8994 in fivefold cross-validation, respectively. In addition, 2 kinds of case studies based on lung cancer and endometrial cancer indicate the robustness and efficiency of LDACE even in a real environment.

Conclusions

Substantial results demonstrated that the proposed model is expected to be an auxiliary tool to guide and assist biomedical research, and the close integration of deep learning and biology big data will provide life sciences with novel insights.

Background

In the past few decades, it is believed that only the protein-coding genes contain genetic information [1]. As the development continues to deepen, researchers found that the number of noncoding RNAs (ncRNAs) in the whole transcriptome is over 98% [2], which makes it confident to believe that ncRNAs may be a kind of biomolecules with abundant functions [3,4,5].

Long non-coding RNA (LncRNA) is a kind of ncRNA of which length longer than 200 nucleotides [6]. At first, the low expression level and high tissue-specific pattern of lncRNA mislead many researchers to treat it as “transcriptional noise”. Accumulated studies have proved that lncRNA is involved in many life activities such as immune system, genome regulation, and cell-fate programming and reprogramming [7]. There is also a great number of researches confirm numerous human diseases such as cancers, blood diseases and neurodegeneration are associated with various kinds of lncRNAs [8]. Therefore, it is critical and urgent to identify uncovered human lncRNA-disease associations to facilitate understanding the mechanisms [9,10,11].

It is unrealistic to confirm uncovered lncRNA-disease associations by large-scale wet experiments in terms of time consumption, high cost and high error rate [12]. Significant advances achieved by Artificial Intelligence (AI) and computational methods have had a huge impact in a wide field [13,14,15,16]. Due to the assumptions that similar lncRNAs are associated with similar diseases and vice versa [17]. Computational methods for the detection of uncovered relationships have become a hot topic in bioinformatics [18, 19] based on some related databases such as MNDR [20], Lnc2Cancer [21], NONCODE [22] and DrugBank [23].

To date, there are approximately 3 categories of methods for predicting potential associations or interactions between different bioentities. The first kind of methods is based on the matrix decomposition. Lu et al. [24] proposed a method called SIMCLDA to predict the lncRNA-disease potential association based on the induction matrix by combining ontology associations and function similarity. Chen et al. [25] present a novel framework called IMCMDA to infer potential miRNA-disease associations. Secondly, a large number of computational models predict associations borrow the idea of network. Chen et al. [26] propose a computational method to discover unknown drug-target interactions by network-based random walk with restart. Zhou et al. [27] proposed a rank-based method called RWRHLD to predict lncRNA-disease association by prioritizing candidate lncRNA-disease integrated networks. Thirdly, machine-learning-based methods for detecting disease-related miRNAs have been extensively mined. Guo et al. [28] proposed a supervised machine learning method based on various biological information. Computational methods could obtain new lncRNA-disease associations in a short time, which significantly provides a broad prospect for low-risk and faster medical development [29]. The combination of control theory, machine learning and big data will provide relevant researchers with novel insights [30,31,32,33].

From the collection of data to the construction of computational models, lncRNA has attracted a lot of attention in the field of computational biology [34,35,36]. Chen et al. [37] developed a database called ncRNA Drug Targets Database (NRDTD) that collected clinically or experimentally supported ncRNAs as drug targets. Sun et al. [38] constructed a database called Disease Related LncRNA-EF Interaction Database (DLREFD), which contains experimentally verified interactions among lncRNAs. Liu et al. [39] proposed a computational model to infer lncRNA-disease associations by combining human lncRNA expression profiles, gene expression profiles, and human disease-associated gene data.

In this paper, we proposed a novel learning-based prediction model called LDACE by combining CNN and ELM. The framework of the proposed method can be seen in the Fig. 1. Firstly, we downloaded known lncRNA-disease associations from LncRNADisease database [40] in October, 2018. 1765 independent associations consist of 328 different diseases and 881 different lncRNAs were obtained after removing redundant and invalid items. Then, an adjacency matrix could be constructed with above data to store the whole information. Secondly, the semantics similarity matrix and Gaussian interaction profile kernel similarity matrix of disease or lncRNA are calculated respectively to enable lncRNA or disease to be represented by abundant biological information. Finally, after feature selection and dimension transformation by CNN, the low-dimension vectors in a suitable space are taken into the ELM classifier for training, validation and test. As a result, LDACE obtained substantial performance with Area Under Receiver Operating Characteristic Curve (AUROC) of 0.9057 under Leave-one-out cross-validation (LOOCV) and 0.8994 under fivefold cross validation. Moreover, the classifier and method comparison experiments are applied to assess the ability of the proposed model from different aspects. In addition, we also carried out 2 kinds of case studies to simulate the prediction effect of LDACE in the real environment. Considering the competitive performance of the various results under numerous evaluation criteria implemented, the proposed method can indeed serve as a guidance for practice. Meanwhile, this work can be viewed as a attempt to combine machine learning method with biological big data. It is anticipated to provide novel insight to understand mechanism and cell activity at molecular level for related biomedical researchers.

Fig. 1
figure 1

Flowchart of the proposed model LDACE

Results and discussion

Evaluation criteria

Cross validation was chosen to carry out the evaluation task to assess the performance fairly and comprehensively. For k-fold cross-validation, the whole data set is divided into k mutually exclusive subsets of equal size, each subset can be treated as the test set to evaluate the model in turn, and the others are utilized as the training set to construct the model. When cross-validation is implemented, ROC and AUROC are drawn and calculated separately. ROC can be used at different thresholds to evaluate the ability of the model. The area of the ROC is Area Under the Curve (AUROC). When the AUROC is equal to 1, the classifier will generate a perfect prediction result. If the AUROC value is 0.5, this classifier can be treated as a random guess. A wide range of evaluation methods are used to assess our methods in a different way including accuracy (Acc.), sensitivity (Sen.), specificity (Spec.), precision (Prec.), and MCC. They are defined as:

$$Acc. = \frac{TN + TP}{{TN + TP + FN + FP}}$$
(1)
$$Sen. = \frac{TP}{{TP + FN}}$$
(2)
$$Spec. = \frac{TN}{{TN + FP}}$$
(3)
$$Prec. = \frac{TP}{{TP + FP}}$$
(4)
$$MCC = \frac{TP \times TN - FP \times FN}{{\sqrt {\left( {TP + FP} \right)\left( {TP + FN} \right)\left( {TN + FP} \right)\left( {TN + FN} \right)} }}$$
(5)

where TP denotes the number of true positives; FP represents the number of false positives; TN indicates the number of true negatives; FN stands for the number of false negatives.

Leave-one-out cross validation (LOOCV)

For Leave-One-Out Cross Validation (LOOCV), only one sample is left as the test set at each time, and the others are treated as the training set to build the model. The total number of the whole v2018 dataset is 3530, so we repeat 3530 times to train and test in the end. For LOOCV, LDACE obtained a competitive AUROC of 0.9086. The ROC and AUROC achieved by the proposed method can be seen in Fig. 2.

Fig. 2
figure 2

The ROC and AUROC achieved by LDACE in LOOCV on v2018 dataset (3530 lncRNA-disease associations)

Fivefold cross validation

Considering that LOOCV is labor-intensive, time-consuming and limited by real-world experiment. Fivefold cross validation was chosen to evaluate the proposed model from another perspective. As described in the above section for the k-fold cross validation, it is required to repeat 5 times under this kind of evaluation strategy to obtain the final predictive performance. Specifically, LDACE achieved mean AUROC of 89.94% under fivefold cross validation with a 0.84% standard deviation. A various of evaluation metric including Acc., Sen., Spec., Prec. and MCC were 82.52%, 85.04%, 80, 81%, 65.19% and 89.95%, respectively. Their standard deviations were 0.61, 2.76, 2.12, 1.19 and 1.33. The high AUROC obtained by LDACE implied that the proposed model with various types of biological information indeed was reliable and effective to discover the potential lncRNA-disease associations. The low standard deviation demonstrated that LDACE was stable and robust. The results of the proposed method can be seen in Table 1 and Fig. 3.

Table 1 Various evaluation criteria under fivefold cross validation achieved by LDACE on v2018 dataset
Fig. 3
figure 3

ROCs and AUROCs achieved by LDACE under fivefold cross validation on the v2018 dataset

Classifiers comparison

In order to evaluate the performance of ELM in this dataset, we compared ELM with some commonly used classifiers in this section. Under fivefold cross validation, the ROC and the AUROCs are as in the Fig. 4. For fairness, all parameters are set to default values and it is obvious that ELM achieved the most competitive results. The effective ability of ELM can be attributed to the following factors: (1) For NaïveBayes, each feature of the representation vector may not be independent which makes the classification effect dissatisfied. (2) For SVM, training and test samples may be linearly inseparable, and the choice of kernel function under default parameters is not optimal. (3) For decision tree, it is easy to over fit and ignore the correlation between attributes. ELMs with fewer training parameters, faster speeds, and a wide range of applications is chosen to perform the final classification task.

Fig. 4
figure 4

Classifiers comparison under fivefold cross validation on the v2018 dataset

Compared with previous methods

To further assess the performance of our method with existed methods, LDACE was compared with other 3 network-based models including LRLSLDA [41], LRLSLDA-LNCSIM1 [42] and LRLSLDA-LNCSIM2 [42]. Considering previous model was implemented on the previous dataset which was collected from LncRNADisease in October, 2012. For the sake of fairness, we also applied the proposed framework to train, validate and test on the same version 2012 dataset. The ROC and AUROC obtained by the LDACE can be seen in the Table 2. In conclusion, the proposed computational method increases the AUROC of 0.08, 0.043, and 0.0362, respectively. In addition, machine learning-based models have significant advantages when dealing with new sample problems compared to network-based models.

Table 2 The comparison of AUROC between the proposed model and several previous network-based methods in LOOCV on the v2012 dataset

Case study

To further have a more comprehensive evaluation of the proposed model in the real world, we implemented LDACE on lung adenocarcinoma and endometrial cancer as 2 kinds of case studies. The associations in the LncRNA Disease were treated as the training set to construct the computational model, and the other 3 databases including LncRNADisease 2.0 [43], Lnc2Cancer [21], MNDR [20] and CRlncRNA [44] were utilized to verify the prediction results.

In the first kind of case study, lung adenocarcinoma was selected as the research object. Positive samples are all associations existed in the LncRNADisease database and the number of them is 1765. Negative samples were of the equal size as the positive pairs randomly selected from unlabeled associations as mentioned above. The training set consists of both positive samples and negative samples was together sent to ELM for construction of the prediction model. We combined lung adenocarcinoma with all 881 lncRNAs appeared in LncRNADisease as the test set and sorted the prediction results to conveniently validate in other database. In the end, the probability of H19 was in 4/881 of the list. It has been associated with lung adenocarcinoma by recent researches [45] and it did not include in the LncRNADisease database.

In the second kind of case study, endometrial cancer was selected as the subject. In order to test the ability of the proposed model in solving new sample problems that is the new lncRNA prediction. Positive samples are composed of the remaining associations that do not contain endometrial cancer related pairs in LncRNADisease. Given that there are 48 endometrial cancer associated pairs, the number of positive samples is 1717 (1765 − 48). Like case study 1, we also randomly extracted and built the same number negative and test samples by similar method. After the construction of the classifier, we put the test set into the computational model for prediction and verified them in the other databases. The list of the validated top 10 lncRNAs can be seen as Table 3.

Table 3 Top 10 lncRNAs associated with endometrial cancer which were predicted by LDACE

We carefully analyzed the model construction process and the predicted ranks. We think that the result is due to the following factors. From the view of model, due to the assumptions that similar lncRNAs are associated with similar diseases and vice versa. lncRNA and disease are mainly represented by known associations. Therefore, nodes with large degrees are more likely to be predicted. On the other hand, miRNA with numerous associations may be a hot spot. Several isolated nodes such as snhg4 may actually be associated to disease but has not been verified by wet experiments.

Discussion

As a kind of regulatory factor in the human cells, lncRNA has proven to be closely related to many complex diseases. However, considering the tedious and low efficiency of manual experiments, numerous calculation methods have been developed to assist in the identification of lncRNA-disease associations. In this paper, we proposed an efficient method to discover potential lncRNA-disease associations. We constructed and integrated multi-type features including disease semantics feature, disease and lncRNA function feature. CNN was applied to extract low-dimensional abstract information from the above integrated features and ELM was applied to carry out the prediction task. The proposed method has achieved competitive performance in cross-validation, method comparison and case study experiments.

More and more similar methods have been proposed to accelerate the process of experiments and expose the internal connection between lncNRAs and diseases. Most of these methods make use of the inherent properties of biological entities such as semantic similarity and known relationships such as functional similarity. There are also some methods that take account of additional biological entities such as genes or other ncRNAs as bridges to assist prediction. The method proposed in this paper contains the above characteristics to a certain extent but is not complete. In the future, based on the premise of sufficient and reliable data, we will expand a richer heterogeneous attribute network centered on lncRNA and disease to accelerate reasoning and discovery. We hope that the method we propose can not only provide novel insights for similar methods, but also accelerate the research process of related experimenters.

Conclusions

In this paper, a computational model called LDACE was proposed based on CNN and ELM to infer potential associations between lncRNAs and diseases. Specifically, the representation vectors of both lncRNA and disease can be constructed by various biological information including function and semantics similarity. After implementing feature extraction and dimension transformation from original space by CNN, the low-dimension dense vectors were sent into the ELM for prediction task. LDACE obtained a substantial performance of 0.9086 in LOOCV and 0.9014 in fivefold cross validation, respectively. Moreover, we carried out the classifier and method comparison experiment. The results achieved by LDACE highlighted that it is an interesting attempt to combine CNN with ELM, and the deep learning technology can significantly improve the performance of the model to distinguish unknown associations. In addition, 2 kinds case studies based on lung adenocarcinoma and endometrial cancer demonstrated the effectiveness of LDACE in the practical environment. Competitive results indicate that our method has a prominent ability in mining the hidden associations between lncRNA and disease. It is believed that the tight integration of deep learning with biological data will promote the development of all aspects in both computer and life sciences. We hope that our work will not only provide assistance and guidance for manual experiments, but also to open up a novel sight to mine potential information and promote deep understanding from biological data by machine learning method.

Methods

lncRNA-disease associations

Known lncRNA-disease associations were collected from the LncRNADisease database in August 2018. 2947 lncRNA-disease association pairs were in the initial downloaded file. After routine preprocessing operations such as identifier unification and redundancy removal, we got v2018 dataset containing 1765 independent lncRNA-disease associations including 881 lncRNAs and 328 diseases. Then we constructed an adjacency matrix A with 328 rows and 881 columns to store all associations of the v2018 dataset. The element A (i, j) was set to 1 if and only if the ith disease and jth lncRNA was experimentally validated to be associated.

Randomly selecting negative samples from unlabeled samples is a commonly used down sampling technique for construction dataset and widespread in bioinformatics [46]. Therefore, the same number of negative samples as the positive samples are randomly selected to form the whole data set together with the positive samples. The total number of the training set is 3530 containing 1765 experimental valid positive samples and 1765 negative samples.

To compare with the existed methods, we also downloaded the previous lncRNA-disease associations called v2012 dataset from the first published lncRNA-disease association prediction model [41, 42]. After the same operation as mentioned above, we obtained 293 independent associations composed of 118 lncRNAs and 167 diseases which is the same as described in the original paper.

Disease MeSH descriptors

Medical Subject Headings (MeSH) is a standard controlled vocabulary which aims at indexing life and medical books and journals. It can be roughly classified into 16 categories, including Health Care [N], Publication Characteristics [V], Geographicals [Z]. We downloaded all MeSH descriptors (headings) from National Library of Medicine (NLM) in August 2018 to construct and measure the semantics similarity between lncRNA and disease.

Disease semantic similarity matrix 1

Disease is a kind of abnormal life process that occurs when the body is under certain conditions and is affected by the damage of the disease. How to effectively represent disease as vectors is a difficult task in bioinformatics research for a long period. Previous method has proven that it is a high quality way to characterize disease by MeSH descriptor [47]. The specific calculation step is shown in the Fig. 5. Each disease can be represented as a Directed Acyclic Graph (DAG). For example, disease D’s DAG can be represented as DAG(D) = (D, \({N}_{D}\), \({E}_{D}\)), \({N}_{D}\) is a node set which contains disease D and its ancestor disease in DAG(D). \({E}_{D}\) is an edge set which contains all links between nodes in DAG(D).

Fig. 5
figure 5

The process of the semantics similarity calculation between diseases “lupus erythematosus, systemic” and “acne vulgaris”. (1) Construct their own directed acyclic graphs according to the rules; (2) Calculate the contribution of various diseases (nodes) to “lupus erythematosus, systemic” and “acne vulgaris” in the directed acyclic graph, of which the lowest level is “lupus” “erythematosus, systemic” and “acne vulgaris” contribute 1 to themselves, and the parent node decays layer by layer, “autoimmune diseases”, “connective tissue diseases”, etc. contribute 0.5, and so on; (3) Calculate “lupus erythematosus, systemic” and “acne vulgaris” constitute the total contribution of directed acyclic graphs. \(DV\left( {{\text{lupus erythematosus}},{\text{ systemic}}} \right) = 1 + 2 \times 0.5 + 2 \times 0.25 = 2.5\), \(DV\left( {\text{acne vulgaris}} \right) = 1 + 2 \times 0.5 + 1 \times 0.25 + 1 \times 0.125 = 2.375\). \(Similarity\left( {{\text{lupus erythematosus}},{\text{systemic}},{\text{acne vulgaris}}} \right) = \frac{0.25 + 0.125}{{2.5 + 2.375}} = 0.0769\)

Inspired by the Jaccard formula, the similarity can be calculated by dividing the intersection of two sets by the union of two sets. Disease D could be represented by a DAG and the semantics similarity between 2 diseases could be calculated as follows:

$$\left\{ {\begin{array}{*{20}l} {D1_{D} \left( t \right) = 1 } \hfill & { if\; t = D} \hfill \\ {D1_{D} \left( t \right) = \max \left\{ {\Delta {*}D1_{D} \left( {t^{\prime}} \right)|t^{\prime} \in children{ }of{ }t} \right\}} \hfill & {if\,t \ne D} \hfill \\ \end{array} } \right.$$
(6)

where \(\Delta\) is the factor and \(t\) is the node in DAG. \(\Delta\) can be from 0 to 1, and it is set to 0.5 according to previous literature [42]. In DAG (D), disease D contributes the most to itself. The further the distance is, the smaller the contribution of D's ancestral disease to D. Therefore, we can define the sum of the contributions of all nodes in the DAG(D) to disease D. DV1(D) can be calculated as:

$$DV1\left( D \right) = \Sigma_{{t \in N_{D} }} D1_{D} \left( t \right)$$
(7)

The semantic similarity of disease i and disease j can be defined as follows:

$$DS1\left( {i,j} \right) = \frac{{\mathop \sum \nolimits_{{t \in N_{i} \cap N_{j} }} \left( {D1_{i} \left( t \right) + D1_{j} \left( t \right)} \right)}}{DV1\left( i \right) + DV1\left( j \right)}$$
(8)

Disease semantic similarity matrix 2

In the disease semantic similarity matrix 1, the algorithm only forces on a single object from the local view, but does not consider the difference between diseases from the whole perspective. Some scholars believe that the contribution is different because of the appearance frequency of disease in the whole MeSH. Combined with the view of information theory, they proposed novel ideas to improve this situation and achieved a certain degree of improvement [42]. The new contribution of disease t to disease D can be calculated as follows:

$$D2_{D} \left( t \right) = - {\text{log}}\left( {\frac{the number of DAGs including t}{{the number of disease}}} \right)$$
(9)

Then the semantic value of disease D can be obtained, DV2(D) as:

$$DV2\left( {\text{D}} \right) = \Sigma_{{t \in N_{D} }} D2_{D} \left( t \right)$$
(10)

The semantic similarity of disease i and disease j can be defined as follows:

$$DS2\left( {i,j} \right) = \frac{{\mathop \sum \nolimits_{{t \in N_{i} \cap N_{j} }} \left( {D2_{i} \left( t \right) + D2_{j} \left( t \right)} \right)}}{DV2\left( i \right) + DV2\left( j \right)}$$
(11)

Disease Gaussian interaction profile kernel similarity matrix

Obviously, the matrix A includes the whole association contents of the v2018 database. Disease i can be represented as a function vector \({\text{d}}_{i}\) of 881 dimensions that is a column of matrix A. The value of each dimension in \({\text{d}}_{i}\) is determined by whether disease have been associated with lncRNA or not. If and only if the ith disease is valid proved to be associated with the jth lncRNA by wet experiment, the jth dimension of the vector is defined as 1, otherwise 0.

In fact, this can be treated as a functional representation of the lncRNA, and we transform it by Gaussian interaction profile kernel function to make more suitable for downstream classification tasks. Then similarity between diseases i and disease j can be defined as follows:

$$DG\left( {i,j} \right) = exp\left( { - \alpha_{d} {\text{d}}_{i} - {\text{d}}_{j}^{2} } \right)$$
(12)

hyperparameters \(\alpha_{d}\) can be defined as follows:

$$\alpha_{d} = \alpha_{d}^{^{\prime}} \left( {\frac{1}{nd}\mathop \sum \limits_{i = 1}^{nd} {\text{d}}_{i}^{2} } \right)$$
(13)

Here we set \(\alpha_{d}^{^{\prime}}\) = 0.5, nd is set to 328 which equals to the number of disease. Finally, the disease Gaussian interaction profile kernel similarity matrix DG is a square matrix with 328 rows and columns.

Disease integrated similarity matrix

To integrate all biological information, the element of the final disease similarity matrix DS (i, j) can be defined as follows:

$$DS\left( {i,j} \right) = \left\{ {\begin{array}{*{20}c} {\frac{{DS1\left( {i,j} \right) + DS2\left( {i,j} \right)}}{2} if i and j have semantic similarity } \\ {DG\left( {i,j} \right) otherwise } \\ \end{array} } \right.$$
(14)

LncRNA Gaussian interaction profile kernel similarity matrix

It can represent each lncRNA’s function by the row of the matrix A similar to disease. The Gaussian profile kernel similarity between lncRNA i and j could be calculated as follows:

$$RS\left( {i,j} \right) = RG\left( {i,j} \right) = exp\left( { - \alpha_{r} r_{i} - r_{j}^{2} } \right)$$
(15)

Given that there is no other information about lncRNAs, we directly regard RG as the lncRNA similarity matrix (RS). Parameter \(\alpha_{r}\) can be adjusted as follows:

$$\alpha_{r} = \alpha_{r}^{^{\prime}} \left( {\frac{1}{nr}\mathop \sum \limits_{i = 1}^{nr} r_{i}^{2} } \right)$$
(16)

Here, we set \(\alpha_{r}^{^{\prime}}\) = 0.5, and nl is set to 881 which equals to the number of lncRNA. Finally, the lncRNA similarity matrix RS of 328 rows and 328 columns can be constructed.

The representation of the association pair

From above operations, each lncRNA and disease can be represented as a vector by integrating various biology information. In summary, the ith disease can be represented as the ith row of the matrix DS as shown below:

$$DS_{i,*} = \left( {RS_{i,1} ,RS_{i,2} , \ldots ,RS_{i,881} } \right)$$
(17)

The jth lncRNA can be represented as the jth row of the matrix RS as shown below:

$$RS_{j,*} = \left( {RS_{j,1} ,RS_{j,2} , \ldots ,RS_{j,328} } \right)$$
(18)

The combination of the associations between the ith disease and the jth lncRNA is seen as follows:

$$AssociationPair_{i,j} = \left( {DS_{i,*} ,RS_{j,*} } \right) = \left( {RS_{i,1} ,RS_{i,2} , \ldots ,RS_{i,881} ,RS_{j,1} ,RS_{j,2} , \ldots ,RS_{j,328} } \right)$$
(19)

Then we get 3530 1209-dimensional vectors. Each positive sample is given a label 1 and each negative sample is given a label 0.

Convolutional neural networks (CNN)

Considering that the constructed representation vector is high-dimensional and sparse, we hope to extract the effective features through Convolutional Neural Network (CNN) [48,49,50]. Compared to other machine learning method, CNN has its unique advantages in feature capture and model capacity [51]. In this paper, we choose CNN to carry out the feature extraction task [52, 53].

Convolution neural network is a multi-layer neural network which consists of input layer, convolution layer, pooling layer, fully-connected layer and output layer [54, 55]. The key of CNN lies in the convolutional layer and the pooling layer which extracted features and passed them into the fully connected layer for classification [56]. The weight of the convolution window is adjusted by the feedback result [57]. The convolution layer is applied to extract both local and global features with different filters. It can be shown in the Fig. 6.

Fig. 6
figure 6

The convolution and pooling of CNN

ELM

GB Huang et al. [58] proposed Extreme Learning Machine which is a single hidden layer feedforward neural network algorithm. For traditional artificial neural networks, it will consume lots of resources and time to determine the paraments when back-propagation algorithm is applied [59]. Considering these iterative steps, there is only one hidden layer in ELM and when the classifier is trained, the number of hidden layer neuron nodes is the only hyperparameter that has to be set. The main steps of ELM are shown in Fig. 7.

Fig. 7
figure 7

Structure of Extreme Learning Machine (ELM). The connection weight of the input layer and the hidden layer, and the threshold of the hidden layer can be randomly set. The connection weights between the hidden layer and the output layer do not need to be adjusted iteratively, but are determined at once by solving equations

ELM is a kind of single hidden layer feedforward network with random hidden nodes and the activation function f(x). For N arbitrary distinct samples \(\left( {x_{i} ,l_{i} } \right)\), where \(x_{i} = \left[ {x_{i1} ,x_{i2} , \ldots ,x_{im} } \right]^{T} \epsilon R^{n}\) and \(l_{i} = \left[ {l_{i1} ,l_{i2} , \ldots ,l_{im} } \right]^{T} \epsilon R^{m}\). Therefore, the output of ELM is represented as follows:

$$\mathop \sum \limits_{i = 1}^{{N^{\prime}}} q_{i} f\left( {p_{i} \cdot x_{j} + t_{i} } \right) = O_{j} , j = 1, \ldots ,N$$
(20)

where \(N^{\prime}\) is the number of the hidden nodes, \(p_{i} = \left[ {p_{i1} ,p_{i2} , \ldots ,p_{im} } \right]^{T}\) is the weight vector from the input layer nodes to the ith hidden layer node, \(q_{i} = \left[ {q_{i1} ,q_{i2} , \ldots ,q_{im} } \right]^{T}\) is the weight vector from the ith hidden layer to the output layer, \(t_{i}\) is the threshold of the ith hidden node. \(p_{i} \cdot x_{j}\) is the inner product of \(p_{i}\) and \(t_{i}\).

The loss function is defined as follows:

$$\mathop \sum \limits_{j = 1}^{N} O_{j} - l_{j}$$
(21)

In order to minimize the error between input and output, we need to determine the three parameters \(p_{i} ,q_{i} and t_{i}\) such that:

$$\mathop \sum \limits_{i = 1}^{{N^{\prime}}} q_{i} f\left( {p_{i} \cdot x_{j} + t_{i} } \right) = t_{j} , j = 1, \ldots ,N$$
(22)

The Eq. (22) can be written compactly as \(Hq = l\) where

$$\begin{aligned} & H\left( {p_{1} , \ldots ,p_{{N^{\prime}}} ;q_{1} , \ldots ,q_{{N^{\prime}}} ;X_{1} , \ldots ,X_{N} } \right) \\ & \quad = \left[ {\begin{array}{*{20}c} {g\left( {p_{1} \cdot x_{1} + t_{1} } \right)} & \cdots & {H\left( {p_{{N^{\prime}}} \cdot x_{1} + t_{{N^{\prime}}} } \right)} \\ \vdots & \ddots & \vdots \\ {g\left( {p_{1} \cdot x_{N} + t_{1} } \right)} & \cdots & {H\left( {p_{{N^{\prime}}} \cdot x_{N} + t_{{N^{\prime}}} } \right)} \\ \end{array} } \right]_{{N \times N^{\prime}}} , q = \left[ {\begin{array}{*{20}c} {q_{1}^{T} } \\ \vdots \\ {q_{{N^{\prime}}}^{T} } \\ \end{array} } \right]_{{N^{\prime} \times m}} , l = \left[ {\begin{array}{*{20}c} {l_{1}^{T} } \\ \vdots \\ {l_{N}^{T} } \\ \end{array} } \right]_{N \times m} \\ \end{aligned}$$
(23)

Therefore, in order to train the ELM, we need to find the appropriate parameters \(\hat{p}_{i} ,\hat{q}_{i} and \hat{t}_{i}\) such that

$$H\left( {\hat{p}_{i} ,\hat{t}_{i} } \right)\hat{q}_{i} - l = \mathop {min}\limits_{p,q,t} H\left( {\hat{p}_{i} ,\hat{t}_{i} } \right)\hat{q}_{i} - l, i = 1,2, \ldots ,N^{\prime}$$
(24)

It is equivalent to minimize the loss function as follows:

$$E = \mathop \sum \limits_{j = 1}^{N} \left( {\mathop \sum \limits_{i = 1}^{P} q_{i} f\left( {p_{i} \cdot x_{j} + t_{i} } \right) - l_{j} } \right)^{2}$$
(25)

ELM combined high learning efficiency and strong generalization ability is widely used in solving both academic and industrial issues. Here, all hyperparameters are set to default values.

Availability of data and materials

LncRNA-disease association dataset can be downloaded from the url: http://www.cuilab.cn/. Disease MeSH descriptors can be downloaded from the url: ftp://nlmpubs.nlm.nih.gov/online/mesh/. The code used or analyzed during this study are available from the corresponding author on reasonable requests.

Abbreviations

LncRNA:

Long non-coding RNA

ELM:

Extreme Learning Machine

CNN:

Convolutional Neural Network

AUROC:

Area Under Receiver Operating Characteristic Curve

LOOCV:

Leave-one-out cross-validation

DAG:

Directed Acyclic Graph

References

  1. Djebali S, Davis CA, Merkel A, Dobin A, Lassmann T, Mortazavi A, Tanzer A, Lagarde J, Lin W, Schlesinger F. Landscape of transcription in human cells. Nature. 2012;489(7414):101.

    CAS  PubMed Central  PubMed  Google Scholar 

  2. Bertone P, Stolc V, Royce TE, Rozowsky JS, Urban AE, Zhu X, Rinn JL, Tongprasit W, Samanta M, Weissman S. Global identification of human transcribed sequences with genome tiling arrays. Science. 2004;306(5705):2242–6.

    CAS  PubMed  Google Scholar 

  3. You Z-H, Lei Y-K, Gui J, Huang D-S, Zhou X. Using manifold embedding for assessing and predicting protein interactions from high-throughput experimental data. Bioinformatics. 2010;26(21):2744–51.

    CAS  PubMed Central  PubMed  Google Scholar 

  4. Yi H-C, You Z-H, Huang D-S, Kwoh CK. Graph representation learning in bioinformatics: trends, methods and applications. Brief Bioinform. 2021;23(1):bbab340.

    Google Scholar 

  5. Zhang Q, Wang S, Chen Z, He Y, Liu Q, Huang D-S. Locating transcription factor binding sites by fully convolutional neural network. Brief Bioinform. 2021;22(5):bbaa435.

    PubMed Central  PubMed  Google Scholar 

  6. Wang L, You Z-H, Huang D-S, Li J-Q. MGRCDA: metagraph recommendation method for predicting CircRNA-disease association. IEEE Trans Cybern. 2021.

  7. Flynn RA, Chang HY. Long noncoding RNAs in cell-fate programming and reprogramming. Cell Stem Cell. 2014;14(6):752–61.

    CAS  PubMed Central  PubMed  Google Scholar 

  8. Johnson R. Long non-coding RNAs in Huntington’s disease neurodegeneration. Neurobiol Dis. 2012;46(2):245–54.

    CAS  PubMed  Google Scholar 

  9. Qiu M-T, Hu J-W, Yin R, Xu L. Long noncoding RNA: an emerging paradigm of cancer research. Tumor Biol. 2013;34(2):613–20.

    CAS  Google Scholar 

  10. Chen X, Sun Y-Z, Guan N-N, Qu J, Huang Z-A, Zhu Z-X, Li J-Q. Computational models for lncRNA function prediction and functional similarity calculation. Brief Funct Genom. 2019;18(1):58–82.

    CAS  Google Scholar 

  11. Chen X, Yan CC, Zhang X, You Z-H. Long non-coding RNAs and complex diseases: from experimental results to computational models. Brief Bioinform. 2017;18(4):558–76.

    CAS  PubMed  Google Scholar 

  12. He Y, Shen Z, Zhang Q, Wang S, Huang D-S. A survey on deep learning in DNA/RNA motif mining. Brief Bioinform. 2021;22(4):bbaa229.

    PubMed  Google Scholar 

  13. Gao S, Zhou M, Wang Y, Cheng J, Yachi H, Wang J. Dendritic neuron model with effective learning algorithms for classification, approximation, and prediction. IEEE Trans Neural Netw Learn Syst. 2018;30(2):601–14.

    PubMed  Google Scholar 

  14. Liu T, Tian B, Ai Y, Zou Y, Wang F-Y. Parallel reinforcement learning-based energy efficiency improvement for a cyber-physical system. IEEE/CAA J Autom Sin. 2019;7(2):617–26.

    Google Scholar 

  15. Huang D-S, Du J-X. A constructive hybrid structure optimization methodology for radial basis probabilistic neural networks. IEEE Trans Neural Netw. 2008;19(12):2099–115.

    PubMed  Google Scholar 

  16. Wang X-F, Huang D-S. A novel density-based clustering framework by using level set method. IEEE Trans Knowl Data Eng. 2009;21(11):1515–31.

    CAS  Google Scholar 

  17. Wang X-F, Huang D-S, Du J-X, Xu H, Heutte L. Classification of plant leaf images with complicated background. Appl Math Comput. 2008;205(2):916–26.

    Google Scholar 

  18. Chen X, Xie D, Zhao Q, You Z-H. MicroRNAs and complex diseases: from experimental results to computational models. Brief Bioinform. 2019;20(2):515–39.

    CAS  PubMed  Google Scholar 

  19. Chen X, Yan CC, Zhang X, Zhang X, Dai F, Yin J, Zhang Y. Drug–target interaction prediction: databases, web servers and computational models. Brief Bioinform. 2016;17(4):696–712.

    CAS  PubMed  Google Scholar 

  20. Cui T, Zhang L, Huang Y, Yi Y, Tan P, Zhao Y, Hu Y, Xu L, Li E, Wang D. MNDR v2.0: an updated resource of ncRNA–disease associations in mammals. Nucleic Acids Res. 2017;46(D1):D371–4.

    PubMed Central  Google Scholar 

  21. Ning S, Zhang J, Wang P, Zhi H, Wang J, Liu Y, Gao Y, Guo M, Yue M, Wang L. Lnc2Cancer: a manually curated database of experimentally supported lncRNAs associated with various human cancers. Nucleic Acids Res. 2015;44(D1):D980–5.

    PubMed Central  PubMed  Google Scholar 

  22. Liu C, Bai B, Skogerbø G, Cai L, Deng W, Zhang Y, Bu D, Zhao Y, Chen R. NONCODE: an integrated knowledge database of non-coding RNAs. Nucleic Acids Res. 2005;33(suppl_1):D112–5.

    CAS  PubMed  Google Scholar 

  23. Wishart DS, Feunang YD, Guo AC, Lo EJ, Marcu A, Grant JR, Sajed T, Johnson D, Li C, Sayeeda Z. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res. 2017;46(D1):D1074–82.

    PubMed Central  Google Scholar 

  24. Lu C, Yang M, Luo F, Wu F-X, Li M, Pan Y, Li Y, Wang J. Prediction of lncRNA-disease associations based on inductive matrix completion. Bioinformatics. 2018;1:8.

    Google Scholar 

  25. Chen X, Wang L, Qu J, Guan N-N, Li J-Q. Predicting miRNA–disease association based on inductive matrix completion. Bioinformatics. 2018;34(24):4256–65.

    CAS  PubMed  Google Scholar 

  26. Chen X, Liu M-X, Yan G-Y. Drug–target interaction prediction by random walk on the heterogeneous network. Mol BioSyst. 2012;8(7):1970–8.

    CAS  PubMed  Google Scholar 

  27. Zhou M, Wang X, Li J, Hao D, Wang Z, Shi H, Han L, Zhou H, Sun J. Prioritizing candidate disease-related long non-coding RNAs by walking on the heterogeneous lncRNA and disease network. Mol BioSyst. 2015;11(3):760–9.

    CAS  PubMed  Google Scholar 

  28. Guo Z-H, You Z-H, Wang Y-B, Yi H-C, Chen Z-H. A learning-based method for LncRNA-disease association identification combing similarity information and rotation forest. iScience. 2019;19:786–95.

    CAS  PubMed Central  PubMed  Google Scholar 

  29. Zhao Y, Huang D-S, Jia W. Completed local binary count for rotation invariant texture classification. IEEE Trans Image Process. 2012;21(10):4492–7.

    PubMed  Google Scholar 

  30. Luo X, Wu H, Yuan H, Zhou M. Temporal pattern-aware QoS prediction via biased non-negative latent factorization of tensors. IEEE Trans Cybern. 2019;50(5):1798–809.

    PubMed  Google Scholar 

  31. Luo X, Zhou M, Li S, Hu L, Shang M. Non-negativity constrained missing data estimation for high-dimensional and sparse matrices from industrial applications. IEEE Trans Cybern. 2019;50(5):1844–55.

    PubMed  Google Scholar 

  32. Luo X, Zhou M, Li S, Shang M. An inherently nonnegative latent factor model for high-dimensional and sparse matrices from industrial applications. IEEE Trans Ind Inf. 2017;14(5):2011–22.

    Google Scholar 

  33. Huang D-S, Jia W, Zhang D. Palmprint verification based on principal lines. Pattern Recognit. 2008;41(4):1316–28.

    Google Scholar 

  34. Wang X-F, Huang D-S, Xu H. An efficient local Chan-Vese model for image segmentation. Pattern Recognit. 2010;43(3):603–18.

    Google Scholar 

  35. Lu C-Y, Min H, Zhao Z-Q, Zhu L, Huang D-S, Yan S. Robust and efficient subspace segmentation via least squares regression. In: European conference on computer vision: 2012. Springer. p. 347–360.

  36. Jia W, Huang D-S, Zhang D. Palmprint verification based on robust line orientation code. Pattern Recognit. 2008;41(5):1504–13.

    Google Scholar 

  37. Chen X, Sun Y-Z, Zhang D-H, Li J-Q, Yan G-Y, An J-Y, You Z-H: NRDTD: a database for clinically or experimentally supported non-coding RNAs and drug targets associations. Database 2017, 2017.

  38. Sun Y-Z, Zhang D-H, Ming Z, Li J-Q, Chen X. DLREFD: a database providing associations of long non-coding RNAs, environmental factors and phenotypes. Database 2017, 2017.

  39. Liu M-X, Chen X, Chen G, Cui Q-H, Yan G-Y. A computational framework to infer human disease-associated long noncoding RNAs. PLoS ONE. 2014;9(1):e84408.

    PubMed Central  PubMed  Google Scholar 

  40. Chen G, Wang Z, Wang D, Qiu C, Liu M, Chen X, Zhang Q, Yan G, Cui Q. LncRNADisease: a database for long-non-coding RNA-associated diseases. Nucleic Acids Res. 2012;41(D1):D983–6.

    PubMed Central  PubMed  Google Scholar 

  41. Chen X, Yan G-Y. Novel human lncRNA-disease association inference based on lncRNA expression profiles. Bioinformatics. 2013;29(20):2617–24.

    CAS  PubMed  Google Scholar 

  42. Chen X, Yan CC, Luo C, Ji W, Zhang Y, Dai Q. Constructing lncRNA functional similarity network based on lncRNA-disease associations and disease semantic similarity. Sci Rep. 2015;5:11338.

    PubMed Central  PubMed  Google Scholar 

  43. Bao Z, Yang Z, Huang Z, Zhou Y, Cui Q, Dong D. LncRNADisease 2.0: an updated database of long non-coding RNA-associated diseases. Nucleic Acids Res. 2018;47(D1):D1034–7.

    PubMed Central  Google Scholar 

  44. Wang J, Zhang X, Chen W, Li J, Liu C. CRlncRNA: a manually curated database of cancer-related long non-coding RNAs with experimental proof of functions on clinicopathological and molecular features. BMC Med Genom. 2018;11(6):114.

    CAS  Google Scholar 

  45. Wang P, Lu S, Mao H, Bai Y, Ma T, Cheng Z, Zhang H, Jin Q, Zhao J, Mao H. Identification of biomarkers for the detection of early stage lung adenocarcinoma by microarray profiling of long noncoding RNAs. Lung Cancer. 2015;88(2):147–53.

    PubMed  Google Scholar 

  46. Ben-Hur A, Noble WS. Kernel methods for predicting protein–protein interactions. Bioinformatics. 2005;21(suppl_1):i38–46.

    CAS  PubMed  Google Scholar 

  47. Wang D, Wang J, Lu M, Song F, Cui Q. Inferring the human microRNA functional similarity and functional network based on microRNA-associated diseases. Bioinformatics. 2010;26(13):1644–50.

    CAS  PubMed  Google Scholar 

  48. Li B, Zheng C-H, Huang D-S. Locally linear discriminant embedding: an efficient method for face recognition. Pattern Recognit. 2008;41(12):3813–21.

    Google Scholar 

  49. Zheng C-H, Huang D-S, Zhang L, Kong X-Z. Tumor clustering using nonnegative matrix factorization with gene selection. IEEE Trans Inf Technol Biomed. 2009;13(4):599–607.

    PubMed  Google Scholar 

  50. Wu Y, Zhang K, Wu D, Wang C, Yuan C-A, Qin X, Zhu T, Du Y-C, Wang H-L, Huang D-S. Person re-identification by multi-scale feature representation learning with random batch feature mask. IEEE Trans Cogn Dev Syst. 2020;13(4):865–74.

    Google Scholar 

  51. Wu D, Wang C, Wu Y, Wang Q-C, Huang D-S. Attention deep model with multi-scale deep supervision for person re-identification. IEEE Trans Emerg Top Comput Intell. 2021;5(1):70–8.

    Google Scholar 

  52. Hu R, Jia W, Ling H, Huang D. Multiscale distance matrix for fast plant leaf recognition. IEEE Trans Image Process. 2012;21(11):4667–72.

    PubMed  Google Scholar 

  53. Zhang Q, Wang D, Han K, Huang D-S. Predicting TF-DNA binding motifs from ChIP-seq datasets using the bag-based classifier combined with a multi-fold learning scheme. IEEE/ACM Trans Comput Biol Bioinform. 2020;18(5):1743–51.

    Google Scholar 

  54. Zhang Q, Yu W, Han K, Nandi AK, Huang D-S. Multi-scale capsule network for predicting DNA-protein binding sites. IEEE/ACM Trans Comput Biol Bioinform. 2020;18(5):1793–800.

    Google Scholar 

  55. Peng C, Zheng Y, Huang D-S. Capsule network based modeling of multi-omics data for discovery of breast cancer-related genes. IEEE/ACM Trans Comput Biol Bioinf. 2019;17(5):1605–12.

    Google Scholar 

  56. Liu B, Yang F, Huang D-S, Chou K-C. iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC. Bioinformatics. 2018;34(1):33–40.

    CAS  PubMed  Google Scholar 

  57. Shen Z, Zhang Q, Han K, Huang D-S. A deep learning model for RNA-protein binding preference prediction based on hierarchical LSTM and attention network. IEEE/ACM Trans Comput Biol Bioinform. 2020.

  58. Huang G-B, Zhu Q-Y, Siew C-K. Extreme learning machine: theory and applications. Neurocomputing. 2006;70(1–3):489–501.

    Google Scholar 

  59. Li B, Fan Z-T, Zhang X-L, Huang D-S. Robust dimensionality reduction via feature space to feature space distance metric learning. Neural Netw. 2019;112:1–14.

    CAS  PubMed  Google Scholar 

Download references

Acknowledgements

The authors would like to thank all the editors and anonymous reviewers for their constructive advices.

About this supplement

This article has been published as part of BMC Bioinformatics Volume 22 Supplement 5 2021: Proceedings of the International Conference on Biomedical Engineering Innovation (ICBEI) 2019-2020. The full contents of the supplement are available at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-22-supplement-5

Funding

This work was supported by the grant of National Key R&D Program of China (Nos. 2018AAA0100100 and 2018YFA0902600) and partly supported by National Natural Science Foundation of China (Grant Nos. 61732012, 61772370, 61932008, 61772357, 62002297, 62002266, and 62073231) and supported by “BAGUI Scholar” Program and the Scientific and Technological Base and Talent Special Program, GuiKe AD18126015 of the Guangxi Zhuang Autonomous Region of China. The funding was used to develop, implement, and evaluate the proposed algorithms. The funding body did not play any role in the design and implementation of the algorithms and in writing the manuscript.

Author information

Authors and Affiliations

Authors

Contributions

Z.-H. G. and Z.-H. C considered the algorithm, arranged the datasets, and performed the analyses. Z.-H. Y., Y.-B. W., H.-C. Y. and M.-N. W. wrote the manuscript. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Zhan-Heng Chen or Yan-Bin Wang.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Guo, ZH., Chen, ZH., You, ZH. et al. A learning-based method to predict LncRNA-disease associations by combining CNN and ELM. BMC Bioinformatics 22, 622 (2021). https://doi.org/10.1186/s12859-022-04611-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-022-04611-3

Keywords

  • CNN
  • ELM
  • lncRNA
  • Disease
  • Association prediction