 Methodology article
 Open Access
 Published:
A novel computational model for predicting potential LncRNAdisease associations based on both direct and indirect features of LncRNAdisease pairs
BMC Bioinformatics volume 21, Article number: 555 (2020)
Abstract
Background
Accumulating evidence has demonstrated that long noncoding RNAs (lncRNAs) are closely associated with human diseases, and it is useful for the diagnosis and treatment of diseases to get the relationships between lncRNAs and diseases. Due to the high costs and time complexity of traditional bioexperiments, in recent years, more and more computational methods have been proposed by researchers to infer potential lncRNAdisease associations. However, there exist all kinds of limitations in these stateoftheart prediction methods as well.
Results
In this manuscript, a novel computational model named FVTLDA is proposed to infer potential lncRNAdisease associations. In FVTLDA, its major novelty lies in the integration of direct and indirect features related to lncRNAdisease associations such as the feature vectors of lncRNAdisease pairs and their corresponding association probability fractions, which guarantees that FVTLDA can be utilized to predict diseases without known relatedlncRNAs and lncRNAs without known relateddiseases. Moreover, FVTLDA neither relies solely on known lncRNAdisease nor requires any negative samples, which guarantee that it can infer potential lncRNAdisease associations more equitably and effectively than traditional stateoftheart prediction methods. Additionally, to avoid the limitations of single model prediction techniques, we combine FVTLDA with the Multiple Linear Regression (MLR) and the Artificial Neural Network (ANN) for data analysis respectively. Simulation experiment results show that FVTLDA with MLR can achieve reliable AUCs of 0.8909, 0.8936 and 0.8970 in 5Fold Cross Validation (fivefold CV), 10Fold Cross Validation (tenfold CV) and LeaveOneOut Cross Validation (LOOCV), separately, while FVTLDA with ANN can achieve reliable AUCs of 0.8766, 0.8830 and 0.8807 in fivefold CV, tenfold CV, and LOOCV respectively. Furthermore, in case studies of gastric cancer, leukemia and lung cancer, experiment results show that there are 8, 8 and 8 out of top 10 candidate lncRNAs predicted by FVTLDA with MLR, and 8, 7 and 8 out of top 10 candidate lncRNAs predicted by FVTLDA with ANN, having been verified by recent literature. Comparing with the representative prediction model of KATZLDA, comparison results illustrate that FVTLDA with MLR and FVTLDA with ANN can achieve the average case study contrast scores of 0.8429 and 0.8515 respectively, which are both notably higher than the average case study contrast score of 0.6375 achieved by KATZLDA.
Conclusion
The simulation results show that FVTLDA has good prediction performance, which is a good supplement to future bioinformatics research.
Background
LncRNAs have long been considered as a transcriptional noise [1, 2]. However, in recent years, more and more researches have shown that lncRNAs play key roles in numerous important biological processes of humans, including chromatin modification, epigenetic regulation, cell cycle control, cell differentiation and so on [3,4,5,6]. Especially, accumulating bioexperiments have confirmed that mutations and dysregulation of lncRNAs are associated with the development of diseases, such as leukemia [7], neurological disorders [8], coronary artery diseases [9] and several cancers [10]. Hence, effectively inferring potential associations between lncRNAs and diseases can not only contribute to understand the pathogenesis of some complex diseases at the molecular level, but also be conducive to provide biomarkers for disease diagnosis, therapy and prognosis. Up to now, along with the rapid increment of newly inferred lncRNAs, some publicly available lncRNArelated databases, including lncRNADisease [11], NONCODE [12], lncRNAdb [13] and NRED [14], have been established successively. However, the number of known lncRNAdisease associations is still very limited, since traditional biological experiments are costly and timeconsuming. Therefore, it is important and necessary to construct effective and highthroughput computational models to explore potential lncRNAdisease associations.
So far, researchers have developed numerous powerful computational models to predict potential lncRNAdisease associations, which can be roughly classified into three major categories according to their main implementation strategies [15]. Among them, the first category aims to adopt machine learning methods to predict potential lncRNAdisease associations. For example, Yu and Wang et al. proposed a prediction model based on the Naïve Bayes classifier [16] in 2018 and a prediction model based on the collaborative filtering algorithm [17] in 2019 to infer potential lncRNAdisease associations, respectively. Xuan and Wang et al. developed a probabilistic matrix factorization model based on the semisupervised learning method to identify potential associations between lncRNAs and diseases [18]. In these prediction models of the first category, the major drawback lies in the requirement of negative samples as the training set, which will affect their prediction performances notably, since the negative samples are usually difficult to obtain. Of course, some models overcome this limitation. LRLSLDA is the first largescale prediction model [19], which does not need the negative samples information, but how to choose the best parameters remains to be solved.
Different from the first category, the second category focuses on implementing propagation algorithms such as Random Walk on a heterogeneous network constructed by integrating lncRNAdisease association network, disease similarity network and lncRNA similar network, etc. For instance, in 2014, Sun et al. [20] established a global networkbased computational model, which adopted the random walk with restart (RWR) algorithm to predict potential lncRNAdisease associations. In 2015, Zhou et al. [21] proposed a prediction model by implementing RWR on a heterogeneous network comprising known lncRNAdisease association network, miRNAassociated lncRNA crosstalk network and disease similarity network. However, these two models mentioned above can only be applied to infer lncRNAs with relateddisease or known miRNAdisease associations. To break through this kind of limitation, in 2015, Chen et al. [22] developed a computational model called KATZLDA for prediction of potential lncRNAdisease associations, which can infer potential lncRNAs in the absence of known associated diseases. But prediction may bias in favor of lncRNAs with more known relateddiseases and diseases with more known relatedlncRNAs as well due to its construction of the network.
According to the above descriptions, the prediction performance of all these models of both categories will be influenced by the number of known lncRNAdisease associations. However, the number of known lncRNAdisease associations confirmed by bioexperiments is still very limited. Therefore, to avoid the drawback of limited known lncRNAdisease associations, the third category adopts indirect biological information to explore the prediction of potential lncRNAdisease associations. For instance, in 2014, Liu et al. [23] proposed a novel prediction model by combining human lncRNA expression profiles, human diseaseassociated gene data and gene expression profiles, which can achieve exciting prediction performance while there are no known lncRNAdisease associations. However, it cannot implements to predict lncRNAs without generelated records.
Different from the above existing methods, in this manuscript, we proposed a novel computational model named FVTLDA to reveal potential lncRNAdisease associations. In FVTLDA, to avoid the limitation of various methods mentioned previously, we first introduce direct and indirect biological information on lncRNAs and diseases, including known lncRNAmiRNAdisease associations. Then, known lncRNAdisease associations will be utilized to extract direct features for lncRNAdisease pairs based on the concept of Disease Clique. Meanwhile, indirect biological information including known miRNAdisease associations and known miRNAlncRNA associations will be utilized to extract indirect features for lncRNAdisease pairs by adopting the random walk with restart. What's more, to avoid the limitation of single model prediction techniques, based on the direct and indirect features obtained for lncRNAdisease pairs, the Multiple Linear Regression (MLR) and Artificial Neural Network (ANN) will be combined with FVTLDA to reveal potential lncRNAdisease associations, respectively. To estimate the prediction performance of FVTLDA, different frameworks including the LOOCV, fivefold CV and tenfold CV are implemented to compare FVTLDA with existing competing models. Simulation experiment results show that FVTLDA with MLR can achieve AUCs of 0.8909, 0.8936 and 0.8970 in fivefold CV, tenfold CV and LOOCV respectively, while FVTLDA with ANN can achieve AUCs of 0.8766, 0.8830 and 0.8807 in fivefold CV, tenfold CV and LOOCV separately, which both outperform existing stateoftheart models. Meanwhile, in case studies of gastric cancer, leukemia and lung cancer, simulation experiment results show that there are 8, 8 and 8 out of top 10 candidate lncRNAs predicted by FVTLDA with MLR, and 8, 7 and 8 out of top 10 candidate lncRNAs predicted by FVTLDA with ANN, having been verified respectively in biological experimental studies or other independent studies. Finally, to further illustrate actual predictive ability of FVTLDA, we have compared it with the representative prediction model KATZLDA based on the new concept of case study contrast score as well, which aims to quantify the prediction ability of the model in case study. And simulation experiment results show that the average case study contrast scores of FVTLDA with MLR and FVTLDA with ANN are 0.8429 and 0.8515 respectively, which both outperform the average case study contrast score of 0.6375 obtained by KATZLDA notably.
Result
Performance evaluation
In order to evaluate the prediction performance of FVTLDA, in this section, we implement the LOOCV on FVTLDA as follows: For all known lncRNAdisease pairs, each pair with known correlations was selected in turn for testing, and other lncRNAdisease pairs were retained as training samples for model learning. Particularly, testing samples and lncRNAdisease pairs without known correlations were considered as candidates. After the implementation of FVTLDA, the ranking positions of test samples in candidates can be obtained according to the association probability fractions. If the ranking of a test sample is above the given threshold, it will be seen as a successful prediction or a positive sample. Otherwise, it is seen as an unsuccessful prediction or a negative sample. Besides, upon different thresholds, the corresponding true positive rate (TPR, sensitivity) and false positive rate (FPR, 1 − specificity) can be calculated as follows:
Here, TP and TN represent the correctly identified positive and negative samples separately, while FP and FN denote the incorrectly identified positive and negative samples, respectively.
Based on the above equations, the Receiver Operating Characteristic (ROC) curve can be drawn according to the TPRs and FPRs of different thresholds, and the area under ROC curve (AUC) will further be calculated to evaluate the performance of FVTLDA. The AUC value of 1 indicates the perfect prediction performance while the AUC value of 0.5 means a random guess.
During simulation, we first compared FVTLDA_MLR (i.e., FVTLDA with MLR) with six stateoftheart prediction models such as NBCLDA [16], CFNBC [17], PMFILDA [18], KATZLDA [22], SIMCLDA [24] and IIRWR [25] in the framework of LOOCV, and comparison results were shown in Fig. 1. Through observing this figure, it can be seen that FVTLDA_MLR can achieve AUC of 0.8970, which significantly outperforms those six stateoftheart prediction models with the increment of AUC values by at least 0.0311.
Moreover, to eliminate the random error caused by the random initialization of weights and biases in FVTLDA_ANN (i.e., FVTLDA with ANN), during simulation, we repeated the execution of LOOCV on FVTLDA_ANN for 20 times, and took the mean and variance of the AUC values as the result. As illustrated in Additional file 1, it can be seen that FVTLDA_ANN achieves a reliable mean of AUC value of 0.8807 and standard deviation (std) of 0.0047 in LOOCV, which outperforms these six stateoftheart prediction models.
In order to further verify the prediction performance of FVTLDA while there are few known lncRNAdisease associations, the frameworks of Kfold CV including fivefold CV and tenfold CV were implemented to compare FVTLDA_MLR with other representative prediction models. During implementing the Kfold CV, all known lncRNAdisease associations are equally divided into K parts, each part was left out as the test sample in turn, and other remaining lncRNAdisease pairs were used as the training samples. As shown in the following Figs. 2 and 3, FVTLDA_MLR can achieve better predictive performance than the other six competing models, which demonstrates that FVTLDA can perform better in sparse data sets as well.
Furthermore, in order to eliminate the effects of the random partition of training samples, during simulation, we repeated the implementations of 5CV and 10CV 20 times respectively, and took the mean and variance of AUC value as the results. As shown in Additional files 2 and 3, FVTLDA_MLR achieves the mean AUC value of 0.8903 and std of 0.0022 in 5CV, and the mean AUC of 0.8940 and std of 0.0014 in 10CV, separately. Meanwhile, as for FVTLDA_ANN, from observing the following Additional files 4 and 5, it can be seen that it achieves the mean AUC value of 0.8766 and std of 0.0043 in 5CV, and the mean AUC of 0.8830 and std of 0.0022 in 10CV, respectively.
Finally, to demonstrate that FVTLDA can perform well in different data sets, we further compared it with other stateoftheart models including HGLDA [26] and the method proposed by Yang et al. [27] in the framework of LOOCV. While comparing FVTLDA with HGLDA, we adopted the data set given by HGLDA, which consists of 183 experimentally validated lncRNAdisease associations. While comparing FVTLDA with the method proposed by Yang et al., we used the dataset put forward by Yang et al., which consists of 319 known lncRNAdisease associations. FVTLDA outperforms these two kinds of model in different datasets (Figs. 4 and 5).
Parameter analysis
In this section, influences of parameters in FVTLDA are estimated. The parameters r_{1} and r_{2} in Eq. (11) (See the section of Methods) and Eq. (14) represent the restart probabilities of the random walk, the parameter rate in Eq. (19) stands for the adjustment factor, and the parameters k_{1} and k_{2} in Eqs. (20) and (21) denote the attenuation factors, respectively.
In order to determine the optimal values of the above five parameters efficiently, we traverse the approximate range of each parameter through FVTLDA with MLR in the framework of LOOCV (0, 0.0001, 0.001, 0.01, 0.1). For parameters that can further improve the precision, we take the approximate solution of the previous step as the default value, and then, the optimal solution with higher precision is achieved by traversal. As illustrated in the following Table 1 (bold represents the best parameter), the optimal values for these five parameters such as rate, r_{1}, r_{2}, k_{1}, and k_{2} are 0.3, 0.001, 0.001, 0.008, 0.007 separately.
Case study
In order to further demonstrate the predictive ability of FVTLDA, in this section, we select gastric cancer, leukemia and lung cancer as case studies. During the simulation, for any given disease d_{i} \(\in\){the gastric cancer, the leukemia, the lung cancer}, only those lncRNAs that do not have known associations with d_{i} will be considered as validated candidates for d_{i}. Next, all candidate lncRNAs will be ranked according to their association probability fractions calculated by FVTLDA. Finally, the top 10 candidate d_{i}related lncRNAs will be verified by recent articles and experiments published in the NCBI database (https://www.ncbi.nlm.nih.gov/). Additionally, to compare the difference of prediction performance between FVTLDA_MLR and FVTLDA_ANN, as well as the difference of prediction performance between FVTLDA and another representative prediction model KATZLDA, we further list all these lncRNAs in the top 10 candidate d_{i}related lncRNAs predicted by FVTLDA_MLR, FVTLDA_ANN and KATZLDA separately. Simultaneously, we will provide corresponding rankings and relevant evidence of these lncRNAs as well. Moreover, in order to visualize the predictive ability of these three kinds of prediction models in the above case studies, we propose a novel concept of case study contrast score, which can be calculated as follows:
Here, m denotes the number of verified lncRNAs in top 10 predicted candidate lncRNAs, and R_{i} represents the ranking corresponding to the ith confirmed lncRNA. If the model has better practical ability, the closer the score of the model is to 1. For example, in Table 2, the case study contrast score of FVTLDA_MLR = \(e^{{\left( {1 + \frac{1}{2} + \frac{1}{4} + \frac{1}{5} + \cdots + \frac{1}{7} + \frac{1}{9} + \frac{1}{10} + \frac{1}{29} + \frac{1}{11}} \right)  \left( {1 + \frac{1}{2} + \frac{1}{3} + \cdots + \frac{1}{9} + \frac{1}{10}} \right)}} = 0.7168\).
Gastric cancer is the second leading cause of cancerrelated deaths and the fourth most common cancer in the world [28, 29]. Up to now, there is a large number of lncRNAs having been proved to be related to gastric cancer [30, 31]. FVTLDA_MLR, FVTLDA_ANN and KATZLDA can successfully predict 8, 8 and 8 confirmed lncRNAs out of the top 10 candidate lncRNAs respectively (Table 2), and their corresponding case study contrast scores are 0.7168, 0.8377 and 0.8439 separately.
As for leukemia, its association with some lncRNAs has been widely reported [32, 33]. FVTLDA_MLR, FVTLDA_ANN, and KATZLDA can successfully predict 8, 8 and 8 confirmed lncRNAs out of the top 10 candidates lncRNAs separately (Table 3), and their corresponding case study contrast scores are 0.9448, 0.9753 and 0.9688 respectively.
Moreover, lung cancer is also a leading cause of cancer death all over the world, regardless of gender [34]. FVTLDA_MLR and FVTLDA_ANN can successfully predict 8 and 7 confirmed lncRNAs out of the top 10 candidate lncRNAs respectively (Table 4). However, KATZLDA can only predict 1 confirmed lncRNAs out of the top 10 candidate lncRNAs. Additionally, the case study contrast scores of FVTLDA_MLR, FVTLDA_ANN and KATZLDA are 0.8670, 0.7414 and 0.0998 respectively.
In conclusion, FVTLDA can achieve excellent prediction performance, and the average case study contrast scores of FVTLDA_MLR (0.8429) and FVTLDA_ANN (0.8515) are both higher than KATZ (0.6375).
Discussion
A lot of evidence has demonstrated that lncRNAs play an important role in the pathological changes of human diseases, and identification of diseaserelated lncRNAs can help us better understand the disease mechanisms at the molecular level. However, it is costly and timeconsuming to verify lncRNAdisease associations with biological experiments. Thus, it is important and necessary to develop efficient computational models to predict potential lncRNAdisease associations.
Different from stateoftheart prediction models, in this paper, a novel computational model called FVTLDA is proposed to predict potential lncRNAdisease associations based on direct and indirect biological information. In order to avoid the limitation of the single model prediction technique, we further combine FVTLDA with multiple linear regression and artificial neural networks respectively. Moreover, to evaluate the prediction performance of FVTLDA, we conducted intensive in experiments. Simulation results demonstrate that FVTLDA achieves better performance than other six available stateoftheart prediction models. Additionally, in case studies of gastric cancer, leukemia and lung cancer, simulation results show that the prediction ability and stability of both FVTLDA with MLR and FVTLDA with ANN are better than that of competing methods.
Certainly, despite the prediction performance of FVTLDA, the current version of FVTLDA can further improve performance as well. For example, we can increase the complexity of neural networks in the model of FVTLDA. Finally, more useful information sources including the genedisease associations can be integrated into the feature vectors of lncRNAdisease pairs to further improve the prediction performance of FVTLDA. In the future, we can also study the association prediction in various fields of computational biology, such as miRNAdisease association prediction [35,36,37], drugtarget interaction prediction [38, 39], and then bring valuable insights to the development of lncRNAdisease association prediction.
Conclusion
In this manuscript, a novel computational model named FVTLDA is proposed. FVTLDA solved three problems of other models: (1) Some models can not be applied to isolated nodes. (2) Some methods require negative samples that are difficult to obtain. (3) Some approaches may be biased towards known nodes. Besides, we combine FVTLDA with Multiple Linear Regression and Artificial Neural Network for data analysis respectively, and results and case studies show that our model outperforms other stateoftheart models, which indicate that FVTLDA can be an excellent tool for research in the future.
Method
In order to introduce direct and indirect biological information on lncRNAdisease associations into FVTLDA, in this section, we first collected three kinds of known associations including miRNAdisease associations, miRNAlncRNA associations and lncRNAdisease association from various databases. And then, based on these three kinds of datasets, we constructed three kinds of incidence matrix as follows:
Step 1 First, we downloaded the dataset of known miRNAdisease associations and miRNAlncRNA associations from the databases of HMDD [40] and starBase v2.0 [41] respectively. After having removed the repetitive associations supported by multiple evidences, and normalized the names of the miRNAs in these two datasets, we finally obtained 4704 unique miRNAdisease associations between 246 miRNAs and 373 diseases (see Additional file 6), and 9086 different miRNAlncRNA association between 246 miRNAs and 1089 lncRNAs (see Additional file 7). Thereafter, based on these two datasets, we constructed a 246 × 373 dimensional miRNAdisease association incidence matrix MD and a 246 × 1089 dimensional miRNAlncRNA association incidence matrix ML separately. In MD, there is MD(i,j) = 1, if and only if there exists a known association between the miRNA m_{i} and the disease d_{j}, otherwise there is MD(i,j) = 0. Similarly, in ML, there is ML(i,j) = 1, if and only if there exists a known association between the miRNA m_{i} and the lncRNA l_{j}, otherwise there is ML(i,j) = 0. For convenience, we defined the numbers of miRNAs, diseases and lncRNAs obtained above as N_{m}, N_{d_MD} and N_{l_ML} respectively. Obviously, there are N_{m} = 246, N_{d_MD} = 373 and N_{l_ML} = 1089.
Step 2 Next, we downloaded the dataset of known lncRNAdisease associations from the MNDR v2.0 database [42]. After having removed the duplicate associations with multiple evidence, as illustrated in the Fig. 6, we further got rid of these associations with either lncRNAs not belonging to N_{l_ML} or diseases not belonging to N_{d_MD}. Finally, we obtained 407 lncRNAdisease associations between 77 different lncRNAs and 95 different diseases (see Additional file 8). similarly, based on the newlydownloaded dataset, we constructed a 77 × 95 dimensional lncRNAdisease association incidence matrix LD, in which, there is LD(i,j) = 1, if and only if there exists a known association between the lncRNA l_{i} and the disease d_{j}, otherwise there is LD(i,j) = 0. And for convenience, we define the numbers of lncRNAs and diseases obtained above as N_{l_LD} and N_{d_LD} respectively. Obviously, there are N_{l_LD} = 77 and N_{d_LD} = 95.
Construction of the Gaussian interaction profile kernel similarity for miRNAs based on miRNAlncRNA associated information
According to the assumption that similar miRNAs tend to interact with similar lncRNAs [43], the Gaussian interaction profile kernel similarity between the miRNA m_{i} and the miRNA m_{j} can be calculated as follows:
Here, IP(m_{i}) denotes the ith row in the miRNAlncRNA association incidence matrix ML, γ_{m} denotes the normalized bandwidth based on the new bandwidth parameter γ_{m}′, and in this paper γ_{m}′ will be set to 1 according to previous experiments [44]. In this way, an N_{m} × N_{m} dimensional Gaussian interaction profile kernel similarity matrix KM for miRNAs can be established.
Construction of the functional similarity for miRNAs based on miRNAdisease associated information
In recent years, disease semantic similarity has been widely utilized to identify potential miRNAdisease associations, and many previous researches have shown the validity of this similarity [45,46,47,48,49,50]. In this study, we calculated the disease semantic similarity in the same way as in previous studies [49]. For all diseases, we first downloaded its corresponding Medical Subject Headings (MESH) descriptors from the National Library of Medicine in turn (http://www.nlm.nih.gov/) [49], and then, we represent a disease d_{A} as its directed acyclic graph (DAG) such as DAG(d_{A}) = (D(d_{A}), E(d_{A})). Here, D(d_{A}) consists of the disease node d_{A} itself and all ancestor nodes of d_{A}, while E(d_{A}) is composed of all the directed edges from parent nodes to children nodes. For example, the code for breast neoplasm is: c04.588.180; c17.800.090.500. The corresponding parent nodes are C04.588 neoplasms by site and C17.800.090 breast diseases [49]. In the same way of the previous study [18], for any two disease nodes d and t, we will calculate the contribution of t to the semantic value of d as follows:
where ∆ denotes the semantic contribution decay factor, and according to the previous study [49], in this paper, ∆ will be set to 0.5. Thereafter, we can calculate the semantic value of the disease d through combining all these diseases in its DAG(d) as follows:
According to the assumption that two diseases with a larger number of shared nodes in their DAGs may have higher similarity, we can calculate the disease semantic similarity score between a pair of diseases d_{i} and d_{j} as follows:
According to the above formula, it is obvious that an N_{d_MD} × N_{d_MD} dimensional matrix DS_{MD} can be established. Meanwhile, after extracting the semantic similarity information of disease in the lncRNAdisease association from the matrix DS_{MD}, we can further build an N_{d_LD} × N_{d_LD} dimensional matrix DS_{LD} as well.
Apparently, after obtaining the semantic similarity scores of diseases, we can finally obtain the functional similarity between miRNAs based on the assumption that miRNAs with similar functions are often implicated in similar disease [49] as follows: for any two given miRNAs m_{i} and m_{j}, let all diseases known to be related to m_{i} and m_{j} be GDM(m_{i}) = {d_{i1},d_{i2},d_{i3}…,d_{ip}} and GDM(m_{j}) = {d_{j1},d_{j2},d_{j3},…,d_{jq}} respectively, then the functional similarity score between m_{i} and m_{j} can be calculated according to the following:
According to the above equation, an N_{m} × N_{m} dimensional functional similarity matrix FM for miRNAs can be established. In the same way, let all diseases are known to be associated to lncRNAs l_{i} and l_{j} as GDL(l_{i}) = {d_{i1},d_{i2},d_{i3}…,d_{ip}} and GDL(l_{j}) = {d_{j1},d_{j2},d_{j3},…,d_{jq}} separately, then the functional similarity score between l_{i} and l_{j} can as well be calculated according to the following equation:
Construction of FVTLDA
As illustrating in Fig. 7, FVTLDA consists of the following three major steps:
Step a According to indirect biological information including known miRNAlncRNA associations and known miRNAdisease associations downloaded above, for each pair of lncRNA and disease, a unique feature vector will be constructed first by adopting the random walk with restart based on the Gaussian interaction profile kernel similarity for miRNAs and functional similarity for miRNAs.
Step b Next, according to known lncRNAdisease associations downloaded above, for each pair of lncRNA and disease, a unique association probability fraction will be calculated based on the concept of Disease Clique [25].
Step c Finally, based on the feature vectors and association probability fractions obtained above, the Multiple Linear Regression (MLR) and the Artificial Neural Network (ANN) will be integrated to infer relationships between feature vectors and corresponding association probability fractions. And then, based on these predicted relationships, for each pair of lncRNA and disease, the potential association between them will be mapped into a probability score. Thereafter, based on these probability scores, we can rank the associations between lncRNAs and diseases conveniently.
Construction of feature vectors for lncRNAdisease pairs
As showing in Fig. 7a, for each lncRNAdisease pair, the construction of its feature vector consists of the three major steps:
Step 1 Based on the formula (11), construct the miRNAlncRNA association probability fractions matrix PL according to known miRNAlncRNA associations and the Gaussian interaction profile kernel similarity for miRNAs. And then, for each lncRNA l_{i}, the column corresponding to l_{i} in the matrix PL will be considered as the feature vector of l_{i}.
Step 2 Based on the formula (14), construct miRNAdisease association probability fractions matrix PD according to known miRNAdisease associations and the miRNA functional similarity. And then, for each disease d_{j}, the column corresponding to d_{j} in the matrix PD will be considered as the feature vector of d_{j}.
Step 3 For each lncRNAdisease pair (l_{i},d_{j}), obtain its feature vector through integrating the feature vector of l_{i} with the feature vector of d_{j} according to the following formula (17).
Random Walk is usually adopted to sort the association probabilities of nodes in a network [50], therefore we can implement the random walk with restart on the miRNAlncRNA association network to obtain the feature vector of lncRNAs as follows: Let any given lncRNA node l_{i} as the walker, the random walks will start from all known miRNA nodes related to it, and will be moved from the current node to the next node according to the Gaussian interaction profile kernel similarity for miRNA nodes. During implementing the random walk, supposing that the random walk can be restarted with the probability of r_{1} (0 < r_{1} < 1), then the random walk process can be described by the following formulas:
The random walk process is an iterative process, which will be stopped when the random walk reaches a stable state: Here, considering the requirements of time efficiency and accuracy, the random walk will be considered to be stable if the difference between PL_{s+1} and PL_{s} is less than 10^{–10}. In this way, for each lncRNA l_{i}, it is obvious that the feature vector of l_{i} can be expressed by the association probability fractions of all miRNAs related to l_{i}, i.e., the feature vectors of l_{i} can be expressed by the ith column in the matrix PL.
Similarly, for each disease d_{j}, let the random walk be restarted with the probability of r_{2} (0 < r_{2} < 1), and its feature vector can as well be obtained according to the following equations:
Finally, for each lncRNAdisease pair (l_{i},d_{j}), its feature vector can be calculated by combining the feature vectors of both l_{i} and d_{j} as follows:
Here, PL(i) and PD(j) represent the ith column of the matrix PL and jth column of the matrix PD respectively. Moreover, for two column vectors A = (a_{1}, a_{2},…,a_{n})^{T} and B = (b_{1},b_{2},…,b_{n})^{T}, A \(\otimes\) B = (a_{1} × b_{1},a_{2} × b_{2},…,a_{n} × b_{n})^{T}.
In this way, all the feature vector obtained will be independent and there is no collinearity.
Construction of association probability fractions for LncRNAdisease pair
The incidence matrix LD obtained from known lncRNAdisease associations can only reflect whether or not lncRNAs have known associations with diseases, but cannot accurately express the degrees of their relationships. Moreover, if one element in LD equals 0, it only means that there is currently no known association between the pair of the corresponding lncRNA and disease nodes, but does not mean that there is absolutely no association existing between them. Thus, values in the matrix LD need to be further processed. Here, we turn this classification problem into a regression problem. By referring to the definition of the Disease Clique proposed in previous study [25], in this section, for each given disease d_{i} and lncRNA l_{j}, we define the set consisting of all these nonzero elements in the ith row of the matrix DS_{LD} as the Disease Clique of d_{i.} Then, as shown in Fig. 8, the lncRNAdisease association incidence matrix LD can be revised as follows:
The probability fraction matrix OUTPUT obtained from the above formula (18) can not only solve the problem of sparsity existing in the original association incidence matrix LD, but also reflect the degree of relationship between lncRNAs and diseases to some extent.
Construction of FVTLDA with MLR and FVTLDA with ANN
In order to avoid the limitations of single model prediction scheme, for any given pair of lncRNA and disease nodes, in this section, we present two different methods, such as the Multiple linear regression (MLR) analysis and the Artificial neural network (ANN), to reveal the potential relationship between the feature vector of the lncRNAdisease pair and its association probability fraction.
Construction of FVTLDA with MLR
MLR analysis is often used in statistical analysis [51,52,53], whose purpose is to determine the quantitative relationship between the dependent and independent variables, and the general form of MLR can be expressed as follows:
Here, Y represents the dependent variable, {X_{1}, X_{2},…, X_{k}} denote the independent variable of Y, β_{0} is the constant term, {β_{1}, β_{2},…, β_{k}} are the partial regression coefficients of {X_{1}, X_{2},…, X_{k}} respectively, and e denotes the error value. Based on formula (22), for each lncRNAdisease pair (l_{i},d_{j}), we can represent the relationship between its association probability fraction OUTPUT(i,j) and its feature vector as follows:
Moreover, for convenience, we define the regression coefficients as W = [\(\upbeta _{0}\),\(\upbeta _{1}\),\(\upbeta _{2}\),…,\(\upbeta _{{{\text{N}}_{{\text{m}}} }}\)], the feature vector of (l_{i},d_{j}) as x_{n} = [1,FV_{ij}(1),FV_{ij}(2),…,FV_{ij}(N_{m})], and the association probability fraction corresponding to(l_{i},d_{j}) as y_{n} = OUTPUT(i,j). Then, for a given training set T = {(x_{1},y_{1}),(x_{2},y_{2}),…,(x_{N},y_{N})}, let X = (x_{1},x_{2},…,x_{n})^{T} and Y = (y_{1},y_{2},…,y_{n})^{T}, the regression coefficients W can be calculated by the least square method, and the optimal solution W^{*} can be calculated as follows:
Finally, based on the above formulas, our prediction model FVTLDA with MLR can be described as the following Algorithm 1 (in Additional file 9).
Artificial neural network (ANN)
ANN is a simple model often used to simulate the biological structure of the human brain. It is a highly dense network composing of simple elements, which can reflect the essential relationships between dependent variables and independent variables. One of the most important characteristics of ANN is that it can be learned by training samples, which can overcome the limitations of traditional methods. Therefore, in this section, we will further adopt ANN to estimate the relationships between the feature vectors of lncRNAdisease pairs and their association probability fractions. As illustrating in the Fig. 9, ANN is a parallel distributed processing system composing of many process components (neurons), which can be divided into three layers such as the Input layer, the Hidden layer and the Output layer. In ANN, each neuron in every layer can receive one or more input signals, and generate an output signal through the activation function as the input signal of the next layer. The most important part of ANN is to determine the weights and biases. In ANN, each link between neurons represents a weight that reflects the influence of the previous neuron on the current neuron, and bias can increase the flexibility of this neuron [54]. In this section, in a way similar to the previous study [55], we determine the weights and biases of ANN through the following four major steps:
Step 1 Take the training samples as the input values, and randomly set the initial values of weights and biases in each layer of ANN.
Step 2 Calculate the output of ANN and compare the output with the target value to obtain the value of error.
Step 3 Readjust the weights and biases in each layer of ANN according to the value of error obtained above from Step 2.
Step 4 Repeat the above procedure until ANN reaches the stop condition.
In this paper, all feature vectors of lncRNAdisease pairs were randomly divided into the training set, the validation set and the test set in a ratio of 3:1:1. Moreover, the training sets were taken as the input of the Input layer. Thereafter, the input of the Hidden layer can be obtained by combining the weights, the output of the Input layer and the biases. Additionally, let \(I_{m}^{n}\) and \(O_{m}^{n}\) denote the input value and the output value of the node m in the nth layer of ANN separately, then, the output of the Hidden layer can be calculated according to the following activation function:
Similarly, the input of the Output layer can be acquired by integrating the weights and the output of the Hidden layer, and the output of the Output layer can be figured out through the following activation function:
After obtaining the output value of the Output layer of ANN, the mean square error (MSE) can be obtained by comparing it with the target (the corresponding association probability fraction) as follows:
Here, N represents the number of test sets.
Finally, the weight and bias between each pair of neuron connections can be modified repeatedly according to the MSE value until one of the following stop conditions has been satisfied:

(1)
Maximum training times (were set to 100 in this paper)

(2)
Minimum MSE (was set to 0.001 in this paper)

(3)
Maximum times of consecutive iterations (In the training process, since the MSE of validation set does not decrease in t consecutive iterations, then we were set the maximum times of consecutive iterations to 15 in this paper)
Finally, based on the above formulas, our prediction model FVTLDA with ANN can be described as the following Algorithm 2 (in Additional file 9).
Availability and requirements
Project name: My bioinformatics project FVTLDA.
Project home page: https://github.com/xiaoyubin123/FVTLDA.git
Operating system: Platform independent
Programming language: Matlab
Other requirements: Matlab_R2017b or higher
Any restrictions to use by nonacademics: No license required
Abbreviations
 FVTLDA:

Feature vectors is developed to predict LncRNADisease Associations
 LOOCV:

Leaveone out cross validation
 MLR:

Multiple linear regression
 ANN:

Artificial neural network
 CV:

Cross validation
 RWR:

Random walk with restart
 TPR:

True positive rate
 FPR:

False positive rate
 ROC:

Receiver operating characteristic
 AUC:

Areas under ROC curve
References
 1.
Esteller M. Noncoding RNAs in human disease. Nat Rev Genet. 2011;12(12):861–74.
 2.
Wang KC, Chang HY. Molecular mechanisms of long noncoding RNAs. Mol Cell. 2011;43(6):904–14.
 3.
Wapinski O, Chang HY. Long noncoding RNAs and human disease. Trends Cell Biol. 2011;21(6):354–61.
 4.
Mercer TR, Dinger ME, Mattick JS. Long noncoding RNAs: insights into functions. Nat Rev Genet. 2009;10(3):155–9.
 5.
Es L, Lm L, Birren B, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409(6822):860.
 6.
Chen X, Sun YZ, Guan NN, et al. Computational models for lncRNA function prediction and functional similarity calculation. Brief Funct Genomics. 2019;18(1):58–82.
 7.
Calin GA, Liu C, Ferracin M, et al. Ultraconserved regions encoding ncRNAs are altered in human leukemias and carcinomas. Cancer Cell. 2007;12(3):215–29.
 8.
Johnson R. Long noncoding RNAs in Huntington"s disease neurodegeneration. Neurobiol Dis. 2012;46(2):245–54.
 9.
Cai Y, Yang Y, Chen X, et al. Circulating ‘lncRNA OTTHUMT00000387022’ from monocytes as a novel biomarker for coronary artery disease. Cardiovasc Res. 2016;112:714–24.
 10.
Li J, Xuan Z, Liu C. Long noncoding RNAs and complex human diseases. Int J Mol Sci. 2013;14(9):18790–808.
 11.
Chen G, Wang Z, Wang D, et al. LncRNADisease: a database for longnoncoding RNAassociated diseases. Nucleic Acids Res. 2013;41(D1):D983–6.
 12.
Bu D, Yu K, Sun S, et al. NONCODE v3.0: integrative annotation of long noncoding RNAs. Nucleic Acids Res. 2012;40(D1):D210–5.
 13.
Amaral PP, Clark MB, Gascoigne DK, Dinger ME, Mattick JS. lncRNAdb: a reference database for long noncoding RNAs. Nucleic Acids Res. 2011;39(Database issue):D146–51.
 14.
Dinger ME, Pang KC, Mercer TR, Crowe ML, Grimmond SM, Mattick JS. NRED: a database of long noncoding RNA expression. Nucleic Acids Res. 2009;37(Database issue):D122–6.
 15.
Chen X, Yan CC, Zhang X, You ZH. Long noncoding RNAs and complex diseases: from experimental results to computational models. Brief Bioinform. 2016. https://doi.org/10.1093/bib/bbw060.
 16.
Jingwen Y, Pengyao P, Lei W, et al. A novel probability model for LncRNA–disease association prediction based on the Naïve Bayesian classifier. Genes. 2018;9(7):345.
 17.
Yu J, Xuan Z, Feng X, et al. A novel collaborative filtering model for LncRNAdisease association prediction based on the Naïve Bayesian classifier. BMC Bioinform. 2019;20(1):396.
 18.
Xuan Z, Li J, Yu J, et al. A probabilistic matrix factorization method for identifying lncRNAdisease associations. Genes. 2019;10(2):126.
 19.
Chen X, Yan GY. Novel human lncRNAdisease association inference based on lncRNA expression profiles. Bioinformatics. 2013a;29(20):2617–24.
 20.
Sun J, Shi H, Wang Z, et al. Inferring novel lncRNA–disease associations based on a random walk model of a lncRNA functional similarity network. Mol BioSyst. 2014;10(8):2074–81.
 21.
Zhou M, Wang X, Li J, et al. Prioritizing candidate diseaserelated long noncoding RNAs by walking on the heterogeneous lncRNA and disease network. Mol BioSyst. 2015;11(3):760–9.
 22.
Chen X. KATZLDA: KATZ measure for the lncRNAdisease association prediction. Sci Rep. 2015a;5:16840.
 23.
Liu MX, Chen X, Chen G, et al. A computational framework to infer human diseaseassociated long noncoding RNAs. PLoS ONE. 2014;9(1):e84408.
 24.
Lu C, Yang M, Luo F, et al. Prediction of lncRNA–disease associations based on inductive matrix completion. Bioinformatics. 2018;34(19):3357–64.
 25.
Wang L, Xiao Y, Li J, et al. IIRWR: internal inclined random walk with restart for LncRNAdisease association prediction. IEEE Access. 2019;7:54034–41.
 26.
Chen X. Predicting lncRNAdisease associations and constructing lncRNA functional similarity network based on the information of miRNA. Sci Rep. 2015b;5:13186.
 27.
Yang X, Gao L, Guo X, et al. A network based method for analysis of lncRNAdisease associations and prediction of lncRNAs implicated in diseases. PLoS ONE. 2014;9(1):e87797.
 28.
Hartgrink HH, Jansen EPM, Grieken NCTV, et al. Gastric cancer. Lancet. 2009;374(9688):477–90.
 29.
Guo X, Xia J, Deng K, et al. Long noncoding RNAs: emerging players in gastric cancer. Tumor Biol. 2014;35(11):10591–600.
 30.
Chen D, Ju H, Lu Y, et al. Long noncoding RNA XIST regulates gastric cancer progression by acting as a molecular sponge of miR101 to modulate EZH2 expression. J Exp Clin Cancer Res. 2016;35(1):142.
 31.
Xia H, Chen Q, Chen Y, et al. The lncRNA MALAT1 is a novel biomarker for gastric cancer metastasis. Oncotarget. 2016;7(35):56209.
 32.
Fernando TR, RodriguezMalave NI, Waters EV, et al. LncRNA expression discriminates karyotype and predicts survival in Blymphoblastic leukemia. Mol Cancer Res. 2015;13(5):839–51.
 33.
Wang Y, Wu P, Lin R, et al. LncRNA NALT interaction with NOTCH1 promoted cell proliferation in pediatric T cell acute lymphoblastic leukemia. Sci Rep. 2015;5:13749.
 34.
Hoffman PC, Mauer AM, Vokes EE. Lung cancer. Lancet. 2000;355(9202):479–85.
 35.
Chen X, Guan NN, Sun YZ, Li JQ, Qu J. MicroRNAsmall molecule association identification: from experimental results to computational models. Brief Bioinform. 2018;21(1):47–61.
 36.
Chen X, Li TH, Zhao Y, Wang CC, Zhu CC. Deepbelief network for predicting potential miRNAdisease associations. Brief Bioinform. 2020. https://doi.org/10.1093/bib/bbaa186.
 37.
Chen X, Zhu CC, Yin J. Ensemble of decision tree reveals potential miRNAdisease associations. PLoS Comput Biol. 2019;15(7):e1007209.
 38.
Wang CC, Zhao Y, Chen X. Drugpathway association prediction: from experimental results to computational models. Brief Bioinform. 2020. https://doi.org/10.1093/bib/bbaa061.
 39.
Chen X, Yan CC, Zhang X, Zhang X, Dai F, Yin J, Zhang Y. Drugtarget interaction prediction: databases, web servers and computational models. Brief Bioinform. 2016;17(4):696–712. https://doi.org/10.1093/bib/bbv066.
 40.
Li Y, Qiu C, Tu J, et al. HMDD v2.0: a database for experimentally supported human microRNA and disease associations. Nucleic Acids Res. 2013;42(D1):D1070–4.
 41.
Li JH, Liu S, Zhou H, et al. starBase v2.0: decoding miRNAceRNA, miRNAncRNA and protein–RNA interaction networks from largescale CLIPSeq data. Nucleic Acids Res. 2013;42(D1):D92–7.
 42.
Cui T, Zhang L, Huang Y, et al. MNDR v2.0: an updated resource of ncRNAdisease associations in mammals. Nucleic Acids Res. 2018;46(D1):D371–4.
 43.
van Laarhoven T, Nabuurs SB, Marchiori E. Gaussian interaction profile kernels for predicting drug–target interaction. Bioinformatics. 2011;27(21):3036–43.
 44.
Chen X, Yan GY. Novel human lncRNA–disease association inference based on lncRNA expression profiles. Bioinformatics. 2013b;29(20):2617–24.
 45.
Chen X, Xie D, Wang L, et al. BNPMDA: bipartite network projection for MiRNA–disease association prediction. Bioinformatics. 2018;34(18):3178–86.
 46.
Chen X, Wang L, Qu J, et al. Predicting miRNA–disease association based on inductive matrix completion. Bioinformatics. 2018;34(24):4256–65.
 47.
Chen X, Xie D, Zhao Q, You ZH. MicroRNAs and complex diseases: from experimental results to computational models. Brief Bioinform. 2019;20(2):515–39.
 48.
Chen X, Yin J, Qu J, et al. MDHGI: matrix decomposition and heterogeneous graph inference for miRNAdisease association prediction. PLoS Comput Biol. 2018;14(8):e1006418.
 49.
Wang D, Wang J, Lu M, et al. Inferring the human microRNA functional similarity and functional network based on microRNAassociated diseases. Bioinformatics. 2010;26(13):1644–50.
 50.
Niu YW, Wang GH, Yan GY, et al. Integrating random walk and binary regression to identify novel miRNAdisease association. BMC Bioinform. 2019;20(1):59.
 51.
Kaytez F, Taplamacioglu MC, Cam E, et al. Forecasting electricity consumption: a comparison of regression analysis, neural networks and least squares support vector machines. Int J Electr Power Energy Syst. 2015;67:431–8.
 52.
Atici U. Prediction of the strength of mineral admixture concrete using multivariable regression analysis and an artificial neural network. Expert Syst Appl. 2011;38(8):9609–18.
 53.
Bahadir E. Using neural network and logistic regression analysis to predict prospective mathematics teachers’ academic success upon entering graduate education. Educ Sci Theory Pract. 2016;16(3):943–64.
 54.
Lee Y. Neural network based approach for predicting Learning effect in design students. Int J Organ Innov. 2010;2(3):250.
 55.
Wang L, Zeng Y, Chen T. Back propagation neural network with adaptive differential evolution algorithm for time series forecasting. Expert Syst Appl. 2015;42(2):855–63.
Acknowledgements
The authors thank all those who have made suggestions for this article.
Funding
This research was partly sponsored by the National Natural Science Foundation of China (No. 61873221, No. 61672447) and the Natural Science Foundation of Hunan Province (No. 2018JJ4058, No. 2019JJ70010, No. 2017JJ5036). Publication costs were funded by the National Natural Science Foundation of China (No. 61873221, No. 61672447). The funder of manuscript is Lei Wang (L.W.), whose contribution are stated in the section of Author’s Contributions. The funding body has not played any roles in the design of the study and collection, analysis and interpretation of data in writing the manuscript.
Author information
Affiliations
Contributions
YBX conceived the study. YBX, ZX, and LW developed the method. YBX and ZPC implemented the algorithms. LAK and YBX collected the data. XF performed the data analyses. YBX and LW wrote the manuscript. All authors have read and approved the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1
The ROC curves achieved by FVTLDA_ANN in framework of LOOCV.
Additional file 2
The ROC curves achieved by FVTLDA_MLR in framework of 5fold CV.
Additional file 3
The ROC curves achieved by FVTLDA_MLR in framework of 10fold CV.
Additional file 4
The ROC curves achieved by FVTLDA_ANN in framework of 5fold CV.
Additional file 5
The ROC curves achieved by FVTLDA_ANN in framework of 10fold CV.
Additional file 6
Known miRNAdisease associations obtained from HMDD.
Additional file 7
Known miRNAlncRNA associations obtained from starBase v2.0.
Additional file 8
Known lncRNAdisease associations obtained from MNDR v2.0.
Additional file 9
Algorithm 1 and 2.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Xiao, Y., Xiao, Z., Feng, X. et al. A novel computational model for predicting potential LncRNAdisease associations based on both direct and indirect features of LncRNAdisease pairs. BMC Bioinformatics 21, 555 (2020). https://doi.org/10.1186/s12859020039067
Received:
Accepted:
Published:
Keywords
 LncRNAdisease association prediction
 Features
 Random walk
 Multiple linear regression
 Artificial neural network