A novel computational model for predicting potential LncRNA-disease associations based on both direct and indirect features of LncRNA-disease pairs

Xiao, Yubin; Xiao, Zheng; Feng, Xiang; Chen, Zhiping; Kuang, Linai; Wang, Lei

doi:10.1186/s12859-020-03906-7

Methodology article
Open access
Published: 02 December 2020

A novel computational model for predicting potential LncRNA-disease associations based on both direct and indirect features of LncRNA-disease pairs

Yubin Xiao^1,3,
Zheng Xiao²,
Xiang Feng¹,
Zhiping Chen¹,
Linai Kuang³ &
…
Lei Wang ORCID: orcid.org/0000-0002-5065-8447^1,3

BMC Bioinformatics volume 21, Article number: 555 (2020) Cite this article

2933 Accesses
6 Citations
1 Altmetric
Metrics details

Abstract

Background

Accumulating evidence has demonstrated that long non-coding RNAs (lncRNAs) are closely associated with human diseases, and it is useful for the diagnosis and treatment of diseases to get the relationships between lncRNAs and diseases. Due to the high costs and time complexity of traditional bio-experiments, in recent years, more and more computational methods have been proposed by researchers to infer potential lncRNA-disease associations. However, there exist all kinds of limitations in these state-of-the-art prediction methods as well.

Results

In this manuscript, a novel computational model named FVTLDA is proposed to infer potential lncRNA-disease associations. In FVTLDA, its major novelty lies in the integration of direct and indirect features related to lncRNA-disease associations such as the feature vectors of lncRNA-disease pairs and their corresponding association probability fractions, which guarantees that FVTLDA can be utilized to predict diseases without known related-lncRNAs and lncRNAs without known related-diseases. Moreover, FVTLDA neither relies solely on known lncRNA-disease nor requires any negative samples, which guarantee that it can infer potential lncRNA-disease associations more equitably and effectively than traditional state-of-the-art prediction methods. Additionally, to avoid the limitations of single model prediction techniques, we combine FVTLDA with the Multiple Linear Regression (MLR) and the Artificial Neural Network (ANN) for data analysis respectively. Simulation experiment results show that FVTLDA with MLR can achieve reliable AUCs of 0.8909, 0.8936 and 0.8970 in 5-Fold Cross Validation (fivefold CV), 10-Fold Cross Validation (tenfold CV) and Leave-One-Out Cross Validation (LOOCV), separately, while FVTLDA with ANN can achieve reliable AUCs of 0.8766, 0.8830 and 0.8807 in fivefold CV, tenfold CV, and LOOCV respectively. Furthermore, in case studies of gastric cancer, leukemia and lung cancer, experiment results show that there are 8, 8 and 8 out of top 10 candidate lncRNAs predicted by FVTLDA with MLR, and 8, 7 and 8 out of top 10 candidate lncRNAs predicted by FVTLDA with ANN, having been verified by recent literature. Comparing with the representative prediction model of KATZLDA, comparison results illustrate that FVTLDA with MLR and FVTLDA with ANN can achieve the average case study contrast scores of 0.8429 and 0.8515 respectively, which are both notably higher than the average case study contrast score of 0.6375 achieved by KATZLDA.

Conclusion

The simulation results show that FVTLDA has good prediction performance, which is a good supplement to future bioinformatics research.

Background

LncRNAs have long been considered as a transcriptional noise [1, 2]. However, in recent years, more and more researches have shown that lncRNAs play key roles in numerous important biological processes of humans, including chromatin modification, epigenetic regulation, cell cycle control, cell differentiation and so on [3,4,5,6]. Especially, accumulating bio-experiments have confirmed that mutations and dysregulation of lncRNAs are associated with the development of diseases, such as leukemia [7], neurological disorders [8], coronary artery diseases [9] and several cancers [10]. Hence, effectively inferring potential associations between lncRNAs and diseases can not only contribute to understand the pathogenesis of some complex diseases at the molecular level, but also be conducive to provide biomarkers for disease diagnosis, therapy and prognosis. Up to now, along with the rapid increment of newly inferred lncRNAs, some publicly available lncRNA-related databases, including lncRNADisease [11], NONCODE [12], lncRNAdb [13] and NRED [14], have been established successively. However, the number of known lncRNA-disease associations is still very limited, since traditional biological experiments are costly and time-consuming. Therefore, it is important and necessary to construct effective and high-throughput computational models to explore potential lncRNA-disease associations.

So far, researchers have developed numerous powerful computational models to predict potential lncRNA-disease associations, which can be roughly classified into three major categories according to their main implementation strategies [15]. Among them, the first category aims to adopt machine learning methods to predict potential lncRNA-disease associations. For example, Yu and Wang et al. proposed a prediction model based on the Naïve Bayes classifier [16] in 2018 and a prediction model based on the collaborative filtering algorithm [17] in 2019 to infer potential lncRNA-disease associations, respectively. Xuan and Wang et al. developed a probabilistic matrix factorization model based on the semi-supervised learning method to identify potential associations between lncRNAs and diseases [18]. In these prediction models of the first category, the major drawback lies in the requirement of negative samples as the training set, which will affect their prediction performances notably, since the negative samples are usually difficult to obtain. Of course, some models overcome this limitation. LRLSLDA is the first large-scale prediction model [19], which does not need the negative samples information, but how to choose the best parameters remains to be solved.

Different from the first category, the second category focuses on implementing propagation algorithms such as Random Walk on a heterogeneous network constructed by integrating lncRNA-disease association network, disease similarity network and lncRNA similar network, etc. For instance, in 2014, Sun et al. [20] established a global network-based computational model, which adopted the random walk with restart (RWR) algorithm to predict potential lncRNA-disease associations. In 2015, Zhou et al. [21] proposed a prediction model by implementing RWR on a heterogeneous network comprising known lncRNA-disease association network, miRNA-associated lncRNA crosstalk network and disease similarity network. However, these two models mentioned above can only be applied to infer lncRNAs with related-disease or known miRNA-disease associations. To break through this kind of limitation, in 2015, Chen et al. [22] developed a computational model called KATZLDA for prediction of potential lncRNA-disease associations, which can infer potential lncRNAs in the absence of known associated diseases. But prediction may bias in favor of lncRNAs with more known related-diseases and diseases with more known related-lncRNAs as well due to its construction of the network.

According to the above descriptions, the prediction performance of all these models of both categories will be influenced by the number of known lncRNA-disease associations. However, the number of known lncRNA-disease associations confirmed by bio-experiments is still very limited. Therefore, to avoid the drawback of limited known lncRNA-disease associations, the third category adopts indirect biological information to explore the prediction of potential lncRNA-disease associations. For instance, in 2014, Liu et al. [23] proposed a novel prediction model by combining human lncRNA expression profiles, human disease-associated gene data and gene expression profiles, which can achieve exciting prediction performance while there are no known lncRNA-disease associations. However, it cannot implements to predict lncRNAs without gene-related records.

Different from the above existing methods, in this manuscript, we proposed a novel computational model named FVTLDA to reveal potential lncRNA-disease associations. In FVTLDA, to avoid the limitation of various methods mentioned previously, we first introduce direct and indirect biological information on lncRNAs and diseases, including known lncRNA-miRNA-disease associations. Then, known lncRNA-disease associations will be utilized to extract direct features for lncRNA-disease pairs based on the concept of Disease Clique. Meanwhile, indirect biological information including known miRNA-disease associations and known miRNA-lncRNA associations will be utilized to extract indirect features for lncRNA-disease pairs by adopting the random walk with restart. What's more, to avoid the limitation of single model prediction techniques, based on the direct and indirect features obtained for lncRNA-disease pairs, the Multiple Linear Regression (MLR) and Artificial Neural Network (ANN) will be combined with FVTLDA to reveal potential lncRNA-disease associations, respectively. To estimate the prediction performance of FVTLDA, different frameworks including the LOOCV, fivefold CV and tenfold CV are implemented to compare FVTLDA with existing competing models. Simulation experiment results show that FVTLDA with MLR can achieve AUCs of 0.8909, 0.8936 and 0.8970 in fivefold CV, tenfold CV and LOOCV respectively, while FVTLDA with ANN can achieve AUCs of 0.8766, 0.8830 and 0.8807 in fivefold CV, tenfold CV and LOOCV separately, which both outperform existing state-of-the-art models. Meanwhile, in case studies of gastric cancer, leukemia and lung cancer, simulation experiment results show that there are 8, 8 and 8 out of top 10 candidate lncRNAs predicted by FVTLDA with MLR, and 8, 7 and 8 out of top 10 candidate lncRNAs predicted by FVTLDA with ANN, having been verified respectively in biological experimental studies or other independent studies. Finally, to further illustrate actual predictive ability of FVTLDA, we have compared it with the representative prediction model KATZLDA based on the new concept of case study contrast score as well, which aims to quantify the prediction ability of the model in case study. And simulation experiment results show that the average case study contrast scores of FVTLDA with MLR and FVTLDA with ANN are 0.8429 and 0.8515 respectively, which both outperform the average case study contrast score of 0.6375 obtained by KATZLDA notably.

Result

Performance evaluation

In order to evaluate the prediction performance of FVTLDA, in this section, we implement the LOOCV on FVTLDA as follows: For all known lncRNA-disease pairs, each pair with known correlations was selected in turn for testing, and other lncRNA-disease pairs were retained as training samples for model learning. Particularly, testing samples and lncRNA-disease pairs without known correlations were considered as candidates. After the implementation of FVTLDA, the ranking positions of test samples in candidates can be obtained according to the association probability fractions. If the ranking of a test sample is above the given threshold, it will be seen as a successful prediction or a positive sample. Otherwise, it is seen as an unsuccessful prediction or a negative sample. Besides, upon different thresholds, the corresponding true positive rate (TPR, sensitivity) and false positive rate (FPR, 1 − specificity) can be calculated as follows:

$$TPR = \frac{TP}{{TP + FN}}$$

(1)

$$FPR = \frac{FP}{{TN + FP}}$$

(2)

Here, TP and TN represent the correctly identified positive and negative samples separately, while FP and FN denote the incorrectly identified positive and negative samples, respectively.

Based on the above equations, the Receiver Operating Characteristic (ROC) curve can be drawn according to the TPRs and FPRs of different thresholds, and the area under ROC curve (AUC) will further be calculated to evaluate the performance of FVTLDA. The AUC value of 1 indicates the perfect prediction performance while the AUC value of 0.5 means a random guess.

During simulation, we first compared FVTLDA_MLR (i.e., FVTLDA with MLR) with six state-of-the-art prediction models such as NBCLDA [16], CFNBC [17], PMFILDA [18], KATZLDA [22], SIMCLDA [24] and IIRWR [25] in the framework of LOOCV, and comparison results were shown in Fig. 1. Through observing this figure, it can be seen that FVTLDA_MLR can achieve AUC of 0.8970, which significantly outperforms those six state-of-the-art prediction models with the increment of AUC values by at least 0.0311.

Moreover, to eliminate the random error caused by the random initialization of weights and biases in FVTLDA_ANN (i.e., FVTLDA with ANN), during simulation, we repeated the execution of LOOCV on FVTLDA_ANN for 20 times, and took the mean and variance of the AUC values as the result. As illustrated in Additional file 1, it can be seen that FVTLDA_ANN achieves a reliable mean of AUC value of 0.8807 and standard deviation (std) of 0.0047 in LOOCV, which outperforms these six state-of-the-art prediction models.

In order to further verify the prediction performance of FVTLDA while there are few known lncRNA-disease associations, the frameworks of K-fold CV including fivefold CV and tenfold CV were implemented to compare FVTLDA_MLR with other representative prediction models. During implementing the K-fold CV, all known lncRNA-disease associations are equally divided into K parts, each part was left out as the test sample in turn, and other remaining lncRNA-disease pairs were used as the training samples. As shown in the following Figs. 2 and 3, FVTLDA_MLR can achieve better predictive performance than the other six competing models, which demonstrates that FVTLDA can perform better in sparse data sets as well.

Furthermore, in order to eliminate the effects of the random partition of training samples, during simulation, we repeated the implementations of 5-CV and 10-CV 20 times respectively, and took the mean and variance of AUC value as the results. As shown in Additional files 2 and 3, FVTLDA_MLR achieves the mean AUC value of 0.8903 and std of 0.0022 in 5-CV, and the mean AUC of 0.8940 and std of 0.0014 in 10-CV, separately. Meanwhile, as for FVTLDA_ANN, from observing the following Additional files 4 and 5, it can be seen that it achieves the mean AUC value of 0.8766 and std of 0.0043 in 5-CV, and the mean AUC of 0.8830 and std of 0.0022 in 10-CV, respectively.

Finally, to demonstrate that FVTLDA can perform well in different data sets, we further compared it with other state-of-the-art models including HGLDA [26] and the method proposed by Yang et al. [27] in the framework of LOOCV. While comparing FVTLDA with HGLDA, we adopted the data set given by HGLDA, which consists of 183 experimentally validated lncRNA-disease associations. While comparing FVTLDA with the method proposed by Yang et al., we used the dataset put forward by Yang et al., which consists of 319 known lncRNA-disease associations. FVTLDA outperforms these two kinds of model in different datasets (Figs. 4 and 5).

Parameter analysis

In this section, influences of parameters in FVTLDA are estimated. The parameters r₁ and r₂ in Eq. (11) (See the section of Methods) and Eq. (14) represent the restart probabilities of the random walk, the parameter rate in Eq. (19) stands for the adjustment factor, and the parameters k₁ and k₂ in Eqs. (20) and (21) denote the attenuation factors, respectively.

In order to determine the optimal values of the above five parameters efficiently, we traverse the approximate range of each parameter through FVTLDA with MLR in the framework of LOOCV (0, 0.0001, 0.001, 0.01, 0.1). For parameters that can further improve the precision, we take the approximate solution of the previous step as the default value, and then, the optimal solution with higher precision is achieved by traversal. As illustrated in the following Table 1 (bold represents the best parameter), the optimal values for these five parameters such as rate, r₁, r₂, k₁, and k₂ are 0.3, 0.001, 0.001, 0.008, 0.007 separately.

Table 1 Effects of the parameter to the performance of FVTLDA_MLR in LOOCV

Full size table

Case study

In order to further demonstrate the predictive ability of FVTLDA, in this section, we select gastric cancer, leukemia and lung cancer as case studies. During the simulation, for any given disease d_i $\in${the gastric cancer, the leukemia, the lung cancer}, only those lncRNAs that do not have known associations with d_i will be considered as validated candidates for d_i. Next, all candidate lncRNAs will be ranked according to their association probability fractions calculated by FVTLDA. Finally, the top 10 candidate d_i-related lncRNAs will be verified by recent articles and experiments published in the NCBI database (https://www.ncbi.nlm.nih.gov/). Additionally, to compare the difference of prediction performance between FVTLDA_MLR and FVTLDA_ANN, as well as the difference of prediction performance between FVTLDA and another representative prediction model KATZLDA, we further list all these lncRNAs in the top 10 candidate d_i-related lncRNAs predicted by FVTLDA_MLR, FVTLDA_ANN and KATZLDA separately. Simultaneously, we will provide corresponding rankings and relevant evidence of these lncRNAs as well. Moreover, in order to visualize the predictive ability of these three kinds of prediction models in the above case studies, we propose a novel concept of case study contrast score, which can be calculated as follows:

$$score = \exp \left( {\mathop \sum \limits_{i = 1}^{m} \frac{1}{{R_{i} }} - \frac{1}{i}} \right)$$

(3)

Here, m denotes the number of verified lncRNAs in top 10 predicted candidate lncRNAs, and R_i represents the ranking corresponding to the ith confirmed lncRNA. If the model has better practical ability, the closer the score of the model is to 1. For example, in Table 2, the case study contrast score of FVTLDA_MLR = $e^{{\left( {1 + \frac{1}{2} + \frac{1}{4} + \frac{1}{5} + \cdots + \frac{1}{7} + \frac{1}{9} + \frac{1}{10} + \frac{1}{29} + \frac{1}{11}} \right) - \left( {1 + \frac{1}{2} + \frac{1}{3} + \cdots + \frac{1}{9} + \frac{1}{10}} \right)}} = 0.7168$.

Table 2 Top 10 potential gastric cancer-related lncRNAs and their PubMed unique identifiers predicted by FVTLDA_MLR, FVTLDA_ANN and KATZLDA

Full size table

Gastric cancer is the second leading cause of cancer-related deaths and the fourth most common cancer in the world [28, 29]. Up to now, there is a large number of lncRNAs having been proved to be related to gastric cancer [30, 31]. FVTLDA_MLR, FVTLDA_ANN and KATZLDA can successfully predict 8, 8 and 8 confirmed lncRNAs out of the top 10 candidate lncRNAs respectively (Table 2), and their corresponding case study contrast scores are 0.7168, 0.8377 and 0.8439 separately.

As for leukemia, its association with some lncRNAs has been widely reported [32, 33]. FVTLDA_MLR, FVTLDA_ANN, and KATZLDA can successfully predict 8, 8 and 8 confirmed lncRNAs out of the top 10 candidates lncRNAs separately (Table 3), and their corresponding case study contrast scores are 0.9448, 0.9753 and 0.9688 respectively.

Table 3 Top 10 potential leukemia-related lncRNAs and their PubMed unique identifiers predicted by FVTLDA_MLR, FVTLDA_ANN and KATZLDA

Full size table

Moreover, lung cancer is also a leading cause of cancer death all over the world, regardless of gender [34]. FVTLDA_MLR and FVTLDA_ANN can successfully predict 8 and 7 confirmed lncRNAs out of the top 10 candidate lncRNAs respectively (Table 4). However, KATZLDA can only predict 1 confirmed lncRNAs out of the top 10 candidate lncRNAs. Additionally, the case study contrast scores of FVTLDA_MLR, FVTLDA_ANN and KATZLDA are 0.8670, 0.7414 and 0.0998 respectively.

Table 4 Top 10 potential lung cancer-related lncRNAs and their PubMed unique identifiers predicted by FVTLDA_MLR, FVTLDA_ANN and KATZLDA

Full size table

In conclusion, FVTLDA can achieve excellent prediction performance, and the average case study contrast scores of FVTLDA_MLR (0.8429) and FVTLDA_ANN (0.8515) are both higher than KATZ (0.6375).

Discussion

A lot of evidence has demonstrated that lncRNAs play an important role in the pathological changes of human diseases, and identification of disease-related lncRNAs can help us better understand the disease mechanisms at the molecular level. However, it is costly and time-consuming to verify lncRNA-disease associations with biological experiments. Thus, it is important and necessary to develop efficient computational models to predict potential lncRNA-disease associations.

Different from state-of-the-art prediction models, in this paper, a novel computational model called FVTLDA is proposed to predict potential lncRNA-disease associations based on direct and indirect biological information. In order to avoid the limitation of the single model prediction technique, we further combine FVTLDA with multiple linear regression and artificial neural networks respectively. Moreover, to evaluate the prediction performance of FVTLDA, we conducted intensive in experiments. Simulation results demonstrate that FVTLDA achieves better performance than other six available state-of-the-art prediction models. Additionally, in case studies of gastric cancer, leukemia and lung cancer, simulation results show that the prediction ability and stability of both FVTLDA with MLR and FVTLDA with ANN are better than that of competing methods.

Certainly, despite the prediction performance of FVTLDA, the current version of FVTLDA can further improve performance as well. For example, we can increase the complexity of neural networks in the model of FVTLDA. Finally, more useful information sources including the gene-disease associations can be integrated into the feature vectors of lncRNA-disease pairs to further improve the prediction performance of FVTLDA. In the future, we can also study the association prediction in various fields of computational biology, such as miRNA-disease association prediction [35,36,37], drug-target interaction prediction [38, 39], and then bring valuable insights to the development of lncRNA-disease association prediction.

Conclusion

In this manuscript, a novel computational model named FVTLDA is proposed. FVTLDA solved three problems of other models: (1) Some models can not be applied to isolated nodes. (2) Some methods require negative samples that are difficult to obtain. (3) Some approaches may be biased towards known nodes. Besides, we combine FVTLDA with Multiple Linear Regression and Artificial Neural Network for data analysis respectively, and results and case studies show that our model outperforms other state-of-the-art models, which indicate that FVTLDA can be an excellent tool for research in the future.

Method

In order to introduce direct and indirect biological information on lncRNA-disease associations into FVTLDA, in this section, we first collected three kinds of known associations including miRNA-disease associations, miRNA-lncRNA associations and lncRNA-disease association from various databases. And then, based on these three kinds of datasets, we constructed three kinds of incidence matrix as follows:

Step 1 First, we downloaded the dataset of known miRNA-disease associations and miRNA-lncRNA associations from the databases of HMDD [40] and starBase v2.0 [41] respectively. After having removed the repetitive associations supported by multiple evidences, and normalized the names of the miRNAs in these two datasets, we finally obtained 4704 unique miRNA-disease associations between 246 miRNAs and 373 diseases (see Additional file 6), and 9086 different miRNA-lncRNA association between 246 miRNAs and 1089 lncRNAs (see Additional file 7). Thereafter, based on these two datasets, we constructed a 246 × 373 dimensional miRNA-disease association incidence matrix MD and a 246 × 1089 dimensional miRNA-lncRNA association incidence matrix ML separately. In MD, there is MD(i,j) = 1, if and only if there exists a known association between the miRNA m_i and the disease d_j, otherwise there is MD(i,j) = 0. Similarly, in ML, there is ML(i,j) = 1, if and only if there exists a known association between the miRNA m_i and the lncRNA l_j, otherwise there is ML(i,j) = 0. For convenience, we defined the numbers of miRNAs, diseases and lncRNAs obtained above as N_m, N_{d_MD} and N_{l_ML} respectively. Obviously, there are N_m = 246, N_{d_MD} = 373 and N_{l_ML} = 1089.

Step 2 Next, we downloaded the dataset of known lncRNA-disease associations from the MNDR v2.0 database [42]. After having removed the duplicate associations with multiple evidence, as illustrated in the Fig. 6, we further got rid of these associations with either lncRNAs not belonging to N_{l_ML} or diseases not belonging to N_{d_MD}. Finally, we obtained 407 lncRNA-disease associations between 77 different lncRNAs and 95 different diseases (see Additional file 8). similarly, based on the newly-downloaded dataset, we constructed a 77 × 95 dimensional lncRNA-disease association incidence matrix LD, in which, there is LD(i,j) = 1, if and only if there exists a known association between the lncRNA l_i and the disease d_j, otherwise there is LD(i,j) = 0. And for convenience, we define the numbers of lncRNAs and diseases obtained above as N_{l_LD} and N_{d_LD} respectively. Obviously, there are N_{l_LD} = 77 and N_{d_LD} = 95.

Construction of the Gaussian interaction profile kernel similarity for miRNAs based on miRNA-lncRNA associated information

According to the assumption that similar miRNAs tend to interact with similar lncRNAs [43], the Gaussian interaction profile kernel similarity between the miRNA m_i and the miRNA m_j can be calculated as follows:

$$KM\left( {m_{i} ,m_{j} } \right) = \exp \left( { - \gamma_{m} \left| {\left| {IP\left( {m_{i} } \right) - IP\left( {m_{j} } \right)} \right|} \right|^{2} } \right)$$

(4)

$$\gamma_{m} = \frac{{\gamma_{m}^{\prime } }}{{\mathop \sum \nolimits_{k = 1}^{{N_{m} }} \left| {\left| {IP\left( {m_{k} } \right)} \right|} \right|^{2} }}$$

(5)

Here, IP(m_i) denotes the ith row in the miRNA-lncRNA association incidence matrix ML, γ_m denotes the normalized bandwidth based on the new bandwidth parameter γ_m′, and in this paper γ_m′ will be set to 1 according to previous experiments [44]. In this way, an N_m × N_m dimensional Gaussian interaction profile kernel similarity matrix KM for miRNAs can be established.

Construction of the functional similarity for miRNAs based on miRNA-disease associated information

In recent years, disease semantic similarity has been widely utilized to identify potential miRNA-disease associations, and many previous researches have shown the validity of this similarity [45,46,47,48,49,50]. In this study, we calculated the disease semantic similarity in the same way as in previous studies [49]. For all diseases, we first downloaded its corresponding Medical Subject Headings (MESH) descriptors from the National Library of Medicine in turn (http://www.nlm.nih.gov/) [49], and then, we represent a disease d_A as its directed acyclic graph (DAG) such as DAG(d_A) = (D(d_A), E(d_A)). Here, D(d_A) consists of the disease node d_A itself and all ancestor nodes of d_A, while E(d_A) is composed of all the directed edges from parent nodes to children nodes. For example, the code for breast neoplasm is: c04.588.180; c17.800.090.500. The corresponding parent nodes are C04.588 neoplasms by site and C17.800.090 breast diseases [49]. In the same way of the previous study [18], for any two disease nodes d and t, we will calculate the contribution of t to the semantic value of d as follows:

$$D_{d} \left( t \right) = \left\{ {\begin{array}{*{20}l} 0 \hfill & {if\;t \notin DAG\left( d \right)} \hfill \\ 1 \hfill & { if\;t \in DAG\left( d \right) \;and\; t = d} \hfill \\ {\max \left\{ {\Delta *D_{d} \left( {t^{\prime}} \right){|}t^{\prime} \in children of t} \right\}} \hfill & {if\; t \in DAG\left( d \right)\; and\; t \ne d} \hfill \\ \end{array} } \right.$$

(6)

where ∆ denotes the semantic contribution decay factor, and according to the previous study [49], in this paper, ∆ will be set to 0.5. Thereafter, we can calculate the semantic value of the disease d through combining all these diseases in its DAG(d) as follows:

$$D\left( d \right) = \mathop \sum \limits_{{t_{i} \in DAG\left( d \right)}} D_{d} \left( {t_{i} } \right)$$

(7)

According to the assumption that two diseases with a larger number of shared nodes in their DAGs may have higher similarity, we can calculate the disease semantic similarity score between a pair of diseases d_i and d_j as follows:

$$DS_{MD} \left( {i,j} \right) = \frac{{\mathop \sum \nolimits_{{t \in \left( {DAG\left( {d_{i} } \right) \cap DAG\left( {d_{j} } \right)} \right)}} \left( {D_{{d_{i} }} \left( t \right) + D_{{d_{j} }} \left( t \right)} \right)}}{{D\left( {d_{i} } \right) + D\left( {d_{j} } \right)}}$$

(8)

According to the above formula, it is obvious that an N_{d_MD} × N_{d_MD} dimensional matrix DS_MD can be established. Meanwhile, after extracting the semantic similarity information of disease in the lncRNA-disease association from the matrix DS_MD, we can further build an N_{d_LD} × N_{d_LD} dimensional matrix DS_LD as well.

Apparently, after obtaining the semantic similarity scores of diseases, we can finally obtain the functional similarity between miRNAs based on the assumption that miRNAs with similar functions are often implicated in similar disease [49] as follows: for any two given miRNAs m_i and m_j, let all diseases known to be related to m_i and m_j be GDM(m_i) = {d_i1,d_i2,d_i3…,d_ip} and GDM(m_j) = {d_j1,d_j2,d_j3,…,d_jq} respectively, then the functional similarity score between m_i and m_j can be calculated according to the following:

$$FM\left( {m_{i} ,m_{j} } \right) = \frac{{\mathop \sum \nolimits_{t = 1}^{p} \max \left( {DS_{MD} \left( {d_{it} ,GDM\left( {m_{j} } \right)} \right)} \right) + \mathop \sum \nolimits_{t = 1}^{q} {\max}\left( {DS_{MD} \left( {d_{jt} ,GDM\left( {m_{i} } \right)} \right)} \right)}}{p + q}$$

(9)

According to the above equation, an N_m × N_m dimensional functional similarity matrix FM for miRNAs can be established. In the same way, let all diseases are known to be associated to lncRNAs l_i and l_j as GDL(l_i) = {d_i1,d_i2,d_i3…,d_ip} and GDL(l_j) = {d_j1,d_j2,d_j3,…,d_jq} separately, then the functional similarity score between l_i and l_j can as well be calculated according to the following equation:

$$FL\left( {l_{i} ,l_{j} } \right) = \frac{{\mathop \sum \nolimits_{t = 1}^{p} \max \left( {DS_{LD} \left( {d_{it} ,GDL\left( {m_{j} } \right)} \right)} \right) + \mathop \sum \nolimits_{t = 1}^{q} {\max}\left( {DS_{LD} \left( {d_{jt} ,GDL\left( {m_{i} } \right)} \right)} \right)}}{p + q}$$

(10)

Construction of FVTLDA

As illustrating in Fig. 7, FVTLDA consists of the following three major steps:

Step a According to indirect biological information including known miRNA-lncRNA associations and known miRNA-disease associations downloaded above, for each pair of lncRNA and disease, a unique feature vector will be constructed first by adopting the random walk with restart based on the Gaussian interaction profile kernel similarity for miRNAs and functional similarity for miRNAs.

Step b Next, according to known lncRNA-disease associations downloaded above, for each pair of lncRNA and disease, a unique association probability fraction will be calculated based on the concept of Disease Clique [25].

Step c Finally, based on the feature vectors and association probability fractions obtained above, the Multiple Linear Regression (MLR) and the Artificial Neural Network (ANN) will be integrated to infer relationships between feature vectors and corresponding association probability fractions. And then, based on these predicted relationships, for each pair of lncRNA and disease, the potential association between them will be mapped into a probability score. Thereafter, based on these probability scores, we can rank the associations between lncRNAs and diseases conveniently.

Construction of feature vectors for lncRNA-disease pairs

As showing in Fig. 7a, for each lncRNA-disease pair, the construction of its feature vector consists of the three major steps:

Step 1 Based on the formula (11), construct the miRNA-lncRNA association probability fractions matrix PL according to known miRNA-lncRNA associations and the Gaussian interaction profile kernel similarity for miRNAs. And then, for each lncRNA l_i, the column corresponding to l_i in the matrix PL will be considered as the feature vector of l_i.

Step 2 Based on the formula (14), construct miRNA-disease association probability fractions matrix PD according to known miRNA-disease associations and the miRNA functional similarity. And then, for each disease d_j, the column corresponding to d_j in the matrix PD will be considered as the feature vector of d_j.

Step 3 For each lncRNA-disease pair (l_i,d_j), obtain its feature vector through integrating the feature vector of l_i with the feature vector of d_j according to the following formula (17).

Random Walk is usually adopted to sort the association probabilities of nodes in a network [50], therefore we can implement the random walk with restart on the miRNA-lncRNA association network to obtain the feature vector of lncRNAs as follows: Let any given lncRNA node l_i as the walker, the random walks will start from all known miRNA nodes related to it, and will be moved from the current node to the next node according to the Gaussian interaction profile kernel similarity for miRNA nodes. During implementing the random walk, supposing that the random walk can be restarted with the probability of r₁ (0 < r₁ < 1), then the random walk process can be described by the following formulas:

$$PL_{s + 1} = \left( {1 - r_{1} } \right)*NKM^{T} *PL_{s} + r_{1} *PL_{0}$$

(11)

$$NKM\left( {i,j} \right) = \frac{{KM\left( {i,j} \right)}}{{\mathop \sum \nolimits_{k = 1}^{{N_{m} }} KM\left( {i,k} \right)}}$$

(12)

$$PL_{0} \left( {i,j} \right) = \frac{{ML\left( {i,j} \right)}}{{\mathop \sum \nolimits_{k = 1}^{{N_{m} }} ML\left( {k,j} \right)}}$$

(13)

The random walk process is an iterative process, which will be stopped when the random walk reaches a stable state: Here, considering the requirements of time efficiency and accuracy, the random walk will be considered to be stable if the difference between PL_s+1 and PL_s is less than 10^–10. In this way, for each lncRNA l_i, it is obvious that the feature vector of l_i can be expressed by the association probability fractions of all miRNAs related to l_i, i.e., the feature vectors of l_i can be expressed by the ith column in the matrix PL.

Similarly, for each disease d_j, let the random walk be restarted with the probability of r₂ (0 < r₂ < 1), and its feature vector can as well be obtained according to the following equations:

$$PD_{s + 1} = \left( {1 - r_{2} } \right) * NFM^{T} * PD_{s} + r_{2} * PD_{0}$$

(14)

$$NFM\left( {i,j} \right) = \frac{{FM\left( {i,j} \right)}}{{\mathop \sum \nolimits_{k = 1}^{{N_{m} }} FM\left( {i,k} \right)}}$$

(15)

$$PD_{0} \left( {i,j} \right) = \frac{{MD\left( {i,j} \right)}}{{\mathop \sum \nolimits_{k = 1}^{{N_{m} }} MD\left( {k,j} \right)}}$$

(16)

Finally, for each lncRNA-disease pair (l_i,d_j), its feature vector can be calculated by combining the feature vectors of both l_i and d_j as follows:

$$FV_{ij} = PL\left( i \right) \otimes PD\left( j \right)$$

(17)

Here, PL(i) and PD(j) represent the ith column of the matrix PL and jth column of the matrix PD respectively. Moreover, for two column vectors A = (a₁, a₂,…,a_n)^T and B = (b₁,b₂,…,b_n)^T, A $\otimes$ B = (a₁ × b₁,a₂ × b₂,…,a_n × b_n)^T.

In this way, all the feature vector obtained will be independent and there is no collinearity.

Construction of association probability fractions for LncRNA-disease pair

The incidence matrix LD obtained from known lncRNA-disease associations can only reflect whether or not lncRNAs have known associations with diseases, but cannot accurately express the degrees of their relationships. Moreover, if one element in LD equals 0, it only means that there is currently no known association between the pair of the corresponding lncRNA and disease nodes, but does not mean that there is absolutely no association existing between them. Thus, values in the matrix LD need to be further processed. Here, we turn this classification problem into a regression problem. By referring to the definition of the Disease Clique proposed in previous study [25], in this section, for each given disease d_i and lncRNA l_j, we define the set consisting of all these nonzero elements in the ith row of the matrix DS_LD as the Disease Clique of d_i. Then, as shown in Fig. 8, the lncRNA-disease association incidence matrix LD can be revised as follows:

$$OUTPUT\left( {i,j} \right) = \frac{{OUT\left( {i,j} \right) - {\min}\left( {OUT} \right)}}{{\max \left( {OUT} \right) - {\min}\left( {OUT} \right)}}$$

(18)

$$OUT = rate * FOUT + \left( {1 - rate} \right) * DOUT$$

(19)

$$FOUT\left( {i,j} \right) = \left\{ {\begin{array}{*{20}l} {\mathop \sum \limits_{n = 1}^{{N_{l\_LD} }} k_{1} * LD\left( {n,j} \right) * FL\left( {i,n} \right)} \hfill & {if\;LD\left( {i,j} \right) \ne 1} \hfill \\ {\left( {\mathop \sum \limits_{n = 1}^{{N_{l\_LD} }} k_{1} * LD\left( {n,j} \right) * FL\left( {i,n} \right)} \right) + 1 - k_{1} } \hfill & {if\;LD\left( {i,j} \right) = 1} \hfill \\ \end{array} } \right.$$

(20)

$$DOUT\left( {i,j} \right) = \left\{ {\begin{array}{*{20}l} {\mathop \sum \limits_{n = 1}^{{N_{d\_LD} }} k_{2} * LD\left( {i,n} \right) * DS_{LD} \left( {n,j} \right)} \hfill & {if\;LD\left( {i,j} \right) \ne 1} \hfill \\ {\left( {\mathop \sum \limits_{n = 1}^{{N_{d\_LD} }} k_{2} * LD\left( {i,n} \right) * DS_{LD} \left( {n,j} \right)} \right) + 1 - k_{2} } \hfill & { if\;LD\left( {i,j} \right) = 1} \hfill \\ \end{array} } \right.$$

(21)

The probability fraction matrix OUTPUT obtained from the above formula (18) can not only solve the problem of sparsity existing in the original association incidence matrix LD, but also reflect the degree of relationship between lncRNAs and diseases to some extent.

Construction of FVTLDA with MLR and FVTLDA with ANN

In order to avoid the limitations of single model prediction scheme, for any given pair of lncRNA and disease nodes, in this section, we present two different methods, such as the Multiple linear regression (MLR) analysis and the Artificial neural network (ANN), to reveal the potential relationship between the feature vector of the lncRNA-disease pair and its association probability fraction.

Construction of FVTLDA with MLR

MLR analysis is often used in statistical analysis [51,52,53], whose purpose is to determine the quantitative relationship between the dependent and independent variables, and the general form of MLR can be expressed as follows:

$$Y = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + \cdots + \beta_{k} X_{k} \pm e$$

(22)

Here, Y represents the dependent variable, {X₁, X₂,…, X_k} denote the independent variable of Y, β₀ is the constant term, {β₁, β₂,…, β_k} are the partial regression coefficients of {X₁, X₂,…, X_k} respectively, and e denotes the error value. Based on formula (22), for each lncRNA-disease pair (l_i,d_j), we can represent the relationship between its association probability fraction OUTPUT(i,j) and its feature vector as follows:

$$OUTPUT\left( {i,j} \right) = \beta_{0} * 1 + \beta_{1} * FV_{ij} \left( 1 \right) + \beta_{2} * FV_{ij} \left( 2 \right) + \cdots + \beta_{{N_{m} }} * FV_{ij} \left( {N_{m} } \right)$$

(23)

Moreover, for convenience, we define the regression coefficients as W = [$\upbeta _{0}$,$\upbeta _{1}$,$\upbeta _{2}$,…,$\upbeta _{{{\text{N}}_{{\text{m}}} }}$], the feature vector of (l_i,d_j) as x_n = [1,FV_ij(1),FV_ij(2),…,FV_ij(N_m)], and the association probability fraction corresponding to(l_i,d_j) as y_n = OUTPUT(i,j). Then, for a given training set T = {(x₁,y₁),(x₂,y₂),…,(x_N,y_N)}, let X = (x₁,x₂,…,x_n)^T and Y = (y₁,y₂,…,y_n)^T, the regression coefficients W can be calculated by the least square method, and the optimal solution W^* can be calculated as follows:

$$W^{*} = \left( {X^{T} X} \right)^{ - 1} X^{T} Y$$

(24)

Finally, based on the above formulas, our prediction model FVTLDA with MLR can be described as the following Algorithm 1 (in Additional file 9).

Artificial neural network (ANN)

ANN is a simple model often used to simulate the biological structure of the human brain. It is a highly dense network composing of simple elements, which can reflect the essential relationships between dependent variables and independent variables. One of the most important characteristics of ANN is that it can be learned by training samples, which can overcome the limitations of traditional methods. Therefore, in this section, we will further adopt ANN to estimate the relationships between the feature vectors of lncRNA-disease pairs and their association probability fractions. As illustrating in the Fig. 9, ANN is a parallel distributed processing system composing of many process components (neurons), which can be divided into three layers such as the Input layer, the Hidden layer and the Output layer. In ANN, each neuron in every layer can receive one or more input signals, and generate an output signal through the activation function as the input signal of the next layer. The most important part of ANN is to determine the weights and biases. In ANN, each link between neurons represents a weight that reflects the influence of the previous neuron on the current neuron, and bias can increase the flexibility of this neuron [54]. In this section, in a way similar to the previous study [55], we determine the weights and biases of ANN through the following four major steps:

Step 1 Take the training samples as the input values, and randomly set the initial values of weights and biases in each layer of ANN.

Step 2 Calculate the output of ANN and compare the output with the target value to obtain the value of error.

Step 3 Readjust the weights and biases in each layer of ANN according to the value of error obtained above from Step 2.

Step 4 Repeat the above procedure until ANN reaches the stop condition.

In this paper, all feature vectors of lncRNA-disease pairs were randomly divided into the training set, the validation set and the test set in a ratio of 3:1:1. Moreover, the training sets were taken as the input of the Input layer. Thereafter, the input of the Hidden layer can be obtained by combining the weights, the output of the Input layer and the biases. Additionally, let $I_{m}^{n}$ and $O_{m}^{n}$ denote the input value and the output value of the node m in the nth layer of ANN separately, then, the output of the Hidden layer can be calculated according to the following activation function:

$$O_{X}^{2} = \frac{2}{{1 + e^{{ - 2*I_{X}^{2} }} }} - 1$$

(25)

Similarly, the input of the Output layer can be acquired by integrating the weights and the output of the Hidden layer, and the output of the Output layer can be figured out through the following activation function:

$$O_{1}^{3} = I_{1}^{3}$$

(26)

After obtaining the output value of the Output layer of ANN, the mean square error (MSE) can be obtained by comparing it with the target (the corresponding association probability fraction) as follows:

$$E_{total} = \frac{1}{N}\mathop \sum \limits_{k = 1}^{N} \left( {O_{1}^{3} \left( k \right) - target\left( k \right)} \right)^{2}$$

(27)

Here, N represents the number of test sets.

Finally, the weight and bias between each pair of neuron connections can be modified repeatedly according to the MSE value until one of the following stop conditions has been satisfied:

(1)
Maximum training times (were set to 100 in this paper)
(2)
Minimum MSE (was set to 0.001 in this paper)
(3)
Maximum times of consecutive iterations (In the training process, since the MSE of validation set does not decrease in t consecutive iterations, then we were set the maximum times of consecutive iterations to 15 in this paper)

Finally, based on the above formulas, our prediction model FVTLDA with ANN can be described as the following Algorithm 2 (in Additional file 9).

Availability and requirements

Project name: My bioinformatics project FVTLDA.

Project home page: https://github.com/xiaoyubin123/FVTLDA.git

Operating system: Platform independent

Programming language: Matlab

Other requirements: Matlab_R2017b or higher

Any restrictions to use by non-academics: No license required

Availability of data and materials

All data generated or analyzed during this study are included in this published article [Additional files 6, 7 and 8].

Abbreviations

FVTLDA:: Feature vectors is developed to predict LncRNA-Disease Associations
LOOCV:: Leave-one out cross validation
MLR:: Multiple linear regression
ANN:: Artificial neural network
CV:: Cross validation
RWR:: Random walk with restart
TPR:: True positive rate
FPR:: False positive rate
ROC:: Receiver operating characteristic
AUC:: Areas under ROC curve

References

Esteller M. Non-coding RNAs in human disease. Nat Rev Genet. 2011;12(12):861–74.
Article CAS PubMed Google Scholar
Wang KC, Chang HY. Molecular mechanisms of long noncoding RNAs. Mol Cell. 2011;43(6):904–14.
Article CAS PubMed PubMed Central Google Scholar
Wapinski O, Chang HY. Long noncoding RNAs and human disease. Trends Cell Biol. 2011;21(6):354–61.
Article CAS PubMed Google Scholar
Mercer TR, Dinger ME, Mattick JS. Long non-coding RNAs: insights into functions. Nat Rev Genet. 2009;10(3):155–9.
Article CAS PubMed Google Scholar
Es L, Lm L, Birren B, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409(6822):860.
Article Google Scholar
Chen X, Sun YZ, Guan NN, et al. Computational models for lncRNA function prediction and functional similarity calculation. Brief Funct Genomics. 2019;18(1):58–82.
Article CAS PubMed Google Scholar
Calin GA, Liu C, Ferracin M, et al. Ultraconserved regions encoding ncRNAs are altered in human leukemias and carcinomas. Cancer Cell. 2007;12(3):215–29.
Article CAS PubMed Google Scholar
Johnson R. Long non-coding RNAs in Huntington"s disease neurodegeneration. Neurobiol Dis. 2012;46(2):245–54.
Article CAS PubMed Google Scholar
Cai Y, Yang Y, Chen X, et al. Circulating ‘lncRNA OTTHUMT00000387022’ from monocytes as a novel biomarker for coronary artery disease. Cardiovasc Res. 2016;112:714–24.
Article CAS PubMed Google Scholar
Li J, Xuan Z, Liu C. Long non-coding RNAs and complex human diseases. Int J Mol Sci. 2013;14(9):18790–808.
Article CAS PubMed PubMed Central Google Scholar
Chen G, Wang Z, Wang D, et al. LncRNADisease: a database for long-non-coding RNA-associated diseases. Nucleic Acids Res. 2013;41(D1):D983–6.
Article CAS PubMed Google Scholar
Bu D, Yu K, Sun S, et al. NONCODE v3.0: integrative annotation of long noncoding RNAs. Nucleic Acids Res. 2012;40(D1):D210–5.
Article CAS PubMed Google Scholar
Amaral PP, Clark MB, Gascoigne DK, Dinger ME, Mattick JS. lncRNAdb: a reference database for long noncoding RNAs. Nucleic Acids Res. 2011;39(Database issue):D146–51.
Article CAS PubMed Google Scholar
Dinger ME, Pang KC, Mercer TR, Crowe ML, Grimmond SM, Mattick JS. NRED: a database of long noncoding RNA expression. Nucleic Acids Res. 2009;37(Database issue):D122–6.
Article CAS PubMed Google Scholar
Chen X, Yan CC, Zhang X, You Z-H. Long non-coding RNAs and complex diseases: from experimental results to computational models. Brief Bioinform. 2016. https://doi.org/10.1093/bib/bbw060.
Article PubMed PubMed Central Google Scholar
Jingwen Y, Pengyao P, Lei W, et al. A novel probability model for LncRNA–disease association prediction based on the Naïve Bayesian classifier. Genes. 2018;9(7):345.
Article CAS Google Scholar
Yu J, Xuan Z, Feng X, et al. A novel collaborative filtering model for LncRNA-disease association prediction based on the Naïve Bayesian classifier. BMC Bioinform. 2019;20(1):396.
Article CAS Google Scholar
Xuan Z, Li J, Yu J, et al. A probabilistic matrix factorization method for identifying lncRNA-disease associations. Genes. 2019;10(2):126.
Article CAS PubMed Central Google Scholar
Chen X, Yan GY. Novel human lncRNA-disease association inference based on lncRNA expression profiles. Bioinformatics. 2013a;29(20):2617–24.
Article CAS PubMed Google Scholar
Sun J, Shi H, Wang Z, et al. Inferring novel lncRNA–disease associations based on a random walk model of a lncRNA functional similarity network. Mol BioSyst. 2014;10(8):2074–81.
Article CAS PubMed Google Scholar
Zhou M, Wang X, Li J, et al. Prioritizing candidate disease-related long non-coding RNAs by walking on the heterogeneous lncRNA and disease network. Mol BioSyst. 2015;11(3):760–9.
Article CAS PubMed Google Scholar
Chen X. KATZLDA: KATZ measure for the lncRNA-disease association prediction. Sci Rep. 2015a;5:16840.
Article CAS PubMed PubMed Central Google Scholar
Liu MX, Chen X, Chen G, et al. A computational framework to infer human disease-associated long noncoding RNAs. PLoS ONE. 2014;9(1):e84408.
Article PubMed PubMed Central CAS Google Scholar
Lu C, Yang M, Luo F, et al. Prediction of lncRNA–disease associations based on inductive matrix completion. Bioinformatics. 2018;34(19):3357–64.
Article CAS PubMed Google Scholar
Wang L, Xiao Y, Li J, et al. IIRWR: internal inclined random walk with restart for LncRNA-disease association prediction. IEEE Access. 2019;7:54034–41.
Article Google Scholar
Chen X. Predicting lncRNA-disease associations and constructing lncRNA functional similarity network based on the information of miRNA. Sci Rep. 2015b;5:13186.
Article CAS PubMed PubMed Central Google Scholar
Yang X, Gao L, Guo X, et al. A network based method for analysis of lncRNA-disease associations and prediction of lncRNAs implicated in diseases. PLoS ONE. 2014;9(1):e87797.
Article PubMed PubMed Central CAS Google Scholar
Hartgrink HH, Jansen EPM, Grieken NCTV, et al. Gastric cancer. Lancet. 2009;374(9688):477–90.
Article PubMed PubMed Central Google Scholar
Guo X, Xia J, Deng K, et al. Long non-coding RNAs: emerging players in gastric cancer. Tumor Biol. 2014;35(11):10591–600.
Article CAS Google Scholar
Chen D, Ju H, Lu Y, et al. Long non-coding RNA XIST regulates gastric cancer progression by acting as a molecular sponge of miR-101 to modulate EZH2 expression. J Exp Clin Cancer Res. 2016;35(1):142.
Article PubMed PubMed Central CAS Google Scholar
Xia H, Chen Q, Chen Y, et al. The lncRNA MALAT1 is a novel biomarker for gastric cancer metastasis. Oncotarget. 2016;7(35):56209.
Article PubMed PubMed Central Google Scholar
Fernando TR, Rodriguez-Malave NI, Waters EV, et al. LncRNA expression discriminates karyotype and predicts survival in B-lymphoblastic leukemia. Mol Cancer Res. 2015;13(5):839–51.
Article CAS PubMed PubMed Central Google Scholar
Wang Y, Wu P, Lin R, et al. LncRNA NALT interaction with NOTCH1 promoted cell proliferation in pediatric T cell acute lymphoblastic leukemia. Sci Rep. 2015;5:13749.
Article PubMed PubMed Central Google Scholar
Hoffman PC, Mauer AM, Vokes EE. Lung cancer. Lancet. 2000;355(9202):479–85.
Article CAS PubMed Google Scholar
Chen X, Guan NN, Sun YZ, Li JQ, Qu J. MicroRNA-small molecule association identification: from experimental results to computational models. Brief Bioinform. 2018;21(1):47–61.
Google Scholar
Chen X, Li T-H, Zhao Y, Wang C-C, Zhu C-C. Deep-belief network for predicting potential miRNA-disease associations. Brief Bioinform. 2020. https://doi.org/10.1093/bib/bbaa186.
Article PubMed PubMed Central Google Scholar
Chen X, Zhu CC, Yin J. Ensemble of decision tree reveals potential miRNA-disease associations. PLoS Comput Biol. 2019;15(7):e1007209.
Article PubMed PubMed Central CAS Google Scholar
Wang C-C, Zhao Y, Chen X. Drug-pathway association prediction: from experimental results to computational models. Brief Bioinform. 2020. https://doi.org/10.1093/bib/bbaa061.
Article PubMed PubMed Central Google Scholar
Chen X, Yan CC, Zhang X, Zhang X, Dai F, Yin J, Zhang Y. Drug-target interaction prediction: databases, web servers and computational models. Brief Bioinform. 2016;17(4):696–712. https://doi.org/10.1093/bib/bbv066.
Article CAS PubMed Google Scholar
Li Y, Qiu C, Tu J, et al. HMDD v2.0: a database for experimentally supported human microRNA and disease associations. Nucleic Acids Res. 2013;42(D1):D1070–4.
Article PubMed PubMed Central CAS Google Scholar
Li JH, Liu S, Zhou H, et al. starBase v2.0: decoding miRNA-ceRNA, miRNA-ncRNA and protein–RNA interaction networks from large-scale CLIP-Seq data. Nucleic Acids Res. 2013;42(D1):D92–7.
Article PubMed PubMed Central CAS Google Scholar
Cui T, Zhang L, Huang Y, et al. MNDR v2.0: an updated resource of ncRNA-disease associations in mammals. Nucleic Acids Res. 2018;46(D1):D371–4.
CAS PubMed Google Scholar
van Laarhoven T, Nabuurs SB, Marchiori E. Gaussian interaction profile kernels for predicting drug–target interaction. Bioinformatics. 2011;27(21):3036–43.
Article PubMed CAS Google Scholar
Chen X, Yan GY. Novel human lncRNA–disease association inference based on lncRNA expression profiles. Bioinformatics. 2013b;29(20):2617–24.
Article CAS PubMed Google Scholar
Chen X, Xie D, Wang L, et al. BNPMDA: bipartite network projection for MiRNA–disease association prediction. Bioinformatics. 2018;34(18):3178–86.
Article CAS PubMed Google Scholar
Chen X, Wang L, Qu J, et al. Predicting miRNA–disease association based on inductive matrix completion. Bioinformatics. 2018;34(24):4256–65.
CAS PubMed Google Scholar
Chen X, Xie D, Zhao Q, You ZH. MicroRNAs and complex diseases: from experimental results to computational models. Brief Bioinform. 2019;20(2):515–39.
Article CAS PubMed Google Scholar
Chen X, Yin J, Qu J, et al. MDHGI: matrix decomposition and heterogeneous graph inference for miRNA-disease association prediction. PLoS Comput Biol. 2018;14(8):e1006418.
Article PubMed PubMed Central CAS Google Scholar
Wang D, Wang J, Lu M, et al. Inferring the human microRNA functional similarity and functional network based on microRNA-associated diseases. Bioinformatics. 2010;26(13):1644–50.
Article CAS PubMed Google Scholar
Niu YW, Wang GH, Yan GY, et al. Integrating random walk and binary regression to identify novel miRNA-disease association. BMC Bioinform. 2019;20(1):59.
Article Google Scholar
Kaytez F, Taplamacioglu MC, Cam E, et al. Forecasting electricity consumption: a comparison of regression analysis, neural networks and least squares support vector machines. Int J Electr Power Energy Syst. 2015;67:431–8.
Article Google Scholar
Atici U. Prediction of the strength of mineral admixture concrete using multivariable regression analysis and an artificial neural network. Expert Syst Appl. 2011;38(8):9609–18.
Article Google Scholar
Bahadir E. Using neural network and logistic regression analysis to predict prospective mathematics teachers’ academic success upon entering graduate education. Educ Sci Theory Pract. 2016;16(3):943–64.
Google Scholar
Lee Y. Neural network based approach for predicting Learning effect in design students. Int J Organ Innov. 2010;2(3):250.
Google Scholar
Wang L, Zeng Y, Chen T. Back propagation neural network with adaptive differential evolution algorithm for time series forecasting. Expert Syst Appl. 2015;42(2):855–63.
Article Google Scholar

Download references

Acknowledgements

The authors thank all those who have made suggestions for this article.

Funding

This research was partly sponsored by the National Natural Science Foundation of China (No. 61873221, No. 61672447) and the Natural Science Foundation of Hunan Province (No. 2018JJ4058, No. 2019JJ70010, No. 2017JJ5036). Publication costs were funded by the National Natural Science Foundation of China (No. 61873221, No. 61672447). The funder of manuscript is Lei Wang (L.W.), whose contribution are stated in the section of Author’s Contributions. The funding body has not played any roles in the design of the study and collection, analysis and interpretation of data in writing the manuscript.

Author information

Authors and Affiliations

College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, 410001, People’s Republic of China
Yubin Xiao, Xiang Feng, Zhiping Chen & Lei Wang
Hunan Province Key Laboratory of Tumor Cellular and Molecular Pathology, Cancer Research Institute, University of South China, Hengyang, 421001, Hunan, People’s Republic of China
Zheng Xiao
Key Laboratory of Hunan Province for Internet of Things and Information Security, Xiangtan University, Xiangtan, 411105, People’s Republic of China
Yubin Xiao, Linai Kuang & Lei Wang

Authors

Yubin Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Xiang Feng
View author publications
You can also search for this author in PubMed Google Scholar
Zhiping Chen
View author publications
You can also search for this author in PubMed Google Scholar
Linai Kuang
View author publications
You can also search for this author in PubMed Google Scholar
Lei Wang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

YBX conceived the study. YBX, ZX, and LW developed the method. YBX and ZPC implemented the algorithms. LAK and YBX collected the data. XF performed the data analyses. YBX and LW wrote the manuscript. All authors have read and approved the manuscript.

Corresponding author

Correspondence to Lei Wang.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1

The ROC curves achieved by FVTLDA_ANN in framework of LOOCV.

Additional file 2

The ROC curves achieved by FVTLDA_MLR in framework of 5-fold CV.

Additional file 3

The ROC curves achieved by FVTLDA_MLR in framework of 10-fold CV.

Additional file 4

The ROC curves achieved by FVTLDA_ANN in framework of 5-fold CV.

Additional file 5

The ROC curves achieved by FVTLDA_ANN in framework of 10-fold CV.

Additional file 6

Known miRNA-disease associations obtained from HMDD.

Additional file 7

Known miRNA-lncRNA associations obtained from starBase v2.0.

Additional file 8

Known lncRNA-disease associations obtained from MNDR v2.0.

Additional file 9

Algorithm 1 and 2.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Xiao, Y., Xiao, Z., Feng, X. et al. A novel computational model for predicting potential LncRNA-disease associations based on both direct and indirect features of LncRNA-disease pairs. BMC Bioinformatics 21, 555 (2020). https://doi.org/10.1186/s12859-020-03906-7

Download citation

Received: 22 November 2019
Accepted: 25 November 2020
Published: 02 December 2020
DOI: https://doi.org/10.1186/s12859-020-03906-7

A novel computational model for predicting potential LncRNA-disease associations based on both direct and indirect features of LncRNA-disease pairs

Abstract

Background

Results

Conclusion

Background

Result

Performance evaluation

Parameter analysis

Case study

Discussion

Conclusion

Method

Construction of the Gaussian interaction profile kernel similarity for miRNAs based on miRNA-lncRNA associated information

Construction of the functional similarity for miRNAs based on miRNA-disease associated information

Construction of FVTLDA

Construction of feature vectors for lncRNA-disease pairs

Construction of association probability fractions for LncRNA-disease pair

Construction of FVTLDA with MLR and FVTLDA with ANN

Construction of FVTLDA with MLR

Artificial neural network (ANN)

Availability and requirements

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Supplementary Information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us