More and more evidence showed that long non-coding RNAs (lncRNAs) play important roles in the development and progression of human sophisticated diseases. Therefore, predicting human lncRNA-disease associations is a challenging and urgently task in bioinformatics to research of human sophisticated diseases.

Results

In the work, a global network-based computational framework called as LRWRHLDA were proposed which is a universal network-based method. Firstly, four isomorphic networks include lncRNA similarity network, disease similarity network, gene similarity network and miRNA similarity network were constructed. And then, six heterogeneous networks include known lncRNA-disease, lncRNA-gene, lncRNA-miRNA, disease-gene, disease-miRNA, and gene-miRNA associations network were applied to design a multi-layer network. Finally, the Laplace normalized random walk with restart algorithm in this global network is suggested to predict the relationship between lncRNAs and diseases.

Conclusions

The ten-fold cross validation is used to evaluate the performance of LRWRHLDA. As a result, LRWRHLDA achieves an AUC of 0.98402, which is higher than other compared methods. Furthermore, LRWRHLDA can predict isolated disease-related lnRNA (isolated lnRNA related disease). The results for colorectal cancer, lung adenocarcinoma, stomach cancer and breast cancer have been verified by other researches. The case studies indicated that our method is effective.

The disease is an abnormal life activity process that occurs due to the disorder of homeostasis after the body is damaged by the cause of the disease under certain conditions. Currently, many studies have confirmed that there is a complex cross-regulation relationship among diseases, genes, lncRNAs, and miRNAs [1,2,3,4].

Many researches have shown that although the proportion of encoded proteins in the human genome is less than 2%, under certain conditions, most of all nucleotides are detectably transcribed [5]. Among the various types of non-protein-coding transcripts, long non-coding RNAs (lncRNAs) and microRNAs (miRNAs) has attracted more and more attention. Among them, lncRNAs are defined as non-coding RNA with a length greater than 200 nucleotides [6]; miRNAs are an RNA molecule with a length of about 19–25 nucleotides that exists widely in eukaryotes [7].

The lncRNAs play an important role in a variety of biological mechanisms, such as epigenetic regulation, chromatin remodeling, gene transcription, protein transport, cell transportation [8]. The function of lncRNAs can be divided into the following categories: Transcription interference; Inducing chromatin remodeling and nucleosome modification; Regulating alternative splicing mode; Generating endogenous siRNAs; Regulating protein activity; Structure or Tissue function; Change the location of protein; Precursor of small RNA [5, 9, 10], et al.

Many researchers found that the expression or functional abnormalities of lncRNAs are closely related to the occurrence of human diseases, including cancers and degenerative neurological diseases, which seriously endanger human health. For example: The lncRNA HOTAIR overexpression increases breast cancer cell proliferation [11, 12]. The lncRNA AFAP1-AS1 has abnormal expression in cholangiocarcinoma, gallbladdercancer, hepatocellular carcinoma, gastric cancer, colorectal cancer, esophageal cancer [13]. The lncRNA HOXA-AS2 may be a biomarker for the treatment of gastric cancer, et al. [14]. There is a close correlation between lncRNA PCGEM1 and osteoarthritis [15]. Therefore, lncRNAs can be used as an important biomarker for the diagnosis of diseases.

The identification of lncRNA-diseases association includes biological experimental verification methods and computational model predictions. For example, based on the biological experiments, Faghihi et al. [16] found that the expression of BACE1-AS can promote the rapid feed forward regulation of β-secretase in Alzheimer’s disease. Applying the RT-PCR technology and Northern blot analysis, Hu et al. [17] confirmed and verified that H19 may become a new target for colon cancer anti-tumor therapy. The results of biological experimental are reliable, however, they are time-consuming and costly.

Recently, the computational model attracted more and more attention, in which various data resources can be integrated, to identify the lncRNA-disease association. For instance, based on a semi-supervised learning framework, the Laplacian regularized least squares for lncRNA-disease association calculation model (LRLSLDA) was suggested to predict potential disease-related lncRNA models [18]. Integrating genome, regulome and transcriptome data, the naive Bayesian classifier was proposed to identify cancer-related lncRNAs [19]. Similarly, based on disease-gene cluster association scores, a machine learning method was suggested to predict potential lncRNA-disease associations [20]. Combining the incremental principal component analysis (IPCA) and random forest (RF) algorithm, a machine learning model, called as IPCARF, was applied to predict the lncRNA-disease associations [21].

In the process of finding lncRNA-disease associations, the method of matrix factorization has also been widely used. For instance, the dual-network integrated logistic matrix factorization and Bayesian optimization model has been used for lncRNA-disease associations (DNILMF-LDA) [22]. In addition, the weighted graph regularized collaborative matrix factorization (WGRCMF), dual sparse collaborative matrix factorization (DSCMF) and the multi-label fusion collaborative matrix factorization (MLFCMF) were applied to construct model for prediction of lncRNA-disease associations [23,24,25].

Based on the hypothesis that lncRNAs with similar functions may be related to diseases with similar phenotypes, some researchers have proposed several calculation methods based on biological networks to predict disease-related lncRNAs.

In addition, integrating the lncRNA and the disease similarity network, and the lncRNA-disease association network. BPLLDA model based on paths of fixed lengths in a heterogeneous lncRNA-disease association network was proposed to predict lncRNA-disease associations [26]. Furthermore, some random walk models on these heterogeneous networks were suggested to predict the relationship between lncRNA and disease [27,28,29]. For example, Sun et al. [27] proposed the random walk with restart method on a lncRNA functional similarity network (RWRlncD). Gu et al. [28] proposed a global network-based random walk with restart algorithm on lncRNA seed nodes and disease seed nodes to predict the relationship between lncRNA and disease (GrWLDA). Based on the heterogeneous network through the lncRNA, disease, and gene similarity network, MHRWR model was proposed based on random walk with restart algorithm on the global network [29].

Following the random walk with restart model, in the paper, a new computational model based on Laplacian normalized random walk with restart algorithm in a heterogeneous network was proposed to predict the association between lncRNA and disease. Firstly, the disease semantic similarity (lncRNA function similarity, gene function similarity, miRNA function similarity) is calculated. And then, based on the association of lncRNA and disease (miRNA and gene), the Gaussian interaction profile kernel similarity of lncRNA and disease (miRNA and gene) are calculated. The lncRNA function similarity (disease semantic similarity, miRNA function similarity, gene function similarity) is integrated with the Gaussian interaction profile kernel similarity for lncRNAs (diseases, miRNAs, genes) to construct the isomorphic networks. Furthermore, the Laplace normalized random walk with restart algorithm on heterogeneous networks is developed to predict potential lncRNA-disease association. As a result, our method obtains reliable AUCs of 0.98402 in the ten-fold cross validation. The performance of our method is superior to other similar methods. Moreover, case studies on colorectal cancer, lung adenocarcinoma, stomach cancer and breast cancer also demonstrate the reliability of our model.

Methods

Experimental data sources

In the paper, the databases involved in lncRNA-disease associations mainly include LncRNADisease database [30, 31], EVLncRNAs database [32], Lnc2Cancer database [33], MNDR v3.1 database [34], et al. Similarly, the lncRNA-miRNA association comes from the integrated data of DIANA-LncBase database [35], LncAcTdb 2.0 database [36], MiRcode database [37], and StarBase database [38]. The lncRNA-gene association comes from the integrated data of LncRNADisease database [30, 31], LncAcTdb 2.0 database [36] and LncRNA2Target v2.0 database [39]. The miRNA-disease association comes from the integrated data of MNDR v3.1 database [34], HMDD database [40] and MiR2Disease database [41]. The miRNA-gene association comes from the data of MiRTarBase database [42]. The gene-disease association comes from the integrated data of DisGeNET database [43], CREEDS database [44], and DISEASES database [45].

Due to the different databases may have different names for the same biomolecule, so we need to perform data error correction and data cleaning on the data sets obtained from the database (mainly includes deleting duplicates, mistake, vacant data). In addition, the names of biomolecules of the same type from different databases are unified. In order to improve the comprehensiveness of the data and further improve the accuracy and scope of the prediction, the union of the related data of the above database was considered.

For lncRNA, the intersection of three database, lncRNA-disease, lncRNA-gene and lncRNA-miRNA association set obtained from all databases, were considered to construct the lncRNA similarity network. There are 814 lncRNA in the work (Fig. 1). Finally, 2476 miRNAs, 7986 genes, and 217 diseases were remained to research. At the same time, we also summarize some basic characteristics of the X–Y association dataset (e.g., the average degree) of the dataset in Table 1. And X and Y both stand for lncRNA, disease, gene, miRNA.

Calculate the similarity matrix

LncRNA functional similarity matrix

Similar to the method of Sun et al. [27], the functional similarity of two lncRNAs was computed as following:

Supposing lncRNA l_{1} is associated with the disease group D_{1} (\(D_{1} = \{ d_{1i} |1 \le i \le a\}\)), and lncRNA l_{2} is associated with the disease group D_{2} (\(D_{2} = \{ d_{2j} |1 \le j \le b\}\)), the similarity between disease d_{11} and a disease group D_{2} is defined as follows:

where \(Sim(d_{11} ,d_{2} )\) is the disease semantic similarity of diseases d_{11} and d_{2}. Then, the functional similarity between lncRNA l_{1} and l_{2} is defined as:

The Disease Ontology (DO) provides open-source ontology for the integration of biomedical data that is associated with human disease [46]. The terms in DO are diseases or ideas of disease-related that are organized in a directed acyclic graph (DAG). Applying the method of Wang et al. [47, 48], the semantic similarity of diseases is calculated as following:

Given disease d, its DAG graph can be expressed as DAG(d) = (Ans(d), E(d)), where Ans(d) represents the set of the node, including node and its ancestor nodes, E(d) represents the edge set of the corresponding direct link from the parent node d to the child node. That is the E(d) denotes the relationship between different diseases. Based on DAG graph, the contribution of disease term d to the semantic value of disease T and the semantic value of disease T itself can be computed by the following two steps:

where \(\Delta\) is the semantic contribution attenuation factor and its value ranged from 0 to 1. As the direct distance between disease d and its ancestor diseases increases, the contribution of these ancestral diseases to the semantic value of disease d will gradually decrease. The semantic similarity between diseased d_{1} and diseased d_{2} is calculated by Eq. (5):

Similar to the Wang et al. [47] method, the functional similarity of two miRNAs can be defined as following:

Assuming that miRNA m_{1} is associated with the disease group D_{3} (\(D_{3} = \{ d_{3k} |1 \le k \le c\}\)) and miRNA m_{2} is associated with the disease group D_{4} (\(D_{4} = \{ d_{4z} |1 \le z \le e\}\)). The similarity of a disease d_{31} and a disease group D_{4} is defined as follows:

The Gene Ontology (GO) database is the world’s largest informatics resource on the functions of genes [49]. For a GO node A, DAG = (Ans(A), E (A)) is its directed acyclic graph, where Ans(A) represents the set of all ancestors of node A (including node A); E (A) represents the set of edges connecting each node in DAG. For any GO node, assuming t is the ancestor of A, or t = A, \(S_{A} (t)\) of t's contribution to A is defined by Eq. (8):

where \(\Delta\) is the semantic contribution attenuation factor and its value ranged from 0 to 1. As the direct distance between gene A and its ancestor genes increases, the contribution of these ancestral genes to the semantic value of gene A will gradually decrease. The semantic contribution \(S_{V} (A)\) of node A is defined as follows:

Assuming that the GO term set annotations of genes G_{1} and G_{2} are \(G{O_1} = \left\{ {{{\text{go}}_{11}},g{o_{12}}, \ldots ,g{o_{1m}}} \right\}\) and \(G{O_2} = \left\{ {{{\text{go}}_{21}},g{o_{22}}, \ldots ,g{o_{2n}}} \right\}\), respectively, the similarity of the two genes G_{1} and G_{2} is calculated by Eq. (12) [50]:

Gaussian interaction profile kernel similarity for lncRNAs and diseases

Because there are many zeros in the matrix LS, DS, MS and GS, this will cause the sparsity of the matrix, which may lead to the inaccuracy of the prediction results. To avoid such scenario, we introduce the Gaussian interaction profile kernel similarity [51, 52].

Firstly, the m × n matrix LD represents the association matrix of lncRNA and disease, the elements are only 0 and 1. For example, if lncRNA l_{i} is related to disease d_{j}, LD (i, j) = 1, otherwise LD (i, j) = 0.

In the same way, we can define the lncRNA-miRNA association matrix LM, lncRNA-gene association matrix LG, disease-gene association matrix DG, miRNA-gene association matrix MG, miRNA-disease association matrix MD, respectively.

The Gaussian interaction profile kernel similarity of lncRNA l_{i} and l_{j} is defined as following:

where IP (l_{i}) is a binary vector, which represents the ith row of the lncRNA-disease association matrix LD, and m represents the number of lncRNAs. \(r_{l}^{^{\prime}}\) is a regulation parameter of the kernel bandwidth parameter of \(r_{l}\). According to the previous research, it is set to 1.

Similarly, the Gaussian interaction profile kernel similarity of disease d_{i} and d_{j} is defined as:

where IP (d_{i}) is a binary vector, which represents the ith column of the lncRNA-disease association matrix LD and n is the number of diseases. \(r^{\prime}_{d} = 1\), it is a regulation parameter of the kernel bandwidth parameter of \(r_{d}\).

Gaussian interaction profile kernel similarity for MiRNAs and genes

The Gaussian interaction profile kernel similarity calculation method of miRNA and gene is similar to that of lncRNA and disease, but the correlation matrix MG is used here. Therefore, we similarly define as follows: IP (m_{i})is a binary vector, which represents the i-th row of the matrix MG and h is the number of miRNAs. \(r^{\prime}_{m}\) = 1, it is a regulation parameter of the kernel bandwidth parameter of \(r_{m}\). IP (g_{i}) is a binary vector, which represents the ith column of the matrix MG and k is the number of genes. \(r^{\prime}_{g}\) = 1, it is a regulation parameter of the kernel bandwidth parameter of \(r_{g}\).

Integration of similarities between lncRNAs, miRNAs, genes, and diseases

We integrate the lncRNA functional similarity (disease semantic similarity, miRNA functional similarity, gene functional similarity) with the Gaussian interaction profile kernel similarity for lncRNAs (diseases, miRNAs, genes) as follows:

where NL is the set of lncRNAs with no functional similarity with any other lncRNAs, ND is the set of diseases with no sematic similarity with any other disease, NM is the set of miRNAs with no functional similarity with any other miRNAs, and NG is the set of genes with no functional similarity with any other genes. By definition, LL, DD, MM and GG are symmetric.

The heterogeneous network

Based on the novel lncRNA similarity matrix LL, diseases similarity matrix DD, miRNA similarity matrix MM, and gene similarity matrix GG, four isomorphic networks include lncRNA similarity network, disease similarity network gene similarity network and miRNA similarity network were constructed, as shown in Fig. 2. In addition, a heterogeneous network through these four similarity networks and their interrelation ships were built based on six association matrix LD, LM, LG, MD, MG, DG, as shown in Fig. 3.

The random walk with restart

Based on the heterogeneous network, the random walk with restart (RWR) on the heterogeneous network to predict lncRNA-disease association was defined as follows [53]:

where P^{0} is the initial probability vector, P^{t} is the probability vector in which the ith element is the probability of detecting the random walk at node i at step t. λ is the restart probability, and its value ranged from 0 to 1. W is the probability transition matrix and W_{ij} denotes the transition probability from node i to j, when the L_{1} norm of P^{t+1} and P^{t} is less than 10^{−6}, it can be considered that reaches a stable state, meanwhile, the stable probability \(P^{\infty }\) can be obtained.

The probability transition matrix W is constructed in this paper as follows:

Among them, the matrix W includes four intra-transition matrices and twelve inter-transition matrices. W_{LL} is the intra-transition matrix of lncRNA similarity network. W_{DD}, W_{MM} and W_{GG} are similar to W_{LL} and represent the intra-transition matrix of disease similarity network, miRNA similarity network, and gene similarity network, respectively. W_{LM} is defined as the transition matrix from lncRNA network to miRNA network. W_{LG}, W_{LD}, W_{ML}, W_{MG}, W_{MD}, W_{GL}, W_{GM}, W_{GD}, W_{DL}, W_{DM} and W_{DG} are defined similar to W_{LM}.

Laplacian normalization

Given the matrix A = A (i, j), the diagonal matrix D is defined as follows, if i = j, then D (i, j) is equal to the sum of the ith row of matrix A, otherwise D (i, j) = 0, then the Laplace normalization of matrix A is defined as [54, 55]:

where P_{LM} (P_{LG}, P_{LD}) is the parameter which represents the transition probability from lncRNA similarity network to miRNA (gene, disease) similarity network and its value ranged from 0 to 1. Besides, P_{LM }= P_{ML}, P_{LG }= P_{GL}, P_{LD }= P_{DL}, P_{MG }= P_{GM}, P_{MD }= P_{DM}, P_{GD }= P_{DG}. Similarly, other intra-transition matrix and inter-transition matrix can be defined.Applying the Laplacian normalization, all elements of probability transition matrix W can be obtained.The calculation formula of P^{0} is as follows:

Among them, the parameters P_{L}, P_{M}, P_{G}, 1 − P_{L }− P_{M }− P_{G} represent the importance of lncRNA similarity network, miRNA similarity network, gene similarity network and disease similarity network, respectively. Their values ranged from 0 to 1. U_{L0} represents the initial probability of the lncRNA similarity network, which is equal probabilities and is assigned to all seed nodes in the lncRNA similarity network. The sum of U_{L0} is 1. The initial probability U_{M0} and U_{G0} are similar to U_{L0}. U_{D0} represents the initial probability of the disease similarity network, for disease d, the initial transition probability of disease d is 1, and the transition probability of other diseases is 0.

Finally, the Laplace normalized random walk with restart algorithm is used to predict related lncRNAs scores (see Fig. 3). The method was called as LRWRHLDA (the Laplace normalized random walk with restart algorithm in heterogeneous networks to predict the lncRNA-disease association).

Results

Performance evaluation

In this paper, ten-fold cross validation is used to evaluate the performance of our model. In the ten-fold cross validation, all known lncRNA-disease interactions are randomly divided into ten folds. For each experiment, nine subsets are regarded as training samples and the remaining one subset is treated as test samples. After completing the test, predicted scores are generated. Then, we rank test samples and unknown lncRNA-disease interactions. The corresponding predicted result of test samples is considered as true positive (TP) when the predicted relevance score is greater than the threshold. Otherwise, considered as false negative (FN). Similarly, for the unknown lncRNA-disease interactions, the corresponding predicted result consider as false positive (FP) when the predicted relevance score is greater than the threshold. Otherwise, considered as true negative (TN). Then, the true positive rates (TPR), the false positive rates (FPR), recall and precision are calculated as follow:

$$TPR = recall = \frac{TP}{{TP + FN}},$$

(29)

$$FPR = \frac{FP}{{FP + TN}},$$

(30)

$$precision = \frac{TP}{{TP + FP}}.$$

(31)

Finally, the receiver operating characteristic (ROC) curve and precision-recall curve (PR) curve are drawn as shown in Fig. 4. The area under the ROC curve (AUC) and the area under the PR curve (AUPR) are used to evaluate the performance of our method. The range of AUC, AUPR are all from 0 to 1. When the parameters are set to P_{LM} = P_{LG} = P_{LD} = P_{MG} = P_{MD} = P_{GD} = 0.2, P_{L} = 0.4, P_{M} = 0.1, P_{G} = 0.1, λ = 0.7, the results of ten experiments are shown in Table 2.

Comparison with different predicted methods using ten-fold cross validation

In order to compare with other models, the data in this paper is applied to the BPLLDA model [26], the RWRlncD model [27], GrwLDA model [28] and the MHRWR model [29].

As a result, the ROC curves under ten-fold cross validation of LRWRHLDA, RWRlncD, GrwLDA, BPLLDA and MHRWR were plotted in Fig. 5.

As can be seen, LRWRHLDA has an AUC of 0.98402 and outperformed RWRlncD (0.53625), GrwLDA (0.83276), BPLLDA (0.87148) and MHRWR (0.97169). In summary, LRWRHLDA is better than other model in lncRNA-disease association prediction.

The area under PR curve (AUPR) is also used to evaluate the performance of LRWRHLDA model, BPLLDA model [26], the RWRlncD model [27], GrwLDA model [28] and MHRWR model [29] to avoid overestimates the performance of these methods (see Fig. 6).

It can be seen from Fig. 6 that the AUPR value of LRWRHLDA is also higher than other models.

Effects of parameters

There are ten parameters in our model, including the transition probability P_{LM}, P_{LG}, P_{LD}, P_{MG}, P_{MD}, P_{GD} between networks; the weight of the subnet P_{L}, P_{M}, P_{G}; and the restart probability λ. Due to too many parameters and our limited computing resources, we arbitrarily fixed nine of these parameters in the paper and only discussed the impact of restart probability λ with the ten-fold cross validation in our model. The results are shown in Table 3. As can be seen, based on the AUC index, the parameter λ has less influence on the performance of LRWRHLDA, when λ = 0.7. Based on the AUPR index, when λ is equal to 0.9, the AUPR value reaches the maximum. And observing Table 3, the results showed that the restart probability λ has powerful effects on our model.

Case study

Case studies on predicted lncRNA-disease associations

It is known that lncRNAs play critical roles in the development of many diseases. To evaluate the ability of LRWRHLDA in inferring potential lncRNA-disease associations, we use all known lncRNA-disease associations in LD as training data to assess the potential of predicted associations by our model.

The stable probability \(P^{\infty }\) can be used as a measure of proximity to the seed lncRNAs. If \(P^{\infty }\) (lncRNA i) > \(P^{\infty }\) (lncRNA j), then lncRNA i will be in closer proximity to the seed lncRNAs than lncRNA j in the lncRNA similarity network. As a result, all candidate lncRNAs can be ranked according to the \(P^{\infty }\), and the top ranked lncRNAs can be expected to have a high probability of being associated with the disease of interest. The novel lncRNA-disease associations are ranked according to the stable probability of LRWRHLDA. To validate the predictions, we use literature or the following those databases: LncRNADisease [30], LncRNADisease v2.0 [31], MNDR v3.1 [34], lnCAR [56]. Specifically, we list the top 10 lncRNAs associated with four diseases, including colorectal cancer, lung adenocarcinoma, stomach cancer and breast cancer. According to \(P^{\infty }\), the top 10 results were shown in Table 4 (the detailed results see Additional file 1: Table-S1).

Colorectal cancer is the third most common cancer diagnosed in the US. While the incidence and the mortality rate of colorectal cancer has decreased due to effective cancer screening measures, there has been an increase in number of young patients diagnosed in colon cancer due to unclear reasons at this point of time [57]. Lung adenocarcinoma is one of the main types of lung cancer, which belongs to non-small cell carcinoma. The incidence of lung adenocarcinoma is mainly female and non-smokers [58]. Stomach cancer is the fifth most common cancer and the third most common cause of cancer death globally [59]. The most majority of stomach cancers are adenocarcinomas, with no obvious symptoms in the early stage. They are often similar to the symptoms of chronic gastric diseases such as gastritis and gastric ulcers, and easily ignore. Moreover, the current early diagnosis rate of stomach cancer is still low. Breast cancer is a malignant tumor that occurs in the epithelial tissue of the breast. At present, breast cancer has become a major public health problem in the current society, and its cause is not yet fully understood. In the world, breast cancer is an important cause of human suffering and premature mortality among women [60].

In Table 4, the six potential lncRNA-disease associations were confirmed in the literature except the existing lncRNA-disease associations in the database, in which included ENST00000535511-colorectal cancer, RP4-colorectal cancer, CTNNAP1-colorectal cancer, LINC01021-colorectal cancer, GMDS-AS1-lung adenocarcinoma, LINC01207-lung adenocarcinoma. These results demonstrated that the predictive performance of the proposed method.

Case studies on predicted novel diseases and novel lncRNAs

For each disease, it is deemed as a novel disease and all its related lncRNAs are removed to predict potential lncRNAs related the disease. All the candidate lncRNAs were ranked according to \(P^{\infty }\) and lncRNAs with high scores were expected to be potentially related with investigated disease d. Depend on \(P^{\infty }\), the top 10 results were listed in Table 5 (the detailed results see Additional file 2: Table-S2).

Analogously, the stable probability \(P^{\infty }\) can be also used as a measure of proximity to the seed diseases. All the candidate diseases were ranked according to \(P^{\infty }\) and diseases with high scores were expected to be potentially related with investigated lncRNA. To evaluate the ability of our model to predict new lncRNAs, we analyzed two lncRNAs including H19 and HOTAIR. For each lncRNA, it is removed all its related diseases in predicting potential diseases. According to \(P^{\infty }\), the top 10 results were showed in Table 6 (the detailed results see Additional file 3: Table-S3).

Observing Table 5, we can find that thirty-five of the top ten lncRNAs associations with four cancers were validated by the database or literature. However, other five cancer-lncRNA associations, colorectal cancer-CARL, stomach cancer-AF117829.1, breast cancer-AP003486.1, lung adenocarcinoma-AC018413.1 and lung adenocarcinoma-TUBB2A have not been confirmed by the database or literature. It implies our method can predict more additional lncRNA-disease associations.

From Table 6, in both cases, all top ten associated diseases were validated by the database. In summary, LRWRHLDA achieves favorable performances in predicting novel disease-associated lncRNAs and novel lncRNA-associated diseases.

Discussion

At present, many studies have shown that lncRNA has an important influence on the physiological process of diseases. Because traditional biological experiments are time-consuming and costly, it is necessary to develop a computational model to predict the association between lncRNA and disease.

In this paper, a new model-LRWRHLDA based on the Laplace normalized random walk with restart algorithm in heterogeneous network was constructed to predict potential lncRNA-disease associations. The ten-fold cross validation test is applied to evaluate the prediction performance of our method. In comparison with the state-of-the-art prediction methods, our method can achieve better performance in terms of AUC values. Moreover, case studies of colorectal cancer, lung adenocarcinoma, stomach cancer and breast cancer are implemented to further demonstrate that it could be a useful method for predicting potential relationships between lncRNAs and diseases as well.

However, our method has some limitations. Firstly, since we have 10 parameters, the selection and adjustment of parameters still face some difficulties. Secondly, because of our model is based on four networks, there are too many nodes in the network. In the random walk process, the more nodes there are, the longer the random walk time will be. In the future, we will continue to improve the model.

Conclusion

In this study, we proposed an effective method, LRWRHLDA, which is based on the Laplace normalized random walk with restart algorithm in heterogeneous network to predict the potential lncRNA and disease association. First, a heterogeneous network based on lncRNA, disease, miRNA, gene similarity network and their correlation networks were constructed. Then, we calculate the probability transition matrix by Laplace normalization. Finally, the potential lncRNA-disease associations were predicted by the random walk with restart over heterogeneous networks. Furthermore, LRWRHLDA can predict isolated disease-related lnRNA (isolated lnRNA-related disease). Our method is evaluated comprehensively by ten-fold cross validation and case studies in comparison with other methods. The results show that our method has higher prediction accuracy.

Availability of data and materials

The datasets supporting the conclusions of this article are included within the article and its additional files. The code (executable code and source code) and data for this study are available at https://github.com/wang-124/LRWRHLDA.git.

Abbreviations

lncRNAs:

Long non-coding RNAs

miRNA:

MicroRNA

LRWRHLDA:

Prediction the potential lncRNA-disease associations based on Laplace normalized random walk with restart algorithm in heterogeneous networks

ROC:

Receiver operating characteristic

TPR:

True positive rates

FPR:

False positive rates

AUC:

Areas under ROC curve

PR:

Precision-recall

AUPR:

The area under the precision-recall curve

References

Moreau Y, Tranchevent LC. Computational tools for prioritizing candidate genes: boosting disease gene discovery. Nat Rev Genet. 2012;13(8):523–36.

Rupaimoole R, Slack FJ. MicroRNA therapeutics: towards a new era for the management of cancer and other diseases. Nat Rev Drug Discov. 2017;16(3):203–22.

Dai LY, Liu JX, Zhu R, Wang J, Yuan SS. Logistic weighted profile-based bi-random walk for exploring MiRNA-disease associations. J Comput Sci Technol. 2021;36(2):276–87.

Li Z, Ho IHT, Li X, Xu D, Wu WKK, Chan MTV, et al. Long non-coding RNAs in the spinal cord injury: novel spotlight. J Cell Mol Med. 2019;23(8):4883–90.

Xue X, Yang YA, Zhang A, Fong KW, Kim J, Song B, et al. LncRNA HOTAIR enhances ER signaling and confers tamoxifen resistance in breast cancer. Oncogene. 2016;35(21):2746–55.

Ji D, Zhong X, Jiang X, Leng K, Xu Y, Li Z, et al. The role of long non-coding RNA AFAP1-AS1 in human malignant tumors. Pathol Res Pract. 2018;214(10):1524–31.

Faghihi MA, Modarresi F, Khalil AM, Wood DE, Sahagan BG, Morgan TE, et al. Expression of a noncoding RNA is elevated in Alzheimer’s disease and drives rapid feed-forward regulation of beta-secretase. Nat Med. 2008;14(7):723–30.

Hu Q, Wang YB, Zeng P, Yan GQ, Xin L, Hu XY. Expression of long non-coding RNA (lncRNA) H19 in immunodeficient mice induced with human colon cancer cells. Eur Rev Med Pharmacol Sci. 2016;20(23):4880–4.

Zhao T, Xu J, Liu L, Bai J, Xu C, Xiao Y, et al. Identification of cancer-related lncRNAs through integrating genome, regulome and transcriptome features. Mol Biosyst. 2015;11(1):126–36.

Zhu R, Wang Y, Liu JX, Dai LY. IPCARF: improving lncRNA-disease association prediction using incremental principal component analysis feature selection and a random forest classifier. BMC Bioinform. 2021;22(1):175.

Li Y, Li J, Bian N. DNILMF-LDA: prediction of lncRNA-disease associations by dual-network integrated logistic matrix factorization and bayesian optimization. Genes (Basel). 2019;10(8):608.

Liu JX, Cui Z, Gao YL, Kong XZ. WGRCMF: a weighted graph regularized collaborative matrix factorization method for predicting novel LncRNA-disease associations. IEEE J Biomed Health Inform. 2021;25(1):257–65.

Liu JX, Gao MM, Cui Z, Gao YL, Li F. DSCMF: prediction of LncRNA-disease associations based on dual sparse collaborative matrix factorization. BMC Bioinform. 2021;22(Suppl 3):241.

Gao MM, Cui Z, Gao YL, Wang J, Liu JX. Multi-label fusion collaborative matrix factorization for predicting LncRNA-disease associations. IEEE J Biomed Health Inform. 2021;25(3):881–90.

Xiao X, Zhu W, Liao B, Xu J, Gu C, Ji B, et al. BPLLDA: predicting lncRNA-disease associations based on simple paths with limited lengths in a heterogeneous network. Front Genet. 2018;9:411.

Sun J, Shi H, Wang Z, Zhang C, Liu L, Wang L, et al. Inferring novel lncRNA-disease associations based on a random walk model of a lncRNA functional similarity network. Mol Biosyst. 2014;10(8):2074–81.

Gu C, Liao B, Li X, Cai L, Li Z, Li K, et al. Global network random walk for predicting potential human lncRNA-disease associations. Sci Rep. 2017;7(1):12442.

Zhao X, Yang Y, Yin M. MHRWR: prediction of lncRNA-disease associations based on multiple heterogeneous networks. IEEE/ACM Trans Comput Biol Bioinform. 2020;PP.

Chen G, Wang Z, Wang D, Qiu C, Liu M, Chen X, et al. LncRNADisease: a database for long-non-coding RNA-associated diseases. Nucleic Acids Res. 2013;41(Database issue):D983–6.

Bao Z, Yang Z, Huang Z, Zhou Y, Cui Q, Dong D. LncRNADisease 2.0: an updated database of long non-coding RNA-associated diseases. Nucleic Acids Res. 2019;47(D1):D1034–7.

Zhou B, Ji B, Liu K, Hu G, Wang F, Chen Q, et al. EVLncRNAs 2.0: an updated database of manually curated functional long non-coding RNAs validated by low-throughput experiments. Nucleic Acids Res. 2021;49(D1):D86-91.

Gao Y, Shang S, Guo S, Li X, Zhou H, Liu H, et al. Lnc2Cancer 3.0: an updated resource for experimentally supported lncRNA/circRNA cancer associations and web tools based on RNA-seq and scRNA-seq data. Nucleic Acids Res. 2021;49(D1):D1251–8.

Ning L, Cui T, Zheng B, Wang N, Luo J, Yang B, et al. MNDR v3.0: mammal ncRNA-disease repository with increased coverage and annotation. Nucleic Acids Res. 2021;49(D1):D160–4.

Paraskevopoulou MD, Georgakilas G, Kostoulas N, Reczko M, Maragkakis M, Dalamagas TM, et al. DIANA-LncBase: experimentally verified and computationally predicted microRNA targets on long non-coding RNAs. Nucleic Acids Res. 2013;41(Database issue):D239–45.

Wang P, Li X, Gao Y, Guo Q, Wang Y, Fang Y, et al. LncACTdb 2.0: an updated database of experimentally supported ceRNA interactions curated from low- and high-throughput experiments. Nucleic Acids Res. 2019;47(D1):D121–7.

Jeggari A, Marks DS, Larsson E. MiRcode: a map of putative microRNA target sites in the long non-coding transcriptome. Bioinformatics. 2012;28(15):2062–3.

Cheng L, Wang P, Tian R, Wang S, Guo Q, Luo M, et al. LncRNA2Target v2.0: a comprehensive database for target genes of lncRNAs in human and mouse. Nucleic Acids Res. 2019;47(D1):D140–4.

Huang Z, Shi J, Gao Y, Cui C, Zhang S, Li J, et al. HMDD v3.0: a database for experimentally supported human microRNA-disease associations. Nucleic Acids Res. 2019;47(D1):D1013–7.

Jiang Q, Wang Y, Hao Y, Juan L, Teng M, Zhang X, et al. MiR2Disease: a manually curated database for microRNA deregulation in human disease. Nucleic Acids Res. 2009;37(Database issue):D98-104.

Huang HY, Lin YC, Li J, Huang KY, Shrestha S, Hong HC, et al. MiRTarBase 2020: updates to the experimentally validated microRNA-target interaction database. Nucleic Acids Res. 2020;48(D1):D148–54.

Wang Z, Monteiro CD, Jagodnik KM, Fernandez NF, Gundersen GW, Rouillard AD, et al. Extraction and analysis of signatures from the Gene Expression Omnibus by the crowd. Nat Commun. 2016;7:12846.

Pletscher-Frankild S, Pallejà A, Tsafou K, Binder JX, Jensen LJ. DISEASES: text mining and data integration of disease-gene associations. Methods. 2015;74:83–9.

Wang D, Wang J, Lu M, Song F, Cui Q. Inferring the human microRNA functional similarity and functional network based on microRNA-associated diseases. Bioinformatics. 2010;26(13):1644–50.

Li J, Gong B, Chen X, Liu T, Wu C, Zhang F, et al. DOSim: an R package for similarity between diseases based on disease ontology. BMC Bioinform. 2011;12:266.

Ganegoda GU, Li M, Wang W, Feng Q. Heterogeneous network model to infer human disease-long intergenic non-coding RNA associations. IEEE Trans Nanobiosci. 2015;14(2):175–83.

Wen Y, Han G, Anh VV. Laplacian normalization and bi-random walks on heterogeneous networks for predicting lncRNA-disease associations. BMC Syst Biol. 2018;12(Suppl 9):122.

Zhao ZQ, Han GS, Yu ZG, Li J. Laplacian normalization and random walk on heterogeneous networks for disease-gene prioritization. Comput Biol Chem. 2015;57:21–8.

Song Q, Shang J, Yang Z, Zhang L, Zhang C, Chen J, et al. Identification of an immune signature predicting prognosis risk of patients in lung adenocarcinoma. J Transl Med. 2019;17(1):70.

This work was supported by the National Natural Science Foundation of China (No. 61772027, 61772028), key research and development plan of Zhejiang Province (2021C02039).

Funding

This research is partly sponsored by the National Natural Science Foundation of China (No. 61772027, 61772028), key research and development plan of Zhejiang Province (2021C02039). The funding bodies did not play any roles in the design of the study, in the collection, analysis, or interpretation of data, or in writing the manuscript.

Author information

Authors and Affiliations

School of Science, Zhejiang Sci-Tech University, Hangzhou, 310018, China

Liugen Wang, Min Shang & Ping-an He

College of Life Science, Zhejiang Sci-Tech University, Hangzhou, 310018, China

LW, MS, QD and PH designed the study. LW and MS carried out analyses and wrote the program. LW and PH wrote the paper. All authors read and approved the final manuscript.

. In this file we provide the results of stable probability of lncRNA when LRWRHLDA run over when delete related cancer of the lncRNAs.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Wang, L., Shang, M., Dai, Q. et al. Prediction of lncRNA-disease association based on a Laplace normalized random walk with restart algorithm on heterogeneous networks.
BMC Bioinformatics23, 5 (2022). https://doi.org/10.1186/s12859-021-04538-1