Prediction of lncRNA-disease association based on a Laplace normalized random walk with restart algorithm on heterogeneous networks

Background More and more evidence showed that long non-coding RNAs (lncRNAs) play important roles in the development and progression of human sophisticated diseases. Therefore, predicting human lncRNA-disease associations is a challenging and urgently task in bioinformatics to research of human sophisticated diseases. Results In the work, a global network-based computational framework called as LRWRHLDA were proposed which is a universal network-based method. Firstly, four isomorphic networks include lncRNA similarity network, disease similarity network, gene similarity network and miRNA similarity network were constructed. And then, six heterogeneous networks include known lncRNA-disease, lncRNA-gene, lncRNA-miRNA, disease-gene, disease-miRNA, and gene-miRNA associations network were applied to design a multi-layer network. Finally, the Laplace normalized random walk with restart algorithm in this global network is suggested to predict the relationship between lncRNAs and diseases. Conclusions The ten-fold cross validation is used to evaluate the performance of LRWRHLDA. As a result, LRWRHLDA achieves an AUC of 0.98402, which is higher than other compared methods. Furthermore, LRWRHLDA can predict isolated disease-related lnRNA (isolated lnRNA related disease). The results for colorectal cancer, lung adenocarcinoma, stomach cancer and breast cancer have been verified by other researches. The case studies indicated that our method is effective. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04538-1.

Many researches have shown that although the proportion of encoded proteins in the human genome is less than 2%, under certain conditions, most of all nucleotides are detectably transcribed [5]. Among the various types of non-protein-coding transcripts, long non-coding RNAs (lncRNAs) and microRNAs (miRNAs) has attracted more and more attention. Among them, lncRNAs are defined as non-coding RNA with a length greater than 200 nucleotides [6]; miRNAs are an RNA molecule with a length of about 19-25 nucleotides that exists widely in eukaryotes [7].
The lncRNAs play an important role in a variety of biological mechanisms, such as epigenetic regulation, chromatin remodeling, gene transcription, protein transport, cell transportation [8]. The function of lncRNAs can be divided into the following categories: Transcription interference; Inducing chromatin remodeling and nucleosome modification; Regulating alternative splicing mode; Generating endogenous siRNAs; Regulating protein activity; Structure or Tissue function; Change the location of protein; Precursor of small RNA [5,9,10], et al.
Many researchers found that the expression or functional abnormalities of lncRNAs are closely related to the occurrence of human diseases, including cancers and degenerative neurological diseases, which seriously endanger human health. For example: The lncRNA HOTAIR overexpression increases breast cancer cell proliferation [11,12]. The lncRNA AFAP1-AS1 has abnormal expression in cholangiocarcinoma, gallbladdercancer, hepatocellular carcinoma, gastric cancer, colorectal cancer, esophageal cancer [13]. The lncRNA HOXA-AS2 may be a biomarker for the treatment of gastric cancer, et al. [14]. There is a close correlation between lncRNA PCGEM1 and osteoarthritis [15]. Therefore, lncRNAs can be used as an important biomarker for the diagnosis of diseases.
The identification of lncRNA-diseases association includes biological experimental verification methods and computational model predictions. For example, based on the biological experiments, Faghihi et al. [16] found that the expression of BACE1-AS can promote the rapid feed forward regulation of β-secretase in Alzheimer's disease. Applying the RT-PCR technology and Northern blot analysis, Hu et al. [17] confirmed and verified that H19 may become a new target for colon cancer anti-tumor therapy. The results of biological experimental are reliable, however, they are time-consuming and costly.
Recently, the computational model attracted more and more attention, in which various data resources can be integrated, to identify the lncRNA-disease association. For instance, based on a semi-supervised learning framework, the Laplacian regularized least squares for lncRNA-disease association calculation model (LRLSLDA) was suggested to predict potential disease-related lncRNA models [18]. Integrating genome, regulome and transcriptome data, the naive Bayesian classifier was proposed to identify cancer-related lncRNAs [19]. Similarly, based on disease-gene cluster association scores, a machine learning method was suggested to predict potential lncRNA-disease associations [20]. Combining the incremental principal component analysis (IPCA) and random forest (RF) algorithm, a machine learning model, called as IPCARF, was applied to predict the lncRNA-disease associations [21].
In the process of finding lncRNA-disease associations, the method of matrix factorization has also been widely used. For instance, the dual-network integrated logistic matrix factorization and Bayesian optimization model has been used for lncRNA-disease associations (DNILMF-LDA) [22]. In addition, the weighted graph regularized collaborative matrix factorization (WGRCMF), dual sparse collaborative matrix factorization (DSCMF) and the multi-label fusion collaborative matrix factorization (MLFCMF) were applied to construct model for prediction of lncRNA-disease associations [23][24][25].
Based on the hypothesis that lncRNAs with similar functions may be related to diseases with similar phenotypes, some researchers have proposed several calculation methods based on biological networks to predict disease-related lncRNAs.
In addition, integrating the lncRNA and the disease similarity network, and the lncRNA-disease association network. BPLLDA model based on paths of fixed lengths in a heterogeneous lncRNA-disease association network was proposed to predict lncRNAdisease associations [26]. Furthermore, some random walk models on these heterogeneous networks were suggested to predict the relationship between lncRNA and disease [27][28][29]. For example, Sun et al. [27] proposed the random walk with restart method on a lncRNA functional similarity network (RWRlncD). Gu et al. [28] proposed a global network-based random walk with restart algorithm on lncRNA seed nodes and disease seed nodes to predict the relationship between lncRNA and disease (GrWLDA). Based on the heterogeneous network through the lncRNA, disease, and gene similarity network, MHRWR model was proposed based on random walk with restart algorithm on the global network [29].
Following the random walk with restart model, in the paper, a new computational model based on Laplacian normalized random walk with restart algorithm in a heterogeneous network was proposed to predict the association between lncRNA and disease. Firstly, the disease semantic similarity (lncRNA function similarity, gene function similarity, miRNA function similarity) is calculated. And then, based on the association of lncRNA and disease (miRNA and gene), the Gaussian interaction profile kernel similarity of lncRNA and disease (miRNA and gene) are calculated. The lncRNA function similarity (disease semantic similarity, miRNA function similarity, gene function similarity) is integrated with the Gaussian interaction profile kernel similarity for lncRNAs (diseases, miRNAs, genes) to construct the isomorphic networks. Furthermore, the Laplace normalized random walk with restart algorithm on heterogeneous networks is developed to predict potential lncRNA-disease association. As a result, our method obtains reliable AUCs of 0.98402 in the ten-fold cross validation. The performance of our method is superior to other similar methods. Moreover, case studies on colorectal cancer, lung adenocarcinoma, stomach cancer and breast cancer also demonstrate the reliability of our model.
Due to the different databases may have different names for the same biomolecule, so we need to perform data error correction and data cleaning on the data sets obtained from the database (mainly includes deleting duplicates, mistake, vacant data). In addition, the names of biomolecules of the same type from different databases are unified. In order to improve the comprehensiveness of the data and further improve the accuracy and scope of the prediction, the union of the related data of the above database was considered.
For lncRNA, the intersection of three database, lncRNA-disease, lncRNA-gene and lncRNA-miRNA association set obtained from all databases, were considered to construct the lncRNA similarity network. There are 814 lncRNA in the work (Fig. 1). Finally, 2476 miRNAs, 7986 genes, and 217 diseases were remained to research. At the same time, we also summarize some basic characteristics of the X-Y association dataset (e.g., the average degree) of the dataset in Table 1. And X and Y both stand for lncRNA, disease, gene, miRNA.

LncRNA functional similarity matrix
Similar to the method of Sun et al. [27], the functional similarity of two lncRNAs was computed as following: Supposing lncRNA l 1 is associated with the disease group D 1 ( D 1 = {d 1i |1 ≤ i ≤ a} ), and lncRNA l 2 is associated with the disease group D 2 ( D 2 = {d 2j |1 ≤ j ≤ b} ), the similarity between disease d 11 and a disease group D 2 is defined as follows: where Sim(d 11 , d 2 ) is the disease semantic similarity of diseases d 11 and d 2 . Then, the functional similarity between lncRNA l 1 and l 2 is defined as:

Disease semantic similarity matrix
The Disease Ontology (DO) provides open-source ontology for the integration of biomedical data that is associated with human disease [46]. The terms in DO are diseases or ideas of disease-related that are organized in a directed acyclic graph (DAG). Applying the method of Wang et al. [47,48], the semantic similarity of diseases is calculated as following: Given disease d, its DAG graph can be expressed as DAG(d) = (Ans(d), E(d)), where Ans(d) represents the set of the node, including node and its ancestor nodes, E(d) represents the edge set of the corresponding direct link from the parent node d to the child node. That is the E(d) denotes the relationship between different diseases. Based on DAG graph, the contribution of disease term d to the semantic value of disease T and the semantic value of disease T itself can be computed by the following two steps: where is the semantic contribution attenuation factor and its value ranged from 0 to 1. As the direct distance between disease d and its ancestor diseases increases, the contribution of these ancestral diseases to the semantic value of disease d will gradually decrease. The semantic similarity between diseased d 1 and diseased d 2 is calculated by Eq. (5):

MiRNA functional similarity matrix
Similar to the Wang et al. [47] method, the functional similarity of two miRNAs can be defined as following: Assuming that miRNA m 1 is associated with the disease group D 3 ( D 3 = {d 3k |1 ≤ k ≤ c} ) and miRNA m 2 is associated with the disease group D 4 ( D 4 = {d 4z |1 ≤ z ≤ e} ). The similarity of a disease d 31 and a disease group D 4 is defined as follows: and the functional similarity between miRNA m 1 and m 2 is computed by Eq. (7):

Gene function similarity matrix
The Gene Ontology (GO) database is the world's largest informatics resource on the functions of genes [49]. For a GO node A, DAG = (Ans(A), E (A)) is its directed acyclic graph, where Ans(A) represents the set of all ancestors of node A (including node A); E (A) represents the set of edges connecting each node in DAG. For any GO node, assuming t is the ancestor of A, or t = A, S A (t) of t's contribution to A is defined by Eq. (8): where is the semantic contribution attenuation factor and its value ranged from 0 to 1. As the direct distance between gene A and its ancestor genes increases, the contribution of these ancestral genes to the semantic value of gene A will gradually decrease. The semantic contribution S V (A) of node A is defined as follows: Then the semantic similarity of nodes A and B is calculated by Eq. (10): The similarity of a go node g and a GO node set G = go 1 , go 2 , . . . , go f is defined as: Assuming that the GO term set annotations of genes G 1 and G 2 are GO 1 = go 11 , go 12 , . . . , go 1m and GO 2 = go 21 , go 22 , . . . , go 2n , respectively, the similarity of the two genes G 1 and G 2 is calculated by Eq. (12) [50]: .

Gaussian interaction profile kernel similarity for lncRNAs and diseases
Because there are many zeros in the matrix LS, DS, MS and GS, this will cause the sparsity of the matrix, which may lead to the inaccuracy of the prediction results. To avoid such scenario, we introduce the Gaussian interaction profile kernel similarity [51,52].
Firstly, the m × n matrix LD represents the association matrix of lncRNA and disease, the elements are only 0 and 1.
In the same way, we can define the lncRNA-miRNA association matrix LM, lncRNAgene association matrix LG, disease-gene association matrix DG, miRNA-gene association matrix MG, miRNA-disease association matrix MD, respectively.
The Gaussian interaction profile kernel similarity of lncRNA l i and l j is defined as following: where IP (l i ) is a binary vector, which represents the ith row of the lncRNA-disease association matrix LD, and m represents the number of lncRNAs. r ′ l is a regulation parameter of the kernel bandwidth parameter of r l . According to the previous research, it is set to 1.
Similarly, the Gaussian interaction profile kernel similarity of disease d i and d j is defined as: where IP (d i ) is a binary vector, which represents the ith column of the lncRNA-disease association matrix LD and n is the number of diseases. r ′ d = 1 , it is a regulation parameter of the kernel bandwidth parameter of r d .

Gaussian interaction profile kernel similarity for MiRNAs and genes
The Gaussian interaction profile kernel similarity calculation method of miRNA and gene is similar to that of lncRNA and disease, but the correlation matrix MG is used here. Therefore, we similarly define as follows: IP (m i )is a binary vector, which represents the i-th row of the matrix MG and h is the number of miRNAs. r ′ m = 1, it is a regulation parameter of the kernel bandwidth parameter of r m . IP (g i ) is a binary vector, which represents the ith column of the matrix MG and k is the number of genes. r ′ g = 1, it is a regulation parameter of the kernel bandwidth parameter of r g .

Integration of similarities between lncRNAs, miRNAs, genes, and diseases
We integrate the lncRNA functional similarity (disease semantic similarity, miRNA functional similarity, gene functional similarity) with the Gaussian interaction profile kernel similarity for lncRNAs (diseases, miRNAs, genes) as follows: where NL is the set of lncRNAs with no functional similarity with any other lncRNAs, ND is the set of diseases with no sematic similarity with any other disease, NM is the set of miRNAs with no functional similarity with any other miRNAs, and NG is the set of genes with no functional similarity with any other genes. By definition, LL, DD, MM and GG are symmetric.

The heterogeneous network
Based on the novel lncRNA similarity matrix LL, diseases similarity matrix DD, miRNA similarity matrix MM, and gene similarity matrix GG, four isomorphic networks include lncRNA similarity network, disease similarity network gene similarity network and miRNA similarity network were constructed, as shown in Fig. 2. In addition, a heterogeneous network through these four similarity networks and their interrelation ships were built based on six association matrix LD, LM, LG, MD, MG, DG, as shown in Fig. 3.

The random walk with restart
Based on the heterogeneous network, the random walk with restart (RWR) on the heterogeneous network to predict lncRNA-disease association was defined as follows [53]: where P 0 is the initial probability vector, P t is the probability vector in which the ith element is the probability of detecting the random walk at node i at step t. λ is the restart probability, and its value ranged from 0 to 1. W is the probability transition matrix and W ij denotes the transition probability from node i to j, when the L 1 norm of P t+1 and P t is less than 10 −6 , it can be considered that reaches a stable state, meanwhile, the stable probability P ∞ can be obtained. The probability transition matrix W is constructed in this paper as follows: Among them, the matrix W includes four intra-transition matrices and twelve intertransition matrices. W LL is the intra-transition matrix of lncRNA similarity network. W DD , W MM and W GG are similar to W LL and represent the intra-transition matrix of disease similarity network, miRNA similarity network, and gene similarity network,

Laplacian normalization
Given the matrix A = A (i, j), the diagonal matrix D is defined as follows, if i = j, then D (i, j) is equal to the sum of the ith row of matrix A, otherwise D (i, j) = 0, then the Laplace normalization of matrix A is defined as [54,55]: Therefore, W LM and W LL can be obtained by the following two steps: The probability of transition from l i to m j is as follows: The probability of transition from l i to l j is as follows: where P LM (P LG , P LD ) is the parameter which represents the transition probability from lncRNA similarity network to miRNA (gene, disease) similarity network and its value ranged from 0 to 1. Besides, P LM = P ML , P LG = P GL , P LD = P DL , P MG = P GM , P MD = P DM , P GD = P DG . Similarly, other intra-transition matrix and inter-transition matrix can be defined.Applying the Laplacian normalization, all elements of probability transition matrix W can be obtained.The calculation formula of P 0 is as follows: Among them, the parameters P L , P M , P G , 1 − P L − P M − P G represent the importance of lncRNA similarity network, miRNA similarity network, gene similarity network and disease similarity network, respectively. Their values ranged from 0 to 1. U L0 represents the initial probability of the lncRNA similarity network, which is equal probabilities and is assigned to all seed nodes in the lncRNA similarity network. The sum of U L0 is 1. The initial probability U M0 and U G0 are similar to U L0 . U D0 represents the initial probability of the disease similarity network, for disease d, the initial transition probability of disease d is 1, and the transition probability of other diseases is 0.
Finally, the Laplace normalized random walk with restart algorithm is used to predict related lncRNAs scores (see Fig. 3). The method was called as LRWRHLDA (the Laplace normalized random walk with restart algorithm in heterogeneous networks to predict the lncRNA-disease association).

Performance evaluation
In this paper, ten-fold cross validation is used to evaluate the performance of our model. In the ten-fold cross validation, all known lncRNA-disease interactions are randomly divided into ten folds. For each experiment, nine subsets are regarded as training samples and the remaining one subset is treated as test samples. After completing the test, predicted scores are generated. Then, we rank test samples and unknown lncRNA-disease interactions. The corresponding predicted result of test samples is considered as true positive (TP) when the predicted relevance score is greater than the threshold. Otherwise, considered as false negative (FN). Similarly, for the unknown lncRNA-disease interactions, the corresponding predicted result consider as false positive (FP) when the predicted relevance score is greater than the threshold. Otherwise, considered as true negative (TN). Then, the true positive rates (TPR), the false positive rates (FPR), recall and precision are calculated as follow:  Table 2.

Comparison with different predicted methods using ten-fold cross validation
In order to compare with other models, the data in this paper is applied to the BPLLDA model [26], the RWRlncD model [27], GrwLDA model [28] and the MHRWR model [29].
The area under PR curve (AUPR) is also used to evaluate the performance of LRWRHLDA model, BPLLDA model [26], the RWRlncD model [27], GrwLDA model [28] and MHRWR model [29] to avoid overestimates the performance of these methods (see Fig. 6). The ROC curve and AUC of LRWRHLDA, RWRlncD, GrWLDA, BPLLDA and MHRWR in predicting lncRNA-disease associations by the ten-fold cross validation Fig. 6 The PR curve and AUPR of LRWRHLDA, RWRlncD, GrWLDA, BPLLDA and MHRWR in predicting lncRNA-disease associations by ten-fold cross validation Table 3 The AUC and AUPR values when λ taking different values from 0.1 to 0.9, in which other parameters were fixed It can be seen from Fig. 6 that the AUPR value of LRWRHLDA is also higher than other models.

Effects of parameters
There are ten parameters in our model, including the transition probability P LM , P LG , P LD , P MG , P MD , P GD between networks; the weight of the subnet P L , P M , P G ; and the restart probability λ. Due to too many parameters and our limited computing resources, we arbitrarily fixed nine of these parameters in the paper and only discussed the impact of restart probability λ with the ten-fold cross validation in our model. The results are shown in Table 3. As can be seen, based on the AUC index, the parameter λ has less influence on the performance of LRWRHLDA, when λ = 0.7. Based on the AUPR index, when λ is equal to 0.9, the AUPR value reaches the maximum. And observing Table 3, the results showed that the restart probability λ has powerful effects on our model.

Case studies on predicted lncRNA-disease associations
It is known that lncRNAs play critical roles in the development of many diseases. To evaluate the ability of LRWRHLDA in inferring potential lncRNA-disease associations, we use all known lncRNA-disease associations in LD as training data to assess the potential of predicted associations by our model.
The stable probability P ∞ can be used as a measure of proximity to the seed lncR-NAs. If P ∞ (lncRNA i) > P ∞ (lncRNA j), then lncRNA i will be in closer proximity to the seed lncRNAs than lncRNA j in the lncRNA similarity network. As a result, all candidate lncRNAs can be ranked according to the P ∞ , and the top ranked lncRNAs can be expected to have a high probability of being associated with the disease of interest. The novel lncRNA-disease associations are ranked according to the stable probability of LRWRHLDA. To validate the predictions, we use literature or the following those databases: LncRNADisease [30], LncRNADisease v2.0 [31], MNDR v3.1 [34], lnCAR [56]. Specifically, we list the top 10 lncRNAs associated with four diseases, including colorectal cancer, lung adenocarcinoma, stomach cancer and breast cancer. According to P ∞ , the top 10 results were shown in Table 4 (the detailed results see Additional file 1: Table-S1).
Colorectal cancer is the third most common cancer diagnosed in the US. While the incidence and the mortality rate of colorectal cancer has decreased due to effective cancer screening measures, there has been an increase in number of young patients diagnosed in colon cancer due to unclear reasons at this point of time [57]. Lung adenocarcinoma is one of the main types of lung cancer, which belongs to non-small cell carcinoma. The incidence of lung adenocarcinoma is mainly female and non-smokers [58]. Stomach cancer is the fifth most common cancer and the third most common cause of cancer death globally [59]. The most majority of stomach cancers are adenocarcinomas, with no obvious symptoms in the early stage. They are often similar to the symptoms of chronic gastric diseases such as gastritis and gastric ulcers, and easily ignore. Moreover, the current early diagnosis rate of stomach cancer is still low. Breast cancer is a malignant tumor that occurs in the epithelial tissue of the breast. At present, breast cancer has become a major public health problem in the current society, and its cause is not yet fully understood. In the world, breast cancer is an important cause of human suffering and premature mortality among women [60].
In Table 4, the six potential lncRNA-disease associations were confirmed in the literature except the existing lncRNA-disease associations in the database, in which included ENST00000535511-colorectal cancer, RP4-colorectal cancer, CTNNAP1-colorectal cancer, LINC01021-colorectal cancer, GMDS-AS1-lung adenocarcinoma, LINC01207-lung adenocarcinoma. These results demonstrated that the predictive performance of the proposed method.

Case studies on predicted novel diseases and novel lncRNAs
For each disease, it is deemed as a novel disease and all its related lncRNAs are removed to predict potential lncRNAs related the disease. All the candidate lncRNAs were ranked according to P ∞ and lncRNAs with high scores were expected to be potentially related with investigated disease d. Depend on P ∞ , the top 10 results were listed in Table 5 (the detailed results see Additional file 2: Table-S2).
Analogously, the stable probability P ∞ can be also used as a measure of proximity to the seed diseases. All the candidate diseases were ranked according to P ∞ and diseases with high scores were expected to be potentially related with investigated lncRNA. To evaluate the ability of our model to predict new lncRNAs, we analyzed two lncRNAs including H19 and HOTAIR. For each lncRNA, it is removed all its related diseases in predicting potential diseases. According to P ∞ , the top 10 results were showed in Table 6 (the detailed results see Additional file 3: Table-S3).  Observing Table 5, we can find that thirty-five of the top ten lncRNAs associations with four cancers were validated by the database or literature. However, other five cancer-lncRNA associations, colorectal cancer-CARL, stomach cancer-AF117829.1, breast cancer-AP003486.1, lung adenocarcinoma-AC018413.1 and lung adenocarcinoma-TUBB2A have not been confirmed by the database or literature. It implies our method can predict more additional lncRNA-disease associations.
From Table 6, in both cases, all top ten associated diseases were validated by the database. In summary, LRWRHLDA achieves favorable performances in predicting novel disease-associated lncRNAs and novel lncRNA-associated diseases.

Discussion
At present, many studies have shown that lncRNA has an important influence on the physiological process of diseases. Because traditional biological experiments are timeconsuming and costly, it is necessary to develop a computational model to predict the association between lncRNA and disease.
In this paper, a new model-LRWRHLDA based on the Laplace normalized random walk with restart algorithm in heterogeneous network was constructed to predict potential lncRNA-disease associations. The ten-fold cross validation test is applied to evaluate the prediction performance of our method. In comparison with the state-of-the-art prediction methods, our method can achieve better performance in terms of AUC values. Moreover, case studies of colorectal cancer, lung adenocarcinoma, stomach cancer and breast cancer are implemented to further demonstrate that it could be a useful method for predicting potential relationships between lncRNAs and diseases as well.
However, our method has some limitations. Firstly, since we have 10 parameters, the selection and adjustment of parameters still face some difficulties. Secondly, because of our model is based on four networks, there are too many nodes in the network. In the random walk process, the more nodes there are, the longer the random walk time will be. In the future, we will continue to improve the model.

Conclusion
In this study, we proposed an effective method, LRWRHLDA, which is based on the Laplace normalized random walk with restart algorithm in heterogeneous network to predict the potential lncRNA and disease association. First, a heterogeneous network based on lncRNA, disease, miRNA, gene similarity network and their correlation networks were constructed. Then, we calculate the probability transition matrix by Laplace normalization. Finally, the potential lncRNA-disease associations were predicted by the random walk with restart over heterogeneous networks. Furthermore, LRWRHLDA can predict isolated disease-related lnRNA (isolated lnRNA-related disease). Our method is evaluated comprehensively by ten-fold cross validation and case studies in comparison with other methods. The results show that our method has higher prediction accuracy.