A machine learning framework that integrates multi-omics data predicts cancer-related LncRNAs

Background LncRNAs (Long non-coding RNAs) are a type of non-coding RNA molecule with transcript length longer than 200 nucleotides. LncRNA has been novel candidate biomarkers in cancer diagnosis and prognosis. However, it is difficult to discover the true association mechanism between lncRNAs and complex diseases. The unprecedented enrichment of multi-omics data and the rapid development of machine learning technology provide us with the opportunity to design a machine learning framework to study the relationship between lncRNAs and complex diseases. Results In this article, we proposed a new machine learning approach, namely LGDLDA (LncRNA-Gene-Disease association networks based LncRNA-Disease Association prediction), for disease-related lncRNAs association prediction based multi-omics data, machine learning methods and neural network neighborhood information aggregation. Firstly, LGDLDA calculates the similarity matrix of lncRNA, gene and disease respectively, and it calculates the similarity between lncRNAs through the lncRNA expression profile matrix, lncRNA-miRNA interaction matrix and lncRNA-protein interaction matrix. We obtain gene similarity matrix by calculating the lncRNA-gene association matrix and the gene-disease association matrix, and we obtain disease similarity matrix by calculating the disease ontology, the disease-miRNA association matrix, and Gaussian interaction profile kernel similarity. Secondly, LGDLDA integrates the neighborhood information in similarity matrices by using nonlinear feature learning of neural network. Thirdly, LGDLDA uses embedded node representations to approximate the observed matrices. Finally, LGDLDA ranks candidate lncRNA-disease pairs and then selects potential disease-related lncRNAs. Conclusions Compared with lncRNA-disease prediction methods, our proposed method takes into account more critical information and obtains the performance improvement cancer-related lncRNA predictions. Randomly split data experiment results show that the stability of LGDLDA is better than IDHI-MIRW, NCPLDA, LncDisAP and NCPHLDA. The results on different simulation data sets show that LGDLDA can accurately and effectively predict the disease-related lncRNAs. Furthermore, we applied the method to three real cancer data including gastric cancer, colorectal cancer and breast cancer to predict potential cancer-related lncRNAs. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04256-8.


Background
Long non-coding RNAs (lncRNAs) are a type of non-coding RNA molecule with transcript length longer than 200 nucleotides [1,2]. Many studies have confirmed that the human genome contains massive amounts of lncRNA [3]. Many evidences indicate that lncRNAs regulate the expression level of genes at multiple levels (e.g., epigenetic regulation, genomic splicing, genomic imprinting, chromatin modification, transcriptional activation, transcriptional and post-transcriptional regulation) in the form of RNA [4][5][6][7]. The aberrant expression of lncRNA is involved in the proliferation, apoptosis, angiogenesis, and metastasis of tumors [8,9]. LncRNA is closely related to the diagnosis, prognosis, and prevention and treatment of complex diseases [10]. LncRNA has become a new candidate biomarker for cancer diagnosis and prognosis [11].
The experimentally verified information about disease-related lncRNA is gradually increasing. A large number of databases have been published. The database LncRNA-Disease contains 3000 lncRNA-disease associations [12]. The database Lnc2Cancer has collected 1500 lncRNA-cancer entries [13]. Moreover, researchers have constructed lncRNA-related databases including NONCODE [14], lncRNAdb [15], LNCipedia [16], lncACTdb [17]. Although the research on lncRNA has progressed rapidly in recent years, the functions of most lncRNAs are still unclear. Bioinformatics calculation methods have been developed to predict the potential lncRNA-disease associations for biological experiment verifications. The calculation methods can greatly reduce the experimental cost and time for finding new disease-related lncRNAs [18,19].
The disease-related lncRNAs prediction methods can be categorized into networkbased approaches and machine learning-based approaches. Biological system is a highly complex heterogenous network involving different molecules. Network-based approaches use multiple features including (but not limited to) lncRNA functional similarity, lncRNA-gene association, gene-gene interaction, gene-disease association, and molecular similarity to construct lncRNA similarity networks, or lncRNA-disease heterogeneous networks, then use network model analysis methods (e.g. propagation algorithms and random walk theory) to predict potential lncRNA-disease associations [20]. RWRlncD constructed a unified network including disease similarity network, lncRNA functional similarity network, and disease-lncRNA association network. The method used the Random Walk with Restart (RWR) method to predict the potential lncRNA-disease association [21]. RWRHLD added miRNA information that interacts with lncRNA, further improving the accuracy of the lncRNA-disease prediction method [22]. LncRDNetFlow used a streaming algorithm to predict lncRNA-disease associations based on multi-omics networks [23]. However, the known lncRNA-disease association data is still insufficient, and those methods cannot be applied to the prediction of related disease without any known lncRNAs information. To avoid the abovementioned problems, researchers attempt to combine known pathogenic gene-miRNA association data, miRNA-lncRNA association data and other data to predict lncRNA-disease association. LncPriCNet used multiple features, including phenotype-gene relations and gene-gene interactions, to construct a multi-level composite network and then used similarity scores to predict lncRNA-disease associations [24]. Ganegoda et al. proposed a model for predicting potential disease-associated lncRNAs by integrating known cancer-associated lncRNAs information and multi-omics data including genomic, regulatory, and transcriptional bios data [25].
Recently, many bioinformatics calculation models based on machine learning algorithms have been proposed to find potential lncRNA-disease associations. Lu et al. used inductive matrix completion and principal component analysis to predict potential lncRNA-disease associations [26]. Based on a review of existing research, Chen et al. proposed a hypothesis that functionally similar lncRNAs tend to be abnormally expressed in similar diseases, and developed a semi-supervised machine learning framework based on laplacian regularized least squares method (named LRLSLDA). Unfortunately, the method suffered from selecting multiple parameters effectively [27]. Wang et al. used lncRNA similarity data and disease similarity data to train a bagging support vector machine (SVM) classifier, and the trained SVM is implemented as a web server to predict potential disease-related lncRNAs [28]. You et al. proposed a method called LDASR to predict latent lncRNA-disease associations by using collaborative filtering and rotating forest [29]. These methods have achieved good results. Although the research on lncRNA has made rapid progress in recent years, unfortunately, these methods often used unmodified traditional machine learning methods, and the omics data used are limited to two or three types. Recently, the accumulation of associated omics data between lncRNAs and diseases and the development of machine learning and deep learning technologies provide researchers with better opportunities to use supervised learning models to predict disease-related lncRNAs.
Meanwhile, modern medical research proves that the alternations of biological factors (e.g., miRNA, protein and gene) may directly or indirectly affect diseases. Earlier studies have shown that RNA-protein interactions regulate gene expression by controlling various post-transcriptional processes. LncRNAs regulate the RNA-protein interactions by recruiting regulatory complexes [30,31], and the literatures indicate that many lncR-NAs also act as regulators to regulate gene expression [32]. Wang et al. reported that lncRNA-miRNA-disease interactive network could be great addition to the biomedical research field [33]. Liu et al. reported that lncRNA-binding proteins play a key role in the development of many diseases [34]. The accumulated miRNA-disease associations can be used for disease treatment [35]. Considering the mechanism of lncRNAs regulate genes, and biological factors regulate diseases provide a better opportunity for obtaining more information about lncRNA-disease associations.
Inspired by currently well-performing neural network technologies [36,37], we tried to use multiple omics similarity matrices, neural network neighborhood information aggregation and trained supervised learning model to extract association features from lncRNA-gene-disease association network to predict disease-related lncRNAs. In this article, we proposed a new machine learning framework named LGDLDA (LncRNA-Gene-Disease association networks based LncRNA-Disease Association prediction) for disease-related lncRNAs association prediction based multi-omics functional similarity data, machine learning methods and neural network neighborhood information aggregation. We collected data from three databases LncRNADisease v2.0 [38], Lnc2Cancer [13], and MNDR v2.0 databases [39] separately, and then combined these three data into one data. The diseases in this combined data do not include gastric cancer, breast cancer, and prostate cancer. Additional file 1: Fig. S1 provided the data processing procedure for disease-lncRNA association instances. This combined data contains 6000 disease-lncRNA association instances, of which 4000 association instances were used for training and 2000 association instances were used for validating. Firstly, LGDLDA calculates the similarity between lncRNAs through the lncRNA expression profile matrix, lncRNA-miRNA interaction matrix and lncRNA-protein interaction matrix. The gene similarity matrix is obtained by calculating the lncRNA-gene association.
matrix and the gene-disease association matrix. The disease similarity matrix is obtained by calculating the disease ontology, the disease-miRNA association matrix, and Gaussian interaction profile kernel similarity. Secondly, LGDLDA integrates neighborhood information by using nonlinear feature learning of neural network. Thirdly, LGDLDA uses embedded node representations to approximate the observed matrices. Finally, LGDLDA ranks candidate lncRNA-disease pairs and then selects potential disease-related lncRNAs. The stability test results show that LGDLDA is more robust and the simulation data experiments show that LGDLDA performs better than four state-ofart methods in predicting lncRNA-disease association.
LGDLDA can effectively predict potential cancer-related lncRNAs and provide more candidates for biological experimental verification. Most of predicted cancer-related lncRNAs are supported by recent literatures.

Results
In the results section, the work we do is described as follows: Firstly, we used randomly split samples to observe the robustness of each method. Secondly, we compared LGDLDA with four famous lncRNA-disease association prediction methods on a small lncRNA-disease association simulation network. Four state-of-art methods include NCPLDA [40], IDHI-MIRW [41], LncDisAP [42] and NCPHLDA [43]. Finally, LGDLDA was applied to three real cancer samples to predict potential disease-related lncRNAs.

Comparison of method stability
Before comparing the performance of LGDLDA with four famous lncRNA-disease association prediction methods in small data, we need to evaluate the stability of these methods. We generally randomly divide the data set into two parts: Ω1 and Ω2. In the first step, based on the training set Ω1, we select different parameters and determine the parameter configuration with good performance. In the second step, we expect that the selected parameter configuration can have an accurate prediction in Ω2. We performed this experiment on a small lncRNA-disease association simulation network which contains 356 lncRNAs, 354 diseases, 132 genes, 736 known lncRNA-gene associations, 462 gene-disease associations and 2169 known lncRNA-disease association instances [41]. Ω1 contains 1446 lncRNA-gene association instances and Ω2 contains 723 lncRNA-gene association instances. There may be two issues to consider: (i) Does the randomness in the randomly divided sample affect the stability of the method? (ii) Is the stability of LGDLDA better than NCPLDA [40], IDHI-MIRW [41], LncDisAP [42] and NCPHLDA [43] ?
To address the two issues, we observed the performance of the method in two experiments. In the first experiment, we performed 10 random splits on a certain comprehensive data set. For each randomly divided data set, we ran LGDLDA on the data set and calculated AUC values. The AUC values for 10 realizations are shown in Fig. 1. The experimental results from Fig. 1 show that random partition strategy has little effect on the method performance. In the second experiment, we performed 50 random splits on a certain comprehensive data set. For each randomly divided data set, we ran each method on the data set and calculated AUC values. BasedWe performed these experiments on these AUC values, we calculated the minimum, first quartile, median, third quartile and maximum value and draw boxplots. The box plots from Fig. 2 show that the stability of LGDLDA is better than IDHI-MIRW, NCPLDA, LncDisAP and NCPHLDA. We also performed 10 random splits experiment and 50 random splits experiment on a dataset with 10% incorrect data. The AUC values for 10 realizations on the dataset are shown in Additional file 1: Fig. S2. The box plots from 50 random splits experiment on a dataset with 10% incorrect data are shown in Additional file 1: Fig. S3.

Comparison with four state-of-art methods on a small simulation data set
In this section, we compared LGDLDA with four famous methods (i.e., NCPLDA, IDHI-MIRW, LncDisAP and NCPHLDA) on a small lncRNA-disease association simulation network which contains 356 lncRNAs, 354 diseases, 132 genes, 736 known lncRNAgene associations, 462 gene-disease associations and 2169 known lncRNA-disease associations from breast cancer [41]. LncDisAP [42] and IDHI-MIRW [41] are prediction methods based on multiple biological datasets and RWR algorithm. NCPHLDA [43] and NCPLDA [40] are network-based methods. We performed these experiments on a computer with an Intel i9-10900X CPU and 512 G RAM.
To avoid the small lncRNA-disease association simulation network favoring our own model, we run each method on data that does not contain gene-related information (i.e., data without genes, lncRNA-gene associations, and gene-disease associations). Figure 3 shows the ROCs and corresponding AUC values of LGDLDA and four competition methods. As shown in Fig. 3, LGDLDA outperformed other four methods in terms of AUC value. The AUC of LGDLDA is 0.926, which is 0.035, 0.096, 0.163 and 0.116 higher than that of IDHI-MIRW, NCPLDA, LncDisAP and NCPHLDA, respectively. We also run each method on data containing gene information. Figure 4 shows the ROCs and AUC values of LGDLDA and the four competition methods. As shown in Fig. 4, LGDLDA outperformed other four methods in terms of AUC value. The AUC of LGDLDA is 0.935, which is 0.067, 0.134, 0.205 and 0.131 higher than that of IDHI-MIRW, NCPLDA, LncDisAP and NCPHLDA, respectively. Considering we often apply method to incomplete data set, we randomly remove 20% of the data and run each method. The ROCs and AUC values of LGDLDA and other four methods are shown in Fig. 5. LGDLDA achieved a better performance than other four methods in terms of AUC. The AUC of LGDLDA is 0.880, which is 0.034, 0.088, 0.053 and 0.208 higher than that of IDHI-MIRW, NCPLDA, LncDisAP and NCPHLDA, respectively. Although our method LGDLDA is affected by incomplete data, it performs better than other four methods. Compared with the four state-of-art methods, the results on different simulation data sets show that LGDLDA can accurately and effectively predict the disease-related lncRNAs.
In order to observe whether it is necessary to include each omics data, we performed the experiment on the dataset with missing part of the omics data and recorded the AUC values, and compared with the experimental results on the complete multi-omics dataset. The experimental results are shown in Additional file 1: Table S1.

Application to cancer data and potential lncRNA-disease associations analysis
In this section, we applied LGDLDA to real cancer data including gastric cancer, colorectal cancer, and breast cancer. For a given disease, all known related lncRNAs are true labels, and other lncRNAs are candidates for disease. Inspired by the work of Guo et al. [29], we used the related information in the LncRNADisease database v2.0, DisGeNet, and LncACTdb to train LGDLDA, and other databases including CRl-ncRNA [44], MNDR v2.0, LncRNAwiki [45], and Lnc2Cancer, were used to verify the results. We applied the LGDLDA to real cancer data and ranked the lncRNA-disease association scores from large to small, and then identified the top 15 potentially relevant lncRNAs for each cancer.
Gastric cancer is the second most common cancer in the world [46,47]. Accumulating evidence has demonstrated that many lncRNAs are dysregulated in gastric cancer [48,49]. It is necessary to use computing methods to predict cancer-related lncRNAs. In the gastric cancer study, we used 1352 associations and gene related associations from databases as positive samples. We randomly selected the same number of samples from the database as negative samples. We constructed the test data set by extracting gastric cancer-related lncRNAs from other databases. Recent literatures supported 12 out of 15 potential gastric cancer-related lncRNAs. The confirmed databases and supporting literature of these 15 cancer-related lncRNAs are shown in Table 1 and Additional file 1: Table S2, respectively. For example, Xu et al. [50] found that overexpression of ZFAS1 is significantly related to lymphatic metastasis and TNM staging. The overexpression of ZFAS1 leads to the loss of control of the cell cycle process, which in turn promotes the proliferation and migration of gastric cancer cells. Liu et al. reported that lncRNA H19 is aberrantly highly expressed in gastric cancer cell lines. Zai et al. reported that activated DANCR promotes the proliferation and invasion of gastric cancer cells [51]. LncRNA HOXA11-AS promotes the invasion and proliferation of gastric cancer by regulating the chromatin modifiers  [52]. A large number of studies have shown that LncRNA can be used as a biomarker for the treatment of gastric cancer [53]. Breast cancer is the most common malignant tumor in women and the second leading cause of cancer death [54,55]. If we can detect cancer-related lncRNA as early as possible and intervene early, it will greatly reduce the incidence of breast cancer. Recent literatures supported 12 out of 15 potential breast cancer-related lncRNAs. The confirmed databases and supporting literature of these 15 cancer-related lncRNAs are shown in Additional file 1: Table S3 and Additional file 1: Table S4, respectively. For example, Yang et al. found that overexpression of LncRNA BCRT1 can promote the M2 polarization of macrophages, thereby accelerating the development of breast cancer [56]. Schiemann reported that lncRNA BORG regulates the transcriptional repressive activity of TRIM28 to trigger the migration and invasion of potential breast cancer cells [57]. Spector et al. reported that lncRNA MaTAR25 affects the proliferation and metastasis of breast cancer cells by regulating the expression of Tensin1 gene [58].
Prostate cancer is the second most common cancer in men and the fifth leading cause of death worldwide [59,60]. Recent literatures supported 12 out of 15 potential prostate cancer-related lncRNAs. The confirmed databases and supporting literature of these 15 cancer-related lncRNAs are shown in Additional file 1: Table S5 and Additional file 1: Table S6, respectively. For example, Zhao et al. [61] reported that overexpression of ANRIL promoted the proliferation and migration of prostate cancer cells. Li et al. reported that lncRNA SNHG1 enhanced the expression of CDK7 and promoted cell proliferation in prostate cancer by negatively regulating miR-199a-3p [62]. Zhang et al. reported that the androgen-reduced transcript of LncRNA GAS5 can promote the proliferation of prostate cancer [63].

Discussion
In case studies, we have found many potential cancer-related lncRNAs. Most of potential association lncRNAs are supported by recent literatures. In future biological experiments, it would be interesting to find the association mechanisms between new potential lncRNAs and diseases.
As shown in Fig. 6, this is a sub-network discovered by our proposed method LGDLDA. The sub-network contains some confirmed lncRNAs, PSORS1C3, PTCSC2 and UCC are predicted lncRNAs not yet reported. we hypothesize the rapidly increasing biological data brings more information (e.g., Lnc2Cancer and LncACTdb), while LGDLDA combined with nonlinear mapping can more accurately capture the complex features in multi-omics data.
It should be noted that the method LGDLDA is the worst one when focusing only on top genes (FRP < 0.05 or in a lesser extent FPR < 0.1). Maybe, this is not the best method when focusing on "top prediction". We believe that this is because the dataset is too small and affects the performance of the method. We propose two ideas to improve the performance of LGDLDA. The first idea, we use warm start strategy. We apply LGDLDA to similar training datasets to obtain a good performance parameter set β, then further optimize the parameter set β on the training set to improve the performance of LGDLDA. The second idea, we use stability selection strategy. We run LGDLDA multiple times to obtain multiple results, then use the stability selection strategy to average these results to remove the risk of overfitting caused by small datasets.
Finally, the real association mechanism between lncRNAs and disease is much more complicated than what we assumed. For example, the relationship between lncRNAs and complex diseases will change over time. We will try to design a new machine learning framework to analyze association data and time dynamic data simultaneously.

Conclusions
In this article, we proposed a novel machine learning framework, namely LGDLDA, to find cancer-related lncRNAs by integrating analysis of multi-omics data. Firstly, LGDLDA calculates the similarity matrix of lncRNA, gene and disease respectively.
LGDLDA calculates the similarity between lncRNAs through the lncRNA expression profile matrix, lncRNA-miRNA interaction matrix and lncRNA-protein interaction matrix.
LGDLDA obtains gene similarity matrix by calculating the lncRNA-gene association matrix and the gene-disease association matrix.
LGDLDA obtains disease similarity matrix by calculating the disease ontology, the disease-miRNA association matrix, and Gaussian interaction profile kernel similarity. Secondly, LGDLDA integrates the neighborhood information in similarity matrices by using nonlinear feature learning of neural network. Thirdly, LGDLDA uses embedded node representations to approximate the observed matrices. Finally, LGDLDA ranks candidate lncRNA-disease pairs and then selects potential disease-related lncRNAs.
LGDLDA incorporates the prior knowledge of biological network topology including lncRNA similarity networks, lncRNA-gene association network, gene-disease association network, disease semantic similarity networks, and lncRNA-disease association network. In this framework, a deep learning model was used to generate feature matrices. In model optimization, the final optimization problem is a popular matrix completion problem, which can be solved using convex optimization methods. In summary, the method considers more critical information and obtains the performance improvement cancer-related lncRNA predictions.

Overview of LGDLDA
In this section, we will introduce the main steps in the LGDLDA method. (1) LGDLDA uses multiple association similarity matrices (including lncRNA functional similarities, gene-disease associations, disease similarities, lncRNA-disease associations, and lncRNA-gene associations matrix) to build lncRNA-gene-disease association network. (2) Based on the matrices generated in the first step, LGDLDA uses the association similarity matrices combined with neural network to calculate the neighborhood information of lncRNAs and diseases, and further embeds it into the low-dimensional spatial node representations. (3) Inspired by the reconstruction matrix algorithm in NNHLDA [36], LGDLDA uses low-dimensional spatial node representations to generate the projection matrices to approximate the observed matrices, and learns as much information in the original matrix as possible in the optimization of the loss function. (4) LGDLDA sorts the elements in the learned association matrix and selects the top values to predict disease-related lncRNAs. Figure 7 shows the flowchart of LGDLDA method.  1) LGDLDA uses multiple association similarity matrices to build lncRNA-gene-disease association network. (2) Based on the matrices generated in the first step, LGDLDA uses the association similarity matrices combined with neural network to calculate the neighborhood information of lncRNAs and diseases, and further embeds it into the low-dimensional spatial node representations. (3) LGDLDA uses embedded representations to generate the reconstructed matrix to approximate the original matrix, and learns as much information in the original matrix as possible in the optimization of the loss function. (4) LGDLDA sorts the elements in the learned association matrix and selects the top values to predict cancer-related lncRNAs Yuan et al. BMC Bioinformatics (2021) 22:332

Datasets
In this paragraph, we will introduce the mathematical formulas used next. S ∈ R m×m is used to represent the lncRNAs functional similarity matrix and D ∈ R n×n is used to represent disease similarity matrix, where m and n denote the number of lncR-NAs and diseases, respectively. A ∈ R m×n represents lncRNA-disease association matrix, rows represent lncRNAs and columns are used to represent diseases. For each entry a ij in A, the value of a ij is equal to 1 if disease j related to lncRNA i; otherwise, a ij is equal to 0. Let A lg ∈ R m×k be the lncRNA-gene association matrix and A gd ∈ R k×n represents the gene-disease association matrix, where k represents the number of genes. For calculating the functional similarity networks of lncRNAs, LGDLDA uses the lncRNA expression profile matrix, lncRNA-protein function association matrix and lncRNA-miRNA association matrix. For calculating the disease similarity network, LGDLDA uses disease information, protein-disease association matrix and miRNAdisease association matrix. All lncRNAs and diseases are annotated with standard corresponding IDs.

Constructing lncRNA/disease similarity network
Since the Pearson correlation coefficient is easily affected by outliers, and outliers are inevitably included in the data, we used the biweight midcorrelation (BM) coefficient [70,71]. Compared with Pearson correlation coefficient, the BM coefficient can calculate the correlation more accurately. We computed BM coefficients between lncRNAs and constructed the lncRNA similarity weighting network LncSm1. The range of BM value is from -1 to 1. The stronger the correlation, the larger the absolute value of BM.
The radial basis function (RBF) Gaussian kernel function was applied to lncRNA-miRNA interactions to obtain Gaussian interaction profile kernel similarity [72], and constructed the lncRNA similarity weighting network LncSm2. The similarity network can be defined as follows: where GIP lm (l i ) represents the lncRNA-miRNA interaction profile, GIP lm (l i ) is a binary vector in which 1 represents presence of interactions between lncRNA l i and miRNA and 0 represents absence, α l is the weight factor used to regulate the kernel bandwidth, the parameter α ′ l is set to 0.5 empirically and N l denotes the total number of lncRNAs. Analogous to lncRNA-miRNA interactions-based Gaussian similarity calculation method, the lncRNA-protein interactions-based Gaussian similarity of lncRNA pairs is calculated by the same method. GIP lp (l i ) represents the lncRNA-protein interaction profile, GIP lp (l i ) is a binary vector. With the help of the method described above, we constructed the similarity network LncSm3.
We first used the R package "DOSE" to compute the correlation coefficients between diseases [73,74]. Then, we can build a weighted disease similarity network DisSm1. We used disease-miRNA associations to calculate the kernel similarity of the Gaussian interaction spectrum between disease d i and d j , and then construct a weighted disease similarity correlation network DisSm2.
where GIP dm (d i ) denotes disease-miRNA interaction profile, GIP dm (d i ) is a binary vector.

Constructing lncRNA/disease topological similarity networks
In order to overcome the loss of information caused by the fusion of similarity networks (i.e., LncSm1, LncSm2, and LncSm3 or DisSm1 and DisSm2), the idea of network diffusion is employed to generate the topological similarity networks. Motivated by the work of Zhang et al. [41], the RWR was applied to each similarity network to construct topological similarity network. RWR algorithm is a widely used complex biological network analysis method [41,75,76]. The details of constructing lncRNA/disease topological similarity networks were shown in Additional file 1. LTS represents the lncRNA similarity network LncTSN, and DTS represents the disease similarity network DisTSN.

Node embedding
For nodes representing lncRNA or disease in the heterogeneous network, its characteristic information can be summarized from the neighbor information related to it. For example, lncRNA's features can be aggregated from related lncRNAs, genes and diseases. Thus, we can use sufficient relevant information (related lncRNA, gene and disease information) to accurately represent the features of lncRNA. The aggregation can be defined as follows: and gee ′ i ∈ R 2d are the embeddings of lncRNA i , disease i and gee i , respectively. The initial representations of lncRNA, disease and gene nodes ( lnce i ∈ R d , dise i ∈ R d and gee i ∈ R d ) are randomly set. By considering both node's neighbor information and its own features, we can obtain the network topology feature information of each node, and then calculate the feature vector of this node.
The neural network obtains more powerful feature expression capability by using nonlinear activation functions. Motivated by the work of Zeng et al. [36], the activation function σ [·] (ReLU(x) = max(x,0)) can be defined as follows: where W and b denotes the parameters in the neural networks. The nodes are embedded in low-dimensional vectors and normalized: where e ′′ i stands for either lnce ′′ i , dise ′′ i or gee ′′ i . Thus, we used a single-layer neural network to non-linearly transform the nodes' representation and obtained a new embedding representation.

Training and evaluation
In machine learning, the model contains many parameters, and we need to use training data to determine the optimal values of the parameters through training optimization.
The optimization goal is to make the difference between the predicted value and the target value (i.e., loss function) as small as possible. The information loss function between the reconstructed matrix and the original information matrix can be defined as follows: where E ∈ R p×q are the information mapping matrices, which can extract the main features of the nodes from the embedded node information representations. The matrix EE T is used to enforce symmetry of the recovery. Since the functions in the method are all differentiable, we can use the gradient descent method to iteratively solve step by step to obtain the minimize loss function and model parameter values.
LGDLDA uses the gradient descent method to train the model parameters. After training, elements in the reconstruction matrix can predict each associations score. The higher a score is, the larger probability we suggest the potential association exists: In this sense, the final optimization problem is a popular matrix completion problem, which can be solved using convex optimization methods.

Evaluation method and metrics
To be able to fairly evaluate the performance of the methods, we performed LOOCV (Leave-One-Out Cross-Validation) on the verified lncRNA-disease association data. Given a disease d i , each known disease-related lncRNA is left out as test sample, meanwhile other disease-related lncRNAs are used as training samples. All irrelevant lncR-NAs constitute candidate samples. The test samples are positive samples, and other samples are negative samples. In the predicted association matrix, LGDLDA regards elements larger than the threshold as effective associations between lncRNAs and diseases. We used true positive rate (TPR) and false positive rate (FPR) to calculate area under the curve (AUC).
Additional file 1: Figure S1. The data processing procedure for disease-lncRNA association instances. Figure S2.
The AUC values for 10 realizations on the dataset with 10% incorrect data. Figure S3. The box plots from 50 random splits experiment on a dataset with 10% incorrect data. Table S1. The experimental results on a dataset lacking some omics data.