Predicting lncRNA-disease associations using multiple metapaths in hierarchical graph attention networks

Background Many biological studies have shown that lncRNAs regulate the expression of epigenetically related genes. The study of lncRNAs has helped to deepen our understanding of the pathogenesis of complex diseases at the molecular level. Due to the large number of lncRNAs and the complex and time-consuming nature of biological experiments, applying computer techniques to predict potential lncRNA-disease associations is very effective. To explore information between complex network structures, existing methods rely mainly on lncRNA and disease information. Metapaths have been applied to network models as an effective method for exploring information in heterogeneous graphs. However, existing methods are dominated by lncRNAs or disease nodes and tend to ignore the paths provided by intermediate nodes. Methods We propose a deep learning model based on hierarchical graphical attention networks to predict unknown lncRNA-disease associations using multiple types of metapaths to extract features. We have named this model the MMHGAN. First, the model constructs a lncRNA-disease–miRNA heterogeneous graph based on known associations and two homogeneous graphs of lncRNAs and diseases. Second, for homogeneous graphs, the features of neighboring nodes are aggregated using a multihead attention mechanism. Third, for the heterogeneous graph, metapaths of different intermediate nodes are selected to construct subgraphs, and the importance of different types of metapaths is calculated and aggregated to obtain the final embedded features. Finally, the features are reconstructed using a fully connected layer to obtain the prediction results. Results We used a fivefold cross-validation method and obtained an average AUC value of 96.07% and an average AUPR value of 93.23%. Additionally, ablation experiments demonstrated the role of homogeneous graphs and different intermediate node path weights. In addition, we studied lung cancer, esophageal carcinoma, and breast cancer. Among the 15 lncRNAs associated with these diseases, 15, 12, and 14 lncRNAs were validated by the lncRNA Disease Database and the Lnc2Cancer Database, respectively. Conclusion We compared the MMHGAN model with six existing models with better performance, and the case study demonstrated that the model was effective in predicting the correlation between potential lncRNAs and diseases.


Introduction
LncRNAs can regulate the expression of target genes through different cellular mechanisms, such as signal transduction, induction, guidance, and scaffolding, and play a variety of roles in all life processes [1].Aberrant expression of lncRNAs is usually associated with human diseases.Therefore, mining the correlation between lncRNAs and diseases is conducive to elucidating the pathogenic mechanisms of complex diseases, providing a basis for disease diagnosis and prevention.
Although some lncRNA-disease associations have been experimentally validated, the vast majority of these associations remain unknown [1].Traditional biological experimental approaches to validate potential lncRNA-disease associations are often resource intensive and costly.To alleviate this problem, computational approaches have received much attention from scholars.Recent methods can be broadly classified into three categories: network-based methods, random walk-based methods, and machine learningbased methods.
Network-based approaches focus on predicting potential associations between lncR-NAs and diseases using various propagation algorithms.The first network-based method, LRLSLDA [2], combines the lncRNA-disease association network and the lncRNA expression similarity network and incorporates Laplace's regular least squares in a semisupervised learning framework to identify potential lncRNA-disease associations.Notably, this approach does not require negative samples.Yang et al. [3] used a propagation algorithm to identify existing diseases and detected disease-causing gene associations; based on this information, they constructed a new disease gene-related network and identified lncRNA-disease associations in that network.Li [4] calculated multiple similarities between lncRNAs and diseases, acquired probability matrices of lncRNAs and diseases, and subsequently assessed their network consistency before predicting unknown lncRNA-disease associations.Zhang et al. [5] combined lncRNA, protein, and disease information to construct a network and applied the stream propagation algorithm.
Random walk-based methods can pay more attention to the information that contributes more to the network.Xie et al. [6] proposed the LDA-LLNSUBRW model to predict LDA.This model is based mainly on linear neighborhood similarity and an unbalanced bi-random walk.Sun et al. [7] proposed the RWRlncD method, which is based on a global network that contains the lncRNA functional similarity network, the disease similarity network, and known lncRNA-disease associations.For lncRNAs without a known associated disease, however, this approach cannot be applied.Li et al. [8] designed an improved local random walk method for a newly established heterogeneous network.In 2019, Hu et al. [9] introduced a matrix completion method (LMNLMI).
The third category includes machine learning-based methods.Yao et al. [10] utilized random forests to select features in their proposed methodology.Wang et al. [11] proposed a weighted matrix decomposition (WMD) method for LDA prediction by presetting the weights of different correlation matrices and converting them into lowdimensional matrices.Lan et al. [12] trained a support vector machine (SVM) model to predict potential associations between lncRNAs and diseases by combining multiple biological data.Yu et al. [13] created a predictive model (CFNBC) based on Bayesian classification by unifying the associations among lncRNAs, diseases, and miRNAs.Bayesian classification was used for linear discriminant analysis (LDA) prediction models of collaborative filtering (CFNBC).With the growth of scientific research, there has been an increasing emphasis on neural networks.Neural networks can achieve superior training results by continuously modifying parameters through numerous operations.Recently, graphical neural networks, such as graphical convolutional networks (GCNs) and graphical attention networks (GATs), have been used in bioinformatics research because of their ability to integrate graph topology and node features.To prioritize more relevant neighbors and eliminate noise, they have also developed a bi-interaction aggregator to aggregate representations of similar neighbors.The GBDT-LR [14] model uses two different machine learning methods, gradient boosting decision trees and logistic regression, and combines them.Wu et al. [15] developed the GAMCLDA model, which applies graph convolutional networks to reconstruct graph structures and lncRNA and disease node feature vectors.
These existing methods have achieved satisfactory performance and effectively contributed to the advancement of computational methods for LDA prediction, but the ability of these methods to mine the rich semantic information in heterogeneous graphs composed of lncRNAs and diseases is far from optimal or even satisfactory.Metapaths show strong potential for exploring complex structural and semantic information in heterogeneous networks.Xuan [16] et al. considered that nodes with similar attributes are not only located near the neighborhood of the target node but also located in the region far from the target node.Therefore, they integrated the associations between the nodes, increased global dependencies, and added multiview features of the node pairs.Zhao [17] et al. developed a new framework based on heterogeneous graph attention networks and metapath graph attention networks.They constructed a two-part topological graph of lncRNAs and diseases and used the KNN algorithm to remove noise effects.Inspired by existing studies, we designed a multiple metapath-based hierarchical graph attention network model for lncRNA-disease association prediction.The approach of constructing subgraph aggregation features under multiple types of metapaths is used to obtain information about various relationships between lncRNAs and diseases in both heterogeneous and homogeneous graphs simultaneously for better performance.Our contributions are as follows: 1. We propose a dual-path feature extraction strategy based on a homogeneous graph and a heterogeneous graph.Subgraph aggregation features of homomorphic and heteromorphic graphs are used to enrich the model input information.The KNN algorithm is used to construct homogeneous subgraphs to reduce computation and denoising.In addition, miRNA information nodes are introduced to construct a ternary heterogeneous network with richer information.2. Different types of metapaths are constructed.For the heterogeneous graph, the existing metapaths are only paths for lncRNA or disease nodes, i.e., the connecting pathways of other nodes, such as miRNA nodes, are ignored.We learn each homogeneous graph or heterogeneous subgraph of a specific metapath by extracting the paths that lncRNAs or disease nodes reach through different types of nodes using the GAT network.Moreover, in the heterogeneous subgraphs, we adaptively assign weights to the different metapath subgraphs using the attention mechanism to obtain additional semantic information.

Datasets
In this study, datasets collected from three studies were used to evaluate the model performance.
Dataset 3: We used the dataset screened by Li et al. [27].The authors screened relevant records with causal relationships from the HMDDV3.2database and converted all disease names into standardized names based on the MeSH nomenclature.Finally, 861 lncRNAs, 437 miRNAs, and 432 diseases were obtained.
Model parameter tuning, ablation experiments, and comparisons with the baseline model were performed on dataset 1.Three datasets were used for robustness experiments.The detailed data are shown in Table 1.In this table, LDA represents the association of lncRNAs with diseases, LMA represents the association of lncRNAs with miRNAs, and MDA represents the association of miRNAs with diseases.

Flowchart of the MMHGAN model
As shown in Fig. 1, we propose the MMHGAN model for predicting lncRNA candidates associated with a given disease.The MMHGAN model consists of data sources, the construction of heterogeneous and homogeneous graphs, the acquisition of subgraph features via multihead attention, and prediction.

LncRNA sequence similarity
We obtained the sequences by lncRNA name from NONCODE (http:// www.nonco de.org/), GenBank (https:// www.ncbi.nlm.nih.gov/) and Ensembl (http:// asia.ensem bl.org/ index.html) to obtain information to find the corresponding sequence of each lncRNA.After obtaining all the lncRNA sequences, based on previous studies by Yang [28] and Li [29] et al., we performed a two-by-two calculation of the lncRNA sequences using the Levenshtein distance, which is the editing distance between strings used to measure the differences between two strings [30].In previous studies, the editing cost was set to 2, while the insertion cost and deletion cost were set to 1.We followed the same criteria in our study.The formula for the LSS is shown below: where dist denotes the minimum cost of converting the l i sequence of a lncRNA to the l j sequence and len denotes the length of the lncRNA sequence.

Disease semantic similarity
The computation of the semantic similarity of diseases is based on the medical subject term descriptor [31], available from https:// www.ncbi.nlm.nih.gov/.[32] The tool provides topological relationships between diseases and describes them with a directed acyclic graph (DAG).With the known directed acyclic graph, we calculated the semantic similarity DSS between diseases using the method proposed by Wang et al. [32].
( Assuming that d is an ancestor node of the DAG and d′ is a child node of d, the semantic contribution of each node in the DAG is calculated as follows: After the contribution scores were obtained, the semantic score D v1 was calculated for each disease: T represents the DAG topology of the disease.Finally, the semantic similarity of the two diseases was calculated with the following formula:

LncRNA/disease GIP kernel similarity
According to previous studies, the lncRNA Gaussian kernel similarity (LGS) and disease Gaussian kernel similarity (DGS) were calculated based on the neighbor-joining matrix LD.The formula for the LGS is as follows: Here, N l denotes the number of lncRNAs, and ξ l is the regularization factor.Similarly, the DGS was calculated as follows: Here, N d denotes the number of diseases, and ξ d is the regularization factor.Considering that there are many sparse values in the similarity matrix obtained above and that there is a problem with inaccurate prediction of individual semantic information as features, we linearly fused the two similarities in the following equation: (2) LSM and DSM are the combined similarity matrices of lncRNAs and disease after linear fusion.

Subgraph construction based on metapaths
A metapath is a composite relation connecting two objects and is a widely used structure for capturing semantics.Metapaths can be used to explore structural information in heterogeneous graphs and capture rich semantic information, fully and intuitively exploiting network structures.
To explore more diverse information embedded in the metapaths, we constructed a ternary heterogeneous graph G lmd = (V , E) containing three types of nodes, lncRNA, miRNA, and disease nodes.The set of nodes is v = v lnc ∪ v dis ∪ v mir .v lnc represents the set of 240 lncRNA nodes, v dis is the set of 412 disease nodes, and v mir is the set con- taining 495 miRNA nodes.The edge E in the heterogeneous graph can be defined as follows: where N lnc , N dis and N mir represent the numbers of lncRNAs, diseases and miRNAs in the dataset, respectively.E lnc−dis , E lnc−mir and E mir−dis represent the association matrix of lncRNAs and diseases, the association matrix of lncRNAs and miRNAs and the association matrix of miRNAs and diseases, respectively.Given lncRNA node l i (l i ∈ N lnc ) and disease node d j (d j ∈ N dis ) , there is an association between l i and d j if the associa- tion matrix The correlation matrix G between the heterogeneous maps G lmd can be defined as: Dataset 1 was chosen as an example, and 2697 lncRNA-disease associations were experimentally verified.We treated these 2697 experimentally verified associations as positive samples, labeled 1.However, the number of known lncRNA-disease associations is much greater than the number of known lncRNA-disease associations.An imbalance of positive and negative samples reduces the generalizability of the model.To address this issue, we randomly selected an equal number of unknown lncRNAdisease associations, labeled 0, to be added to the heterogeneous map.In addition, we used the combined similarity of lncRNAs, miRNAs, and diseases as lncRNA and disease node features, respectively.Therefore, the lncRNA node feature has 240 dimensions, (10 the disease node feature has 412 dimensions, and the feature vector is represented as a lncRNA, for example: where F li represents the features of the ith lncRNA in the lncRNA similarity matrix and x j represents the combined similarity value of the ith lncRNA and the jth lncRNA.Simi- larly, F di represents the feature vector of the ith disease in the disease similarity matrix.Pathways essentially describe the associations between lncRNAs L 1 and L 2 or between diseases D 1 and D 2 .Different metapaths usually have different semantics.In the ter- nary heterogeneous graph G lmd obtained above, it is assumed that there is a metapath type P of L1 → D1 → L2 , L 1 is a certain lncRNA node, D 1 is a certain disease node with which it is associated, and L 2 is another lncRNA associated with the above disease node.Through the metapath p, if there exists a node v that conforms to the metapath type P, then the set of nodes v pD l can be obtained.Thus, we can obtain the subgraph G pD l = (v pD l , E ld ) of the LncRNA.E ld represents the edges formed by lncRNA nodes con- forming to the metapath connections of a given type.In our proposed model, in addition to the metapaths of type L → D → L , we define three other types of metapaths L → M → L , D → L → D , and、D → M → D .With these three types of metapaths, we can construct the following three kinds of homogeneous subgraphs: represents the set of lncRNA nodes for which a metapath type PM exists for lncRNA nodes, and E lm represents the edges formed by connecting lncRNA nodes through miRNA nodes.

Feature extraction
After obtaining the above homogeneous subgraph, different nodes were found to be in different feature spaces due to the heterogeneity of nodes in the lncRNA-disease-miRNA heterogeneity graph.To address feature nodes in the same space, we performed a linear transformation on the three types of nodes so that they are mapped into the same feature space.The calculations are as follows: H l(i) and H d(i) are the projected features of lncRNA node l (i) and disease node d (i) , respectively.The three node feature dimensions are ultimately projected into a (13) F li = (x 1 ; x 2 ; x 3 ; . . . . . .; x 239 , x 240 ) (14) F di = (y 1 ; y 2 ; y 3 . . . . . .; y 241 , y 412 ) 64-dimensional feature space.w l(i) and w d(i) are the parameter weight matrices of the lncRNA and disease nodes, respectively, with dimensions of 240 × 64 and 412 × 64.
In homogeneous graphs, neighboring nodes exhibit different levels of importance in the task of learning node embeddings.The GAT is an effective tool for learning graph representations because it assigns different weights to neighboring nodes of the central node.In our model, the GAT is used to learn node representations.Feature weights are learned adaptively in subgraphs composed of different metapaths.This approach can fully exploit the information in the heterogeneous network.Specifically, for a given subgraph, the GAT uses an attention mechanism to learn the importance of different neighboring nodes to the target node, and then, for the central node, the features of the neighboring nodes are aggregated based on the calculated scores.For different homogeneous subgraphs, the degree of contribution a P uv of a neighbor node v to a node can be calculated as follows: where G is the type of subgraph, u is the target node, and v is the neighbor node in the homogeneous subgraph G. LeakyReLU is a nonlinear activation function with a negative slope set to 0.2.v G denotes the set of nodes contained in subgraph G according to the subgraph.Finally, the obtained ownership values are normalized with the softmax function to obtain the final weight coefficients a G uv .Subsequently, the features of all neighboring nodes v are computed and aggregated with the attention coefficients to update the features of the target node u Z G u : σ represents the ELU activation function.
To enhance the model's ability to capture different levels of information, we introduced a multihead attention mechanism to extend the attention scores between nodes.The multihead attention mechanism is an improved attention mechanism that calculates the attention scores between nodes k times and uses the average value as the final score.The embedded feature Z G u obtained after the internode attention mechanism is: Considering that the embedding of a particular node can only reflect the semantic information of that node one-sidedly, to obtain a more comprehensive and adequate node embedding, we introduced an attention mechanism at the metapath semantic level to calculate the weights that the nodes receive under different subgraphs.Subsequently, the weights are aggregated with the corresponding neighboring nodes and then nonlinearly transformed.The average value of the node features after the nonlinear (17) transformation was used as the contribution value of each metapath.Thus, the weights of nodes under a certain type of subgraph W G u are calculated where V is the total number of nodes under the subgraph adjacent to target node u, tanh is the activation function, q T is the trainable semantic layer attention vector with dimensions set to 128, and b is the bias vector.GN is the number of subgraphs of different nodes, and W G u is the contribution of different subgraphs to the target node u.After semantic embedding, the final embedding obtained is defined as follows:

Feature extraction based on homogeneous graphs
A heterogeneous graph constructed based on the correlation between nodes lacks information about nodes of the same type.To further capture the potential characteristics of the presence of same-type nodes, we defined metapaths L → L and D → D of the same type of node to construct both lncRNA and disease homogeneous graphs.The construction of the homology graph still requires the establishment of a neighborhood matrix between the nodes.We chose to use the KNN algorithm to construct the respective association matrices of lncRNAs and diseases.Moreover, the KNN algorithm makes predictions based on neighboring samples, and choosing the right number of samples can effectively eliminate the influence of noise.
Based on the comprehensive similarity obtained, the KNN algorithm was used to find the top k lncRNAs or diseases that were most similar to the ith lncRNA or disease, respectively, and assigned values of 1 and 0, respectively.Subsequently, we obtained the association matrices of lncRNAs or diseases with themselves, i.e., E lnc−lnc and E dis−dis .Their assignment formulas are as follows: where Nei li (k) , ( Nei di (k) ) contains the top k most similar lncRNA sequences (diseases) and lncRNA li (disease di) contains itself.We empirically set k to 20.
We defined the lncRNA homogeneous graph G l = (V , E) as containing the set of nodes v lnc .The edge E in the graph can be defined as E lnc−lnc ∈ R N lnc ×N lnc , where N lnc denotes the number of lncRNAs in the dataset.Given lncRNA nodes l i (l i ∈ N lnc ) and l j (l j ∈ N lnc ) , l i and l j are associated with each other if the association matrix E lnc−lnc ij = 1 .Additionally, we defined the disease homogeneous graph G d = (V , E) containing the set of nodes v dis .The edge E in the graph can be defined as E dis−dis ∈ R N dis ×N dis , where N dis denotes the number of disease nodes in the dataset.Given disease nodes d i (d i ∈ N dis ) and d j (d j ∈ N dis ) , if the association matrix E dis−dis ij = 1 , then there is an association between d i and d j .Conversely, this means that no association is observed between the nodes.
Subsequently, we used the combined similarity of lncRNAs and diseases as the feature vector of the nodes.For the constructed homogeneous graphs, we similarly used the multihead attention mechanism to aggregate the node features and finally obtained the embedded features Z O .

LDA prediction
We performed feature enhancement for the initial lncRNA and disease similarity using heterogeneous graph extraction of metapaths and homogeneous graph aggregation, respectively.We concatenated the resulting final embeddings and used a fully connected layer to reconstruct the lncRNA and disease features for the final prediction.
The predicted probabilities of lncRNA node i and disease node j are calculated as follows: y ij represents the association probability between the final predicted lncRNA li and the disease dj.Additionally, we created a loss function during the model training to quantify the discrepancy between the model's predicted value and the actual value.We then combined this function with the gradient descent approach to efficiently optimize the model's parameters and boost its predictive capability.The model uses an Adam optimizer for the gradient descent algorithm [33].The following is the formula for calculating the loss function: y represents the true association of lncRNA with the disease.Finally, the model was trained by a backpropagation algorithm to obtain the final prediction probability.

Comparison with other methods
To further validate the performance of the model, based on dataset 1, we compared the proposed method with five benchmark models.The BiGAN [28] is a generative adversarial model that consists of an encoder, a generator and a discriminator for predicting the associations of novel lncRNAs with diseases.HOPEXGB [34] is a prediction method based on machine learning techniques that uses higher order proximity preserving embedding (HOPE) and extreme gradient boosting (XGB) to identify miRNAs and lncRNAs associated with diseases.VGAELDA [35] is an end-to-end model that integrates variational inference and a graph autoencoder for lncRNA-disease association prediction.GCRFLDA [36] is a prediction method based on graph convolution matrix (26) [37] is a method for predicting potential lncRNA-disease associations based on inductive matrix complementation.GAMCLDA [15] is a method based on a graph self-encoder and matrix completion.

Experimental setup
We used a fivefold cross-validation approach to evaluate the models.Our method is based on the PyTorch framework and executed with the dgl package.The computing environment included the Windows 10 operating system with an Intel(R) Core(TM) i5 and 16 GB of RAM.The maximum number of epochs in our model was 500, and all the trainable parameters were learned using the Adam optimizer with a learning rate of 0.001 and a weight decay rate of 0.005.

Evaluation metrics
Referring to the evaluation metrics based on previous studies, we used the receiver operating characteristic (ROC) curve, precision, recall, and F1 score.Additionally, we used three other evaluation metrics, namely, accuracy, sensitivity, and the F1-score.These metrics were calculated as follows:

Comparison with other advanced methods
As shown in Table 2, compared to the performance metrics of the benchmark model, MMHGAN's overall performance metrics are all higher than 88%.These results are better than those of GCRFLDA (86%), which is the best overall performing model among the benchmark models.MMHGAN has four evaluation metrics that are better than (28)  those GCRFLDA.However, the AUPR achieved by MMHGAN is lower than that of GCRFLDA.While the other models achieved good AUC/ACC performance, the performance in terms of the AUPR and recall was less than 80%.

Model performance with different datasets
To better evaluate our model, we tested it on three datasets with multiple evaluation metrics, and the results are shown in Table 3.On these three datasets, all the metrics of the model were greater than 88%.The ROC and PR curves of our model on the three datasets are shown in Figs.

Comparison with different feature combinations
To further test the effect of different features on the classification results, we performed the following comparisons: MMHGAN-NHO: This model aggregates node features only in heterogeneous graphs in the module identified as (iii) in Fig. 1.
MMHGAN-NA: For subgraphs obtained from different metapaths, in the module labeled (iii) in Fig. 1, we set the coefficient of the aggregated features of the subgraphs  obtained through different nodes to 0.5 without weight assignment, i.e., the computation node of module (iii) labeled attention.
We compared these two models with the original model, and the comparison results are shown in Table 4.The results show that the model with richer feature information and more diverse attention mechanisms achieved better performance.

Analysis of parameters
By altering some of the parameters in this model, we can increase its performance.We assessed the value of k in the multiple attention mechanism first.We used k = 1, 2, 4, 8, and 16, and the resulting AUC findings are displayed in Fig. 8.As demonstrated, the model functions best when k = 4.The model is equivalent to that without the multiple attention mechanism when k = 1.The model effect was outperformed by the effects of other k values.This result demonstrates how the multihead attention method can be used to more fairly assign the weights of metapath instances.Second, we tested the different dimensional features of the attention layer and the output features, and Fig. 9 shows the AUC values of the MMHGAN model prediction results when the dimension n of the output features is different.It is clear that as the number of dimensions increases, the AUC value for the MMHGAN model increases.The model produces the   best prediction results when the number of dimensions is 256.When there are more than 512 dimensions, the model's performance decreases, perhaps as a result of the model's increased propensity for overfitting, which yields subpar results.We therefore chose 128 as the number of dimensions.

Case study
We studied three cases, lung cancer, esophageal cancer, and breast cancer cases, to further evaluate the performance of the model in predicting the associations between lncRNAs and diseases.For the studied diseases, we filtered out the associations between diseases and lncRNAs and constructed the same number of negative samples for training using the remaining associations between diseases and lncRNAs as positive samples.The diseases to be studied were subsequently entered into the trained model as test samples to obtain the prediction scores.We ranked the scores and selected the 15 lncRNAs with the highest scores as diseases with possible associations for the final predictions.For the prediction results, we compared the results by reviewing the LncRNADisease database, the Lnc2Cancer database, and the published literature.The final predictions for these three diseases are shown in Table 5, 6, and 7.  Lung cancer is a malignant tumor originating from lung tissue cells that usually spreads through the respiratory tract and is associated with extremely high morbidity and mortality.The prediction results confirmed the presence of all the predicted lncRNAs.The results suggest that the lncRNAs predicted by the model are indeed associated with lung cancer.
Esophageal carcinoma is one of the most common tumors of the digestive tract.Therefore, we chose it as the second case to test the model.Table 6 shows that the predicted associations of 12 of these lncRNAs with diseases can be retrieved from the LncRNA-Disease and Lnc2Cancer databases.
Breast cancer was studied as the third case.Breast cancer is one of the most common malignant tumors in women and originates from breast epithelial or ductal cells.Its incidence increases with age.As shown in Table 7, 14 of the 15 predicted lncRNAs were confirmed by databases such as lncRNADisease.The above three case studies demonstrated the ability of the MMHGAN model to predict potential lncRNA-disease associations.

KM curve
A Kaplan-Meier curve is a statistical tool used in survival analysis, usually to describe the probability of an event occurring within a certain period.Survival analyses are primarily used to study the time to the occurrence of an event, which can be the onset of a disease, death, or other specific outcome.Survival time t i is the horizontal coordinate, and survival rate S t i at each time point is the vertical coordinate; the continuous curve formed by connecting the survival rates at each time point is referred to as the survival curve.
Based on the results of the case study, we selected breast cancer for survival analysis based on TCGA [38] data.As shown in Fig. 10 and Fig. 11, for PVT1 and HOTAIR, the survival rates of patients with low lncRNA expression are higher over time.

Discussion
To make full use of lncRNA and disease intermediate information to enhance LDA prediction, we proposed the MMHGAN model to learn each homogeneous graph or heterogeneous subgraph of a specific metapath using a GAT network.In addition, we used the KNN algorithm to construct homogeneous graphs and used an attention mechanism to adaptively assign weights to different heterogeneous metapath subgraphs to achieve denoising and to obtain additional semantic information.The cross-validation results show that the overall performance of the model outperforms that of the baseline comparison method.
Several studies have been conducted to introduce primary and deeper information for disease association prediction through the k-nearest neighbors (KNN) algorithm, and the model performance has further improved.These studies have validated the effectiveness of combining the KNN algorithm and GCN in disease association prediction.Consistent with these studies, we also constructed homogeneous subgraphs using the KNN algorithm and acquired features using the GAT.The difference is that our homogeneous graphs in the input KNN algorithm are the LSM and DSM, which are the merged similarity matrices of lncRNAs and diseases after linear fusion.
To explore better disease association prediction models, different approaches have been used to fully exploit disease association information.Yang [28] et al. introduced the generative anti-network approach to lncRNA disease association prediction.Shi [35] et al. proposed VGAELDA, which integrates variational inference and a graph autoencoder through the integration of graph representation learning and alternating training involving variational inference, which enhances the ability of VGAELDA to capture efficient low-dimensional representations from high-dimensional features.Fan [36] et al. proposed GCRFLDA, a prediction method based on graph convolutional matrix complementation.utilizing conditional random fields and attention mechanisms to form encoders and decoders, learn efficient embedding of nodes, and score lncRNA-disease associations.As shown in Table 2, although these methods use different techniques and obtain good performance (AUC > 89%), they do account for the rich semantic information in heterogeneous graphs.He [34] et al. proposed a prediction method based on machine learning techniques to identify disease-related miRNAs and lncRNAs by higher-order proximity-preserving embedding (HOPE) and extreme gradient lifting (XGB) using a heterogeneous disease-miRNA-lncRNA (DML) information network.Lu [37] et al. proposed a prediction method based on disease-gene and gene-gene correlations, computed the Gaussian interaction spectrum kernel of lncRNAs, and proposed a method to predict potential lncRNA-disease associations on the basis of inductive matrix complementation.Wu [15] introduced graph self-encoders to learn lncRNAs and characterize diseases through their ability to encode and decode graph structures and features.While these methods have advanced the field by considering heterogeneous graph-rich information, they have not fully exploited the potential of heterogeneous graph-rich information, as shown in Table 2, where the overall performance of the methods was 75%.In addition, these methods do not further consider the information of the intermediate nodes of the metapath subgraph.Inspired by Xuan [16] and Zhao [17] et al., we utilized subgraphs constructed from homogeneous graphs and heterogeneous graphs as inputs and adopted multipath subgraphs combined with a multihead attention mechanism to acquire features, fully considering the information of the intermediate nodes of the metapath subgraphs.As shown in Table 2, our method's AUC, ACC, recall, and F1 score are 0.59%, 0.48%, 2.05%, and 0.85% greater than those of the best baseline model, GCRFLDA.
Our study is inspired by GSMV, a new association prediction model proposed by Xuan et al., and HGATLDA, a novel metapath-based heterogeneous graph attention network framework developed by Zhao et al.Unlike the HGATLDA approach, these methods do not consider homogeneous subgraph information.We obtained the features of homogeneous subgraphs through a multihead attention mechanism; in addition, unlike GSMV, which uses metapath instances to obtain semantic information, we used metapath subgraphs to obtain semantic information.Subgraphs can better capture local structural information and are more interpretable; additionally, when dealing with sparse matrices, metapath extraction of subgraphs can reduce the computational complexity and noise interference, and it is easier to adapt to different requirements and data characteristics by extracting subgraphs according to different paths.
As shown in Table 3, our model performs better on dataset 2 and dataset 3 than on dataset 1, which may be due to the different data sample sizes.
Despite the good results of our model, there are still several limitations.First, there was an imbalance of positive and negative samples in the datasets; for example, in the first dataset, only 2697 associations existed between 240 lncRNA nodes and 412 disease nodes, which was insufficient for predicting the results.Second, generating subgraphs was used in the model to aggregate the features, and the complexity of the model increased when the amount of data increased.In addition, we did not validate the results predicted by the model through biological experiments; in the future, we will add biological wet experiments to further evaluate the model's performance.

Conclusion
In this paper, we proposed a hierarchical network model of multiple metapaths, MMH-GAN, to extract features from a multiview perspective and to mine the semantic information contained in different graphs for predicting potential lncRNA-disease associations.By constructing both homogeneous and heterogeneous graphs, the information provided by the neighboring nodes of lncRNAs or disease nodes can be mined more comprehensively.In addition to the KNN algorithm and the method of constructing subgraphs through metapaths, the noise generated by sparse matrices can be effectively reduced, which can lead to better performance of our model.Moreover, we introduced miRNA nodes to construct a ternary heterogeneous graph.To better explore the structural information provided by the heterogeneous graph, we generated corresponding subgraphs with the help of different nodes and used the GAT network to enhance the features.We assigned different weights to the subgraphs constructed by different nodes to obtain more semantic information.Finally, the MMHGAN also outperforms the other methods.In the case study, the capability of the MMHGAN model is further confirmed.

Fig. 1
Fig. 1 Flowchart of the MMHGAN model.The MMHGAN model consists of four stages.(i) Calculate the combined similarity between lncRNAs and diseases and collate the associations between lncRNAs and diseases and between miRNAs.(ii) Construct homogeneous graphs GL and GD based on the top k pieces of information with the highest similarity in the combined similarity matrix of lncRNAs and diseases derived from the KNN algorithm.Aggregated the neighbor node features through the multihead attention mechanism.(iii) Construct a heterogeneous graph G lmd based on the association matrix, extract different types of metapaths from the graph, construct subgraphs, and update node embeddings through a graph attention network (GAT).Subsequently, calculated the weights under different metapaths and update the target node embeddings.(iv) Use the fully connected layer to recombine the input features to predict potential lncRNA-disease associations dm ) .v pM d represents the set of disease nodes for which a metapath type PM exists for disease nodes, and E dm represents the edges formed by connecting disease nodes through miRNA nodes.G pL d = (v pL d , E dl ) .v pL d represents the set of disease nodes for which a metapath type PL exists for disease nodes, and E dl represents the edges formed by connecting disease nodes through lncRNA nodes.

Fig. 2
Fig. 2 ROC curves generated by the MMHGAN model under fivefold-cv on dataset 1

Fig. 3 Fig. 4
Fig. 3 PR curves generated by the MMHGAN model under fivefold-cv on dataset 1

Fig. 8
Fig. 8 Model performance for different values of k

Fig. 9
Fig. 9 Dimensions of the output vector

Fig. 10 Fig. 11
Fig. 10 Survival analysis of breast cancer patients with PVT1

Table 1
Dataset information

Table 2
Comparison of different models

Table 3
Results for different datasets

Table 4
Results for different features of the MMHGAN model

Table 5
The top 15 lung cancer-related lncRNA candidates

Table 6
The top 15 esophageal carcinoma cancer-related lncRNA candidates

Table 7
The top 15 breast cancer-related lncRNA candidates