Predicting lncRNA-disease associations using multiple metapaths in hierarchical graph attention networks

Yao, Dengju; Deng, Yuexiao; Zhan, Xiaojuan; Zhan, Xiaorong

doi:10.1186/s12859-024-05672-2

Research
Open access
Published: 29 January 2024

Predicting lncRNA-disease associations using multiple metapaths in hierarchical graph attention networks

Dengju Yao¹,
Yuexiao Deng¹,
Xiaojuan Zhan^1,2 &
…
Xiaorong Zhan³

BMC Bioinformatics volume 25, Article number: 46 (2024) Cite this article

1311 Accesses
Metrics details

Abstract

Background

Many biological studies have shown that lncRNAs regulate the expression of epigenetically related genes. The study of lncRNAs has helped to deepen our understanding of the pathogenesis of complex diseases at the molecular level. Due to the large number of lncRNAs and the complex and time-consuming nature of biological experiments, applying computer techniques to predict potential lncRNA-disease associations is very effective. To explore information between complex network structures, existing methods rely mainly on lncRNA and disease information. Metapaths have been applied to network models as an effective method for exploring information in heterogeneous graphs. However, existing methods are dominated by lncRNAs or disease nodes and tend to ignore the paths provided by intermediate nodes.

Methods

We propose a deep learning model based on hierarchical graphical attention networks to predict unknown lncRNA-disease associations using multiple types of metapaths to extract features. We have named this model the MMHGAN. First, the model constructs a lncRNA-disease–miRNA heterogeneous graph based on known associations and two homogeneous graphs of lncRNAs and diseases. Second, for homogeneous graphs, the features of neighboring nodes are aggregated using a multihead attention mechanism. Third, for the heterogeneous graph, metapaths of different intermediate nodes are selected to construct subgraphs, and the importance of different types of metapaths is calculated and aggregated to obtain the final embedded features. Finally, the features are reconstructed using a fully connected layer to obtain the prediction results.

Results

We used a fivefold cross-validation method and obtained an average AUC value of 96.07% and an average AUPR value of 93.23%. Additionally, ablation experiments demonstrated the role of homogeneous graphs and different intermediate node path weights. In addition, we studied lung cancer, esophageal carcinoma, and breast cancer. Among the 15 lncRNAs associated with these diseases, 15, 12, and 14 lncRNAs were validated by the lncRNA Disease Database and the Lnc2Cancer Database, respectively.

Conclusion

We compared the MMHGAN model with six existing models with better performance, and the case study demonstrated that the model was effective in predicting the correlation between potential lncRNAs and diseases.

Peer Review reports

Introduction

LncRNAs can regulate the expression of target genes through different cellular mechanisms, such as signal transduction, induction, guidance, and scaffolding, and play a variety of roles in all life processes [1]. Aberrant expression of lncRNAs is usually associated with human diseases. Therefore, mining the correlation between lncRNAs and diseases is conducive to elucidating the pathogenic mechanisms of complex diseases, providing a basis for disease diagnosis and prevention.

Although some lncRNA-disease associations have been experimentally validated, the vast majority of these associations remain unknown [1]. Traditional biological experimental approaches to validate potential lncRNA-disease associations are often resource intensive and costly. To alleviate this problem, computational approaches have received much attention from scholars. Recent methods can be broadly classified into three categories: network-based methods, random walk-based methods, and machine learning-based methods.

Network-based approaches focus on predicting potential associations between lncRNAs and diseases using various propagation algorithms. The first network-based method, LRLSLDA [2], combines the lncRNA-disease association network and the lncRNA expression similarity network and incorporates Laplace's regular least squares in a semisupervised learning framework to identify potential lncRNA-disease associations. Notably, this approach does not require negative samples. Yang et al. [3] used a propagation algorithm to identify existing diseases and detected disease-causing gene associations; based on this information, they constructed a new disease gene-related network and identified lncRNA-disease associations in that network. Li [4] calculated multiple similarities between lncRNAs and diseases, acquired probability matrices of lncRNAs and diseases, and subsequently assessed their network consistency before predicting unknown lncRNA-disease associations. Zhang et al. [5] combined lncRNA, protein, and disease information to construct a network and applied the stream propagation algorithm.

Random walk-based methods can pay more attention to the information that contributes more to the network. Xie et al. [6] proposed the LDA-LLNSUBRW model to predict LDA. This model is based mainly on linear neighborhood similarity and an unbalanced bi-random walk. Sun et al. [7] proposed the RWRlncD method, which is based on a global network that contains the lncRNA functional similarity network, the disease similarity network, and known lncRNA-disease associations. For lncRNAs without a known associated disease, however, this approach cannot be applied. Li et al. [8] designed an improved local random walk method for a newly established heterogeneous network. In 2019, Hu et al. [9] introduced a matrix completion method (LMNLMI).

The third category includes machine learning-based methods. Yao et al. [10] utilized random forests to select features in their proposed methodology. Wang et al. [11] proposed a weighted matrix decomposition (WMD) method for LDA prediction by presetting the weights of different correlation matrices and converting them into low-dimensional matrices. Lan et al. [12] trained a support vector machine (SVM) model to predict potential associations between lncRNAs and diseases by combining multiple biological data. Yu et al. [13] created a predictive model (CFNBC) based on Bayesian classification by unifying the associations among lncRNAs, diseases, and miRNAs. Bayesian classification was used for linear discriminant analysis (LDA) prediction models of collaborative filtering (CFNBC). With the growth of scientific research, there has been an increasing emphasis on neural networks. Neural networks can achieve superior training results by continuously modifying parameters through numerous operations. Recently, graphical neural networks, such as graphical convolutional networks (GCNs) and graphical attention networks (GATs), have been used in bioinformatics research because of their ability to integrate graph topology and node features. To prioritize more relevant neighbors and eliminate noise, they have also developed a bi-interaction aggregator to aggregate representations of similar neighbors. The GBDT-LR [14] model uses two different machine learning methods, gradient boosting decision trees and logistic regression, and combines them. Wu et al. [15] developed the GAMCLDA model, which applies graph convolutional networks to reconstruct graph structures and lncRNA and disease node feature vectors.

These existing methods have achieved satisfactory performance and effectively contributed to the advancement of computational methods for LDA prediction, but the ability of these methods to mine the rich semantic information in heterogeneous graphs composed of lncRNAs and diseases is far from optimal or even satisfactory. Metapaths show strong potential for exploring complex structural and semantic information in heterogeneous networks. Xuan [16] et al. considered that nodes with similar attributes are not only located near the neighborhood of the target node but also located in the region far from the target node. Therefore, they integrated the associations between the nodes, increased global dependencies, and added multiview features of the node pairs. Zhao [17] et al. developed a new framework based on heterogeneous graph attention networks and metapath graph attention networks. They constructed a two-part topological graph of lncRNAs and diseases and used the KNN algorithm to remove noise effects. Inspired by existing studies, we designed a multiple metapath-based hierarchical graph attention network model for lncRNA-disease association prediction. The approach of constructing subgraph aggregation features under multiple types of metapaths is used to obtain information about various relationships between lncRNAs and diseases in both heterogeneous and homogeneous graphs simultaneously for better performance. Our contributions are as follows:

1.
We propose a dual-path feature extraction strategy based on a homogeneous graph and a heterogeneous graph. Subgraph aggregation features of homomorphic and heteromorphic graphs are used to enrich the model input information. The KNN algorithm is used to construct homogeneous subgraphs to reduce computation and denoising. In addition, miRNA information nodes are introduced to construct a ternary heterogeneous network with richer information.
2.
Different types of metapaths are constructed. For the heterogeneous graph, the existing metapaths are only paths for lncRNA or disease nodes, i.e., the connecting pathways of other nodes, such as miRNA nodes, are ignored. We learn each homogeneous graph or heterogeneous subgraph of a specific metapath by extracting the paths that lncRNAs or disease nodes reach through different types of nodes using the GAT network. Moreover, in the heterogeneous subgraphs, we adaptively assign weights to the different metapath subgraphs using the attention mechanism to obtain additional semantic information.

Materials and methods

Datasets

In this study, datasets collected from three studies were used to evaluate the model performance.

Dataset 1: The dataset used in the study by Fu [18] is widely used as a reliable dataset. The main sources are Lnc2Cancer [19], LncRNADisease [20], GeneRIF [21], and starBase v2.0 [22] HMDD v2.0 [23].

Dataset 2: We referred to the dataset screened in Zhou's [24] study. The lncRNAs were integrated from the lncr2cancer v3.0 [25], LncRNADisease v2.0 [26], starBase v2.0 [22] and HMDDv3.2 databases.

Dataset 3: We used the dataset screened by Li et al. [27]. The authors screened relevant records with causal relationships from the HMDDV3.2 database and converted all disease names into standardized names based on the MeSH nomenclature. Finally, 861 lncRNAs, 437 miRNAs, and 432 diseases were obtained.

Model parameter tuning, ablation experiments, and comparisons with the baseline model were performed on dataset 1. Three datasets were used for robustness experiments. The detailed data are shown in Table 1. In this table, LDA represents the association of lncRNAs with diseases, LMA represents the association of lncRNAs with miRNAs, and MDA represents the association of miRNAs with diseases.

Table 1 Dataset information

Full size table

Flowchart of the MMHGAN model

As shown in Fig. 1, we propose the MMHGAN model for predicting lncRNA candidates associated with a given disease. The MMHGAN model consists of data sources, the construction of heterogeneous and homogeneous graphs, the acquisition of subgraph features via multihead attention, and prediction.

LncRNA sequence similarity

We obtained the sequences by lncRNA name from NONCODE (http://www.noncode.org/), GenBank (https://www.ncbi.nlm.nih.gov/) and Ensembl (http://asia.ensembl.org/index.html) to obtain information to find the corresponding sequence of each lncRNA. After obtaining all the lncRNA sequences, based on previous studies by Yang [28] and Li [29] et al., we performed a two-by-two calculation of the lncRNA sequences using the Levenshtein distance, which is the editing distance between strings used to measure the differences between two strings [30]. In previous studies, the editing cost was set to 2, while the insertion cost and deletion cost were set to 1. We followed the same criteria in our study. The formula for the LSS is shown below:

$$LSS\left({l}_{i},{l}_{j}\right)=1-\frac{dist}{len({l}_{i}+{l}_{j})}$$

(1)

where dist denotes the minimum cost of converting the ${l}_{i}$ sequence of a lncRNA to the ${l}_{j}$ sequence and len denotes the length of the lncRNA sequence.

Disease semantic similarity

The computation of the semantic similarity of diseases is based on the medical subject term descriptor [31], available from https://www.ncbi.nlm.nih.gov/. [32] The tool provides topological relationships between diseases and describes them with a directed acyclic graph (DAG). With the known directed acyclic graph, we calculated the semantic similarity DSS between diseases using the method proposed by Wang et al. [32]. Assuming that d is an ancestor node of the DAG and ${d}{\prime}$ is a child node of d, the semantic contribution of each node in the DAG is calculated as follows:

$$\left\{\begin{array}{l}{D}_{D1}\left(d\right)=1 \qquad if\; d=D\\ {D}_{D1}\left(d\right)={\text{max}}\left\{0.5\times {D}_{D1}\left({d}{^{\prime}}\right)|{d}{^{\prime}}\epsilon children\; of\; d\right\} \qquad if \,\,d\ne D\end{array}\right.$$

(2)

After the contribution scores were obtained, the semantic score ${D}_{v1}$ was calculated for each disease:

$${D}_{V1}\left(D\right)=\sum_{d\in T(D)}{D}_{D1}(d)$$

(3)

T represents the DAG topology of the disease.

Finally, the semantic similarity of the two diseases was calculated with the following formula:

$$DSS(d(i),d(j))=\frac{{\sum }_{t\in T(d(i))\cap T(d(j)}({D}_{d(i)}(t)+{D}_{d(j)}(t))}{DV(d(i))+DV(d(j))}$$

(4)

LncRNA/disease GIP kernel similarity

According to previous studies, the lncRNA Gaussian kernel similarity (LGS) and disease Gaussian kernel similarity (DGS) were calculated based on the neighbor-joining matrix LD. The formula for the LGS is as follows:

$$LGS\left({l}_{i},{l}_{j}\right)={\text{exp}}(-{\xi }_{l}{\Vert LD\left(i,:\right)-LD(j,:)\Vert }^{2})$$

(5)

$${\xi }_{l}=1/\left(\frac{1}{{N}_{l}}\sum_{i=1}^{{N}_{l}}{\Vert LD(i,:)\Vert }^{2}\right)$$

(6)

Here, ${N}_{l}$ denotes the number of lncRNAs, and ${\xi }_{l}$ is the regularization factor.

Similarly, the DGS was calculated as follows:

$$DGS\left({d}_{i},{d}_{j}\right)={\text{exp}}(-{\xi }_{d}{\Vert LD\left(i,:\right)-LD(j,:)\Vert }^{2})$$

(7)

$${\xi }_{d}=1/\left(\frac{1}{{N}_{d}}\sum_{i=1}^{{N}_{d}}{\Vert LD(i,:)\Vert }^{2}\right)$$

(8)

Here, ${N}_{d}$ denotes the number of diseases, and ${\xi }_{d}$ is the regularization factor.

Considering that there are many sparse values in the similarity matrix obtained above and that there is a problem with inaccurate prediction of individual semantic information as features, we linearly fused the two similarities in the following equation:

$$LSM=\frac{\alpha LSS\left({l}_{i},{l}_{j}\right)+\left(1-\alpha \right)LGS\left({l}_{i},{l}_{j}\right)}{2}$$

(9)

$$DSM=\frac{\alpha DSS\left({l}_{i},{l}_{j}\right)+(1-\alpha )DGS({l}_{i},{l}_{j})}{2}$$

(10)

LSM and DSM are the combined similarity matrices of lncRNAs and disease after linear fusion.

Feature extraction based on heterogeneous graphs

Subgraph construction based on metapaths

A metapath is a composite relation connecting two objects and is a widely used structure for capturing semantics. Metapaths can be used to explore structural information in heterogeneous graphs and capture rich semantic information, fully and intuitively exploiting network structures.

To explore more diverse information embedded in the metapaths, we constructed a ternary heterogeneous graph ${G}_{lmd}=(V,E)$ containing three types of nodes, lncRNA, miRNA, and disease nodes. The set of nodes is $v=\left\{{v}^{lnc}\cup {v}^{dis}\cup {v}^{mir}\right\}$. ${v}^{lnc}$ represents the set of 240 lncRNA nodes, ${v}^{dis}$ is the set of 412 disease nodes, and ${v}^{mir}$ is the set containing 495 miRNA nodes. The edge E in the heterogeneous graph can be defined as follows:

$$E=\left\{\begin{array}{c}{E}^{lnc-dis}\in {R}^{{N}_{lnc}\times {N}_{dis}}\\ {E}^{lnc-mir}\in {R}^{{N}_{lnc}\times {N}_{mir}}\\ { E}^{mir-dis}\in {R}^{{N}_{mir}\times {N}_{dis}}\end{array}\right.$$

(11)

where ${N}_{lnc}$, ${N}_{dis}$ and ${N}_{mir}$ represent the numbers of lncRNAs, diseases and miRNAs in the dataset, respectively. ${E}^{lnc-dis}$, ${E}^{lnc-mir}$ and ${E}^{mir-dis}$ represent the association matrix of lncRNAs and diseases, the association matrix of lncRNAs and miRNAs and the association matrix of miRNAs and diseases, respectively. Given lncRNA node ${l}_{i}$ $({l}_{i}\in {N}_{lnc})$ and disease node ${d}_{j} ({d}_{j}\in {N}_{dis})$, there is an association between ${l}_{i}$ and ${d}_{j}$ if the association matrix ${E}_{ij}^{lnc-dis}=1$. If ${E}_{ij}^{lnc-dis}=0$, then an association between ${l}_{i}$ $({l}_{i}\in {N}_{lnc})$ and ${d}_{j} ({d}_{j}\in {N}_{dis})$ has not yet been observed. Similarly, if ${E}_{ij}^{lnc-mir}=1$ or ${E}_{ij}^{mir-dis}=1$, nodes ${l}_{i}$ $({l}_{i}\in {N}_{lnc})$ and ${m}_{j} ({m}_{j}\in {N}_{mir})$ or ${m}_{i}$ $({l}_{i}\in {N}_{mir})$ and disease node ${d}_{j} ({d}_{j}\in {N}_{dis})$ are associated; otherwise, ${E}_{ij}^{lnc-mir}=0$ or ${E}_{ij}^{mir-dis}=0$.

The correlation matrix G between the heterogeneous maps ${G}_{lmd}$ can be defined as:

$$G=\left[\begin{array}{ccc}0& {E}^{lnc-dis}& {E}^{lnc-mir}\\ {{E}^{lnc-dis}}^{T}& 0& { E}^{mir-dis}\\ {{E}^{lnc-mir}}^{T}& {{ E}^{mir-dis}}^{T}& 0\end{array}\right]$$

(12)

Dataset 1 was chosen as an example, and 2697 lncRNA-disease associations were experimentally verified. We treated these 2697 experimentally verified associations as positive samples, labeled 1. However, the number of known lncRNA-disease associations is much greater than the number of known lncRNA-disease associations. An imbalance of positive and negative samples reduces the generalizability of the model. To address this issue, we randomly selected an equal number of unknown lncRNA-disease associations, labeled 0, to be added to the heterogeneous map. In addition, we used the combined similarity of lncRNAs, miRNAs, and diseases as lncRNA and disease node features, respectively. Therefore, the lncRNA node feature has 240 dimensions, the disease node feature has 412 dimensions, and the feature vector is represented as a lncRNA, for example:

$${F}_{li}=\left({x}_{1};{x}_{2};{x}_{3};\dots \dots ;{x}_{239},{x}_{240}\right)$$

(13)

$${F}_{di}=({y}_{1};{y}_{2};{y}_{3}\dots \dots ;{y}_{241},{y}_{412})$$

(14)

where ${F}_{li}$ represents the features of the ith lncRNA in the lncRNA similarity matrix and ${x}_{j}$ represents the combined similarity value of the ith lncRNA and the jth lncRNA. Similarly, ${F}_{di}$ represents the feature vector of the ith disease in the disease similarity matrix.

Pathways essentially describe the associations between lncRNAs ${L}_{1}$ and ${L}_{2}$ or between diseases ${D}_{1}$ and ${D}_{2}$. Different metapaths usually have different semantics. In the ternary heterogeneous graph ${G}_{lmd}$ obtained above, it is assumed that there is a metapath type P of $L1\to D1\to L2$, ${L}_{1}$ is a certain lncRNA node, ${D}_{1}$ is a certain disease node with which it is associated, and ${L}_{2}$ is another lncRNA associated with the above disease node. Through the metapath p, if there exists a node v that conforms to the metapath type P, then the set of nodes ${v}_{l}^{pD}$ can be obtained. Thus, we can obtain the subgraph ${G}_{l}^{pD}=({v}_{l}^{pD},{E}_{ld})$ of the LncRNA. ${E}_{ld}$ represents the edges formed by lncRNA nodes conforming to the metapath connections of a given type. In our proposed model, in addition to the metapaths of type $L\to D\to L$, we define three other types of metapaths $L\to M\to L$, $D\to L\to D$, and、$D\to M\to D$. With these three types of metapaths, we can construct the following three kinds of homogeneous subgraphs:

${G}_{l}^{pM}=({v}_{l}^{pM},{E}_{lm})$. ${v}_{l}^{pM}$ represents the set of lncRNA nodes for which a metapath type PM exists for lncRNA nodes, and ${E}_{lm}$ represents the edges formed by connecting lncRNA nodes through miRNA nodes.

${G}_{d}^{pM}=({v}_{d}^{pM},{E}_{dm})$. ${v}_{d}^{pM}$ represents the set of disease nodes for which a metapath type PM exists for disease nodes, and ${E}_{dm}$ represents the edges formed by connecting disease nodes through miRNA nodes.

${G}_{d}^{pL}=({v}_{d}^{pL},{E}_{dl})$. ${v}_{d}^{pL}$ represents the set of disease nodes for which a metapath type PL exists for disease nodes, and ${E}_{dl}$ represents the edges formed by connecting disease nodes through lncRNA nodes.

Feature extraction

After obtaining the above homogeneous subgraph, different nodes were found to be in different feature spaces due to the heterogeneity of nodes in the lncRNA-disease–miRNA heterogeneity graph. To address feature nodes in the same space, we performed a linear transformation on the three types of nodes so that they are mapped into the same feature space. The calculations are as follows:

$${H}_{l\left(i\right)}={W}_{l\left(i\right)}\cdot {F}_{l\left(i\right)}$$

(15)

$${H}_{d(i)}={W}_{d(i)}\cdot {F}_{d(i)}$$

(16)

${H}_{l(i)}$ and ${H}_{d(i)}$ are the projected features of lncRNA node ${l}_{(i)}$ and disease node ${d}_{(i)}$, respectively. The three node feature dimensions are ultimately projected into a 64-dimensional feature space. ${w}_{l(i)}$ and ${w}_{d(i)}$ are the parameter weight matrices of the lncRNA and disease nodes, respectively, with dimensions of 240 × 64 and 412 × 64.

In homogeneous graphs, neighboring nodes exhibit different levels of importance in the task of learning node embeddings. The GAT is an effective tool for learning graph representations because it assigns different weights to neighboring nodes of the central node. In our model, the GAT is used to learn node representations. Feature weights are learned adaptively in subgraphs composed of different metapaths. This approach can fully exploit the information in the heterogeneous network. Specifically, for a given subgraph, the GAT uses an attention mechanism to learn the importance of different neighboring nodes to the target node, and then, for the central node, the features of the neighboring nodes are aggregated based on the calculated scores. For different homogeneous subgraphs, the degree of contribution ${a}_{uv}^{P}$ of a neighbor node v to a node can be calculated as follows:

$${\varphi }_{uv }^{G}=LeakyRelu\left({\left({\left({H}_{u}\right)}^{T}\cdot {H}_{v}\right)}_{G}\right)$$

(17)

$${a}_{uv}^{G}=softmax\left({\varphi }_{uv}^{G}\right)=\frac{{\text{exp}}\left({\varphi }_{uv}^{G}\right)}{{\sum }_{k\epsilon {v}^{G}}{\text{exp}}\left({\varphi }_{uv}^{G}\right)}$$

(18)

where G is the type of subgraph, u is the target node, and v is the neighbor node in the homogeneous subgraph G. LeakyReLU is a nonlinear activation function with a negative slope set to 0.2. ${v}^{G}$ denotes the set of nodes contained in subgraph G according to the subgraph. Finally, the obtained ownership values are normalized with the softmax function to obtain the final weight coefficients ${a}_{uv}^{G}$.

Subsequently, the features of all neighboring nodes v are computed and aggregated with the attention coefficients to update the features of the target node u ${Z}_{u}^{G}$:

$${Z}_{u}^{G}=\sigma \left(\sum_{v\epsilon {v}^{G}}{a}_{uv}^{G}\cdot {H}_{v}\right)$$

(19)

$\sigma$ represents the ELU activation function.

To enhance the model's ability to capture different levels of information, we introduced a multihead attention mechanism to extend the attention scores between nodes. The multihead attention mechanism is an improved attention mechanism that calculates the attention scores between nodes k times and uses the average value as the final score. The embedded feature ${Z}_{u}^{G}$ obtained after the internode attention mechanism is:

$${Z}_{u}^{G}=\frac{{\sum }_{1}^{K}\sigma \left(\sum_{v\epsilon {v}^{G}}{a}_{uv}^{G}\cdot {H}_{v}\right)}{k}$$

(20)

Considering that the embedding of a particular node can only reflect the semantic information of that node one-sidedly, to obtain a more comprehensive and adequate node embedding, we introduced an attention mechanism at the metapath semantic level to calculate the weights that the nodes receive under different subgraphs. Subsequently, the weights are aggregated with the corresponding neighboring nodes and then nonlinearly transformed. The average value of the node features after the nonlinear transformation was used as the contribution value of each metapath. Thus, the weights of nodes under a certain type of subgraph ${W}_{u}^{G}$ are calculated

$${W}_{u}^{G}=\frac{1}{\left|V\right|}\sum_{u\in V}{q}^{T}\cdot tanh\left({W}^{G}\cdot {Z}_{u}^{G}+b\right)$$

(21)

$${\omega }_{u}^{G}=\frac{{\text{exp}}\left({W}_{u}^{Gj}\right)}{{\sum }_{j=1}^{GN}{\text{exp}}\left({W}_{u}^{Gj}\right)}$$

(22)

where V is the total number of nodes under the subgraph adjacent to target node u, tanh is the activation function, ${q}^{T}$ is the trainable semantic layer attention vector with dimensions set to 128, and b is the bias vector. GN is the number of subgraphs of different nodes, and ${W}_{u}^{G}$ is the contribution of different subgraphs to the target node u. After semantic embedding, the final embedding obtained is defined as follows:

$${Z}_{u}=\sum_{i=1}^{GN}{\omega }_{u}^{Gi}\cdot {Z}_{u}^{Gi}$$

(23)

Feature extraction based on homogeneous graphs

A heterogeneous graph constructed based on the correlation between nodes lacks information about nodes of the same type. To further capture the potential characteristics of the presence of same-type nodes, we defined metapaths $L\to L$ and $D\to D$ of the same type of node to construct both lncRNA and disease homogeneous graphs. The construction of the homology graph still requires the establishment of a neighborhood matrix between the nodes. We chose to use the KNN algorithm to construct the respective association matrices of lncRNAs and diseases. Moreover, the KNN algorithm makes predictions based on neighboring samples, and choosing the right number of samples can effectively eliminate the influence of noise.

Based on the comprehensive similarity obtained, the KNN algorithm was used to find the top k lncRNAs or diseases that were most similar to the ith lncRNA or disease, respectively, and assigned values of 1 and 0, respectively. Subsequently, we obtained the association matrices of lncRNAs or diseases with themselves, i.e., ${E}^{lnc-lnc}$ and ${E}^{dis-dis}$. Their assignment formulas are as follows:

$${E}_{ij}^{lnc-lnc}=\left\{\begin{array}{l}1\quad if\; j \in {Nei}_{li}\left(k\right)\\ 0\; otherwise\end{array}\right.$$

(24)

$$E_{ij}^{dis - dis} = \left\{ {\begin{array}{*{20}l} 1 & {if\; j \in Nei_{di} \left( k \right)} \\ 0 & {otherwise} \\ \end{array} } \right.$$

(25)

where ${Nei}_{li}(k)$, (${Nei}_{di}(k)$) contains the top k most similar lncRNA sequences (diseases) and lncRNA li (disease di) contains itself. We empirically set k to 20.

We defined the lncRNA homogeneous graph ${G}_{l}=(V,E)$ as containing the set of nodes ${v}^{lnc}$. The edge E in the graph can be defined as ${E}^{lnc-lnc}\in {R}^{{N}_{lnc}\times {N}_{lnc}}$, where ${N}_{lnc}$ denotes the number of lncRNAs in the dataset. Given lncRNA nodes ${l}_{i}$ $({l}_{i}\in {N}_{lnc})$ and ${l}_{j} ({l}_{j}\in {N}_{lnc})$, ${l}_{i}$ and ${l}_{j}$ are associated with each other if the association matrix ${E}_{ij}^{lnc-lnc}=1$. Additionally, we defined the disease homogeneous graph ${G}_{d}=(V,E)$ containing the set of nodes ${v}^{dis}$. The edge E in the graph can be defined as ${E}^{dis-dis}\in {R}^{{N}_{dis}\times {N}_{dis}}$, where ${N}_{dis}$ denotes the number of disease nodes in the dataset. Given disease nodes ${d}_{i}$ $({d}_{i}\in {N}_{dis})$ and ${d}_{j} ({d}_{j}\in {N}_{dis})$, if the association matrix ${E}_{ij}^{dis-dis}=1$, then there is an association between ${d}_{i}$ and ${d}_{j}$. Conversely, this means that no association is observed between the nodes.

Subsequently, we used the combined similarity of lncRNAs and diseases as the feature vector of the nodes. For the constructed homogeneous graphs, we similarly used the multihead attention mechanism to aggregate the node features and finally obtained the embedded features Z_O.

LDA prediction

We performed feature enhancement for the initial lncRNA and disease similarity using heterogeneous graph extraction of metapaths and homogeneous graph aggregation, respectively. We concatenated the resulting final embeddings and used a fully connected layer to reconstruct the lncRNA and disease features for the final prediction.

The predicted probabilities of lncRNA node i and disease node j are calculated as follows:

$${y}_{ij}=sigmoid\left(W\left({Z}_{li}+{Z}_{dj}\right)+b\right)$$

(26)

${y}_{ij}$ represents the association probability between the final predicted lncRNA li and the disease dj. Additionally, we created a loss function during the model training to quantify the discrepancy between the model's predicted value and the actual value. We then combined this function with the gradient descent approach to efficiently optimize the model's parameters and boost its predictive capability. The model uses an Adam optimizer for the gradient descent algorithm [33]. The following is the formula for calculating the loss function:

$$LOSS=-\left(y{\text{log}}{y}_{ij}+\left(1-y\right){\text{log}}\left(1-{y}_{ij}\right)\right)$$

(27)

y represents the true association of lncRNA with the disease. Finally, the model was trained by a backpropagation algorithm to obtain the final prediction probability.

Comparison with other methods

To further validate the performance of the model, based on dataset 1, we compared the proposed method with five benchmark models. The BiGAN [28] is a generative adversarial model that consists of an encoder, a generator and a discriminator for predicting the associations of novel lncRNAs with diseases. HOPEXGB [34] is a prediction method based on machine learning techniques that uses higher order proximity preserving embedding (HOPE) and extreme gradient boosting (XGB) to identify miRNAs and lncRNAs associated with diseases. VGAELDA [35] is an end-to-end model that integrates variational inference and a graph autoencoder for lncRNA-disease association prediction. GCRFLDA [36] is a prediction method based on graph convolution matrix complementation. SIMCLDA [37] is a method for predicting potential lncRNA-disease associations based on inductive matrix complementation. GAMCLDA [15] is a method based on a graph self-encoder and matrix completion.

Experimental setup

We used a fivefold cross-validation approach to evaluate the models. Our method is based on the PyTorch framework and executed with the dgl package. The computing environment included the Windows 10 operating system with an Intel(R) Core(TM) i5 and 16 GB of RAM. The maximum number of epochs in our model was 500, and all the trainable parameters were learned using the Adam optimizer with a learning rate of 0.001 and a weight decay rate of 0.005.

Evaluation metrics

Referring to the evaluation metrics based on previous studies, we used the receiver operating characteristic (ROC) curve, precision, recall, and F1 score. Additionally, we used three other evaluation metrics, namely, accuracy, sensitivity, and the F1-score. These metrics were calculated as follows:

$$Accuracy=\frac{TN+TP}{TN+TP+FN+FP}$$

(28)

$$Sensitivity (Recall)=\frac{TP}{TP+FN}$$

(29)

$$F1-score=\frac{2\times \mathit{Pr}ecision\times Recall}{\mathit{Pr}ecision+\mathit{Re}call}$$

(30)

Results

Comparison with other advanced methods

As shown in Table 2, compared to the performance metrics of the benchmark model, MMHGAN's overall performance metrics are all higher than 88%. These results are better than those of GCRFLDA (86%), which is the best overall performing model among the benchmark models. MMHGAN has four evaluation metrics that are better than those GCRFLDA. However, the AUPR achieved by MMHGAN is lower than that of GCRFLDA. While the other models achieved good AUC/ACC performance, the performance in terms of the AUPR and recall was less than 80%.

Table 2 Comparison of different models

Full size table

Model performance with different datasets

To better evaluate our model, we tested it on three datasets with multiple evaluation metrics, and the results are shown in Table 3. On these three datasets, all the metrics of the model were greater than 88%. The ROC and PR curves of our model on the three datasets are shown in Figs. 2, 3, 4, 5, 6, 7.

Table 3 Results for different datasets

Full size table

Ablation experiment

Comparison with different feature combinations

To further test the effect of different features on the classification results, we performed the following comparisons:

MMHGAN-NHO: This model aggregates node features only in heterogeneous graphs in the module identified as (iii) in Fig. 1.

MMHGAN-NA: For subgraphs obtained from different metapaths, in the module labeled (iii) in Fig. 1, we set the coefficient of the aggregated features of the subgraphs obtained through different nodes to 0.5 without weight assignment, i.e., the computation node of module (iii) labeled attention.

We compared these two models with the original model, and the comparison results are shown in Table 4. The results show that the model with richer feature information and more diverse attention mechanisms achieved better performance.

Table 4 Results for different features of the MMHGAN model

Full size table

Analysis of parameters

By altering some of the parameters in this model, we can increase its performance. We assessed the value of k in the multiple attention mechanism first. We used k = 1, 2, 4, 8, and 16, and the resulting AUC findings are displayed in Fig. 8. As demonstrated, the model functions best when k = 4. The model is equivalent to that without the multiple attention mechanism when k = 1. The model effect was outperformed by the effects of other k values. This result demonstrates how the multihead attention method can be used to more fairly assign the weights of metapath instances. Second, we tested the different dimensional features of the attention layer and the output features, and Fig. 9 shows the AUC values of the MMHGAN model prediction results when the dimension n of the output features is different. It is clear that as the number of dimensions increases, the AUC value for the MMHGAN model increases. The model produces the best prediction results when the number of dimensions is 256. When there are more than 512 dimensions, the model's performance decreases, perhaps as a result of the model's increased propensity for overfitting, which yields subpar results. We therefore chose 128 as the number of dimensions.

Case study

We studied three cases, lung cancer, esophageal cancer, and breast cancer cases, to further evaluate the performance of the model in predicting the associations between lncRNAs and diseases. For the studied diseases, we filtered out the associations between diseases and lncRNAs and constructed the same number of negative samples for training using the remaining associations between diseases and lncRNAs as positive samples. The diseases to be studied were subsequently entered into the trained model as test samples to obtain the prediction scores. We ranked the scores and selected the 15 lncRNAs with the highest scores as diseases with possible associations for the final predictions. For the prediction results, we compared the results by reviewing the LncRNADisease database, the Lnc2Cancer database, and the published literature. The final predictions for these three diseases are shown in Table 5, 6, and 7.

Table 5 The top 15 lung cancer-related lncRNA candidates

Full size table

Table 6 The top 15 esophageal carcinoma cancer-related lncRNA candidates

Full size table

Table 7 The top 15 breast cancer-related lncRNA candidates

Full size table

Lung cancer is a malignant tumor originating from lung tissue cells that usually spreads through the respiratory tract and is associated with extremely high morbidity and mortality. The prediction results confirmed the presence of all the predicted lncRNAs. The results suggest that the lncRNAs predicted by the model are indeed associated with lung cancer.

Esophageal carcinoma is one of the most common tumors of the digestive tract. Therefore, we chose it as the second case to test the model. Table 6 shows that the predicted associations of 12 of these lncRNAs with diseases can be retrieved from the LncRNADisease and Lnc2Cancer databases.

Breast cancer was studied as the third case. Breast cancer is one of the most common malignant tumors in women and originates from breast epithelial or ductal cells. Its incidence increases with age. As shown in Table 7, 14 of the 15 predicted lncRNAs were confirmed by databases such as lncRNADisease. The above three case studies demonstrated the ability of the MMHGAN model to predict potential lncRNA-disease associations.

KM curve

A Kaplan–Meier curve is a statistical tool used in survival analysis, usually to describe the probability of an event occurring within a certain period. Survival analyses are primarily used to study the time to the occurrence of an event, which can be the onset of a disease, death, or other specific outcome.

Survival time ${t}_{i}$ is the horizontal coordinate, and survival rate ${S}_{{t}_{i}}$ at each time point is the vertical coordinate; the continuous curve formed by connecting the survival rates at each time point is referred to as the survival curve.

Based on the results of the case study, we selected breast cancer for survival analysis based on TCGA [38] data. As shown in Fig. 10 and Fig. 11, for PVT1 and HOTAIR, the survival rates of patients with low lncRNA expression are higher over time.

Discussion

To make full use of lncRNA and disease intermediate information to enhance LDA prediction, we proposed the MMHGAN model to learn each homogeneous graph or heterogeneous subgraph of a specific metapath using a GAT network. In addition, we used the KNN algorithm to construct homogeneous graphs and used an attention mechanism to adaptively assign weights to different heterogeneous metapath subgraphs to achieve denoising and to obtain additional semantic information. The cross-validation results show that the overall performance of the model outperforms that of the baseline comparison method.

Several studies have been conducted to introduce primary and deeper information for disease association prediction through the k-nearest neighbors (KNN) algorithm, and the model performance has further improved. These studies have validated the effectiveness of combining the KNN algorithm and GCN in disease association prediction. Consistent with these studies, we also constructed homogeneous subgraphs using the KNN algorithm and acquired features using the GAT. The difference is that our homogeneous graphs in the input KNN algorithm are the LSM and DSM, which are the merged similarity matrices of lncRNAs and diseases after linear fusion.

To explore better disease association prediction models, different approaches have been used to fully exploit disease association information. Yang [28] et al. introduced the generative anti-network approach to lncRNA disease association prediction. Shi [35] et al. proposed VGAELDA, which integrates variational inference and a graph autoencoder through the integration of graph representation learning and alternating training involving variational inference, which enhances the ability of VGAELDA to capture efficient low-dimensional representations from high-dimensional features. Fan [36] et al. proposed GCRFLDA, a prediction method based on graph convolutional matrix complementation. utilizing conditional random fields and attention mechanisms to form encoders and decoders, learn efficient embedding of nodes, and score lncRNA-disease associations. As shown in Table 2, although these methods use different techniques and obtain good performance (AUC > 89%), they do account for the rich semantic information in heterogeneous graphs. He [34] et al. proposed a prediction method based on machine learning techniques to identify disease-related miRNAs and lncRNAs by higher-order proximity-preserving embedding (HOPE) and extreme gradient lifting (XGB) using a heterogeneous disease–miRNA‒lncRNA (DML) information network. Lu [37] et al. proposed a prediction method based on disease–gene and gene–gene correlations, computed the Gaussian interaction spectrum kernel of lncRNAs, and proposed a method to predict potential lncRNA-disease associations on the basis of inductive matrix complementation. Wu [15] introduced graph self-encoders to learn lncRNAs and characterize diseases through their ability to encode and decode graph structures and features. While these methods have advanced the field by considering heterogeneous graph-rich information, they have not fully exploited the potential of heterogeneous graph-rich information, as shown in Table 2, where the overall performance of the methods was 75%. In addition, these methods do not further consider the information of the intermediate nodes of the metapath subgraph. Inspired by Xuan [16] and Zhao [17] et al., we utilized subgraphs constructed from homogeneous graphs and heterogeneous graphs as inputs and adopted multipath subgraphs combined with a multihead attention mechanism to acquire features, fully considering the information of the intermediate nodes of the metapath subgraphs. As shown in Table 2, our method's AUC, ACC, recall, and F1 score are 0.59%, 0.48%, 2.05%, and 0.85% greater than those of the best baseline model, GCRFLDA.

Our study is inspired by GSMV, a new association prediction model proposed by Xuan et al., and HGATLDA, a novel metapath-based heterogeneous graph attention network framework developed by Zhao et al. Unlike the HGATLDA approach, these methods do not consider homogeneous subgraph information. We obtained the features of homogeneous subgraphs through a multihead attention mechanism; in addition, unlike GSMV, which uses metapath instances to obtain semantic information, we used metapath subgraphs to obtain semantic information. Subgraphs can better capture local structural information and are more interpretable; additionally, when dealing with sparse matrices, metapath extraction of subgraphs can reduce the computational complexity and noise interference, and it is easier to adapt to different requirements and data characteristics by extracting subgraphs according to different paths.

As shown in Table 3, our model performs better on dataset 2 and dataset 3 than on dataset 1, which may be due to the different data sample sizes.

Despite the good results of our model, there are still several limitations. First, there was an imbalance of positive and negative samples in the datasets; for example, in the first dataset, only 2697 associations existed between 240 lncRNA nodes and 412 disease nodes, which was insufficient for predicting the results. Second, generating subgraphs was used in the model to aggregate the features, and the complexity of the model increased when the amount of data increased. In addition, we did not validate the results predicted by the model through biological experiments; in the future, we will add biological wet experiments to further evaluate the model's performance.

Conclusion

In this paper, we proposed a hierarchical network model of multiple metapaths, MMHGAN, to extract features from a multiview perspective and to mine the semantic information contained in different graphs for predicting potential lncRNA-disease associations. By constructing both homogeneous and heterogeneous graphs, the information provided by the neighboring nodes of lncRNAs or disease nodes can be mined more comprehensively. In addition to the KNN algorithm and the method of constructing subgraphs through metapaths, the noise generated by sparse matrices can be effectively reduced, which can lead to better performance of our model. Moreover, we introduced miRNA nodes to construct a ternary heterogeneous graph. To better explore the structural information provided by the heterogeneous graph, we generated corresponding subgraphs with the help of different nodes and used the GAT network to enhance the features. We assigned different weights to the subgraphs constructed by different nodes to obtain more semantic information. Finally, the MMHGAN also outperforms the other methods. In the case study, the capability of the MMHGAN model is further confirmed.

Availability of data and materials

The data and code can be downloaded from the following website: https://github.com/ydkvictory/MMHGAN.

References

Yang Y, Yujiao W, Fang W, Linhui Y, Ziqi G, Zhichen W, et al. The roles of miRNA, lncRNA and circRNA in the development of osteoporosis. Biol Res. 2020;53:40.
Article CAS PubMed PubMed Central Google Scholar
Chen X, Yan G-Y. Novel human lncRNA-disease association inference based on lncRNA expression profiles. Bioinformatics. 2013;29:2617–24.
Article CAS PubMed Google Scholar
Yang X, Gao L, Guo X, Shi X, Wu H, Song F, et al. A network based method for analysis of lncRNA-disease associations and prediction of lncRNAs implicated in diseases. PLoS ONE. 2014;9: e87797.
Article PubMed PubMed Central Google Scholar
Li G, Luo J, Liang C, Xiao Q, Ding P, Zhang Y. Prediction of LncRNA-disease associations based on network consistency projection. IEEE Access. 2019;7:58849–56.
Article Google Scholar
Zhang J, Zhang Z, Chen Z, Deng L. Integrating multiple heterogeneous networks for novel LncRNA-disease association inference. IEEE/ACM Trans Comput Biol and Bioinf. 2019;16:396–406.
Article CAS Google Scholar
Xie G, Jiang J, Sun Y. LDA-LNSUBRW: lncRNA-disease association prediction based on linear neighborhood similarity and unbalanced bi-random walk. IEEE/ACM Trans Comput Biol and Bioinf. 2020;1:1–1.
Article Google Scholar
Sun J, Shi H, Wang Z, Zhang C, Liu L, He W, et al. Inferring novel lncRNA-disease associations based on random walk on lncRNA functional similarity network. Mol BioSyst. 2014;10:2074.
Article CAS PubMed Google Scholar
Li J, Zhao H, Xuan Z, Yu J, Feng X, Liao B, et al. A novel approach for potential human LncRNA-disease association prediction based on local random walk. IEEE/ACM Trans Comput Biol Bioinf. 2021;18:1049–59.
Article CAS Google Scholar
Hu P, Huang Y-A, Chan KCC, You Z-H. Learning multimodal networks from heterogeneous data for prediction of lncRNA–miRNA interactions. IEEE/ACM Trans Comput Biol Bioinf. 2020;17:1516–24.
Article CAS Google Scholar
Yao D, Zhan X, Zhan X, Kwoh CK, Li P, Wang J. A random forest based computational model for predicting novel lncRNA-disease associations. BMC Bioinf. 2020;21:126.
Article CAS Google Scholar
Wang Y, Yu G, Wang J, Fu G, Guo M, Domeniconi C. Weighted matrix factorization on multi-relational data for LncRNA-disease association prediction. Methods. 2020;173:32–43.
Article CAS PubMed Google Scholar
Lan W, Li M, Zhao K, Liu J, Wu F-X, Pan Y, et al. LDAP: a web server for lncRNA-disease association prediction. Bioinf. 2017;33:458–60.
CAS Google Scholar
Yu J, Xuan Z, Feng X, Zou Q, Wang L. A novel collaborative filtering model for LncRNA-disease association prediction based on the Naïve Bayesian classifier. BMC Bioinf. 2019;20:396.
Article Google Scholar
Zhou S, Wang S, Wu Q, Azim R, Li W. Predicting potential miRNA-disease associations by combining gradient boosting decision tree with logistic regression. Comput Biol Chem. 2020;85: 107200.
Article CAS PubMed Google Scholar
Wu X, Lan W, Chen Q, Dong Y, Liu J, Peng W. Inferring LncRNA-disease associations based on graph autoencoder matrix completion. Comput Biol Chem. 2020;87: 107282.
Article CAS PubMed Google Scholar
Xuan P, Wang S, Cui H, Zhao Y, Zhang T, Wu P. Learning global dependencies and multi-semantics within heterogeneous graph for predicting disease-related lncRNAs. Briefings Bioinf. 2022;23:bbac361.
Article Google Scholar
Zhao X, Zhao X, Yin M. Heterogeneous graph attention network based on meta-paths for lncRNA-disease association prediction. Briefings Bioinf. 2022;23:bbab407.
Article Google Scholar
Fu G, Wang J, Domeniconi C, Yu G. Matrix factorization-based data fusion for the prediction of lncRNA-disease associations. Bioinformatics. 2018;34:1529–37.
Article CAS PubMed Google Scholar
Ning S, Zhang J, Wang P, Zhi H, Wang J, Liu Y, et al. Lnc2Cancer: a manually curated database of experimentally supported lncRNAs associated with various human cancers. Nucl Acids Res. 2016;44:D980–5.
Article CAS PubMed Google Scholar
Chen G, Wang Z, Wang D, Qiu C, Liu M, Chen X, et al. LncRNADisease: a database for long-non-coding RNA-associated diseases. Nucl Acids Res. 2012;41:D983–6.
Article PubMed PubMed Central Google Scholar
Lu Z, Bretonnel CK, Hunter L. GeneRIF quality assurance as summary revision. In: Biocomputing 2007. World Scientific, Maui, Hawaii, USA;2006, 269–80.
Li J-H, Liu S, Zhou H, Qu L-H, Yang J-H. starBase v2.0: decoding miRNA-ceRNA, miRNA-ncRNA and protein–RNA interaction networks from large-scale CLIP-Seq data. Nucl Acids Res. 2014;42:D92–7.
Article CAS PubMed Google Scholar
Li Y, Qiu C, Tu J, Geng B, Yang J, Jiang T, et al. HMDD v2.0: a database for experimentally supported human microRNA and disease associations. Nucl Acids Res. 2014;42:D1070–4.
Article CAS PubMed Google Scholar
Zhou Y, Wang X, Yao L, Zhu M. LDAformer: predicting lncRNA-disease associations based on topological feature extraction and Transformer encoder. Briefings Bioinf. 2022;23:bbac370.
Article Google Scholar
Gao Y, Shang S, Guo S, Li X, Zhou H, Liu H, et al. Lnc2Cancer 3.0: an updated resource for experimentally supported lncRNA/circRNA cancer associations and web tools based on RNA-seq and scRNA-seq data. Nucl Acids Res. 2021;49:D1251–8.
Article CAS PubMed Google Scholar
Bao Z, Yang Z, Huang Z, Zhou Y, Cui Q, Dong D. LncRNADisease 2.0: an updated database of long non-coding RNA-associated diseases. Nucl Acids Res. 2019;47:D1034–7.
Article CAS PubMed Google Scholar
Li J, Wang D, Yang Z, Liu M. HEGANLDA: a computational model for predicting potential lncRNA-disease associations based on multiple heterogeneous networks. IEEE/ACM Trans Comput Biol and Bioinf. 2021;1:1.
CAS Google Scholar
Yang Q, Li X. BiGAN: LncRNA-disease association prediction based on bidirectional generative adversarial network. BMC Bioinf. 2021;22:357.
Article CAS Google Scholar
Li M, Liu M, Bin Y, Xia J. Prediction of circRNA-disease associations based on inductive matrix completion. BMC Med Genom. 2020;13:42.
Article Google Scholar
Wang W, Zhang L, Sun J, Zhao Q, Shuai J. Predicting the potential human lncRNA–miRNA interactions based on graph convolution network with conditional random field. Briefings Bioinf. 2022;23:463.
Article Google Scholar
Xuan P, Han K, Guo M, Guo Y, Li J, Ding J, et al. Prediction of microRNAs associated with human diseases based on weighted k most similar neighbors. PLoS ONE. 2013;8: e70204.
Article CAS PubMed PubMed Central Google Scholar
Wang JZ, Du Z, Payattakool R, Yu PS, Chen C-F. A new method to measure the semantic similarity of GO terms. Bioinformatics. 2007;23:1274–81.
Article CAS PubMed Google Scholar
Liang Q, Zhang W, Wu H, Liu B. LncRNA-disease association identification using graph auto-encoder and learning to rank. Briefings Bioinf. 2023;24:539.
Article Google Scholar
He J, Li M, Qiu J, Pu X, Guo Y. HOPEXGB: A Consensual Model for Predicting miRNA/lncRNA-Disease Associations Using a Heterogeneous Disease-miRNA-lncRNA Information Network. J Chem Inf Model. 2023;acs.jcim.3c00856.
Shi Z, Zhang H, Jin C, Quan X, Yin Y. VGAE : A representation learning model based on variational inference and graph autoencoder for predicting lncRNA-disease associations. BMC Bioinf. 2021;22:136.
Article CAS Google Scholar
Fan Y, Chen M, Pan X. GCRFLDA: scoring lncRNA-disease associations using graph convolution matrix completion with conditional random field. Briefings Bioinf. 2022;23:361.
Article Google Scholar
Lu C, Yang M, Luo F, Wu F-X, Li M, Pan Y, et al. Prediction of lncRNA-disease associations based on inductive matrix completion. Bioinformatics. 2018;34:3357–64.
Article CAS PubMed Google Scholar
Tomczak K, Czerwińska P, Wiznerowicz M. Review The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. w. 2015;1:68–77.
Google Scholar

Download references

Funding

This work is supported by the National Natural Science Foundation of China (Grant No. 62172128). The funding body did not play any role in the design of the study; the collection, analysis, or interpretation of the data; or the writing of the manuscript.

Author information

Authors and Affiliations

School of Computer Science and Technology, Harbin University of Science and Technology, Harbin, 150080, China
Dengju Yao, Yuexiao Deng & Xiaojuan Zhan
College of Computer Science and Technology, Heilongjiang Institute of Technology, Harbin, 150050, China
Xiaojuan Zhan
Department of Endocrinology and Metabolism, Hospital of South, University of Science and Technology, Shenzhen, 518055, China
Xiaorong Zhan

Authors

Dengju Yao
View author publications
You can also search for this author in PubMed Google Scholar
Yuexiao Deng
View author publications
You can also search for this author in PubMed Google Scholar
Xiaojuan Zhan
View author publications
You can also search for this author in PubMed Google Scholar
Xiaorong Zhan
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

D.J.Y. directed the research and revised the paper. Y.X.D. conceived and implemented the model, performed the experiments, and wrote the paper. X.J.Z. and X.R.Z. analyzed the experimental results and revised the paper. All authors have read and approved the final manuscript.

Corresponding author

Correspondence to Dengju Yao.

Ethics declarations

Ethical approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Yao, D., Deng, Y., Zhan, X. et al. Predicting lncRNA-disease associations using multiple metapaths in hierarchical graph attention networks. BMC Bioinformatics 25, 46 (2024). https://doi.org/10.1186/s12859-024-05672-2

Download citation

Received: 09 November 2023
Accepted: 23 January 2024
Published: 29 January 2024
DOI: https://doi.org/10.1186/s12859-024-05672-2

Predicting lncRNA-disease associations using multiple metapaths in hierarchical graph attention networks

Abstract

Background

Methods

Results

Conclusion

Introduction

Materials and methods

Datasets

Flowchart of the MMHGAN model

LncRNA sequence similarity

Disease semantic similarity

LncRNA/disease GIP kernel similarity

Feature extraction based on heterogeneous graphs

Subgraph construction based on metapaths

Feature extraction

Feature extraction based on homogeneous graphs

LDA prediction

Comparison with other methods

Experimental setup

Evaluation metrics

Results

Comparison with other advanced methods

Model performance with different datasets

Ablation experiment

Comparison with different feature combinations

Analysis of parameters

Case study

KM curve

Discussion

Conclusion

Availability of data and materials

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethical approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us