 Research article
 Open access
 Published:
A representation learning model based on variational inference and graph autoencoder for predicting lncRNAdisease associations
BMC Bioinformatics volumeÂ 22, ArticleÂ number:Â 136 (2021)
Abstract
Background
Numerous studies have demonstrated that long noncoding RNAs are related to plenty of human diseases. Therefore, it is crucial to predict potential lncRNAdisease associations for disease prognosis, diagnosis and therapy. Dozens of machine learning and deep learning algorithms have been adopted to this problem, yet it is still challenging to learn efficient lowdimensional representations from highdimensional features of lncRNAs and diseases to predict unknown lncRNAdisease associations accurately.
Results
We proposed an endtoend model, VGAELDA, which integrates variational inference and graph autoencoders for lncRNAdisease associations prediction. VGAELDA contains two kinds of graph autoencoders. Variational graph autoencoders (VGAE) infer representations from features of lncRNAs and diseases respectively, while graph autoencoders propagate labels via known lncRNAdisease associations. These two kinds of autoencoders are trained alternately by adopting variational expectation maximization algorithm. The integration of both the VGAE for graph representation learning, and the alternate training via variational inference, strengthens the capability of VGAELDA to capture efficient lowdimensional representations from highdimensional features, and hence promotes the robustness and preciseness for predicting unknown lncRNAdisease associations. Further analysis illuminates that the designed cotraining framework of lncRNA and disease for VGAELDA solves a geometric matrix completion problem for capturing efficient lowdimensional representations via a deep learning approach.
Conclusion
Cross validations and numerical experiments illustrate that VGAELDA outperforms the current stateoftheart methods in lncRNAdisease association prediction. Case studies indicate that VGAELDA is capable of detecting potential lncRNAdisease associations. The source code and data are available at https://github.com/zhanglabNKU/VGAELDA.
Introduction
LncRNAs are RNAs longer than 200 nucleotides thus losing the function of encoding, while they can still influence a series of biological processes, such as gene transcription, cell apoptosis, hormonal regulation, and immune response. Hence, lncRNAs are closely linked to plenty of human diseases [1,2,3]. For instance, lncRNA PANDAR is a novel biomarker of breast cancer, which upregulates proliferation of breast cancer cells [4]. Sun et al. [5] found that the downregulation of lncRNA MEG3 promotes proliferation of gastric cancer cells. Faghihi et al. [6] reported that lncRNA BACE1AS can regulate mRNA BACE1, while BACE1 is associated with the generation of betaamyloid, which can cause Alzheimerâ€™s disease. Therefore, it is essential to predict potential lncRNAdisease associations for disease prevention, detection, diagnosis and treatment. However, there are only a small number of lncRNAdisease associations that have been discovered so far, and it would be ideal to predict more potential lncRNAdisease associations using computational approaches. Generally, computational methods, especially machine learning algorithms, are more timeefficient and costeffective to detect potential lncRNAdisease associations compared with experimental methods.
Previous machine learning approaches for predicting lncRNAdisease associations can be categorized into three types. The first type of methods is based on matrix analysis. Two commonly used matrix analysis methods for predicting lncRNAdisease associations are manifold regularization [7] and matrix completion [8], which suggest that lncRNAdisease association matrix follow manifold constraint or lowrank constraint, respectively. Manifold regularization based methods have been widely adopted for link prediction of biological entities [9,10,11]. Laplacian regularized least square (LRLS) method [7] integrates manifold regularization and basic least square method. Chen and Yan [12] proposed LRLSLDA that applied LRLS to the lncRNAdisease associations prediction, after the construction of an lncRNA graph and a disease graph through computing feature similarity respectively. Based on LRLSLDA, several methods were proposed to improve the performance of LRLS by integrating different types of feature similarities [13, 14]. In addition, lncRNAdisease associations can be viewed as links on an lncRNAdisease bipartite graph. Matrix completion algorithm [8] can solve link prediction problem by applying lowrank constraint to association matrix, and have been commonly applied to forecast associations among biological entities [15,16,17]. Lu et al. [18] proposed a matrix completion based method for predicting lncRNAdisease associations. Geometric matrix completion [19, 20] incorporates manifold regularization into the matrix completion problem, and Lu et al. [21] proposed a geometric matrix completion based framework for predicting lncRNAdisease associations.
The second type of methods focuses on the integration of heterogeneous features. Applying multisource features to learn better representations is an efficient technique for predicting associations among biological entities [22, 23]. Lan et al. [24] developed a web server for lncRNAdisease association prediction by integrating multiple features of lncRNAs and diseases to construct lncRNA similarity network and disease similarity network. Fu et al. [25] integrated heterogeneous data for lncRNAdisease associations prediction by matrix factorization with lowrank constraint. Ding et al. [26] inferred links on lncRNAdisease bipartite graph via lncRNAdiseasegene tripartite graph. Yao et al. [27] adopted random forest for feature selection in lncRNAdisease associations prediction.
The third type is deep learning approaches. Neural networks are competent to capture efficient lowdimensional representations from highdimensional features of biological entities, and deep learning based methods were proposed for detecting potentional associations among biological entities [17, 22, 28]. Thus, several deep learning models applying autoencoders for representation learning of lncRNA features and disease features were proposed [29, 30]. Graph neural networks (GNN) [31] were proposed in deep learning on graphs. Hence, there are some recent approaches for lncRNAdisease associations prediction based on GNN. Xuan et al. [32] integrated graph convolutional networks (GCN) [33] and CNN to learn representations from features of lncRNAs and diseases. GCN is applicable for link prediction on bipartite graph [34], and Wu et al. [35] adopted graph autoencoder to predict lncRNAdisease associations on lncRNAdisease bipartite graph.
In this paper, we proposed a method, VGAELDA, that integrates variational inference and graph autoencoders to improve the performance of lncRNAdisease associations prediction. In previous works, feature inference and label propagation are two separated stages in these methods, and hence label propagation procedure may fail to make the full use of lowdimensional representations learned from highdimensional features. Using deep learning approaches, our method proposed an endtoend framework, which fuses feature inference and label propagation under the variational inference algorithm of Graph Markov Neural Networks (GMNN) [36]. Specifically, the feature inference network in VGAELDA is designed as a variational graph autoencoder (VGAE) [37] that learns representations from feature matrices of lncRNAs and diseases respectively. Furthermore, the label propagation network in our model is a graph autoencoder (GAE) [37] that estimates the score of unknown lncRNAdisease pairs from known ones. These two graph autoencoders learn from feature and propagate label alternately, which are trained by variational EM algorithm, and are implemented as a representation learning framework. This framework minimizes the difference of the representations learned by two autoencoders respectively. Therefore, VGAELDA has the following advantages. (i) VGAE is preferable to infer lowdimensional representations from highdimensional features in a graph, and these representations can better depict similarities and dependencies among nodes. This would significantly enhance the robustness and preciseness of prediction without handcrafted feature similarities. (ii) VGAELDA implements the variational EM algorithm as a representation learning framework, by training the feature inference autoencoder and the label propagation autoencoder alternately. (iii) VGAELDA provides a useful solution to the geometric matrix completion problem via deep learning, because autoencoders tend to minimize the rank of outputs, and we suggest that manifold regularization can be obtained via the alternate training of two graph autoencoders. (iv) VGAELDA implements an efficient way to integrate information from lncRNA space and disease space. Experiments illustrate that VGAELDA is superior to the current stateoftheart methods, and case studies on several diseases illustrate the capability of VGAELDA to detect new lncRNAdisease associations.
Results
Datasets
In this paper, we adopted two datasets for evaluation. Dataset1 is an lncRNAdisease association dataset from [26], including 540 associations among 115 lncRNAs and 178 diseases. Dataset2 is an lncRNAdisease association dataset from [25], including 2697 associations among 240 lncRNAs and 412 diseases. Both of them were collected from LncRNADisease [38] Database.
For each lncRNA, we adopted Word2Vec to compute the feature vector. Word2Vec [39] is an efficient method to learn the embedding vectors of natural language, and BioVec [40] (https://pypi.org/project/biovec/) applied Word2Vec for representation learning of biological sequences, including protein sequences or nucleotide sequences. In VGAELDA, the length of each vector was set at 300. We downloaded lncRNA sequences from the Nucleotide Database of NCBI.
For each disease, we adopted its associations with 1415 genes as the feature vector on Dataset1. Dataset2 includes disease associated with 15527 genes. After removing genes that are not associated with any diseases, 10146 genes remain and are used as the feature vector on Dataset2. Information with respect to diseases was collected from DisGeNet [41] and Disease Ontology [42].
Comparison with other methods
Cross validation
We compared our proposed method, VGAELDA, with other five stateoftheart methods:

LRLSLDA: Chen and Yan [12] proposed a Laplacian regularized least square (LRLS) method [7] based framework to predict lncRNAdisease associations.

SIMCLDA: Lu et al. [18] proposed a computational method for predicting lncRNAdisease associations based on speedup inductive matrix completion (SIMC) [43].

TPGLDA: Ding et al. [26] integrated heterogeneous features by constructing lncRNAdiseasegene tripartite graph for lncRNAdisease associations prediction.

SKFLDA: Xie et al. [14] proposed SKFLDA that applied kernel fusion trick for different types of similarities to improve the preciseness of lncRNAdisease associations prediction.

GAMCLDA: Wu et al. [35] implemented GAMCLDA, adopting graph autoencoders to predict lncRNAdisease associations on lncRNAdisease bipartite graph.
We adopted 5fold cross validation to obtain the result, and the metrics were listed below.
where TP denotes true positive, FN denotes false negative, TN denotes true negative, FP denotes false negative, TPR denotes true positive rate, FPR denotes false positive rate, and Mcc denotes Matthews correlation coefficient. The receiver operating characteristic (ROC) curve can be plotted by TPR and FPR, while the area under ROC curve (AUROC) and the area under precisionrecall curve (AUPR) are important metrics to measure the performance of a binary classification model.
We plotted the ROC curves and PR curves of Dataset1 and Dataset2 on Figs.Â 1 and 2, respectively. We ran our experiments for 5 times, and the mean values and standard deviations of AUROC and AUPR are listed on TableÂ 1. The AUROC and AUPR values of VGAELDA in 5 times are listed in Additional file 1.
The results show that VGAELDA outperforms the other five stateoftheart methods in both AUROC and AUPR, on both datasets. Specifically, for the AUPR values obtained by other five stateoftheart methods, GAMCLDA performs best in 5fold CV on both Dataset1 and Dataset2, which gives AUPR values at 0.5794 and 0.3798 respectively. Compared with these AUPR values, VGAELDA significantly outperforms these previous methods by increasing the AUPR values 45% in 5fold CV on Dataset1, and 116% in 5fold CV on Dataset2.
Evaluation on imbalanced data
As the datasets are imbalanced, i.e., the number of negative samples is far more than positive samples, it is essential to evaluate the capability to retrieve true positive samples from predicted positive ones. In our experiments, the evaluation was implemented through the following two ways. In summary, VGAELDA performs the best in both evaluation ways.
Firstly, we evaluated the performance of our model at high stringency level of specificity according to Eq. (23456). We fixed specificity at 0.95 and 0.99, and then computed sensitivity, accuracy, precision, F1score and Mcc. The results of Dataset1 and Dataset2 are listed on Additional file 2 and TableÂ 2, respectively, which illustrate that VGAELDA outperforms other five methods at all five metrics, and in both datasets. Matthews correlation coefficient (Mcc) is a comprehensive metric in binary classification on imbalanced data [44]. For the Mcc values obtained by the other five stateoftheart methods, SKFLDA performs the best at \(Sp=0.95\) on Dataset1, which obtains 0.4637, GAMCLDA performs the best at \(Sp=0.99\) on Dataset1 and both \(Sp=0.95\) and 0.99 on Dataset2, which obtains 0.5804, 0.3855 and 0.4860 respectively. VGAELDA outperforms these methods by improving the Mcc values 13% and 28% at \(Sp=0.95\) and 0.99 on Dataset1, and 42% and 49% at \(Sp=0.95\) and 0.99 on Dataset2.
Secondly, we evaluated recall score (i.e. sensitivity) via counting the number of true positive samples at different topk cutoffs, according to Eq. (1), where \(k\in \{20,40,60,80,100\}\). The bar charts depicting the number of true positive samples at different topk cutoffs on Dataset1 and Dataset2 are shown on Additional file 3 and Fig.Â 3, respectively. VGAELDA retrieves the most true positive samples at all 5 cutoffs on both Dataset1 and Dataset2.
Case studies
To further evaluate the capability for detecting unknown lncRNAdisease associations of VGAELDA, case studies were adopted. We predicted the unknown diseaserelated lncRNAs of some specific diseases on the datasets, which can be validated by PubMed literature. The unknown diseaserelated lncRNAs of a disease are ranked by VGAELDApredicted score. In this paper, we adopted case studies on lncRNAs associated with breast cancer and colon cancer.
On Dataset 1, the top 10 VGAELDApredicted lncRNAs associated with breast cancer and colon cancer were listed in TablesÂ 3 and 4, respectively. PMID denotes the PubMed ID of the supporting literature for the corresponding diseaserelated lncRNAs detected by VGAELDA. TableÂ 3 indicates that all the top 10 VGAELDApredicted lncRNAs associated with breast cancer have been confirmed by previous literature. TableÂ 4 indicates that 8 of the top 10 VGAELDApredicted lncRNAs associated with colon cancer have been confirmed as well.
On Dataset 2, the top 10 VGAELDApredicted lncRNAs associated with breast cancer and colon cancer were listed in Additional files 4 and 5. Additional file 4 demonstrates that 8 of the top 10 VGAELDApredicted lncRNAs associated with breast cancer have been confirmed by previous literature. Additional file 5demonstrates that 9 of the top 10 VGAELDApredicted lncRNAs associated with colon cancer have been confirmed.
Breast cancer is the most commonly diagnosed cancer and the main threat of health among females worldwide [45]. VGAELDA has been applied to predict potential lncRNAs related to breast cancer. For instance, DNM3OS downregulates Vitamin D receptor (VDR), and VDR is capable of upregulating Suppressor of fused gene (SuFu), while SuFu is an inhibitor of progression of breast cancer [46]. CCAT1 promotes proliferation and migration of triplenegative breast cancer cells via downregulating miRNA miR218 and activating the expression of protein ZFX [47]. BANCR is significantly correlated to the growth of breast cancer cells [48].
Colon cancer is a major malignant cancer in digestive system [45]. Among the top 10 lncRNAs predicted by VGAELDA, UCA1 facilitates the progression of colon cancer through upregulating miRNA miR285p and HOXB3 [49]. It is found that GAS5 is positively correlated to colon cancer as well [50]. Also, previous research suggests that PVT1 can sponge miRNA miR26b and promote proliferation and metastasis of colon cancer [51].
Besides, we listed the predictions of potential lncRNAdisease associations with respect to all diseases of Dataset1 and Dataset2 in Additional files 6 and 7, respectively.
Discussion
Previous methods for predicting lncRNAdisease associations modeled dependent relationship from features based on some handcrafted measurements of similarity, then propagated labels of samples on the graph constructed via feature similarities. However, it is difficult for those measurements to capture similarities among highdimensional features directly. Hence, the hyperparameters in these measurements would significantly affect the performance of prediction, which decreases the preciseness of label propagation.
To address this issue, VGAELDA designed representation learning framework that fuses the feature inference network and the label propagation network, to solve graph semisupervised learning Problem 1 (see Methods). Our Assumption 1 (see Methods) clarifies the capability of an autoencoder to obtain lowrank solution. Based on Assumption 1, an autoencoder with manifold loss as we defined in DefinitionÂ 1 (see Methods), is competent to obtain the optimal solution of geometric matrix completion problem. Considering the manifold constraint and lowrank constraint that the lncRNAdisease association matrix should satisfy, we adopted VGAE to implement feature inference network GNNq, and GAE to implement label propagation network GNNp. With the alternate training via variational EM algorithm, two GAEs with manifold loss to measure the smoothness of manifold, would significantly strengthen the robustness and preciseness of label propagation through the representations learned by VGAE. Hence the feature similarities, i.e. the topological relationship of the graph, only need to be estimated roughly. The experiments demonstrate that VGAELDA outperforms various kinds of matrix completion based or manifold regularization based methods.
Furthermore, VGAELDA provides an efficient way to integrate information from lncRNA space and disease space. By applying cotraining loss as we defined in DefinitionÂ 2 (see Methods), information from lncRNA space and disease space are captured collaboratively. Finally, the association matrix \(F_l\) computed from lncRNA space and \(F_d\) computed from disease space, can be integrated simply, since AssumptionÂ 1 suggest that both \(F_l\) and \(F_d\) follow lowrank property.
Conclusion
The prediction of potential lncRNAdisease associations is of great importance to disease prognosis, diagnosis and treatment. In this paper, we proposed a deep learning model, VGAELDA, which integrates variational inference and graph autoencoders to detect potential lncRNAdisease associations. VGAELDA designed a representation learning framework to fuse the feature inference network and the label propagation network. Specifically, VGAELDA adopts variational graph autoencoder GNNq for feature inference, and graph autoencoder GNNp for label propagation. These two graph autoencoders are trained alternately in endtoend manner via variational EM algorithm. This has significantly improved the efficiency of feature representation learning and label propagation. Further discussion demonstrates the validity of VGAELDA to find an optimal solution to the geometric matrix completion problem, and to integrate information from both lncRNA space and disease space. Experiments illustrate that VGAELDA is superior to the current stateoftheart prediction methods, and case studies indicate that VGAELDA is competent in detecting potential lncRNAdisease associations. The results of evaluation demonstrate that VGAELDA is competent to capture efficient lowdimensional representations from highdimensional features of both lncRNAs and diseases, and predict unknown lncRNAdisease associations robustly and precisely.
Compared to previous lncRNAdisease associations prediction methods, VGAELDA adopts an endtoend framework based on variational inference in graph neural networks. VGAELDA is a datadriven endtoend deep learning approach with a high flexibility. Therefore, VGAELDA is competent to be a general model for graph semisupervised learning and association prediction tasks for other biological entities.
Methods
Problem formulation
Suppose the number of lncRNAs and diseases are m and n respectively, and \(Y_{m\times n}\) denotes the association matrix. \(Y_{ij}=1\) if the association between lncRNA i and disease j is known, otherwise \(Y_{ij}=0\). An algorithm predicting lncRNAdisease associations requires Y and corresponding feature matrix X as input, then outputs a score for each pair of lncRNA and disease. F denotes the score matrix, \(F_{ij}\in [0,1]\), i.e. the prediction result.
In the view of machine learning, an lncRNAdisease pair is labeled if it has been proved to be associated. Usually, there are only few samples labeled in an lncRNAdisease dataset, and the other tremendous amount of associations need to be detected. Therefore, the prediction for lncRNAdisease associations can be viewed as propagating labels to plenty of unlabeled pairs from few labeled ones, which is classified as semisupervised learning.
Variational inference for graph semisupervised learning
Graph semisupervised learning
Semisupervised learning is based on manifold assumption [52]. Manifold assumption clarifies that samples are distributed on a manifold, samples with higher feature similarities are closer on the manifold, and tend to share the same labels. The manifold of data can be depicted by graph structure constructed through feature matrix, which leads to graph semisupervised learning. This type of methods first computes adjacency matrix from features to construct a graph, then propagate labels from labeled samples to unlabeled ones on this graph iteratively [53, 54].
Suppose L denotes normalized Laplacian matrix of the graph, minimizing \(\mathrm {trace}(F^TLF)\) can obtain the label matrix F following manifold assumption [52, 55]. Belkin et al. [7] added this manifold constraint to least square problem, then derived Laplacian regularized least square (LRLS) method
where \(\Vert \cdot \Vert _F\) denotes Frobenius norm of a matrix, and \(\eta\) is a hyperparameter. Eq. (7) is a tradeoff between the accuracy based on labeled data, and the smoothness of the manifold. This is classified as manifold regularization [7]. Label propagation follows the framework of manifold regularization as Eq. (7) [53, 54]. Xia et al. [9] derived that association matrix F follows manifold assumption, and can be obtained via solving Eq. (7).
Graph Markov neural networks
The motivation of VGAELDA is begun with graph semisupervised learning from probabilistic perspective. Through this perspective, label propagation can be viewed as maximizing \(p(y_uy_l,x_v)\) [56], where \(y_u\) and \(y_l\) denote labels from unlabeled and labeled nodes respectively, and \(x_v\) denotes attributes of objects on the graph. As the number of \(y_u\) is often much larger than \(y_l\), it is difficult to maximize \(p(y_uy_l,x_v)\). Qu et al. [36] proposed Graph Markov Neural Networks (GMNN), suggesting that variational inference for graph semisupervised learning leads to ProblemÂ 1.
Problem 1
Variational inference for graph semisupervised learning adopts the variational distribution \(q(y_ux_v)\) to approximate \(p(y_uy_l,x_v)\), which leads to optimize evidence lower bound (ELBO)
Remark of ProblemÂ 1 is in the Additional file 8. Since labeled and unlabeled samples are observations and latent variables in conditional random field (CRF), and according to Markov property in CRF, the label of an unlabeled node is only related to its neighborhood. Hence, label propagation procedure aggregates messages from neighborhood, which is intrinsically related to graph neural networks [33].
GMNN adopted two GNNs, GNNq and GNNp, to depict \(q(y_ux_v)\) and \(p(y_l,y_ux_v)\) respectively, since GNNs are successfully adopted in graph semisupervised learning [33]. Problem 1 can be solved by variational EM (expectation maximization) algorithm [57] (see Additional file 8), GNNq and GNNp are trained by variational EM algorithm, which executes the following two steps alternately until convergence.

Estep: fix GNNp, and train GNNq by attributes of objects, to obtain the pseudolabels,

Mstep: fix GNNq, and input pseudolabels into GNNp for training.
Geometric matrix completion
Except for manifold assumption, the association matrix also follows the lowrank assumption that it lies in a smaller subspace, this leads to the matrix completion [8] problem.
where \(\Omega\) is the set of all known lncRNAdisease associations. The projection operator \({\mathcal {P}}_\Omega (\cdot ):{\mathbb {R}}^{m\times n}\rightarrow {\mathbb {R}}^{m\times n}\) of matrix M is defined as
Eq. (9) is an NPhard and nonconvex problem, thus it is usually relaxed as the following convex surrogate
where \(\Vert \cdot \Vert _*\) denotes nuclear norm, i.e. the sum of singular values of a matrix.
Geometric matrix completion [19, 20] incorporates manifold constraint \(\mathrm {trace}(F^TLF)\) into lowrank constraint, that is to solve
VGAELDA
Method overview
We proposed our model, VGAELDA, which designed representation learning framework to fuse the feature inference network and the label propagation network, and is trained through variational EM algorithm using GMNN [36] that integrated variational inference and GNN. VGAELDA executes the following two steps alternately until convergence.

Estep (feature inference): fix GNNp, and train GNNq by highdimensional features, to obtain lowdimensional representations,

Mstep (label propagation): fix GNNq, and input lncRNAdisease association matrix into GNNp for training.
In VGAELDA, feature inference network GNNq is a variational graph autoencoder (VGAE) [37], and label propagation network GNNp is a graph autoencoder (GAE) [37]. AssumptionÂ 1 and DefinitionÂ 1 suggest that the application of these two autoencoders solves the geometric matrix completion problem Eq. (12), for capturing efficient lowdimensional representations via VGAELDA. Furthermore, VGAELDA adopts cotraining [58] that integrates information from lncRNA space and disease space. The framework of our model is shown on Fig.Â 4.
Implementing graph autoencoders
Each layer of a graph autoencoder is graph convolutional layer. The formula of the lth \((l>0)\) graph convolutional [33] layer is
where \({\tilde{A}}\) is adjacency matrix with selfloop, i.e. \(\tilde{A}=A+I\). \({\tilde{D}}\) is a diagonal matrix called degree matrix, \({\tilde{D}}_{ii}=\sum _j{\tilde{A}}_{ij}\), \(\rho (\cdot )\) denotes nonlinear activation function, \(\Theta ^{(l)}\) denotes weight of the lth layer of network, and \(H^{(0)}\) is the initial input feature matrix.
Assumption 1
Autoencoder GNNp with Y as input and F as output can obtain the optimal solution of Eq. (11).
Definition 1
(manifold loss) Suppose Z and \(Z'\) are representations of autoencoder GNNq and GNNp, respectively, then, to optimize manifold constraint \(\mathrm {trace}(F^TLF)\) can be viewed as optimizing the following manifold loss
Remarks of AssumptionÂ 1 and Definition 1 are in Additional file 8. In the view of the alternating direction method of multipliers (ADMM) [59], solving the geometric matrix completion problem Eq. (12) can be viewed as optimizing Eq. (7) and Eq. (11) alternately. Therefore, autoencoder GNNp with the addition of manifold loss as we defined in Definition 1, obtains the solution of Eq. (12).
However, to enhance the efficiency of adding manifold loss Eq. (14), we implemented a variational graph autoencoder as GNNq to capture representation Z. Suppose the feature matrix of the graph is X, the encoder learns mean \(\mu\) and standard deviation \(\sigma\). The representation Z can be computed by applying reparameterization trick [60], which means
where \(\epsilon\) is sampled from standard Gaussian distribution. Then, the decoder reconstructs a feature matrix \(X'\).
The adjacency matrix of graph G can be constructed simply in this way. Firstly, sort the Euclidean distances among different feature vectors of nodes. Secondly, for each node i, select the 10nearest nodes except itself. Thirdly, suppose the set of these nodes for node i is \({\mathcal {N}}(i)\), matrix C satisfies that \(C_{ij}=1\) if \(j\in {\mathcal {N}}(i)\), otherwise \(C_{ij}=0\). The adjacency matrix with selfloop of the constructed graph G is
where \(\odot\) denotes Hadamard product.
Network structures of GNNq and GNNp are shown on Additional file 9. As shown on Additional file 9, GNNp is a basic GAE that takes initial label matrix Y as input, the dimension of hidden vector is 256, output of hidden layer is \(Z'\), and output of decoder is prediction F. GNNq is a VGAE, that each layer of the variational autoencoder [60] is a graph convolutional layer, the dimension of output vectors of each hidden layers in GNNq are 256.
Variational EM algorithm
The variational EM algorithm is implemented through minimizing the losses of GNNq and GNNp alternately. Similar to other variational graph autoencoders, the loss function of GNNq is the sum of reconstruction error \(L_{qr}\), and KL divergence \(L_{KL}\).
Kingma and Welling [60] derived that in a variational autoencoder:

If the features follow Gaussian distribution, the reconstruction error is mean square error.
$$\begin{aligned} L_{qr}=\frac{1}{2}\Vert XX'\Vert _F^2, \end{aligned}$$(18) 
If the features follow Bernoulli distribution, the reconstruction error is cross entropy loss.
$$\begin{aligned} L_{qr}=\sum _{i,j}X_{ij}\log X'_{ij}. \end{aligned}$$(19) 
KL divergence loss can be computed through
$$\begin{aligned} L_{KL}=\sum _{i,j}\frac{1}{2}(1+2\log \sigma _{ij}\mu _{ij}^2\sigma _{ij}^2). \end{aligned}$$(20)
In VGAELDA, the features of lncRNAs are computed from sequences by Word2Vec [39], and features of diseases are computed through associations with diseaserelated genes. Thus, lncRNA features follow Gaussian distribution, and disease features follow Bernoulli distribution. Therefore, \(L_{qr}\) in GNNql and GNNqd are computed by Eq. (18) and Eq. (19), respectively.
The outputs of encoder and decoder are scaled into (0,1) through applying sigmoid activation function. Meanwhile, following Eq. (7) , the loss function of GNNp is the sum of reconstruction error and manifold loss.
The reconstruction error of GNNp is the cross entropy between prediction and true label
Then, F is obtained after adopting variational EM algorithm to train GNNq and GNNp alternately until convergence, and is finally scaled into interval [0,Â 1] by
where \(F_{min}\) and \(F_{max}\) denote minimum and maximum element in matrix F.
Integrating information from lncRNA space and disease space
As shown on Fig.Â 4, the constructed lncRNA graph \(G_l\) and disease graph \(G_d\) are different. Eq. (17) and Eq. (21) can compute loss from \(G_l\) and \(G_d\) respectively, but it is important to integrate the information capturing from lncRNA space and disease space. Therefore, we adopt cotraining [58] to train GNNql and GNNqd collaboratively.
Definition 2
(cotraining loss) Suppose \(Z_l\) and \(Z_d\) are representations learned from lncRNA space and disease space, respectively, then cotraining loss
can measure the performance of cotraining.
Remark of DefinitionÂ 2 is in Additional file 8. Then GNNql and GNNqd are trained simultaneously by optimizing the total loss of GNNq
where \(L_{ql}\) and \(L_{qd}\) denote losses of GNNql and GNNqd computed through Eq. (17) respectively, and \(\alpha \in (0,1)\) is the weight parameter that balances information capturing from lncRNA space and disease space. Similarly, the total loss of GNNp is
where \(L_{pl}\) and \(L_{pd}\) denote losses of GNNpl and GNNpd computed through Eq. (21) respectively. Then, the variational EM algorithm is implemented through optimizing \({\mathcal {L}}_q\) and \({\mathcal {L}}_p\) alternately. After training procedure, GNNpl outputs \(F_l\) while GNNpd outputs \(F_d\). Since both \(F_l\in {\mathbb {R}}^{m\times n}\) and \(F_d\in {\mathbb {R}}^{n\times m}\) are lowrank provided by autoencoders, and through the ranksum inequality that
the final result
is lowrank.
The procedure of VGAELDA is summarized in Algorithm 1, where \(X',Z\leftarrow \mathrm {GNN}(G,X)\) summarizes the computing procedure of a GAE.
Hyperparameters tuning
In VGAELDA, there are three hyperparameters, \(\alpha ,\beta\) and \(\gamma\), that need to be tuned. Hyperparameter \(\alpha\) depicts a balance between lncRNA space and disease space. However, after evaluating our model at each \(\alpha \in \{0.1,0.3,0.5,0.7,0.9\}\), we found that VGAELDA is robust to the choice of \(\alpha\), and the results are shown on Additional file 10. Hence we simply set \(\alpha =0.5\).
Since manifold loss \(L_m\) and cotraining loss \(L_c\) depend on the computation of representations of GNNql and GNNqd, the capabilities of manifold constraint and cotraining constraint are related to the effectiveness of representation capturing by GNNq. Hence, we need to set hyperparameter \(\beta\) in Eq. (25) and \(\gamma\) in Eq. (21), increasing as training goes, to enhance the robustness of representation learning, and the convergence of EM algorithm. So here we set \(\beta =\gamma =e/e_n\) at eth epoch, where \(e_n=500\) denotes the number of epochs.
We adopted PyTorch [61] (https://pytorch.org/) to construct VGAELDA, and applied Adam optimizer [62], where learning rate is 0.01, weight decay is \(10^{5}\), and we set dropout=0.5 [63]. Our model was trained on a single NVIDIA GeForce GTX 2070 GPU with 8GB memory. we evaluated the performance of VGAELDA through varying learning rate in {0.001,0.01,0.1,1}, and the results are shown on Additional file 11. The figure depicts that the best value of learning rate is 0.01.
Moreover, we evaluated our model at different dimension of hidden vectors, and the results are shown on Additional file 12. The figure depicts that the performance of our model is enhanced with the increase of hidden vector dimension. However, when the dimension is more than 256, there is little increment and the performance remains stable. Hence, we set the hidden vector dimension at 256 to save the time and space cost of our model.
Besides, we also evaluated our model at different dimension of lncRNA embedding vectors adopted by Word2Vec, and the results are shown on Additional file 13. The figure shows that a larger dimension of lncRNA embedding vectors tends to perform better. However, when the dimension is more than 150, there is little increment and the performance remains stable. Hence, we simply set the dimension of lncRNA embedding vectors at 300.
Availability of data and materials
All the data using in our paper are collected from the following public datasets. Dataset1 can be downloaded from https://github.com/USTCHIlab/TPGLDA. Dataset2 can be downloaded from http://mlda.swu.edu.cn/codes.php?name=MFLDA. Both of them were collected from LncRNADisease Database (http://www.cuilab.cn/lncrnadisease). In VGAELDA, the information of lncRNA sequences was downloaded from the Nucleotide Database of NCBI (https://www.ncbi.nlm.nih.gov/nuccore), and the information of diseases was downloaded from DisGeNet (https://www.disgenet.org/home/) and Disease Ontology (https://diseaseontology.org/). The source code is available at https://github.com/zhanglabNKU/VGAELDA.
Abbreviations
 Acc:

Accuracy
 AUPR:

Area under precisionrecall curve
 AUROC:

Area under ROC curve
 EM:

Expectation maximization
 FN:

False negative
 FP:

False positive
 GAE:

Graph autoencoder
 GCN:

Graph convolutional networks
 GNN:

Graph neural networks
 Mcc:

Matthews correlation coefficient
 LDA:

lncRNAdisease association
 LncRNA:

Long nonencoding RNA
 LOOCV:

Leaveoneout cross validation
 LRLS:

Laplacian regularized least square method
 Pre:

Precision
 ROC curve:

Receiver operating characteristic (ROC) curve
 Sn:

Sensitivity
 Sp:

Speciality
 TN:

True negative
 TP:

True positive
 VGAE:

Variational graph autoencoder
References
Wapinski O, Chang HY. Long noncoding RNAs and human disease. Trends Cell Biol. 2011;21(6):354â€“61.
Jalali S, Kapoor S, Sivadas A, Bhartiya D, Scaria V. Computational approaches towards understanding human long noncoding RNA biology. Bioinformatics. 2015;31(14):2241â€“51.
Chen X, Yan CC, Zhang X, You ZH. Long noncoding RNAs and complex diseases: from experimental results to computational models. Brief Bioinform. 2016;18(4):558â€“76.
Sang Y, Tang J, Li S, Li L, Tang XF, Cheng C, Luo Y, Qian X, Deng LM, Liu L, Lv XB. LncRNA PANDAR regulates the g1/s transition of breast cancer cells by suppressing p16(INK4A) expression. Sci Rep. 2016;6:22366.
Sun M, Xia R, Jin F, Xu T, Liu Z, De W, Liu X. Downregulated long noncoding RNA meg3 is associated with poor prognosis and promotes cell proliferation in gastric cancer. Tumor Biol. 2014;35:1065â€“73.
Faghihi MA, Modarresi F, Khalil AM, Wood DE, Sahagan BG, Morgan TE, Finch CE, St. Laurent III G, Kenny PJ, Wahlestedt C. Expression of a noncoding RNA is elevated in Alzheimerâ€™s disease and drives rapid feedforward regulation of betasecretase. Nat Med. 2008;14(7):723â€“30.
Belkin M, Niyogi P, Sindhwani V. Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J Mach Learn Res. 2006;7(1):2399â€“434.
CandÃ¨s E, Recht B. Exact matrix completion via convex optimization. Found Comput Math. 2009;9(6):717.
Xia Z, Wu LY, Zhou X, Wong STC. Semisupervised drugprotein interaction prediction from heterogeneous biological spaces. BMC Syst Biol. 2010;4(Suppl 2):6.
You ZH, Lei YK, Gui J, Huang DS, Zhou X. Using manifold embedding for assessing and predicting protein interactions from highthroughput experimental data. Bioinformatics. 2010;26(21):2744â€“51.
Xiao Q, Luo J, Liang C, Cai J, Ding P. A graph regularized nonnegative matrix factorization method for identifying micrornadisease associations. Bioinformatics. 2018;34(2):239â€“48.
Chen X, Yan GY. Novel human lncRNAdisease association inference based on lncRNA expression profiles. Bioinformatics. 2013;29(20):2617â€“24.
Chen X, Yan CC, Luo C, Ji W, Zhang Y, Dai Q. Constructing lncRNA functional similarity network based on lncRNAdisease associations and disease semantic similarity. Sci Rep. 2015;5(1):11338.
Xie G, Meng T, Luo Y, Liu Z. SKFLDA: similarity kernel fusion for predicting lncRNAdisease association. Mol Ther Nucl Acids. 2019;18(6):45â€“55.
Natarajan N, Dhillon IS. Inductive matrix completion for predicting genedisease associations. Bioinformatics. 2014;30(12):60â€“8.
Chen X, Wang L, Qu J, Guan NN, Li JQ. Predicting miRNAdisease association based on inductive matrix completion. Bioinformatics. 2018;34(24):4256â€“65.
Li J, Zhang S, Liu T, Ning C, Zhang Z, Zhou W. Neural inductive matrix completion with graph convolutional networks for miRNAdisease association prediction. Bioinformatics. 2020;36(8):2538â€“46.
Lu C, Yang M, Luo F, Wu FX, Li M, Pan Y, Li Y, Wang J. Prediction of lncRNAdisease associations based on inductive matrix completion. Bioinformatics. 2018;34(19):3357â€“64.
Kalofolias V, Bresson X, Bronstein MM, Vandergheynst P. Matrix completion on graphs. arXiv preprint. 2014. arXiv:1408.1717
Monti F, Bronstein M, Bresson X. Geometric matrix completion with recurrent multigraph neural networks. Adv Neural Inf Process Syst. 2017;30:3697â€“707.
Lu C, Yang M, Li M, Li Y, Wu F, Wang J. Predicting human lncRNAdisease associations based on geometric matrix completion. IEEE J Biomed Health. 2018;24(8):2420â€“9.
Wang L, You ZH, Huang YA, Huang DS, Chan KCC. An efficient approach based on multisources information to predict circRNAdisease associations using deep convolutional neural network. Bioinformatics. 2019;36(13):4038â€“46.
Xiao Q, Zhang N, Luo J, Dai J, Tang X. Adaptive multisource multiview latent feature learning for inferring potential diseaseassociated miRNAs. Brief Bioinform. 2020.
Lan W, Li M, Zhao K, Liu J, Wu FX, Pan Y, Wang J. LDAP: a web server for lncRNAdisease association prediction. Bioinformatics. 2016;33(3):458â€“60.
Fu G, Wang J, Domeniconi C, Yu G. Matrix factorizationbased data fusion for the prediction of lncRNAdisease associations. Bioinformatics. 2017;34(9):1529â€“37.
Ding L, Wang M, Sun D, Li A. TPGLDA: novel prediction of associations between lncRNAs and diseases via lncRNAdiseasegene tripartite graph. Sci Rep. 2018;8(1):1065.
Yao D, Zhan X, Zhan X, Kwoh CK, Li P, Wang J. A random forest based computational model for predicting novel lncRNAdisease associations. BMC Bioinform. 2020;21:126.
Chen X, Li TH, Zhao Y, Wang CC, Zhu CC. Deepbelief network for predicting potential miRNAdisease associations. Brief Bioinform. 2020.
Xuan P, Cao Y, Zhang T, Kong R, Zhang Z. Dual convolutional neural networks with attention mechanisms based method for predicting diseaserelated lncRNA genes. Front Genet. 2019;10:416.
Sheng N, Cui H, Zhang T, Xuan P. Attentional multilevel representation encoding based on convolutional and variance autoencoders for lncRNAdisease association prediction. Brief Bioinform. 2020;1â€“14.
Scarselli F, Gori M, Tsoi AC, Hagenbuchner M, Monfardini G. The graph neural network model. IEEE Trans Neural Netw. 2009;20(1):61â€“80.
Xuan P, Pan S, Zhang T, Liu Y, Sun H. Graph convolutional network and convolutional neural network based method for predicting lncRNAdisease associations. Cells. 2019;8(9):1012.
Kipf TN, Welling M. Semisupervised classification with graph convolutional networks. In: Proceedings of the international conference on learning representations (ICLR);2017.
Berg R, Kipf T, Welling M. Graph convolutional matrix completion. In: Proceedings of KDD;2018.
Wu X, Lan W, Chen Q, Dong Y, Liu J, Peng W. Inferring lncRNAdisease associations based on graph autoencoder matrix completion. Comput Biol Chem. 2020;87:107282.
Qu M, Bengio Y, Tang J. GMNN: graph Markov neural networks. Proc Mach Learn Res. 2019;97:5241â€“50.
Kipf TN, Welling M. Variational graph autoencoders. In: NeurIPS Workshop on Bayesian Deep Learning;2016.
Chen G, Wang Z, Wang D, Qiu C, Liu M, Chen X, Zhang Q, Yan G, Cui Q. LncRNADisease: a database for longnoncoding RNAassociated diseases. Nucleic Acids Res. 2012;41(D1):983â€“6.
Le Q, Mikolov T. Distributed representations of sentences and documents. Proc Mach Learn Res. 2014;32:1188â€“96.
Asgari E, Mofrad MRK. Protvec: a continuous distributed representation of biological sequences. PLoS ONE. 2015;10(11):0141287.
PiÃ±ero J, Bravo A, QueraltRosinach N, GutiÃ©rrezSacristÃ¡n A, DeuPons J, Centeno E, GarcÃaGarcÃa J, Sanz F, Furlong LI. DisGeNET: a comprehensive platform integrating information on human diseaseassociated genes and variants. Nucleic Acids Res. 2016;45(D1):833â€“9.
Schriml LM, Mitraka E, Munro J, Tauber B, Schor M, Nickle L, Felix V, Jeng L, Bearer C, Lichenstein R, Bisordi K, Campion N, Hyman B, Kurland D, Oates CP, Kibbey S, Sreekumar P, Le C, Giglio M, Greene C. Human disease ontology 2018 update: classification, content and workflow expansion. Nucleic Acids Res. 2018;47(D1):955â€“62.
Xu M, Jin R, Zhou ZH. Speedup matrix completion with side information: application to multilabel learning. In: Advances in neural information processing systems, 2013;2301â€“2309.
Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation. BMC Genom. 2020;21:6.
Bray F, Ferlay J, Soerjomataram I, Siegel R.L, Torre L.A, Jemal A. Global cancer statistics 2018: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2018;68(6):394â€“424.
Alimirah F, Peng X, Gupta A, Yuan L, Welsh J, Cleary M, Mehta RG. Crosstalk between the vitamin d receptor (VDR) and miR214 in regulating SuFu, a hedgehog pathway inhibitor in breast cancer cells. Exp Cell Res. 2016;349(1):15â€“22.
Han C, Li X, Fan Q, Liu G, Yin J. Ccat1 promotes triplenegative breast cancer progression by suppressing mir218/zfx signaling. Aging (Albany NY). 2019;11(14):4858â€“75.
Lou KX, Li ZH, Wang P, Liu Z, Chen Y, Wang XL, Cui HX. Long noncoding RNA BANCR indicates poor prognosis for breast cancer and promotes cell proliferation and invasion. Eur Rev Med Pharmacol Sci. 2018;22(5):1358â€“65.
Cui M, Chen M, Shen Z, Wang R, Fang X, Song B. LncRNAuca1 modulates progression of colon cancer through regulating the mir285p/hoxb3 axis. J Cell Biochem. 2019;120(5):6926â€“36.
Poursheikhani A, Abbaszadegan MR, Nokhandani N, Kerachian MA. Integration analysis of long noncoding RNA (lncRNA) role in tumorigenesis of colon adenocarcinoma. BMC Med Genomics. 2020;13:108.
Zhang R, Li J, Yan X, Jin K, Li W, Liu X, Zhao J, Shang W, Liu Y. Long noncoding RNA plasmacytoma variant translocation 1 (pvt1) promotes colon cancer progression via endogenous sponging mir26b. Med Sci Monitor. 2018;24:8685â€“92.
Belkin M, Niyogi P. Laplacian eigenmaps and spectral techniques for embedding and clustering. Adv Neural Inf Process Syst. 2002;15:585â€“91.
Zhou D, Bousquet O, Lal TN, Weston J, SchÃ¶lkopf B. Learning with local and global consistency. Adv Neural Inf Process Syst. 2004;16:321â€“8.
Wang F, Zhang C. Label propagation through linear neighborhoods. IEEE Trans Knowl Data Eng. 2008;20(1):55â€“67.
Johnson R, Zhang T. On the effectiveness of Laplacian normalization for graph semisupervised learning. J Mach Learn Res. 2007;8(53):1489â€“517.
Wang J, Shen HC, Wang F, Quan L, Zhang C. Linear neighborhood propagation and its applications. IEEE Trans Pattern Anal Mach Intell. 2009;31(9):1600â€“15.
Neal R, Hinton G. A view of the em algorithm that justifies incremental, sparse, and other variants, 1998;355â€“368. Springer, Dordrecht.
Blum A, Mitchell T. Combining labeled and unlabeled data with cotraining. In: Proceedings of the annual conference on computational learning theory, vol. 11, pp. 92â€“100; 1998.
Boyd S, Parikh N, Chu E, Peleato B, Eckstein J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trends Mach Learn. 2011;3(1):1â€“122.
Kingma DP, Welling M. Autoencoding variational bayes. In: Proceedings of the international conference on learning representations. 2014; ICLR.
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A, Kopf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S. Pytorch: an imperative style, highperformance deep learning library. In: Advances in neural information processing systems, 2019;pp. 8026â€“8037.
Kingma DP, Ba JA. A method for stochastic optimization. In: Proceedings of the international conference on learning representations. 2015; ICLR.
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res. 2014;15:1929â€“58.
Acknowledgements
Not applicable.
Funding
This research was funded by the National Natural Science Foundation of China Grant No. 61973174.
Author information
Authors and Affiliations
Contributions
Han Zhang conceived the research. Zhuangwei Shi, Han Zhang, Chen Jin, Xiongwen Quan and Yanbin Yin designed the research. Zhuangwei Shi and Chen Jin implemented the research. Zhuangwei Shi, Han Zhang, Chen Jin and Yanbin Yin wrote the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1.
AUROC and AUPR values of VGAELDA in 5 times
Additional file 2.
Binary classification metrics of different methods on Dataset1
Additional file 3.
True positive samples at different cutoffs on Dataset1
Additional file 4.
Case study for breast cancer on Dataset2
Additional file 5.
Case study for colon cancer on Dataset2
Additional file 6.
Predictions of potential lncRNAdisease association on Dataset1
Additional file 7.
Predictions of potential lncRNAdisease association on Dataset2
Additional file 8.
Remarks
Additional file 9.
Network structures
Additional file 10.
AUPR at different Î±
Additional file 11.
AUPR at different learning rate
Additional file 12.
AUPR at different dimension of hidden vectors
Additional file 13.
AUPR at different dimension of embedding vectors of lncRNA
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Shi, Z., Zhang, H., Jin, C. et al. A representation learning model based on variational inference and graph autoencoder for predicting lncRNAdisease associations. BMC Bioinformatics 22, 136 (2021). https://doi.org/10.1186/s1285902104073z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285902104073z