Overview of SVDNVLDA
Matrix decomposition method, SVD, and network embedding method, node2vec, were novelly integrated in SVDNVLDA for obtaining the linear and the nonlinear representations of both lncRNA and disease entities respectively. By combining the different features of each lncRNA and each disease, the integrated feature vectors were constructed which fused the linear features of interaction information and the nonlinear features of network topology information. These feature vectors were served as the inputs of one machine learning classifier and the corresponding predicted results would be obtained in the end (Fig. 7).
Step1: Data processing and construction of lncRNA-disease association matrix and lncRNA-miRNA-disease association network (LMDN). Step 2: Apply SVD on association matrix to get linear features. Step 3: Apply node2vec on LMDN to get nonlinear features. Step 4: Feature integration. Step 5: Use XGBoost classifier to predict association.
Data preprocessing
The study mainly included lncRNA-disease association data, lncRNA-miRNA association data and miRNA-disease association data. The experimentally confirmed lncRNA-disease association data were downloaded from LncRNADisease v2.0 [42] and Lnc2Cancer v3.0 [43]. All disease names were converted into standard MESH disease terms, and duplicate data was filtered to retain only one replication. For avoiding experimental errors that came from the downloaded data, the lncRNAs with one or none association were removed. In the end, a total number of 4518 associations between 861 lncRNAs and 253 diseases were obtained.
The known lncRNA-miRNA association data was downloaded from Encori [44] and NPInter V4.0 databases [45]. After eliminating redundancy, only records of the lncRNAs commonly to lncRNA-disease data and the miRNAs commonly to miRNA-disease data were selected. Finally, a total of 8172 lncRNA-miRNA associations were obtained involving 338 lncRNAs and 285 miRNAs.
As for miRNA-disease association data, it was obtained from the HMDD v3.2 database [46]. The original data includes two types of association records, namely the subjective causality and passive changes of miRNAs during the course of diseases. By contrast, the studies of miRNAs in causal relationship with diseases were more valuable for exploring the pathogenesis and searching for new biomarkers. In our experiment, only the related records with causal relationships in HMDD database were picked. All disease names were transformed to standardized names based on MeSH glossary, and the lncRNAs associated with only one disease were removed from the original data. Ultimately, a total count of 861 lncRNAs, 437 miRNAs and 431 diseases were involved in our experiment. The statistical overview of formed data, also as the statistical overview of LMDN was documented in Additional file 8.
Construct lncRNA-disease association matrix and LMDN
Firstly, the lncRNA- disease association matrix was constructed. For lncRNA \(l\), if there is a known association with disease \(j\) in our collected data, the corresponding element value in the association matrix \(R_{M \times N}\) is \(1\); otherwise, it is \(0\). The formula is made out as:
$${\text{R}}_{{{\text{M}} \times {\text{N}}}} \left( {\text{i,j}} \right){ = }\left\{ {\begin{array}{*{20}l} {1,} \hfill & {{\text{if}}\,{\text{i}}\,{\text{and}}\,{\text{j}}\,{\text{have}}\,{\text{association}}} \hfill \\ {0,} \hfill & {{\text{otherwise}}} \hfill \\ \end{array} } \right.$$
(1)
in our experiment, the real matrix \({\text{R}}_{{{\text{M}} \times {\text{N}}}}\) was shaped as 861 × 437 dimensions.
After the construction of association matrix, lncRNA-disease association data combined with lncRNA-miRNA association and miRNA-disease association data were used to construct lncRNA-miRNA-disease association heterogeneous network (LMDN). Among the three types of vertices in LMDN, namely lncRNA, miRNA and disease, there would be an edge between two vertices with association record, otherwise the two vertices would have no connection. The heterogeneous network was a sparse network with 1769 nodes and 16,878 edges, as detailed in Additional file 8.
Linear feature extraction based on singular value decomposition
SVD is a matrix decomposition method which has been widely used in recommender systems [47, 48]. In SVD, the matrix is common decomposed into the multiplying of three matrices:
$$R_{M \times N} = U_{M \times C} \cdot \Sigma_{C \times C} \cdot V_{C \times N}^{T}$$
(2)
As a typical collaborative filtering-based recommendation system with SVD, the initial matrix \(R\) represents a rating matrix for \(M\) users’ rates on \(N\) goods. Among the resulted matrixes, \(U\) represents the interesting levels of \(M\) users on \(C\) features of goods, namely users’ characteristics or commodity affinity; while \(\Sigma\) represents the importance of each feature of goods, specified as a non-negative diagonal matrix, in which diagonal elements are arranged as descending order. \(V^{T}\) represents the distribution of \(C\) features in \(N\) goods [49].
Analogically, applying SVD on lncRNA-disease association matrix \(R_{M \times N}\), the obtained matrixes \(U\), \(\Sigma\) and \(V^{T}\) could represent lncRNA feature matrix, feature weight matrix and disease feature matrix, respectively. For dimensional reduction purpose, only the ranked \(k\) features with larger numerical values in \(\Sigma\) were taken, and \(R\) would be expressed as:
$$R_{M \times N} \approx U_{M \times k} \cdot {\Sigma }_{k \times k} \cdot V_{k \times N}^{T}$$
(3)
In fact, the binary matrix \(R\) is not an ideal initial matrix. In recommendation system, \(0\) (or blank) elements in rating matrixs cannot actually represent these rates of products, more likely, it is commonly due to missing users’ evaluations. Thus, in lncRNA-disease association matrix \(R\), the value \(0\) usually represents that the corresponding association has not been confirmed. Therefore, for calculation convenience and considering biological meaning, all the \(0\) elements in original binary matrix \(R\) were replaced by \(10^{ - 6}\) in our experiment. Based on the theory of SVD, each row in \(U_{M \times k}\) represents a \(k\)-dimensional linear feature vector of a certain lncRNA. Similarly, each column in \(V_{k \times N}^{T}\) represents a \(k\)-dimensional linear feature vector of a certain disease (Fig. 8).
Nonlinear feature extraction based on Node2vec
Network representation learning (NRL), also known as network embedding, refers to map nodes into a continuous low-dimensional space on the premise of keeping characteristics of nodes in the original network. Given a network \(G = \left( {V,E} \right)\), where \(V = \left\{ {v_{i} } \right\}\) represents the collection of nodes and \(E = e_{i} \subset \left\{ {V \times V} \right\}\) represents the collection of edges. The mathematical expression of NRL is: \(\forall v_{i}\), find a map \(f:V \to R^{d}\), and \(d \ll \left| V \right|\). The ideal learned node representations should be able to quantify the characteristics of nodes in social network, which could be intuitively expressed that topological neighbor nodes have small numerical vector distance and the representations of nodes in the same community have larger similarity than nodes outside the community. Up to now, many NRL methods have been widely used to solve problems such as node classification, community discovery, link prediction and data visualization [50].
As a semi-supervised network feature learning method, node2vec [51] innovatively proposed a biased random walk on the basis of word representation method [52] and DeepWalk [53], as well as defined a more flexible way to select the next step node with random walk. More specifically, node2vec trades off the two kinds of random walk strategy: Breadth-first search (BFS) and Depth-first search (DFS), which are shown in Fig. 9. Unlike the original random walk, node2vec can artificially control the degree of BFS and DFS by adjusting parameters based on the preferences of actual practice scenario. Here is a detailed description of simple random walk and modified biased random walk in node2vec (Fig. 10).
For a given boot node \(u\), simulate a simple unbiased random walk with \(l\) length. \(c_{i}\) represents the \(i^{th}\) node in the process of random walk. Let \(c_{0} = u\), and the transition probability of the node reached in \(i^{th}\) step is:
$${\text{P(c}}_{{\text{i}}} = {\text{x|c}}_{{{\text{i}} - {1}}} = {\text{v)}} = \left\{ {\begin{array}{*{20}l} {\frac{{\uppi _{{{\text{vx}}}} }}{{\text{Z}}}{,}} \hfill & {{\text{if}}\,{\text{(v,x)}} \in {\text{E}}} \hfill \\ {0,} \hfill & {{\text{otherwise}}} \hfill \\ \end{array} } \right.$$
(4)
of which \(\pi_{vx}\) is the unnormalized transition probability between nodes \(v\) and \(x\), \(Z\) represents a normalized constant term.
As for the biased random walk in node2vec, just as shown in Fig. 10, if the root position of a random walk is set at node \(t\), through edge \(\left( {t,v} \right)\), the current position reached node \(v\), and the transition probability is set as follows:
$$\alpha_{pq} \left( {t,x} \right) = \left\{ {\begin{array}{*{20}l} \frac{1}{p} \hfill & { if\,d_{tx} = 0} \hfill \\ 1 \hfill & {if\,d_{tx} = 1} \hfill \\ \frac{1}{q} \hfill & { if\,d_{tx} = 2} \hfill \\ \end{array} } \right.$$
(5)
\(d_{tx}\) represents the shortest distance between nodes \(t\) and \(x\) and the possible value of \(d_{tx}\) is 0,1,2. As shown in Fig. 10, the parameter \(p\) controls the probability that the next step of walk will return to the previous node. If \(p\) is greater than \(1\), the random walk will have less tendency to turn back. The value of \(q\) controls the preference of BFS and DFS to guide the bias of random walk. If \(q\) is greater than \(1\), the random walk will be more inclined to BFS, that is, to the neighbor node of the starting node. If \(q\) is less than \(1\), the random walk is more inclined to DFS, that is, to go away from the starting node. When the values of \(p\) and \(q\) are both equal to \(1\), node2vec is equal to DeepWalk.
In the constructed LMDN, node2vec was adopted to obtain the corresponding representations for vertices. The representations of lncRNA and disease nodes generated by node2vec retain the topological information of the nodes in LMDN. The experimental results demonstrate that the obtained nonlinear features could effectively enhance the SVD based linear features and improve the information richness in integrated features.
Feature integration
Based on the decomposition of \(R_{M \times N}\) and NRL method node2vec, we have obtained the linear feature matrixes \(U\), \(V^{T}\), and the nonlinear feature representations of lncRNA and disease nodes in LMDN. For each lncRNA \(i\) and disease \(j\), the feature integration rules are as follows:
The linear features corresponding to lncRNA \(i\) is the ith row of \(U\), which is noted as \(LL_{i}\) after being converted into a column vector. Similarly, the linear features corresponding to disease \(j\) is the jth column of \(V^{T}\), represented as \(LD_{j}\). The nonlinear features corresponding to \(i\) is noted as \(NL_{i}\) as well as the nonlinear features corresponding to \(j\) is noted as \(ND_{j}\). The final integrated features of \(i\) and \(j\) is expressed as:
$$FL_{i} = \left[ {\begin{array}{*{20}c} {LL_{i} } \\ {NL_{i} } \\ \end{array} } \right]$$
(6)
$$FD_{j} = \left[ {\begin{array}{*{20}c} {LD_{j} } \\ {ND_{j} } \\ \end{array} } \right]$$
(7)
where [] represents the vector connect operation.