Known disease-lncRNA associations
Because the number of lncRNA-disease associations is limited and many heterogeneous biological datasets have been constructed, we collected 8842 known disease-lncRNA associations from the MNDR dataset (http://www.bioinformatics.ac.cn/mndr/index.html) and 2934 known disease-lncRNA associations from the LncRNADisease dataset (http://www.cuilab.cn/lncrnadisease). Since the disease names in the LncRNADisease database differ from those in the MNDR dataset, we mapped the diseases in these two disease-lncRNA association datasets to their MeSH descriptors. After eliminating diseases without any MeSH descriptors, merging the diseases with the same MeSH descriptors and removing the lncRNAs that were not present in the lncRNA-miRNA dataset (DS4) used in this paper, 583 known lncRNA-disease associations (DS1) were obtained from the LncRNADisease dataset (see Additional file 1), and 702 known lncRNA-disease associations (DS2) were obtained from the MNDR dataset (see Additional file 2). Furthermore, after integrating the DS1 and DS2 datasets and removing the duplicate associations, we obtained the DS3 dataset, which included 1073 disease-lncRNA associations (see Additional file 3).
Known lncRNA-miRNA associations
To construct the lncRNA-miRNA network, the lncRNA-miRNA association dataset DS4 was obtained from the starBasev2.0 database (http://starbase.sysu.edu.cn/) in February 2, 2017 and provided the most comprehensive experimentally confirmed lncRNA-miRNA interactions based on large-scale CLIP-Seq data. After the data pre-processing (including the elimination of duplicate values, erroneous data, and disorganized data), removing the lncRNAs that did not exist in the DS3 dataset and merging the miRNA copies that produced the same mature miRNA, we finally obtained 1883 lncRNA-miRNA associations (DS4) (see Additional file 4).
Known disease-miRNA associations
To validate the performance of DCSMDA, the known human miRNA-disease associations were downloaded from the latest version of the HMDD database, which is considered the golden-standard dataset. In this dataset, after eliminating the duplicate associations and miRNA-disease associations involved with other diseases or lncRNAs not contained in the DS3 or DS
4
, we finally obtained 3252 high-quality lncRNA-disease associations (DS5) (see Additional file 5).
Construction of the disease-lncRNA-miRNA interaction network
To clearly demonstrate the process of constructing the disease-lncRNA-miRNA interaction network, we use the disease-lncRNA dataset DS3 and the lncRNA-miRNA dataset DS4 as examples. We defined L to represent all the different lncRNA terms in DS3 and DS4 and then constructed the disease-lncRNA-miRNA interactive network based on DS3 and DS4 according to the following 3 steps:
Step 1 (Construction of the disease-lncRNA network): Let D and L be the number of different diseases and lncRNAs obtained from DS3, respectively. S
D
= {d
1
, d
2
,..., d
D
} represents the set of all D different diseases in DS3. S
L
= {l
1
, l
2
,..., l
L
} represents the set of all L different lncRNAs in DS3, and for any given d
i
∈ S
D
and l
j
∈S
L
, we can construct the D*L dimensional matrix KAM1 as follows:
$$ KAM1\left(i,j\right)=\Big\{{\displaystyle \begin{array}{c}1\kern0.5em if\kern0.2em {d}_i\kern0.2em is\kern0.34em related\kern0.34em to\kern0.2em {l}_j\kern0.2em in\kern0.2em {DS}_3\\ {}0\kern7.8em otherwise\end{array}} $$
(1)
Step 2 (Construction of the lncRNA-miRNA network): Let M be the number of different miRNAs obtained from DS4. S
M
= {m
1
, m
2
,..., m
M
} represents the set of all M different miRNAs in DS4, and for any given m
i
∈S
M
and l
j
∈S
L
, we can construct the M*L dimensional matrix KAM2 as follows:
$$ KAM2\left(i,j\right)=\left\{\begin{array}{c}1\kern0.5em if\ {m}_i\ is\ related\ to\ {l}_j\ in\ {DS}_4\\ {}0\kern5.25em otherwise\end{array}\right. $$
(2)
Step 3 (Constriction of the disease-lncRNA-miRNA interactive network): Based on the disease-lncRNA network and lncRNA-miRNA network, we can obtain the undirected graph G
3
= (V
3
, E
3
), where V
3
= S
D
∪S
L
∪S
M
= {d
1
, d
2,
..., d
D
, l
D + 1
, l
D + 2
..., l
D + L
, m
D + L + 1
, m
D + L + 2
..., m
D + L + M
} is the set of vertices, E
3
is the edge set of G
3
, and d
i
∈S
D
, l
j
∈S
L
, mk∈SM. Here, an edge exists between d
i
and l
j
in E
3
KAM1(d
i
, l
j
) = 1, an edge exists between l
j
and m
k
in E
3
if KAM2(m
k
, l
j
) = 1. Then, for any given a, b∈V
3
, we can define the Strong Correlation (SC) between a and b as follows:
$$ SC\left(a,b\right)=\left\{\begin{array}{c}1\kern0.5em if\kern0.34em there\kern0.34em is\kern0.34em an\kern0.34em edge\kern0.34em between\kern0.2em a\kern0.2em and\kern0.2em b\\ {}0\kern11em otherwise\end{array}\right. $$
(3)
Notably, although we did not use any known disease-miRNA associations, the diseases and miRNAs can still be indirectly linked by integrating the edges between the disease nodes, the lncRNA nodes and edges between the miRNA nodes and lncRNA nodes in G
3
.
Disease semantic similarity
We downloaded the MeSH descriptors of the diseases from the National Library of Medicine (http://www.nlm.nih.gov/), which introduced the concept of Categories and Subcategories and provided a strict system for disease classification. The topology of each disease was visualized as a Directed Acyclic Graph (DAG) in which the nodes represented the disease MeSH descriptors, and all MeSH descriptors in the DAG were linked from more general terms (parent nodes) to more specific terms (child nodes) by a direct edge (see Fig. 4). Let DAG(A) = (A, T(A), E(A)), where A represents disease A, T(A) represents the node set, including node A and its ancestor nodes, and E(A) represents the corresponding edge set. Then, we defined the contribution of disease term d in DAG(A) to the semantic value of disease A as follows:
$$ \left\{\begin{array}{c}{D}_A(d)=1\kern16.8em if\kern0.3em d=A\\ {}{D}_A(d)=\max \left\{0.5\ast {D}_A\left({d}^{\ast}\right)|{d}^{\ast}\in children\kern0.3em of\kern0.3em d\right\}\kern0.3em if\kern0.3em d\ne A\end{array}\right. $$
(4)
For example, the semantic value of the disease ‘Gastrointestinal Neoplasms’ shown in Fig. 4 is calculated by summing the weighted contribution of ‘Neoplasms’ (0.125), ‘Neoplasms by Site’ (0.25), ‘Digestive System Diseases’ (0.25), ‘Digestive System Neoplasms’ (0.5), ‘Digestive System Neoplasms’ (0.5) and ‘Gastrointestinal Diseases’ (0.5) to ‘Gastrointestinal Neoplasms’ and the contribution to ‘Gastrointestinal Neoplasms’ (1) by ‘Gastrointestinal Neoplasms’.
Then, the sematic value of disease A can be obtained by summing the contribution from all disease terms in = DAG(A), and the semantic similarity between the two diseases d
i
and d
j
can be calculated as follows:
$$ SSD\left({d}_i,{d}_j\right)=\frac{\sum \limits_{d\in \left(T\left({d}_i\right)\cap T\left({d}_j\right)\right)}\left({D}_{d_i}(d)+{D}_{d_j}(d)\right)}{\sum \limits_{d\in T\left({d}_i\right)}{D}_{d_i}(d)+{\sum}_{d\in T\left({d}_j\right)}{D}_{d_j}(d)} $$
(5)
where SSD is the disease semantic similarity matrix.
MiRNA Gaussian interaction profile kernel similarity
Based on the assumption that similar miRNAs tend to show similar interaction and non-interaction patterns with lncRNAs, in this section, we introduce the Gaussian interaction profile kernel used to calculate the network topologic similarity between miRNAs and used the vector MLP(m
i
) to denote the ith row of the adjacency matrix KAM2. Then, the Gaussian interaction profile kernel similarity for all investigated miRNAs can be calculated as follows:
$$ MGS\left({m}_i,{m}_j\right)=\exp \left(-\frac{M\ast {\left\Vert MLP\left({m}_i\right)- MLP\left({m}_j\right)\right\Vert}^2}{\sum \limits_{i=1}^M{\left\Vert MLP\left({m}_i\right)\right\Vert}^2}\right) $$
(6)
where parameter M is the number of miRNAs in DS4.
Disease Gaussian interaction profile kernel similarity
Based on the assumption that similar diseases tend to show similar interaction and non-interaction patterns with lncRNAs, the Gaussian interaction profile kernel similarity for all investigated diseases can be calculated as follows:
$$ DGS\left({d}_i,{d}_j\right)=\exp \left(-\frac{D\ast {\left\Vert DLP\left({d}_i\right)- DLP\left({d}_j\right)\right\Vert}^2}{\sum \limits_{i=1}^D{\left\Vert DLP\left({d}_i\right)\right\Vert}^2}\right) $$
(7)
where parameter D is the number of diseases in DS3, and DLP(d
i
) represent the ith row of the matrix KAM1. Then, based on previous work [46], we can improve the predictive accuracy problems by logistic function transformation as follows:
$$ FDGS\left({d}_i,{d}_j\right)=\frac{1}{1+{e}^{-15\ast DGS\left({d}_i,{d}_j\right)+\log (9999)}} $$
(8)
lncRNA Gaussian interaction profile kernel similarity
Based on the assumption that similar lncRNAs tend to show similar interaction and non-interaction patterns with miRNAs and similar lncRNAs tend to show similar interaction and non-interaction patterns with diseases, the Gaussian interaction profile kernel similarity matrix for all investigated lncRNAs in DS3 can be computed in a similar way as that for disease, as follows:
$$ LGS1\left({l}_i,{l}_j\right)=\exp \left(-\frac{L\ast {\left\Vert LDP\left({l}_i\right)- LDP\left({l}_j\right)\right\Vert}^2}{\sum \limits_{i=1}^L{\left\Vert LDP\left({l}_i\right)\right\Vert}^2}\right) $$
(9)
where parameter L is the number of lncRNAs in DS3, and LDP(l
i
) represents the ith column of the matrix KAM1.
Obviously, the Gaussian interaction profile kernel similarity for all investigated lncRNAs in DS4 can be computed as follows:
$$ LGS2\left({d}_i,{d}_j\right)=\exp \left(-\frac{L\ast \parallel LMP\left({l}_i\right)- LMP\left({l}_j\right){\parallel}^2}{\sum \limits_{i=1}^L\parallel LMP\left({l}_i\right){\parallel}^2}\right) $$
(10)
where LMP(l
i
) represents the ith column of the matrix KAM2.
Disease functional similarity based on the lncRNAs
To calculate the functional similarity of the diseases, we first constructed the undirected graph G
1
= (V
1
, E
1
) based on KAM1, where V
1
= S
D
∪S
M
= {d
1
, d
2
, …, d
D
, l
D + 1
, l
D + 2
,…, l
D + M
} is the set of vertices, E
1
is the set of edges, and for any two nodes a, b∈V
1
, an edge exists between a and b in E
1
if KAM1(a, b) = 1. Therefore, we can calculate the similarities between two disease nodes by comparing and integrating the similarities of the lncRNA nodes associated with these two disease nodes based on the assumption that similar diseases tend to show similar interaction and non-interaction patterns with lncRNAs. The procedure used to calculate the disease functional similarity is shown in Fig. 5.
Because different lncRNA terms in DS3 may relate to several diseases, assigning the same contribution value to all miRNAs is not suitable, and therefore, we defined the contribution value of each lncRNA as follows:
$$ C\left({l}_i\right)=\frac{\mathrm{The}\kern0.34em \mathrm{number}\kern0.34em \mathrm{of}\kern0.2em {l}_i-\mathrm{related}\kern0.34em \mathrm{edges}\ \mathrm{in}\ {E}_1}{\mathrm{The}\ \mathrm{number}\ \mathrm{of}\ \mathrm{all}\ \mathrm{edges}\ \mathrm{in}\ {E}_1} $$
(11)
Based on the definition of C(l
i
), we can define the contribution value of each lncRNA to the functional similarity of each disease pair as follows:
$$ {CD}_{ij}\left({l}_k\right)=\Big\{{\displaystyle \begin{array}{c}1\kern2.30em if\kern0.3em lncRNA\kern0.3em {l}_k\kern0.2em related\kern0.34em to\kern0.2em {d}_i\kern0.2em and\kern0.2em {d}_j\kern0.2em simultaneously\\ {}C\left({l}_k\right)\kern6em if\kern0.34em lncRNA\kern0.3em {l}_k\kern0.2em only\kern0.34em related\kern0.34em to\kern0.2em {d}_i\kern0.2em or\kern0.2em {d}_j\end{array}}\operatorname{} $$
(12)
Finally, we can define the functional similarity between diseases d
i
and dj by integrating lncRNAs related to d
i
, d
j
or both as follows:
$$ FSD\left({d}_i,{d}_j\right)=\frac{\sum \limits_{l_k\in \left(D\left({d}_i\right)\cup D\left({d}_j\right)\right)}C{D}_{ij}\left({l}_k\right)}{\mid D\left({d}_i\right)\mid +\mid D\left({d}_j\right)\mid -\mid D\left({d}_i\right)\cap D\left({d}_j\right)\mid } $$
(13)
where D(d
i
) and D(d
j
) represent all lncRNAs related to di and d
j
in E
1
, respectively.
MiRNA functional similarity based on lncRNAs
Based on the assumption that similar miRNAs tend to show similar interaction and non-interaction patterns with lncRNAs, we can also calculate the miRNA functional similarity in the lncRNA-miRNA interactive network. Similar to the procedure used to calculate the disease functional similarity, first, we constructed the undirected graph G
2
= (V
2
, E
2
), where V
2
= S
M
∪S
L
= {m
1
, m
2
,…, l
M + 1
, l
M + 2
,…, l
M + L
} is the set of vertices, E
2
is the set of edges, and for any two nodes a, b ∈ V
2
, an edge exists between a and b in E
2
if KAM2(a, b) = 1. Then, we defined the contribution of each lncRNA to the functional similarity of each miRNA pair as follows:
$$ {CM}_{ij}\left({l}_k\right)=\Big\{{\displaystyle \begin{array}{c}1\kern1.20em if\kern0.34em lncRNA\kern0.3em {l}_k\kern0.2em related\kern0.2em {m}_i\kern0.2em and\kern0.2em {m}_j\kern0.2em simultaneously\\ {}C\left({l}_k\right)\kern5em if\kern0.34em lncRNA\kern0.3em {l}_k\kern0.2em only\kern0.34em related\kern0.2em {m}_i\kern0.2em or\kern0.2em {m}_j\end{array}}\operatorname{} $$
(14)
Additionally, we can define the functional similarity between m
i
and m
j
as follows:
$$ FSM\left({m}_i,{m}_j\right)=\frac{\sum \limits_{l_k\in \left(D\left({m}_i\right)\cup D\left({m}_j\right)\right)}C{M}_{ij}\left({m}_k\right)}{\mid D\left({m}_i\right)\mid +\mid \mathrm{D}\left({m}_j\right)\mid -\mid D\left({m}_i\right)\cap D\left({m}_j\right)\mid } $$
(15)
where D(m
i
) represents all lncRNAs related to m
i
, and D(m
j
) represents lncRNAs relate to m
j
in E
2
.
Integrated similarity
The processes used to calculate the integrated similarities of the diseases, lncRNAs and miRNAs are illustrated in Fig. 6. Combining the disease semantic similarity, the disease Gaussian interaction profile kernel similarity and the disease functional similarity mentioned above, we can construct the disease integrated similarity matrix FDD as follows:
$$ FDD=\frac{SSD+ FDGS+ FSD}{3} $$
(16)
Additionally, based on the miRNA Gaussian interaction profile kernel similarity and the miRNA functional similarity, we can construct the miRNA integrated similarity matrix FMM as follows:
$$ FMM=\frac{MGS+ FSM}{2} $$
(17)
Furthermore, based on the Gaussian interaction profile kernel similarity matrices LGS1 and LGS2, we can construct the lncRNA integrated similarity matrix FLL as follows:
$$ FLL=\frac{LGS1+ LGS2}{2} $$
(18)
Prediction of disease-miRNA associations based on a distance correlation set
In this section, we developed a novel computational method, i.e., DCSMDA, to predict potential disease-miRNA associations by introducing a distance correlation set based on the following assumptions: similar diseases tend to show similar interaction and non-interaction patterns with lncRNAs, and similar lncRNAs tend to show similar interaction and non-interaction patterns with miRNAs. As illustrated in Fig. 7, the DCSMDA procedure consists of the following 5 major steps:
Step 1 (Construction of the adjacency matrix based on G
3
): First, we construct a (D + L + M) * (D + L + M) Adjacency Matrix (AM) based on the undirected graph G
3
and SC, and then for any two nodes v
i
, v
j
∈V
3
, we can define the AM(i, j) as follows:
$$ AM\left(i,j\right)=\left\{\begin{array}{c} SC\left({d}_i,{d}_j\right),\kern0.75em if\kern0.5em i\in \left[1,D\right]\ \mathrm{and}\ j\in \left[1,D\right].\kern6.25em \\ {} SC\left({d}_i,{l}_j\right),\kern0.75em if\kern0.5em i\in \left[1,D\right]\ \mathrm{and}\kern0.5em j\in \left[D,D+L\right].\kern4.75em \\ {} SC\left({d}_i,{m}_j\right),\kern1.25em if\kern0.5em i\in \left[1,D\right]\ \mathrm{and}\ j\in \left[D+L,D+L+M\right].\kern3em \\ {} SC\left({m}_i,{d}_j\right),\kern1em if\kern0.5em i\in \left[D,D+L\right]\ \mathrm{and}\ j\in \left[1,D\right].\kern4.75em \\ {} SC\left({m}_i,{m}_j\right),\kern1.25em if\kern0.5em i\in \left[D,D+L\right]\ \mathrm{and}\ j\in \left[\mathrm{D},D+L\right].\kern3.25em \\ {} SC\left({m}_i,{l}_j\right),\kern1.25em if\kern0.5em i\in \left[D,D+L\right]\ \mathrm{and}\ j\in \left[D+L,D+L+M\right].\kern1.75em \\ {} SC\left({l}_i,{d}_j\right),\kern1.25em if\kern0.5em i\in \left[D+L,D+L+M\right]\ \mathrm{and}\ j\in \left[1,D\right].\kern3em \\ {} SC\left({l}_i,{m}_j\right),\kern1.25em if\kern0.5em i\in \left[D+L,D+L+M\right]\ \mathrm{and}\ j\in \left[\mathrm{D},D+L\right].\kern1.75em \\ {} SC\left({l}_i,{m}_j\right),\kern1.25em if\kern0.5em i\in \left[D+L,D+L+M\right]\ \mathrm{and}\ j\in \left[D+L,D+L+M\right]\end{array}\right. $$
(19)
where i∈[1, D + L + M] and j∈[1, D + L + M], and to calculate the shortest distance matrix in step 2, we define AM (i, j) = 1 if i = j.
Step 2 (Construction of the shortest distance matrix based on adjacency matrix AM): First, we set parameter b to control the bandwidth of the distance correlation set and let b be a pre-determined positive integer, and then, we can obtain b matrices, such as AM1, AM2,..., AMb, based on the above formula (19), and the Shortest Path Matrix is calculated as follows:
$$ SPM\left(i,j\right)=\left\{\ \begin{array}{c}1,\kern2.5em if\ AM\left(i,j\right)=1\\ {}k,\kern2.25em otherwise\kern1.25em \end{array}\right. $$
(20)
where i∈[1, D + M + L], j∈[1, D + M + L], k∈[2, b], and k satisfies the following: AM k(i, j)≠0, while AM 1(i, j) = AM 2(i, j) = … = AM k-1(i, j) = 0.
Step 3 (Calculation of distance correlation sets and distance coefficient of each node pair in G
3
):
For each node v
i
∈ V
3
, we can obtain distance correlation set DCS(i) according to the shortest distance matrix as follows:
$$ DCS(i)=\left\{{v}_j|r\ge SPM\left(i,j\right)>0\right\} $$
(21)
where DCS(i) of each node contains itself and all nodes with the shortest distance less than b.
For instance, in the disease-miRNA-lncRNA interaction network illustrated in Fig. 7, DCS (seed node) is all candidate nodes when b is set to 2.
Then, we can calculate the distance coefficient (DC) of the node pair (vi, vj) as follows:
$$ P\left(i,j\right)=\left\{\begin{array}{c} SPM{\left(i,j\right)}^{b+1}, if\ i\in DCS(j)\ or\ j\in DCS(i)\\ {}0,\kern3.5em otherwise\end{array}\right. $$
(22)
Furthermore, we can construct a Distance Correlation Matrix (DCM) based on the disease integrated similarity, the lncRNA integrated similarity, and the miRNA integrated similarity as follows:
$$ DCM\left(i,j\right)=\Big\{{\displaystyle \begin{array}{c}P\left(i,j\right)\ast \exp \left( FDD\left(i,j\right)\right),\kern7.9em if\kern0.5em i\in \left[1,D\right]\ \mathrm{and}\ j\in \left[1,D\right].\kern6.3em \\ {}P\left(i,j\right)\ast \exp \left( FLL\left(i,j\right)\right),\kern6em if\kern0.5em i\in \left[D,D+L\right]\ \mathrm{and}\ j\in \left[\mathrm{D},D+L\right].\kern4.75em \\ {}P\left(i,j\right)\ast \exp \left( FMM\left(i,j\right)\right),\kern0.5em if\kern0.5em i\in \left[D+L,D+L+M\right]\ \mathrm{and}\ j\in \left[D+L,D+L+M\right]\kern3em \\ {}P\left(\mathrm{i},\mathrm{j}\right)\ast \frac{SPM\left(i,j\right)}{b},\kern18.5em \mathrm{otherwise}\kern5.5em \end{array}}\operatorname{} $$
(23)
where i∈[1, D + L + M] and j∈[1, D + L + M].
Step 4 (Estimation of the association degree between a pair of nodes): Based on formula (23), we can estimate the association degree between vi and vj as follows:
$$ PM\left(i,j\right)=\frac{\sum \limits_{k=1}^{D+L+M} DCM\left(i,k\right)+{\sum}_{k=1}^{D+L+M} DCM\left(k,j\right)}{D+L+M} $$
(24)
Thus, we can obtain prediction matrix PM, where the entity PM (i, j) in row i column j represents the predicted association between node v
i
and v
j
.
Step 5 (Calculation of the final prediction result matrix between the miRNAs and diseases): Let \( PM=\left[\begin{array}{c}{C}_{11}\kern0.75em {C}_{12}\kern1em {C}_{13}\\ {}{C}_{21}\kern0.75em {C}_{22}\kern1em {C}_{23}\\ {}{C}_{31}\kern0.75em {C}_{32}\kern0.75em {C}_{33}\end{array}\right] \), where C11 is a D×D matrix, C12 is a D×L matrix, C13 is a D×M matrix, C21 is an L×D matrix, C
22
is an L ×L matrix, C
23
is an L×M matrix, C31 is an M×D matrix, C
32
is an M×L matrix and C
33
is an M ×M matrix. Obviously, C
13
is our predicted result, which provides the association probability between each disease and miRNA. A previous study [27] demonstrated that the Gaussian interaction profile kernel similarity is a high-efficiency tool for optimizing the result of prediction, and therefore, we used the miRNA Gaussian interaction profile kernel similarity and the disease Gaussian interaction profile kernel similarity to optimize the result of the DCSMDA as follows:
$$ FAD= FDD\ast {C}_{13}\ast FMM $$
(25)
where the matrix FAD denotes the relationship between the miRNA-disease pairs.