Datasets and preprocessing
The manual-curated piRNA-disease datasets were collected from publicly available piRBase2.0 and MNDR v3.1 as the main dataset used in our model. Here, we only chose those experimentally verified PDA pairs. The sequence information of each piRNA can be obtained in the two databases. We filtered out those non-human PDA pairs from MNDR v3.1. Besides, it is worth mentioning that many nodes (i.e., piRNAs) with degree = 1 exist in the datasets. Considering the influence of degree on graph-based methods, as shown in Additional file 1: Figs. S1 and S2, we filtered out the nodes with degree = 1. For a graph containing N nodes, the degree distribution is defined as follows:
$$P_{k} \, = \frac{{N_{k} }}{N}$$
(1)
where Pk is the degree distribution and Nk denotes the number of nodes with degree = k, and Pk should meet \(\sum\limits_{k = 0}^{\infty } {P_{k} } = 1\).
Finally, after performing the inclusion of identifier unification, de-redundancy and deletion of the irrelevant items from the two databases (i.e., piRBase2.0 and MNDR v3.1), we got the main dataset including 3446 pairs of known associations among 1478 piRNAs and 24 diseases.
In addition, for independent piRDisease dataset, generally viewed as benchmark dataset by the baseline methods, it consists of 4350 piRNAs, 21 diseases and 4993 known PDAs. We employed it to make comparisons with the five baseline methods.
Medical Subject Headings (MeSH) descriptor data of diseases were downloaded from https://meshb.nlm.nih.gov/.
Method overview
The backbone of a graph is node and edge. How to effectively mine structural information for edges and augment feature information for nodes side by side is a key point for graph-based methods. Based on this, the rationale of our method is to construct appropriate topology associations as adjacency matrix (AM) and enhance node features as node initial representation (NIR) on the heterogeneous graph of PDAs. This way, we can integrate the two components into a GCN and convert the problem of PDA prediction into a graph link prediction task. The flowchart is shown in Fig. 5.
Composition of the PDA heterogeneous graph
Based on graph theory, we can treat the detection of potential PDAs as a link prediction task in graph. Thus, a heterogeneous graph consisting of the piRNA-piRNA subgraph, disease-disease subgraph and known piRNA-disease subgraph is established. Specifically, we can denote them by App, Add, and Apd, then integrate the three subgraphs into a heterogeneous graph G. After aligning nodes of different subgraphs according to the node map, the final adjacency matrix A ∈ R(N+M)× (N+M) of G is defined as follows:
$$A = \left[ {\begin{array}{*{20}l} {A^{pp} } & {A^{pd} } \\ {\left( {A^{pd} } \right)^{T} } & {A^{dd} } \\ \end{array} } \right]$$
(2)
where N is the number of piRNAs, and M is the number of diseases. App denotes the projection matrix of piRNA-piRNA and Add denotes the corresponding disease-disease matrix, while Apd denotes the known PDA matrix and (Apd)T denotes its transposition.
Within the full graph, how to effectively construct App and Add subgraph is a key point to infer potential PDAs. Different from those shallow embedding methods such as node2vec [25], LINE [26], and SDNE [27] etc., we more thoroughly considered node feature information combined with PDA heterogeneous graph to jointly learn the node representation in each convolution layer. The detailed construction procedure will be shown in the following sections.
piRNA/disease subgraph construction
In this study, we adopted bipartite graph projection [28] to construct piRNA subgraph and disease subgraph individually. We assume that P = {P1, P2, …, Pn} (n = 1478) is the set of piRNA nodes and D = {D1, D2, …, Dm} (m = 24) is the set of disease nodes, while PDAs = {Pa Db} (1 ≤ a ≤ n and 1 ≤ b ≤ m) is the set of known piRNA-disease association. Given any Pp Di ∈ PDAs and Pq Di ∈ PDAs, (i.e., piRNA p and q are both related to disease i), we can infer the edge Pp Pq between Pp and Pq and construct the piRNA-piRNA subgraph. Same procedure applies to disease-disease subgraph construction. This way, the piRNA-piRNA and disease-disease subgraph were built using known PDAs.
It should be noted that piRNAs mostly point to only up to a few diseases. The default bipartite graph projection in the graph with a highly skewed degree distribution might void the assumption that piRNAs with similar functions are likely to be related to similar diseases. Specifically, for hub disease nodes in the piRNA-disease graph, a huge number, to the order of the square of disease degrees, of piRNA-piRNA connections will be generated using the projection rule. To control the inflation, we proposed a sampling procedure based on median node centrality. In general, for data with a long-tail distribution, the median and the region around it (e.g., the box in a boxplot) is appropriate to represent the data because the median is not sensitive to extremes. Instead of applying subgraph projection on all nodes, the projection is limited to certain number of nodes with centralities around the median. Therefore, we proposed a median centrality-based subsampling strategy as follows. For node centrality, we adopted the PageRank algorithm [29]. PageRank measures the importance of a node in relation to its linked nodes as following:
$${\text{PR}}(i) = \alpha \sum\limits_{j \in Mi} {\frac{{{\text{PR}}(j)}}{L(j)}} + \frac{(1 - \alpha )}{N}$$
(3)
where, Mi is the set of all webpages that have links to webpage i, L(j) is the number of links out of webpage j, N is the total number of webpages and α is set as 0.85 by default.
In the piRNA-piRNA/disease-disease subgraph, we selected k nodes with PR values around the median PR. The value of k can be decided as follows:
$$k = \frac{{C_{pd} }}{{\left( {d_{\max } - d_{\min } } \right) * \mu }}$$
(4)
where for piRNA projection subgraph, Cpd is the number of known PDAs. dmax/dmin denotes the maximum/minimum degree of diseases, respectively. Similarly, for disease projection subgraph, corresponding dmax/dmin denotes the maximum/minimum degree of piRNAs. μ is a hyperparameter as dilution factor chosen from {1.0,10.0,100.0, 1000.0, …}. Specifically, to limit k to an integer and in a reasonable range, μ was tested at various values and set to 1.0 for disease subgraph and 100.0 for piRNA subgraph, respectively. The procedure of constructing piRNA/disease subgraph is presented as the following pseudocodes:
Node feature construction
Similarity features between entities are crucial to characterize the node primary representation and are complementary with topological structure information. To felicitously apply them, we took them as a type of similarity profile information for node feature and proposed a residual scaling-based feature augmentation algorithm to comprehend and enhance node diversity.
piRNA similarity profile
We adopted the Jaccard similarity coefficient [30] to calculate the k-mer similarity profile [31] for piRNAs. We first obtained the k-mer feature from original piRNA sequence information, where the k is empirically selected to 3(i.e.,3-mer). The Jaccard similarity coefficient between xi and xj was calculated as follows:
$${\text{S}}_{J} (i,j) = \frac{{|x_{i} \cap x_{j} |}}{{|x_{i} \cup x_{j} |}}$$
(5)
where xi and xj denote the binary feature vector of entity i and j. \(|x_{i} \cap x_{j} |\) denotes the number of cases where both elements in xi and xj are equal to 1,and \(|x_{i} \cup x_{j} |\) denotes the number of the cases where either the elements of xi or xj are equal to 1.
Considering the influence of sparsity in piRNA similarity profile matrix, the sparse entries can be estimated by mathematical expectation of the similarity between Jaccard similarity coefficient and Gaussian interaction profile (GIP) kernel similarity [32]. The GIP similarity between piRNA i and j was calculated as follows:
$${\text{S}}_{PG} (i,j) = \exp \left( { - n_{d} \left\| {A(i) - A(j)} \right\|^{2} } \right)$$
(6)
where nd is a factor used to control the bandwidth of kernel. We can calculate nd by normalizing the original kernel bandwidth \(n^{{\prime }}_{d}\):
$$n_{d} = \frac{{n^{{\prime }}_{d} }}{{\frac{1}{k}\sum\limits_{i = 1}^{k} {\left\| {A(i)} \right\|^{2} } }}$$
(7)
where k denotes the number of all diseases and \(n^{{\prime }}_{d}\) is usually set to 1. Thus, we can obtain the integrated piRNA similarity between i and j as follows:
$$S_{P} (i,j) = \left\{ {\begin{array}{*{20}l} {\frac{1}{2}{\text{S}}_{PG} (i,j)} & {{\text{if}}\;{\text{S}}_{p} {(}i,j{) = 0}} \\ {S_{J} (i,j)} & {{\text{otherwise}}} \\ \end{array} } \right.$$
(8)
Disease similarity profile
In PDA-PRGCN, each disease including all related annotation terms obtained from MeSH descriptors can be represented by hierarchical directed acyclic graphs (DAGs). In general, a DAG can be expressed as DAG = (T(d), E(d)), for a given disease d. T(d) denotes d itself together with all its ancestor nodes, while E(d) denotes all relationships between nodes in the DAG(d). Dd(t) of a disease t in a DAG to the semantics of disease d is defined as follows:
$$\left\{ {\begin{array}{*{20}l} {D_{d} (t) = 1} & {if\;t = d} \\ {D_{d} (t) = \max \left\{ {\Delta * D_{d} (t^{{\prime }} )\left| {t^{{\prime }} \in children\;of\;t} \right.} \right\}} & {if\;t \ne d} \\ \end{array} } \right.$$
(9)
where \(\Delta\) is usually set to 0.5 according to previous studies. For a disease d to itself, the semantic contribution value is set 1, and with the distance between diseases increasing, the semantic contribution value will decrease. Thus, we can define the semantic value of disease d as following:
$${\text{DV}}(d) = \sum\nolimits_{t \in T(d)} {D_{d} (t)}$$
(10)
Following the method previously proposed [33], the semantic similarity score between disease i and j can be calculated by:
$$S_{D} (i,j) = \frac{{\sum\nolimits_{t \in T(i) \cap T(j)} {\left( {D_{i} (t) + D_{j} (t)} \right)} }}{{\sum\nolimits_{t \in T(i)} {D_{i} (t)} + \sum\nolimits_{t \in T(j)} {D_{j} (t)} }}$$
(11)
Similarly, with the same criteria as for piRNA, we can finally obtain the integrated disease similarity between i and j as follows:
$$S_{D} (i,j) = \left\{ {\begin{array}{*{20}l} {\frac{1}{2}{\text{S}}_{DG} (i,j)} & {{\text{if}}\;{\text{S}}_{D} (i,j) = 0} \\ {S_{D} (i,j)} & {{\text{otherwise}}} \\ \end{array} } \right.$$
(12)
Residual scaling-based node feature augmentation
In general, individual residual profile has different contributions in similarity matrix. Being a kind of local-specific residual, it could efficiently represent the proportions of each row and thus characterize the importance of similarity profile for each node. Therefore, for each row in piRNA/disease similarity profile matrix, we have:
$$S^{{\prime }}_{i,j} = {\text{diag}} \left( {R^{i} } \right)S^{pp}_{i,j}$$
(13)
where, Spp (Sdd for disease) is the matrix of original similarity profile of piRNA, diag denotes the diagonalization operator. Ri denotes the residual profile and can be defined as following.
$$R^{i} = \left\| {\ln \left( {\delta^{i} + 1} \right) - \mu } \right\|*\Delta$$
(14)
$$\delta^{i} = \frac{{\sum\nolimits_{j} {A^{pd}_{i,j} } }}{{\sum\nolimits_{j} {S^{pp}_{i,j} } }}$$
(15)
where, Apd is the known PDA matrix. Here, we only considered the similarity of piRNA/disease distribution in ith row of Spp. δ i is defined to describe the ratio of row degree distribution, while \(\Delta\) ∈ {0.1,10,100} denotes the scaling factor and is used to shrink or enlarge the difference of distribution among different piRNAs/diseases. Evidently, the non-negative residuals are derived as Ri from δ i and mean degree µ. Here, ln(·) and ||·|| denotes the Napierian logarithm operator and the non-negative absolute value operator, respectively. The procedure of residual scaling-based feature augmentation is presented as following pseudocodes (Here, the Stacked Autoencoder [34] was adopted to implement noise reduction.):
Graph convolution network
We adopted GCN to learn final embedding and finished the link prediction task by an end-to-end mode. Specifically, we utilized the GCN with two convolution layers. The given graph adjacency matrix A and feature matrix H with the trainable weight vector W and the non-linear activation function σ jointly define the neural network f (⋅) as follows:
$$\mathop H\nolimits^{(l + 1)} = f\left( {\mathop H\nolimits^{(l)} ,A} \right) = \sigma \left( {\mathop {G\,H}\nolimits^{(l)} \mathop W\nolimits^{(l)} } \right)$$
(16)
where \(\mathop {G = \, D}\nolimits^{( - 1/2)} \,A^{{\prime }} D^{( - 1/2)}\) with \(A^{{\prime }} = A + I\) and D is the diagonal degree matrix of \(A^{{\prime }}\), and ReLU is adopted as σ.
In view of the efficacy of PDA prediction, we designed a predictor of PDA-PRGCN by applying a three-layer fully-connected (FC) neural network to output the probability for potential links in the PDA graph, corresponding to feature extractor (i.e., GCN). Different from conventional dot product method [24, 35], our predictor is based on the end-to-end mode aiming for a better integration of embeddings and joint optimization of the proposed model as well as downstream tasks.
Dual-loss mechanism
Together with the cross-entropy loss function applied to obtain the optimal classifications, the sensitivity–specificity loss function [36] was jointly adopted to train PDA-PRGCN. The dual loss is defined as follows.
$$Loss = L_{CE} + L_{SS}$$
(17)
$${\text{L}}_{{{\text{CE}}}} = - \sum\limits_{i = 1}^{N} {y^{(i)} \mathrm{log }\hat{y}^{(i)} + \left( {1 - y^{(i)} } \right)\log \left( {1 - \hat{y}^{(i)} } \right)}$$
(18)
$$L_{SS} = w * sp{ + (}1 - w{)} * se$$
(19)
where, \(y^{(i)}\) is the true label and \(\hat{y}^{(i)}\) is the predicted label. w = {0.1,0.01} represents sensitivity ratio, which can control the balance between sensitivity and specificity. Herein, sp denotes \({\text{Specificity}} = \frac{{{\text{TN}}}}{{{\text{TN}} + {\text{FP}}}}\) and se denotes \({\text{Sensitivity}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}} \,\) with FN, TN, TP and FP denoting false negative, true negative, true positive and false positive, respectively.