Datasets
To better train our BiGAN model, we collected three experimentally validated datasets from MNDR v3.0, Lnc2Cancer, and LncRNADisease. Below is a brief description of the datasets used.
The first dataset is from the mammalian ncRNA disease repository (MNDR) with coverage and annotation proposed by Lin et al. 24 August 2020 [39]. We extracted association information about human lncRNA-disease pairs in MNDR, consisting of two datasets. One of the databases is experimentally verified association information, covering 742 human diseases, 25,494 human lncRNAs, and 39,783 lncRNA-disease associations, which can be used as a training set. The other dataset is the association information of predicted lncRNA-disease pairs, covering 231 human diseases, 17,713 human lncRNAs, and 52,144 pieces of association information, which can be used as a validation set.
The second dataset was released on 8 January 2019, and contains experimentally validated lncRNA-disease correlations downloaded from LncRNADisease V2.0 [40]. We also collected another special ncRNA dataset named circRNA whose sequences were sufficiently long (>200) in this dataset. After removing the lncRNA disease pairs that were not labelled with IDs and that lacked features, we deleted duplicate samples describing the lncRNA-disease relationship according to known experimental evidence. From this, we obtained 205,959 interaction associations for 529 human diseases and 19,166 lncRNAs. In addition, 823 circRNAs and 529 human diseases, and 1004 interaction associations were included. This dataset contains more comprehensive information than the other two datasets.
The third dataset was published on 30 June 2020, and contains experimentally proven lncRNA-disease correlations which were based on the Lnc2Cancer V3.0 dataset [41]. After removing the lncRNA disease pairs that were not labelled with IDs and that lacked characteristics, we deleted duplicate samples describing the lncRNA-disease relationships according to known experimental evidence. As a result, 216 human diseases,2659 human lncRNAs, and 9254 human lncRNA-disease interaction associations were obtained. Compared with lnc2cancer2.0 published in 2018, the number of diseases have increased by 51 and the number of lncRNAs increased by more than 1000, and the lncRNA-disease association nearly doubled. This allows us to collect enough data to learn the features of the latent space between lncRNAs and diseases in training the BiGAN.
LncRNA-disease association
According to the sorted dataset, the interaction information between diseases and lncRNAs was constructed into a matrix \(A \in R^{nd \times nl}\), where the columns represent lncRNAs and the rows represent disease. If there was an experimentally verified lncRNA-disease association, the value of A in the matrix was set to 1. Otherwise, the value was set to 0, as shown in Fig. 5A.
LncRNA sequence similarity
An increasing number of studies have shown that similar pathologies between two different diseases may be linked with two similar lncRNAs. Therefore, one of the important characteristics of lncRNA-disease association prediction is the similarity between different lncRNAs. Between any two strings, the Levenshtein distance is the minimum cost required for a single word of one string to be converted to the other string after insertion, deletion, or replacement. To investigate the deeper similarity between lncRNAs, we used the Levenshtein distance to calculate the similarity between two lncRNAs. We set the editing cost as 2, and the cost for deletion and insertion as 1. The similarity between the ith lncRNA and the jth lncRNA is\(L_{sim}(l_i,l_j) \in R^{nl \times nl}\), and it can be calculated as follows:
$$\begin{aligned} L_{sim}=1- \frac{x}{len(l_i)+len(l_j)} \end{aligned}$$
(3)
where x represents the minimum cost required to convert one lncRNA sequence into another and len represents the sequence length of lncRNA.
Disease semantic similarity
In 2010, Schlicker et al. found that the more similar the disease phenotype was, the more similar the gene dysfunction [42]. Gene Ontology annotations provide a way to obtain the semantic similarity of genes [43]. Thus, some researchers employ directed acyclic graphs (DAGs) to represent diseases. Additionally, the Jaccard correlation coefficient has been used to calculate the functional similarity of diseases. We applied DAGs to this study to calculate semantic similarity scores for diseases. Let \(D_{sim} \in R^{nd \times nd}\) be the disease similarity between the ith disease and the jth disease. It can be calculated as follows:
$$\begin{aligned} D_{sim}(d_i,d_j)= \frac{\sum _{x\in G_{d_i} \cap G_{d_j}} ( SVD_i(x)+SVD_j(x) )}{ \sum _{x \in G_{d_i} SVD_i(x) } + \sum _{x \in G_{d_j} SVD_j(x) }} \end{aligned}$$
(4)
where \(G_{d_i}\) represents disease \(d_i\) in DAGs , \(G_{d_j}\)represents \(d_j\) disease in DAGs . Compare disease i and disease j, \(SVD_i(x)\) denotes the disease semantic value of \(x \in G_{d_i}\) , and \(SVD_j(x)\) denotes the disease semantic value of \(x \in G_{d_j}\) .We can calculate the semantic value of a disease d by using the following equation:
$$\begin{aligned} SVD(x)= {\left\{ \begin{array}{ll} \text{ max } \{\mu \cdot SVD(d^\prime )\}, \text{ if } \quad x \ne d \\ 1, \qquad \quad \qquad \quad \qquad \text{ Otherwise } \end{array}\right. } \end{aligned}$$
(5)
where \(d^\prime\) \(\in\) children of d, and \(\mu\) represents the factor of semantic contribution. According to previous research, we set it to 0.5 [44].
Gaussian interaction profile kernel similarity
Similar lncRNAs may be associated with different diseases that have similar pathological characteristics, and vice versa. Based on this assumption, the kernel similarity between lncRNAs and diseases can be calculated by the Gaussian interaction profile (GIP). The GIP kernel similarities were computed based on the lncRNA-disease interaction matrix obtained from the LncRNADisease dataset. The GIP similarities \(GKL(l_i,l_j)\) of lncRNAs can be computed as follows:
$$\begin{aligned} GKL(l_i,l_j) = \text{ exp }(-\lambda ||A(l_i)-A(l_j)||^2) \end{aligned}$$
(6)
where \(A(l_i)\) and \(A(l_j)\) represent the ith and jth columns information in the association matrix A. Let \(\lambda\) be a parameter that can control the width of the kernel boundary and is represented by the average number of diseases associated with each lncRNA, which is defined as follows:
$$\begin{aligned} \lambda = \frac{1}{\frac{1}{nl} \sum _{i=1}^{nl} ||A(l_i)||^2} \end{aligned}$$
(7)
where nl denotes the number of lncRNAs.
Similarly, we can obtain the GIP kernel similarity of disease \(d_i\) and disease \(d_j\) as follows:
$$\begin{aligned} GKD(d_i,d_j) = \text{ exp }(-\lambda ||A(d_i)-A(d_j)||^2) \end{aligned}$$
(8)
where \(A(d_i)\) and \(A(d_j)\) denote the ith and jth rows information in the lncRNA-disease association matrix A. Let \(\lambda\) be a parameter that can control the width of the kernel boundary and is represented by the average number of lncRNAs associated with each disease, which can be calculated as follows:
$$\begin{aligned} \lambda = \frac{1}{\frac{1}{nl} \sum _{i=1}^{nd} ||A(d_i)||^2} \end{aligned}$$
(9)
where nd denotes the number of diseases.
Integrated similarity
From the above,the lncRNAs sequence similarity, the semantic similarity of diseases, and the GIP kernel similarity of lncRNAs and diseases were gathered. We obtained the integrated similarity of lncRNA (Ls) and integrated similarity of diseases (Ds), (Fig. 5B), and the calculation formula is shown as follows:
$$\begin{aligned} Ls(l_i,i_j)&= \frac{L_{sim}(l_i,l_j)+GKL(l_i,l_j)}{2} \end{aligned}$$
(10)
$$\begin{aligned} Ds(d_i,d_j)&= \frac{D_{sim}(d_i,d_j)+GKD(d_i,d_j)}{2} \end{aligned}$$
(11)
The disease similarity vector for disease \(d_i\) contains the similarity values of all other diseases to \(d_i\). Additionally, the lncRNA similarity vector for lncRNA \(l_i\) includes the similarity values of all other lncRNAs to \(l_i\). Therefore, we concatenated these similarity vectors for the corresponding lncRNA-disease pair to generate large eigenvectors of size \(nd+nl\), where the number of diseases and lncRNAs was nd and nl, as shown in Fig. 5C. There were \(nd \times nl\) samples altogether, each corresponding to a lncRNA-disease pair.
BiGAN
In 2018, Chen et al. proposed using linear-based principal component analysis (PCA) to obtain the traits of GIP kernel similarity [45]. However, the potential lncRNA-disease correlation features were difficult to mine. As a nonlinear generalization of PCA, an auto-encoder is an unsupervised neural network model that mainly includes an encoder and decoder. This special neural network has two advantages in dealing with the features of lncRNA-disease associations [46]. One is that auto-encoders are good at learning biological patterns that are annotated. Second, they can automatically recognize the comprehensive similarity characteristics of lncRNAs and diseases, eliminate noise, and reduce dimensions. This can solve the problem that features extracted from large datasets may produce considerable noise. To further study the model of unsupervised learning, we developed a novel generative adversarial network model inspired by the auto-encoder.
The main framework of the BiGAN
In this study, we propose using the bidirectional generative adversarial network(BiGAN) model to complete the task of predicting the association of lncRNA-disease pairs. BiGAN consists of an encoder, a generator, and a discriminator, the main framework of which is shown in Fig. 6. The BiGAN encoder can map the original data point x to the latent representation z. The BiGAN generator will capture the feature in the latent space to generate a new lncRNA-disease association. The BiGAN discriminator not only discriminates in the traditional data space (x versus G(z)), but also discriminates in the joint data and latent space ((x, E(x)) versus (G(z), z)). The latent component is both an encoder output E(x) and a generator input z.
We can clearly see that the encoder and the generator cannot “communicate” with each other directly. However, the encoder and generator will learn to reverse each other through the joint probability distribution. In other words, E(G(z)) and G(E(x)) can be computed to fool the BiGAN discriminator. In our model, an encoder \(E:\Omega _X \rightarrow \Omega _Z\) and a generator \(G: \Omega _Z \rightarrow \Omega _X\) are trained at the same time. The BiGAN encoder includes a distribution \(P_E(Z|X) = \sigma (Z-E(X))\) mapping data points x into a latent feature space of the generator. The BiGAN generator includes a distribution \(Q_G(X|Z) = \sigma (X-G(Z))\) extracting randomly sampled noise from the encoder to generat new lncRNA-disease associations. The discriminator will take input from the latent space in to predict the distribution of \(P_D(Y|X,Z)\), where the value of Y is equal to 0 if X is from the output of generator (\(G(z),z\sim p_z\)), and the value of Y is 1 if X is sampled from the encoder data distribution \(p_x\). Thus, we can define a minimax objective to replace the BiGAN training objective.
$$\begin{aligned} \mathop {min}\limits _{G,E} \mathop {max}\limits _{D} V(D,E,G) \end{aligned}$$
(12)
where V(D, E, G) can be computed based on the following formulas:
$$\begin{aligned} V(D,E,G)&= {E}_{X \sim p_X} [logD(X,E(X))] + {E}_{Z \sim p_Z}[log(1-D(G(Z),Z))] \end{aligned}$$
(13)
$$\begin{aligned} logD(X,E(X))&= {E}_{Z \sim p_E(\cdot |X)} [logD(X,Z)] \end{aligned}$$
(14)
$$\begin{aligned} log(1-D(G(Z),Z))&= {E}_{X \sim p_G(\cdot |Z)} [log(1-D(X,Z))] \end{aligned}$$
(15)
In contrast to other advanced unsupervised computing models, the BiGAN can learn the gradient information perfectly, so as to ensure the correct weight allocation.
More details of the encoder, generator, and discriminator
Encoder In the similarity eigenvectors, each lncRNA contains the similarity information and position information of all other lncRNAs. Likewise, each disease contains information about the similarity and position of all the other diseases. As mentioned above, the BiGAN encoder is one of the two parts of an auto-encoder. The main functions of the encoder are to compress data, eliminate noise, and learn the features of the latent space. We take the similarity feature vectors of the samples as input so that the encoder can fully learn the parameters of the similarity vectors. In this way, the encoder can effectively map the data points into the latent feature space. The structure of BiGAN encoder is shown in Fig. 7A. The encoder is composed of three fully connected layers of the neural network. We can compute the output of each layer with the following formula:
$$\begin{aligned} E(x) = W^Ex+b^E \end{aligned}$$
(16)
where x denotes the similarity features of lncRNA-disease pairs. \(W^E\) and \(b^E\) represent the encoder weights and bias, respectively.
The dimension of the similarity eigenvectors between the lncRNA and disease will be compressed into a low-dimensional vector after passing through each layer in the encoder. A trained encoder can predict the feature representations of data by capturing semantic attributes. The dense information of compressed low-dimensional vectors is more conducive to learning the mapping relationship of the latent space. To mine the representation of latent space more effectively, we decided to set the number of neurons in the final layer to 100. We employed ReLU as the activation function in the BiGAN model, and it can be defined as follows:
$$\begin{aligned} ReLU(y) = {\left\{ \begin{array}{ll} y\quad y\ge 0 \\ 0\quad y<0 \end{array}\right. } \end{aligned}$$
(17)
In addition, the encoder will randomly sample noise z in distribution \(P_E(Z|X) = \sigma (Z-E(X))\) and output latent features E(x) during training. Ultimately, we can obtain many data pairs (x, E(x)).
Generator In most generative adversarial network(GAN) models, the role of the generator is to learn the features of the original data and generate new data based on the learned characteristics. However, in the BiGAN model, the generator takes randomly sampled noise as input. As shown in Fig. 7B, the generator is similar to the encoder in that it has the same network structure. The output of the generator is calculated as follows:
$$\begin{aligned} G(z) = W^Gz+b^G \end{aligned}$$
(18)
where z is the feature of the latent space. \(W^G\) and \(b^G\) denote the weights and bias of the generator, respectively.
However, each layer in the generator increases the dimension of the potential representation and the final output dimension is the same as the original similarity feature vector dimension. Next, the representation with noise is decoded by the generator, and new lncRNA-disease associations are generated. Then, we can obtain a series of data pairs(G(z),z).
Discriminator The two data pairs mentioned above are taken as inputs to fool the discriminator. The discriminator discrimines whether the input data are real. If the discriminator thinks the data pairs come from the encoder, will be set as 1. If the discriminator thinks data pairs come from the generator, it will be set as 0. The structure of the discriminator is shown in Fig. 7C, where the sigmoid function is defined as follows:
$$\begin{aligned} sigmoid(\theta ) = \frac{1}{1-e^{(-\theta )}} \end{aligned}$$
(19)
where \(\theta\) is the input of the sigmoid function.
The BiGAN encoder has a strong representation learning ability to learn the latent association between lncRNAs and diseases. The BiGAN generator will extract the features of the joint data and latent space to generate new lncRNA—disease associations. Finally, \(z=E(G(z))\) and \(x=G(E(x))\) are determined through a union probability distribution to arrive at a bidirectional structure. And you can see the concrete proof in the study of Jeff et al. According to our experiment, the BiGAN is an unsupervised feature learning model with strong robustness and representational learning ability. Compared with other computing models, the BiGAN performs remarkably well.