 Research
 Open Access
 Published:
CRPGCN: predicting circRNAdisease associations using graph convolutional network based on heterogeneous network
BMC Bioinformatics volume 22, Article number: 551 (2021)
Abstract
Background
The existing studies show that circRNAs can be used as a biomarker of diseases and play a prominent role in the treatment and diagnosis of diseases. However, the relationships between the vast majority of circRNAs and diseases are still unclear, and more experiments are needed to study the mechanism of circRNAs. Nowadays, some scholars use the attributes between circRNAs and diseases to study and predict their associations. Nonetheless, most of the existing experimental methods use less information about the attributes of circRNAs, which has a certain impact on the accuracy of the final prediction results. On the other hand, some scholars also apply experimental methods to predict the associations between circRNAs and diseases. But such methods are usually expensive and timeconsuming. Based on the above shortcomings, followup research is needed to propose a more efficient calculationbased method to predict the associations between circRNAs and diseases.
Results
In this study, a novel algorithm (method) is proposed, which is based on the Graph Convolutional Network (GCN) constructed with Random Walk with Restart (RWR) and Principal Component Analysis (PCA) to predict the associations between circRNAs and diseases (CRPGCN). In the construction of CRPGCN, the RWR algorithm is used to improve the similarity associations of the computed nodes with their neighbours. After that, the PCA method is used to dimensionality reduction and extract features, it makes the connection between circRNAs with higher similarity and diseases closer. Finally, The GCN algorithm is used to learn the features between circRNAs and diseases and calculate the final similarity scores, and the learning datas are constructed from the adjacency matrix, similarity matrix and feature matrix as a heterogeneous adjacency matrix and a heterogeneous feature matrix.
Conclusions
After 2fold crossvalidation, 5fold crossvalidation and 10fold crossvalidation, the area under the ROC curve of the CRPGCN is 0.9490, 0.9720 and 0.9722, respectively. The CRPGCN method has a valuable effect in predict the associations between circRNAs and diseases.
Background
With the advancement of science and technology, bioinformatics is increasingly at the forefront of scientific research. The relationships between diseases and drugs [1], the relationships between RNAs and diseases [2,3,4] are play an increasingly important role in the treatment and development of human diseases. Therefore, more and more scholars begin to invest in research in the direction of bioinformatics [5, 6]. Especially, circRNAs as noncoding RNA (ncRNAs), it has higher stability and integrity than other linear ncRNAs. Therefore, circRNAs can be used as a biomarker of diseases, it also has great potential in the treatment and diagnosis of diseases.
Although the formation and characteristics of circRNAs are basically discovered after a plenty of research by scientists, there are still dozens of biological functions that are still unclear. A large number of biologists prove the associations between circRNAs and diseases through experimental methods. Recently, some researchers point out that certain functions of ciRS7 are related to human pathology and the development of cancer [7], its regulation of diseases and the mechanism in the development process and relate diseases are discovered by more studies. In addition, the functions of various other circRNAs are also being investigated. Usually, laboratory consumables are disposable, even some reusable equipment in the laboratory need manual maintenance. Therefore, as the number of experiments increases, such experiments base on experimental methods require a large deal of time and resources, resulting in high experimental costs. Consequently, it is more necessary to study the relationships between circRNAs and diseases based on computational methods.
Recently, an increasingly large number of researchers invest in research on the relationships between circRNAs and diseases based on computational methods. Lu et al. propose a method for the associations between circRNAs and diseases based on sequence and ontology representation, the kmers is used to reduce dimensionality and the method apply Convolutional Neural Networks (CNN) to extract features, and then Long ShortTerm Memory (LSTM) algorithm is used to feature learning [8]. Zhang et al. propose the PDCPGWNNM method [9] approach to design circRNAdisease graph structure data using circRNAmiRNA interactions and miRNAs regulatory relationships in diseases, and the Weighted Nuclear Norm Minimization (WNNM) model is used to predict. Lei et al. propose the CDWBMS method [10], which uses a heterogeneous network to integrate the relationships between circRNAs and diseases, and it predicts the relationships between circRNAs and diseases based on an improved Weighted Biased MetaStructure (WBMS) search algorithm. Wang et al. propose a algorithm based on Generative Adversarial Networks (GAN), which adopts the Extreme Learning Machine (ELM) classifier to predict [11]. Wei et al. propose a method called iCircDALTR [12], it utilize Learning to Rank (LTR) algorithm to rank the associations based on various predictive variables and characteristics in a supervised manner.
In addition to the above studies, The Graph Convolutional Network (GCN) [13], The Random Walk with Restart (RWR) [14] and The Principal Component Analysis (PCA) [15] have also played an indelible role. Jin et al. propose NIMCGCN method to predict miRNAdisease associations establish on Neural Inductive Matrix Completion (NIMC) with GCN [16]. Wang et al. propose a calculation method is referred to GCNCDA [17] based on Fast learning with Graph Convolutional Networks (FastGCN) combine with Forest by Penalizing Attributes (Forest PA) classifier to predict potential circRNAdisease associations. Pan et al. propose an updated predictor DimiG 2.0 [18], which uses a semisupervised multilabel GCN to infer the relationships between miRNAs and diseases on the interaction network between Proteincoding genes (PCGs) and miRNAs.
RWR can captures the multifaceted relationships between circRNAs or between diseases and treats the circRNAs matrix or diseases matrix as a graph structure, and RWR is utilised to capture information about the overall structure of the graph. Such as RWRKNN [19], IIRWR [20], TRWRMB [21], MRWMDA [22]. In this paper, the RWR algorithm is used to calculate the similarity between circRNAs and the similarity between diseases in preparation for the subsequent PCA feature extraction.
In numerous different directions of research [23, 24], PCA played an important role. The circRNAs and diseases in this paper have a host of different attributes. If these datas are analyzed separately, their information may not be fully utilized, and some datas will be isolated. This kind of datas use leads to results that are subject to varying degrees of bias. Therefore, the PCA algorithm is required to perform a comprehensive analysis of the original data while also perform data dimensionality reduction.
Based on the discretion and research of the above methods, a novel and reliable method is proposed in this paper, which is based on Graph Convolutional Networks (GCN) to predict the associations between circRNAs and diseases, called CRPGCN. Compared to other algorithms, such as the GCNCDA, it uses the GCN algorithm as a feature extraction method and uses Forest PA classifier to classify features, but it does not consider neighbour nodes associations. In contrast, CRPGCN maximises the performance of GCN by first extracting features and noise reduction from the associations between circRNAs and diseases, and then performing feature learning that takes into account the associations between neighbouring nodes. Furthermore, in the comparative experiments in this paper, it can also be seen that the CRPGCN method outperforms some advanced GCN methods.
The main contributions of this work are summarized as follows:

The CRPGCN method incorporates the RWR similarity calculation method and the PCA feature extraction method, allowing the calculated nodes to better combine the similarity between neighbouring nodes while greatly reducing the impact on the prediction results.

The CRPGCN algorithm improves prediction accuracy and has the highest AUC values and AUPR values when compared to advanced algorithms.

The GCRGCN algorithm is more stable than some of the advanced algorithms, and its AUCs are stable when compared by a variety of methods with different datasets.

By comparing various evaluation metrics, the CRPGCN algorithm outperforms other advanced algorithms in terms of overall performance.
Benchmark datasets
The selection of dataset is one of the keys to study and predict circRNAdisease associations. The genebased circRNA similarity is the basis for the composition of the comprehensive similarity matrix in this paper, and it makes an important contribution in the study by Ding et al. [25]. Meanwhile, circR2Disease [26] can be used to construct genebased circRNA similarity based on the study by Hang et al. [27]. In summary, circR2Disease dataset is used as the benchmark to calculate circRNAdisease associations matrix A, genebased circRNA similarity CGS. circRNA GIP kernel similarity CIS and disease GIP kernel similarity DIS are thereafter calculated using A. circBase [28] is considered as the benchmark database, which combines the CGR algorithm to calculate the sequencebased similarity of each circRNA pair. In addition, the DAG information from the MeSH database provided the basis for calculating the semantic similarity between diseases.
Methods
In this paper, a novel algorithm is proposed, which is called CRPGCN, show as Fig. 1. In this study, the dataset needed to be preprocessed to construct adjacency matrices and feature matrices connecting circRNAs to diseases by the following methods:
The adjacency matrix A is obtained from the known circRNAdisease associations in the dataset. The circRNA comprehensive similarity matrix CS consists of the circRNA GIP kernel similarity matrix CIS, the circRNA genebased similarity matrix CGS and the circRNA sequencebased similarity matrix CES. Thereafter, the disease comprehensive similarity matrix DS is composed of the disease GIP kernel similarity matrix DIS and the disease semantic similarity matrix DSS. Thereafter, the CRPGCN method is trained by constructing heterogeneous adjacency matrices and heterogeneous feature matrices from A, CS and DS obtained in the above manner. The CRPGCN algorithm flow is as follows: Step 1: The matrices A, CS and DS given by data preprocessing are fed into the CRPGCN. Step 2: The RWR algorithm is used to aggregate the CS matrix and DS neighbour node information respectively to obtain the CRS and DRS. Step 3: The CRS matrix and DRS matrix are combined with the adjacency matrix A respectively, and the PCA is used to reduce the dimension and extract the features to obtain the feature matrices CF and DF separately. Step 4: The CS, DS and A are used to form the heterogeneous adjacency matrix \({\mathrm{A_{cd}}}\), after which CF and DF are used to compose the heterogeneous feature matrix CD, and finally the GCN algorithm is used for feature learning and scores calculation between circRNAs and diseases. The relationships between circRNAs and diseases is treated as graphstructured data by CRPGCN, which makes full use of the associations between each node and its neighbours to learn informations about similar nodes, while isolated nodes can also be well handled. Ultimately, the accuracy and stability of the CRPGCN algorithm is demonstrated by comparative experiments. In particular, the above steps will be described in detail in the following section.
Construct circRNAdisease adjacency matrix
The establishment of the adjacency matrix A (see Additional file 1) uses the known association relationships between circRNAs and diseases in the CircR2Disease dataset. A(i,j) is set to 1 when there is an associations between circRNAs and diseases, otherwise it is set to 0, is given by the following:
Construct circRNA GIP kernel similarity
For a circRNA \(c_i\), IP\(_1 (c_i)\) value is defined as the ith row of the circRNAdisease associations matrix A. The calculation method for the GIP kernel similarity between each pair of \(c_i\) and \(c_j\) is shown as:
where CIS represents the GIP kernel similarity of \(c_i\) and \(c_j\). \(\gamma _c\) is used to control the bandwidth, it represents the regularized Gaussian interaction attribute kernel similarity bandwidth based on the new bandwidth parameter \(\gamma _{m}^{\prime }\). \(\gamma _{m}^{\prime }\) is set to 1. n represents the number of circRNA. The disease GIP kernel similarity DIS is calculated in the same way.
Construct genebased circRNA similarity
Because similar RNAs tend to regulate similar genes, genes have been widely used to infer RNA similarity. In this study, to construct the genebased circRNA similarity, the circRNAgene associations adjacency matrix \({\mathrm{A}}_{\mathrm{cg}}\) must be constructed first. Where \({\mathrm{A}}_{\mathrm{cg}}\) is set to 1 to indicate that \(g_i\) and \(g_j\) are related, otherwise it is set to 0. Similar to the circRNA GIP kernel similarity calculation method, the GIP kernel similarity matrix GIS of the gene is constructed. The genebased circRNA similarity matrix CGS is constructed [27] through the \({\mathrm{A}}_{\mathrm{cg}}\) and GIS matrix, it is given by:
where \({\mathrm{A}}_{\mathrm{cg}}^{\mathrm{T}}\) is the transpose of \({\mathrm{A}}_{\mathrm{cg}}\).
Construct sequencebased circRNA similarity
The method rest on Chaos Game Representation (CGR) [29] can transform circRNA sequences into the corresponding spectral format. This method can exploit CGR coordinates to convert circRNA sequences into CGR radian sequences.
This method uses the Pearson correlation coefficient to quantify the similarity and difference between the position information and the nonlinear information for calculates the sequencebased circRNAs similarity matrix CES. By combining the method of Zheng et al. the CGR space [30] is first divided into \(8\times 8\) grids and the ith grid can be expressed as:
Furthermore, the quantified position information \(X_i\) and \(Y_i\) of \(grid_i\) is obtained by accumulating the horizontal coordinate value \(x_j\) and vertical coordinate value \(y_j\) in each grid respectively, which can be presented as follows:
where \(Num_i\) denotes the number of points in the ith \(grid_i\), \(X_i\) denotes the sum of the horizontal coordinate values \(x_j\) for all points in the ith \(grid_i\), and \(Y_i\) denotes the sum of the horizontal coordinate values \(y_j\) for all points in the ith \(grid_i\). \(Z_i\) is used to represent the zscore of each grid to quantify the nonlinear information, which is calculated as:
where \(N_g\) is 64, which means the total number of grids.
Finally, based on the above calculation of the \(X_i\) , \(Y_i\) and \(Z_i\) attributes of each \(grid_i\), the following equation is fused to construct a description array \(des(c_i)\) for all grids of the circRNA sequence being calculated:
Then, the Pearson correlation coefficient is used to determine the sequence similarity CES, it can be presented as follows:
where Cov(\(*\)) represents the covariance, D(\(*\)) represents the variance, \(c_i\) represents the ith circRNA.
Constructing disease semantic similarity
The DAG associations between diseases can help to calculate the similarity between each pair of diseases. The more DAG correlations between two diseases, the greater their similarity. The contribution value of the diseases can quantify the DAG correlation between the two diseases. Calculation of diseases contribution values based on the MeSH dataset, which is given by:
Through the contribution value of the diseases, the semantic similarity between the diseases is calculated, DSS is described as follows:
where T\((d_i)\cap\)T\((d_j)\) represents the set of common ancestor nodes of the two diseases \(d_i\) and \(d_j\).
Data fusion
The circRNA comprehensive similarity matrix CS is obtained by fusing the matrices CIS, CGS and CES. If the genebased circRNA similarity is not 0, the average value of CIS, CGS and CES is united as the current circRNA comprehensive similarity CS (see Additional file 2). Otherwise, the average value of CIS and CGS is used as the CS of circRNA. The comprehensive similarity CS is given by:
If the diseases has no DAG associations, certain semantic similarities cannot be calculated. By analyzing disease similarity measures from multifaceted, in order to calculate the similarity between diseases more comprehensively, DIS and DSS are needed to be fused together. The disease comprehensive similarity DS (see Additional file 3) between diseases \(d_i\) and \(d_j\) is defined as follows:
CRPGCN algorithm
In this section, the implementation of the CRPGCN algorithm is described in detail. The adjacency matrix A, circRNA comprehensive similarity matrix CS and disease comprehensive similarity matrix DS are used as the input datas for CRPGCN, and the output is the score matrix. The specific process is shown in Algorithm 1.
From lines 1–7 of the CRPGCN algorithm, CS and DS are used by the RWR algorithm to fuse the similarity information of neighbouring nodes to obtain CRS and DRS. Because the similarity relationships between each node and its neighbours has an important influence on the prediction result, the RWR algorithm can combine well to calculate the relationships between nodes and their neighbours. RWR combines the similarity [31] between neighbouring nodes by random walk and adjusts the degree of integration of the combined neighbouring nodes by edge weights. The calculation method [19] of RWR is defined as:
where \(W=[w_{i,j}]\) is the transfer probability matrix and \({\widetilde{W}}\) is the matrix after normalisation of W. \(\vec {e}_{l}\) is the initial vector of \(k \times 1\) and is the row vector of the CRS or DRS. c is the restart probability. Based on subsequent experiments c is set to 0.4. \(\vec {r}_{l}\) is the similarity vector obtained after the RWR calculation.
With the RWR algorithm, there is a certain probability that the walk process of the computed nodes will combine the similarity between the lowly associated neighbouring nodes, and the generation of similarity noise is inevitable. In order to reduce the impact of similarity noise on the computation results, the PCA algorithm is invoked. In rows 9–21, by using the PCA algorithm to extract features while noise reduction of the similarity matrix, the final obtained feature matrices CF, DF can be better learned by GCN, the calculation [32] of the feature matrix is shown below:
In line 22, By using noise reduction on the similarity matrix, the final result is used as the feature matrices CF (see Additional file 4), DF (see Additional file 5) for circRNAs and diseases. At the same time, the noise reduction matrix is not enough for the GCN method to find the associations between nodes more easily [33], the concept of heterogeneous adjacency matrix and heterogeneous feature matrix are introduced for better feature embedding. Their construction methods are shown as follows:
The learning method of the GCN is defined specifically from lines 23 to 32. According to the definition of GCN, the formula for the convolution of the adjacency matrix A_{cd} with the identity matrix CD is given by:
where the Fourier series matrix \(W_e\) is the training weight matrix, then \({\mathrm{C D}}\left( {\mathrm{I}}+D^{\frac{1}{2}} A D^{\frac{1}{2}}\right)\) represents the hidden associations between circRNAs or diseases nodes and potential factors. It can be converted into a hidden matrix H through the \(W_e\). I is the identity matrix. By introducing the deviation matrix B into the hidden matrix H through the activation function. The initialisation [34] of the trainable matrices \(W_e\), \(W_d\) and B is provided by Glorot et al. as follows:
where \(\Phi _{p}\) and \(\Phi _{n}\) are randomly selected positive and negative samples for this experiment. W is used to minimise the prediction error during the iterative process, and it is calculated as shown in Eq. (19). The constraints on the weight matrices in the encoder and decoder are defined by the remaining three terms separately. Because the ratio of positive and negative samples affects the experimental training results, this experiment validates the optimal ratio of positive and negative samples, and the validation results and discussion will be given in the next section.
Results
Evaluation method and metrices
The ROC curve is drawn based on TPR and FPR. The calculation method of TPR is as follows:
where TPR represents the percentage of all samples that are actually positive that are correctly judged as positive. In addition, FPR is calculated as follows:
where FPR is the percentage of all samples that are actually negative that are incorrectly judged to be positive.
The experiment used a variety of methods to assess performance, including recall (Recall), F1 score (F1), accuracy (ACC), Matthew correlation coefficient (MCC), area under the receiver operating characteristic curve (AUC) and area under precisionrecall curve (AUPR). They are defined as:
where TP is true positive, indicating the number of positive samples that are correctly classified, and FN is false negative, indicating the number of negative samples that are incorrectly classified. FP is false positive, which means the number of positive samples that are incorrectly classified as negative; TN is true negative, which means the number of negative samples that are correctly classified.
kfold cross validation
In this section, kfold crossvalidation (CV) is used to assess the performance of CRPGCN. The dataset used for this experiment is derived from a combined dataset of 533 circRNAs associated with 89 diseases obtained by screening the circBase, circR2Disease and MeSH databases. In order to assess the performance of CRPGCN more accurately, the dataset is randomly sampled. According to the AM matrix, when the AM matrix is 1, it is a positive sample, otherwise it is a negative sample, after which the positive sample is randomly disrupted while it is divided into 5 equal parts, then the negative sample data is taken 5 times the positive sample, and finally the positive and negative samples are combined as training samples. In addition to the associations between circRNAs and diseases in the dataset itself, the potential associations between circRNAs and diseases also has a significant impact on the final results, and the lantent factor number (LFN) parameter is adjusted to the optimal value, which is presented in the next section. In addition, the ratio of positive to negative samples also plays a crucial role in the outcome of the experiment. The ROC curves are shown in Fig. 2, with the final AUC values of 0.9490, 0.9720 and 0.9722 for the 2fold CV, 5fold CV and 10fold CV respectively.
Analysis of parameters
The key parameters of the CRPGCN algorithm have a huge impact on the results [35], thus in this section, the three primary parameters will be analysed.
In the CRPGCN, the LFN is one of the foundations on which it is constructed, and it plays an integral role in this experiment. Therefore, this subsection evaluates the impact on the CRPGCN algorithm based on the variation of LFN, which is set to range from 5 to 100 and validated by AUC values. In addition to this, a fivefold CV of the dataset is performed by fixing the optimal values of the remaining parameters constant. As shown in the histogram in Fig. 3a, the trend of the AUC value is monotonically increasing as the LFN goes from 5 to 65. From 65 to 100 there is a monotonically decreasing pattern. In addition, the best AUC value of 0.9720 is obtained at an LFN of 65. By adjusting the LFN to a reasonable value, the associations between circRNAs and diseases can be strengthened, thus making the prediction more accurate.
In addition, the restart probability c of the RWR and the proportion k of the truncated vector of the PCA also have a large impact on the AUC of the CRPGCN. c means the probability of the computed node returning to the original node in the next step, and 1c is the probability of being computed to reach a neighbouring node. k represents the number of matrix columns of length k of the matrix selected by the PCA processing matrix as the columns of the feature matrix. Because the distributions of both c and k are between 0 and 1, the experiments in this section set their step sizes to 0.1. From the results, it is shown that when c is between [0.3,0.5] and k is between [0.1,0.9], the average AUC values are 0.9643, 0.9658, and 0.9645, respectively. when c is 0.5 and k is between [0.4,0.8], the AUC values reached one of the peaks, with a range average AUC value of 0.9676, but they did not reach the highest value. The best AUC value is 0.9720 when c=0.4 and k=0.3. The experimental validation shows that although there are some outstanding AUC values in different ranges, the highest AUC values can only be obtained by setting the values of c and k reasonably, and the results are shown in Fig. 3b.
To further demonstrate the validity of the parameters, the results of the experiments at twofold CV, fivefold CV and tenfold CV of different LFN will be presented here, and the results prove the conclusions in the CRPGCN article to be correct. As shown in Fig. 4 and Table 1 (Tables 2, 3).
Comparison with existing methods
In order to verify the reliability of the algorithm, CRPGCN algorithms is used in this experiment to compare it with other excellent prediction method. As shown in Fig. 5. The GCMDR [36] is developed by Huang et al. to predict the relationships between miRNAs and drugs, and GCN to be used by it for extraction feature and final scores calculation. The AERF [37] is developed by K. Deepthi et al. to predict the associations between circRNAs and diseases, the Deep Autoencoder (DAEN) algorithm is used by it to extract features and thereafter the Random Forest (RF) classifier is used to classify and predict the results of the score matrix. GCNMDA [38] is developed by Long et al. to predict the associations between human microorganisms and drugs, with a Conditional Random Fields (CRF) layer added to the GCN process for feature extraction and final scores calculation. The SIMCCDA [39] is developed by Li et al. to predict the associations between circRNAs and diseases, which uses the PCA algorithm for feature extraction and dimensionality reduction, after which the Speedup Inductive Matrix Completion (SIMC) algorithm is used by it to perform the calculation of the prediction score matrix. The VGAELDA [40] integrates variational inference and graph autoencoders for lncRNAdisease associations prediction. The GATMDA [41] using graph attention networks with inductive matrix completion for human microbedisease associations prediction. After fivefold CV, the AUC values of GCMDR, AERF, GCNMDA, SIMCCDA, VGAELDA, GATMDA and CRPGCN are 0.6882, 0.8653, 0.7714, 0.8291, 0.5114, 0.9254, 0.9720, respectively. The AUPR values are 0.1203, 0.8062, 0.1465, 0.1756, 0.6367, 0.9067, 0.9418, respectively. In addition, the results of performance evaluation indicators such as F1, MCC, ACC and RECALL are shown in Fig. 6 and Table 4. This study effectively combines circRNA sequence informations, circRNA gene informations, and disease DAG data by fusing multiple datasets. Thereafter, the RWR algorithm is used by CRPGCN for comprehensive similarity calculation, which allows each node being calculated to better fuse informations from neighbouring nodes with higher weights. PCA is then used for feature extraction and dimensionality reduction, and the similarity informations of the nodes is further enhanced. It allows each pair of circRNAdisease nodes with high similarity to perform more prominent features while also performing data noise reduction, so that the preprocessed datas can be used by the GCN for faster feature learning and to obtain a higher accuracy scores prediction matrix.
In summary, the CRPGCN algorithm has a higher accuracy and greater advantage in predicting the associations between circRNAs and diseases than many other excellent comparative algorithms.
Comparison with different datasets
In order to verify the reliability of the CRPGCN algorithm under different datasets, this experiment provides 4 types datasets for comparison, as shown in Table 2. DataSet1 has 330 types of circRNAs and 354 types of associations with 48 diseases; DataSet2 has 661 types of circRNAs and 736 types of associations with 100 diseases; DataSet3 has 512 types of circRNAs and 609 types of associations with 71 diseases; DataSet4 has 533 types of circRNAs and 612 types of associations with 89 diseases. DataSet4 is the benchmark dataset for this study.
The histogram of AUC values in Fig. 7a. and Table 3 shows that the AUC values of the CRPGCN method under fivefold CV are consistently stable at around 0.95, with little fluctuation. Whereas CRPGCNI also performs well on the DataSet1 and DataSet2, the AUC values produce a significant drop on the DataSet3 and DataSet4, indicating that for different datasets the CRPGCNI method produces large fluctuations in its effectiveness, which implies that the CRPGCNI algorithm is not stable. For CRPGCNII, the results in the figure show that it performs relatively poorly in all four datasets, which implies that CRPGCNII basically fails to make accurate predictions. The AUC values of CRPGCN algorithm for twofold, fivefold and tenfold CV in the four datasets are shown in Table 5, while the average AUC values of them are calculated and they are 0.9375, 0.9562 and 0.9621 respectively. In summary, it can be shown that the CRPGCN algorithm has the same stable, efficient and accurate prediction both under different datasets and in comparison with other computational methods.
The four ROC curves in Fig. 8 show that the ROC curves of the CRPGCN algorithm all rise rapidly, with the TPR reaching above 0.9 before the FPR value of 0.1, which indicates that the CRPGCN algorithm is extremely efficient. For the CRPGCNI method, the ROC curves under DataSet1 and DataSet2 are also rise fast, with TPR values reaching around 0.9 before the FPR value of 0.1. However, the curves of CRPGCNI under the DataSet3 and DataSet4 are significantly flatter, with TPR values basically reaching 0.9 after the FPR value of 0.9. This performance indicates that for different datasets, the prediction accuracy of CRPGCNI fluctuates somewhat. For the CRPGCNII method, the curve trend is remarkably flat for either of the four datasets, along with low AUC values, which indicates that the CRPGCNII method basically does not have accurate predictions for the associations between circRNAs and diseases. Furthermore, because of the inclusion of the PCA algorithm for extraction feature, the CRPGCN algorithm and the CRPGCNI algorithm had higher AUC values than CRPGCNII, which suggests that the PCA feature extraction algorithm is equally essential for this experiment. Meanwhile, although Dataset4 is not the dataset with the most circRNAdisease associations, CRPGCN obtained the highest AUC value because this algorithm incorporates genebased circRNA similarity for circRNAs composite similarity calculation, which shows that genebased circRNA similarity is crucial for this algorithm.
Comparison with different comprehensive similarity calculation method
In order to study the influence of different similarity calculation methods on CRPGCN algorithm, in addition to RWR, DeepWalk [42], Line [43], Node2vec [44] and Struct2vec [45] algorithms are selected for comparison.As shown in the Fig. 9, the AUC values of CRPGCN using RWR, DeepWalk, Line, Node2vec and Struct2vec similarity calculation methods reached 0.9720, 0.8480, 0.8586, 0.9522 and 0.7724 under fivefold CV respectively. This means that the RWR algorithm has better performance compared to other similarity calculation methods in this study.
In the data preprocessing stage, the comprehensive similarity between circRNAs and the comprehensive similarity between diseases is calculated for feature learning. However, simply calculating the comprehensive similarity is not sufficient to fuse the datas between similar nodes for feature learning, so it is necessary to fuse the neighbouring nodes based on the comprehensive similarity to help the subsequent feature extraction. Compared to the other similarity calculation algorithms in this study, the RWR algorithm focuses more on the influence of the weights of neighbouring nodes on the similarity calculation, and it uses the comprehensive similarity as the similarity weights of neighbouring nodes for datas fusion. In contrast, Struct2vec focuses more on the calculation of structural similarity, which does not have much influence on this experiment, so the AUC value of Struct2vec is the lowest. On the other hand, Node2vec is closer to the RWR algorithm in terms of computational results because it is also more concerned with the weights of neighbouring nodes. However, compared to the RWR algorithm, Node2vec uses either a DepthFirstSearch (DFS) strategy or a BreadthFirstSearch (BFS) strategy to calculate similarity which combines more informations from low similarity nodes, whereas the RWR algorithm may return to the original nodes for similarity calculation which allows the neighbouring nodes with high similarity to be combined more closely. Overall, the RWR algorithm is the best choice for the computation of similarity in this study.
Case study
To further validate the predictive performance of CRPGCN for diseases, the case study is conducted on breast cancer alone. Breast cancer is a common disease and is one of the more lethal diseases especially for women. This case study may allow researchers to better study breast cancer and develop drugs or methods for effective treatment. The circR2Disease database and circFunBase [46] database are selected for validation. By removing circRNAs associated with breast cancer and then training them using CRPGCN, the final experiment predicted the unassociated data. The top 40 circRNAs are confirmed in descending order of prediction scores according to the CRPGCN method, as shown in Table 6 (see Additional file 6). There are some unidentified associations between circRNAs and breast cancer that may be able to be validate in future studies. The experimental results demonstrate the excellent predictive performance of the CRPGCN algorithm.
Conclusions and discussion
In this paper, CRPGCN is proposed for predicting the relationships between circRNAs and diseases using GCN constructed with RWR and PCA based on heterogeneous network. In CRPGCN, data from multiple datasets are used for similarity fusion, which includes information on circRNA sequences, genes, DAG of diseases, and circRNAdisease associations . By filtering the dataset, 533 circRNAs with 89 diseases are obtained.
With above information provided by the datasets, the circRNA GIP kernel similarity matrix CIS, the sequencebased circRNA similarity matrix CES, the genebased circRNA similarity matrix CGS, the disease GIP kernel similarity matrix DIS, and the disease semantic similarity matrix DSS are calculated. After that, the circRNA comprehensive similarity matrix CS is obtained by fusing CIS, CGS and CES, and the disease comprehensive similarity matrix DS is obtained by the fusion of DIS and DSS. Thereafter, the RWR algorithm is used to allow each node to learn the information of neighbouring nodes with higher correlation. However, the simple splicing matrix inevitably generates noise, and the PCA method not only enables feature extraction but also noise reduction for the splicing matrix. The datas processed by these methods are fused into a heterogeneous adjacency matrix and a heterogeneous feature matrix, which are used by the GCN algorithm for feature learning and calculation of associations scores between circRNAs and diseases. The results and comparative experiments show that the CRPGCN algorithm proposed in this paper has good performance and can accurately predict the associations between circRNAs and diseases. It can provide useful help to biologists and save their time in experiments.
Also, in the comparison experiments of this paper, the CRPGCN method has an outstanding performance in comparison with the best published algorithms. The results show that the CRPGCN method is the best among the comparative methods in this paper. In order to demonstrate the stability of the CRPGCN method, different datasets are used for the comparison. In conclusion, the different comparison experiments show that the CRPGCN algorithm is a stable and accurate prediction performance for the associations between circRNAs and diseases.
Availability of data and materials
The dataset and source code can be obtained from https://github.com/KajiMaCN/CRPGCN/. The circBase database can be downloaded from http://bioinfo.snnu.edu.cn/CircR2Disease/. The circR2Disease database can be got from http://circrna.org/. The MeSH database can be obtained from https://www.nlm.nih.gov/.
Abbreviations
 circRNA:

Circular RNA
 GCN:

Graph convolutional network
 RWR:

Random walk with restart
 PCA:

Principal component analysis
 CGR:

Chaos game representation
 DAGs:

Directed acyclic graphs
 TPR:

True positive rate
 FPR:

False positive rate
 ROC:

Receiver operating characteristic
 AUC:

Areas under ROC curve
References
Jarada TN, Rokne JG, Alhajj R. SNFNN: computational method to predict drugdisease interactions using similarity network fusion and neural networks. BMC Bioinform. 2021;22(1):28. https://doi.org/10.1186/s12859020039503.
Wang L, Zhong X, Wang S, Zhang H, Liu Y. A novel endtoend method to predict RNA secondary structure profile based on bidirectional LSTM and residual neural network. BMC Bioinform. 2021;22(1):169. https://doi.org/10.1186/s1285902104102x.
Zhu R, Wang Y, Liu JX, Dai LY. IPCARF: improving lncRNAdisease association prediction using incremental principal component analysis feature selection and a random forest classifier. BMC Bioinform. 2021;22(1):175. https://doi.org/10.1186/s12859021041049.
Han G, Kuang Z, Deng L. Mscne:predict mirnadisease associations using neural network based on multisource biological information. IEEE/ACM Trans Comput Biol Bioinform, 2021;1. https://doi.org/10.1109/TCBB.2021.3106006
Tang M, Liu C, Liu D, Liu J, Liu J, Deng L. PMDFI: predicting miRNAdisease associations based on highorder feature interaction. Front Genet. 2021;12:318. https://doi.org/10.3389/fgene.2021.656107.
Cai Y, Wang J, Deng L. SDN2GO: an integrated deep learning model for protein function prediction. Front Bioeng Biotechnol. 2020;8:391. https://doi.org/10.3389/fbioe.2020.00391.
Azari H, Mousavi P, Karimi E, Sadri F, Zarei M, Rafat M, Shekari M. The expanding role of CDR1AS in the regulation and development of cancer and human diseases, 2021. https://doi.org/10.1002/jcp.29950.
Lu C, Zeng M, Wu FX, Li M, Wang J. Improving circRNAdisease association prediction by sequence and ontology representations with convolutional and recurrent neural networks. Bioinformatics. 2021;36(24):5656–64. https://doi.org/10.1093/bioinformatics/btaa1077.
Zhang Y, Lei X, Pan Y, Pedrycz W. Prediction of diseaseassociated circRNAs via circRNAdisease pair graph and weighted nuclear norm minimization. Knowl Based Syst. 2021;214:106694. https://doi.org/10.1016/j.knosys.2020.106694.
Lei XJ, Bian C, Pan Y. Predicting CircRNAdisease associations based on improved weighted biased metastructure. J Comput Sci Technol. 2021;36(2):288–98. https://doi.org/10.1007/s113900210798x.
Wang L, Yan X, You ZH, Zhou X, Li HY, Huang YA. SGANRDA: semisupervised generative adversarial networks for predicting circRNAdisease associations. Brief Bioinform. 2021. https://doi.org/10.1093/bib/bbab028.
Wei H, Xu Y, Liu B. iCircDALTR: identification of circRNAdisease associations based on Learning to Rank. Bioinformatics. 2021. https://doi.org/10.1093/bioinformatics/btab334.
Kipf TN, Welling M. Semisupervised classification with graph convolutional networks, 2016. arXiv:1609.02907.
Tong H, Faloutsos C, Pan JY. Fast random walk with restart and its applications. Technical report, 2006.
Hotelling H. Analysis of a complex of statistical variables into principal components. J Educ Psychol. 1933;24(6):417. https://doi.org/10.1037/h0071325.
Li J, Zhang S, Liu T, Ning C, Zhang Z, Zhou W. Neural inductive matrix completion with graph convolutional networks for miRNAdisease association prediction. Bioinformatics. 2020;36(8):2538–46. https://doi.org/10.1093/bioinformatics/btz965.
Wang L, You ZH, Li YM, Zheng K, Huang YA. GCNCDA: a new method for predicting circRNAdisease associations based on graph convolutional network algorithm. PLoS Comput Biol. 2020;16(5):1–19. https://doi.org/10.1371/journal.pcbi.1007568.
Pan X, Shen HB. Scoring diseasemicroRNA associations by integrating disease hierarchy into graph convolutional networks. Pattern Recognit. 2020;105(xxxx):107385. https://doi.org/10.1016/j.patcog.2020.107385.
Lei X, Bian C. Integrating random walk with restart and kNearest Neighbor to identify novel circRNAdisease association. Sci Rep. 2020;10(1):1–9. https://doi.org/10.1038/s41598020590400.
Wang L, Xiao Y, Li J, Feng X, Li Q, Yang J. Iirwr: Internal inclined random walk with restart for lncrnadisease association prediction. IEEE Access. 2019;7(1):54034–41. https://doi.org/10.1109/ACCESS.2019.2912945.
Zhang W, Lei X, Bian C. Identifying cancer genes by combining tworounds RWR based on multiple biological data. BMC Bioinform. 2019;20:518–151812. https://doi.org/10.1186/s1285901931238.
Wang M, Zhu P. MRWMDA: a novel framework to infer miRNAdisease associations. BioSystems, 2021;199(April 2020), 104292. https://doi.org/10.1016/j.biosystems.2020.104292.
Arowolo MO, Adebiyi M, Adebiyi A, Okesola O. PCA model for RNASeq malaria vector data classification using KNN and decision tree algorithm. In: 2020 International conference in mathematics, computer engineering and computer science, ICMCECS 2020. 2020. https://doi.org/10.1109/ICMCECS47690.2020.240881.
Sell SL, Widen SG, Prough DS, Hellmich HL. Principal component analysis of blood microRNA datasets facilitates diagnosis of diverse diseases. PLoS ONE, 2020;15(6 June), 1–26. https://doi.org/10.1371/journal.pone.0234185.
Ding Y, Chen B, Lei X, Liao B, Wu FX. Predicting novel CircRNAdisease associations based on random walk and logistic regression model. Comput Biol Chem. 2020;87:107287. https://doi.org/10.1016/j.compbiolchem.2020.107287.
Fan C, Lei X, Fang Z, Jiang Q, Wu FX. CircR2Disease: a manually curated database for experimentally supported circular RNAs associated with various diseases. Database 2018(2018), 2018. https://doi.org/10.1093/database/bay044.
Wei H, Liu B. iCircDAMF: identification of circRNAdisease associations based on matrix factorization. Brief Bioinform. 2019;21(4):1356–67. https://doi.org/10.1093/bib/bbz057.
Glažar P, Papavasileiou P, Rajewsky N. CircBase: a database for circular RNAs. RNA. 2014;20(11):1666–70. https://doi.org/10.1261/rna.043687.113.
Jeffrey HJ. Chaos game representation of gene structure. Technical Report 8, 1990. http://nar.oxfordjournals.org/.
Zheng K, You ZH, Li JQ, Wang L, Guo ZH, Huang YA. ICDACGR: identification of circRNAdisease associations based on chaos game representation. PLoS Comput Biol. 2020;16(5):1007872. https://doi.org/10.1371/journal.pcbi.1007872.
Wang J, Kuang Z, Ma Z, Han G. GBDTL2E: predicting lncRNAEF associations using diffusion and hetesim features based on a heterogeneous network. Front Genet. 2020;11:272. https://doi.org/10.3389/fgene.2020.00272.
Buratin A, Gaffo E, Molin AD, Bortoluzzi S. CircIMPACT: an R package to explore circular RNA impact on gene expression and pathways. Genes. 2021;12(7):1044. https://doi.org/10.3390/genes12071044.
Zhang Y, Lei X, Fang Z, Pan Y. CircRNAdisease associations prediction based on metapath2vec++ and matrix factorization. Big Data Min Anal. 2020;3(4):280–91. https://doi.org/10.26599/BDMA.2020.9020025.
Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks. J Mach Learn Res. 2010;9:249–56.
Ji C, Gao Z, Ma X, Wu Q, Ni J, Zheng C. AEMDA: inferring miRNAdisease associations based on deep autoencoder. Bioinformatics (Oxford, England). 2021;37(1):66–72. https://doi.org/10.1093/bioinformatics/btaa670.
Huang YA, Hu P, Chan KCC, You ZH. Graph convolution for predicting associations between miRNA and drug resistance. Bioinformatics. 2020;36(3):851–8. https://doi.org/10.1093/bioinformatics/btz621.
Deepthi K, Jereesh AS. Inferring potential CircRNAdisease associations via deep autoencoderbased classification. Mol Diagn Therapy. 2021;25(1):87–97. https://doi.org/10.1007/s4029102000499y.
Long Y, Wu M, Kwoh CK, Luo J, Li X. Predicting human microbedrug associations via graph convolutional network with conditional random field. Bioinformatics. 2020;36(19):4918–27. https://doi.org/10.1093/bioinformatics/btaa598.
Li M, Liu M, Bin Y, Xia J. Prediction of circRNAdisease associations based on inductive matrix completion. BMC Med Genom. 2020;13:044. https://doi.org/10.1186/s1292002006790.
Shi Z, Zhang H, Jin C, Quan X, Yin Y. A representation learning model based on variational inference and graph autoencoder for predicting lncRNAdisease associations. BMC Bioinform. 2021;22(1):136. https://doi.org/10.1186/s1285902104073z.
Long Y, Luo J, Zhang Y, Xia Y. Predicting human microbedisease associations via graph attention networks with inductive matrix completion. Brief Bioinform. 2021;22(3):146. https://doi.org/10.1093/bib/bbaa146.
Perozzi B, AlRfou R, Skiena S. DeepWalk: Online learning of social representations. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, 2014, p. 701–710. https://doi.org/10.1145/2623330.2623732.
Tang J, Qu M, Wang M, Zhang M, Yan J, Mei Q. LINE: largescale information network embedding. In: WWW 2015—proceedings of the 24th international conference on world wide web, 2015, p. 1067–1077. https://doi.org/10.1145/2736277.2741093.
Grover A, Leskovec J. Node2vec: scalable feature learning for networks. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, vol. 13–17August2016, 2016, p. 855–864. https://doi.org/10.1145/2939672.2939754.
Wang L, Lu Y, Huang C, Vosoughi S. Embedding node structural role identity into hyperbolic space. In: International conference on information and knowledge management, proceedings, 2020;pp. 2253–2256. https://doi.org/10.1145/3340531.3412102.
Meng X, Hu D, Zhang P, Chen Q, Chen M. CircFunBase: a database for functional circular RNAs. Database. 2019;2019:003. https://doi.org/10.1093/database/baz003.
Acknowledgements
We would like to thank the Experimental Center of School of Computer and Information Engineering, Central South University of Forestry and Technology, for providing computing resources.
Funding
This work is supported in part by the National Natural Science Foundation of China under Grants 62072477, 61309027, 61702562 and 61702561, the Hunan Provincial Natural Science Foundation of China under Grant 2018JJ3888, the Scientific Research Fund of Hunan Provincial Education Department under Grant 18B197, the National Key R&D Program of China under Grant 2018YFB1700200, the Hunan Key Laboratory of Intelligent Logistics Technology 2019TP1015.
Author information
Affiliations
Contributions
ZHM designed this study. ZHM collected the data, conceived and implemented the model. ZHM and ZFK performed and analysed the experiments. ZHM wrote the paper. ZFK and LD revised the manuscript. All authors have read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1
: Adjacency matrix A. The adjacency matrix A constructed from circR2Disease.
Additional file 2
: circRNA comprehensive similarity matrix.
Additional file 3
: Disease comprehensive similarity.
Additional file 4
: circRNA feature matrix.
Additional file 5
: Disease feature matrix.
Additional file 6
: Prediction of the top 40 predicted circRNAs associated with Breast cancer.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Ma, Z., Kuang, Z. & Deng, L. CRPGCN: predicting circRNAdisease associations using graph convolutional network based on heterogeneous network. BMC Bioinformatics 22, 551 (2021). https://doi.org/10.1186/s1285902104467z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285902104467z
Keywords
 CircRNAdisease
 Graph convolutional network
 Heterogenous network
 Principal component analysis
 Deep learning