 Research
 Open access
 Published:
MSPCD: predicting circRNAdisease associations via integrating multisource data and hierarchical neural network
BMC Bioinformatics volumeÂ 23, ArticleÂ number:Â 427 (2022)
Abstract
Background
Increasing evidence shows that circRNA plays an essential regulatory role in diseases through interactions with diseaserelated miRNAs. Identifying circRNAdisease associations is of great significance to precise diagnosis and treatment of diseases. However, the traditional biological experiment is usually timeconsuming and expensive. Hence, it is necessary to develop a computational framework to infer unknown associations between circRNA and disease.
Results
In this work, we propose an efficient framework called MSPCD to infer unknown circRNAdisease associations. To obtain circRNA similarity and disease similarity accurately, MSPCD first integrates more biological information such as circRNAmiRNA associations, circRNAgene ontology associations, then extracts circRNA and disease highorder features by the neural network. Finally, MSPCD employs DNN to predict unknown circRNAdisease associations.
Conclusions
Experiment results show that MSPCD achieves a significantly more accurate performance compared with previous stateoftheart methods on the circFunBase dataset. The case study also demonstrates that MSPCD is a promising tool that can effectively infer unknown circRNAdisease associations.
Background
Circular RNA is a special class of singlestranded, noncoding RNA, which forms a closed covalent ring structure by connecting the 3â€² and 5â€² ends through exon or intron circularization. In the 1970s, circRNA was discovered for the first time in Viroids and Sendai virus particles of infected plants by electron microscopy and other technologies [1]. Later, circRNA was also found in both Animal cells [2] and fungal cells [3]. Due to the limitations of biotechnology and the particularity of structure, circRNA was considered an artifact or misssplicing product. Therefore, in the following decades, it was not taken seriously by scientists. The past decades have seen a growth in highthroughput sequencing and other related technologies. Biologists discovered that circRNA is widely present in archaea and may have biological functions [4]. Simultaneously, Salzman et al. [5] also proved that circRNA is a general feature in the gene expression program of human cells and may play an essential role in various biological functions. Then Nature published two articles [6, 7] about the biological function of circRNA, which unveiled the mystery of circRNA for the first time. Since then, circRNA has raised the interest of many biologists. PubMed, a wellknown biomedical database, has collected more than 7000 articles about circRNA from 2012 to December 2020, and it is still on the rise. Many of the articles are focused on the research of associations between circRNA and diseases due to its resistance to exonucleasemediated degradation and higher stability than most linear RNA in cells.
Existing experimental results reveal that circRNA plays a crucial role in diseases by interacting with diseaserelated miRNAs and has excellent potential to become a new clinical diagnostic marker. Hense et al. [8] have confirmed that CDR1as and miR7 are coexpressed in the mouse brain and affect midbrain development. The expression level of \(hsa\_circRNA\_100855\) is higher in patients with cervical lymph node metastasis or late clinical treatment at stage T3 [9]. For the treatment of depression, \(hsa\_circRNA\_103636\) is a potential new biomarker [10]. Liu et al. [11] used circRNA chips to screen the differential expression of circRNA and coexpression analysis of ceRNA between arthritis patients and normal people, concluding that circRNACER may be a potential target for arthritis treatment. In addition, circRNA is also closely related to atherosclerosis [12], diabetes [13], Ruan virus disease [14], viral hepatitis [15], and neurological diseases [16]. There is a strong correlation between circRNA and diseases, so recognizing their associations is essential to disease treatment and diagnosis. However, these experimental methods are expensive, difficult, and slowprogressing. An effective computational method is necessary for identifying circRNAdisease associations.
In recent years, many computational methods were proposed and mostly divided into two types: methods based on network and methods based on machine learning. As for networkbased methods, Lei et al. [17] proposed a new computational path weighted method for predicting circRNAdisease associations on the circR2Disease dataset. Fan et al. [18] presented a heterogeneous networkbased model, named KATZHCDA, by integrating disease similarity matrix, circRNA expression profiles, and known circRNAdisease associations and using the KATZ model to measure circRNAdisease associations. Deng et al. [19] introduced the KATZCPDA for identifying circRNAdisease associations with multiple heterogeneous networks constructed by the integrations among circRNAs, proteins, and diseases. Zou et al. [20] constructed multiple similarity networks and association networks and used the double matrix factorization method to infer circRNAdisease associations. Lei et al. [21] used the random walk with restart algorithm to weight the features and then used the knearest neighbor to predict unknown circRNAs and disease associations.
As for methods based on machine learning, Zheng et al. [22] provided iCDACGR based on nonlinear information and quantify location to predict the circRNAdisease associations. The method first uses Chaos Game Representation to quantify the nonlinear sequence relationship of circRNA with biological sequence position information. Fan et al. [23] presented a novel approach MSFCNN using CNN. The similarity kernels of circRNA or disease are integrated with the similarity kernel fusion method. Then, they constructed the feature matrix using interaction features and multiple similarity kernels among diseases, miRNAs and circRNAs. Finally, the model predicts potential circRNAdisease association by trained CNN with the features matrix input. Wang et al. [24] proposed a new model implemented by extreme learning machine and CNN. The CNN model is constructed for effective hiddenfeature extraction, and the ELM classifier is designed to identify potential circRNAdisease association on the circR2Disease dataset. Xiao et al. [25] presented a method by graphbased multilabel learning to predict potential associations. The model contains the graph regularization and mixednorm constraint terms to make a better prediction. Wei et al. [26] provided iCircDAMF using matrix factorization to predict the circRNAdisease associations. Zhao et al. [27] developed a novel method IBNPKATZ with KATZ measure and the bipartite network projection. Lei et al. [28] designed a model using gradient boosting decision tree to predict circRNAdisease associations on the circR2Disease dataset. Wang et al. [29] developed a new model based on Fast learning with graph convolutional networks (FastGCN). Ding et al. [30] presented an approach by logistic regression model and the random walk. Chen et al. [?] first constructed multiple association networks, then integrated multiple similarities to generate circRNA and disease features, and then used graph attention network to predict circRNAdisease associations. The association between disease and circRNA is a complex process involving plentiful biological information. Due to the lack of full use of relevant biological information, the performance of the above models has much room to improve.
In this work, we develop a method called MSPCD, which calculates circRNA similarity and disease similarity more accurately by integrating more biological information such as circRNAdisease associations, circRNAgene ontology (GO) associations. MSPCD also utilizes the neural network to extract circRNA and disease highorder features and adopts the DNN to infer unknown circRNAdisease associations. To verify the performance of MSPCD, fivefold crossvalidation is performed on the circFunBase dataset. The AUC value of MSPCD is 0.9904 on the circFunBase dataset. Moreover, we compare MSPCD with several stateoftheart computational frameworks on the circFunBase dataset and perform a case study. The experimental results demonstrate that MSPCD is efficient in inferring unknown circRNAdisease associations.
Results and discussion
The performance of MSPCD based on fivefold crossvalidation
To evaluate MSPCDâ€™s predictive ability in circRNAdisease associations, we implement fivefold crossvalidation experiments on the circFunBase dataset. The experimental results are summarized in TableÂ 1. In addition, we also plot the ROC curve, as shown in Fig.Â 1. From the table, we can see that the five experimentsâ€™ AUC values of the MSPCD model reach 0.9903, 0.9924, 0.9908, 0.9879, 0.9907, respectively, and the average value is 0.9904. In terms of accuracy index, the five experimentsâ€™ accuracy values are 0.9505, 0.9631, 0.9296, 0.9371, and 0.9463, respectively, with an average value of 0.9453. The \(F1\_score\) reflects a harmonic average of the accuracy and recall rate of the model. The \(F1\_score\) values of the five experiments are 0.9501, 0.9652, 0.9314, 0.9387, 0.9473, respectively, and the average value is 0.9463. Besides, the average precision and recall are 0.9246 and 0.9702, respectively. The above experimental results demonstrate that MSPCD has good performance in inferring unknown circRNAdisease associations.
Comparison with existing stateoftheart methods
In this section, we compare the model with several stateoftheart methods (DMFCDA [31], KATZCPDA [19], AE_RF [32], GBDTCDA [28], IMSCDA [33], AE_DNN [34]). DMFCDA regards circRNAdisease associations prediction as a kind of recommendation problem. Firstly, DMFCDA extracts circRNA and disease latent features from the original circRNAdisease association matrix, respectively, then cascades circRNA and disease latent features to represent circRNA and disease pair. Finally, DMFCDA utilizes a DNN to realize the prediction of circRNA and disease associations. KATZCPDA first integrates multiple heterogeneous networks including circRNA, protein, and disease, and then uses the KATZ method to predict the relationship between circRNA and disease. AE_RF firstly obtains circRNA similarity and disease similarity by integrating circRNA functional similarity, circRNA Gaussian interaction profile kernel similarity, disease semantic similarity and disease Gaussian interaction profile kernel similarity. Then, AE_RF uses autoencoder for feature selection, and employs Random Forest to give the final circRNAdiseases association predictions. GBDTCDA uses circRNArelated expression profiles, circRNA sequences, and gene ontology (GO) terms data to construct a circRNA similarity network, and then uses the GBDT algorithm to identify circRNAdisease associations. IMSCDA first combines the semantic similarity of diseases, Jaccard similarity and Gaussian interaction profile kernel similarity of disease and circRNA. Then IMSCDA uses stacked autoencoder to extract latent features. Finally, the random forest classifier is used to predict the association between circRNA and disease. AE_DNN resembles the AE_RF and have several vital improvements. AE_DNN replaces circRNA functional similarity with circRNA sequence similarity when integrating circRNA similarity, and replaces the RF classifier with the DNN classifier.
We perform fivefold crossvalidation experiments. TableÂ 2 shows experimental results for the seven methods. Furthermore, we also plot ROC curves, as shown in Fig.Â 2. From the table, we can see that MSPCD AUC value achieves the best result, which is much higher than that of the secondbest method by 4.12%. And compared with other evaluation indicators, it has also achieved the best results on accuracy, recall and \(F1\_score\). Our model can achieve such good results because it not only integrates additional biological information to obtain circRNA similarity and disease similarity, but also utilizes hierarchical neural networks to predict circRNAdisease associations.
In order to further evaluate the performance of MPSCD, we conduct experimental comparisons on an independent testing dataset. We randomly select 20% of the samples from the circFunBase dataset as the independent testing dataset, which is, 2984*20% \(\approx\) 596 samples. The remaining \(2984596\) = 2388 samples are divided into five parts of roughly the same size, which are used in the training dataset and the validation dataset to crossvalidate the model. After this division, we ensure that the independent testing dataset does not overlap with other datasets. We draw the ROC curve of these methods on the independent testing dataset. From Fig.Â 3, we can see that the AUC value of MSPCD is 0.9853 on independent testing dataset.
Comparison of different datasets
To further verify the performance of MSPCD, we perform a fivefold crossvalidation experiment on the circR2Disease dataset. The experimental results are illustrated in TableÂ 3, and the ROC curve is shown in Fig.Â 4. From the table, we can see that the average AUC value, accuracy, precision, recall, \(F1\_score\) of MSPCD on circR2Disease are 0.9526, 0.9157, 0.9119, 0.9219, and 0.9163, respectively.
In addition, in order to further verify the application capabilities of MSPCD in different datasets, we also compared MSPCD with the above stateoftheart methods in the circR2Disease database.The ROC curves of these methods are shown in Fig.Â 5. From the figure, we can see that MSPCD has achieved the best AUC value.The experimental results confirm that MSPCD can be applied to different datasets.
Analysis effects of the length of highorder feature
In the model, we use neural networks to extract circRNA and disease highorder features, respectively. If the highorder featureâ€™s length is too short, the model will be unable to thoroughly learn the complicated relationship between circRNA and disease. If the highorder featureâ€™s length is too long, the risk of overfitting will be increased. In this section, to study the effect of the length of highorder feature on circRNA and disease associations prediction, we set the length of highorder feature to 8, 16, 32, 64, 128 for experimental comparison.
The experimental results are shown in Fig.Â 6. From the figure, we can see that the AUC value does not change much when the highorder featureâ€™s length is in the range of 16 to 128. This is because we use regularization to alleviate overfitting.
Comparison with different classifiers
In predicting circRNAdisease associations, after obtaining circRNA similarity and disease similarity, many previous models directly cascade them to represent each circRNAdisease pair. However, MSPCD firstly uses neural networks to extract highorder features from circRNA similarity and disease similarity. Then, MSPCD employs neural networks to predict circRNAdisease associations. In this section, to verify the effectiveness of hierarchical neural network of our model, we cascade circRNA similarity and disease similarity to represent each circRNA and disease pair, and then directly use several classical classifiers (DNN, RF, SVM) to infer unknown circRNAdisease associations.
We have carried out fivefold crossvalid experiments for these classifiers. The experiments are shown in TableÂ 4 and Fig.Â 7. The AUC values of MSPCD, DNN, SVM and RF are 0.9904, 0.9763, 0.9679 and 0.8983 respectively. The experimental results illustrate that hierarchical neural networks used in MSPCD can improve the modelâ€™s ability in predicting circRNAdisease association. At the same time, it is worth noting that the AUC value of directly using DNN and SVM are also higher than other previous computational methods, which reverifies that it is useful to obtain circRNA similarity and disease similarity by fusing various biological information.
Case study
To verify the practical capacity of MSPCD in predicting circRNAdisease associations, we perform a case study on the circFunbase dataset. CircFunBase contains 2597 circRNAs ,67 types of diseases and 2984 comfirmed circRNAdisease associations. We used the known circRNAdisease associations in circFunBase to train the MSPCD model, and then predict the remaining 196,944 unknown circRNAs and disease associations. Next, we ranked the unknown circRNAs and disease associations according to the predicted scores. We took out the top 15 circRNAs and disease associations and performed literature validation in the PubMed database. As shown in TableÂ 5, the results show that five of the top 15 circRNAdisease candidate associations are confirmed in PubMed. Itâ€™s worth noting that the remaining ones, which have not been confirmed, they are potential circRNAdisease associations to be confirmed.
Conclusion
Understanding the relationship between circRNA and disease will help us recognize the disease mechanism, which is significant for accurate staging and remedy of disease. Additionally, with the formation of many databases about circRNA, it is possible to explore the associations between circRNA and disease by computational methods, which complements for the high cost of biological methods. The previous computational methods have limits because they do not fully consider relevant biological information, resulting in low accuracy of prediction. We developed a method called MSPCD to infer unknown circRNAdisease associations. To obtain similarity of circRNA and disease more accurately MSPCD firstly integrates various biological information. Then, MSPCD extracts highorder features of circRNA and disease by the neural network. Finally, MSPCD utilized DNN to obtain the prediction result. We implemented fivefold crossvalidation experiments. the AUC value of MSPCD reached 0.9904 on circFunBase dataset, which outperformed other previous models. The comprehensive experimental results illustrate that MSPCD has a good performance in inferring unknown circRNAdisease associations. In MSPCD, circRNAdisease association prediction is modeled as supervised learning. The sparse supervisory signal leads to limited performance. Selfsupervised learning mitigates the effects of sparse supervision signals by pretraining on largescale dataset without manual annotations. In the future, we will consider applying selfsupervised learning to circRNAdisease association prediction.
Materials and methods
Problem description
Limited by time and cost, exploring the correlations between circRNAs and human diseases based on biological experiments has encountered many difficulties and bottlenecks. Instead of traditional experiments, computational methods are proved to be an efficient and accurate way to discover the potential connections between circRNAs and diseases.
Data set
We obtained the circRNAdisease association data from CircFunBase [35]. As a biological information database for circRNAs, CircFunBase provides highquality functional circRNA resources. We finally extracted 2984 verified circRNAdisease associations, including 2597 circRNAs and 67 diseases, the same as Zheng et al. [36]. Based on this data set, we established the circRNAdisease association matrix \(S_d\). If circRNA \(c_m\) is related to disease \(d_n\), the value of \(S_{d}(m, n)\) is 1; otherwise, it is 0.
We also built another circRNAdisease association dataset from circR2Disease [28]. The circR2Disease dataset includes 612 circRNAdisease associations involving 533 circRNAs and 89 diseases. The details of the database are shown in TableÂ 6.
CircRNA sequence similarity
We acquire the circRNA sequence information from CircFunBase and CircBase (http://www.circbase.org/) and establish the circRNA sequence similarity model based on the Levenshtein distance [37]. Levenshtein distance is a kind of edit distance widely used to measure the similarity between two strings. It is defined as the minimum number of edits to convert a source string into a target string. It only allows three singlecharacter operations: insertion, deletion, and replacement. According to previous relevant researches, the cost of insertion and deletion are set to 1, and the cost of replacement is set to 2. Thus, we can calculate the sequence similarity of circRNA \(c_m\) and circRNA \(c_n\) using the following formula:
where \(cost_{min}\) indicates the Levenshtein distance between the sequence of \(c_m\) and the sequence of \(c_n\), and \(l(c_m)\) indicates the sequence length of circRNA \(c_m\).
CircRNA functional similarity
Based on the assumption that circRNAs with similar functions are associated with similar diseases, gene ontology (GO) terms, and miRNAs, we also utilize circRNA association information about GO terms and miRNA. These abundant biological materials support us to explore the potential relationship between circRNA and disease entirely. The association data of circRNAGO and circRNAmiRNA were obtained from CircFunBase through web crawler technology and applied to construct the twoâ€™s association matrices, \(S_g\) and \(S_m\).
The circRNA functional similarity model includes three aspects: disease, Go terms, and miRNA. We use the Jaccard similarity coefficient to measure the similarity score. The diseasebased score of circRNA \(c_m\) and circRNA \(c_n\) can be calculated by the following formula:
where \(TD(c_m)\) denotes the binary vector formed by the mth row in matrix \(S_d\). For \(c_m\) and \(c_n\), their functional similarity score based on GO terms can be calculated by:
where the binary vector \(TG(c_m)\) is the mth row of the correlation matrix \(S_g\). The similarity score of \(c_m\) and \(c_n\) based on miRNA can be calculated as follows:
where \(TM(c_m)\) represents the mth row of the association matrix \(S_m\). To utilize the above three functional similarities at a comprehensive level, we obtain the final circRNA functional similarity by taking the average value of them, which is computed by:
CircRNA GIP kernel similarity
Gaussian interaction profile (GIP) kernel similarity has been widely applied to extract the network topology information to predict the interaction between biomolecules. According to the biological hypothesis that functionally comparable circRNAs are inclined to be associated with semblable diseases, we calculate the GIP kernel similarity between circRNAs through the circRNAdisease adjacent matrix \(S_d\). The calculation formula for circRNA \(c_m\) and circRNA \(c_n\) is as follows:
where \(\delta _c\) is the parameter of kernel bandwidth, and nc denotes the number of circRNAs.
Disease GIP kernel similarity
Many studies have utilized GIP similarity to measure the similarity between diseases because the more similar diseases are, the more similar their correlations with circRNAs. The GIP similarity for disease \(d_m\) and disease \(d_n\) can be calculated by:
where \(\delta _d\) is the width parameter, \(TC(d_m)\) denotes the binary vector formed by the mth column in the association matrix \(S_d\), and nd denotes the number of diseases.
Disease semantic similarity
To construct the semantic similarity model, we use MeSH, a database that supplies a meticulous classification scheme and can be got from (https://www.ncbi.nlm.nih.gov/mesh/). The relationships between diseases can be expressed as a directed acyclic graph (DAG), where nodes represent diseases and edges represent their associations. If a disease k is in the DAG of disease d, its contribution \(G_d(k)\) to d is as follows:
where \(\mu\) is the contribution element. According to the previous paper by Wang et al. [38], we set its value to 0.5. For disease \(d_m\) and disease \(d_n\), their first semantic similarity model \(DS_1(dm, dn)\) can be calculated by:
where \(N_{d_m}\) is defined as the set of diseases in the DAG of disease \(d_m\). However, model \(DS_1\) only considers the correlation between the layers in disease DAG and ignores the fact that different diseases appear in DAGs at various times. Thus, we construct the second disease semantic model, which calculates the semantic contribution value by the following formula:
where n(DAGs(k)) indicates the number of DAGs including disease k, and n(dis) indicates the total number of all the diseases. We can calculate the semantic similarity score of disease \(d_m\) and disease \(d_n\) according to the second model as follows:
MSPCD model
The MSPCD model employs hierarchical neural networks to reveal the latent assoications between circRNAs and diseases. In this part we will introduce the implementation process of MSPCD in detail.
Multisource information fusion
In the aforementioned sections, we have acquired circRNA sequence similarity, circRNA functional similarity, circRNA GIP similarity, disease GIP similarity, and disease semantic similarity. To fully take advantage of data from different sources, we need to fuse the complex similarity information. A better descriptor of the relationship between circRNAs and diseases can help us dig deeper into circRNAdisease associations.The flow chart is showed in Fig.Â 8.
The integrated similarity of circRNA can be gained by combining circRNA sequence similarity CS, circRNA functional similarity CF, and circRNA GIP similarity CG. In view of the fact that some circRNAs in the circRNAdisease matrix \(S_d\) lack the sequence information required for the experiment, we define the binary flag value FQ to represent the two opposite situations. If the value of \(FQ_{mn}\) is 1, it means that circRNA \(c_m\) and circRNA \(c_n\) have sequence similarity. Otherwise, \(FQ_{mn}\) is 0. The fusional similarity matrix of circRNA is defined by the following formula:
For diseases, we adopt GIP similarity GS, semantic similarity \(DS_1\) and \(DS_2\) to measure the integrated similarity. Since some disease pairs have no matching semantically similar association, the binary flag value FS is defined to distinguish different situations. If there is a semantic similarity of disease \(d_m\) and \(d_n\), the value of \(FS_{mn}\) is 1; otherwise, it is 0. The fusional similarity matrix of disease is defined as follows:
Extract highorder features of circRNA and diseases
There are noise and redundancy in the fundamental features. Designing more efficient features to characterize circRNAs and diseases excellently benefits the performance of the model. Therefore, we take the original characteristics as input and pass them through three layers of fully connected neural networks to extract the dense latent features of circRNAs and diseases, respectively. The activation function rectified linear unit (ReLU) is adopted in the layers mentioned above.
We use m to represent the indexes of fully connected layers. The value of m in this section is 1, 2, or 3. e represents the initial input vectors of circRNAs. \(w_{em}\) and \(b_{em}\) respectively represent the weight coefficient and bias of the corresponding layer. For circRNAs, the output of each connected layer can be expressed as follows:
If m is 1, then \(O_{e0}\) is equal to e. Likewise, f represents the primary input vectors of the diseases. \(w_{fm}\) and \(b_{fm}\) respectively represent the weight coefficient and bias of the corresponding layer. The output of each connected layer for diseases can be expressed by:
\(O_{f0}\) is equal to f when the value of m is 1. The outputs of the third layers, \(O_{e3}\) and \(O_{f3}\), are the complicated highlevel features of circRNAs and diseases.
Feature interaction by the dot product
So far, for any circRNA i and any disease j in the data set, we respectively project their initial feature vectors into N*1dimensional highorder feature vectors, CH and DH. The values of CH and DHâ€™s corresponding positions are multiplied to learn the interaction feature vector CD between circRNA i and disease j, and its dimension is also N*1. We do not directly use interactive characteristics to predict the correlation between circRNA i and disease j but concatenate them with the highorder feature CH of i and the highorder feature DH of j. The generated vector is adopted to represent the circRNAdisease pair \(ij\), which is defined as follows:
Predict circRNAdisease associations by DNN
We send the feature vectors acquired above into three fully connected layers. ReLU is utilized as the activation function of the first two fully connected layers, and the last activation function is sigmoid to get the ultimate binary results. VG denotes the matrix composed of all the feature vectors generated in the previous step. It is also the input of the fourth fully connected layer. \({\hat{y}}\) denotes the output of the sixth layer, that is, the predicted label values.
where \(w_4\), \(w_5\), \(w_6\) are the weights of the corresponding connection layer. \(b_4\), \(b_5\), \(b_6\) are the biases of the corresponding connection layer. Before training the model, we notice that the size of the constructed cirRNAdisease association matrix Sd is 2597*67, but there are only 2984 associations in Sd. To avoid the impact of unbalanced samples, we randomly pick 2984 negative samples from the remaining unverified associations to reach the identical quantity as the positive samples. The selected negative set is not strictly credible, and there may be unproven positive associations. But the influence is negligible because they only occupy a tiny proportion in the whole negative sample set. The flow chart of MSPCD is shown in Fig.Â 9.
Evaluation metrics
To evaluate the performance of MSPCD, we choose fivefold crossvalidation. First, the dataset is divided into five subsets. Then, four subsets are used for training set and one subset for testing. We repeat that process until all subsets have used for test set. We choose AUC, accuracy, precision, recall, and \(F1\_score\) as the evaluation indicators and take the average values of five experimental results as the final result. AUC is the area under the ROC curve and could be regarded as the probability that the predicted score of positive samples is greater than that of negative examples. The remaining indicators are as follows:
where TP and TN are the numbers of circRNAdisease association pairs and nonassociation pairs which are correctly identified, respectively; FP and FN are the numbers of circRNAdisease association pairs and nonassociation pairs which are incorrectly identified, respectively.
We implement MSPCD in Keras 2.2.5. The batch size and learning rate are tuned by grid search in \(\left\{ 32, 64,128,256,512\right\}\) and \(\left\{ 0.0005,0.001,0.002,0.0025\right\}\), respectively. The dimension of highorder features we search for is \(\left\{ 8, 16, 32, 64, 128\right\}\). The number of training epochs is set to 200.
Availability of data and materials
The data and code used in the current study is available at:https://github.com/dayunliu/MSPCD.
Abbreviations
 ROC:

Rceiver operating characteristic
 TPR:

True positive rate
 FPR:

False positive rate
 AUC:

Area under ROC curve
 SVM:

Support vector machine
 RF:

Random forest
 DAG:

Directed acyclic graph
References
Sanger HL, Klotz G, Riesner D, Gross HJ, Kleinschmidt AK. Viroids are singlestranded covalently closed circular RNA molecules existing as highly basepaired rodlike structures. Proc Natl Acad Sci. 1976;73(11):3852â€“6. https://doi.org/10.1073/pnas.73.11.3852.
Hsu MT, CocaPrados M. Electron microscopic evidence for the circular form of RNA in the cytoplasm of eukaryotic cells. Nature. 1979;280(5720):339â€“40. https://doi.org/10.1038/280339a0.
Arnberg AC, Ommen GJBV, Grivell LA, Bruggen EFJV, Borst P. Some yeast mitochondrial RNAs are circular. Cell. 1980;19(2):313â€“9. https://doi.org/10.1016/00928674(80)90505x.
Danan M, Schwartz S, Edelheit S, Sorek R. Transcriptomewide discovery of circular RNAs in archaea. Nucleic Acids Res. 2011;40(7):3131â€“42. https://doi.org/10.1093/nar/gkr1009.
Salzman J, Gawad C, Wang PL, Lacayo N, Brown PO. Circular RNAs are the predominant transcript isoform from hundreds of human genes in diverse cell types. PLoS ONE. 2012;7(2):30733.
Hansen TB, Jensen TI, Clausen BH, Bramsen JB, Finsen B, Damgaard CK, Kjems J. Natural RNA circles function as efficient microRNA sponges. Nature. 2013;495(7441):384â€“8. https://doi.org/10.1038/nature11993.
Memczak S, Jens M, Elefsinioti A, Torti F, Krueger J, Rybak A, Maier L, Mackowiak SD, Gregersen LH, Munschauer M, Loewer A, Ziebold U, Landthaler M, Kocks C, le Noble F, Rajewsky N. Circular RNAs are a large class of animal RNAs with regulatory potency. Nature. 2013;495(7441):333â€“8. https://doi.org/10.1038/nature11928.
Hansen TB, Wiklund ED, Bramsen JB, Villadsen SB, Statham AL, Clark SJ, Kjems J. miRNAdependent gene silencing involving ago2mediated cleavage of a circular antisense RNA. EMBO J. 2011;30(21):4414â€“22. https://doi.org/10.1038/emboj.2011.359.
Wang P, Wu T, Zhou H, Jin Q, He G, Yu H, Xuan L, Wang X, Tian L, Sun Y, Liu M, Qu L. Long noncoding RNA NEAT1 promotes laryngeal squamous cell cancer through regulating miR107/CDK6 pathway. J Exp Clin Cancer Res. 2016. https://doi.org/10.1186/s130460160297z.
Cui X, Niu W, Kong L, He M, Jiang K, Chen S, Zhong A, Li W, Lu J, Zhang L. hsa_circRNA_103636: potential novel diagnostic and therapeutic biomarker in major depressive disorder. Biomark Med. 2016;10(9):943â€“52. https://doi.org/10.2217/bmm20160130.
Liu Q, Zhang X, Hu X, Dai L, Fu X, Zhang J, Ao Y. Circular RNA related to the chondrocyte ECM regulates MMP13 expression by functioning as a MiR136 â€˜spongeâ€™ in human cartilage degradation. Sci Rep. 2016. https://doi.org/10.1038/srep22572.
Burd CE, Jeck WR, Liu Y, Sanoff HK, Wang Z, Sharpless NE. Expression of linear and novel circular forms of an INK4/ARFassociated noncoding RNA correlates with atherosclerosis risk. PLoS Genet. 2010;6(12):1001233. https://doi.org/10.1371/journal.pgen.1001233.
Wang Y, Liu J, Liu C, Naji A, Stoffers DA. MicroRNA7 regulates the mTOR pathway and proliferation in adult pancreaticcells. Diabetes. 2012;62(3):887â€“95. https://doi.org/10.2337/db120451.
Huntzinger E, Izaurralde E. Gene silencing by microRNAs: contributions of translational repression and mRNA decay. Nat Rev Genet. 2011;12(2):99â€“110. https://doi.org/10.1038/nrg2936.
Taylor JM. Hepatitis delta virus. Virology. 2006;344(1):71â€“6. https://doi.org/10.1016/j.virol.2005.09.033.
Lukiw WJ. Circular RNA (CircRNA) in Alzheimerâ€™s disease (AD). Front Genet. 2013. https://doi.org/10.3389/fgene.2013.00307.
Lei X, Fang Z, Chen L, Wu FX. PWCDA: path weighted method for predicting CircRNAdisease associations. Int J Mol Sci. 2018;19(11):3410. https://doi.org/10.3390/ijms19113410.
Fan C, Lei X, Wu FX. Prediction of CircRNAdisease associations using KATZ model based on heterogeneous networks. Int J Biol Sci. 2018;14(14):1950â€“9. https://doi.org/10.7150/ijbs.28260.
Deng L, Zhang W, Shi Y, Tang Y. Fusion of multiple heterogeneous networks for predicting CircRNAdisease associations. Sci Rep. 2019. https://doi.org/10.1038/s4159801945954x.
Zuo ZL, Cao RF, Wei PJ, Xia JF, Zheng CH. Double matrix completion for circRNAdisease association prediction. BMC Bioinform. 2021;22(1):307.
Lei X, Bian C. Integrating random walk with restart and kNearest Neighbor to identify novel circRNA disease association. Sci Rep. 2020;10(1):1943.
Zheng K, You ZH, Li JQ, Wang L, Guo ZH, Huang YA. iCDACGR: identification of CircRNAdisease associations based on chaos game representation. PLoS Comput Biol. 2020;16(5):1007872. https://doi.org/10.1371/journal.pcbi.1007872.
Fan C, Lei X, Pan Y. Prioritizing CircRNAdisease associations with convolutional neural network based on multiple similarity feature fusion. Front Genet. 2020. https://doi.org/10.3389/fgene.2020.540751.
Wang L, You ZH, Huang YA, Huang DS, Chan KCC. An efficient approach based on multisources information to predict CircRNAâ€”disease associations using deep convolutional neural network. Bioinformatics. 2019;36(13):4038â€“46. https://doi.org/10.1093/bioinformatics/btz825.
Xiao Q, Yu H, Zhong J, Liang C, Li G, Ding P, Luo J. An insilico method with graphbased multilabel learning for largescale prediction of CircRNAdisease associations. Genomics. 2020;112(5):3407â€“15. https://doi.org/10.1016/j.ygeno.2020.06.017.
Wei H, Liu B. iCircDAMF: identification of CircRNAdisease associations based on matrix factorization. Brief Bioinform. 2019;21(4):1356â€“67. https://doi.org/10.1093/bib/bbz057.
Zhao Q, Yang Y, Ren G, Ge E, Fan C. Integrating bipartite network projection and KATZ measure to identify novel CircRNAdisease associations. IEEE Trans Nanobiosci. 2019;18(4):578â€“84. https://doi.org/10.1109/tnb.2019.2922214.
Lei X, Fang Z. GBDTCDA: predicting CircRNAdisease associations based on gradient boosting decision tree with multiple biological data fusion. Int J Biol Sci. 2019;15(13):2911â€“24. https://doi.org/10.7150/ijbs.33806.
Wang L, You ZH, Li YM, Zheng K, Huang YA. GCNCDA: a new method for predicting CircRNAdisease associations based on graph convolutional network algorithm. PLoS Comput Biol. 2020;16(5):1007568. https://doi.org/10.1371/journal.pcbi.1007568.
Ding Y, Chen B, Lei X, Liao B, Wu FX. Predicting novel CircRNAdisease associations based on random walk and logistic regression model. Comput Biol Chem. 2020;87:107287. https://doi.org/10.1016/j.compbiolchem.2020.107287.
Lu C, Zeng M, Zhang F, Wu F, Li M, Wang J. Deep matrix factorization improves prediction of human CircRNAdisease associations. IEEE J Biomed Health Inform. 2020. https://doi.org/10.1109/jbhi.2020.2999638.
Deepthi K, Jereesh AS. Inferring potential CircRNAâ€”disease associations via deep autoencoderbased classification. Mol Diagn Ther. 2020. https://doi.org/10.1007/s4029102000499y.
Wang L, You ZH, Li JQ, Huang YA. IMSCDA: prediction of CircRNAdisease associations from the integration of multisource similarity information with deep stacked autoencoder model. IEEE Trans Cybern. 2020;51:5522â€“31.
Deepthi K, Jereesh AS. An ensemble approach for CircRNAdisease association prediction based on autoencoder and deep neural network. Gene. 2020;762:145040. https://doi.org/10.1016/j.gene.2020.145040.
Meng X, Hu D, Zhang P, Chen Q, Chen M. CircFunBase: a database for functional circular RNAs. Database. 2019;2019.
Zheng K, You ZH, Li JQ, Wang L, Guo ZH, Huang YA. ICDACGR: Identification of CircRNAdisease associations based on chaos game representation. PLoS Comput Biol. 2020;16(5):1007872.
Levenshtein VI. Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Dokl. 1966;10:707â€“10.
Wang D, Wang J, Lu M, Song F, Cui Q. Inferring the human microRNA functional similarity and functional network based on microRNAassociated diseases. Bioinformatics. 2010;26(13):1644â€“50.
Acknowledgements
Not applicable.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 23 Supplement 3, 2022: Selected articles from the International Conference on Intelligent Biology and Medicine (ICIBM 2021): bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume23supplement3.
Funding
This work and publication costs are funded by the National Natural Science Foundation of China under Grants Nos.Â 61972422, 62072058 and 82073339. The funders did not play any role in the design of the study, the collection, analysis, and interpretation of data, or in writing of the manuscript.
Author information
Authors and Affiliations
Contributions
LD and HL conceived the prediction method.DYL,YZL and RQW,JYL wrote the paper.LD,HL and DYL developed the computer programs. LD,HL,JYL and JXZ analyzed the results and revised the paper. All the authors have read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Deng, L., Liu, D., Li, Y. et al. MSPCD: predicting circRNAdisease associations via integrating multisource data and hierarchical neural network. BMC Bioinformatics 23 (Suppl 3), 427 (2022). https://doi.org/10.1186/s12859022049765
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859022049765