DWNN-RLS: regularized least squares method for predicting circRNA-disease associations

Background Many evidences have demonstrated that circRNAs (circular RNA) play important roles in controlling gene expression of human, mouse and nematode. More importantly, circRNAs are also involved in many diseases through fine tuning of post-transcriptional gene expression by sequestering the miRNAs which associate with diseases. Therefore, identifying the circRNA-disease associations is very appealing to comprehensively understand the mechanism, treatment and diagnose of diseases, yet challenging. As the complex mechanism between circRNAs and diseases, wet-lab experiments are expensive and time-consuming to discover novel circRNA-disease associations. Therefore, it is of dire need to employ the computational methods to discover novel circRNA-disease associations. Result In this study, we develop a method (DWNN-RLS) to predict circRNA-disease associations based on Regularized Least Squares of Kronecker product kernel. The similarity of circRNAs is computed from the Gaussian Interaction Profile(GIP) based on known circRNA-disease associations. In addition, the similarity of diseases is integrated by the mean of GIP similarity and sematic similarity which is computed by the direct acyclic graph (DAG) representation of diseases. The kernels of circRNA-disease pairs are constructed from the Kronecker product of the kernels of circRNAs and diseases. DWNN (decreasing weight k-nearest neighbor) method is adopted to calculate the initial relational score for new circRNAs and diseases. The Kronecker product kernel based regularised least squares approach is used to predict new circRNA-disease associations. We adopt 5-fold cross validation (5CV), 10-fold cross validation (10CV) and leave one out cross validation (LOOCV) to assess the prediction performance of our method, and compare it with other six competing methods (RLS-avg, RLS-Kron, NetLapRLS, KATZ, NBI, WP). Conlusion The experiment results show that DWNN-RLS reaches the AUC values of 0.8854, 0.9205 and 0.9701 in 5CV, 10CV and LOOCV, respectively, which illustrates that DWNN-RLS is superior to the competing methods RLS-avg, RLS-Kron, NetLapRLS, KATZ, NBI, WP. In addition, case studies also show that DWNN-RLS is an effective method to predict new circRNA-disease associations.


Background
Circular RNAs (circRNAs) are a class of endogenous noncoding RNAs with distinct properties and diverse cellular functions, unlike the linear RNAs with 5' and 3' termini which reflect start and stop of the RNA polymerase on the DNA template, and are generated by back splicing (3'-5') or lariat introns [1][2][3][4]. The circRNAs are not easy to be degraded by exoribonucleases because they lack free ends [5,6]. As forming a circRNA is usually considered a rare event in cells, it was suggested that they may be considered errors of normal splicing process [4,7]. Therefore, despite their existence in both unicellular and multicellular organisms, they have been previously even disregarded as transcriptional noise or artifacts [8]. Nevertheless, with the advances of high-throughput deep sequencing and functional genomics, the knowledge of circRNAs has recently been learned substantially [9,10].
To date, circRNAs have been found in various tissues and cell lines of plants, animals and so on [4,11,12]. Some circRNAs can be translated in some tissues or translated into a protein under splicing-dependent, cap-independent manner or other certain conditions [13]. Furthermore, circRNAs are expected to have other functions independent of their host genes because they have much longer half-life than other linear RNA transcripts [10]. Many cir-cRNAs can regulate gene expression because they have strong potential to act as miRNA sponges or decoys [14]. In addition, some circRNAs can also function as protein sponges or decoys, and the best example is that protein MBL is prevented to bind to other targets when being tethered to a circRNA [15]. CircRNA circFoxo3 can also act as a protein scaffold, which binds to sites for mouse MDM2 and p53 [16]. Unlike the above functions of cir-cRNAs are based on the fact that they are located to the cytoplasm, some circRNAs such as exon-intron circRNAs are retained in the nucleus and they may promote with transcription [17].
Through the understanding of functions of circRNAs, many evidences have shown that circRNAs play an important role in occurrence of human complex diseases, such as cancer [18]. CircRNA ciRS-7 has significant implications for diseases through efficiently regulating the activity of miRNA miR-7 [19]. Likewise, by sponging the miR-7, miR-17 and miR-214, cir-ITCH can increase the level of ITCH which further inhibits the Wnt pathway that is frequently aberrant in cancers [20,21]. SRY can affect the proliferation, migration and invasion of cholangiocarcinoma cells, which is the sponge of miR-138 and can strongly suppress its level [22,23]. CircRNA-MYLK level is elevated and correlated with BC (bladder carcinoma) progression and plays an oncogenic role in BC in vitro and vivo [24]. Circ-Foxo3 was minimally expressed in patient tumor samples and in a panel of cancer cells and its expression was found to be significantly increased during the cancer cell apoptosis [16,25]. Circular RNA MTO1 can suppress hepatocellular carcinoma progression by acting as the sponge of miR-9 [26]. In addition, the aberrant expression of circCCDC66 also is associated with a late-stage diagnosis and metastases [27].
In recent years, some databases about circRNAs have been developed to further study the function mechanism of circRNAs. CircBase is the first database about circR-NAs, which merges and unifies data sets of circRNAs and provides the interface to access, download, and browse the evidence supporting their expression within the genomic context [28]. CircRNADb is a comprehensively annotated human circular RNAs database, which containes 32,914 human exonic circRNAs from diversified sources and provides the genomic information, exon splicing, genome sequence, internal ribosome entry site (IRES), open reading frame (ORF) and references of these circRNAs [29]. PlantcircBase is a database of plant circRNAs, which also provided other functions such as visualization of the structures of circRNA based on their genomic position [12]. Likewise, PlantCircNet also is a database of plant cir-cRNAs, which has the main feature of plantCircNet to provide visualized plant circRNA-miRNA-mRNA regulatory networks and can identify metabolic effects of circR-NAs [30]. ExoRBase is a web-accessible database, which provides the circRNA, lncRNA and mRNA information by RNA-seq data analyses of human blood exosomes [31]. CircNet provides tissue-specific circRNA expression profiles and circRNA-miRNA-gene regulatory networks by utilizing sequencing datasets to systematically identify the expression of circRNAs in RNA-seq samples [32]. TSTD also provides the tissue-specific circRNAs and further characterizes the functions of these circRNAs [33]. The cancer somatic mutations that alter miRNA targeting and functioning are provided by SomamiR 2.0 database which also collects the associations between miRNA and other competing endogenous RNAs such as mRNAs, circRNAs and lncRNAs [34]. The CSCD is also a cancer-specific circRNAs database which identifies the cancer-specific circRNAs by analyzed the RNA-seq samples and further predicts the miRNA response element sites and RNA binding protein sites of each circRNA [35]. Circ2Traits is the circRNA-disease associations database, which is constructed by circRNA-miRNA associations, miRNAdisease associations and disease-SNPS associations [18]. To our knowledge, CircR2Disease is the first manually curated database about circRNA-disease associations by reviewing existing literatures and provides the important foundation to study the associations of circRNAs and diseases [36].
In general, we have obtained some significant progresses in understanding features and functions of circRNAs. In addition, some databases about circRNAs have also been constructed. However, current studies of circRNA-disease associations mainly focus on biomedical experimentations that are notoriously expensive and time-consuming. Therefore, there is a very urgent need to predict circRNAdisease associations by computational methods. To our knowledge, the development of computational approach is very limited because the databases of circRNA-disease associations are incomplete. However, circR2Disease provides the chance to effectively predict novel circRNAdisease associations through developing computational methods.
In this study, we develop a novel method (call DWNN-RLS) to predict new circRNA-disease associations. Firstly, DWNN-RLS computes the Gaussian interaction profile (GIP) kernel similarities of circRNAs and diseases based on the known circRNA-disease associations. By considering their direct acyclic graph(DAG) representation, the sematic similarity of diseases is also calculated. We further obtain the final similarity of diseases with the mean of GIP similarity and sematic similarity. Then the association possibility scores of circRNA-disease pairs are predicted by Kronecker product kernel based Regularized Least Squares approach. The kernels of circRNA-disease pairs are calculated by the Kronecker product of kernels of cir-cRNAs and diseases. Furthermore, the decreasing weight k-nearest neighbor (DWNN) method is used to calculate the initial relational scores of new circRNAs and new diseases. In order to assess the prediction performance of DWNN-RLS and compare with other competing methods, we conduct 5-fold cross validation (5CV), 10-fold cross validation (10CV) and leave-one-out cross validation (LOOCV). The experiment results demonstrate that DWNN-RLS outperforms other six competing methods (RLS-avg, RLS-Kron, NetLapRLS, KATZ, NBI, WP) in terms of AUC (area under the ROC curve) values. Specifically, the AUC values of DWNN-RLS in 5CV, 10CV and LOOCV reach 0.8854, 0.9205 and 0.9701, respectively, which are superior to the second best results (KATZ: 0.8224 and 0.8343, RLS-avg: 0.9169). Furthermore, the prediction ability of DWNN-RLS also is illustrated by the case studies.

Similarity of circRNAs
As the successful application of GIP kernel similarity in other relative areas [38][39][40][41][42], we also use it to calculate the similarities of circRNAs. The GIP kernel was computed from the known circRNA-disease associations. Let C = c 1 , c 2 , ..., c N c be the set of N c circRNAs and D = d 1 , d 2 , ..., d N d be the set of N d diseases. Let matrix Y ∈ R N c ×N d represents known circRNA-disease associations, in which the value of y ij is 1 if circRNA i and disease j exists a known association, otherwise 0. Then the GIP similarity of circRNA c i and circRNA c j can be computed as follows: where y c i = y i1 , y i2 , ..., y iN d and y c j = y j1 , y j2 , ..., y jN d are the association profiles of circRNA c i and circRNA c j , respectively. Since the GIP kernel is computed by a decaying function of the distance between the vectors, this function is of the form of a bell-shaped curve. In addition, since a larger value of γ c yields a narrower bell while a smaller value of γ c yields a wider bell, the parameter γ c can be used to regulate the bandwidth of kernel. In this study, parameter γ c is computed as the reciprocal of average number of associations per circRNA.

Similarity of diseases
Firstly, we also compute the GIP similarity of disease d i and disease d j as follows: where In addition, the parameter γ d is used to regulate the bandwidth of kernel. Secondly, we use the Mesh descriptions of diseases to compute the sematic similarity. Specifically, for disease A which can be represented by a DAG (DAG A , DAG A = T A , E A ) in mesh database. Set T A includes the parent diseases nodes of A and itself while set E A includes the direct edges between disease nodes within T A . The similarity of diseases A and B can be calculated as follows: where SV A (t)(SV B (t)) is the sematic value between disease A(B) and t which is the all common ancestors of diseases A and B. In addition, Sem(A) and Sem(B) are the sematic values of diseases A and B, respectively. For disease A, the Sem(A) and SV A (t) can be calculated as follows: where is the layer contribution factor between disease node and its direct ancestor disease nodes in DAG. The value of is set to 0.5 in this study [37]. After computing the GIP similarity and sematic similarity of diseases, we integrate the final similarity of diseases with their mean as follows:

DWNN for new circRNAs and diseases
The good performance of prediction method largely depends on the quality of known circRNA-disease associations. In fact, new circRNAs (or new diseases) have no any association with diseases (or circRNAs). In this study, we use the DMNN to compute the initial association score based on similarities of circRNAs and diseases. Specifically, the initial association score between new circRNA c i and disease d j can be calculated as follows: where N(c i ) is the set of k c i nearest neighbors of new circRNA c i . The parameter k c i is calculated as follows: where simset(c i ) l is the l-th similarity value of the ranked vector based on similarity between circRNA c i and other circRNAs from high to low. Furthermore, the parameter is used to control the range of l that is used to select k nearest neighbors for each new circRNA and disease. In this study, the value of is set to 1, so the value of l is 1 and all neighbors are used to calculate initial association score.
Similarly, we also compute the initial association scores of new disease d j and circRNA c i as follows: where N(d j ) is the set of k d j nearest neighbors of new disease d j . The parameter k d j is also calculated as follows: where simset(d j ) l is the l-th similarity value of the ranked vector based on similarity between disease d j and other diseases from high to low. Parameter is also used to control the range for selecting neighbors.

Kronecker product kernel based regularized least squares(RLS-Kron)
In this study, we use RLS-Kron method to predict new circRNA-disease associations [38,39,43]. Based on the kernel K , the predicted circRNA-disease associations matrix has a simple closed-form solution as follows: in which the parameter σ is a regularizations parameter and is set to 0.2 in this study. Kron-RLS has no any prediction ability when σ is set to 0. The kernel K is calculated from the Kronecker product K c ⊗K d of the circRNA kernel and disease kernel, which is defined as follows: where matrices K c and K d are the similarity matrices of circRNAs and diseases, respectively. In addition, in order to calculate the predicted matrix, Kron-RLS needs to compute the inverse of an N c N d ×N c N d matrix. Therefore, we also use an effective method based on matrix eigenvalue decomposition. According to the matrix theory, the eigenvalues (vectors) of a kronecker product are the Kronecker product of eigenvalues (vectors). Specifically, the kernal can be calculated as follows: where ∧ = ∧ c ⊗ ∧ d and ∨ = ∨ c ⊗ ∨ d are all derived from the eigenvalues decompositions of the two kernel matrices K c and K d . As K c and K d are real symmetric matrices, their specific eigenvalues decompositions process are defined as follows: where ∨ c and ∨ d are orthogonal matrices whose columns are the eigenvectors of K c and K d , respectively. ∧ c and ∧ d are diagonal matrices whose diagonal entries are the eigenvalues of K c and K d , respectively. Therefore, the final predicted circRNA-disease associations matrixŶ can be calculated as follows:Ŷ

Performance evaluation
In this study, we conduct 5CV, 10CV and LOOCV to evaluate the performance of DWNN-RLS for predicting new circRNA-disease associations. AUC (area under the ROC curve) value is used as the evaluation metric. We perform 10 repetitions of 10CV and 5CV. That is, under 10CV, the known circRNA-disease associations data are divided into 10 folds, and each fold takes in turn as the test set and the rest as the train set at each time. Similarly, the data set are randomly divided into 5 folds and each fold takes in turn as the test data and the rest as the train set on each time. In LOOCV, each known circRNA-disease association is in turn chosen as the test set while the rest known circRNA -disease associations as the train set. The larger AUC values show the better prediction ability of the method, while if AUC value is less than or equal to 0.5, the prediction method has no prediction ability.

Comparison with other methods
As there is no competing computational method for predicting circRNA-disease associations in the literature, to assess the performance of our method, we also compare DWNN-RLS against other six effective methods in other relevant prediction issues. These methods include RLSavg [38], RLS-Kron [38], NetLapRLS [44], KATZ [45,46], NBI [47] and WP [47,48]. We briefly review them here. RLS-avg use the average of the output values which are computed from two kernels, respectively. RLS-Kron compute the prediction scores by Kronecker product kernel based on regularised least squares approach. NetLapRLS is used to predict circRNA-disease associations by exploiting information on similarities of links and nodes. KATZ is a network-based method which considers the number of walks between network nodes and lengths in a heterogeneous network to predict associations. NBI is also a network-based method to infer new associations, which only uses cricRNA-disease bipartite network topology similarity. WP and DBSI are recommendation models which directly use the similarities of circRNAs and diseases. Figure 1 shows the AUC curves of seven prediction methods on CircR2Disease data set in terms of 5CV. The AUC value of DWWN-RLS is the highest among the seven methods, indicating that the prediction performance of DWWN-RLS is better than other methods. Figure 2 shows the AUC curves of seven prediction methods in terms of 10CV on CircR2Disease dataset. The AUC value of DWWN-RLS reaches 0.9205, which is  Note that the advantage of prediction performance is more obvious in 10CV and LOOCV than 5CV, indicating that DWWN-RLS can achieve good result based on more known circRNA-disease associations. In addition, the sematic similarity of diseases can improve the prediction performance of DWWN-RLS. When only the GIP Fig. 2 The AUC curves of seven methods in the 10CV Fig. 3 The AUC curves of seven methods in the LOOCV similarity is used, the AUCs of DWNN-RLS are 0.8368, 0.8819 and 0.9423 in 5CV, 10CV and LOOCV, respectively. When the GIP similarity combined with the disease sematic similarity, DWNN-RLS obtains the increased AUCs of 0.8854, 0.9205 and 0.9701 in 5CV, 10CV and LOOCV. By comparing with RLS-Kron method, the DMNN method also can improve the prediction performance. Comparing with KATZ, NBI and WP methods, we think that DWNN-RLS is a machining learning model and has the objective function and solution process that is beneficial to obtain better prediction performance.

Parameter analysis for and σ
To further understand the robustness of DWWN-RLS method, we analyze the influence of parameters and σ on the prediction performance in 10CV. The parameter is used to control the range for selecting k nearest neighbors of cicrRNAs and diseases. The parameter σ is the regularization parameter of DWWN-RLS method. The value of is set to be 1.0 when analyzing parameter σ . Furthermore, we also set the default value of σ to be 0.2 when analyzing parameter . With parameter σ of 0.2, Table 1 demonstrates the prediction performance of DWWN-RLS method in 10CV when ranges from 0.1 to 1.0 with 0.1 increments. The prediction performance of DWWN-RLS method is best when is set to be 1.0, indicating that all neighbors of circRNAs and diseases are involved in calculating their initial associations scores.
Furthermore, Table 2 describes the prediction performances of DWWN-RLS with different values of σ when is set to be 1.0. We can see from Table 2 that DWWN-RLS obtains the best prediction performance when σ is set to be 0.2. Therefore, in this study, we set the default value of σ to be 0.2.

Case studies
After confirming the prediction performance and robustness of DWWN-RLS method in 10CV, 5CV and LOOCV, we further analyze the prediction ability of DWWN-RLS in discovering new circRNA-disease associations. In predicting new circRNA-disease associations, all known circRNA-disease associations on CircR2Disease dataset are chosen as the train set and all other circRNA-disease pairs are the candidate circRNA-disease associations. We adapt DWWN-RLS to compute the prediction scores for these candidate circRNA-disease pairs. Here, we analyze the prediction results of Atherosclerotic vascular disease and Breast cancer.
Atherosclerotic vascular disease is responsible for the majority of cases of CVD (Cardiovascular disease) in both developing and developed countries, which encompasses coronary heart disease, cerebrovascular disease, and peripheral arterial disease, and which also result the CVD, the leading cause of death and disability all over the world [49,50]. Table 3 shows that 2 of top 10 predicted associations are confirmed in the previous literature. Elevated cANRIL expression could lead to worse EC (endothelial cells) inflammation, exacerbating AS (atherosclerosis) [51]. CANRIL is transcribed at a locus of atherosclerotic cardiovascular disease on chromosome 9p21, and induces nucleolar stress and apoptosis, and inhibits the proliferation in smooth muscle cells and macrophages [52]. The cZNF292 also associates with atherosclerotic cardiovascular disease by stimulating angiogenesis through vascular sprouting and cell proliferation [53].
There is approximately 1 in 12 women developing breast cancer in Western Europe and the United States, and which is characterized by a distinct metastatic pattern  involving the regional lymph nodes, bone marrow, lung and liver [54,55]. Table 4 shows the validation results of top 10 new circRNA-disease associations predicted by DWNN-RLS. There is 3 out of top 10 predicted associations that can be validated in previous studies. CircRNAs circGFRA1 and GFRA1 act as ceRNAs in triple negative breast cancer by regulating miR-34a [56]. The human breast cancer cell line MDA-MB-231 are stably transfected with circ-Foxo3, the ectopic expression of the Foxo3 circular RNA could suppress tumor growth, cancer cell proliferation and survival [25]. CDR1as contains more than 70 selectively conserved target sites of miR-7 which can directly downregulate oncogenes in cancers such as breast cancer [57]. Above case studies demonstrate that there are a number of prediction results that have not been confirmed by previous literature. To our knowledge, a possible reason is that the database Circ2Disease are still limited and the new studies have not been published yet. In summary, these predicted circRNA-disease associations deserve being studied and considered in the future.

Discussion
With the advances of RNA-Seq, high-throughput sequencing and other techniques, we have achieved some important progresses in understanding characteristics and functions of cricRNAs. CricRNAs may play key roles in diseases as miRNA sponges or decoys, protein sponges or decoys and regulation gene transcription. Therefore, systematically understanding association between cir-cRNAs and diseases has become an important issue of bioinformatics research, which is beneficial to disease diagnose and treatment. Although some databases about circRNA have been established in recent years, these databases rarely focused on the associations between circRNAs and diseases. The computation methods for predicting circRNA-disease associations are also lacking because of these limitations. To our knowledge, CircR2Disease is the first database about circRNA-disease associations, which provides the chance to develop effective methods to identify novel associations between circRNAs and diseases.

Conclusion
DWNN-RLS method is developed to predict new associations between circRNAs and diseases on CircR2Disease dataset. Firstly, DWNN-RLS computes the GIP similarities of circRNAs and diseases based on the known circRNA-disease associations. Secondly, we further compute the sematic similarity of disease and compute the final similarity of diseases with the mean of GIP similarity and sematic similarity. Finally, the Kron-RLS is used to predict novel circRNA-disease associations based on their similarities. 10CV, 5CV and LOOCV are used to evaluate the prediction performance of DWNN-RLS. In addition, we use the DWNN to calculate the initial associations scores for new circRNAs and diseases. We also compare our method with other six methods. In terms of 10CV, 5CV and LOOCV, DWNN-RLS all achieves the best prediction performance. In addition, we also show that DWNN-RLS method may achieves better prediction performance with the more known circRNA-disease associations. Case studies further illustrate the prediction performance of DWNN-RLS. However, there still exist some limitations in DWNN-RLS. We all know that cricRNAs can function as miRNA sponges or decoys, protein sponges or decoys. In this study, we only use the GIP similarity of circRNAs. In the future, the similarity computation of circRNAs could consider more relevant biological network information, such as cricRNA-miRNA associations and sequence information. Similarly, the disease functional information also should be considered [58][59][60]. Other latest matrix factorization methods such as NRLMF [61], SRMF [62], DRRS [63] should be considered to predict cricRNA-disease association when we integrate more biological network information such as circRNA-miRNA associations, cir-cRNA sequence information and disease functional information. Therefore, to further improve the prediction performance, we would develop a more effective approach to discover new circRNA-disease associations by reasonably integrating more biological network information.