 Research article
 Open Access
 Published:
A novel collaborative filtering model for LncRNAdisease association prediction based on the Naïve Bayesian classifier
BMC Bioinformaticsvolume 20, Article number: 396 (2019)
Abstract
Background
Since the number of known lncRNAdisease associations verified by biological experiments is quite limited, it has been a challenging task to uncover human diseaserelated lncRNAs in recent years. Moreover, considering the fact that biological experiments are very expensive and timeconsuming, it is important to develop efficient computational models to discover potential lncRNAdisease associations.
Results
In this manuscript, a novel Collaborative Filtering model called CFNBC for inferring potential lncRNAdisease associations is proposed based on Naïve Bayesian Classifier. In CFNBC, an original lncRNAmiRNAdisease tripartite network is constructed first by integrating known miRNAlncRNA associations, miRNAdisease associations and lncRNAdisease associations, and then, an updated lncRNAmiRNAdisease tripartite network is further constructed through applying the itembased collaborative filtering algorithm on the original tripartite network. Finally, based on the updated tripartite network, a novel approach based on the Naïve Bayesian Classifier is proposed to predict potential associations between lncRNAs and diseases. The novelty of CFNBC lies in the construction of the updated lncRNAmiRNAdisease tripartite network and the introduction of the itembased collaborative filtering algorithm and Naïve Bayesian Classifier, which guarantee that CFNBC can be applied to predict potential lncRNAdisease associations efficiently without entirely relying on known miRNAdisease associations. Simulation results show that CFNBC can achieve a reliable AUC of 0.8576 in the LeaveOneOut Cross Validation (LOOCV), which is considerably better than previous stateoftheart results. Moreover, case studies of glioma, colorectal cancer and gastric cancer demonstrate the excellent prediction performance of CFNBC as well.
Conclusions
According to simulation results, due to the satisfactory prediction performance, CFNBC may be an excellent addition to biomedical researches in the future.
Background
Recently, accumulating evidences have indicated that lncRNAs (Long noncoding RNAs) are involved in almost the entire cell life cycle through various mechanisms [1, 2] and participate in close relationships in the development of some human complex diseases [3, 4] such as the Alzheimer’s disease [5] and many types of cancers [6]. Hence, identification of diseaserelated lncRNAs is critical to the understanding of the pathogenesis of complex diseases systematically and may further facilitate the discovery of potential drug targets. However, since biological experiments are very expensive and timeconsuming, it has become a hot topic to develop effective computational models to uncover potential diseaserelated lncRNAs. Up to now, existing computational models for predicting potential associations between lncRNAs and diseases can be roughly classified into two major categories. Generally, in the first category of models, biological information of miRNAs, lncRNAs or diseases will be adopted to identify potential lncRNAdisease associations. For example, Chen et al. proposed a prediction model called HGLDA based on the information of miRNAs, in which, a hypergeometric distribution test was adopted to infer potential disease related lncRNAs [7]. Chen et al. proposed a KATZ measure to predict potential lncRNAdisease associations by utilizing the information of lncRNAs and diseases [8]. Ping and Wang et al. proposed a method for identifying potential diseaserelated lncRNAs based on the topological information of known lncRNAdisease association network [9]. In the second category of models, multiple data sources will be integrated to construct all kinds of heterogeneous networks to infer potential associations between diseases and lncRNAs. For example, Yu and Wang et al. proposed a naïve Bayesian Classifier based probability model to uncover potential diseaserelated lncRNAs by integrating known miRNAdisease associations, miRNAlncRNA associations, lncRNAdisease associations, genelncRNA associations, genemiRNA associations and genedisease associations [10]. Zhang et al. developed a computational model to discover possible lncRNAdisease associations through combining lncRNAs similarity, proteinprotein interactions and diseases similarity [11]. Fu et al. presented a prediction model by considering the quality and relevance of different heterogeneous data sources to identify potential lncRNAdisease associations [12]. Chen et al. proposed a novel prediction model called LRLSLDA by adopting Laplacian Regularized Least Squares to integrate known phenomelncRNAome network, disease similarity network and lncRNA similarity network [13].
In recent years, in order to solve the problem of scarce known associations between different objects, an increasing number of recommender systems have been developed to increase the reliability of association prediction based on collaborative filtering methods [14], which depend on prior disposals to predict useritem relationships. Up to now, some novel prediction models have been proposed successively, in which, recommender algorithms have been appended to identify different potential diseaserelated objects. For example, Lu et.al proposed a model called SIMCLDA to predict potential lncRNAdisease associations based on inductive matrix completion by computing Gaussian interaction profile kernel of known lncRNAdisease associations, diseasegene and genegene onotology associations [15]. Luo et al. modeled drug repositioning problem into a recommendation system to predict novel drug indications based on known drugdisease associations through utilizing matrix completion [16]. Zeng et.al developed a novel prediction model called PCFM by adopting the probabilitybased collaborative filtering algorithm to infer geneassociated human diseases [17]. Luo et al. proposed a prediction model named CPTL to uncover potential diseaseassociated miRNAs via transduction learning by integrating disease similarity, miRNA similarity and known miRNAdisease associations [18].
In this study, a novel Collaborative Filtering model called CFNBC for predicting potential lncRNAdisease associations is proposed on the basis of Naïve Bayesian Classifier, in which, an original lncRNAmiRNAdisease tripartite network is constructed first by integrating miRNAdisease association network, miRNAlncRNA association network and lncRNAdisease association network, and then, considering the fact that the number of known associations between the three objects such as lncRNAs, miRNAs and diseases is very limited, an updated tripartite network is further constructed by applying a collaborative filtering algorithm on the original tripartite network. Thereafter, based on the updated tripartite network, we can predict potential lncRNAdisease associations through adopting the Naïve Bayesian Classifier. Finally, in order to evaluate the prediction performance of our newly proposed model, LOOCV is implemented for CFNBC based on known experimentally verified lncRNAdisease associations. As a result, CFNBC can achieve a reliable AUC of 0.8576, which is much better than that of previous classical prediction models. Moreover, case studies of glioma, colorectal cancer and gastric cancer demonstrate the excellent prediction performance of CFNBC as well.
Results
Leaveoneout cross validation
In this section, in order to estimate the prediction performance of CFNBC, LOOCV will be implemented based on known experimentally verified lncRNAdisease associations. During simulation, for a given disease d_{j}, each known lncRNA related to d_{j} will be left out in turns as the test sample, whereas all the remaining associations between lncRNAs and d_{j} are taken as training cases for model learning. Thus, the similarity scores between candidate lncRNAs and d_{j} can be calculated and all candidate lncRNAs can be ranked by predicted results simultaneously. As a result, the higher the candidate lncRNA is ranked, the better the performance of our prediction model will be. Moreover, the value of area under the receive operating characteristic (ROC) curve (AUC) can be further used to measure the performance of CFNBC. Obviously, the closer the AUC value is to 1, the better the prediction performance of CFNBC will be. Hence, by setting different classification thresholds, we can calculate the true positive rate (TPR or sensitivity) and the false positive rate (FPR or 1specificity) as follows:
Here, TP, FN, FP and TN denote the true positives, false negatives, false positives and true negatives respectively. Specifically, TPR indicates the percentage of candidate lncRNAs with ranks higher than a given rank cutoff, and FPR denotes the percentage of candidate lncRNAs with ranks below the given threshold.
The effects of α
Based on the assumption that original common neighboring miRNA nodes shall deserve more credibility than recommended common neighboring miRNA nodes, a decay factor α is used to make our prediction model CFNBC work more effectively. In this section, in order to evaluate the effects of α to the predcition performance of CFNBC, we will implement a series of experiments to estimate its actual effects while α is set to different values ranging from 0.05 to 0.8. As shown in Table 1, it is easy to see that CFNBC can achieve the best prediction performance while α is set to 0.05.
Comparison with other stateoftheart methods
In order to further assess the performance of CFNBC, in this section, we will compare it with four kinds of stateoftheart prediction models such as HGLDA [7], SIMLDA [15], NBCLDA [10] and the method proposed by Yang et al. [19] in the framework of LOOCV while α is set to 0.05. Among these four methods, since a hypergeometric distribution test was utilized to infer lncRNAdisease associations by integrating miRNAdisease associations with lncRNAmiRNA associations in HGLDA, then we will adopt a data set consisting of 183 experimentally validated lncRNAdisease associations as the hypergeometric distribution test to compare CFNBC with HGLDA. As illustrated in Table 2 and Fig. 1, the simulation results demonstrate that CFNBC outperforms HGLDA significantly. As for the model SIMLDA, since it applied inductive matrix completion to identify lncRNAdisease associations by integrating lncRNAdisease associations, genedisease and genegene ontology associations, then we will collect a sub data set, which belongs to DS_{ld} in CFNBC and consists of 101 known associations between 30 different lncRNAs and 79 different diseases, from the data set adopted by SIMLDA to compare CFNBC with SIMLDA. As shown in Table 2 and Fig. 2, it is easy to see that CFNBC can achieve a reliable AUC of 0.8579, which is better than the AUC of 0.8526 achieved by SIMLDA. As for the model NBCLDA, since it fused multiple heterogeneous biological data sources and adopted the naïve Bayesian classifier to uncover potential lncRNA–disease associations, then we will compare CFNBC with it based on the data set DS_{ld} directly. As illustrated in Table 2 and Fig. 3, it is obvious that CFNBC can obtain a reliable AUC of 0.8576, which is higher than the AUC of 0.8519 achieved by NBCLDA as well. Finally, while comparing CFNBC with the method proposed by yang et al., in order to keep the fairness in comparison, we will collect a data set consisting of 319 lncRNAdisease associations between 37 lncRNAs and 52 diseases by deleting the nodes with degree equal to 1 on the data set DS_{ld}. As shown in Table 2 and Fig. 4, it is easy to see that CFNBC can achieve a reliable AUC of 0.8915, which considerably outperforms the AUC of 0.8568 achieved by the method proposed by yang et al. Hence, it is easy to draw a conclusion that our model CFNBC can achieve better performance than these classical prediction models.
Additionally, in order to further evaluate the prediction performance of CFNBC, we will compare it with above four models based on the predicted topk associations by using F1score measure. During simulation, we will randomly choose 80% of known lncRNAdisease associations as the training set, whereas all remaining known and unknown lncRNAdisease associations are taken as testing sets. Since the sets of known lncRNAdisease associations in these models are different, we will set different threshold k to compare them with CFNBC. As shown in Table 3, it is easy to see that CFNBC outperforms these four kinds of stateoftheart models in terms of F1score measure as well. Moreover, the paired ttest also demonstrates that the performance of CFNBC is significantly better than the prediction results of other methods in terms of the F1scores (pvalue < 0.05, as illustrated in Table 4).
Case studies
In order to further demonstrate the capability of CFNBC in inferring new lncRNAs related to a given disease, in this section, we will implement case studies of glioma, colorectal cancer and gastric cancer for CFNBC based on the data set DS_{ld}. As a result, the top 20 diseaserelated lncRNAs predicted by CFNBC have been confirmed by manually mining relevant literatures, and corresponding evidences are listed in the following Table 5. Additionally, among these three kinds of cancers chosen for case studies, the glioma is one of the most lethal primary brain tumors with a median survival of less than 12 months, and 6 out of 100000 people may have gliomas [20], hence it is important to find potential associations between glioma and dysregulations of some lncRNAs. As illustrated in Table 5, while applying CFNBC to predict candidate lncRNAs related to glioma, it is easy to see that there are six out of the top 20 predicted gliomarelated lncRNAs having been validated by recent literatures on biological experiments. For instance, the lncRNA XIST has been demonstrated to be an important regulator in tumor progression and may be a potential therapeutic target in the treatment of glioma [21]. Ma et al. found that the lncRNA MALAT1 plays an important role in glioma progression and prognosis and may be considered as a convictive prognostic biomarker for glioma patients [22]. Xue et al. provided a comprehensive analysis of KCNQ1OT1miR370CCNE2 axis in human glioma cells and a novel strategy for glioma treatment [23].
As for the colorectal cancer (CRC), it is the third most common cancer and the third leading cause of cancer death in men and women in the United States [24]. In recent years, accumulating evidences have shown that many CRCrelated lncRNAs have been reported based on biological experiments. For example, Song et al. demonstrated that the higher expression of XIST was correlated with worse disease free survival of CRC patients [25]. Zheng et al. proved that the higher expression level of MALAT1 may serve as a negative prognostic marker in stage II/III CRC patients [26]. Nakano et al. found that the loss of imprinting of the lncRNA KCNQ1OT1 may play an important role in the occurrence of CRC [27]. As illustrated in Table 5, while applying CFNBC to uncover candidate lncRNAs related to CRC, it is obvious that there are 6 out of the top 20 predicted CRCrelated lncRNAs having been verified in the Lnc2Cancer database.
Moreover, the gastric cancer is the second most frequent cause of cancer death [28]. Up to now, lots of lncRNAs have been reported to be associated with gastric cancer. For instance, XIST, MALAT1, SNHG16, NEAT1, H19 and TUG1 were reported to be upregulated in gastric cancer [29,30,31,32,33,34]. As illustrated in Table 5, while applying CFNBC to uncover candidate lncRNAs related to gastric cancer, it is obvious that there are 6 out of the top 20 newly identified lncRNAs related to gastric cancer having been validated by the lncRNADisease and Lnc2Cancer database respectively.
Discussion
Accumulating evidences have shown that prediction of potential lncRNAdisease associations is helpful in understanding crucial roles of lncRNAs in biological process, complex disease diagnoses, prognoses and treatments. In this manuscript, we constructed an original lncRNAmiRNAdisease tripartite network by combining miRNAlncRNA, miRNAdisease and lncRNAdisease associations first. And then, we formulated the prediction of potential lncRNAdisease associations as a problem of recommender system and obtained an updated tripartite network through applying a novel itembased collaborative filtering algorithm to the original tripartite network. Finally, we proposed a prediction model called CFNBC to infer potential associations between lncRNAs and diseases by applying the naïve Bayesian Classifier on the updated tripartite network. Comparing with stateoftheart prediction models, CFNBC can achieve better performs in terms of AUC values without entirely relying on known lncRNAsdisease associations, which means that CFNBC can predict potential associations between lncRNAs and diseases even as these lncRNAs and diseases are not in known data sets. Additionally, we implemented LOOCV to evaluate the prediction performance of CFNBC, and the simulation results showed that the problem of limited positive samples existed in stateoftheart models has been significantly solved in CFNBC by the addition of collaborative filtering algorithm and the predictive accuracy has been improved by adopting the disease semantic similarity to infer potential associations between lncRNAs and diseases. Moreover, case studies of glioma, colorectal cancer and gastric cancer were implemented to further estimate the performance of CFNBC, and simulation results demonstrated that CFNBC could be a useful tool for predicting potential relationships between lncRNAs and diseases as well. Of course, despite the reliable experimental results achieved by CFNBC, there are still some biases in our model. For example, it is noteworthy that there are many other types of data that can be utilized to uncover potential lncRNAdisease associations, therefore, the prediction performance of CFNBC would be improved by the addition of more types of data. In addition, the results of CFNBC may be affected by the quality of datasets and the numbers of known lncRNAdisease relationships as well. Furthermore, successfully established models in the other computational fields would inspire the development of lncRNAdisease association prediction, such as microRNAdisease association prediction [35,36,37], drugtarget interaction prediction [38] and synergistic drug combinations prediction [39].
Conclusion
Finding out lncRNAdisease relationships is essential for understanding human disease mechanisms. In this manuscript, our main contributions are as follows: (1) An original tripartite network is constructed by integrating a variety of biological information including miRNAlncRNA, miRNAdisease and lncRNAdisease associations. (2) An updated tripartite network is constructed by applying a novel itembased collaborative filtering algorithm on the original tripartite network. (3) A novel prediction model called CFNBC is developed based on the naïve Bayesian Classifier and applied on the updated tripartite network to infer potential associations between lncRNAs and diseases. (4) CFNBC can be adopted to predict a potential diseaserelated lincRNA or an potential lncRNArelated disease without relying on any known lncRNAdisease associations. (5) A recommendation system is applied in CFNBC, which guarantees that CFNBC can achieve effective prediction results in condition of scarce known lncRNAdisease associations.
Data collection and preprocessing
In order to construct our novel prediction model CFNBC, we combined three kinds of heterogeneous data sets such as the miRNAdisease association set, the miRNAlncRNA association set and the lncRNAdisease association set to infer potential associations between lncRNAs and diseases, which were collected from different public databases including the HMDD [40], the starBase v2.0 [41], and the MNDR v2.0 databases [42], etc.
Construction of the miRNAdisease and miRNAlncRNA association sets
Firstly, we downloaded two datasets of known miRNAdisease associations and miRNAlncRNA associations from the HMDD [40] in August 2018 and the starBase v2.0 [41] in January 2015 respectively. Then, we removed duplicated associations with conflicting evidences on these two data sets separately, manually picked out the common miRNAs existing in both the dataset of miRNAdisease associations and the dataset of miRNAlncRNA associations, and retained only the associations related with these selected miRNAs in these two data sets. As a result, we finally obtained a data set DS_{md} including 4704 different miRNAdisease interactions between 246 different miRNAs and 373 different diseases, and a data set DS_{ml} including 9086 different miRNAlncRNA interactions between 246 different miRNAs and 1089 different lncRNAs (see Supplementary Materials Table 1and Table 2).
Construction of the lncRNAdisease association set
Firstly, we downloaded a dataset of known lncRNAdisease associations from the MNDR v2.0 databases [42] in 2017. Then, once the dataset was collected, in order to keep the uniformity of disease names, we transformed some diseases names included in the set of lncRNAdisease associations into their aliases in the data set of miRNAdisease associations, and unified the names of lncRNAs in the datasets of miRNAlncRNA associations and lncRNAdiseases associations. By this means, we selected out these lncRNAdisease interactions associated with both lncRNAs belonging to DS_{ml} and diseases belonging to DS_{md}. As a result, we finally obtained a data set DS_{ld} including 407 different lncRNAdisease interactions between 77 different lncRNAs and 95 different diseases (see Supplementary Materials Table 3).
Analysis of relational data sources
In CFNBC, the newly constructed lncRNAmiRNAdisease tripartite network (LMDN for abbreviation) consists of three kinds of objects such as lncRNAs, miRNAs and diseases. Therefore, we collected three kinds of relational data sources from different databases based on these three kinds of objects. As illustrated in Fig. 5, the numbers of diseases are 373 in the data set of miRNAdisease associations (md for abbreviation) and 95 in the data set of lncRNAdisease associations (ld for abbreviation) respectively. The numbers of lncRNAs are 1089 in the data set of miRNAlncRNA associations (ml for abbreviation) and 77 in ld respectively. The numbers of miRNAs are 246 in both ml and md. Moreover, it is clear that the set of 95 diseases in ld is a subset of the set of 373 diseases in md, and the set of 77 lncRNAs in ld is a subset of the set of 1089 lncRNAs in ml.
Method
As illustrated in Fig. 6, our newly proposed prediction model CFNBC consists of the following four main stages:

Step1: As illustrated in Fig. 6(a), we can construct a miRNAdisease association network MDN, a miRNAlncRNA association network MLN, and an lncRNAdisease association network LDN based on the data sets DS_{md}, DS_{ml} and DS_{ld} respectively.

Step2: As illustrated in Fig. 6(b), through integrating these three newly constructed association networks MDN, MLN, and LDN, we can further construct an original lncRNAmiRNAdisease association tripartite network LMDN.

Step3: As illustrated in Fig. 6(c), after applying the collaborative filtering algorithm on LMDN, we can obtain an updated lncRNAmiRNAdisease association tripartite network LMDN^{′}.

Step4: As illustrated in Fig. 6(d), after appending the naïve Bayesian classifier to LMDN^{′}, we can obtain our final prediction model CFNBC.
In the original tripartite network LMDN, owing to the sparse known associations between lncRNAs and diseases, for any given lncRNA node a and disease node b, it is obvious that the number of miRNA nodes that associate with both a and b will be very limited. Hence, in CFNBC, we designed a collaborative filtering algorithm for recommending suitable miRNA nodes to corresponding lncRNA nodes and disease nodes respectively. And then, based on these known and recommended common neighboring nodes, we can finally apply the Naïve Bayesian Classifier on LMDN^{′} to uncover potential lncRNAdisease associations.
Construction of LMDN
Let matrix \( {R}_{MD}^0 \) be the original adjacency matrix of known miRNAdisease associations and the entity \( {R}_{MD}^0\left({m}_k,{d}_j\right) \) denote the element in the k^{th} row and j^{th} column of \( {R}_{MD}^0 \), then there is \( {R}_{MD}^0\left({m}_k,{d}_j\right) \) =1 if and only if the miRNA node m_{k} is associated with the disease node d_{j}, otherwise, there is \( {R}_{MD}^0\left({m}_k,{d}_j\right) \) =0. In the same way, we can obtain the original adjacency matrix \( {R}_{ML}^0 \) of known miRNAlncRNA associations as well, and in \( {R}_{ML}^0 \), there is \( {R}_{ML}^0\left({m}_k,{l}_i\right) \) =1 if and only if the miRNA node m_{k} is associated with the lncRNA node l_{i}, otherwise, there is \( {R}_{ML}^0\left({m}_k,{l}_i\right) \) =0. Additionally, considering that a recommender system may involve various input data including users and items, therefore, in CFNBC, we will take lncRNAs and diseases as users, while miRNAs as items. Thereafter, as for these two original adjacency matrices \( {R}_{MD}^0 \) and \( {R}_{ML}^0 \) obtained above, since their row vectors are the same, it is easy to see that we can construct another adjacency matrix \( {R}_{ML D}^0=\left[{R}_{ML}^0,{R}_{MD}^0\right] \) by splicing \( {R}_{MD}^0 \) and \( {R}_{ML}^0 \) together. Moreover, it is obvious that the row vector of \( {R}_{MLD}^0 \) is exactly the same as the row vector in \( {R}_{MD}^0 \) or \( {R}_{ML}^0 \), while the column vector of \( {R}_{MLD}^0 \) consists of the column vector of \( {R}_{MD}^0 \) and the column vector of \( {R}_{ML}^0 \).
Applying the itembased collaborative filtering algorithm on LMDN
Since CFNBC is based on the collaborative filtering algorithm, then the relevance scores between lncRNAs and diseases predicted by CFNBC will depend on the common neighbors between these lncRNAs and diseases. However, owing to the scarce known lncRNAmiRNA, lncRNAdisease and miRNAdisease associations, the number of common neighbors between these lncRNAs and diseases in LMDN will be very limited as well. Hence, in order to improve the number of common neighbors between lncRNAs and diseases in LMDN, we will apply the collaborative filtering algorithm on LMDN in this section.
First, on the basis of \( \kern0.50em {R}_{MLD}^0 \) and LMDN, we can obtain a cooccurrence matrix R^{m × m}, in which, let the entity R(m_{k}, m_{r}) denote the element in the k^{th} row and r^{th} column of R^{m × m}, then there is R(m_{k}, m_{r}) =1 if and only if the miRNA node m_{k} and the miRNA node m_{r} share at least one common neighboring node (a lncRNA node or a disease node) in LMDN, otherwise, there is R(m_{k}, m_{r}) =0. Hence, a similarity matrix R^{′} can be calculated after normalizing R^{m × m} as follows:
Where ∣N(m_{k})∣ represents the number of known lncRNAs and diseases associated to m_{k} in LMDN, that is, the number of elements with value equaling to 1 in the k^{th} row of \( {R}_{MLD}^0 \), N(m_{r}) represents the number of elements with value equaling to 1 in the r^{th} row of \( {R}_{MLD}^0 \), and ∣N(m_{k}) ∩ N(m_{r})∣ denotes the number of known lncRNAs and diseases associated with both m_{k} and m_{r} simultaneously in LMDN.
Next, for any given lncRNA node l_{i} and miRNA node m_{h} in LMDN, if the association between l_{i} and m_{h} is known already, then, for a miRNA node m_{t} other than m_{h} in LMDN, it is obvious that the higher the relevance score between m_{t} and m_{h}, the bigger the possibility that there may exist potential association between l_{i} and m_{t}. Hence, we can obtain the relevance score between l_{i} and m_{t} based on the similarities between miRNAs as follows:
Here, N(l_{i}) represents the set of neighboring miRNA nodes that are directly connected to l_{i} in LMDN, and S(K, m_{t} − top) denote the set of topK miRNAs that are most similar to m_{t} in LMDN. \( {R}_t^{\prime } \) is a vector consisting of the t^{th} row of R^{′}. In addition, there is u_{it} = 1 if and only if l_{i} is interacted with m_{t} in ML, otherwise, there is u_{it} =0.
Similarly, for any given disese node d_{j} and miRNA node m_{h} in LMDN, if the association between d_{j} and m_{h} is known already, then, for a miRNA node m_{t} other than m_{h} in LMDN, we can obtain the relevance score between d_{j} and m_{t} based on the similarities between miRNAs as follows:
Where N(d_{j}) denotes the set of neighboring miRNA nodes that are directly connected to d_{j} in LMDN. In addition, there is u_{jt} =1 if and only if d_{j} is interacted with m_{t} in MD, otherwise, there is u_{jt} =0.
Obviously, based on the similarity matrix R^{′} and the adjacency matrix \( {R}_{MLD}^0 \), we can construct a new recommender matrix \( {R}_{MLD}^1 \) as follows:
In particular, for a certain lncRNA node l_{i} or a disease node d_{j} in LMDN, if there is a miRNA m_{k} satisfying \( {R}_{MLD}^0\left({m}_k,{l}_i\right)=1 \) or \( {R}_{MLD}^0\left({m}_k,{d}_j\right)=1 \) in \( {R}_{MLD}^0 \), then, we will first sum up the values of all elements in the i^{th} or j^{th} column of \( {R}_{MLD}^1 \) respectively. Thereafter, we will obtain its average value \( \overline{p} \). Finally, if there is a miRNA node m_{θ} in the i^{th} or j^{th} column of \( {R}_{MLD}^1 \) satisfying \( {R}_{MLD}^1\left({m}_{\theta },{l}_i\right)>\overline{p} \) or \( {R}_{MLD}^1\left({m}_{\theta },{d}_j\right)>\overline{p} \), then we will recommend the miRNA m_{θ} to l_{i} or d_{j} respectively. And in the same time, we will as well add a new edge between m_{θ} and l_{i} or m_{θ} and d_{j} in LMDN separately.
For instance, according to Fig. 6 and the given matrix \( {R}_{MLD}^0=\left[\begin{array}{cc}\begin{array}{cc}1& 1\\ {}1& 0\end{array}& \begin{array}{cc}1& 0\\ {}1& 0\end{array}\\ {}\begin{array}{cc}0& 1\\ {}\begin{array}{c}0\\ {}0\end{array}& \begin{array}{c}0\\ {}0\end{array}\end{array}& \begin{array}{cc}0& 1\\ {}\begin{array}{c}0\\ {}1\end{array}& \begin{array}{c}1\\ {}1\end{array}\end{array}\end{array}\right] \), we can obtain its corresponding matrices R^{m × m}, R^{′} and \( {R}_{MLD}^1 \) as follows:
To be specific, as illustrate in Figure 6, if taking the lncRNA node l_{1} as an example, then from the matrix \( {R}_{MLD}^0 \), it is easy to see that there are two miRNA nodes such as m_{1} and m_{2} associated with l_{1}. In addition, according to formula (9), we can know as well that there is \( {R}_{MLD}^1\left({m}_5,{l}_1\right)=0.905>\overline{p}=\frac{R_{MLD}^1\left({m}_1,{l}_1\right)+{R}_{MLD}^1\left({m}_2,{l}_1\right)}{2}=\frac{0.81+0.81}{2}=0.81 \). Hence, we will recommend the miRNA node m_{5} to l_{1}. In the same way, the miRNA nodes m_{2}, m_{4} and m_{5} will be recommended to l_{2} as well. Moreover, according to previous description, it is obvious that these new edges between m_{5} and l_{1}, m_{2} and l_{2}, m_{4} and l_{2}, and m_{5} and l_{2} will be added to the original tripartite network LMDN in the same time. Thereafter, we can obtain an updated lncRNAmiRNAdisease association tripartite network LMDN^{′} on the basis of the original tripartite network LMDN.
Construction of the prediction model CFNBC
The naïve Bayesian classifier is a kind of simple probabilistic classifier with a conditionally independent assumption. Based on this probability model, the posterior probability can be described as follows:
Where C is a dependent class variable and F_{1}, F_{2}, …, F_{n} are the feature variables of class C.
Moreover, since each feature F_{i} is conditionally independent to any other feature F_{j} (i ≠ j) in class C, then the above formula (10) can as well be expressed as follows:
In our previous work, we proposed a probability model called NBCLDA based on the Naïve Bayesian classifier to predict potential lncRNAdisease associations [10]. However, in NBCLDA, there exist some circumstances where it happens to be no relevance scores between a certain pair of lncRNA and disease nodes, and the reason is that there are no common neighbors between them owing to the scarce known associations between the pair of lncRNA and disease. Hence, in order to overcome this kind of drawback existing in our previous work, in this section, we will design a novel prediction model called CFNBC to infer potential associations between lncRNAs and diseases through adopting the itembased collaborative filtering algorithm on LMDN and applying the Naïve Bayesian classifier on LMDN^{′}. In CFNBC, for a given pair of lncRNA and disease nodes, it is obvious that they will have two kinds of common neighboring miRNA nodes such as the original common miRNA nodes and the recommended common miRNA nodes. In order to illustrate this case more intuitively, an example is given in Figure 7, in which, the node m_{3} is an original common neighboring miRNA node since it has known associations with both l_{2} and d_{2}, while the nodes m_{4} and m_{5} belong to recommended common neighboring miRNA nodes since they do not have known associations with both l_{2} and d_{2}. And in particular, while applying the Naïve Bayesian classifier on LMDN^{′}, for a given pair of lncRNA and disease nodes, we will consider that their common neighboring miRNA nodes, including both the original and recommended common neighboring miRNA nodes, are all conditionally independent of each other, since they are different nodes in LMDN^{′}. That is, for a given pair of lncRNA and disease nodes, it is assumed that all their common neighboring nodes will not interfere with each other in CFNBC.
Method for applying the Naïve Bayesian theory on LMDN ^{′}
For any given lncRNA node l_{i} and disease node d_{j} in LMDN^{′}, let CN_{1}(l_{i}, d_{j}) = {m_{1 − 1}, m_{2 − 1}, ⋯m_{h − 1}} denote a set consisting of all original common neighboring nodes between them, and CN_{2}(l_{i}, d_{j}) = {m_{1 − 2}, m_{2 − 2}, ⋯m_{h − 2}} denote a set consisting of all recommended common neighboring nodes between them in LMDN^{′}, then, the prior probabilities \( p\left({e}_{l_i{d}_j}=1\right) \) and \( p\left({e}_{l_i{d}_j}=0\right) \) can be calculated as follows:
Where M^{c} denotes the number of known lncRNAdisease associations in LDN and M = nl × nd. Here, nl and nd represent the number of different lncRNAs and diseases in LDN respectively.
Furthermore, based on these two kinds of common neighboring nodes, the posterior probabilities between l_{i} and d_{j} can be calculated as follows:
Obviously, comparing formula (14) with formula (15), it can be easily identified that whether an lncRNA node is related to a disease node or not in LMDN^{′}. However, since it is too difficult to obtain the value of p(CN_{1}(l_{i}, d_{j})) and p(CN_{2}(l_{i}, d_{j})) directly, the probability of potential association existing between l_{i} and d_{j} in LMDN^{′} can be defined as follows:
Here \( p\left({m}_{\updelta 1}{e}_{l_i{d}_j}=1\right) \) and \( p\left({m}_{\updelta 1}{e}_{l_i{d}_j}=0\right) \) denote the conditional possibilities that whether the node m_{δ − 1} is a common neighboring node between l_{i} and d_{j} or not in LMDN^{′} separately, and \( p\left({m}_{\updelta 2}{e}_{l_i{d}_j}=1\right) \) and \( p\left({m}_{\updelta 2}{e}_{l_i{d}_j}=0\right) \) represent whether the node m_{δ − 2} is a common neighboring node between l_{i} and d_{j} or not in LMDN^{′} respectively. Moreover, according to the Bayesian theory, these four kinds of conditional probabilities can be defined as follows:
Where \( p\left({e}_{l_i{d}_j}=1{m}_{\updelta 1}\right) \) and \( p\left({e}_{l_i{d}_j}=0{m}_{\updelta 1}\right) \) are the probability of whether the lncRNA node l_{i} is connected to the disease node d_{j} or not respectively, while m_{δ − 1} is a common neighboring miRNA node between l_{i} and d_{j} in LMDN^{′}. And similarly, \( p\left({e}_{l_i{d}_j}=1{m}_{\updelta 2}\right) \) and \( p\left({e}_{l_i{d}_j}=0{m}_{\updelta 2}\right) \) represent the probability of whether the lncRNA node l_{i} is connected to the disease node d_{j} or not respectively, while m_{δ − 2} is a common neighboring miRNA node between l_{i} and d_{j} in LMDN^{′}. Moreover, supposing that m_{δ − 1} and m_{δ − 2} are two common neighboring miRNA nodes between l_{i} and d_{j} in LMDN^{′}, let \( {N}_{m_{\updelta 1}}^{+} \) and \( {N}_{m_{\updelta 1}}^{} \) represent the number of known associations and the number of unknown associations between disease nodes and lncRNA nodes in LMDN^{′} that have m_{δ − 1} as a common neighboring miRNA node between them, and \( {N}_{m_{\updelta 2}}^{+} \) and \( {N}_{m_{\updelta 2}}^{} \) represent the number of known associations and the number of unknown associations between disease nodes and lncRNA nodes in LMDN^{′} that have m_{δ − 2} as a common neighboring miRNA node between them, then, it is obvious that \( p\left({e}_{l_i{d}_j}=1{m}_{\updelta 1}\right) \) and \( p\left({e}_{l_i{d}_j}=1{m}_{\updelta 2}\right) \) can be calculated as follows:
Obviously, according to above formula (17), formula (18), formula (19) and formula (20), the formula (16) can be modified as follows:
Furthermore, for any given lncRNA node l_{i} and disease node d_{j}, since the value of \( \frac{p\left({e}_{l_i{d}_j}=1\right)}{p\left({e}_{l_i{d}_j}=0\right)} \) is a constant, then for convenience, we will denote the value of \( \frac{p\left({e}_{l_i{d}_j}=1\right)}{p\left({e}_{l_i{d}_j}=0\right)} \) as ϕ_{m}. In addition, for each common neighboring node m_{δ − 1} between l_{i} and d_{j}, let N_{l − 1} and N_{d − 1} denote the numbers of lncRNAs and diseases associated to m_{δ − 1} in LMDN^{′} respectively, then it is obvious that there is \( {N}_{m_{\updelta 1}}^{+}+{N}_{m_{\updelta 1}}^{}={N}_{l1}\times {N}_{d1} \). And similarly, for each common neighboring miRNA node m_{δ − 2} between l_{i} and d_{j}, let N_{l − 2} and N_{d − 2} represent the numbers of lncRNAs and diseases associated to m_{δ − 2} in LMDN^{′} respectively, then it is obvious that there is \( {N}_{m_{\updelta 2}}^{+}+{N}_{m_{\updelta 2}}^{}={N}_{l2}\times {N}_{d2} \). Thereafter, the above formula (16) can be further modified as follows:
Besides, since \( {N}_{m_{\updelta 1}}^{+} \) and \( {N}_{m_{\updelta 2}}^{+} \) may be zero, then we introduce the Laplace calibration to guarantee that the value of S(l_{i}, d_{j}) will not be zero. Hence, the above formula (16) can once again be modified as follows:
Next, for any given lncRNA node and disease node, since the original common neighboring miRNA nodes between them are obtained from the known associations, while the recommended common neighboring miRNA nodes between them are obtained by our itembased collaborative filtering algorithm, then it is reasonable to consider that the original common neighboring miRNA nodes shall deserve more credibility than the recommended common neighboring miRNA nodes. Hence, in order to make our prediction model be able to work more effectively, we will add a decay factor α in the range of (0, 1) to the above formula (25). Thereafter, the formula (25) can be rewritten as follows:
Additionally, it has been reported that the degree of common neighboring nodes will play a significant role in the link prediction, and the common neighboring nodes with high degrees can improve the prediction accuracy [43]. Hence, we will further add an index Resource (RA) [44] and Logarithmic function for standardization to the above formula (26). Thereafter, for any given lncRNA node l_{i} and disease node d_{j} in LMDN^{′}, we can obtain the probability that there may exist a potential association between them as follows:
Here, \( {k}_{m_{\delta 1}} \) and \( {k}_{m_{\delta 2}} \) represent the degree of m_{δ − 1} and m_{δ − 2} in LMDN^{′} respectively.
Method for appending the disease semantic similarity into CFNBC
Each disease can be described as a Directed Acyclic Graph (DAG), in which, the nodes represent the disease MeSH descriptors and all MeSH descriptors in the DAG are linked from parent nodes to child nodes by a direct edge. By this way, a disease d_{j} can be denoted as DAG(d_{j}) = (d_{j}, T(d_{j}), E(d_{j})), where T(d_{j}) is the set consisting of node d_{j} and its ancestor nodes, E(d_{j}) represents the set of edges between parent nodes and child nodes [45]. Thereafter, by adopting the scheme of DAG, we can define the semantic value of d_{j} as follows:
Where,
Here, δ is the semantic contribution factor with the value between 0 and 1, and according to previous work, δ will be set to 0.5 in this paper. Thus, based on above formula (28) and formula (29), the semantic similarity between diseases d_{j} and d_{i} can be calculated as follows:
Based on above formula (25) and formula (30), for any given lncRNA node l_{i} and disease node d_{j} in LMDN^{′}, we can finally obtain the probability that there may exist a potential association between them as follows:
Availability of data and materials
The Matlab code can be download at https://github.com/jingwenyu18/CFNBC;
The datasets generated and/or analysed during the current study are available in the HMDD repository, http://www.cuilab.cn/; MNDR repository, http://www.rnasociety.org/mndr/; starBase repository, http://starbase.sysu.edu.cn/starbase2/index.php .
Abbreviations
 AUC:

areas under ROC curve
 CFNBC:

a novel Collaborative Filtering algorithm for sparse known lncRNAdisease associations will be proposed on the basis of Naïve Bayesian Classifier
 CRC:

the Colorectal cancer
 FPR:

false positive rates
 ld :

the data set of lncRNAdisease associations
 LMDN:

the lncRNAmiRNAdisease tripartite network
 LMDN′:

an updated lncRNAmiRNAdisease association tripartite network
 lncRNA:

long noncoding RNAs lncRNA
 lncRNAs:

long noncoding RNAs lncRNAs
 LOOCV:

LeaveOne Out Cross Validation
 md :

the data set of miRNAdisease associations
 ml :

the data set of miRNAlncRNA associations
 TPR:

true positive rates
References
 1.
Guttman MR, et al. Ribosome profiling provides evidence that large noncoding RNAs do not encode proteins. Cell. 2013;154(1):240–51.
 2.
Guttman M, Rinn JL. Modular regulatory principles of large non–coding RNAs. Nature. 2012;482(7385):339–46.
 3.
Chen X, Yan CC, Zhang X, et al. Long noncoding RNAs and complex diseases: from experimental results to computational models. Brief Bioinform. 2016;18(4):558–76.
 4.
Chen X, Sun Y, Guan N, et al. Computational models for lncRNA function prediction and functional similarity calculation. Brief Funct Genomics. 2019;18(1):58–82.
 5.
Faghihi MA, Modarresi F, Khalil AM, et al. Expression of a noncoding RNA is elevated in Alzheimer's disease and drives rapid feedforward regulation of βsecretase. Nat Med. 2008;14(7):723–30.
 6.
Li D, Liu X, Zhou J, et al. LncRNA HULC modulates the phosphorylation of YB1 through serving as a scaffold of ERK and YB1 to enhance hepatocarcinogenesis. Hepatology. 2016;65(5):1612.
 7.
Chen X. Predicting lncRNAdisease associations and constructing lncRNA functional similarity network based on the information of miRNA. Sci Rep. 2015;5(1):13186.
 8.
Chen X. KATZLDA: KATZ measure for the lncRNAdisease association prediction. Sci Rep. 2015;5(1):16840.
 9.
Ping P, Wang L, Kuang L, et al. A novel method for LncRNAdisease association prediction based on an lncRNAdisease association network. IEEE/ACM Trans Comput Biol Bioinform. 2019;16(2):688–93.
 10.
Yu J, Ping P, Wang L, et al. A novel probability model for LncRNAdisease association prediction based on the Naïve Bayesian classifier. Genes. 2018;9(7):345.
 11.
Zhang J, Zhang Z, Chen Z, et al. Integrating multiple heterogeneous networks for novel LncRNAdisease association inference. IEEE/ACM Trans Comput Biol Bioinform. 2019;16(2):396–406.
 12.
Fu G, Wang J, Domeniconi C, et al. Matrix factorizationbased data fusion for the prediction of lncRNA–disease associations. Bioinformatics. 2018;34(9):1529–37.
 13.
Chen X, Yan GY. Novel human lncRNAdisease association inference based on lncRNA expression profiles. Bioinformatics. 2013;29(20):2617–24.
 14.
Liu NN, He L, Zhao M. Social temporal collaborative ranking for context aware movie recommendation. ACM Trans Intell Syst Technol. 2013;4(1):1–26.
 15.
Lu C, Yang M, Luo F, et al. Prediction of lncRNAdisease associations based on inductive matrix completion. Bioinformatics. 2018;34(19):3357–64.
 16.
Luo H, Li M, Wang S, et al. Computational drug repositioning using lowrank matrix approximation and randomized algorithms. Bioinformatics. 2018;34(11):1904–12.
 17.
Zeng X, Ding N, RodríguezPatón A, et al. Probabilitybased collaborative filtering model for predicting gene–disease associations. BMC Med Genet. 2017;10(Suppl 5):76.
 18.
Luo J, Ding P, Liang C, et al. Collective prediction of diseaseassociated miRNAs based on transduction learning. IEEE/ACM Trans Comput Biol Bioinform. 2017;14(6):1468–75.
 19.
Yang X, Gao L, Guo X, et al. A network based method for analysis of lncRNAdisease associations and prediction of lncRNAs implicated in diseases. PLoS One. 2014;9(1):e87797.
 20.
Furnari FB, Fenton T, Bachoo RM, et al. Malignant astrocytic glioma: genetics, biology, and paths to treatment. Genes Dev. 2007;21(21):2683–710.
 21.
Wang Z, Yuan J, Li L, et al. Long noncoding RNA XIST exerts oncogenic functions in human glioma by targeting miR137. Am J Transl Res. 2017;9(4):1845–55.
 22.
Ma KX, Wang HJ, Li XR, et al. Long noncoding RNA MALAT1 associates with the malignant status and poor prognosis in glioma. Tumor Biol. 2015;36(5):3355–9.
 23.
Gong W, Zheng J, Liu X, et al. Knockdown of long noncoding RNA KCNQ1OT1 restrained glioma cells’ malignancy by activating miR370/CCNE2 axis. Front Cell Neurosci. 2017;11:84.
 24.
Siegel R, Desantis C, Jemal A. Colorectal cancer statistics, 2014. CA Cancer J Clin. 2014;64(2):104–17.
 25.
Song H, He P, Shao T, et al. Long noncoding RNA XIST functions as an oncogene in human colorectal cancer by targeting miR1323p. J buon. 2017;22(3):696–703.
 26.
Zheng HT, Shi DB, Wang YW, et al. High expression of lncRNA MALAT1 suggests a biomarker of poor prognosis in colorectal cancer. Int J Clin Exp Pathol. 2014;7(6):3174–81.
 27.
Dong H, Xu G, Meng W, et al. Long noncoding RNA H19 indicates a poor prognosis of colorectal cancer and promotes tumor growth by recruiting and binding to eIF4A3. Oncotarget. 2016;7(16):22159–73.
 28.
Hartgrink HH, Jansen EP, Grieken NCV, et al. Gastric cancer. Lancet. 2009;374(9688):477–90.
 29.
Chen D, Ju H, Lu Y, et al. Long noncoding RNA XIST regulates gastric cancer progression by acting as a molecular sponge of miR101 to modulate EZH2 expression. J Exp Clin Cancer Res. 2016;35(1):142.
 30.
Xia H, Chen Q, Chen Y, et al. The lncRNA MALAT1 is a novel biomarker for gastric cancer metastasis. Oncotarget. 2016;7(35):56209–18.
 31.
Lian D, Amin B, Du D, et al. Enhanced expression of the long noncoding RNA SNHG16 contributes to gastric cancer progression and metastasis. Cancer Biomark. 2017;21(1):151–60.
 32.
Fu JW, Kong Y, Sun X. Long noncoding RNA NEAT1 is an unfavorable prognostic factor and regulates migration and invasion in gastric cancer. J Cancer Res Clin Oncol. 2016;142(7):1571–9.
 33.
Yang F, Bi J, Xue X, et al. Upregulated long noncoding RNA H19 contributes to proliferation of gastric cancer cells. FEBS J. 2012;279(17):3159–65.
 34.
Zhang E, He X, Yin D, et al. Increased expression of long noncoding RNA TUG1 predicts a poor prognosis of gastric cancer and regulates cell proliferation by epigenetically silencing of p57. Cell Death Dis. 2016;7(2):e2109.
 35.
Chen X, Xie D, Wang L, et al. BNPMDA: bipartite network projection for MiRNAdisease association prediction. Bioinformatics. 2018;34(18):3178–86.
 36.
Chen X, Huang L. LRSSLMDA:Laplacian regularized sparse subspace learning for MiRNAdisease association prediction. PLoS Comput Biol. 2017;13(12):e1005912.
 37.
Chen X, Huang L, Xie D, et al. EGBMMDA: extreme gradient boosting machine for MiRNAdisease association prediction. Cell Death Dis. 2018;9:3.
 38.
Chen X, Yan CC, Zhang X, et al. Drugtarget interaction prediction: databases, web servers and computational models. Brief Bioinform. 2016;17(4):696–712.
 39.
Chen X, Ren B, Chen M, et al. NLLSS: predicting synergistic drug combinations based on semisupervised learning. PLoS Comput Biol. 2016;12(7):e1004975.
 40.
Li Y, Qiu C, Tu J, et al. HMDD v2.0: a database for experimentally supported human microRNA and disease associations. Nucleic Acids Res. 2014;42(D1):D1070–4.
 41.
Li JH, Liu S, Zhou H, et al. starBase v2.0: decoding miRNAceRNA, miRNAncRNA and protein–RNA interaction networks from largescale CLIPSeq data. Nucleic Acids Res. 2014;42(D1):D92–7.
 42.
Cui T, Zhang L, Huang Y, et al. MNDR v2. 0: an updated resource of ncRNA–disease associations in mammals. Nucleic Acids Res. 2017;46(D1):D371–4.
 43.
Zhou T, Lü L, Zhang Y, et al. Predicting missing links via local information. Eur Phys J B. 2009;71(4):623–30.
 44.
Liu W, Lü L. Link prediction based on local random walk. EPL (Europhysics Letters). 2010;89(5):58007.
 45.
Wang D, Wang J, Lu M, et al. Inferring the human microRNA functional similarity and functional network based on microRNAassociated diseases. Bioinformatics. 2010;26(13):1644–50.
Acknowledgments
The authors thank the anonymous referees for suggestions that helped improve the paper substantially.
Funding
This research is partly sponsored by the National Natural Science Foundation of China (No.61873221, No.61672447) and the Natural Science Foundation of Hunan Province (No.2018JJ4058, No.2019JJ70010,No.2017JJ5036). Publication costs were funded by the National Natural Science Foundation of China (61873221, 61672447). The funder is Lei Wang (L.W.), whose contributions are stated in the section of Author’s Contributions.
Author information
Affiliations
Contributions
Conceptualization, J.Y. and L.W.; Methodology, J.Y., Q.Z. and L.W.; Validation, Z.X., X.F. and Q.Z.; Formal Analysis, J.Y. and L.W.; Investigation, X.F. and Z.X.; Resources, Z.X. and Q.Z.; Data Curation, J.Y. and X.F.; WritingOriginal Draft Preparation, J.Y. and Z.X; WritingReview and Editing, L.W. and Q.Z.; Supervision, L.W.; Project Administration, L.W. and Q.Z.; Funding Acquisition, L.W. All authors read and approved the final manuscript.
Corresponding author
Correspondence to Lei Wang.
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that there are no competing interests regarding the publication of this paper.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Received
Accepted
Published
DOI
Keywords
 lncRNAdisease associations
 Original tripartite network
 Itembased collaborative filtering
 Updated tripartite network
 naïve Bayesian classifier