A novel collaborative filtering model for LncRNA-disease association prediction based on the Naïve Bayesian classifier

Background Since the number of known lncRNA-disease associations verified by biological experiments is quite limited, it has been a challenging task to uncover human disease-related lncRNAs in recent years. Moreover, considering the fact that biological experiments are very expensive and time-consuming, it is important to develop efficient computational models to discover potential lncRNA-disease associations. Results In this manuscript, a novel Collaborative Filtering model called CFNBC for inferring potential lncRNA-disease associations is proposed based on Naïve Bayesian Classifier. In CFNBC, an original lncRNA-miRNA-disease tripartite network is constructed first by integrating known miRNA-lncRNA associations, miRNA-disease associations and lncRNA-disease associations, and then, an updated lncRNA-miRNA-disease tripartite network is further constructed through applying the item-based collaborative filtering algorithm on the original tripartite network. Finally, based on the updated tripartite network, a novel approach based on the Naïve Bayesian Classifier is proposed to predict potential associations between lncRNAs and diseases. The novelty of CFNBC lies in the construction of the updated lncRNA-miRNA-disease tripartite network and the introduction of the item-based collaborative filtering algorithm and Naïve Bayesian Classifier, which guarantee that CFNBC can be applied to predict potential lncRNA-disease associations efficiently without entirely relying on known miRNA-disease associations. Simulation results show that CFNBC can achieve a reliable AUC of 0.8576 in the Leave-One-Out Cross Validation (LOOCV), which is considerably better than previous state-of-the-art results. Moreover, case studies of glioma, colorectal cancer and gastric cancer demonstrate the excellent prediction performance of CFNBC as well. Conclusions According to simulation results, due to the satisfactory prediction performance, CFNBC may be an excellent addition to biomedical researches in the future.


Background
Recently, accumulating evidences have indicated that lncRNAs (Long non-coding RNAs) are involved in almost the entire cell life cycle through various mechanisms [1,2] and participate in close relationships in the development of some human complex diseases [3,4] such as the Alzheimer's disease [5] and many types of cancers [6]. Hence, identification of disease-related lncRNAs is critical to the understanding of the pathogenesis of complex diseases systematically and may further facilitate the discovery of potential drug targets. However, since biological experiments are very expensive and timeconsuming, it has become a hot topic to develop effective computational models to uncover potential disease-related lncRNAs. Up to now, existing computational models for predicting potential associations between lncRNAs and diseases can be roughly classified into two major categories. Generally, in the first category of models, biological information of miRNAs, lncRNAs or diseases will be adopted to identify potential lncRNA-disease associations.
For example, Chen et al. proposed a prediction model called HGLDA based on the information of miRNAs, in which, a hypergeometric distribution test was adopted to infer potential disease related lncRNAs [7]. Chen et al. proposed a KATZ measure to predict potential lncRNAdisease associations by utilizing the information of lncRNAs and diseases [8]. Ping and Wang et al. proposed a method for identifying potential disease-related lncRNAs based on the topological information of known lncRNA-disease association network [9]. In the second category of models, multiple data sources will be integrated to construct all kinds of heterogeneous networks to infer potential associations between diseases and lncRNAs. For example, Yu and Wang et al. proposed a naïve Bayesian Classifier based probability model to uncover potential disease-related lncRNAs by integrating known miRNA-disease associations, miRNA-lncRNA associations, lncRNA-disease associations, gene-lncRNA associations, gene-miRNA associations and genedisease associations [10]. Zhang et al. developed a computational model to discover possible lncRNA-disease associations through combining lncRNAs similarity, proteinprotein interactions and diseases similarity [11]. Fu et al. presented a prediction model by considering the quality and relevance of different heterogeneous data sources to identify potential lncRNA-disease associations [12]. Chen et al. proposed a novel prediction model called LRLSLDA by adopting Laplacian Regularized Least Squares to integrate known phenome-lncRNAome network, disease similarity network and lncRNA similarity network [13].
In recent years, in order to solve the problem of scarce known associations between different objects, an increasing number of recommender systems have been developed to increase the reliability of association prediction based on collaborative filtering methods [14], which depend on prior disposals to predict user-item relationships. Up to now, some novel prediction models have been proposed successively, in which, recommender algorithms have been appended to identify different potential diseaserelated objects. For example, Lu et.al proposed a model called SIMCLDA to predict potential lncRNA-disease associations based on inductive matrix completion by computing Gaussian interaction profile kernel of known lncRNA-disease associations, disease-gene and gene-gene onotology associations [15]. Luo et al. modeled drug repositioning problem into a recommendation system to predict novel drug indications based on known drug-disease associations through utilizing matrix completion [16]. Zeng et.al developed a novel prediction model called PCFM by adopting the probability-based collaborative filtering algorithm to infer gene-associated human diseases [17]. Luo et al. proposed a prediction model named CPTL to uncover potential disease-associated miRNAs via transduction learning by integrating disease similarity, miRNA similarity and known miRNA-disease associations [18].
In this study, a novel Collaborative Filtering model called CFNBC for predicting potential lncRNA-disease associations is proposed on the basis of Naïve Bayesian Classifier, in which, an original lncRNA-miRNA-disease tripartite network is constructed first by integrating miRNA-disease association network, miRNA-lncRNA association network and lncRNA-disease association network, and then, considering the fact that the number of known associations between the three objects such as lncRNAs, miRNAs and diseases is very limited, an updated tripartite network is further constructed by applying a collaborative filtering algorithm on the original tripartite network. Thereafter, based on the updated tripartite network, we can predict potential lncRNA-disease associations through adopting the Naïve Bayesian Classifier. Finally, in order to evaluate the prediction performance of our newly proposed model, LOOCV is implemented for CFNBC based on known experimentally verified lncRNA-disease associations. As a result, CFNBC can achieve a reliable AUC of 0.8576, which is much better than that of previous classical prediction models. Moreover, case studies of glioma, colorectal cancer and gastric cancer demonstrate the excellent prediction performance of CFNBC as well.

Leave-one-out cross validation
In this section, in order to estimate the prediction performance of CFNBC, LOOCV will be implemented based on known experimentally verified lncRNA-disease associations. During simulation, for a given disease d j , each known lncRNA related to d j will be left out in turns as the test sample, whereas all the remaining associations between lncRNAs and d j are taken as training cases for model learning. Thus, the similarity scores between candidate lncRNAs and d j can be calculated and all candidate lncRNAs can be ranked by predicted results simultaneously. As a result, the higher the candidate lncRNA is ranked, the better the performance of our prediction model will be. Moreover, the value of area under the receive operating characteristic (ROC) curve (AUC) can be further used to measure the performance of CFNBC. Obviously, the closer the AUC value is to 1, the better the prediction performance of CFNBC will be. Hence, by setting different classification thresholds, we can calculate the true positive rate (TPR or sensitivity) and the false positive rate (FPR or 1-specificity) as follows: Here, TP, FN, FP and TN denote the true positives, false negatives, false positives and true negatives respectively. Specifically, TPR indicates the percentage of candidate lncRNAs with ranks higher than a given rank cutoff, and FPR denotes the percentage of candidate lncRNAs with ranks below the given threshold.

The effects of α
Based on the assumption that original common neighboring miRNA nodes shall deserve more credibility than recommended common neighboring miRNA nodes, a decay factor α is used to make our prediction model CFNBC work more effectively. In this section, in order to evaluate the effects of α to the predcition performance of CFNBC, we will implement a series of experiments to estimate its actual effects while α is set to different values ranging from 0.05 to 0.8. As shown in Table 1, it is easy to see that CFNBC can achieve the best prediction performance while α is set to 0.05.

Comparison with other state-of-the-art methods
In order to further assess the performance of CFNBC, in this section, we will compare it with four kinds of stateof-the-art prediction models such as HGLDA [7], SIMLDA [15], NBCLDA [10] and the method proposed by Yang et al. [19] in the framework of LOOCV while α is set to 0.05. Among these four methods, since a hypergeometric distribution test was utilized to infer lncRNAdisease associations by integrating miRNA-disease associations with lncRNA-miRNA associations in HGLDA, then we will adopt a data set consisting of 183 experimentally validated lncRNA-disease associations as the hypergeometric distribution test to compare CFNBC with HGLDA. As illustrated in Table 2 and Fig. 1, the simulation results demonstrate that CFNBC outperforms HGLDA significantly. As for the model SIMLDA, since it applied inductive matrix completion to identify lncRNA-disease associations by integrating lncRNAdisease associations, gene-disease and gene-gene ontology associations, then we will collect a sub data set, which belongs to DS ld in CFNBC and consists of 101 known associations between 30 different lncRNAs and 79 different diseases, from the data set adopted by SIMLDA to compare CFNBC with SIMLDA. As shown in Table 2 and Fig. 2, it is easy to see that CFNBC can achieve a reliable AUC of 0.8579, which is better than the AUC of 0.8526 achieved by SIMLDA. As for the model NBCLDA, since it fused multiple heterogeneous biological data sources and adopted the naïve Bayesian classifier to uncover potential lncRNA-disease associations, then we will compare CFNBC with it based on the data set DS ld directly. As illustrated in Table 2 and Fig. 3, it is obvious that CFNBC can obtain a reliable AUC of 0.8576, which is higher than the AUC of 0.8519 achieved by NBCLDA as well. Finally, while comparing CFNBC with the method proposed by yang et al., in order to keep the fairness in comparison, we will collect a data set consisting of 319 lncRNA-disease associations between 37 lncRNAs and 52 diseases by deleting the nodes with degree equal to 1 on the data set DS ld . As shown in Table 2 and Fig. 4, it is easy to see that CFNBC can achieve a reliable AUC of 0.8915, which considerably outperforms the AUC of 0.8568 achieved by the method proposed by yang et al. Hence, it is easy to draw a conclusion that our model CFNBC can achieve better performance than these classical prediction models.
Additionally, in order to further evaluate the prediction performance of CFNBC, we will compare it with above four models based on the predicted top-k associations by using F1-score measure. During simulation, we will randomly choose 80% of known lncRNA-disease associations as the training set, whereas all remaining known and unknown lncRNA-disease associations are taken as testing sets. Since the sets of known lncRNAdisease associations in these models are different, we will set different threshold k to compare them with CFNBC. As shown in Table 3, it is easy to see that CFNBC outperforms these four kinds of state-of-the-art models in terms of F1-score measure as well. Moreover, the paired t-test also demonstrates that the performance of CFNBC is significantly better than the prediction results of other methods in terms of the F1-scores (p-value < 0.05, as illustrated in Table 4).

Case studies
In order to further demonstrate the capability of CFNBC in inferring new lncRNAs related to a given disease, in this section, we will implement case studies of glioma, colorectal cancer and gastric cancer for CFNBC based on the data set DS ld . As a result, the top 20 diseaserelated lncRNAs predicted by CFNBC have been confirmed by manually mining relevant literatures, and corresponding evidences are listed in the following Table 5. Additionally, among these three kinds of cancers chosen for case studies, the glioma is one of the most lethal primary brain tumors with a median survival of less than 12 months, and 6 out of 100000 people may have gliomas [20], hence it is important to find potential associations between glioma and dysregulations of some lncRNAs. As illustrated in Table 5, while applying CFNBC to predict candidate lncRNAs related to glioma, it is easy to see that there are six out of the top 20 predicted glioma-related lncRNAs having been validated by recent literatures on biological experiments. For instance, the lncRNA XIST has been demonstrated to be an important regulator in tumor progression and may be a potential therapeutic target in the treatment of glioma [21]. Ma et al. found that the lncRNA MALAT1 plays an important role in glioma progression and prognosis and may be considered as a convictive prognostic biomarker for glioma patients [22]. Xue et al. provided a comprehensive analysis of KCNQ1OT1-miR-370-CCNE2 axis in human glioma cells and a novel strategy for glioma treatment [23].
As for the colorectal cancer (CRC), it is the third most common cancer and the third leading cause of cancer death in men and women in the United States [24]. In recent years, accumulating evidences have shown that many CRC-related lncRNAs have been reported based on biological experiments. For example, Song et al. demonstrated that the higher expression of XIST was correlated with worse disease free survival of CRC patients [25]. Zheng et al. proved that the higher expression level of MALAT1 may serve as a negative prognostic marker in stage II/III CRC patients [26]. Nakano et al. found that the loss of imprinting of the lncRNA KCNQ1OT1 may play an important role in the occurrence of CRC [27]. As illustrated in Table 5, while applying CFNBC to uncover candidate lncRNAs related to CRC, it is obvious that there are 6 out of the top 20 predicted CRC-related lncRNAs having been verified in the Lnc2Cancer database.
Moreover, the gastric cancer is the second most frequent cause of cancer death [28]. Up to now, lots of lncRNAs have been reported to be associated with gastric cancer. For instance, XIST, MALAT1, SNHG16, NEAT1, H19 and TUG1 were reported to be upregulated in gastric cancer [29][30][31][32][33][34]. As illustrated in Table 5, while applying CFNBC to uncover candidate lncRNAs related to gastric cancer, it is obvious that there are 6 out of the top 20 newly identified lncRNAs related to gastric cancer having been validated by the lncRNADisease and Lnc2Cancer database respectively.

Discussion
Accumulating evidences have shown that prediction of potential lncRNA-disease associations is helpful in understanding crucial roles of lncRNAs in biological process, complex disease diagnoses, prognoses and treatments. In this manuscript, we constructed an original lncRNA-miRNA-disease tripartite network by combining miRNA-lncRNA, miRNA-disease and lncRNA-disease associations first. And then, we formulated the prediction of potential lncRNA-disease associations as a problem of recommender system and obtained an updated tripartite network through applying a novel item-based collaborative filtering algorithm to the original tripartite network. Finally, we proposed a prediction model called CFNBC to infer potential associations between lncRNAs and diseases by applying the naïve Bayesian Classifier on the updated tripartite network. Comparing with state-of-the-art prediction models, CFNBC can achieve better performs in terms of AUC values without entirely relying on known lncRNAs-disease associations, which means that CFNBC can predict potential associations between lncRNAs and diseases even as these lncRNAs and diseases are not in known data sets. Additionally, we implemented LOOCV to evaluate the prediction performance of CFNBC, and the simulation results showed that the problem of limited positive samples existed in state-of-the-art models has been significantly solved in CFNBC by the addition of collaborative filtering algorithm and the predictive accuracy has been improved by adopting the disease semantic similarity to infer potential associations between lncRNAs and diseases. Moreover, case studies of glioma, colorectal cancer and gastric cancer were implemented to further estimate the performance of CFNBC, and simulation results demonstrated that CFNBC could be a useful tool for predicting potential relationships between lncRNAs and diseases as well. Of course, despite the reliable experimental results achieved by CFNBC, there are still some biases in our model. For example, it is noteworthy that there are many other types of data that can be utilized to uncover potential lncRNA-disease associations, therefore, the prediction performance of CFNBC would be improved by the addition of more types of data. In addition, the results of CFNBC may be affected by the quality of datasets and the numbers of known lncRNA-disease relationships as well. Furthermore, successfully established models in the other computational fields would inspire the development of lncRNA-disease association prediction, such as microRNAdisease association prediction [35][36][37], drug-target interaction prediction [38] and synergistic drug combinations prediction [39].

Conclusion
Finding out lncRNA-disease relationships is essential for understanding human disease mechanisms. In this manuscript, our main contributions are as follows: (1) An original tripartite network is constructed by integrating a variety of biological information including miRNA-lncRNA, miRNA-disease and lncRNA-disease associations. (2) An updated tripartite network is constructed   by applying a novel item-based collaborative filtering algorithm on the original tripartite network. (3) A novel prediction model called CFNBC is developed based on the naïve Bayesian Classifier and applied on the updated tripartite network to infer potential associations between lncRNAs and diseases. (4) CFNBC can be adopted to predict a potential disease-related lincRNA or an potential lncRNA-related disease without relying on any known lncRNA-disease associations. (5) A recommendation system is applied in CFNBC, which guarantees that CFNBC can achieve effective prediction results in condition of scarce known lncRNA-disease associations.

Data collection and preprocessing
In order to construct our novel prediction model CFNBC, we combined three kinds of heterogeneous data sets such as the miRNA-disease association set, the miRNA-lncRNA association set and the lncRNA-disease association set to infer potential associations between lncRNAs and diseases, which were collected from different public databases including the HMDD [40], the star-Base v2.0 [41], and the MNDR v2.0 databases [42], etc.

Construction of the miRNA-disease and miRNA-lncRNA association sets
Firstly, we downloaded two datasets of known miRNAdisease associations and miRNA-lncRNA associations from the HMDD [40] in August 2018 and the starBase v2.0 [41] in January 2015 respectively. Then, we removed duplicated associations with conflicting evidences on these two data sets separately, manually picked out the common miRNAs existing in both the dataset of miRNA-disease associations and the dataset of miRNA-lncRNA associations, and retained only the associations related with these selected miRNAs in these two data sets. As a result, we finally obtained a data set DS md including 4704 different miRNA-disease interactions between 246 different miRNAs and 373 different diseases, and a data set DS ml including 9086 different miRNA-lncRNA interactions between 246 different miRNAs and 1089 different lncRNAs (see Supplementary Materials  Table 1and Table 2).

Construction of the lncRNA-disease association set
Firstly, we downloaded a dataset of known lncRNAdisease associations from the MNDR v2.0 databases [42] in 2017. Then, once the dataset was collected, in order to keep the uniformity of disease names, we transformed some diseases names included in the set of lncRNAdisease associations into their aliases in the data set of miRNA-disease associations, and unified the names of lncRNAs in the datasets of miRNA-lncRNA associations and lncRNA-diseases associations. By this means, we selected out these lncRNA-disease interactions associated with both lncRNAs belonging to DS ml and diseases belonging to DS md . As a result, we finally obtained a data set DS ld including 407 different lncRNA-disease interactions between 77 different lncRNAs and 95 different diseases (see Supplementary Materials Table 3).

Analysis of relational data sources
In CFNBC, the newly constructed lncRNA-miRNAdisease tripartite network (LMDN for abbreviation) consists of three kinds of objects such as lncRNAs, miRNAs and diseases. Therefore, we collected three kinds of relational data sources from different databases based on these three kinds of objects. As illustrated in Fig. 5

Method
As illustrated in Fig. 6, our newly proposed prediction model CFNBC consists of the following four main stages: Step1: As illustrated in Fig. 6(a), we can construct a miRNA-disease association network MDN, a miRNA-lncRNA association network MLN, and an lncRNAdisease association network LDN based on the data sets DS md , DS ml and DS ld respectively.
Step2: As illustrated in Fig. 6(b), through integrating these three newly constructed association networks MDN, MLN, and LDN, we can further construct an original lncRNA-miRNA-disease association tripartite network LMDN.
Step3: As illustrated in Fig. 6(c), after applying the collaborative filtering algorithm on LMDN, we can obtain an updated lncRNA-miRNA-disease association tripartite network LMDN ′ .
Step4: As illustrated in Fig. 6(d), after appending the naïve Bayesian classifier to LMDN ′ , we can obtain our final prediction model CFNBC.
In the original tripartite network LMDN, owing to the sparse known associations between lncRNAs and diseases, for any given lncRNA node a and disease node b, it is obvious that the number of miRNA nodes that associate with both a and b will be very limited. Hence, in CFNBC, we designed a collaborative filtering algorithm for recommending suitable miRNA nodes to corresponding lncRNA nodes and disease nodes respectively. And then, based on these known and recommended common neighboring nodes, we can finally apply the Naïve Bayesian Classifier on LMDN ′ to uncover potential lncRNA-disease associations.

Construction of LMDN
Let matrix R 0 MD be the original adjacency matrix of known miRNA-disease associations and the entity R 0 MD ðm k ; d j Þ denote the element in the k th row and j th column of R 0 MD , then there is R 0 MD ðm k ; d j Þ =1 if and only if the miRNA node m k is associated with the disease node d j , otherwise, there is R 0 MD ðm k ; d j Þ =0. In the same way, we can obtain the original adjacency matrix R 0 ML of known miRNA-lncRNA associations as well, and in R 0 ML , there is R 0 ML ðm k ; l i Þ =1 if and only if the miRNA node m k is associated with the lncRNA node l i , otherwise, there is R 0 ML ðm k ; l i Þ = 0. Additionally, considering that a recommender system may involve various input data including users and items, therefore, in CFNBC, we will take lncRNAs and diseases as users, while miRNAs as items. Thereafter, as for these two original adjacency matrices R 0 MD and R 0 ML obtained above, since their row vectors are the same, it is easy to see that we can construct another adjacency matrix R 0 MLD ¼ ½R 0 ML ; R 0 MD by splicing R 0 MD and R 0 ML together. Moreover, it is obvious that the row vector of R 0 MLD is exactly the same as the row vector in R 0 MD or R 0 ML , while the column vector of R 0 MLD consists of the column vector of R 0 MD and the column vector of R 0 ML .

Applying the item-based collaborative filtering algorithm on LMDN
Since CFNBC is based on the collaborative filtering algorithm, then the relevance scores between lncRNAs and diseases predicted by CFNBC will depend on the common neighbors between these lncRNAs and diseases. However, owing to the scarce known lncRNA-miRNA, lncRNA-disease and miRNA-disease associations, the number of common neighbors between these lncRNAs and diseases in LMDN will be very limited as well. Hence, in order to improve the number of common Fig. 5 The relationships among three kinds of different data sources neighbors between lncRNAs and diseases in LMDN, we will apply the collaborative filtering algorithm on LMDN in this section. First, on the basis of R 0 MLD and LMDN, we can obtain a co-occurrence matrix R m × m , in which, let the entity R(m k , m r ) denote the element in the k th row and r th column of R m × m , then there is R(m k , m r ) =1 if and only if the miRNA node m k and the miRNA node m r share at least one common neighboring node (a lncRNA node or a disease node) in LMDN, otherwise, there is R(m k , m r ) =0. Hence, a similarity matrix R ′ can be calculated after normalizing R m × m as follows: Where |N(m k )| represents the number of known lncRNAs and diseases associated to m k in LMDN, that is, the number of elements with value equaling to 1 in the k th row of R 0 MLD , |N(m r )| represents the number of elements with value equaling to 1 in the r th row of R 0 MLD , and |N(m k ) ∩ N(m r )| denotes the number of known lncRNAs and diseases associated with both m k and m r simultaneously in LMDN.
Next, for any given lncRNA node l i and miRNA node m h in LMDN, if the association between l i and m h is known already, then, for a miRNA node m t other than m h in LMDN, it is obvious that the higher the relevance score between m t and m h , the bigger the possibility that there may exist potential association between l i and m t . Hence, we can obtain the relevance score between l i and m t based on the similarities between miRNAs as follows: Here, N(l i ) represents the set of neighboring miRNA nodes that are directly connected to l i in LMDN, and S(K, m t − top) denote the set of top-K miRNAs that are most similar to m t in LMDN. R 0 t is a vector consisting of the t th row of R ′ . In addition, there is u it = 1 if and only if l i is interacted with m t in ML, otherwise, there is u it =0.
Similarly, for any given disese node d j and miRNA node m h in LMDN, if the association between d j and m h is known already, then, for a miRNA node m t other than m h in LMDN, we can obtain the relevance score between d j and m t based on the similarities between miRNAs as follows: Where N(d j ) denotes the set of neighboring miRNA nodes that are directly connected to d j in LMDN. In addition, there is u jt =1 if and only if d j is interacted with m t in MD, otherwise, there is u jt =0.
Obviously, based on the similarity matrix R ′ and the adjacency matrix R 0 MLD , we can construct a new recommender matrix R 1 MLD as follows: In particular, for a certain lncRNA node l i or a disease node d j in LMDN, if there is a miRNA m k satisfying MLD , then, we will first sum up the values of all elements in the i th or j th column of R 1 MLD respectively. Thereafter, we will obtain its average value p. Finally, if there is a miRNA node m θ in the i th or j th column of R 1 MLD satisfying R 1 MLD ðm θ ; l i Þ > p or R 1 MLD ðm θ ; d j Þ > p , then we will recommend the miRNA m θ to l i or d j respectively. And in the same time, we will as well add a new edge between m θ and l i or m θ and d j in LMDN separately.
For instance, according to Fig. 6 and the given matrix , we can obtain its corresponding matrices R m × m , R ′ and R 1 MLD as follows: To be specific, as illustrate in Figure 6, if taking the lncRNA node l 1 as an example, then from the matrix R 0 MLD , it is easy to see that there are two miRNA nodes such as m 1 and m 2 associated with l 1 . In addition, according to formula (9), we can know as well that there ¼ 0:81. Hence, we will recommend the miRNA node m 5 to l 1 . In the same way, the miRNA nodes m 2 , m 4 and m 5 will be recommended to l 2 as well. Moreover, according to previous description, it is obvious that these new edges between m 5 and l 1 , m 2 and l 2 , m 4 and l 2 , and m 5 and l 2 will be added to the original tripartite network LMDN in the same time. Thereafter, we can obtain an updated lncRNA-miRNA-disease association tripartite network LMDN ′ on the basis of the original tripartite network LMDN.

Construction of the prediction model CFNBC
The naïve Bayesian classifier is a kind of simple probabilistic classifier with a conditionally independent assumption. Based on this probability model, the posterior probability can be described as follows: Where C is a dependent class variable and F 1 , F 2 , …, F n are the feature variables of class C.
Moreover, since each feature F i is conditionally independent to any other feature F j (i ≠ j) in class C, then the above formula (10) can as well be expressed as follows: In our previous work, we proposed a probability model called NBCLDA based on the Naïve Bayesian classifier to predict potential lncRNA-disease associations [10]. However, in NBCLDA, there exist some circumstances where it happens to be no relevance scores between a certain pair of lncRNA and disease nodes, and the reason is that there are no common neighbors between them owing to the scarce known associations between the pair of lncRNA and disease. Hence, in order to overcome this kind of drawback existing in our previous work, in this section, we will design a novel prediction model called CFNBC to infer potential associations between lncRNAs and diseases through adopting the item-based collaborative filtering algorithm on LMDN and applying the Naïve Bayesian classifier on LMDN ′ . In CFNBC, for a given pair of lncRNA and disease nodes, it is obvious that they will have two kinds of common neighboring miRNA nodes such as the original common miRNA nodes and the recommended common miRNA nodes. In order to illustrate this case more intuitively, an example is given in Figure 7, in which, the node m 3 is an original common neighboring miRNA node since it has known associations with both l 2 and d 2 , while the nodes m 4 and m 5 belong to recommended common neighboring miRNA nodes since they do not have known associations with both l 2 and d 2 . And in particular, while applying the Naïve Bayesian classifier on LMDN ′ , for a given pair of lncRNA and disease nodes, we will consider that their common neighboring miRNA nodes, including both the original and recommended common neighboring miRNA nodes, are all conditionally independent of each other, since they are different nodes in LMDN ′ . That is, for a given pair of lncRNA and disease nodes, it is assumed that all their common neighboring nodes will not interfere with each other in CFNBC.
− 1 between l i and d j , let N l − 1 and N d − 1 denote the numbers of lncRNAs and diseases associated to m δ − 1 in LMDN ′ respectively, then it is obvious that there is And similarly, for each common neighboring miRNA node m δ − 2 between l i and d j , let N l − 2 and N d − 2 represent the numbers of lncRNAs and diseases associated to m δ − 2 in LMDN ′ respectively, then it is obvious that there is N þ m δ−2 þ N − m δ−2 ¼ N l−2 Â N d−2 . Thereafter, the above formula (16) can be further modified as follows: Besides, since N þ m δ−1 and N þ m δ−2 may be zero, then we introduce the Laplace calibration to guarantee that the value of S(l i , d j ) will not be zero. Hence, the above formula (16) can once again be modified as follows: Next, for any given lncRNA node and disease node, since the original common neighboring miRNA nodes between them are obtained from the known associations, while the recommended common neighboring miRNA nodes between them are obtained by our itembased collaborative filtering algorithm, then it is reasonable to consider that the original common neighboring miRNA nodes shall deserve more credibility than the recommended common neighboring miRNA nodes. Hence, in order to make our prediction model be able to work more effectively, we will add a decay factor α in the range of (0, 1) to the above formula (25). Thereafter, the formula (25) can be rewritten as follows: Additionally, it has been reported that the degree of common neighboring nodes will play a significant role in the link prediction, and the common neighboring nodes with high degrees can improve the prediction accuracy [43]. Hence, we will further add an index Resource (RA) [44] and Logarithmic function for standardization to the above formula (26). Thereafter, for any given lncRNA node l i and disease node d j in LMDN ′ , we can obtain the probability that there may exist a potential association between them as follows: Here, k m δ−1 and k m δ−2 represent the degree of m δ − 1 and m δ − 2 in LMDN ′ respectively.