DSCMF: prediction of LncRNA-disease associations based on dual sparse collaborative matrix factorization

Background In the development of science and technology, there are increasing evidences that there are some associations between lncRNAs and human diseases. Therefore, finding these associations between them will have a huge impact on our treatment and prevention of some diseases. However, the process of finding the associations between them is very difficult and requires a lot of time and effort. Therefore, it is particularly important to find some good methods for predicting lncRNA-disease associations (LDAs). Results In this paper, we propose a method based on dual sparse collaborative matrix factorization (DSCMF) to predict LDAs. The DSCMF method is improved on the traditional collaborative matrix factorization method. To increase the sparsity, the L2,1-norm is added in our method. At the same time, Gaussian interaction profile kernel is added to our method, which increase the network similarity between lncRNA and disease. Finally, the AUC value obtained by the experiment is used to evaluate the quality of our method, and the AUC value is obtained by the ten-fold cross-validation method. Conclusions The AUC value obtained by the DSCMF method is 0.8523. At the end of the paper, simulation experiment is carried out, and the experimental results of prostate cancer, breast cancer, ovarian cancer and colorectal cancer are analyzed in detail. The DSCMF method is expected to bring some help to lncRNA-disease associations research. The code can access the https://github.com/Ming-0113/DSCMF website.

protein functions [1]. Many experiments have demonstrated that lncRNAs play an important role in many aspects, such as epigenetic regulation, cell cycle control and cell differentiation regulation [2][3][4]. However, the current understanding of lncRNAs is still far from enough, and many unknown areas still need us to explore them. Therefore, we still need to strengthen the research on lncRNAs, which will also contribute to the better development of human biology.
There are increasing evidences that lncRNAs are closely linked to many human diseases, such as common cardiovascular diseases [5,6], diabetes [7], Alzheimer's [8] and some cancers. LncRNA like MALAT1 is a transcript that is overexpressed in many cancers [9]. It is closely related to diseases such as lung cancer [10], renal cancer [11] and esophageal cancer [12]. Another example is GAS5, which is related to head and neck cancer [13], colon cancer [14], thyroid cancer [15], etc. Although some LDAs databases have been established for research by experts and scholars, the number of known LDAs in the database are far from enough, and there are many unknown associations that require people to mine them. Therefore, it is very necessary to find a method for efficient and accurate LDAs prediction.
At present, many methods have been proposed in the aspect of LDAs prediction [16]. These methods have helped more or less for predictions. For example, Sun et al. proposed a new computational model that used random walk with restart methods on the lncRNA functional similarity network [17]. A lncRNA-lncRNA functional similarity network was constructed, and the relationship between similar phenotypic diseases and functionally similar lncRNAs was used to predict novel associations. Finally, it was found through experiments that this method is indeed feasible. Chen et al. improved on the basis of the random walk with restart model, combining the disease semantic similarity matrix with the lncRNA expression similarity matrix, and setting the initial probability vector of the random walk with restart model [18]. Therefore, this model can be applied to studies of diseases without known related lncRNAs. Chen et al. proposed a Laplacian regularized least squares method to predict novel associations based on the assumption that similar diseases may be related to functionally similar lncRNAs [19]. This method was developed under the framework of semi-supervised learning and can be used to sort the candidate disease-lncRNA pairs for all diseases. Chen proposed a KATZ measurement model to predict novel LDAs by combining lncRNA expression similarity and functional similarity, as well as disease semantic similarity and GIP kernel similarity [20]. This method can predict lncRNAs with no known associations for those diseases or those with no known associations for lncRNAs. Ding et al. proposed a way to combine the gene-disease association network with the lncRNA-disease association network into a lncRNAdisease-gene tripartite graph for prediction [21]. The advantage of this method is that it can better describe the heterogeneity of coding-non-coding genes-disease associations than other methods. Ping et al. proposed a method of constructing a bipartite network to predict novel LDAs [22]. This method is based on the known topology of the lncRNA-disease network to identify those potential LDAs. Finally, the Leave-oneout cross-validation method was used to evaluate the performance of the method. Zhao et al. proposed a method for predicting novel LDAs without relying on any known lncRNA-disease association [23]. This method is based on distance correlation set that combines known lncRNA-miRNA associations and miRNA-disease associations to predict novel associations. The result proves that this method is effective and has great advantages. Ou-Yang et al. proposed a new method for predicting LDAs, called the two-side sparse self-representation method [24]. The advantage of this approach is that it can adaptively learn the self-characterization of lncRNAs and the self-characterization of diseases, a process based on the known LDAs. And this method can also be supported from the internal associations between diseases and lncRNAs. Fu et al. proposed a matrix factorization model, which mainly decomposes the data matrix of heterogeneous data sources into low-rank matrix by matrix [25].
In this paper, an improved matrix factorization model is proposed to predict LDAs. This method mainly uses the collaborative matrix factorization, and then joins the Gaussian interaction profile kernel. At the same time, the L 2,1 -norm is added to prevent over-fitting [26][27][28]. Since there may be some missing associations in the course of the experiment, the accuracy of our predictions will be reduced, so we also add the weight K nearest known neighbors (WKNKN) pre-processing process. The cross-validation method is used to obtain the AUC value of this method. At the end of the paper, the simulation experiment is carried out. The results show that our method is indeed superior to other methods. The specific improvements to our approach are as follows: • In the DSCMF method, the L 2,1 -norm is introduced to sparse A and B , which reduces redundant data, improves the computational power of the model, improves the robustness of the algorithm, and reduces the influence of noise on the A and B matrices. • Network similarity is added to the DSCMF method, and we add the lncRNA network similarity matrix and the disease network similarity matrix to our method.
In the second part of this paper, we show the experimental results of the DSCMF method. The third and fourth parts discuss and summarize the DSCMF method respectively, and put forward the next work plan. The specific algorithm and detailed formula of the DSCMF method can be seen in the fifth part of this article.

Human LncRNA-disease associations
The LncRNADisease database is a common database for studying lncRNA-disease associations [29]. This database contains 247 diseases, 369 lncRNAs and their associations. These associations were previously verified by 687 experiments [21]. The data used in this paper are 178 diseases without disease ontology (https:// disea seontol ogy. org/) and 115 lncRNAs without expression profiles selected from ArrayExpress (https:// www. ebi. ac. uk/ array expre ss/) [30]. Finally, we get a dataset with 540 lncRNA-disease associations, as listed in Table 1. Y is an adjacency matrix. If the value of this element is 1, this lncRNA l(i) is related to the disease d(j) . Otherwise, it implies that the lncRNA has nothing to do with this disease. The ten-fold crossvalidation method is applied in this paper, and the above dataset is used as the gold standard dataset for experiments to predict novel LDAs.

Cross validation
Cross-validation is used as an evaluation method in our experiments. And compared with the previously proposed LRLSLDA [19], ncPred [31], TPGLDA [21] and NTSH-MDA [32] methods. The experiment process mainly uses the ten-fold cross-validation method. At the same time, in order to ensure the stability and reliability of our experimental results, each method is repeated 30 times. It should be noted that some unknown associations may be lost. To avoid this, the WKNKN pre-processing process is applied to our method. At the end of the final experiment, a corresponding AUC value [33] will be generated. This AUC value is an evaluation indicator used to evaluate the quality of our method. To know the AUC value, you need to know the area under the receiver operating characteristic (ROC) curve. The AUC value is equivalent to the area under the ROC curve. ROC curve is related to true positive rate (TPR) and false positive rate (FPR). The calculation formula is as follows: where TP and TN represent the number of positive and negative samples that are true. FP and FN represent the number of positive and negative samples that are false.
The area under the ROC curve is a number not greater than 1, that is, the AUC value is a number between 0 and 1. Generally, according to past experience, the AUC value is a number between 0.5 and 1. If it is less than 0.5, it proves that this method is not feasible.

Comparison with other methods
The experimental results of the LRLSLDA, ncPred, TPGLDA, NTSHMDA and DSCMF methods are listed in Table 2. In Table 2, we show the method with the highest AUC value and its AUC value in italics. It can be clearly seen from the experimental results that the DSCMF method has the highest AUC value, followed by the NTSHMDA method, but our method is still 5.85% higher than it. The lowest AUC value is the LRLSLDA method, which is 18.98% lower than our method. A more intuitive description of the AUC values for the various methods can be found in Fig. 1.
The above results fully show that the DSCMF method is better than the previous methods, which is more conducive to the prediction of LDAs. The DSCMF method adds a GIP kernel to the original CMF method, thereby increasing the lncRNA network similarity matrix and the disease network similarity matrix in the original method. The second is to add the L 2,1 -norm, which increases the sparsity. Therefore, this method has great advantages over other methods.

Sensitivity analysis from WKNKN
In the course of the experiment, some unknown associations that often have important influence on our prediction may be lost, so in order to avoid this negative impact will affect our experimental results, WKNKN pre-processing process is introduced in the DSCMF method. In this process, the setting of the parameters will also have a certain impact on the experimental results. Different parameters may cause the AUC value to change, so the choice of parameters is particularly important. It includes the choice of two parameters, one is the K value representing the nearest known neighbor, and the other is the attenuation parameter P . According to previous experience, when setting K to 5 and P to 0.7, AUC tends to be stable. When K is set to 5 and P is set to 0.7, the AUC value tends to be stable. Figures 2 and 3 show the effect of the two parameters K and P on AUC, respectively.

Robust analysis of DSCMF
In this paper, we increase the L 2,1 -norm, and the increase of the L 2,1 -norm can improve the robustness of our algorithm. In order to prove the ability of the DSCMF method to learn the subspace, that is, the anti-interference ability when restoring data is strong, the DSCMF method is applied to the synthetic dataset composed of 200 two-dimensional Fig. 1 The LRLSLDA, ncPred, TPGLDA, NTSHMDA and DSCMF methods compare the performance of the AUC and ROC curves based on the ten-fold cross-validation method. It can be seen that the DSCMF method has the best performance data points, and all the data points are distributed in a one-dimensional subspace, i.e. y = x . x and y refer to the position of the coordinate axis where the data point is located. In addition, we also apply the original CMF method to this synthetic dataset to compare with our method. The specific process is to add different numbers of noise points in the synthesized dataset to compare the robustness of the CMF and DSCMF methods. Figure 4 shows the data distribution after adding one noise point. It can be seen that both CMF and DSCMF methods can be relatively stable. Figures 5, 6, and 7 show the data distribution of 30, 60, and 90 noise points respectively. It can be clearly seen that with the increase of noise points, the DSCMF method can basically maintain a stable state, basically unaffected by noise points. However, the CMF method is more affected by noise points. It is therefore proved that the DSCMF method increases the robustness.

Case study
In this section, simulation experiment is performed to predict some novel LDAs. For the predicted results, four common diseases are selected for research: prostate cancer, breast cancer, ovarian cancer, and colorectal cancer. The experimental procedure is as follows: For one of the diseases, the predicted score matrix obtained is sorted from high to low. Then several lncRNAs with the highest scores are selected for analysis and verified by the databases LncRNADisease and Lnc2cancer. The first study is prostate cancer. Prostate cancer is an epithelial malignancy that is closely related to genetic factors and is present in the prostate. For more detailed information on prostate cancer, please visit the https:// www. omim. org/ entry/ 176807 website. In the original gold standard dataset, 13 lncRNAs have been shown to be associated with prostate cancer. The top 20 lncRNAs in the prediction matrix are extracted and analyzed. It is found that 12 of the original 13 lncRNAs that have been shown to be associated with prostate cancer are predicted. And in Table 3, we have indicated these 12 lncRNAs in italics. Among the remaining 8 lncRNAs, three lncRNAs, TUG1, IGF2-AS and CDKN2B-AS1, are found in the database LncRNADisease, and they are all associated with prostate cancer. Their PMIDs are 26975529 [34], 19767753 [35] and 23660942   [36], respectively. The XIST in the table is confirmed to be associated with prostate cancer in the database Lnc2cancer, and its PMID is 29212233 [37]. PTENP1, a lncRNA, is found to be associated with prostate cancer in both database LncRNADisease and Lnc-2cancer. Their PMIDs are 24373479 [38] and 20577206 [39] respectively. The specific information is shown in Table 3. The second disease is breast cancer. Breast cancer has become a common disease that threatens women's physical and mental health. For more detailed information about breast cancer, please visit: https:// www. omim. org/ entry/ 114480. In the gold standard dataset of the experiment, there are 20 kinds of lncRNA related to breast cancer. Comparing the predictions of the first 30 lncRNAs predicted in the simulation experiment, we find that the 17 lncRNAs in our experiment are confirmed in the gold standard dataset. These 17 lncRNAs are specifically indicated in italics in Table 4. And 2 of the remaining 13 are confirmed in the LncRNADisease database. The two lncRNAs are CCAT1 and TUG1. Their PMIDs are 26464701 [40] and 27791993 [41]. There are three lncRNAs are confirmed to be associated with breast cancer in the Lnc2cancer database, which are PTENP1, SNHG16 and TUSC7, respectively. The PMIDs of these three lncRNAs are 29085464 [42], 28232182 [43], and 23558749 [44], respectively. And KCNQ1OT1, a lncRNA, is confirmed to be associated with breast cancer in both LncRNADisease and Lnc2cancer databases. The remaining seven lncRNAs are not confirmed by the databases to be associated with breast cancer. The specific experimental results are listed in Table 4. For example, in the case of lncRNA CCAT1, previous studies have demonstrated that CCAT1 is overexpressed than normal tissue.
The third disease is ovarian cancer. Ovarian cancer is a common disease in female genital organs. Its incidence is second only to cervical cancer and endometrial cancer, posing a serious threat to women's health. For more detailed information on ovarian cancer please visit https:// www. omim. org/ entry/ 167000. In the gold standard dataset, it is known that 12 lncRNAs are associated with ovarian cancer, so the top 22 lncRNAs in the prediction matrix are selected for analysis and the results are listed in Table 5. We successfully predict 11 lncRNAs, which have been confirmed in the gold standard dataset. At the same time, these 11 lncRNAs are shown in italics in Table 5. Three lncRNAs are confirmed in the LncRNADisease database, which are GAS5, NEAT1, and CCAT2, and their PMID numbers are 27779700 [45], 27608895 [46], 27558961 [47]. MEG3, SNHG16, MNX1-AS1, and ZFAS1 are confirmed to be associated with ovarian cancer in the Lnc-2cancer database, and their PMIDs are 28175963 [48], 29461589 [49], 29271994 [50], and 28154416 [51], respectively. The remaining four lncRNAs are not confirmed have any association with ovarian cancer in both databases LncRNADisease and Lnc2cancer. The last disease listed is colorectal cancer. Colorectal cancer is a common malignant tumor in humans. China is a low-incidence area for colorectal cancer, but in recent years, the incidence of colorectal cancer has increased in different regions. As can be seen from the original gold standard dataset, the dataset contains 21 lncRNAs that are associated with colorectal cancer. 20 association pairs are successfully predicted by the DSCMF algorithm, they are shown in italics in Table 6. And the remaining 10 lncRNAs are verified in the two databases LncRNADisease and Lnc2cancer whether  they are associated with colorectal cancer. Among them, 4 lncRNAs are confirmed to be associated with colorectal cancer in the LncRNADisease database. These 4 lncR-NAs are SPRY4-IT1, CDKN2B-AS1, TUG1 and ZFAS1, respectively. Their PMIDs are 27621655 [52], 27286457 [53], 27421138 [54] and 27461828 [55] respectively. There are also six lncRNAs that are not confirmed to be associated with colorectal cancer and further research is needed. Specific information on lncRNA and colorectal cancer is shown in Table 6:

Discussion
Numerous studies have shown that lncRNA is indeed associated with certain diseases in humans, so it is a very important contribution to find some effective methods to predict these associations. However, the process of finding LDAs takes a long time and consumes a lot of energy. So, if you find some new ways to predict LDAs, this will be of great help to our research. The DSCMF method introduced in this paper mainly adds the L 2,1 -norm to the traditional collaborative matrix factorization method to increase the sparsity, and at the same time, the GIP kernel is used to increase the network similarity. The final crossvalidation method also proves that our method is suitable for LDAs prediction. Of course, our method is not completely without disadvantages. The DSCMF method requires a long running time. Therefore, shortening the running time of our method is an important problem that we still need to solve.

Conclusion
A ten-fold cross-validation method is used in the experimental part of this paper. And WKNKN pre-processing method is also used in the paper to solve those unknown interactions, so the accuracy of prediction is improved to the greatest extent.
In the next work, we will continue to work on this aspect of research. And, try to make up for the shortcomings in the previous research process and find some new prediction methods. At the same time, we will try to apply our method to more datasets such as miRNAdisease associations datasets, so as to more fully prove the performance of our method. At the end of the paper, I hope that the DSCMF method can be helpful for predicting lncRNAdisease associations, and we will be more committed to this research and contribute to human society.

LncRNA expression similarity
ArrayExpress contains more than 60,000 expression profiles of 16 human tissues, and these expression profiles are generated by RNA-Seq technology. The lncRNA expression profile used in this paper is obtained from ArrayExpress. The correlation between each pair of lncRNA expression profiles can be expressed using the Spearman correlation coefficient, which is also the similarity of lncRNA expression. The matrix S l can be used to represent the lncRNA expression similarity matrix, and the expression similarity between lncRNA l i and lncRNA l j can be shown in the form of S l l i , l j .

Disease semantic similarity
The semantic similarity of the disease was first used in the ncRNA-disease association, and the results proved its correctness [56]. In this paper, a directed acyclic graph (DAG) is used to describe the relationship between disease semantics. For disease D d , its directed acyclic graph can be expressed as DAG D d = D d , T D d , E D d , where T D d is represented as the set of nodes and E D d is represented as a set of edges between nodes. The specific formula is as follows: where represents a semantic contribution factor. Given a disease semantic similarity matrix S d . To determine the semantic similarity between the two diseases d i and d j , it is necessary to look at their common DAG parts. Therefore, as long as their DAG common parts are larger, their semantic similarities are greater. The specific calculation formula is as follows:

Weight K nearest known neighbors
In order to prevent the loss of some unknown correlations and make our predictions more accurate, the WKNKN preprocessing process is added to the DSCMF method. In the lncRNA-disease association matrix Y , if lncRNA is associated with disease, the value in the matrix is 1, otherwise it is 0. The role of pre-processing is to change these 0 or 1 to values between 0 and 1, forming a new matrix to increase the accuracy of the prediction.

Gaussian interaction profile kernel similarity
Regardless of whether the disease is associated with the lncRNA in the lncRNA-disease network, it is likely to have a similar association with the new disease. The Gaussian interaction profile kernel similarity used in this method is based on this assumption [57]. The GIP kernel similarity can be used in this method to represent the network topological structure of LDAs. The topological structure of lncRNA l i , l j and disease d i , d j are represented by the following formula: The parameters of the adjustment kernel bandwidth represented by γ in the above two formulas. Y(l i ) stands for a binary vector, the i-th row of Y , which represents the interaction profiles of the association between l i and each disease. Next, the lncRNA expression similarity matrix and the network similarity matrix are combined by using formula (8). Similarly, the disease semantic similarity matrix and the network similarity matrix are combined by using formula (9).
In the above two formulas, α ∈ [0, 1] , and it is a parameter that can be adjusted. Where K l is the final matrix combining lncRNA expression similarity and network similarity, and K d is the final matrix that combines the semantic similarity of disease with network similarity.

DSCMF
Collaborative filtering is introduced in the traditional CMF method [58], which can accurately predict some novel LDAs. The objective function of the traditional CMF is as follows: where �·� F is Frobenius norm. h , l and d are positive parameters.
Then, the S l in the traditional collaborative matrix factorization method is replaced by K l . Similarly, S d is replaced by K d , thereby increasing the network similarity between lncRNA and disease. The improved formula is as follows: At the same time, to increase the sparsity, the method in this paper adds L 2,1 -norm to matrix A and B respectively. The final objective function can be written as: The matrices A and B of this formula are two latent feature matrices produced by the decomposition of the matrix Y . Where �A� 2 F = Tr A T A = Tr AA T , �A� 2,1 = Tr A T D 1 A and �B� 2,1 = Tr B T D 2 B .D 1 , D 2 are two diagonal matrices, where the values of the j-th diagonal element are denoted as d 1 The first term is to construct an approximate model, the purpose is to find the matrix A and B . The second part is to add the Tikhonov regularization terms to prevent overfitting. The third part is to add the L 2,1 -norm to matrix A . The fourth part is to add the L 2,1 -norm to matrix B.The last two parts are the collaborative regularization terms of lncRNA expression similarity matrix and disease semantic similarity matrix. A detailed flow chart of the DSCMF method is shown in Fig. 8.

Optimization and algorithm of DSCMF method
In this paper, we use the least squares method to update A and B to optimize the new method of this paper. In the first step, the values of A and B need to be initialized, so the singular value decomposition (SVD) method is used in this paper. The initial formula is: where S k represents a diagonal matrix that contains the k largest singular values. Next, based on the objective function, the partial derivatives are obtained for A and B , respectively, and their partial derivatives are zero. Finally, updating is stopped once A and B converge. The iteration formula is as follows: where h , l and d are a combination of the best parameters automatically selected from h ∈ 2 −2 , 2 −1 , 2 0 , 2 1 and l d ∈ 0, 10 −4 , 10 −3 , 10 −2 , 10 −1 .  The DSCMF method consists of two parts. First, the matrix Y is decomposed into A and B , and L 2,1 -norm is added to A and B , respectively. Second is to join the GIP kernel in the CMF method objective function. The convergence curve is shown in Fig. 9. The algorithm tends to converge in about 10 times, which proves that our algorithm can converge quickly.