 Research article
 Open Access
 Published:
A novel target convergence set based random walk with restart for prediction of potential LncRNAdisease associations
BMC Bioinformatics volumeÂ 20, ArticleÂ number:Â 626 (2019)
Abstract
Background
In recent years, lncRNAs (longnoncoding RNAs) have been proved to be closely related to the occurrence and development of many serious diseases that are seriously harmful to human health. However, most of the lncRNAdisease associations have not been found yet due to high costs and time complexity of traditional bioexperiments. Hence, it is quite urgent and necessary to establish efficient and reasonable computational models to predict potential associations between lncRNAs and diseases.
Results
In this manuscript, a novel prediction model called TCSRWRLD is proposed to predict potential lncRNAdisease associations based on improved random walk with restart. In TCSRWRLD, a heterogeneous lncRNAdisease network is constructed first by combining the integrated similarity of lncRNAs and the integrated similarity of diseases. And then, for each lncRNA/disease node in the newly constructed heterogeneous lncRNAdisease network, it will establish a node set called TCS (Target Convergence Set) consisting of top 100 disease/lncRNA nodes with minimum average network distances to these disease/lncRNA nodes having known associations with itself. Finally, an improved random walk with restart is implemented on the heterogeneous lncRNAdisease network to infer potential lncRNAdisease associations. The major contribution of this manuscript lies in the introduction of the concept of TCS, based on which, the velocity of convergence of TCSRWRLD can be quicken effectively, since the walker can stop its random walk while the walking probability vectors obtained by it at the nodes in TCS instead of all nodes in the whole network have reached stable state. And Simulation results show that TCSRWRLD can achieve a reliable AUC of 0.8712 in the LeaveOneOut Cross Validation (LOOCV), which outperforms previous stateoftheart results apparently. Moreover, case studies of lung cancer and leukemia demonstrate the satisfactory prediction performance of TCSRWRLD as well.
Conclusions
Both comparative results and case studies have demonstrated that TCSRWRLD can achieve excellent performances in prediction of potential lncRNAdisease associations, which imply as well that TCSRWRLD may be a good addition to the research of bioinformatics in the future.
Background
For many years, the genetic information of organism is considered to be stored only in genes used for protein coding, and RNAs have always been thought to be an intermediary in the process of encoding proteins by DNAs [1, 2]. However, recent studies have shown that the genes used to encode proteins only account for a small part (less than 2%) of human genome and more than 98% of human genome are not made up of genes that encode proteins and yield a big mount of ncRNAs (noncodingRNAs) [3, 4]. In addition, as the complexity of biological organisms increases, so does the importance of ncRNAs in biological processes [5, 6]. Generally, ncRNAs can be divided into two major categories such as small ncRNAs and long ncRNAs (lncRNAs) according to the length of nucleotides during transcription, where small ncRNAs consist of less than 200 nucleotides and include microRNAs and transfer RNAs etc. However, lncRNAs consist of more than 200 nucleotides [7,8,9]. In 1990, the first two kinds of lncRNAs such as H19 and Xist were discovered by researchers through gene mapping. Since gene mapping approach is extremely timeconsuming and laborintensive, then researches in the field of lncRNAs have been at a relatively slow pace for a long time [10, 11]. In recent years, with the rapid development of highthroughput technologies in gene sequencing, more and more lncRNAs have been found in eukaryotes and other species [12, 13]. Moreover, simulation results have shown as well that lncRNAs play important roles in various physiological processes such as cell differentiation and death, regulation of epigenetic shape and so on [8, 14, 15]. Simultaneously, growing evidences have further illustrated that lncRNAs are closely linked to diseases that pose a serious threat to human health [16,17,18], which means that lncRNAs can be used as potential biomarkers in the course of disease treatment in the future [19].
With the discovery of a large number of new types of lncRNAs, many databases related to lncRNAs such as lncRNAdisease [20], lncRNAdb [21], NONCODE [22] and Lnc2Cancer [23] have been established by researchers successively, however, in these databases, the number of known associations between lncRNAs and diseases is still very limited due to high costs and timeconsumption of traditional biological experiments. Thus, it is meaningful to develop mathematical models to predict potential lncRNAdisease associations quickly and massively. Based on the assumption that similar diseases tend to be more likely associated with similar lncRNAs [24, 25], up to now, a good deal of computational models for inferring potential lncRNAdisease associations have been proposed. For instance, Chen et al. proposed a computational model called LRLSLDA [26] for prediction of potential lncRNAdisease associations by adopting the method of Laplacian regularized least squares. Ping and Wang et al. constructed a prediction model for extracting feature information from bipartite interactive networks [27]. Zhao and Wang et al. developed a computational model based on Distance Correlation Set to uncover potential lncRNAdisease associations through integrating known associations between three kinds of nodes such as disease nodes, miRNA nodes and lncRNA nodes into a complex network [28]. Chen et al. proposed an lncRNAdisease association prediction model based on a heterogeneous network by considering the influence of path length between nodes on the similarity of nodes in the heterogeneous network [29,30,31]. However, for some time past, a network traversal method called RWR (Random Walk with Restart) has emerged in the field of computational biology including prediction of potential miRNAdisease associations [32, 33], drugtarget associations [34] and lncRNAdisease associations [35,36,37] etc.
Inspired by the thoughts illustrated in above stateoftheart literatures, in this paper, a computational model called TCSRWRLD is proposed to discover potential lncRNAdisease associations. In TCSRWRLD, a heterogeneous network is constructed first through combining known lncRNAdisease associations with the lncRNA integrated similarity and the disease integrated similarity, which can overcome a drawback of traditional RWR based approaches that these approaches cannot start walking process while there are no known lncRNAdisease associations. And then, each node in the heterogeneous network will establish its own TCS according to the information of network distance, which can reflect the specificity of different nodes in the walking process and make the prediction more accurate and less timeconsuming. Moreover, considering that for a given walker, while its TCS has reached the ultimate convergence state, there may be still some nodes that are not included in its TCS but actually associated with it, then in order to ensure that there is no omission in our prediction results, each node in the heterogeneous network will further establish its own GS as well. Finally, for evaluating the prediction performance of our newly proposed model TCSRWRLD, cross validation are implemented based on known lncRNAdisease associations downloaded from the lncRNAdisease database (2017version), and as a result, TCSRWRLD can achieve reliable AUCs of 0.8323, 0.8597, 0.8665 and 0.8712 under the frameworks of 2folds CV, 5folds CV, 10folds CV and LOOCV respectively. In addition, simulation results in case studies of leukemia and lung cancer show that there are 5 and 7 out of the top 10 predicted lncRNAs having been confirmed to be associated with Leukemia and Lung cancer respectively by recent evidences, which demonstrate as well that our model TCSRWRLD has excellent prediction performance.
Results
In order to verify the performance of TCSRWRLD in predicting potential lncRNAdisease associations, LOOCV, 2folds CV, 5folds CV and 10folds CV were implemented on TCSRWRLD respectively. And then, based on the dataset of 2017version downloaded from the lncRNADisease database, we obtained the PrecisionRecall curve (PR curve) of TCSRWRLD. In addition, based on the dataset of 2017version downloaded from the lncRNADisease database and the dataset of 2016version downloaded from the lnc2Cancer database, we compared TCSRWRLD with stateoftheart prediction models such as KATZLDA, PMFILDA [38] and Pingâ€™s model separately. After that, we further analyzed the influences of key parameters on the prediction performance of TCSRWRLD. Finally, case studies of leukemia and lung cancer were performed to validate the feasibility of TCSRWRLD as well.
Cross validation
In this section, ROC curve (Receiver Operating Characteristic) and the score of AUC (Area Under ROC Curve) will be adopted to measure the performance of TCSRWRLD in different cross validations. Here, let TPR (True Positive Rates or Sensitivity) represent the percentage of candidate lncRNAsdisease associations with scores higher than a given score cutoff, and FPR (False Positive Rates or 1Specificity) denote the ratio of predicted lncRNAdisease associations with scores below the given threshold, then ROC curves can be obtained by connecting the corresponding pairs of TPR and FPR on the graph. As illustrated in Fig. 1, simulation results show that TCSRWRLD can achieve reliable AUCs of 0.8323, 0.8597, 0.8665 and 0.8712 in the frameworks of 2folds CV, 5folds CV, 10folds and LOOCV respectively, which implies that TCSRWRLD can achieve excellent performance in predicting potential lncRNAdisease associations.
Moreover, in order to further estimate the prediction performance of TCSRWRLD, we will obtain the PR curve of TCSRWRLD as well. Unlike the AUC, the AUPR (Area Under the PrecisionRecall curve) represents the ratio of all true positives to all positive predictions at every given recall rate. As illustrated in Fig. 2, simulation results show that TCSRWRLD can achieve a reliable AUPR of 0.5007.
Comparison with other related methods
From above descriptions, it is easy to know that TCSRWRLD can achieve satisfactory prediction performance. In this section, we will compare TCSRWRLD with some classical prediction models to further demonstrate the performance of TCSRWRLD. Firstly, based on the dataset of 2017version downloaded from the lncRNAdisease database, we will compare TCSRWRLD with the stateoftheart models such as KATZLDA, PMFILDA and Pingâ€™s model. As shown in Fig. 3, it is easy to see that TCSRWRLD can achieve a reliable AUC of 0.8712 in LOOCV, which is superior to the AUCs of 0.8257, 0.8702 and 0.8346 achieved by KATZLDA, Pingâ€™s model and PMFILDA in LOOCV respectively.
Moreover, in order to prove that TCSRWRLD can perform well in different data backgrounds, we also adopt the dataset of 2016version downloaded from the lnc2Cancer database, which consists of 98 human cancers, 668 lncRNAs and 1103 confirmed associations between them, to compare TCSRWRLD with KATZLDA, PMFILDA and Pingâ€™s model. As illustrated in Fig. 4, it is easy to see that TCSRWRLD can achieve a reliable AUC of 0.8475 in LOOCV, which is superior to the AUCs of 0.8204 and 0.8374 achieved by KATZLDA and PMFILDA respectively, while is inferior to the AUC of 0.8663 achieved by Pingâ€™s model.
Analysis on effects of parameters
In TCSRWRLD, there are some key parameters such as \( {\gamma}_l^{\prime } \), \( {\gamma}_d^{\prime } \) and âˆ‚. As for \( {\gamma}_l^{\prime } \) and \( {\gamma}_d^{\prime } \) in the Equation (5) and Equation (11), we have already known that the model can achieve the best performance when the values of \( {\gamma}_l^{\prime } \)and\( {\gamma}_d^{\prime } \) are both set to 1 [39]. Hence, in order to estimate effect of the key parameter âˆ‚ on the prediction performance of TCSRWRLD, we will set the value range of âˆ‚ from 0.1 to 0.9 and select the value of AUC in LOOCV as the basis of parameter selection in this section. As illustrated in Table 1, It is easy to see that TCSRWRLD can achieve the highest value of AUC in LOOCV while âˆ‚ is set to 0.4. Moreover, it is also easy to see that TCSRWRLD can maintain robustness for different values of âˆ‚, which means that TCSRWRLD is not sensitive to the values of âˆ‚ as well.
Case studies
Up to now, cancer is considered as one of the most dangerous diseases to human health because it is hard to be treated [40]. At present, the incidence of various cancers has a high level not only in the developing countries where medical development is relatively backward, but also in the developed countries where the medical level is already very high. Hence, in order to further evaluate the performance of TCSRWRLD, case study of two kinds of dangerous cancers such as lung cancer and leukemia will be implemented in this section. As for these two kinds of dangerous cancers, the incidence of lung cancer has remained high in recent years, and the number of lung cancer deaths per year is about 1.8 million, which is the highest of any cancer types. However, the survival rate within five years after the diagnosis of lung cancer is only about 15%, which is much lower than that of other cancers [41]. Recently, growing evidences have shown that lncRNAs play crucial roles in the development and occurrence of lung cancer [42]. As illustrated in Table 2, while implementing TCSRWRLD to predict lung cancer related lncRNAs, there are 7 out of the top 10 predicted candidate lung cancer related lncRNAs having been confirmed by the latest experimental evidences. Additionally, as a bloodrelated cancer [43], Leukemia has also been found to be closely related to a variety of lncRNAs in recent years. As illustrated in Table 2, while implementing TCSRWRLD to predict Leukemia related lncRNAs, there are 5 out of the top 10 predicted candidate Leukemia related lncRNAs having been confirmed by stateoftheart experiment results as well. Thus, from above simulation results of case studies, we can easily reach an agreement that TCSRWRLD may have great value in predicting potential lncRNAdisease associations.
Discussion
Since it is very timeconsuming and laborintensive to verify associations between lncRNAs and diseases through traditional biological experiments, then it has become a hot topic in bioinformatics to establish computational models to infer potential lncRNAdisease associations, which can help researchers to have a deeper understanding of diseases at the lncRNA level. In this manuscript, a novel prediction model called TCSRWRLD is proposed, in which, a heterogeneous network is constructed first through combining the disease integrated similarity, the lncRNA integrated similarity and known lncRNAdisease associations, which can guarantee that TCSRWRLD is able to overcome the shortcomings of traditional RWR based prediction models that the random walk process cannot be started while there are no known lncRNAdisease associations. And then, based on the newly constructed heterogeneous network, a random walk based prediction model is further designed based on the concepts of TCS and GS. In addition, based on the dataset of 2017version downloaded from the lncRNAdisease database, a variety of simulations have been implemented, and simulation results show that TCSRWRLD can achieve reliable AUCs of 0.8323, 0.8597 0.8665 and 0.8712 under the frameworks of 2fold CV, 5fold CV, 10fold CV and LOOCV respectively. Additionally, simulation results of case studies of lung cancer and leukemia show as well that TCSRWRLD has a reliable diagnostic ability in predicting potential lncRNAdisease associations. Certainly, the current version of TCSRWRLD still has some shortages and deficiencies. For example, the prediction performance of TCSRWRLD can be further improved if more known lncRNAdisease associations have been added into the experimental datasets. In addition, more accurate establishment of Mesh database will help us obtain more accurate disease semantic similarity scores, which is very important for the calculation of lncRNA functional similarity as well. Of course, all these above problems will be the focus of our future researches.
Conclusion
In this paper, the main contributions are as follows: (1) A heterogeneous lncRNAdisease network is constructed by integrating three kinds of networks such as the known lncRNAdisease association network, the diseasedisease similarity network and the lncRNAlncRNA similarity network. (2) Based on the newly constructed heterogeneous lncRNAdisease network, the concept of network distance is introduced to establish the TCS (Target Convergence Set) and GS (Global Set) for each node in the heterogeneous lncRNAdisease network. (3) Based on the concepts of TCS and GS, a novel random walk model is proposed to infer potential lncRNAdisease associations. (4) Through comparison with traditional stateoftheart prediction models and the simulation results of case studies, TCSRWRLD is demonstrated to be of excellent prediction performance in uncovering potential lncRNAdisease associations.
Methods and materials
Known diseaselncRNA associations
Firstly, we download the 2017version of known lncRNAdisease associations from the lncRNAdisease database (http://www.cuilab.cn/ lncrnadisease). And then, after removing duplicated associations and picking out the lncRNAdisease associations from the raw data, we finally obtain 1695 known lncRNAdisease associations (see Additional file 1) including 828 different lncRNAs (see Additional file 2) and 314 different diseases (see Additional file 3). Hence, we can construct a 314â€‰Ã—â€‰828 dimensional lncRNAdisease association adjacency matrix A, in which, there is A(i, j)â€‰=â€‰1, if and only if there is an known association between the disease d_{i} and the lncRNA l_{j} in the LncRNADisease database, otherwise there is A(i, j)â€‰=â€‰0. In addition, for convenience of description, let N_{L}â€‰=â€‰828 and N_{D}â€‰=â€‰314, then it is obvious that the dimension of the lncRNAdisease association adjacency matrix A can be represented as N_{D}â€‰Ã—â€‰N_{L}. And the like mentioned above, we can get a cancerdisease associations adjacency matrix which dimension is 98â€‰Ã—â€‰668 (It comes from 2016version of known lncRNAdisease associations from the Lnc2Cancer database) (see Additional file 4).
Similarity of diseases
Semantic similarity of diseases
In order to estimate the semantic similarity between different diseases, based on the concept of DAGs (Directed Acyclic Graph) of different diseases proposed by Wang et al. [44, 45], we can calculate the disease semantic similarity through calculating the similarity between compositions of DAGs of different diseases as follows:
Step 1
For all these 314 diseases newly obtained from the lncRNAdisease database, their corresponding MESH descriptors can be downloaded from the Mesh database in the National Library of Medicine (http://www.nlm.nih.gov/). As illustrated in Fig. 5, based on the information of MESH descriptors, each disease can establish a DAG of its own.
Step 2
For any given disease d, Let its DAG be DAG(d)â€‰=â€‰(d, D(d), E(d)), where D(d) represents a set of nodes consisting of the disease d itself and its ancestral disease nodes, and E(d) denotes a set of directed edges pointing from ancestral nodes to descendant nodes.
Step 3
For any given disease d and one of its ancestor nodes t in DAG(d), the semantic contributions of the ancestor node t to the disease d can be defined as follows:
Where Î” is the attenuation factor with value between 0 and 1 to calculate the disease semantic contribution, and according to the stateoftheart experimental results, the most appropriate value forÎ”is 0.5 .
Step 4
For any given disease d, let its DAG be DAG(d), then based on the concept of DAG, the semantic value of d can be defined as follows:
Taking the disease DSN (Digestive Systems Neoplasms) illustrated in Fig. 5 for example, according to the Equation (1), it is easy to know that the semantic contribution of digestive systems neoplasms to itself is 1. Besides, since the neoplasms by site and the digestive system disease located in the second layer of the DAG of DSN, then it is obvious that both of the semantic contributions of these two kinds of diseases to DSN are 0.5*1â€‰=â€‰0.5. Moreover, since the neoplasms located in the third layer of the DAG of DSN, then its semantic contribution to DSN is 0.5*0.5â€‰=â€‰0.25. Hence, according to above formula (2), it is easy to know the semantic value of DSN will be 2.25 (=1â€‰+â€‰0.5â€‰+â€‰0.5â€‰+â€‰0.25).
Step 5
For any two given diseases d_{i} and d_{j}, based on the assumption that the more similar the structures of their DAGs, the higher the semantic similarity between them will be, the semantic similarity between d_{i} and d_{j} can be defined as follows:
Gaussian interaction profile kernel similarity of diseases
Based on the assumption that similar diseases tend to be more likely associated with similar lncRNAs, according to above newly constructed lncRNAdisease association adjacency matrix A, for any two given diseases d_{i} and d_{j}, the Gaussian interaction profile kernel similarity between them can be obtained as follows:
Here, IP(d_{t}) denotes the vector consisting of elements in the tth row of the lncRNAdisease adjacency matrix A. Î³_{d} is the parameter to control the kernel bandwidth based on the new bandwidth parameter \( {\gamma}_d^{\prime } \)by computing the average number of lncRNAsdisease associations for all the diseases. In addition, inspired by the thoughts of former methods proposed by O. Vanunu et al. [46], we will adopt a logistics function to optimize the Gaussian interaction profile kernel similarity between diseases, and based on above Equation (4), we can further obtain a N_{D}â€‰Ã—â€‰N_{D} dimensional adjacency matrix FKD as follows:
Integrated similarity of diseases
Based on the disease semantic similarity and disease Gaussian interaction profile kernel similarity obtained above, a N_{D}â€‰Ã—â€‰N_{D} dimensional integrated disease similarity adjacency matrix KD (N_{D}â€‰Ã—â€‰N_{D}) can be obtained as follows:
Similarity of LncRNAs
Functional similarity of LncRNAs
We can obtain corresponding disease groups of two given lncRNAs l_{i} and l_{j} from the known associations of lncRNAdisease. Based on the assumption that similar diseases tend to be more likely associated with similar lncRNAs, We define the functional similarity of two given lncRNAs l_{i} and l_{j} as the semantic similarity between the disease groups corresponding to them. The specific calculation process is as follows:
For any two given lncRNAs l_{i} and l_{j}, let DS(i)â€‰=â€‰{d_{k}  A(k, i)â€‰=â€‰1, kâˆˆ[1, N_{D}]} and DS(j)â€‰=â€‰{d_{k}  A(k, j)â€‰=â€‰1, kâˆˆ[1, N_{D}]}, then the functional similarity between l_{i} and l_{j} can be calculated according to the following steps [31]:
Step 1
For any given disease group DS(k) and disease d_{t}âˆ‰DS(k), we first calculate the similarity between d_{t} and DS(k) as follows:
Step 2
Therefore, based on above Equation (8), we define the functional similarity between l_{i} and l_{j} as FuncKL(i, j), which can be calculated as follows:
Here, D(i) and D(j) represent the number of diseases in DS(i) and DS(j) respectively. Thereafter, according to above Equation (9), it is obvious that a N_{L}â€‰Ã—â€‰N_{L} dimensional lncRNA functional similarity matrix FuncKL can be obtained in final.
Gaussian interaction profile kernel similarity of lncRNAs
Based on the assumption that similar lncRNAs tend to be more likely associated with similar diseases, according to above newly constructed lncRNAdisease association adjacency matrix A, for any two given lncRNAs l_{i} and l_{j}, the Gaussian interaction profile kernel similarity between them can be obtained as follows:
Here, IP(l_{t}) denotes the vector consisting of elements in the tth column of the lncRNAdisease adjacency matrix A. Î³_{l} is the parameter to control the kernel bandwidth based on the new bandwidth parameter\( {\gamma}_l^{\prime } \)by computing the average number of lncRNAsdisease associations for all the lncRNAs. So far, based on above Equation (10), we can obtain a N_{L}â€‰Ã—â€‰N_{L} dimensional lncRNA Gaussian interaction profile kernel similarity matrix FKL as well.
Integrated similarity of lncRNAs
Based on the lncRNA functional similarity and lncRNA Gaussian interaction profile kernel similarity obtained above, a N_{L}â€‰Ã—â€‰N_{L} dimensional integrated lncRNA similarity adjacency matrix KL (N_{L} Ã—â€‰N_{L}) can be obtained as follows:
Construction of computational model TCSRWRLD
The establishment of heterogeneous network
Through combing the N_{D}â€‰Ã—â€‰N_{D} dimensional integrated disease similarity adjacency matrix KD and the N_{L}â€‰Ã—â€‰N_{L} dimensional integrated lncRNA similarity adjacency matrix KL with the N_{D}â€‰Ã—â€‰N_{L} dimensional lncRNAdisease association adjacency matrix A, we can construct a new (N_{L}â€‰+â€‰N_{D})â€‰Ã—â€‰(N_{L}â€‰+â€‰N_{D}) dimensional integrated matrix AA as follow:
According to above Equation (13), we can construct a corresponding heterogeneous lncRNAdisease network consisting of N_{D} different disease nodes and N_{L} different lncRNA nodes, in which, for any given pair of nodes i and j, there is an edge existing between them, if and only if there is AA(i, j)â€‰>â€‰0.
Establishment of TCS (target convergence set)
Before the implementation of random walk, for each node in above newly constructed heterogeneous lncRNAdisease network, as illustrated in Fig. 6, it will establish its own TCS first according to the following steps:
Step 1
For any given lncRNA node l_{j}, we define its original TCS as the set of all disease nodes that have known associations with it, i.e., the original TCS of l_{j} is TCS_{0}(l_{j})â€‰=â€‰{d_{k}  A(k, j)â€‰=â€‰1, kâˆˆ[1, N_{D}]}. Similarly, for a given disease node d_{i}, we can define its original TCS as TCS_{0}(d_{i})â€‰=â€‰{l_{k}  A(i, k)â€‰=â€‰1, kâˆˆ[1, N_{L}]}.
Step 2
After the original TCS has been established, for any given lncRNA node l_{j}, âˆ€d_{k}âˆˆTCS_{0}(l_{j}), and âˆ€tâˆˆ[1, N_{D}], then we can define the network distance ND(k, t) between d_{k} and d_{t} as follows:
According to above Equation (14), for any disease nodes d_{k}âˆˆTCS_{0}(l_{j}) and âˆ€tâˆˆ[1, N_{D}], obviously it is reasonable to deduce that the smaller the value of ND(k, t), the higher the similarity between d_{t} and d_{k} would be, i.e., the higher the possibility that there is potential association between d_{t} and l_{j} will be.
Similarly, for any given disease node d_{i}, âˆ€l_{k}âˆˆTCS_{0}(d_{i}) and âˆ€tâˆˆ[1, N_{L}], we can define the network distance ND(k, t) between l_{k} and l_{t} as follows:
According to above Equation (15), for any lncRNA nodes l_{k}âˆˆTCS_{0}(d_{i}) and âˆ€tâˆˆ[1, N_{L}], obviously it is reasonable to deduce that the smaller the value of ND(k, t), the higher the similarity between l_{t} and l_{k} will be, i.e., the higher the possibility that there is potential association between l_{t} and d_{i} will be.
Step 3
According to above Equation (14) and Equation (15), for any given disease node d_{i} or any given lncRNA node l_{j}, we define that the TCS of d_{i} as the set of top 100 lncRNA nodes in the heterogeneous lncRNAdisease network that have minimum average network distance to the lncRNA nodes in TCS_{0}(d_{i}), and the TCS of l_{j} as the set of top 100 disease nodes in the heterogeneous lncRNAdisease network that have minimum average network distance to the disease nodes in TCS_{0}(l_{j}). Then, it is easy to know that these 100 lncRNA nodes in TCS (d_{i}) may belong to TCS_{0}(d_{i}) or may not belong to TCS_{0}(d_{i}), and these 100 disease nodess in TCS (l_{j}) may belong to TCS_{0}(l_{j}) or may not belong to TCS_{0}(l_{j}).
Random walk in the heterogeneous LncRNAdisease network
The method of random walk simulates the process of random walkerâ€™s transition from one starting node to other neighboring nodes in the network with given probability. Based on the assumption that similar diseases tend to be more likely associated with similar lncRNAs, as illustrated in Fig. 7, the process of our prediction model TCSRWRLD can be divided into the following major steps:
Step 1
For a walker, before it starts its random walk across the heterogeneous lncRNAdisease network, it will first construct a transition probability matrix W as follows:
Step 2
In addition, for any node Â£_{i} in the heterogeneous lncRNAdisease network, whether Â£_{i} is a lncRNA node l_{i} or a disease node d_{i}, it can obtain an initial probability vector P_{i} (0) for itself as follows:
Step 3
Next, the walker will randomly select a node Â§_{i} in the heterogeneous lncRNAdisease network as the starting node to initiate its random walk, where Â§_{i} may be an lncRNA node l_{i} or a disease node d_{i}. After the initiation of the random walk process, supposing that currently the walker has arrived at the node Î“_{i} from the previous hop node Î“_{j} after t1 hops during its random walk across the heterogeneous lncRNAdisease network, then here and now, whether Î“_{i} is a lncRNA node l_{i} or a disease node d_{i}, and Î“_{j} is a lncRNA node l_{j} or a disease node d_{j}, the walker can further obtain a walking probability vector P_{i}(t) as follows:
Where âˆ‚ (0< âˆ‚<â€‰1) is a parameter for the walker to adjust the value of walking probability vector at each hop. Moreover, based on above newly obtained walking probability vector P_{i}(t), let P_{i}(t) =\( {\left({p}_{i,1}(t),{p}_{i,2}(t),\dots, {p}_{i,j}(t),\dots {p}_{i,{N}_D+{N}_L}(t)\right)}^T \), and for convenience, supposing that there is p_{i, k}(k)=maximum{\( {p}_{i,1}(t),{p}_{i,2}(t),\dots, {p}_{i,k}(t),\dots {p}_{i,{N}_D+{N}_L}(t) \)}, then the walker will choose the node Ïˆ_{k} as its next hop node, where Ïˆ_{k} may be a lncRNA node l_{k} or a disease node d_{k}. Especially, as for the starting node Â§_{i}, since it can be regarded that the walker has arrived at Â§_{i} from Â§_{i} after 0 hops, then it is obvious that at the starting node Â§_{i}, the walker will obtain two kinds of probability vectors such as the initial probability vector P_{i} (0) and the walking probability vector P_{i} (1). However, at each intermediate node Î“_{i}, the walker will obtain two other kinds of probability vectors such as the initial probability vector P_{i} (0) and the walking probability vector P_{i}(t).
Step 4
Based on above Equation (19), supposing that currently the walker has arrived at the node Î“_{i} from the previous hop node Î“_{j} after t1 hops during its random walk across the heterogeneous lncRNAdisease network, let the walking probability vectors obtained by the walker at the node Î“_{i} and Î“_{j} be P_{i}(t) and P_{j}(t1) respectively, if the L1 norm between P_{i}(t) and P_{j}(t1) satisfies â€–P_{i}(t)â€‰âˆ’â€‰P_{j}(tâ€‰âˆ’â€‰1)â€–_{1}â€‰â‰¤â€‰10^{âˆ’6}, then we will regard that the walking probability vector P_{i}(t) has reached a stable state at the node Î“_{i}. Thus, after the walking probability vectors obtained by the walker at every disease node and lncRNA node in the heterogeneous lncRNAdisease network have reached stable state, and for convenience, let these stable walking probability vectors be \( {P}_1\left(\infty \right),{P}_2\left(\infty \right),\dots, {P}_{N_D+{N}_L}\left(\infty \right) \), then based on these stable walking probability vectors, we can obtain a stable walking probability matrix S(âˆž) as follows:
Where S_{1} is a N_{L}Ã—N_{L} dimensional matrix, S_{2} is a N_{L}Ã—N_{D} dimensional matrix, S_{3} is a N_{D}Ã—N_{L} dimensional matrix, and S_{4} is a N_{D}Ã—N_{D} dimensional matrix. And moreover, from above descriptions, it is easy to infer that the matrix S_{2} and the matrix S_{3} are the final result matrices needed by us, and we can predict potential lncRNAdisease associations based on the scores given in these two final result matrices.
According to above described steps of the random walk process based on our prediction model TCSRWRLD, it is obvious that for each node Î“_{i} in the heterogeneous lncRNAdisease network, the stable walking probability vector obtained by the walker at Î“_{i} is P_{i}(âˆž) = \( {\left({p}_{i,1}\left(\infty \right),{p}_{i,2}\left(\infty \right),\dots, {p}_{i,j}\left(\infty \right),\dots {p}_{i,{N}_D+{N}_L}\left(\infty \right)\right)}^T \). Moreover, for convenience, we denote a node set consisting of all the N_{D}+N_{L} nodes in the heterogeneous lncRNAdisease network as a Global Set (GS), then it is obvious that we can rewrite the stable walking probability vector P_{i}(âˆž) as \( {P}_i^{GS}\left(\infty \right) \). Additionally, from observing the stable walking probability vector \( {P}_i^{GS}\left(\infty \right) \), it is easy to know that the walker will not stop its random walk until the N_{D}+N_{L} dimensional walking probability vector at each node in the heterogeneous lncRNAdisease network has reached a stable state, which will obviously be very timeconsuming while the value of N_{D}+N_{L} is large to a certain extent. Hence, in order to decrease the execution time and quicken the velocity of convergence of TCSRWRLD, based on the concept of TCS proposed in above section, while constructing the walking probability vector P_{i}(t)=(p_{i, 1}(t), p_{i, 2}(t), â€¦, p_{i, j}(t), \( \dots, {p}_{i,{N}_D+{N}_L}(t)\Big){}^T \) at the node Î“_{i}, we will keep the p_{i, j}(t) unchanged if the jth node in these N_{D}+N_{L} nodes belongs to the TCS of Î“_{i}, otherwise we will set p_{i, j}(t)=0. Thus, the walking probability vector obtained by the walker at Î“_{i} will turn to be \( {P}_i^{TCS}(t) \) while the stable walking probability vector obtained by the walker at Î“_{i} will turn to be \( {P}_i^{TCS}\left(\infty \right) \). Obviously, comapred with \( {P}_i^{GS}\left(\infty \right) \), the stable state of \( {P}_i^{TCS}\left(\infty \right) \) can be reached by the walker much more quickly. However, considering that there may be nodes that are not in the TCS of Î“_{i} but actually associated with the target node, therefore, in order to avoid omissions, during simulation, we will construct a novel stable walking probability vector \( {P}_i^{ANS}\left(\infty \right) \) through combining \( {P}_i^{GS}\left(\infty \right) \)with \( {P}_i^{TCS}\left(\infty \right) \)to predict potential lncRNAdisease associations as follows:
Availability of data and materials
The datasets generated and/or analysed during the current study are available in the LncRNADisease repository, http://www.cuilab.cn/ lncrnadisease.
Abbreviations
 10Fold CV:

10fold crossvalidation
 2Fold CV:

2fold crossvalidation;
 5Fold CV:

5fold crossvalidation
 AUC:

Areas under ROC curve
 AUPR:

Area under the precisionrecall curve
 FPR:

False positive rates
 GS:

Global set
 H19:

Long noncoding RNA H19
 lncRNAs:

Long noncoding RNAs
 LOOCV:

LeaveOne Out Cross Validation
 ncRNAs:

Noncoding RNAs
 PR curve:

Precisionrecall curve
 ROC:

Receiveroperating characteristics
 RWR:

Random walk with restart
 TCS:

Target Convergence Set
 TCSRWRLD:

A novel computational model based on improved rand walk with restart is proposed to infer potential lncRNAdisease associations
 TPR:

True positive rates
 Xist:

Long noncoding RNA Xist
References
Crick FHC, Barnett L, Brenner S, WattsTobin RJ. General nature of the genetic code for proteins. Nat. 1961;192(4809):1227â€“32.
Yanofsky C. Establishing the triplet nature of the genetic code. Cell. 2007;128(5):815â€“8.
JeanMichel C. Fewer genes, more noncoding RNA. Sci. 2005;309(5740):1529â€“30.
Core LJ, Waterfall JJ, Lis JT. Nascent RNA sequencing reveals widespread pausing and divergent initiation at human promoters. Sci. 2008;322(5909):1845â€“8.
Paul B, Viktor S, Royce TE, Rozowsky JS, Urban AE, Xiaowei Z, Rinn JL, Waraporn T, Manoj S, Sherman W. Global identification of human transcribed sequences with genome tiling arrays. Sci. 2004;306(5705):2242â€“6.
Piero C, Albin S, Boris L, Shintaro K, Kazuro S, Jasmina P, Semple CAM, Taylor MS. Engstr?M PRG, Frith MC: genomewide analysis of mammalian promoter architecture and evolution. Nat Genet. 2006;38(6):626â€“35.
Nina H, Damjan G. Long noncoding RNA in cancer. Int J Mol Sci. 2013;14(3):4655â€“69.
Mercer TR, Dinger ME, Mattick JS. Long noncoding RNAs: insights into functions. Nat Rev Genet. 2009;10(3):155â€“9.
Mitchell G, Pamela R, Ingolia NT, Weissman JS, Lander ES. Ribosome profiling provides evidence that large noncoding RNAs do not encode proteins. Cell. 2013;154(1):240â€“51.
Borsani G, ., Tonlorenzi R, ., Simmler MC, Dandolo L, ., Arnaud D, ., Capra V, ., Grompe M, ., Pizzuti A, ., Muzny D, ., Lawrence C, . Characterization of a murine gene expressed from the inactive X chromosome. Nat 1991, 351(6324):325â€“329.
Brockdorff N, Ashworth A, Kay GF, Mccabe VM, Norris DP, Cooper PJ, Swift S, Rastan S. The product of the mouse Xist gene is a 15 kb inactive Xspecific transcript containing no conserved ORF and located in the nucleus. Cell. 1992;71(3):515â€“26.
Mitchell G, Manuel G, Levin JZ, Julie D, James R, Xian A, Lin F, Koziol MJ, Andreas G, Chad N. Ab initio reconstruction of cell typespecific transcriptomes in mouse reveals the conserved multiexonic structure of lincRNAs. Nat Biotechnol. 2010;28(5):503â€“10.
Guttman M, Amit I, Garber M, French C, Lin MF, Feldser D, Huarte M, Zuk O, Carey BW, Cassady JP. Chromatin signature reveals over a thousand highly conserved large noncoding RNAs in mammals. Nature. 2009;458(7235):223.
Ponting CP, Oliver PL, Reik W. Evolution and functions of long noncoding RNAs. Cell. 2009;136(4):629â€“41.
Wilusz JE, Hongjae S, Spector DL. Long noncoding RNAs: functional surprises from the RNA world. Genes Dev. 2009;23(13):1494â€“504.
Gupta RA, Nilay S, Wang KC, Jeewon K, Horlings HM, Wong DJ, MiaoChih T, Tiffany H, Pedram A, Rinn JL. Long noncoding RNA HOTAIR reprograms chromatin state to promote cancer metastasis. Nature. 2010;464(7291):1071â€“6.
Pibouin L, Villaudy J, Ferbus D, Muleris M, ProspÃ©ri MT, Remvikos Y, Goubin G. Cloning of the mRNA of overexpression in colon carcinoma1 : a sequence overexpressed in a subset of colon carcinomas. Cancer Genet Cytogenet. 2002;133(1):55â€“60.
Ji P, Diederichs SW, Boing S, Metzger R, Schneider PM, Tidow N, Brandt B, Buerger H, Bulk E, Thomas M. MALAT1, a novel noncoding RNA, and thymosin beta4 predict metastasis and survival in earlystage nonsmall cell lung cancer. Oncogene. 2003;22(39):8031.
Spizzo R, ., Almeida MI, Colombatti A, ., Calin GA: Long noncoding RNAs and cancer: a new frontier of translational research? Oncogene 2012, 31(43):4577â€“4587.
Chen G, Wang Z, Wang D, Qiu C, Liu M, Chen X, Zhang Q, Yan G, Cui Q. LncRNADisease: a database for longnoncoding RNAassociated diseases. Nucleic Acids Res. 2012;41(D1):D983â€“6.
Quek XC, Thomson DW, Maag JL, Bartonicek N, Signal B, Clark MB, Gloss BS. Dinger ME.lncRNAdb v2.0: expanding the reference database for functional long noncoding RNAs. Nucleic Acids Res. 2015;43(Database issue):D168â€“73.
Bu D, Yu K, Sun S, Xie C, SkogerbÃ¸ G, Miao R, Xiao H, Liao Q, Luo H, Zhao G. NONCODE v3. 0: integrative annotation of long noncoding RNAs. Nucleic Acids Res. 2011;40(D1):D210â€“5.
Ning S, Zhang J, Wang P, Zhi H, Wang J, Liu Y, Gao Y, Guo M, Yue M, Wang L. Lnc2Cancer: a manually curated database of experimentally supported lncRNAs associated with various human cancers. Nucleic Acids Res. 2015;44(D1):D980â€“5.
Ideker T, Sharan R. Protein networks in disease. Genome Res. 2008;18(4):644â€“52.
Ming L, Qipeng Z, Min D, Jing M, Yanhong G, Wei G, Qinghua C. An analysis of human microRNA and disease associations. PLoS One. 2008;3(10):e3420.
Xing C, GuiYing Y. Novel human lncRNAdisease association inference based on lncRNA expression profiles. Bioinformatics. 2013;29(20):2617â€“24.
Ping P, Wang L, Kuang L, Ye S, Iqbal MFB, Pei T. A novel method for lncRNAdisease association prediction based on an lncRNAdisease association network. IEEE/ACM Trans Comput Biol Bioinform. 2018;16(2):688â€“93.
Zhao H, Kuang L, Wang L, Ping P, Xuan Z, Pei T, Wu Z. Prediction of microRNAdisease associations based on distance correlation set. BMC Bioinformatics. 2018;19(1):141.
Chen X. KATZLDA: KATZ measure for the lncRNAdisease association prediction. Sci Rep. 2014;5(1):16840.
Katz L. A new status index derived from sociometric analysis. Psychometrika. 1953;18(1):39â€“43.
Chen X, Yan CC, Luo C, Ji W, Zhang Y, Dai Q. Constructing lncRNA functional similarity network based on lncRNAdisease associations and disease semantic similarity. Sci Rep. 2015;5:11338.
Chen X, Liu MX, Yan GY. RWRMDA: predicting novel human microRNAdisease associations. Mol BioSyst. 2012;8(10):2792â€“8.
Chen X. miREFRWR: a novel diseaserelated microRNAenvironmental factor interactions prediction method. Mol BioSyst. 2016;12(2):624â€“33.
Chen X, Liu MX, Yan GY. Drugâ€“target interaction prediction by random walk on the heterogeneous network. Mol BioSyst. 2012;8(7):1970â€“8.
Jie S, Hongbo S, Zhenzhen W, Changjian Z, Lin L, Letian W, Weiwei H, Dapeng H, Shulin L, Meng Z. Inferring novel lncRNAdisease associations based on a random walk model of a lncRNA functional similarity network. Mol BioSyst. 2014;10(8):2074â€“81.
Chen X, You ZH, Yan GY, Gong DW. IRWRLDA: improved random walk with restart for lncRNAdisease association prediction. Oncotarget. 2016;7(36):57919â€“31.
Fan XN, Zhang SW, Zhang SY, Zhu K, Lu S. Prediction of lncRNAdisease associations by integrating diverse heterogeneous information sources with RWR algorithm and positive pointwise mutual information. BMC Bioinformatics. 2019;20(1):87.
Xuan Z, Li J, Yu J, Feng X, Zhao B, Wang L. A probabilistic matrix factorization method for identifying lncRNAdisease associations. Genes. 2019;10(2):126.
van Laarhoven T, Nabuurs SB, Marchiori E. Gaussian interaction profile kernels for predicting drugâ€“target interaction. Bioinformatics. 2011;27(21):3036â€“43.
Spiess PE, Dhillon J, Baumgarten AS, Johnstone PA, Giuliano AR. Pathophysiological basis of human papillomavirus in penile cancer: key to prevention and delivery of more effective therapies. CA Cancer J Clin. 2016;66(6):481â€“95.
Tony G, Monika HM, Moritz E, Jeff H, Youngsoo K, Alexey R, Gayatri A, Marion S, Matthias G. The noncoding RNA MALAT1 is a critical regulator of the metastasis phenotype of lung cancer cells. Cancer Res. 2013;73(3):1180â€“9.
White NM, Cabanski CR, SilvaFisher JM, Dang HX, Govindan R, Maher CA. Transcriptome sequencing reveals altered long intergenic noncoding RNAs in lung cancer. Genome Biol. 2014;15(8):429.
Omer A, Singh P, Yadav NK. Singh RK: microRNAs: role in leukemia and their computational perspective. Wiley Interdiscip Rev: RNA. 2015;6(1):65â€“78.
Wang D, Wang J, Lu M, Song F, Cui Q. Inferring the human microRNA functional similarity and functional network based on microRNAassociated diseases. Bioinform. 2010;26(13):1644â€“50.
Chen X. Predicting lncRNAdisease associations and constructing lncRNA functional similarity network based on the information of miRNA. Sci Rep. 2015;5:13186.
Vanunu O, Magger O, Ruppin E, Shlomi T, Sharan R. Associating genes and protein complexes with disease via network propagation. PLoS Comput Biol. 2010;6(1):e1000641.
Acknowledgements
The authors thank the anonymous referees for suggestions that helped improve the paper substantially.
Funding
This research was partly sponsored by the National Natural Science Foundation of China (No.61873221,No. 61672447) and the Natural Science Foundation of Hunan Province (No.2018JJ4058, No.2019JJ70010, No.2017JJ5036). Publication costs were funded by the National Natural Science Foundation of China (No.61873221,No.61672447). The funder of manuscript is Lei Wang(L.W.),whose contribution are stated in the section of Authorâ€™s Contributions. The funding body has not played any roles in the design of the study and collection,analysis and interpretation of data in writing the manuscript.
Author information
Authors and Affiliations
Contributions
JCL conceived the study. JCL, XF, LW improved the study based on the original model. XYL, BW and BHZ implemented the algorithms corresponding to the study. LW supervised the study. JCL and LW wrote the manuscript of the study. All authors reviewed and improved the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisherâ€™s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1.
The known lncRNAdisease associations for constructing the known lncRNAdisease network. We list 1695 known lncRNAdisease associations which were collected from LncRNAdisease datasetit is the latest version in the database.
Additional file 2.
The known 828 lncRNAs name Included in the 1695 known lncRNAdisease associations which were collected from LncRNAdisease datasetit is the latest version in the database.
Additional file 3.
The known 314 diseases name Included in the 1695 known lncRNAdisease associations which were collected from LncRNAdisease datasetit is the latest version in the database.
Additional file 4.
The known 98 human cancer,668 lncRNAs and 1103 confirmed associations between them from Lnc2Cancer database.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Li, J., Li, X., Feng, X. et al. A novel target convergence set based random walk with restart for prediction of potential LncRNAdisease associations. BMC Bioinformatics 20, 626 (2019). https://doi.org/10.1186/s1285901932164
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285901932164
Keywords
 Potential lncRNAdisease association prediction
 Heterogeneous network
 Random walk with restart
 Target convergence set
 Global set