 Methodology article
 Open Access
 Published:
A random forest based computational model for predicting novel lncRNAdisease associations
BMC Bioinformatics volumeÂ 21, ArticleÂ number:Â 126 (2020)
Abstract
Background
Accumulated evidence shows that the abnormal regulation of long noncoding RNA (lncRNA) is associated with various human diseases. Accurately identifying diseaseassociated lncRNAs is helpful to study the mechanism of lncRNAs in diseases and explore new therapies of diseases. Many lncRNAdisease association (LDA) prediction models have been implemented by integrating multiple kinds of data resources. However, most of the existing models ignore the interference of noisy and redundancy information among these data resources.
Results
To improve the ability of LDA prediction models, we implemented a random forest and feature selection based LDA prediction model (RFLDA in short). First, the RFLDA integrates the experimentsupported miRNAdisease associations (MDAs) and LDAs, the disease semantic similarity (DSS), the lncRNA functional similarity (LFS) and the lncRNAmiRNA interactions (LMI) as input features. Then, the RFLDA chooses the most useful features to train prediction model by feature selection based on the random forest variable importance score that takes into account not only the effect of individual feature on prediction results but also the joint effects of multiple features on prediction results. Finally, a random forest regression model is trained to score potential lncRNAdisease associations. In terms of the area under the receiver operating characteristic curve (AUC) of 0.976 and the area under the precisionrecall curve (AUPR) of 0.779 under 5fold crossvalidation, the performance of the RFLDA is better than several stateoftheart LDA prediction models. Moreover, case studies on three cancers demonstrate that 43 of the 45 lncRNAs predicted by the RFLDA are validated by experimental data, and the other two predicted lncRNAs are supported by other LDA prediction models.
Conclusions
Crossvalidation and case studies indicate that the RFLDA has excellent ability to identify potential diseaseassociated lncRNAs.
Background
LncRNAs are a category of long noncoding RNAs with transcripts longer than 200 nucleotides [1]. Accumulated evidence demonstrates that lncRNAs are involved in almost allimportant biological processes, including gene transcription, cell differentiation, and epigenetic regulation [2,3,4]. The abnormal regulation of lncRNAs is associated with many complex human diseases, such as various cancers, Alzheimerâ€™s disease, cardiovascular disease and neurodegenerative diseases [5,6,7,8,9]. Therefore, accurately identifying diseaseassociated lncRNAs is helpful to study the mechanism of lncRNAs in diseases and explore new therapies of diseases.
To reduce the cost of discovering diseaseassociated lncRNAs by biological experiments, dozens of computational models have been developed to identify diseaseassociated lncRNAs based on a variety of biological data. At present, LDA prediction models can be classified into three categories. The first type of LDA prediction models is models based on complex network that predict diseaseassociated lncRNAs by integrating various biological networks [5]. Under the supposition that lncRNAs with analogous function tend to be related to diseases with analogous phenotype, Sun et al. proposed a LDA prediction model named RWRlncD by implementing random walk with restart (RWR) on a LFS network [10]. Under the supposition that the more miRNAs two lncRNAs interacted, the more likely they are related to the analogous diseases, Zhou et al. proposed a LDA prediction model by implementing random walk on a heterogeneous network which integrated the disease similarity network, the miRNAmediated lncRNA crosstalk network and the experimentsupported LDA network [11]. However, neither of the above methods can be used to new diseases that have not any experimentsupported associated lncRNAs. Chen et al. implemented a LDA prediction model called KATZLDA by integrating the Gaussian interaction profile kernel similarity (GIPKS) and semantic similarity of diseases, the expression profiles and functional similarity of lncRNAs, and the experimentsupported LDAs [12]. In addition, Chen et al. developed an improved RWR based LDA prediction model (IRWRLDA), which set the initial probability vector of RWR by combining the lncRNA expression similarity with the DSS [13]. Both of the above methods can be used to new diseases that have not any experimentsupported associated lncRNAs. Moreover, Yu et al. implemented a birandom walks based LDA prediction model (BRWLDA) [14]. Gu et al. developed a random walk based LDA prediction model on global network (GrwLDA) [15]. Zhang et al. constructed a flow propagation algorithm based LDA prediction model (LncRDNetFlow) [16]. Xiao et al. proposed a paths of fixed lengths based LDA prediction model (BPLLDA) [17]. Ping et al. inferred potential LDAs by an experimentsupported LDA network [18]. Fan et al. implemented a RWR based LDA prediction model (IDHIMIRW) by combining the positive pointwise mutual information with multiple heterogeneous information [19]. Liu et al. constructed LDA prediction model based on label propagation algorithm on weighted network (NBLDA) [20]. Li et al. developed a local random walk based LDA prediction model (LRWHLDA) [21]. Sumathipala et al. developed a network diffusion based LDA prediction model by integrating the proteindisease, proteinlncRNA and proteinprotein associations [22]. Zhang et al. developed a DeepWalk based LDA prediction model by integrating the miRNAdisease, lncRNAdisease, and miRNAlncRNA associations [23]. Xie et al. implemented a similarity kernel fusion based LDA prediction model (SFKLDA) by fusing the DSS and cosine similarity, and the lncRNA expression similarity and cosine similarity [24].
The second type of LDA prediction models predict diseaseassociated lncRNAs based on the expression levels and regulation relationships between diseaseassociated genes/miRNAs and lncRNAs [5]. Liu et al. implemented the first LDA prediction model not depending on the known LDAs by combining experimentvalidated diseaseassociated genes with the gene/lncRNA expression profiles [25]. However, this model cannot be used for diseases that have not any experimentvalidated associated genes. Li et al. proposed a genome location based model for screening human vascular diseaseassociated lncRNAs [26]. However, this model is invalid for lncRNAs that have no neighbour genes. Chen developed a hypergeometric distribution based LDA prediction model (HGLDA) by combining the LMI with MDAs [27]. HGLDA has a reliable performance for LDA prediction, but it cannot be used for lncRNAs that have not any experimentsupported interacted miRNAs [5]. Wang et al. developed a sequence based LDA prediction model (LncDisease) using the known lncRNAmiRNA crosstalk [28]. However, because the predicted miRNAlncRNA interactions have high false negative and false positive, the performance of LncDisease is limited. Moreover, Cheng et al. developed information flow modelling based LDA prediction model (IntNetLncSim) by combining lncRNAassociated transcriptional information with posttranscriptional information [29]. Wang et al. developed a competing endogenous RNAs (ceRNAs) based LDA prediction model (DisLncPri) by mapping lncRNAs to their functional genomics context [30]. Fu et al. proposed a matrix factorization based LDA prediction model (MFLDA) by decomposing multiple data matrices into lowrank matrices to identify their interior structure [31]. Ding et al. developed an lncRNAdiseasegene tripartite graph based LDA prediction model (TPGLDA) by combining the genedisease associations with the LDAs [32]. Lu et al. developed an inductive matrix completion based LDA prediction model (SIMCLDA) by integrating the genedisease, lncRNAdisease and genegene associations [33]. Wang et al. implemented a weighted matrix factorization based LDA prediction model (WMFLDA) by presetting weights to various association matrices among genes, lncRNAs and diseases and decomposing these matrices into lowrank matrices [34].
The third type of LDA prediction models predict diseaseassociated lncRNAs based on various machine learning algorithms [5]. Under the supposition that analogous diseases tend to be related to analogous lncRNAs, Chen et al. proposed a Laplacian regularized least squares based LDA prediction model (LRLSLDA) by combining the experimentsupported LDAs with the lncRNA expression profiles [35], which is the first computational model in this field. LRLSLDA is a semisupervised machine learning model not needing negative samples, but how to optimize model parameters remains a problem. Later, Chen et al. implemented a new LDA prediction model named LRLSLDALNCSIM by combining the functional, expression and GIPKS of lncRNAs with the semantic and GIPKS of diseases [36]. LRLSLDALNCSIM improves the performance of LRLSLDA. Furthermore, Huang et al. proposed an improved LFS model (ILNCSIM) by using the topological characteristics of disease DAGs (directed acyclic graphs) [37]. In addition, Zhao et al. implemented a naÃ¯ve Bayesian classifier based lncRNAcancer association prediction model by integrating the genome, transcriptome and regulome data [38]. Lan et al. implemented a bagging SVM classifier based LDA prediction model (LDAP) by combining the disease similarity with the lncRNA similarity [39]. Yu et al. proposed a naÃ¯ve Bayesian classifier based collaborative filtering LDA prediction model (CFNBC) by integrating the lncRNAdisease, miRNAdisease and lncRNAmiRNA associations [40]. Guo et al. developed a rotating forest and neural network based LDA prediction model (LDASR) by combining the lncRNA GIPKS with the DSS and GIPKS [41]. Chen et al. implemented a support vector machine based LDA prediction model (ILDMSF) by integrating the lncRNAgene interactions, the lncRNAdisease associations and the DSS [42]. Guo et al. implemented a random forest classifier based model for inferring novel associations among various bimolecular by constructing a molecular association network based on the known associations among diseases, proteins, miRNAs, lncRNAs and drugs [43]. Latterly, Xuan et al. proposed a series of convolutional neural network based LDA prediction models, including CNNLDA [44], GCNLDA [45], CNNDLP [46] and LDAPred [47]. CNNLDA learned the global and attention characteristics of lncRNAdisease pairs using convolutional neural networks by integrating the DSS, the LFS, and the lncRNAdisease, miRNAdisease and lncRNAmiRNA associations [44]. GCNLDA learned the local and network characteristics of lncRNAdisease pairs using convolutional neural network, graph convolutional network and convolutional autoencoder by combining multiple associations among diseases, miRNAs and lncRNAs [45]. CNNDLP learned the network and attention characteristics of lncRNAdisease pairs using convolutional neural network and convolutional autoencoder by integrating various associations, interactions and similarities among miRNAs, lncRNAs and diseases [46]. LDAPred implemented LDA prediction using convolutional neural network and information flow propagation by integrating various associations, interactions, similarities and topology among miRNAs, lncRNAs and disease [47]. These four methods have better performance for LDA prediction, but they all need to adjust many model parameters.
Inspired by previous works [44, 48], we implemented a random forest and feature selection based LDA prediction model (RFLDA). First, the RFLDA represented lncRNAdisease pairs by a highdimensional feature vector that integrated the DSS, the LFS, the experimentsupported LDAs and MDAs, and the miRNAlncRNA interactions. Then, the RFLDA chose more useful features based on the variable importance score of random forest to represent lncRNAdisease samples. Finally, the RFLDA employed a random forest regression model trained on lowdimensional feature space to score potential LDAs. The AUC and AUPR under 5fold crossvalidation demonstrate that the RFLDA has better performance than several outstanding LDA prediction models. Moreover, case studies on three cancers indicate that the RFLDA has excellent ability to identify diseaseassociated lncRNAs.
Results
Feature selection
To determine how many features should be used to train random forest regression model, we studied the prediction accuracy of models on different training sample sets by 10fold crossvalidation. Specially, we chose the top 50, 100 â€¦ 1900 and 1950 most important features (with the largest variable importance scores) to train random forest models in turn and calculated their prediction accuracy. The prediction accuracy under 10fold crossvalidation of random forest models trained using different number of features is shown in Fig. 1. As one can see from Fig. 1, the prediction accuracy of random forest models gradually increases with more features being added into training sample set, and achieves the largest value, 0.947, on training sample set consisting of the top 300 most important features. Therefore, in this work, we utilized the top 300 most important features to train the RFMDA model and evaluate its performance. The variable importance scores of all 1952 features are listed in Table S1, and the prediction accuracy of all random forest models on different training samples is listed in Table S2.
Performance measures
The AUC and AUPR under 5fold crossvalidation are calculated to evaluate the ability of different LDA prediction models. The 2697 experimentsupported LDAs are considered as positive samples. The 2697 randomly selected lncRNAdisease pairs not validated by experiments are considered as negative samples. All lncRNAdisease pairs not validated by experiments are taken as unlabelled samples. For 5fold crossvalidation, all positive and negative samples are evenly divided into 5 parts. In each crossvalidation, four parts of positive samples and negative samples are used for training random forest model in turn, and the leftover positive samples and all unlabelled samples are used as testing samples. Then, a random forest regression model is trained to score testing samples. As a result, each test sample (lncRNAdisease pair) is given a score that represented the likelihood that the lncRNA and disease of this sample are associated. Next, all test samples are sorted in descending by their prediction scores. On this basis, we calculated the false positive rate (FPR) and the true positive rate (TPR) with different thresholds. The FPR represents the proportion of the real negative samples in predicted positive samples (test samples that are ranked before the given threshold) to all negative samples. The TPR represents the proportion of the real positive samples in predicted positive samples (test samples that are ranked before the given threshold) to all positive samples. The TPR and the FPR can be calculated by eq. 1 and eq. 2, respectively.
where, TP (true positive) means that a positive sample is correctly predicted as positive sample; FN (false negative) means that a positive sample is incorrectly predicted as negative sample; FP (false positive) means that a negative sample incorrectly predicted as positive sample; TN (true negative) means that a negative sample is correctly predicted as negative sample. Using TPR as vertical axis and FPR as horizontal axis, the receiver operating characteristic (ROC) curve is drawn, and the AUC is calculated to evaluate the prediction ability of different LDA prediction models [49]. The larger the AUC is, the better the model is.
Because the number of negative samples (unconfirmed LDAs) is much larger than the number of positive samples (experimentsupported LDAs), it is seriously imbalanced between them. Therefore, we also draw the precisionrecall (PR) curve and calculate the AUPR to evaluate the prediction ability of different LDA prediction model [50]. The Precision means the percentage of the accurately predicted positive samples in all predicted positive samples, and the Recall means the percentage of the accurately predicted positive samples in all real positive samples. The Precision and the Recall can be calculated by eq. 3 and eq. 4, respectively.
Giving that 5fold crossvalidation, we adopt the average values of AUCs/AUPRs in five folds to evaluate the performance of different LDA prediction models. Moreover, to get reliable results, we repeated each experiment 10 times and computed the average value of 10 times experiments to be the final evaluation results.
Performance comparison with other prediction models
To show the prediction ability of the RFLDA, we compare it with several excellent LDA prediction models, such as SIMCLDA [33], Pingâ€™s method [18], MFLDA [31], LDAP [39], CNNLDA [44], and GCNLDA [45]. The AUCs and AUPRs of all LDA prediction models are shown in Table 1. The ROC curves of different LDA prediction models are shown in Fig. 2. The AUCs and AUPRs of the RFLDA in each crossvalidation are listed in Table S3.
As one can see, the RFLDA achieves AUC of 0.976 (Â±0.0002) on all tested 412 diseases, which is higher than all other methods involved in the comparison. It outperforms SIMCLDA by 31%, Pingâ€™s method by 12%, MFLDA by 56%, LDAP by 13%, CNNLDA by 3% and GCNLDA by 2%. Moreover, RFLDA achieves AUPR of 0.779 (Â±0.0297) on all tested 412 diseases, which is also higher than all other methods involved in the comparison. Specifically, it outperforms SIMCLDA by 720%, Pingâ€™s method by 256%, MFLDA by 1080%, LDAP by 369%, CNNLDA by 210% and GCNLDA by 249%. The comparison results indicate that the RFLDA has excellent ability of LDA prediction. It should be noted that the AUCs and AUPRs of other six models except RFLDA in Table 1 are derived from Xuan et al.â€™s work [44, 45].
Case studies
To further show the ability of the RFLDA to identify new diseaseassociated lncRNAs, case studies on stomach cancer, lung cancer and colon cancer are constructed. First, we trained the RFLDA on a sample set that did not contain any validated associations between lncRNAs and the investigated diseases. Here, all known lncRNAdisease associations from Fu et al.â€™s previous work [31], except for the investigated diseases, were taken as positive samples to training random forest prediction model, and all unconfirmed lncRNAdisease pairs were used as test samples. Then, we scored and sorted all unconfirmed lncRNAstomach/lung/colon cancer samples. Finally, we validated the predicted lncRNAs associated with stomach/lung/colon cancer by the records in the Lnc2Cancer (v2.0) [51], LncRNADisease (v2.0) [52], and published literature [53,54,55,56]. The Lnc2Cancer is a manually managed lncRNAcancer association database, which stores the 4986 experimentvalidated associations between 165 cancers and 1614 lncRNAs. The LncRNADisease is a manually managed lncRNAdisease association database, which stores the 10,564 experimentvalidated LDAs and the 195,395 predicted LDAs by excellent LDA prediction methods. As a result, the top 15 predicted lncRNAs associated with the three cancers by the RFLDA are shown in Table 2, respectively.
As one can see from Tables 2, 14 of the top 15 predicted stomach cancerassociated lncRNAs by the RFLDA are supported by the experimental data or the published literature, and the remaining one (MIR155HG) is supported by other LDA prediction models. Specially, MIR17HG has been shown to be abnormaly regulated in stomach cancer in published literature [53]. In addition, 14 of the top 15 predicted lung canerassociated lncRNAs by the RFLDA are supported the experimental data or the publised literature, and the remaining one (MIR100HG) is supported by other LDA prediction models. Specially, HULC has been discovered to be dysregulated in lung cancer in published literature [54]; Cheng et al. discovered that PRNCR1 could upregulate HEY2 by competitively bind miR448 to promote tumor progression in nonsmall cell lung cancer [55]. Moreover, all top 15 predicted colon cancerassociated lncRNAs by the RFLDA are supported by the experimental data or the published literature. Specially, Xu et al. discovered that MIR17HG was upregulated in colorectal cancer tissue and could promote metastasis and tumorigenesis of colorectal cancer cells [56]. In summary, 43 of the top 45 predicted lncRNAs associated with the three cancers by the RFLDA are supported by the experimental data in the Lnc2Cancer database, the LncRNADisease database or the published literatures, and the remaining 2 lncRNAs are supported by other LDA prediction models. Therefore, case studies show that the RFLDA has excellent ability for LDA prediction.
Beside the three diseases analyzed in case studies, the RFLDA is also used to predict the potential associated lncRNAs for other 409 diseases in this research. The predicted top 50 lncRNAs associated with all 412 diseases by the RFLDA are listed in Table S4, which contains three columns: name of disease, name of lncRNA, and the association score predicted by the RFLDA.
Discussion
Increased evidence suggests that dysregulation of some lncRNAs are involved in many complex human diseases. Accurately discovering lncRNAs associated with diseases is helpful to explore the pathogenesis and appropriate treatment options of diseases. Due to the high cost of experimental method for identifying diseaseassociated lncRNAs, researchers have proposed a series of computational model for LDA prediction. However, most of the existing models ignore the interference of noisy and redundancy information among multiple data resources. To improve the performance of LDA prediction models, we developed a random forest and feature selection based LDA prediction model (RFLDA). The AUC and AUPR under 5fold crossvalidation show that the RFLDA are better than several excellent LDA prediction models including SIMCLDA, Pingâ€™s method, MFLDA, LDAP, CNNLDA and GCNLDA. Moreover, case studies on three cancers show that the RFLDA has excellent ability to identify potential diseaseassociated lncRNAs.
We identify the following reasons why the RFLDA can achieve better performance. First, the RFLDA integrates multiple types of biological data including the experimentsupported LDAs, the functional similarity of lncRNAs, the semantic similarity of diseases, the experimentsupported MDAs, and the interactions between lncRNAs and miRNAs. Second, as an excellent machine learning algorithm, random forest has high accuracy and robustness. By combining random resampling and weak classifier assembling, random forest can implement the unbiased estimator for generalization error and good generalization performance. Third, the variable importance score of random forest takes into account not only the effect of an individual feature on the sample prediction but also the joint effect of multiple features on sample prediction. Therefore, the feature selection method based on random forest variable importance score can effectively identifying the most important features for sample prediction.
There are some limitations in RFLDA model. First, RFLDA predicts LDA using the supervised random forest algorithm, which requires both positive and negative samples. However, it is almost unrealistic to obtain the reliable negative samples for LDA prediction. The method of randomly selecting negative samples may influence the prediction performance of RFLDA. Besides, limitation of knowledge about diseases, lncRNA, and miRNAs constrain the prediction performance of RFLDA. Finally, there are many excellent association prediction computational models in various fields of computational biology, such as miRNA/lncRNAdisease association prediction [57,58,59,60,61,62], drugtarget interaction prediction [63], and synergistic drug combination prediction [64]. These association prediction models would provide valuable insights into the development of new lncRNAdisease association prediction. Therefore, we will further improve the performance of LDA prediction model in the future by integrating more biological data and the most advanced algorithm idea of different association prediction.
Conclusion
Accurately identifying diseaseassociated lncRNAs is helpful to explore the functionary mechanism of lncRNAs in diseases. Predicting diseaseassociated lncRNAs by computational methods is an efficient mean. In this work, we developed a random forest and feature selection based LDA prediction model by integrating the LFS, the DSS, the experimentsupported LDAs, the experimentsupported MDAs, and the miRNAlncRNA interactions. The feature selection based on the variable importance score of random forest was implemented to choose more useful features to train LDA prediction model. The random forest regression model was trained to predict potential LDAs. Crossvalidation and case study show that the RFLDA outperforms several excellent LDA prediction models. Therefore, we anticipate that the RFLDA can provide help for the mechanism studies of lncRNAs in diseases in the future.
Methods
Datasets for LDA prediction
The datasets used for constructing the RFLDA model include the experimentsupported LDAs and MDAs, and the LMI. All these kinds of datasets come from Fu et al.â€™s previous study on LDA prediction [31]. Specifically, the 2697 experimentsupported LDAs are originally collected from the Lnc2Cancer [51], LncRNADisease [52] and GeneRIF [65] database. In addition, the 13,562 experimentsupported MDAs originally come from the HMDD (v2.0) [66] database. Moreover, the 1002 LMI originally come from starBase [67] database. In summary, all these datasets cover 240 lncRNAs, 495 miRNAs and 412 diseases.
Representation of LDA and MDA
The LDAs are represented by an 240â€‰Ã—â€‰412 adjacency matrix LD (Fig. 1a). According to the 2697 experimentsupported LDAs, the value of the element of the LD, LD(l(i),â€‰d(j)), is set as 1 if lncRNA l(i) has been confirmed to be related to disease d(j), otherwise 0. Similarly, the MDAs are represented by an 495â€‰Ã—â€‰412 adjacency matrix MD (Fig. 1c). According to the 13,562 experimentsupported MDAs, the value of the element of the MD, MD(m(i),â€‰d(j)), is set as 1 if miRNA m(i) has been validated to be related to disease d(j), otherwise 0.
Representation of DSS
Under the supposition that two analogous diseases tend to be related to analogous lncRNAs, disease similarities are integrated into the RFLDA for LDA prediction. Disease Ontology (DO) [68] adopted a type of semantic associations (â€˜IS_Aâ€™ relationship) to represent the association between disease terms. According to â€˜IS_Aâ€™ relationship between disease terms, we can use a DAG to represent a disease D. In the DAG(D), the vertexes represent disease D and all of its ancestral disease terms, and each of the directed edges represents an â€˜IS_Aâ€™ relationship linking two diseases. Under the supposition that the more common disease terms two diseases share, the more similar they are, the DSS can be calculated according to their DAGs. Here, we calculate disease semantic similarities by Wang et al.â€™s method [69]. Specifically, the semantic value of a disease D, DV(D), is calculated by eq. 5.
where S(D) represents the node set of DAG(D), DC_{D}(d) represents the contribution degree of a disease d in DAG(D) to disease Dâ€™s semantic value and is calculated by eq. 6.
Where, âˆ† is the attenuation coefficient of semantic contribution and is equal to 0.5 by default. As can be seen from eq. 6, the contribution degree of disease D to itself is equal to 1, while the contribution degree of other diseases to disease D is reduced as the length between them increases. Then, the DSS between d(i) and d(j), DS(d(i),â€‰d(j)), is calculated by eq. 7.
In this work, we calculate disease semantic similarities between 412 diseases using DincRNA online toolkit [70], and represent them by a 412â€‰Ã—â€‰412 similarity matrix DD (Fig. 1b), where the value of the element of the DD, DD(d(i),â€‰d(j)), represents the DSS of d(i) and d(j).
Representation of LFS
Based on the supposition that two lncRNAs associated with analogous diseases may have analogous functions, the LFS can be computed according to diseases associated with them. Here, we calculate lncRNA functional similarities by Chen et al.â€™s method [36]. Here, we assume that lncRNA l(a) is related to a group of diseases DG(a)â€‰=â€‰{d(a1),â€‰d(a2),â€‰â€¦,â€‰d(am)}, and lncRNA l(b) is related to a group of diseases DG(b)â€‰=â€‰{d(b1),â€‰d(b2),â€‰â€¦,â€‰d(bn)}, then the LFS between l(a) and l(b), denoted as LS(l(a),â€‰l(b)), can be obtained by calculating the similarity between DG(a) and DG(b) by eq. 8.
Where DS(d(ai),â€‰d(bj)) is the semantic similarity between the disease d(ai) in DG(a) and the disease d(bj) in DG(b); m and n represent disease numbers of the DG(a) and the DG(b), respectively. In this work, the LFS is represented by a 240â€‰Ã—â€‰240 similarity matrix LL (Fig. 1a), where the value of the element of the LL, LL(l(i),â€‰l(j)), represents the LFS of l(i) and l(j).
Representation of LMI
Cumulative evidence indicates that the lncRNAs can interact with the corresponding miRNAs and perform biological functions together with these miRNAs [71]. Therefore, the LMI are integrated into the RFLDA model for lncRNAdisease association prediction, which is represented by an 240â€‰Ã—â€‰795 adjacency matrix LM (Fig. 1c). According to 1002 LMI extracted from starBase database, the value of the element of the LM, LM(l(i),â€‰m(j)), is set as 1 if there is an interactions between miRNA m(j) and lncRNA l(i), otherwise 0.
Construction of the RFLDA model
The RFLDA model is constructed by four steps (see Fig. 3): (1) sample representation; (2) training sample set construction; (3) feature selection; (4) random forest construction and LDA prediction. Next, we introduce the process of constructing RFLDA in detail.
Sample representation
In our RFLDA model, we take an lncRNAdisease pair as a sample. By integrating the functional similarity of lncRNAs (Fig. 2a), the experimentsupported associations between lncRNAs and diseases (Fig. 2a), the semantic similarity of diseases (Fig. 2b), the interactions between lncRNAs and miRNAs (Fig. 2c), and the experimentsupported associations between miRNAs and diseases (Fig. 2c), we use an 1147dimensional feature vector to represent an lncRNA and a disease respectively. Therefore, a sample can be represented by a 2294dimensional feature vector (Fig. 2d), denoted as F, which can be represented by eq. 9 in detail.
Where (f_{1},â€‰f_{2},â€‰â‹¯,â€‰f_{240}) represents the 240 lncRNAlncRNA similarities, (f_{241},â€‰â‹¯,â€‰f_{652}) represents the 412 lncRNAdisease associations, (f_{653},â€‰â‹¯,â€‰f_{1147}) represents the 495 lncRNAmiRNA interactions, (f_{1148},â€‰â‹¯,â€‰f_{1387}) represents the 240 diseaselncRNA associations, (f_{1388},â€‰â‹¯,â€‰f_{1799}) represents the 412 diseasedisease similarities, and (f_{1800},â€‰â‹¯,â€‰f_{2294}) represents the 495 diseasemiRNA associations. Finally, we normalized f_{i} to f_{i}^{â€²} by eq. 10.
Where f_{max} and f_{min} were the maximum and the minimum of f_{i} (iâ€‰=â€‰1, 2â€¦ 2294) in all samples.
Training sample set construction
First, the 2697 experimentsupported LDAs were used as positive samples, and all lncRNAdisease pairs not validated by experiments were taken as unlabelled samples. In addition, the 2697 randomly selected unlabelled samples were taken as negative samples. Finally, all negative samples and positive samples were combined as training samples.
Feature selection based on variable importance score of random forest
Random forest (RF) [72] is an integrated machine learning algorithm proposed by Breiman in 2001, which combines Bagging technology and random subspace method to realize randomness and diversity between base classifiers. First, RF randomly selects multiple samples from the original sample set with replacement using the Bootstrap technology. Then, it constructs a decision tree on each Bootstrap sample set. In the process of training the decision tree, it randomly selects a feature from a feature set for node splitting at each node by random subspace method. Finally, it combines multiple decision trees and determines the classification or prediction results by majority vote. Compared with other machine learning algorithms, RF has many advantages: (1) it can process a variety of data types, including qualitative data or quantitative data; (2) it provides a measure of the variable importance, which provides an easy way to understand the relative importance of features for classification or prediction model; (3) it has high classification accuracy; (4) it has good robustness for noise data and data with missing values; (5) it has ability to analyse complex interactions between features; (6) it has a fast learning speed with the increase of the number of input variables [73]. In recent years, RF has been widely used in a variety of classification and prediction problems, such as DNA binding protein recognition [74], genetic polymorphism recognition [75], prediction of medium and longterm chaotic regions of protein sequences [76], differential expression analysis of microarray data [77], miRNAdisease association prediction [78], etc.
In this work, because each sample has 2294 features, it contains a lot of noisy and redundant information. To improve the prediction performance while reduce the computational cost, we performed feature selection before training LDA prediction model. First, we removed 312 features whose values were 0 in all samples. As a result, 1952 features were preserved. Then, we implemented feature selection according to the variable importance score of random forest which is calculated by the average decrement of the classification accuracy of random forest model before and after small perturbation of the variable in OOB (outside of bag) [77]. Because the variable importance score of random forest takes into account not only the impact of each individual variable on the response variable but also the interaction of multiple variables on the response variable, it is often used to rank features to select more important features [78]. In the RFLDA, we firstly trained a random forest model on the original training sample set consisting of 1952 features and computed variable importance scores of all features; then, we ranked 1952 features in descending order according to their scores; finally, we selected the top 300 features with the highest variable importance scores to represent the training samples. To get reliable results, we calculated the variable importance scores for all features 10 times, and selected important features according to the average variable importance score of each feature.
Random forest construction and LDA prediction
In the last step of the RFLDA, we firstly constructed a random forest regression model using the training sample set consisting of the top 300 most important features by running randomForest package on R platform. In the training sample set, each positive sample was labelled as 1 while each negative sample was labelled as 0. Then, we used the random forest prediction model to score unconfirmed lncRNAdisease pairs. The larger the score of an lncRNAdisease pair, the more likely the lncRNA and the disease are associated. It should be noted that two main parameters in random forest algorithm, the mtry and the ntree, were set to the number of features / 3 and 500 respectively according to the recommended values.
Availability of data and materials
The following are available online: Table S1: The variable importance scores of all 1952 features. Table S2: The prediction accuracy of all random forest models on different training samples. Table S3: AUCs and AUPRs of the RFLDA in each crossvalidation. Table S4: The predicted top 50 lncRNAs associated with all 412 diseases by the RFLDA. The original data and code of RFLDA is available at: https://github.com/ydkvictory/RFLDA
Abbreviations
 lncRNA:

Long noncoding RNA
 LDA:

lncRNAdisease association
 RF:

Random forest
 MDA:

miRNAdisease association
 DSS:

Disease semantic similarity
 LFS:

lncRNA functional similarity
 LMI:

lncRNAmiRNA interactions
 AUC:

The area under the receiver operating characteristic curve
 AUPR:

The area under the precisionrecall curve
 RWR:

Random walk with restart
 DAG:

Directed acyclic graph
 GIPKS:

Gaussian interaction profile kernel similarity
 FPR:

False positive rate
 TPR:

True positive rate
 TP:

True positive
 FN:

False negative
 FP:

False positive
 TN:

True negative
 ROC:

Receiver operating characteristic
 DO:

Disease ontology
 OOB:

Outside of bag
References
Ponting CP, Oliver PL, Reik W. Evolution and functions of long noncoding RNAs. Cell. 2009;136(4):629â€“41.
Lu Q, Ren S, Lu M, Zhang Y, Zhu D, Zhang X, Li T. Computational prediction of associations between long noncoding RNAs and proteins. BMC Genomics. 2013;14:651.
Li J, Xuan Z, Liu C. Long noncoding RNAs and complex human diseases. Int J Mol Sci. 2013;14(9):18790â€“808.
Chen X, Sun YZ, Guan NN, Qu J, Huang ZA, Zhu ZX, Li JQ. Computational models for lncRNA function prediction and functional similarity calculation. Brief Funct Genomics. 2019;18(1):58â€“82.
Chen X, Yan CC, Zhang X, You ZH. Long noncoding RNAs and complex diseases: from experimental results to computational models. Brief Bioinform. 2017;18(4):558â€“76.
Zhang X, Zhou Y, Mehta KR, Danila DC, Scolavino S, Johnson SR, Klibanski A. A pituitaryderived MEG3 isoform functions as a growth suppressor in tumor cells. J Clin Endocrinol Metab. 2003;88(11):5119â€“26.
Faghihi MA, Modarresi F, Khalil AM, Wood DE, Sahagan BG, Morgan TE, Finch CE, Laurent GS III, Kenny PJ, Wahlestedt C. Expression of a noncoding RNA is elevated in Alzheimer's disease and drives rapid feedforward regulation of Î²secretase. Nat Med. 2008;14:723â€“30.
Congrains A, Kamide K, Oguro R, Yasuda O, Miyata K, Yamamoto E, Kawai T, Kusunokif H, Yamamoto H, Takeya Y, Yamamoto K, Onishia M, Sugimoto K, Katsuya T, Awata N, Ikebe K, Gondo Y, Oike Y, Ohishi M, Rakugi H. Genetic variants at the 9p21 locus contribute to atherosclerosis through modulation of ANRIL and CDKN2A/B. Atherosclerosis. 2012;220(2):449â€“55.
Johnson R. Long noncoding RNAs in Huntington's disease neurodegeneration. Neurobiol Dis. 2012;46(2):245â€“54.
Sun J, Shi HB, Wang ZZ, Zhang CJ, Liu L, Wang LT, He WW, Hao DP, Liu SL, Zhou M. Inferring novel lncRNAâ€“disease associations based on a random walk model of a lncRNA functional similarity network. Mol BioSyst. 2014;10:2074â€“81.
Zhou M, Wang XJ, Li JW, Hao DP, Wang ZZ, Shi HB, Han L, Zhou H, Sun J. Prioritizing candidate diseaserelated long noncoding RNAs by walking on the heterogeneous lncRNA and disease network. Mol BioSyst. 2015;11:760â€“9.
Chen X. KATZLDA: KATZ measure for the lncRNAdisease association prediction. Sci Rep. 2015;5:16840.
Chen X, You ZH, Yan GY, Gong DW. IRWRLDA: improved random walk with restart for lncRNAdisease association prediction. Oncotarget. 2016;7(36):57919â€“31.
Yu GX, Fu GY, Lu C, Ren Y, Wang J. BRWLDA: birandom walks for predicting lncRNAdisease associations. Oncotarget. 2017;8(36):60429â€“46.
Gu CL, Liao B, Li XY, Cai LJ, Li ZJ, Li KQ, Yang JL. Global network random walk for predicting potential human lncRNAdisease associations. Sci Rep. 2017;7:12442.
Zhang J, Zhang Z, Chen Z, Deng L. Integrating multiple heterogeneous networks for novel lncRNAdisease association inference. IEEE/ACM Trans Comput Biol Bioinform. 2017;16(2):396â€“406.
Xiao XF, Zhu W, Liao B, Xu JL, Gu CL, Ji BB, Yao YH, Peng LH, Yang JL. BPLLDA: predicting lncRNAdisease associations based on simple paths with limited lengths on a heterogeneous network. Front Genet. 2018;9:411.
Ping PY, Wang L, Kuang LN, Ye ST, Iqbal MFB, Pei TR. A novel method for lncRNAdisease association prediction based on an lncRNAdisease association network. IEEE/ACM Trans Comput Biol Bioinform. 2019;16(2):688â€“93.
Fan XN, Zhang SW, Zhang SY, Zhu K, Lu S. Prediction of lncRNAdisease associations by integrating diverse heterogeneous information sources with RWR algorithm and positive pointwise mutual information. BMC Bioinformatics. 2019;20:87.
Liu Y, Feng X, Zhao HC, Xuan ZW, Wang L. A novel networkbased computational model for prediction of potential LncRNAdisease association. Int J Mol Sci. 2019;20(7):1549.
Li JC, Zhao HC, Xuan ZW, Yu JW, Feng X, Liao B, Wang L. A novel approach for potential human LncRNAdisease association prediction based on local random walk. IEEE/ACM Trans. Comput. Biol. Bioinform; 2019.
Sumathipala M, Maiorino E, Weiss ST, Sharma A. Network diffusion approach to predict lncRNA disease associations using multitype biological networks: LION. Front Physiol. 2019;10:888.
Zhang H, Liang YC, Peng C, Han SY, Du W, Li Y. Predicting lncRNAdisease associations using network topological similarity based on deep mining heterogeneous networks. Math Biosci. 2019;315:108229.
Xie GB, Meng TF, Luo Y, Liu ZG. SKFLDA: similarity kernel fusion for predicting lncRNAdisease association. TherNucl Acids. 2019;18:45â€“55.
Liu MX, Chen X, Chen G, Cui QH, Yan GY. A computational framework to infer human diseaseassociated long noncoding RNAs. PLoS One. 2014;9(1):e84408.
Li JW, Gao C, Wang YC, Ma W, Tu J, Wang JP, Chen ZZ, Kong W, Cui QH. A bioinformatics method for predicting long noncoding RNAs associated with vascular disease. Sci China Life Sci. 2014;57:852â€“7.
Chen X. Predicting lncRNAdisease associations and constructing lncRNA functional similarity network based on the information of miRNA. Sci Rep. 2015;5:13186.
Wang JY, Ma RX, Ma W, Chen J, Yang JC, Xi YG, Cui QH. LncDisease: a sequence based bioinformatics tool for predicting lncRNAdisease associations. Nucleic Acids Res. 2016;44(9):e90.
Cheng L, Shi HB, Wang ZZ, Hu Y, Yang HX, Zhou C, Sun J, Zhou M. IntNetLncSim: an integrative network analysis method to infer human lncRNA functional similarity. Oncotarget. 2016;7(30):47864â€“74.
Wang P, Guo QY, Gao Y, Zhi H, Zhang Y, Liu Y, Zhang JZ, Yue M, Guo MN, Ning SW, Zhang GM, Li X. Improved method for prioritization of disease associated lncRNAs based on ceRNA theory and functional genomics data. Oncotarget. 2017;8(3):4642â€“55.
Fu GY, Wang J, Domeniconi C, Yu GX. Matrix factorizationbased data fusion for the prediction of lncRNAâ€“disease associations. Bioinformatics. 2018;34(9):1529â€“37.
Ding L, Wang MH, Sun DD, Li A. TPGLDA: novel prediction of associations between lncRNAs and diseases via lncRNAdiseasegene tripartite graph. Sci Rep. 2018;8:1065.
Lu CQ, Yang MY, Luo F, Wu FX, Li M, Pan Y, Li YH, Wang JX. Prediction of lncRNAâ€“disease associations based on inductive matrix completion. Bioinformatics. 2018;34(19):3357â€“64.
Wang YH, Yu GX, Wang J, Fu GY, Guo MZ, Domeniconi C. Weighted matrix factorization on multirelational data for LncRNAdisease association prediction. Methods. 2020;173:32â€“43.
Chen X, Yan GY. Novel human lncRNAdisease association inference based on lncRNA expression profiles. Bioinformatics. 2013;29(20):2617â€“24.
Chen X, Yan CC, Luo C, Ji W, Zhang Y, Dai Q. Constructing lncRNA functional similarity network based on lncRNAdisease associations and disease semantic similarity. Sci Rep. 2015;5:11338.
Huang YA, Chen X, You ZH, Huang DS, Chan KC. ILNCSIM: improved lncRNA functional similarity calculation model. Oncotarget. 2016;7(18):25902â€“14.
Zhao TT, Xu JY, Liu L, Bai J, Xu CH, Xiao Y, Li X, Zhang LM. Identification of cancerrelated lncRNAs through integrating genome, regulome and transcriptome features. Mol BioSyst. 2015;11:126â€“36.
Lan W, Li M, Zhao KJ, Liu J, Wu FX, Pan Y, Wang JX. LDAP: a web server for lncRNAdisease association prediction. Bioinformatics. 2017;33(3):458â€“60.
Yu JW, Xuan ZW, Feng X, Zou Q, Wang L. A novel collaborative filtering model for LncRNAdisease association prediction based on the NaÃ¯ve Bayesian classifier. BMC Bioinformatics. 2019;20:396.
Guo ZH, You ZH, Wang YB, Yi HC, Chen ZH. A LearningBased Method for LncRNADisease Association Identification Combing Similarity Information and Rotation Forest. iScience. 2019;19:786â€“95.
Chen QF, Lai DH, Lan W, Wu XM, Chen BS, Chen YPP, Wang JX. ILDMSF: Inferring Associations between Long noncoding RNA and Disease Based on Multisimilarity Fusion. IEEE/ACM Trans. Comput. Biol. Bioinform; 2019.
Guo ZH, Yi HC, You ZH. Construction and comprehensive analysis of a molecular association network via lncRNAmiRNAdiseasedrugprotein graph. Cells. 2019;8(8):866.
Xuan P, Cao YK, Zhang TG, Kong R, Zhang ZG. Dual convolutional neural networks with attention mechanisms based method for predicting diseaserelated lncRNA genes. Front Genet. 2019;10:416.
Xuan P, Pan SX, Zhang TG, Liu Y, Sun H. Graph convolutional network and convolutional neural network based method for predicting lncRNAdisease associations. Cells. 2019;8(9):1012.
Xuan P, Sheng N, Zhang TG, Liu Y, Guo YH. CNNDLP: a method based on convolutional autoencoder and convolutional neural network with adjacent edge attention for predicting lncRNAâ€“disease associations. Int J Mol Sci. 2019;20(17):4260.
Xuan P, Jia L, Zhang TG, Sheng N, Li XK, Li JB. LDAPred: a method based on information flow propagation and a convolutional neural network for the prediction of diseaseassociated lncRNAs. Int J Mol Sci. 2019;20(18):4458.
Chen X, Wang CC, Yin J, You ZH. Novel human miRNAdisease association inference based on random forest. Mol TherNucl Acids. 2018;13:568â€“79.
HajianTilaki K. Receiver operating characteristic (ROC) curve analysis for medical diagnostic test evaluation. Caspian J Intern Med. 2013;4(2):627â€“35.
Saito T, Rehmsmeier M. The precisionrecall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015;10(3):e0118432.
Ning SW, Zhang JZ, Wang P, Zhi H, Wang JJ, Liu Y, Gao Y, Guo MN, Yue M, Wang LH, Li X. Lnc2Cancer: a manually curated database of experimentally supported lncRNAs associated with various human cancers. Nucleic Acids Res. 2016;44(D1):D980â€“5.
Chen G, Wang ZY, Wang DQ, Qiu CX, Liu MX, Chen X, Zhang QP, Yan GY, Cui QH. LncRNADisease: a database for longnoncoding RNAassociated diseases. Nucleic Acids Res. 2013;41(D1):D983â€“6.
Bahari F, EmadiBaygi M, Nikpour P. miR1792 host gene, uderexpressed in gastric cancer and its expression was negatively correlated with the metastasis. Ind J Cancer. 2015;52(1):22â€“5.
Zhang J, Lu S, Zhu JF, Yang KP. Upregulation of lncRNA HULC predicts a poor prognosis and promotes growth and metastasis in nonsmall cell lung cancer. Int J Clin Exp Pathol. 2016;9(12):12415â€“22.
Cheng DZ, Bao CC, Zhang XX, Lin XS, Huang HO, Zhao L. LncRNA PRNCR1 interacts with HEY2 to abolish miR448mediated growth inhibition in nonsmall cell lung cancer. Biomed Pharmacother. 2018;107:1540â€“7.
Xu J, Meng QT, Li XB, Yang HB, Xu J, Gao N, Sun H, Wu SS, Familiari G, Relucenti M, Zhu HT, Wu J, Chen R. Long noncoding RNA MIR17HG promotes colorectal Cancer progression via miR175p. Cancer Res. 2019;79(19):4882â€“95.
Chen X, Wang L, Qu J, Guan NN, Li JQ. Predicting miRNAâ€“disease association based on inductive matrix completion. Bioinformatics. 2018;34(24):4256â€“65.
Chen X, Xie D, Zhao Q, You ZH. MicroRNAs and complex diseases: from experimental results to computational models. Brief Bioinform. 2019;20(2):515â€“39.
Gao YL, Cui Z, Liu JX, Wang J, Zheng CH. NPCMF: nearest profilebased collaborative matrix factorization method for predicting miRNAdisease associations. BMC Bioinformatics. 2019;20:353.
Chen X, Yin J, Qu J, Huang L. MDHGI: matrix decomposition and heterogeneous graph inference for miRNAdisease association prediction. PLoS Comput Biol. 2018;14(8):e1006418.
Yin MM, Cui Z, Gao MM, Liu JX, Gao YL. LWPCMF: logistic weighted profilebased collaborative matrix factorization for predicting MiRNAdisease associations. IEEE/ACM Trans. Comput. Biol. Bioinform; 2019.
Cui Z, Liu JX, Gao YL, Zhu R, Yuan SS. LncRNAdisease associations prediction using bipartite local model with nearest profilebased association inferring. IEEE J Biomed Health Inform; 2019.
Chen X, Yan CC, Zhang X, Zhang X, Dai F, Yin J, Zhang Y. Drugâ€“target interaction prediction: databases, web servers and computational models. Brief Bioinform. 2016;17(4):696â€“712.
Chen X, Ren B, Chen M, Wang Q, Zhang L, Yan G. NLLSS: predicting synergistic drug combinations based on semisupervised learning. PLoS Comput Biol. 2016;12(7):e1004975.
Lu ZY, Coben KB, Hunter L. GeneRIF quality assurance as summary revision. Biocomputing. 2007;2007:269â€“80.
Li Y, Qiu CX, Tu J, Geng B, Yang JC, Jiang TZ, Cui QH. HMDD v2. 0: a database for experimentally supported human microRNA and disease associations. Nucleic acids Res. 2014;42(D1):D1070â€“4.
Li JH, Liu S, Zhou H, Qu LH, Yang JH. starBase v2. 0: decoding miRNAceRNA, miRNAncRNA and proteinRNA interaction networks from largescale CLIPSeq data. Nucleic Acids Res. 2014;42(D1):D92â€“7.
Kibbe WA, Arze C, Felix V, Mitraka E, Bolton E, Fu G, Mungall CJ, Binder JX, Malone J, Vasant D, Parkinson H, Schriml LM. Disease Ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data. Nucleic Acids Res. 2015;43(D1):D1071â€“8.
Wang JZ, Du Z, Payattakool R, Yu PS, Chen CF. A new method to measure the semantic similarity of GO terms. Bioinformatics. 2007;23(10):1274â€“81.
Cheng L, Hu Y, Sun J, Zhou M, Jiang QH. DincRNA: a comprehensive webbased bioinformatics toolkit for exploring disease associations and ncRNA function. Bioinformatics. 19531956;2018:34(11).
Yang GD, Lu XZ, Yuan LJ. LncRNA: a link between RNA and cancer. Biochim et Biophys Acta. 2014;1839(11):1097â€“109.
Breiman L. Random forests. Mach Learn. 2001;45(1):5â€“32.
Verikas A, Gelzinis A, Bacauskiene M. Mining data with random forests: a survey and results of new tests. Pattern Recogn. 2011;44(2):330â€“49.
Nimrod G, SzilÃ¡gyi A, Leslie C, BenTal N. Identification of DNAbinding proteins using structural, electrostatic and evolutionary features. J Mol Biol. 2009;387(4):1040â€“53.
Heidema AG, Boer JM, Nagelkerke N, Mariman EC, Feskens EJ. The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases. BMC Genet. 2006;7:23.
Han P, Zhang X, Norton RS, Feng ZP. Largescale prediction of long disordered regions in proteins using random forests. BMC Bioinformatics. 2009;10:8.
Yao DJ, Yang J, Zhan XJ, Zhan XR, Xie ZQ. A novel random forestsbased feature selection method for microarray expression data analysis. Int J Data Min Bioin. 2015;13(1):84â€“101.
Yao DJ, Zhan XJ, Kwoh CK. An improved random forestbased computational model for predicting novel miRNAdisease associations. BMC Bioinformatics. 2019;20:624.
Acknowledgements
Not applicable.
Funding
This work was supported by Innovation Talents Project of Harbin Science and Technology Bureau (2017RAQXJ027), the Fundamental Research Foundation for Universities of Heilongjiang Province (LGYC2018JQ003), the Natural Science Foundation of Heilongjiang Province (LH2019F023), and China Scholarship Council. The funding body did not play any roles in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.
Author information
Authors and Affiliations
Contributions
DJY and XJZ conceived and implemented the model, performed and analysed the experiments, and wrote the paper. XRZ directed the research and revised the paper. CKK, PL and JKW analysed the experimental results and revised the paper. All authors have read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisherâ€™s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1.
Table S1. The variable importance scores of all 1952 features.
Additional file 2.
Table S2. The prediction accuracy of all random forest models on different training samples.
Additional file 3.
Table S3. AUCs and AUPRs of the RFLDA in each crossvalidation.
Additional file 4.
Table S4. The predicted top 50 lncRNAs associated with all 412 diseases by the RFLDA.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Yao, D., Zhan, X., Zhan, X. et al. A random forest based computational model for predicting novel lncRNAdisease associations. BMC Bioinformatics 21, 126 (2020). https://doi.org/10.1186/s1285902034581
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285902034581
Keywords
 Random forest
 Variable importance
 Feature selection
 lncRNAdisease association prediction
 Bioinformatics algorithm