Accurate prediction of protein-lncRNA interactions by diffusion and HeteSim features across heterogeneous network

Background Identifying the interactions between proteins and long non-coding RNAs (lncRNAs) is of great importance to decipher the functional mechanisms of lncRNAs. However, current experimental techniques for detection of lncRNA-protein interactions are limited and inefficient. Many methods have been proposed to predict protein-lncRNA interactions, but few studies make use of the topological information of heterogenous biological networks associated with the lncRNAs. Results In this work, we propose a novel approach, PLIPCOM, using two groups of network features to detect protein-lncRNA interactions. In particular, diffusion features and HeteSim features are extracted from protein-lncRNA heterogenous network, and then combined to build the prediction model using the Gradient Tree Boosting (GTB) algorithm. Our study highlights that the topological features of the heterogeneous network are crucial for predicting protein-lncRNA interactions. The cross-validation experiments on the benchmark dataset show that PLIPCOM method substantially outperformed previous state-of-the-art approaches in predicting protein-lncRNA interactions. We also prove the robustness of the proposed method on three unbalanced data sets. Moreover, our case studies demonstrate that our method is effective and reliable in predicting the interactions between lncRNAs and proteins. Availability The source code and supporting files are publicly available at: http://denglab.org/PLIPCOM/.


Background
Long non-coding RNAs (lncRNAs) have been intensively investigated in recent years [1,2], and show close connection to transcriptional regulation, RNA splicing, cell cycle and disease. At present, a great majority of lncR-NAs have been identified, but their functional annotations verified by experiment remains very limited [3,4]. Recent studies have proved that the function of lncRNAs strikes a chord with the corresponding binding-proteins [5][6][7]. Therefore, the binding proteins of lncRNAs are urgent to be uncovered for better understand of the biological functions of lncRNAs.
Although high-throughput methods for characterization of protein-RNA interactions have been developed [8,9], in silico methods are appealing for characterization *Correspondence: hliu@cczu.edu.cn 2 Lab of Information Management, Changzhou University, 213164 Jiangsu, China Full list of author information is available at the end of the article of the lncRNAs that are less experimentally covered due to technical challenge [10]. One common way for computationally predicting lncRNA-binding proteins is based on protein sequence and structural information. For example, Muppirala et al. [11] developed a computational approach to predict lncRNA-protein interactions by using the 3-mer and 4-mer conjoint triad features from amino acid and nucleotide sequences to train a prediction models. Wang et al. [12] used the same data set by Muppirala et al. [11] to develop another predictor based on Naive Bayes (NB) and Extended Naive Bayes (ENB). Recently, Lu et al. [13] presented lncPro, a prediction method for Protein-lncRNA associations using Fisher linear discriminant approach. The features used in lncPro consist of RNA/protein secondary structures, hydrogen-bonding propensities and Van der Waals' propensities.
In recent years, network-based methods have widely been used to predict lncRNA functions [14,15]. Many studies have paid attention to integration of heterogeneous data into a single network via data fusion or network-based inference [16][17][18][19][20][21]. The network propagation algorithms, such as the Katz measure [22], random walk with restart (RWR) [23], LPIHN [24] and PRINCE [25,26], have been used to investigate the topological features of biomolecular networks in a variety of issues, such as disease-associated gene prioritization, drug repositioning and drug-target interaction prediction. Random Walk with Restart (RWR) [23] is widely used for prioritization of candidate nodes in a weighted network. LPIHN [24] extends the random walk with restart to the heterogeneous network. PRINCE [25,26] formulates the constraints on prioritization function that relate to its smoothness over the network and usage of prior information. Recently, we developed PLPIHS [27], which uses the HeteSim measure to predict protein-lncRNA interactions in the heterogeneous network.
In this paper, we introduced an computational approach for protein-lncRNA interaction prediction, referred to as PLIPCOM, based on protein-lncRNA heterogeneous network. The heterogeneous network is constructed from three subnetworks, namely protein-protein interaction network, protein-lncRNA association network and lncRNA co-expression network. PLIPCOM incorporates (i) low dimensional diffusion features calculated using random walks with restart (RWR) and a dimension reduction approach (SVD), and (ii) HeteSim features obtained by computing the numbers of different paths from protein to lncRNA in the heterogeneous network. The final prediction model is based on the Gradient Tree Boosting (GTB) algorithm using the two groups of network features. We compared our method to both traditional classifiers and existing prediction methods on multiple datasets, the performance comparison results have shown that our method obtained state-of-the-art performance in predicting protein-lncRNA interactions.
It is worth noting that we have substantially extended and improved our preliminary work published on the BIBM2017 conference proceeding [28]. The improvements include: 1) We presented more detail of the methodology of PLIPCOM, such as the construction of protein-lncRNA heterogenous work, feature extraction and gradient tree boosting algorithm; 2) We have conducted extensive evaluation experiments to demonstrate the performance of the proposed method on multiple data sets with different positive and negative sample ratios, i.e. P:N=1:1,1:2,1:5,1:10, respectively. Particularly, we compared PLIPCOM with our previous method PLPIHS [27] on four independent test datasets, and the experimental results show that PLIPCOM significantly outperform our previous method; 3) To verify the effectiveness of the diffusion and HeteSim features in predicting protein-lncRNA interactions, we evaluated the predictive performance of the two types of features alone and combination of them, on the benchmark dataset; 4) Case studies have been described to show that our method is effective and reliable in predicting the interactions between lncRNAs and proteins; 5) Last but not the least, we have conducted the time complexity analysis of PLIPCOM.

Overview of PLIPCOM
As shown in Fig. 1, the PLIPCOM framework consists of five steps. (A) Collection of three types of data sources, including protein-protein interaction network, protein-lncRNA associations and lncRNA co-expression network. (B) Construction of the global heterogenous network by merging the three networks. (C) Running random walks with restart (RWR) in the heterogeneous network to obtain a diffusion state for each node, which captures its topological relevance to all other nodes (proteins and lncRNAs) in the network. We further apply the singular value decomposition (SVD) to conduct dimension reduction and obtained a 500-dimensional feature vector for each node in the network. (D) The HeteSim score is a measure to estimate the correlation of a pair of nodes relying on the paths that connects the two nodes through a string of nodes. We computed 14 types of HeteSim features from protein-lncRNA heterogenous network. (E) We integrate the 1000-dimension (500-dimensional for the protein and 500-dimensional for the lncRNA) diffusion features and 14-dimension HeteSim scores to train the protein-lncRNA interaction prediction model using gradient tree boosting (GTB) algorithm.

Protein-protein interaction
All human lncRNA genes and protein-coding genes were obtained from GENCODE database [29] (Release 24), which includes 15,941 lncRNA genes and 20,284 proteincoding genes. We obtained the human protein-protein interactions (PPIs) from STRING database [30] (V10.0), which collected PPIs from high-throughput experiments, as well as computational predictions and text mining results. A total of 7,866,428 human PPIs are obtained.

LncRNA-lncRNA co-expression
We downloaded the expression profiles of lncRNA genes from NONCONDE 2016 database [31], and calculated the lncRNA co-expression similarity between each two lncRNAs using Pearson's correlation coefficient.

Protein-lncRNA association
We obtained the protein-lncRNA interactions from NPinter v3.0 [32], which contains 491,416 experimentally verified interactions. In addition to the known protein- Flowchart of PLIPCOM consists of five steps. a Protein-protein interaction, protein-lncRNA association, and lncRNA co-expression data are extracted from multiple public databases. b Global heterogeneous network is built by integrating three subnetworks. c The diffusion scores are calculated using random walks with restart (RWR) on the heterogeneous network, and then dimensionality reduction is conducted to obtain low-dimensional topological features using singular value decomposition (SVD). d For each lncRNA-protein pair, the HeteSim scores are calculate by counting the numbers of different paths linking them on the heterogeneous network. e The diffusion features and HeteSim features are combined to train the Gradient tree boosting (GTB) classifier for predicting protein-lncRNA interactions lncRNA interactions, we also employed the co-expression profiles to build the protein-lncRNA association network. In particular, three co-expression datasets (Hsa.c4-1, Hsa2.c2-0 and Hsa3.c1-0) with pre-computed pairwise Pearson correlation coefficients from COXPRESdb database [33] were downloaded. The three correlations are then integrated as below: where C(l, p) is the integrative correlation coefficient between lncRNA l and protein-coding gene p, C d (l, p) represents the correlation coefficient between l and p in dataset d, and D is the number of data sets. In particular, we take into account the gene pairs whose correlation coefficient are positive, and discard those with negative correlation coefficients, as the mutual exclusion relationship indicates that protein is unlikely to interacting with the lncRNA. An additional paired-end RNA-seq datasest including 19 human normal tissues are obtained from the Human Body Map 2 project (ArrayExpress accession E-MTAB-513) and another study (GEO accession no.GSE30554). Expression levels are calculated using Tophat and cufflinks, and the co-expressions of protein-lncRNA pairs are evaluated using Pearson's correlation coefficients.
Finally, we built a global heterogenous network by merging the three types of subnetworks (protein-protein interaction network, lncRNA-lncRNA co-expression network, and protein-lncRNA association network). The resulting network has 36,225 nodes (15,941 lncRNAs and 20,284 proteins) and 2,339,152 edges after removal of edges wit similarity scores <0.5.

Low-dimensional network diffusion features
The diffusion feature is a high-dimensional vector describing the topological properties of each node, which captures its relevance to all other nodes in the network. The network diffusion features can be calculated using random walk with restart (RWR) algorithm [34,35] on the global heterogenous network. RWR is able to identify relevant or similar nodes by taking the local and global topological structure within the network into account. Let G denote the adjacency matrix for the global network, and T represent the transition probability matrix. Each entry T ij holding the transition probability from node i to node j is computed as below in which G ij is equal to 1 if node i is connected to node j in the network, and 0 otherwise. The RWR process can be written as follows: where α is the restart probability leveraging the importance of local and global topological information; P t is a probability distribution whose i-th element represents the probability of node i being visited at step t. After enough number of iterations, RWR will converge so that P t holds the stable diffusion distribution. In view of excessively high-dimensional features are prone to noise interference and time-consuming in model training, we apply singular value decomposition (SVD) [36][37][38] to reduce the dimensionality of the diffusion features derived by RWR . Formally, the probability transition matrix P is factorized into the form as below: where the diagonal entries of are the singular values of P, and the columns of U and V are the left-singular vectors and right-singular vectors of P, respectively. For a given number n of output dimensions, we assign the top n columns of 1/2 V to x i , namely, where X is the derived low-dimensional feature matrix from the high-dimensional diffusion features. In this work we set n = 500 according to previous study [38].

HeteSim score-based features
The HeteSim score is a measure to estimate the correlation of a pair of nodes, and its value depends on the paths that connects the two nodes through a string of nodes in a graph [39]. HeteSim score can be easily extended to calculate the relevance of nodes in a heterogenous network. Denote by L and P two kinds of nodes in a heterogenous network, (A LP ) n * m is an adjacent matrix, the normalization matrix of A LP with respect to the row vector is defined as The reachable probability matrix R P can be defined as: where P = (P 1 P 2 · · · P n+1 ) represents the set of paths of length n, and P i belongs to any nodes in the heterogenous network.
The detailed calculation procedure can be found in our previous work [27]. Here we calculate the paths from a protein to a lncRNA in the heterogenous network with . As listed in Table 1, there are in total 14 different paths from a protein to a lncRNA under the constraint of length <6. So, we obtain a 14-dimensional HeteSim feature for each node in the heterogenous network.

The gradient tree boosting classifier
Based on the derived diffusion and HeteSim features, we build a classifier using the gradient tree boosting (GTB) [40] algorithm to predict protein-lncRNA interactions. Gradient tree boosting algorithm is an effective machine learning-based method that has been successfully applied for both classification and regression problems [41][42][43].
In GTB algorithm, the decision function is initialized as: where N is the number of protein-lncRNA pairs in the training dataset. The gradient tree boosting algorithm in which β m and α m are the weight and parameter vector of the m-th classification tree h(χ, α m ). The loss function L(y, m (χ)) is defined as: where y is the real class label and (χ) is the decision function. Both β m and α m are iteratively optimized by grid search so that the loss function L(y, m (χ)) is minimized. Accordingly, we obtain the gradient tree boosting model (χ) as follows: We use grid search strategy to select the optimal parameters of GTB with 10-fold cross-validation on the benchmark dataset. The optimal number of trees of the GTB is 600, and the selected depth of the trees is 13. The rest parameters are set to default values.

Training data sets
We randomly select 2,000 protein-lncRNA interactions from the experimentally validated protein-lncRNA associations as positive examples, and randomly generated 2,000, 4,000, 10,000, 20,000 negative samples that are not included in all known associations. As a result, we build a standard training set with 2,000 positive and 2,000 negative samples, and other three unbalanced data sets with more negative samples than positive ones. The ratios of positive and negative samples are 1:1, 1:2, 1:5 and 1:10 in the four training sets, respectively.

Test data sets
For objective performance evaluation, an independent test set is built by randomly selecting 2,000 protein-lncRNA associations from the experimentally validated ones, plus 2,000 randomly generated negative samples. To be more realistic, we accordingly construct other three unbalanced test data sets with positive vs negative ratio 1:2, 1:5 and 1:10, respectively. Note that all the positive and negative samples in these test sets are independently chosen and excluded from the training set.

Performance measures
We firstly evaluate the performance of our method using 10-fold cross-validation. The training set are randomly divided into ten set of roughly equal size subsets. Each subset is in turn used as the validation test data, and the remaining nine subsets are used as training data. The cross-validation process is repeated ten times, and the average performance measure over the ten folds are used for performance evaluation. We use multiple measures to evaluate the performance, including precision (PRE), recall (REC), F-score (FSC), accuracy (ACC) and the area under the receiver operating characteristic curve (AUC). They are defined as below: in which TP and FP represent the numbers of correctly predicted positive and negative samples, FP and FN represent the numbers of wrong predicted positive and negative samples, respectively. The AUC score is computed by varying the cutoff of the predicted scores from the smallest to the greatest value.

Predictive power of topological features
To verify the effectiveness of the diffusion and HeteSim features in predicting protein-lncRNA interactions, we evaluate the predictive performance of the two feature groups alone and combination of them (combined features), on the standard training set. As shown in Fig. 2

Benefit from gradient tree boosting algorithm
Since our method is based on the gradient tree boosting algorithm, we compared our method to several widely used classifiers, including k-nearest neighbors algorithm (kNN) [44], random forest (RF) [45] and support vector machine (SVM) [46], on our build standard training set using 10-fold cross validation. The counterpart classifiers are obtained from the python toolkits scikit-learn [47], and trained using the 1,014-dimensional combined features. For kNN classifier, we use 15 nearest neighbors and leaf size of 30 points. RF builds a number of decision tree classifiers trained on a set of randomly selected samples of the benchmark to improve the performance. A total number of 600 tree classifiers are built in this study. For SVM, we use radial basis function (RBF) as the kernel, and the penalty c and gamma g parameters are optimized to 512 and 0.00195, respectively. The number of trees used in the gradient tree boosting of PLIPCOM is set to 600, and the maximum tree depth is set to 13. Table 2 show the prediction performance of PLIPCOM together with other methods. It can be found that PLIP-COM achieved the best performance with AUC, ACC, SEN, SPE, F1-Score and MCC of 0.982, 0.947, 0.931, 0.963, 0.946 and 0.895, respectively. The results indicate that the GTB algorithm substantially improves the overall performance.

Performance comparison with existing methods
We compare PLIPCOM with four existing network-based prediction methods, including RWR [23], LPIHN [24], PRINCE [26] and PLPIHS [27], on the standard and three unbalanced data sets using 10-fold cross-validation. The parameter setting of PRINCE is that α=0.9, c=- 15, d=log(9999) and the iteration number is set to 10. The parameters of LPIHN are set to their default values, i.e. γ =0.5, β=0.5 and δ=0.3. For RWR, the restart probability r is set to 0.5. The ROC curves are drawn using the true positive rate (TPR) vs. false positive rate (FPR) upon different thresholds of these prediction results. As shown in Fig. 3, PLIPCOM obtain the best performance among these protein-lncRNA interaction prediction methods, its AUC values achieved on four data sets are both more than 0.98. Particularly, the performance of PLIPCOM keeps stable on severely unbalanced data sets, while the performance of other methods is significantly influenced. For instance, on the ratio of 1:10 dataset, PLIPCOM achieved an AUC score of 0.990, and remarkably outperform PLPIHS (0.929), PRINCE (0.854), LPIHN (0.849) and RWR (0.556).

Evaluation on independent test sets
We further compare PLIPCOM with the most recent method, PLPIHS, on four independent test sets. As other three existing methods (PRINCE, LPIHN and RWR) are network-based and can only predict interactions between the nodes included in the prebuilt network, they can not work on independent test set and thus excluded out. In fact, PLPIHS has been shown to outperform other three existing methods in our previous study [27] and the aforementioned 10-fold cross validation. PLIPCOM and PLPIHS are trained on the standard training set, and then used to predict the protein-lncRNA interactions included in four independent test sets. We observed that PLIPCOM approach shows significant improvement compared with PLPIHS, as shown in Fig. 4. PLIPCOM achieved 0.977, 0.981, 0.982, 0.979 AUC score, which is much higher than 0.879, 0.901, 0.889, 0.882 by PLPIHS, on the independent test sets, respectively. It is worth noting that PLPIHS performs worse than PLIPCOM, mainly due to the fact that PLPIHS uses only the HeteSim features a b c d Fig. 3 The ROC curves of PLIPCOM in comparison with other approaches on the train data sets with different positive and negative sample ratios. The four subfigures a b c and d represent the ROC curves on the datasets with positive vs negative sample ratio 1:1, 1:2, 1:5 and 1:10, respectively and a SVM classifier to predict protein-lncRNA interactions. The above results suggest that the two groups of topological features derived from the heterogeneous network are predictive of protein-lncRNA interactions, and their combination further improve the prediction performance.

Case studies
To further illustrate the effectiveness of the proposed method, We present three lncRNAs for case studies, including HOTAIRM1 (ensemble ID: ENSG00000233429), XIST (ensemble ID:ENSG00000229807) and HOTAIR (ensemble ID:ENSG00000228630). The HOTAIRM1 is a long non-coding RNA that plays a critical role in regulating alternative splicing of endogenous target genes, and is also a myeloid lineage-specific ncRNA in myelopoiesis [48]. HOTAIRM1 locates between the human HOXA1 and HOXA2 genes. A multitude of evidence indicates that HOTAIRM1 play vital role in neural differentiation and is a potential diagnostic biomarkers of colorectal cancer [49]. The XIST encodes an RNA molecule that plays key roles in the choice of which X chromosome remains active, and in the initial spread and establishment of silencing on the inactive X chromosome [50]. HOTAIR is a long intervening non-coding RNA (lincRNA) whose expression is increased in pancreatic tumors compared to non-tumor tissue. Knockdown of HOTAIR (siHOTAIR) by RNA interference shows that HOTAIR plays an important role in pancreatic cancer cell invasion [51]. In NPInter V3.0 [32], HOTAIRM1 is associated with 71 protein-coding genes, XIST is associated with 38 protein-coding genes and HOTAIR is associated with 29 protein-coding genes. We apply PLIPCOM to predict the interacting proteins of HOTAIRM1, XIST, HOTAIR and the results are shown in Fig. 5. Our method correctly predicted 69 interactions of HOTAIRM1, 36 interactions of HOTAIRM1, 28 interactions of HOTAIRM1. We further inspected top 10 predicted proteins of HOTAIRM1, XIST, HOTAIR as listed in Table 3. For example, GNAS protein is an imprinted region that gives rise to noncoding RNAs, HOTAIRM1, and other several transcripts, antisense transcripts that includes transcription of RNA encoding the α-subunit of the stimulatory G protein [52]. Indeed, GNAS has been shown to underlie some important quantitative traits in muscle mass and domestic mammals [53]. In addition, HOTAIRM1 can interact with SFPQ in colorectal cancer (CRC) tissues that release PTBP2 a b c d Fig. 4 The ROC curves of PLIPCOM in comparison to PLPIHS on four test data sets with different positive and negative sample ratios. The four subfigures a b c and d represent the ROC curves on the datasets with positive vs negative sample ratio 1:1, 1:2, 1:5 and 1:10, respectively from the SFPQ or PTBP2 complex. The interaction between HOTAIRM1 and SFPQ is a promising diagnostic biomarker of colorectal cancer [54]. NFKB1 is a transcriptional factor that plays crucial role in the regulation of viral and cellular gene expressions [55], and its association with HOTAIRM1 is helpful to uncover the function of HOTAIRM1. Take HOTAIR for another example, EZH2 is the catalytic subunit of the polycomb repressive complex 2 (PRC2) and is involved in repressing gene expression through methylation of histone H3 on lysine 27 (H3K27) [56], EZH2 (predominant PRC2 complex component) inhibition blocked cell cycle progression in glioma cells, which is consistent with the effects elicited by HOTAIR siRNA. Through the study of EZH2, we can understand the biological function of HOTAIR more deeply [57]. These cases demonstrate that PLIPCOM is effective and reliable in predicting the interactions between lncRNAs and proteins.

Discussion and conclusion
Identification of the associations between long noncoding RNAs (lncRNAs) and protein-coding genes is essential for understanding the functional mechanism of lncRNAs. In this work, we introduced a machine a b c  From our perspective, the superior performance of PLIPCOM benefits from at least three aspects: (i) diffusion features calculated using random walks with restart (RWR) on the protein-lncRNA heterogenous network, and the feature dimension is further reduced by applying singular value decomposition (SVD); (ii) HeteSim features obtained by computing the numbers of different paths from protein to lncRNA in the heterogenous network; and (iii) effective prediction model built by using the gradient tree boosting (GTB) algorithm. As far as our knowledge, we are the first to apply both diffusion and HeteSim features to predict protein-lncRNA interactions, although these two types features are regularly used in characterizing biological networks in previous works. As shown in our experimental results, diffusion and HeteSim features are complementary and their combination can further improve the predictive power. Moreover, compared to other classifiers, such as SVM and kNN, GTB used by PLIPCOM can not only achieve high prediction accuracy, but also select the feature of importance for identifying lncRNA-protein interactions.
The time complexity of our method depends mainly on the feature extraction procedure and GTB algorithm. The diffusion feature is calculated using RWR and its time complexity can be inferred from the equation P = (E − (1 − α)T) −1 (αE) = αQ −1 E, in which E is unit matrix, T is the transition probability matrix, α is the restart probability and Q is an n * n sparse matrix (n is number of nodes in the network). The time complexity of calculating inverse matrix Q −1 is O(n 3 ), and can be optimized by using Cholesky algorithm. From our previous work, we know that the time complexity of calculating HeteSim feature is O(kn), where k is the number of samples and n is the number of nodes. Note that these two network features can be calculated in parallel. Moreover, we use the truncated SVD to reduce the diffusion feature dimension so that the time of GTB training process is greatly reduced. As a result, the time complexity of the methodology of PLIPCOM is moderate, and can be scaled to large networks.
Although PLIPCOM show effectiveness and promising predictive power, we think its performance can be further improved by adding protein sequence and structural information. In the near future, we will integrate sequence and structural features to promote the prediction of potential lncRNA-protein interactions.