Netpredictor: R and Shiny package to perform drug-target network analysis and prediction of missing links

Background Netpredictor is an R package for prediction of missing links in any given unipartite or bipartite network. The package provides utilities to compute missing links in a bipartite and well as unipartite networks using Random Walk with Restart and Network inference algorithm and a combination of both. The package also allows computation of Bipartite network properties, visualization of communities for two different sets of nodes, and calculation of significant interactions between two sets of nodes using permutation based testing. The application can also be used to search for top-K shortest paths between interactome and use enrichment analysis for disease, pathway and ontology. The R standalone package (including detailed introductory vignettes) and associated R Shiny web application is available under the GPL-2 Open Source license and is freely available to download. Results We compared different algorithms performance in different small datasets and found random walk supersedes rest of the algorithms. The package is developed to perform network based prediction of unipartite and bipartite networks and use the results to understand the functionality of proteins in an interactome using enrichment analysis. Conclusion The rapid application development envrionment like shiny, helps non programmers to develop fast rich visualization apps and we beleieve it would continue to grow in future with further enhancements. We plan to update our algorithms in the package in near future and help scientist to analyse data in a much streamlined fashion.


Background
Identifying missing associations between drugs and targets provides insights into polypharmacology and off-target mediated effects of chemical compounds in biological systems. Traditional machine learning algorithms like Naive Bayes, SVM and Random Forest have been successfully applied to predict drug target relations [1][2][3][4]. However, using supervised machine learning methods requires training sets, and they can suffer from accuracy problems through insufficient sampling or scope of training sets. During the last years, the field of semisupervised learning has been applied to methods based on graphs or networks. The data points are represented as vertices of a network, while the links between the vertices *Correspondence: djwild@indiana.edu School of Informatics and Computing, Indiana University Bloomington, Informatics West, Bloomington, 47408 Indiana, USA depend upon the labeled information. Thus, it is desirable to develop a predictive model based on using both labeled and unlabeled information. Recently several machine learning techniques provides effective and efficient ways to predict DTIs. One way to formulate the problem of DTI prediction as a binary classification problem, where the drug-target pairs are treated as instances, and the chemical structures of drugs and the amino acid subsequences of targets are treated as features. Then, classical classification methods can be used, e.g., support vector machines (SVM) and regularized least square (RLS). Liu et al. [33] have developed PyDTI package which mainly focuses on neighborhood regularized logistic matrix factorization (NRLMF). NRLMF uses logistic matrix factorization and neighbouhood regularization to prediction drug target pairs. The PyDTI package provides access to other algorithms for drug target prediction such as NetLapRLS,BLM-NII,KBMF-2k,CMF implemented in a single package. Bajic [17] have developed DDR package which combines multiple different similarity measures in the drug space and protein target space and optimizes using average entropy measures. Peska [12] developed bayesian ranking approach for drug target prediction. The novelty of the approach comes from "per-drug ranking" optimization criteria, while projecting drugs and targets to a shared latent space. Most of these methods are command line based and they need to have prior programming expertise to start the analysis. Netpredictor solves this problem by building an intuitive UI and giving users an easy way to interaction and peform prediction based on their data. The main advantages of network-based methods are: • They use label information and as well as unlabeled data as input in the form of vectors. • Once can use multiple classes inside the network structure. • It uses multitude of paths to compute associations.
• Network based methods mostly use transductive learning strategy,in which the test set is unlabelled but while computation it uses the information from neighbourhood.
With the advent of the R open source statistical programming language [13] and the gaining popularity of the RShiny package for interface development around R [14] it has become straightforward for programmers to create and deploy web applications on windows and Linux servers. R and RShiny have already used in several biomedical applications. Table 1 shows some of these. We used R and R shiny to create Netpredictor standalone and web application respectively, which is freely available and open source. The web application framework in R allows creation of a simple intuitive user interface with dynamic filters and real-time exploratory analysis. Shiny also allows integration of additional R packages, Javascript libraries and CSS for customization. Web applications are accessible via browser or can be run locally on the user's computer. The R package described in this paper provides utilities to compute recommendations in a bipartite network and well as unipartite network based on HeatS [15], Random walk with Restart (RWR) [23,24], Network based inference (NBI) [16,21,22] and combination of RWR and NBI(netcombo). In order to understand the topology of the network, the package also provides ways to compute bipartite network properties such as degree centrality, density of the network, betweenness centrality, number of sets of nodes and total number of interactions for given bipartite network. The package also performs graph partitioning such as bipartite community detection using the lpbrim algorithm [18,20] and Table 1 Table shows some lifescience related applications  developed in R and shiny Shiny Web Applications Description rcellminer [5] Analysis of molecular profiling and drug response data.
PACMEN [6] Analysis of gene expression profiles and network topology of cancer .
SynRio [7] Analysis of cyanobacterial genome and interactive genome visualization.
GOPlot [9] Functional analysis of gene expression data.
PEAX [10] Exploration of clinical phenotype and gene expression association Methylation Plotter [11] Exploration of DNA methylation sites over genome.
visualization of communities, network permutations to compute the significance of predictions and performance of the algorithms based on user given data. The ranked list of proteins can be also be used to understand any protein proteins interactions exist among them using subgraph extraction or it can be used to understand neighbouring PPIs in the interactome. Such kind of networks helps in understanding of pathogenic and physiologic mechanisms that trigger the onset and progression of diseases. To dig deeper into such cases, the list of proteins can be used to perform Gene Ontology,disease and pathway enrichment to understand the mechanism of action of proteins and whether if that target is a suitable target or not.

Implementation
The Netpredictor package can be used in two wayseither in standalone form and compelling web application running locally or on an Amazon cloud server [25]. The web applications accessible through the Internet and standalone package are functionally identical. More details regarding the package accessibility and the instructions on how to use it via the web application and run locally are given in the "Availability and requirements" section. The interface consists of two parts -a web interface and a web server. Both of these components are controlled by code that is written within the framework of Shiny application in R. RShiny uses "reactive programming" which ensures that changes in inputs are immediately reflected in outputs, making it possible to build a highly interactive tool. Within the RShiny package, ordinary controllers or widgets are provided for ease of use for application programmers. Many of the procedures like uploading files, refreshing the page, drawing new plots and tables are provided automatically. The communication between the client and server is done over the normal TCP connection. The data traffic that is needed for many of web applications between the browser and the server is facilitated over the websockets protocol. This protocol operates separately using handshake mechanism between the client and server is done over the HTTP protocol. The duplex connection is open all the time and therefore authentication is not needed when exchange is done. In order for an RShiny app to execute, we have to create an RShiny server. RShiny follows a pre-defined way to write R scripts. It consists of server.R and ui.R, which need to be in same directory location. If a developer wants to customize the user interface shiny can also integrate additional CSS and Javascript libraries within the web application. The GUI consists of introduction page with tab panels shown in Fig. 1. The first tab, start prediction, consists of sidebar panels and a main output panel Fig. 2. The sidebar is used to upload the data and select the algorithms and its parameters. The start prediction tab consists of data upload, compute recommendations, compute network properties and visualization of user given data. The advanced analysis tab has two sections the statistical analysis section and permutation testing tab. We computed the recommendations of the Drugbank database using NBI and included the predictions results in the Drugbank search tab.
In the PPI Network tab consist of three functionalities namely one can search for protein interaction from a list of proteins, search for top-k PPI shortest paths using Yen's algorithm [19] using both weighted and un-weighted graphs. The algorithm executes O(n) times Dijkstra algorithm to search paths for each of the k shortest paths, so its time complexity is O(kn(m+nlogn)), where n is the number of nodes and m is the number of edges. Shortest path graph algorithm has been widely adopted to identify genes with important functions in a network [26][27][28][29][30].
We also provide sub-graph extraction from the PPI datasets using a large list of proteins using Consensus-PathDB [31] and string [32] databases.

Main features of netpredictor standalone and web tool
The standalone R package application can perform prediction on unipartite networks using a set of different similarity measures between vertices of a graph in order to predict unknown edges (links) [34][35][36]. The prediction methods are classified into two categories: • Neighborhood based metrics and • Path based metrics For neighbourhood based metrics the methods which are implemented are (i) common neighbours (ii) jaccard coefficient [37] (iii) cosine similarity (iv) hub promoted index [38] (v) hub depressed index (vi) Adamic Adar index [39] (vii) Preferential attachment [40] (viii) Resource allocation [41] (ix) Leicht-Holme-Nerman Index [42]. Similarly using path-based metrics one can compute paths between two nodes as similarity between node pairs. The methods are: (i) The local path based metric [43] uses the path of length 2 and length 3. The metric uses the information of the nearest neighbours and it also uses   [44] is based on similarity of all the paths in a graph.This method counts all the paths between given pair of nodes with shorter paths counting more heavily. Parameters are exponential. (iii) Geodesic similarity metric calculates similarity score for vertices based on the shortest paths between two given vertices. (iv) Hitting time [45] is calculated based on a random walk starts at a node x and iteratively moves to a neighbor of x chosen uniformly at random. The Hitting time H x,y from x to y is the expected number of steps required for a random walk starting at x to reach y. (iv) Random walk with restart [16,45,46] is based on pagerank algorithm [47]. To compute proximity score between two vertexes we start a random walker at each time step with the probability 1 -c, the walker walks to one of the neighbors and with probability c, the walker goes back to start node. After many time steps the probability of finding the random walker at a node converges to the steady-state probability.
The significance of interaction of links is based on random permutation testing. A random permutation test compares the value of the test statistic predicted data value to the distribution of test statistics when the data are permuted. Supporting Information S1_NetpredictorVignette provides tutorial for this netpredictor standalone R package. In the web application app one can load their own data or can use the given sample datasets used in the software. For the custom dataset option one needs to upload bipartite adjacency matrix along with the drug similarity matrix and protein sequence matrix. From the given datasets Enzyme, GPCR, Ion Channel and Nuclear Receptor in the application one can load the data and set the parameters for the given algorithms and start computations. The data structure the web application accepts matrix format files for computation. A summary of the contents of each of the tabs shiny netpredictor application is reported in Table 2.

Start prediction tab
The start prediction tab is designed to upload a network in matrix format and compute it properties, searching for modules, fast prediction of missing interactions , visualization of bipartite modules and predicted network. For the custom dataset, in the input drug-target binary matrix, target nodes should be in rows and drug nodes in the columns. The drug similarity matrix and the target similarity should have the exact number of drugs and targets from the binary matrix. For HeatS, only the bipartite network is used to compute the recommendation of links.  For RWR, NBI, and Netcombo all of these require three matrices. The default parameters are already being set for the algorithms. The main panel of the start prediction tab has four tabs that compute network properties, network modules, the prediction results and predicted network plot. Bipartite network properties are calculated by transforming the network in to one-mode networks (contain one set of nodes) called projection of the network in which a bipartite network of drugs and proteins two drugs are connected if they share a single protein similarly two proteins are connected if they share a single drug molecule. Using the two-projected network of drugs and proteins we compute degree centrality, betweenness, total number of interactions, total number of each of the nodes and distribution of the drug and target nodes shown in Fig. 2. WE have implemented the visualization of cousts and betweenness histograms using the rCharts R package [48]. Bipartite network modules are computed using the lpbrim algorithm [49] for which lpbrim R package is used [20]. The algorithm consists of two stages. First, during the LP phase, neighboring nodes (i.e. those which share links) exchange their labels representing the community they belong to, with each node receiving the most common label amongst its neighbors. The process is iterated until densely connected groups of nodes reach a consensus of what is the most representative label, as indicated by the fact that the modularity is not increased by additional exchanges. Second, the BRIM algorithm (2) refines the partitions found with label propagation. HeatS and network based inference compute (NBI) recommendations using a bipartite graph , where a two phase resource transfer Information from set of nodes in A gets distributed to B set of nodes and then again goes back to resource A . This process allows us to define a technique for the calculation of the weight matrix W. HeatS uses only the drug target bipartite data matrix and NBI uses similarity matrices of drug chemical similarity matrix and protein similarity matrix. The random walk with restart (RWR) algorithm uses all the three different matrices to compute the recommendations. Netcombo computes both NBI and RWR and then averages the scores. The prediction results tab shows the computed results using the javascript library DataTables [52]. The data table provides columns filters and search options. The network plot tab represent the network using the visNetwork R package [53] The Network visualization is made using vis.js javascript library. Javascript libraries can be integrated using a binding between R and javascript data visualization libraries Fig. 3. The htmlwidgets library [54] can generate a web based plot by just calling a function that looks like any other R plotting function.
One can also perform advance analysis using two tabs namely -statistical analysis tab and permutation testing. The statistical analysis tab computes the performance of the algorithms. Three algorithms are network based inference , random walk with restart and netcombo can be used. One can randomly remove the true links from the network using frequency of the drug target interactions in the network. The performance of the algorithm is checked when the removed links are repredicted. The statistics used to evaluate the performance is AUAC, AUC, AUC-TOP(10%), Boltzmann-enhanced discrimination of ROC (BEDROC) [55] and enrichment factor(EF). The data table gets automatically updated for each of the computations. The results are reported in main panel using data tables.
The significance of interactions using random permutations can be computed for the given network using network based inference and random walk with restart. The networks are randomized and significance of the interactions are calculated based on standard normal distribution. The user needs to give total number of permutations to compute and the significant interactions to keep.

PPI network
In the current application we used human proteinprotein interaction (PPI) data from both consensus-pathDB(CPDB) and string DB. The data sources are converted to igraph objects for faster loading and computation. We have implemented top-K shortest paths search using Yen's algorithm ( [19]), with PPI in both the datasets. The multiple shortest path proteins can be enriched for reactome pathways using over-representation analysis. We also provide sub-graph extraction from the PPI datasets using a large list of proteins. can be useful for connecting sources to targets in protein networks, a problem that has been the focus of many studies in the past which include discovering genomic mutations that are responsible for changes in downstream gene expression [50] studying interactions between different cellular processes [51] and linking environmental stresses through receptors The drugbank tab helps to search predicted interactions computed using NBI method using the drugbank database. One can search for targets given a specific drugbank ID and search for drugs given a specific hugo gene name. The Enrichment Analysis tab helps to search the relevant gene ontology terms,pathways and diseases for a given list of genes. A search can be made based on predicted proteins and in order to understand its function , location and pathway this tab can help to understand it. The level of ontology can also be given to the user input. We used biomart services using the biomaRT R package to convert genes names to entrez ids and then the clusterProfiler R package ( [60]) to retrieve the gene ontology lists. The pathway enrichment is based on the ReactomePA R package ( [61]).

Search drugbank tab
The drugbank tab Fig. 4 helps to search predicted interactions computed using NBI method using the drugbank database [56]. One can search for targets given a specific drugbank ID and search for drugs given a specific hugo gene name [57]. In Fig. 3 the data table shows the drug target significant scores whether it is a true or predicted interaction, Mesh categories of drugs, ATC codes and groups (approved, illicit,withdrawn, investigational, experimental). Currently the drugbank search tab only supports data computed using Network based inference.
The computed results and the associated meta-data are stored in a sqllite database [58] for access through shiny data tables interface.

Ontology and pathway search tab
The Ontology and pathway search tab Fig. 5 helps to search the relevant gene ontology terms and pathways for a given set of genes. A search can be made based on predicted proteins and in order to understand its function , location and pathway this tab can help to understand it. The level of ontology can also be given to the user input. We used biomart services using the biomaRT R package [59] to convert genes names to entrez ids and then the clusterProfiler R package [60] to retrieve the gene ontology lists. The pathway enrichment is based on the ReactomePA R package [61].

Results and discussion
In this section we illustrate the use of Netpredictor package in prediction of drug target interactions and analysis of networks. The information about the interactions between drugs and target proteins was obtained from Yamanishi et al. [62] where the number of drugs 212, 99, 105 and 27, interacting with enzymes, ion channels, GPCRs and nuclear receptors respectively. The numbers of the corresponding target proteins in these classes are 478, 146, 84 and 22 respectively. The numbers of the corresponding interactions are 1515, 776, 314 and 44. We performed both network based inference and Random  walk with restart on all of these datasets. To check the performance we randomly removed 20% of the interactions from each of the dataset and computed the performance 50 times and calculated the mean performance of each of these methods . The results are given in Table 3. Clearly, RWR supersedes its performance compared to network based inference in Enzyme and the GPCR dataset. However, computation of NBI algorithm takes less amount of time than RWR. For the drugbank tab we download the latest drugbank set version 4.3 and created a drug target interaction list of 5970 drugs and 3797 proteins We computed similarities of drugs using RDkit [64] ECFP6 fingerprint and local sequence similarity of proteins using smith waterman algorithm and normalized using the procedure proposed by Bleakley and Yamanishi [65] and integrated the matrices for network based inference computation. We ran the computations 50 times and kept the significant drug target relations (p 0.05) where a total of 316645 predicted interactions and 14167 true interactions present in the system.

Conclusions
In this paper we presented netpredictor, a standalone and web application for drug target interaction prediction. Netpredictor uses a shiny framework to develop web pages and the application can be accessed from web browsers. To set up the Netpredictor application locally there are some additional requirements other than shiny which are given below, • Firstly, the user has to have the R statistical environment installed, for which instructions can be found in R software home page. • Secondly, the devtools R package [63] has to be installed. The package can be installed using devtools R package. • Also for fast computation Microsoft R Open package needs to be installed which can be obtained from https://mran.revolutionanalytics.com/documents/ This will load all the libraries need to run netpredictor in browser. The application can be accessed in any of the default web browsers. The netpredictor R package (https://github.com/abhik1368/netpredictor) and the Shiny Web application(https://github.com/abhik1368/ Shiny_NetPredictor) is freely available. Users can follow the "Issues" link on the GitHub site to report bugs or suggest enhancements. In future the intention is to include Open Biomedical Ontologies for proteins to perform enrichment analysis. The package is scalable for further development integrating more algorithms.