Effects of protein interaction data integration, representation and reliability on the use of network properties for drug target prediction
© Mora and Donaldson; licensee BioMed Central Ltd. 2012
Received: 20 June 2012
Accepted: 2 November 2012
Published: 12 November 2012
Previous studies have noted that drug targets appear to be associated with higher-degree or higher-centrality proteins in interaction networks. These studies explicitly or tacitly make choices of different source databases, data integration strategies, representation of proteins and complexes, and data reliability assumptions. Here we examined how the use of different data integration and representation techniques, or different notions of reliability, may affect the efficacy of degree and centrality as features in drug target prediction.
Fifty percent of drug targets have a degree of less than nine, and ninety-five percent have a degree of less than ninety. We found that drug targets are over-represented in higher degree bins – this relationship is only seen for the consolidated interactome and it is not dependent on n-ary interaction data or its representation. Degree acts as a weak predictive feature for drug-target status and using more reliable subsets of the data does not increase this performance. However, performance does increase if only cancer-related drug targets are considered. We also note that a protein’s membership in pathway records can act as a predictive feature that is better than degree and that high-centrality may be an indicator of a drug that is more likely to be withdrawn.
These results show that protein interaction data integration and cleaning is an important consideration when incorporating network properties as predictive features for drug-target status. The provided scripts and data sets offer a starting point for further studies and cross-comparison of methods.
Drug targets (DTs) are defined here as proteins targeted by drugs. These proteins are not necessarily the products of disease-linked genes (which we will call Disease Proteins, DPs) but can be any protein whose binding might lead to a positive effect in the treatment of a disease. Yildirim et al. have presented a distinction between etiological and palliative drugs (the first targeting the DP or its neighbourhood, and the second attacking a different part of the network, probably to counteract symptoms of the disease-related proteins), and state that most known drugs are palliative . This diversity of ways of treating a disease raises an important question: What are drug targets and why do they work? And can we predict them to help drug discovery?
Several studies have attempted to characterize drug targets from a theoretical point of view as such knowledge could be a tool to speed up the drug discovery process. Bioinformatics methods to characterize and predict drug targets have included: pathway and tissue enrichment, domain enrichment, number of exons and protein degree in an interaction network , GO enrichment , sequence similarity to known targets , side-effect similarity , physicochemical properties of the sequence of known drug targets , entropies of tissue expression and ratios of non-synonymous to synonymous SNPs , methods based on drug similarity, target similarity and network similarity [8, 9], in addition to traditional text and data mining approaches . These studies include network-based and non-network-based prediction methods, supervised and non-supervised, from those using the protein interaction space to those including chemical and pharmacological spaces, from single metrics to elaborated predictors with multiple features. Their predictive power has been evaluated by metrics such as the sensitivity, specificity or accuracy, and, specially, the Receiver Operating Characteristic (ROC), which has been widely used during recent years [6, 11–13].
Drug targets can also be characterized in terms of protein network attributes such as degree and centrality. The degree of a protein in a protein interaction network is equivalent to the number of interactions a protein is involved in, while centrality measures quantify the relative importance of a protein. Types of centrality measures include Betweeness Centrality (according to the number of shortest paths that go through it) and Closeness Centrality (the shortest distance between that protein and all others). A number of studies have investigated drug targets in terms of such network based metrics including degree, betweenness centrality , bridging centrality  and pathway closeness centrality . These studies reported significant differences between drug targets and non-drug targets suggesting that these network-based properties might be useful in predicting drug targets. For example, Zhu et al. had some success using an assembly of network metrics (including degree) to train a support-vector machine to rank potential drug targets from the human proteome. This study used only those interactions contained in BioGrid to generate network metrics for proteins and they reported that 94 of their 200 top-ranked proteins were drug targets known to DrugBank.
The initial goal of this paper was to evaluate the predictive value of two simple graph-theoretical metrics, degree and centrality, that previously have been observed to correlate with drug targets [2, 7, 16–18] - the analysis could be extended to other network based prediction metrics. A number of observations have been made from these studies: drug targets are more likely to interact with more than 3 partners for FDA-approved drugs than non-approved , drug targets have high degree and centralities , drug targets have higher degree but far from the highest , drug targets have higher Betweenness Centrality , and more than 40% of drug targets are involved in 1 pathway . In contrast, Hase et al. claim that middle to low degree nodes happen to be advantageous targets.
These studies suggested network-based metrics might be useful for drug target prediction; however, the disparate conclusions (drug-targets are high-degree, middling degree or low-degree) was confusing. In trying to reproduce some studies, we commonly had difficulties determining exactly what data sets were used and found that studies often reported average drug target degree instead of entire degree distributions making it difficult to compare results between studies. We hypothesized that the distribution of graph-based metrics might be very dependent upon the choice of data. So the second goal of this paper was to use an exploratory data analysis approach to ask how network-based metric distributions changed when using various subsets of a well-defined, consolidated data set called iRefIndex . The iRefIndex is a consolidated non-redundant dataset of 13 protein interaction databases (BIND , BioGrid , CORUM , DIP , HPRD , InnateDB , IntAct , MatrixDB , MINT , MPact , MPIDB , MPPI  and OPHID ), that examines the sequence of each protein in order to detect redundancies.
The studies above that have investigated network based metrics of drug targets rely upon PI data, and explicitly or tacitly make choices of different source databases, data integration strategies, representation of proteins and complexes, and data reliability assumptions. Previous work from our group  has shown the susceptibility of the graphical properties of a protein interaction network (PIN) to variables such as the number of included databases, redundant information between databases, canonical representation of proteins, complex representation, and reliability of included information, which makes this an important issue when comparing results from different drug target prediction studies.
Here, we examined the effect of data integration on the distribution of drug targets across degree and centrality measures (and the ability of these measures to predict drug targets). The above mentioned studies work with limited data sets: Yildirim et al. use two high-throughput papers [34, 35], which correspond to 8.2% of all known human interactions present in the consolidated interaction database iRefIndex , while both Sakharkar et al. and Zhu et al. use the BioGrid database , which corresponds to 15.7% of human interactions in iRefIndex, and Hase et al. use results from one study , which correspond to 3.8%. We hypothesized that different conclusions might be reached just by using the complete iRefIndex data set.
Next, we examined the effect of sub-setting interaction data upon the drug target distribution over proteins of varying degree and centrality. We hypothesized that using subsets of interaction data deemed to be more reliable might alter this distribution and be useful for the purpose of predicting drug targets. There are several methods used to rank protein interactions according to some specific notion of reliability. Early attempts include the Expression Profile Reliability (EPR index), which compares protein interaction and RNA expression profiles, and the Paralogous Verification Method (PVM) that searches after paralogs of interactors which also interact (Deane et al.). In this paper, we examined five methods that have been argued to change the reliability profile of data. The first method is a bibiliometric-based measure called “lpr” [19, 33, 37, 38] which is able to distinguish high-throughput and low-throughput experiments. It has been suggested that low-throughput studies contain a higher rate of reliable interactions than high-throughput studies  although this conclusion has been contested . The second and third methods are two annotation-based scores generated by Intact and by multiple PSICQUIC services , which take into account the number of publications supporting an interaction, number and type of experimental methods, and interaction types . As a fourth method, we considered the effect of removing all n-ary derived interactions from our data set. N-ary (aka complex) interactions are created by a family of interaction-detection methods that show that a set of proteins are somehow interacting without specifying the exact binary interactions involved. N-ary derived interactions include binary interaction records that are actually spoke or matrix representations of n-ary data and we have shown that the inclusion of such data can alter graphical properties of a network . Finally, we considered the effect of removing all predicted interactions (by orthologous transfer) from our consolidated data set - iRefIndex includes the predicted interaction database OPHID . Each of these five “more reliable” datasets was examined in terms of their effects on the distribution of drug targets across bins of proteins of varying degree and centrality. Further, each distribution was assessed in terms of its effect on degree and centrality as drug target predictors.
Further, we addressed the effect of representing all n-ary data using a spoke-model representation (where only interactions between each member of the group and one chosen protein are included) versus a matrix-representation (where all possible pair-wise interactions between the group of proteins are included [33, 38]). The representation of n-ary data is not always apparent in a study, but we know that this choice has consequences for network properties .
Finally, we consider the drug target predictive ability of pathway data – a data source that is overlapping but complementary to interaction data. This partial overlap drew our attention to the usefulness of pathway data to drug target prediction, and motivated us to consider a pathway-degree metric for proteins.
In summary, we have chosen degree and centrality as simple drug target predictor features, in order to study the validity of the conclusions about them found in the literature when we work with consolidated protein interaction data from iRefIndex and various decisions regarding data integration, representation and reliability. We have previously shown that network properties can be altered by these choices and we will show the potential effect of these factors on drug target prediction.
Our results section is divided into five parts which examine: 3.1) integration, 3.2) selection, 3.3) representation, 3.4) pathway data and 3.5) relationship to diseases. In order to compare the effect of the source of data on the results, a series of human PINs were generated from the iRefIndex database  using the iRefR package , as specified in the Methods section. R code used to perform each analysis and to create each table and figure in the paper is provided in Additional file 1.
Here we test two hypotheses: First, that the high degree observed in drug targets might be related to the fact that specific databases or papers were chosen instead of a consolidated database and, therefore, this correlation might disappear after data integration, i.e., when using the iRefIndex. Secondly, that the high degree of some drug targets could be related to the inclusion of n-ary data.
Drug targets are correlated to high-degree only in the full data set
Degree of all proteins, drug targets only and non-drug targets only for the full PIN and various subsets
Protein interaction network
Degree standard deviation
full PIN -spoke
drug targets in full PIN
non-drug targets in full PIN
drug target subnetwork -spoke
non-drug target subnetwork -spoke
BioGRID only –spoke
drug targets in BioGRID only
Rual + Stelzl papers only -spoke
drug targets in Rual + Stelzl only
Rual paper only –spoke
drug targets in Rual only
Given this observation, we examined the sub-graph consisting only of interactions between drug targets versus the non-drug target sub-graph. The average degree of the drug target sub-network is only 1.7 (versus 12.7 for the non-drug target sub-network), indicating that drug targets are, on average, high degree proteins more connected to other sites of the full network than among themselves.
For comparison purposes, the last six rows of Table 1 include the data sets of BioGRID, Rual and Stelzl, and only Rual, employed in other drug target studies [1, 2, 18]. It is evident that mean values are much higher for the full data-set than for any of these specific database or study subsets. Moreover, in the comparatively small Rual and Stelzl dataset, drug targets actually have an average degree that is lower than non-drug targets. In addition, skewness and kurtosis values indicate these smaller datasets are even more skewed than the full network case.
These initial results were consistent with drug targets having a higher degree on average in the consolidated dataset; however, the large standard deviation in these values led us to examine the relationship in greater detail. The majority of DTs have degrees between 1 and 8 (50th percentile) and 95% of all DTs have a degree less than 89. The number of DTs decrease linearly with degree between 1 and 20 followed by a long tail out to degree 789 (Additional file 2: Figure S1). A frequency plot shows that DT’s appear to be shifted to higher degrees compared to non-DTs and that this difference is significant (Wilcoxon p-value 6.5e-41) (Additional file 2: Figure S2).
However, this trend is not seen in either the BioGrid or Rual and Stelzl sub-sets. In fact, drug targets were not over-represented at all in these two subsets with the exception of the highest degree bin in BioGrid. These observations argue that using degree as a feature for drug target prediction is significantly affected by choice of data-set.
The process of sub-setting the network will fragment it into smaller components containing drug targets that are disconnected from the main giant component. The full spoke human PIN contains 140 connected components, distributed as shown in (Additional file 3: Table S1), with one giant component including 15754 proteins. The giant connected component contains 1220 drug targets while all the others contain 7 drug targets altogether. A GO term analysis, using GO , revealed that proteins in these separate, small connected components are mainly located in the extracellular region and in the membrane with few in the cytoplasm or nucleus. Curiously, the drug targets in these smaller connected components are mainly cytoplasmic proteins. Consistent with this, the proteins in these disconnected components are mainly involved in cell adhesion, while the drug targets here are mainly involved in signal transduction. This suggests that they are not really independent functional modules but data with missing connections to the main connected component.
Number of drug targets present in isolated components in the full and reliable subsets
# Connected components
# Proteins in disconnected components
# Drug targets in disconnected components
MI score - IntAct
MI score - PSICQUIC
Drug target degree is not overly influenced by n-ary data
We considered the possibility that the higher-degree of drug targets might be influenced by the presence of n-ary data in the full data-set. In a previous work, we distinguished between true binary data (B), n-ary also known as complex data (N) and spoke-represented n-ary data (S) . The S type of data was defined as data records that are binary (only two interactors in the record), but that are in fact a spoke representation of n-ary data. Both N and S-type data could artificially inflate the degree of some nodes. Therefore, we separated these three data-types in the full network into three networks called B, N and S (see Methods) and re-examined the degree distribution of drug targets in each in order to rule out the possibility that the high-degree of drug targets is only due to the presence of n-ary type data.
Drug target content and degree properties of the full network versus interaction type subsets
% All drug targets (#drug targets in data set / Total #drug targets)
% Drug targets in data set (#drug targets in data set / #proteins in data set)
Average degree of data set
Drug target association with higher centrality is dependent on n-ary data
Betweenness Centrality properties for different human PINs
Protein interaction network
Average BC (per protein)
full PIN -spoke
Drug targets only
Non-drug targets only
B nodes only
N nodes only
S nodes only
We examined the distribution of drug target centralities (Additional file 2: Figure S3) and found that drug targets were indeed over-represented in higher centrality bins. However, and in contrast to the degree analysis, this trend was diminished in the absence of the N and S subsets. In contrast, these trends were largely absent from the N or S subsets themselves (i.e., binary data is required to see the drug target centrality trend) and from the two smaller subsets.
In summary, DT’s appear to be overrepresented in higher-degree and centrality bins. However this is most apparent using a consolidated data set and is somewhat dependent on the presence of n-ary data in the case of centrality. Most drug targets seem to be located in true binary interaction data and their degree distributions are therefore not likely to be affected by complex representation artefacts.
Data selection analysis
We wished to quantify the predictive power of high degree and centrality for drug targets and assessed this using the Receiver Operating Characteristic (ROC) on the full-network. We then compared this performance over five different subsets of the data that could reasonably have an effect on reliability and on network properties with respect to the full network. Our rationale here was that removing unreliable data might decrease the degree for some non-drug targets that had been artificially inflated and thereby increase performance by removing false-positives.
Degree as a drug target predictor
Drug target predictive power of degree and centralities for different reliable subsets
Number of proteins in network
AUC – Degree
AUC - BC
AUC - CC
Full PIN, spoke
MI score, IntAct > 0.6
MI score, PSICQUIC > 0.7
Over-representation of drug targets along a centrality rank for the full PIN and each of the subsets behave similarly to degree. We assessed Betweenness Centrality performance using AUC as described above and found results similar to the degree performance (Table 5). The full data set gave the best performance (AUC 0.63). A second measure of centrality (Closeness Centrality: CC) yielded only slightly poorer performance in the same tests. None of the subsets gave better performance than the consolidated data set with either centrality measure – in fact, the MI IntAct reliable data set performed close to random as did the BioGrid and Rual and Stelzl subsets.
Analysis of reliable subsets of the full PIN
The fact that the full network has proven to be the best data source for drug target prediction over all other subsets (except the small MI-IntAct > 0.6 for degree) seemed counter-intuitive since we expected that some of these would contain more reliable data. We had reasoned that removing “unreliable” interactions might decrease the degree (connectivity) for some non-drug targets that had been artificially inflated and therefore reduce noise in the predictor due to false positives.
Average change in degree for drug targets and non-drug targets after removing lower-confidence interactions
Avg degree change (drug targets)
Avg degree change (non-drug targets)
Full to non- predicted
Full to B
Full to LTP
Data Representation Analysis
Drug target predictive power of degree and centralities for spoke and matrix representation of protein complexes
AUC - Degree
AUC – BC
AUC – CC
Full PIN, spoke
Full PIN, matrix
Observations on the integration of interaction and pathway data
Pathways have been traditionally used in drug discovery in the context of studying proteins upstream and downstream of a target in a pathway. Several studies  have emphasized the importance of enriching pathway data with interaction data due to the small overlap between these two data sources: There are, on average, 10 proteins with no interaction data per pathway in the KEGG database  and 15% of the proteins in pathways have no interaction data, including remarkable cases such as “olfactory transduction”, which, to this date, contains 349 proteins without interaction information. Besides that, drug target counts suggest that pathway data might be a good predictive feature alternative to interaction data. For example, there are only 225 drug targets that have no corresponding pathway in KEGG. Only 18 KEGG pathways contain no drug targets, and 81 out of 229 KEGG pathways are significantly enriched in drug targets (hypergeometric score < 0.05). For example, the TCA cycle contains 23 drug targets out of 30 proteins, and the average percentage of drug targets in a KEGG pathway for the human PIN is 23.8%.
One could imagine employing a simple network analysis using pathways; the number of pathways that a protein is involved in could be counted as a “pathway-centrality” and assessed for its relationship with drug target status. However, pathways from multiple databases are not easily consolidated making it difficult to determine how many distinct pathways a protein is involved in. Pathway databases are highly inconsistent both in terms of the biological entities and reactions [46–48] described for the same pathway. The boundaries of a pathway can be subjective such that different start and end points may be chosen and reactions may be divided into separate pathways . Further, pathway databases may differ in the number of intermediate steps  and some databases combine pathway variants in one pathway while others generate separate pathway records for each variant . Finally, pathway definitions or ontologies may differ or be completely absent . The BioCyc database  defines a metabolic pathway, as part of a single biological process in a single organism, regulated as a unit, and that is evolutionary conserved with boundaries defined as stable substrates (not intermediates) with high-degree, typically branching points . It has been reported that, as a consequence of a different ontology, KEGG pathways may be on average 4.2 times larger than BioCyc pathways . It has also been reported that reasons for this inconsistency must be comparison or data integration problems such as different identifiers for the same entity, which should be resolved before an integration effort .
As a consequence we are unable to perform our analysis on a consolidated data set (analogous to the above analysis on a consolidated interaction data set). Instead, we had to resort to three separate pathway-centrality analyses on each of three different databases keeping in mind that results might not be comparable between databases. Pathway records between databases may be redundant and overlapping making results difficult to interpret.
Drug target distribution in different pathway databases
#drug target in database
#Non-drug target in database
% of proteins in database that are drug targets
% of all drug targets with pathway info
% of all non-drug targets with pathway info
UniProt (all prots)
Differences in number of pathways and AUC of pathway centrality in three different pathway databases
Avg # pathways per drug target
Avg # pathways per non-drug target
Max # pathways per drug target
Max # pathways per non-drug target
AUC – Number of pathways for proteins in one pathway or more
AUC – Number of pathways for proteins in zero pathways or more
Drug targets are over-represented in all pathway centrality bins for all three databases under analysis (see Additional file 2: Figure S7). In order to compute the predictive power of the number of pathways per drug target, the ROC curves were generated. Table 9 summarizes the AUC of pathways per protein for the three databases using two different UniProt data subsets. The fifth column shows the AUC when only the proteins reported in that database are used (i.e., proteins involved in at least one pathway). We observe that KEGG is the best dataset while Reactome performs close to randomness. The sixth column shows the result of including all UniProt proteins in the analysis, i.e., all UniProt proteins with no pathway will have a value of zero. In this case, two databases are good drug target predictors, especially KEGG with 0.83. This simple pathway metric outperforms degree and centrality of interaction networks under any studied reliability and representation condition. However, this increase in performance is due to the fact that the majority of proteins in UniProt do not have pathway information.
The previous results motivated us to perform three additional analyses examining the relationship between drug targets and disease.
Distribution of shortest path lengths from drug targets to the nearest disease protein
Drug targets = DPs
Drug targets interact with DPs
Drug targets and DPs have a common interactor
Drug targets disconnected from DPs
Smaller subsets in Table 10 show that, in general, drug targets do not get further from disease proteins after data sub-setting, and, as a rule of thumb, there will always be a disease protein at least 4 steps away from a drug target. However, the proportion of drug targets disconnected from disease proteins is higher for subsets than it is for the full PIN.
Predictive power of degree and centralities for cancer and non-cancer drug targets
AUC - Degree
AUC - BC
AUC – CC
Cancer drug targets
Non-cancer drug targets
Third and finally, we hypothesized that highly central proteins could lead to more side-effects and, therefore, their drugs would be withdrawn from the market. Indeed, we found that the average BC of the subset of drug targets for withdrawn drugs is 54084.4 with a maximum of 1501217, which indicates that withdrawn drug targets have, on average, higher centralities than all drug targets and, of course, than the average of centralities in the full PIN (Wilcoxon p-value = 9.5e-6). In contrast, non-withdrawn drug targets have an average BC of 21411.7 and a maximum of 6930614, which is similar to the average and maximum values of the full PIN (Wilcoxon p-value = 0.8). These observations argue that high centrality should not be used as a predictor and may, in fact, be indicative of drugs that are more likely to be withdrawn.
Using the full PIN (iRefIndex consolidated data set) gives better prediction results than using presumably more reliable subsets such as the true binary interactions, low lpr score, non-predicted interactions, high IntAct MI score and high PSICQUIC MI score, and significantly better than using arbitrary subsets such as one given database or study. This could be taken as an argument in favour of the importance of interaction data integration in drug target prediction studies.
The poor performance of more reliable data sets compared to the full PIN might be due to one of two reasons. Either the subsets we are calling “reliable” are not as reliable as we think they are (and better definitions of reliability are needed) or, if we assume that our data is truly reliable, it is possible that the correlation of drug targets with degree and centralities is partially due to the inclusion of unreliable interactions. Both hypotheses demand further study. We would argue that our results also point out the need for more reliable interaction data and/or methods to filter for such data.
Representation issues seem to be less important for drug target prediction. Spoke models perform slightly better than matrix models, although the difference is not high. This might be due to the fact that most drug targets are present in binary interactions and not affected by complex representation.
Pathways are enriched in drug targets, only partially overlap with interaction data and the number of pathways that crosses a given protein seems to be a good drug target predictor. This could be interpreted as a need to integrate pathway data to the drug target prediction analysis, but also can be the reflection of the fact that the drug discovery process has been mainly pathway-oriented. However, as a consequence of the high inconsistency between pathway databases, an integration effort is required for pathways, similar to the iRefIndex for interaction data. There are integration efforts such as ConsensusPathDB , which highlights similar reactions and leaves to the user the decision of considering if they are identical or not. We believe that distributing pathways into pairwise interactions (such as pioneered by Reactome) and consolidating these interactions using a methodology such as iRefIndex's ssh keys (ROGs) , might be a better procedure to allow pathway integration and integration to PINs.
Our analysis can be improved in several ways. First, we are aware that degree and centralities might not be the best drug target prediction metrics and the analysis could be enriched by using better metrics and using an ensemble of features [14, 15]. However, for the three tested metrics, all the conclusions regarding importance of data integration, negative effect of selecting reliable subsets and neutral effect of data representation, were consistent among the three metrics, making us expect a similar behaviour from more sophisticated prediction metrics. Second, as stated above, degree and centralities seem to be better predictors for cancer, therefore studies related to each type of disease would be recommended. And third, the fact that centralities are better predictors of withdrawn drugs also deserves a deeper analysis.
Even though our purpose was not to examine the predictive power of degrees and centralities compared to other metrics, but only their variation due to a different data source, our analysis has given us an important insight on how these metrics work and their limitations. Data type distinction, over-representation analysis and ROC curves have given us a deeper understanding of the reasons for and against using degree and centralities as drug target features and can be a methodology to use in the assessment of new prediction metrics.
These initial results suggest that data integration is an important consideration when examining potential features for drug target prediction. Using more reliable data sets as defined here has little effect although other measures of confidence may have different results. The representation issues under analysis (n-ary data, matrix representation) do not have a significant effect on the predictive power of degree and centralities. This work will be of use to future studies that incorporate network data as a feature of drug target predictors.
All analyses were performed using R and some of its packages: Â«iRefRÂ» for manipulation of the protein interaction database iRefIndex; Â«igraphÂ» for network analysis; Â«momentsÂ» for computation of statistical moments; Â«limmaÂ» for generating Venn diagrams; “plotrix” for multiple histograms; and Â«org.Hs.eg.dbÂ» for conversion between gene IDs and GO and pathway information. R code to generate all networks, tables and plots is provided as Additional file 1.
Construction of Networks
Networks were constructed and analyzed using the iRefR package .
Construction of the full PIN
The iRefIndex human MITAB file v.8.0 contains 355104 unique records, of which 309726 correspond to human-human interactions. Using a canonical representation of the proteins and including data with all levels of confidence, two protein interaction networks can be obtained: Using a spoke model to represent complexes, the PIN (full PIN, spoke) contains 16078 nodes and 113834 edges. Using a matrix model to represent complexes, the PIN (full PIN, matrix) contains 16078 nodes and 344576 edges. Even though drug targets may be dependent on post-translational modifications and cellular micro-environments , we have focused on the canonical representations of proteins, as described in the iRefIndex [33, 38].
Construction of the Drug target List
There are several drug target databases, such as DrugBank , SuperTarget , TTD , PharmGKB  and others. For the purposes of this paper, we have chosen DrugBank, but the reader can use the included R code (Additional file 1) in order to reproduce these analyses with any other drug target database.
A MITAB representation of the DrugBank database was retrieved, where the drug is described in the first field of the interaction and the drug target in the second field. The DrugBank MITAB table from September 2011 contained 40274 records, 19500 of which correspond to proteins. 14851 of those protein records were found in iRefIndex and only 12632 of these are human proteins.
DrugBank includes an “experimental” category of drugs, defined as “Drug has been shown experimentally to bind specific proteins in mammals, bacteria, viruses, fungi, or parasites. An experimental drug is not necessarily being formally investigated” . Some studies remove this type of drug from the analysis due to the fact that they haven't proven efficacy against diseases. We follow the same line of thought and found 7032 records containing experimental drugs, from which 5011 correspond to human drug targets, and 7819 records containing non-experimental drugs, from which 7621 correspond to human proteins. As a result, 7621 records out of 40274 are useful for the purposes of this study.
These 7621 DrugBank records contain 1266 distinct protein drug targets. 1227 out of these 1266 drug targets belong to human-human protein interactions; therefore, this is the final number of drug targets that was studied.
It is important to highlight that the subset of non-iRefIndex drug targets contains 1592 proteins, which means that interaction data is missing (drug targets don't have a single known protein interaction in iRefIndex's databases) for more than half of the DrugBank human drug targets.
Construction of drug target and non-drug target subnetworks
Drug target and non-drug target subnetworks were constructed using the “igraph” R package  and the spoke version of the full PIN. The drug target subnetwork contains 1227 nodes and 1038 edges (drug target-drug target interactions). The non-drug target subnetwork contains 14851 nodes and 94026 edges (interactions between non-drug-targets).
Generating interaction-type sub-networks
The iRefIndex classifies interaction data according to three interaction types: Binary interaction records, n-ary interaction records (N) and polymers (not studied here). The S subset (spoke-represented n-ary data) corresponds to data that is represented as binary but is possibly just a representation of n-ary data. The S subset was detected using a simple algorithm: binary interaction records annotated by the same database from the same paper which were generated according to an experimental method that is known to generate n-ary data were grouped together into one S-type record . Graphs containing just binary, n-ary or S-type data, were generated using the iRefR package ; their sizes are summarized in (Additional file 3: Table S4).
Generating high-confidence subnetworks
Using the iRefR package , four main reliability criteria were considered: excluding predicted interactions from the interaction network, excluding interactions from high-throughput studies by using an lpr score smaller than 22, including only interactions with a high MI score – IntAct (> 0.6) or a high MI score – PSICQUIC (> 0.7).
The MI score tables were generated using a python script that submits iRefIndex interaction records, one at the time, to the scoring servers  and receives and consolidates these scores in an iRefIndex MITAB format. The algorithm to compute the scores is explained in . The difference between both methods is that the first one includes information from IntAct only while the PSICQUIC version includes interaction data from all PSICQUIC servers (APID, ChEMBL, BioGrid, IntAct, DIP, InnateDB, MPIDB, iRefIndex, MatrixDB, MINT, Interoporc, Reactome, Reactome –FIs, STRING, BIND, DrugBank, I2D, I2D –IMEx, InnateDB –IMEx, and MolCon).
In order to select the cut-off values for each score type, 9 networks were generated for each score and the ROC test was applied to each of them. Values of 0.6 (for MI score - Intact) and 0.7 (for the MI score - PSICQUIC) had the highest AUC values and were chosen as cut-offs in this study. Additional file 3: Tables S5 and S6 show the sizes of all these networks.
Degree: Number of edges for a node or number of interactions for a protein. For computations, the igraph R package was used .
Centrality: Node centrality is a measure of the relative importance of a node within a graph. In our case, the relative importance of a protein inside a PIN. There are various ways to calculate centrality; in this study we used the most common measures called “betweenness” and “closeness” centralities. The Betweenness Centrality is a measure of the number of shortest paths that cross a given node. A node that is found in many shortest paths will have a higher betweenness centrality than a node that is not. The Closeness Centrality is a measure of the mean shortest distance between one node (protein) and all the others that it can reach, which is a measure of how long it will take information to spread from that node to the rest of network. For computations, the “igraph” R package  was used. igraph includes functions to calculate both centrality measures plus other less common types of centrality.
GO enrichment: When examining disconnected components, we considered “enriched” as the most common GO terms associated with a given subset of proteins. The “org.Hs.eg.db” R package  was used to convert gene IDs to GO terms. A routine to count the number of GO terms is included in the supplementary R code.
Pathway Centrality: We defined pathway centrality of a protein as the number of known biological pathways that cross that protein. For computations, the Â«org.Hs.eg.dbÂ» R package  was used to map gene IDs to pathways.
Estimation of predictive power
where FP = False Positives, TN = True Negatives, TP = True Positives, and FN = False Negatives.
The area under this curve (AUC) is interpreted as the probability that the classifier can rank a positive example better than a negative one, and here is calculated using a simple trapezoidal rule. We note that alternatives to the ROC method could be considered [62, 63] as measures of performance.
DAVID disease over-representation analysis
Proteins were grouped in bins of 700 proteins, from higher to lower degree, where bin 1 contained proteins with the highest degree. Each bin was submitted to DAVID [51, 52] and results of over-represented GADB disease categories were summarized in the (Additional file 3: Table S3).
The authors would like to thank Paul Boddie for producing the MITAB files for DrugBank and for the MI scores, and Katerina Michalickova for providing a version of OMIM’s Morbid Map that included gene IDs.
- Yildirim MA, Goh KI, Cusick ME, Barabasi AL, Vidal M: Drug-target network. Nat Biotechnol. 2007, 25 (10): 1119-1126.View ArticlePubMed
- Sakharkar MK, Li P, Zhong Z, Sakharkar KR: Quantitative analysis on the characteristics of targets with FDA approved drugs. Int J Biol Sci. 2008, 4 (1): 15-22.PubMed CentralView Article
- Ma'ayan A, Jenkins SL, Goldfarb J, Iyengar R: Network analysis of FDA approved drugs and their targets. Mt Sinai J Med. 2007, 74 (1): 27-32.PubMed CentralView ArticlePubMed
- Hopkins AL, Groom CR: The druggable genome. Nat Rev Drug Discov. 2002, 1 (9): 727-730.View ArticlePubMed
- Campillos M, Kuhn M, Gavin AC, Jensen LJ, Bork P: Drug target identification using side-effect similarity. Science. 2008, 321 (5886): 263-266.View ArticlePubMed
- Li Q, Lai L: Prediction of potential drug targets based on simple sequence properties. BMC Bioinforma. 2007, 8: 353.View Article
- Yao L, Rzhetsky A: Quantitative systems-level determinants of human genes targeted by successful drugs. Genome Res. 2008, 18 (2): 206-213.PubMed CentralView ArticlePubMed
- Chen X, Liu MX, Yan GY: Drug-target interaction prediction by random walk on the heterogeneous network. Mol Biosyst. 2012, 8 (7): 1970-1978.View ArticlePubMed
- Cheng F, Liu C, Jiang J, Lu W, Li W, Liu G, Zhou W, Huang J, Tang Y: Prediction of drug-target interactions and drug repositioning via network-based inference. PLoS Comput Biol. 2012, 8 (5): e1002503.PubMed CentralView ArticlePubMed
- Yang Y, Adelstein SJ, Kassis AI: Target discovery from data mining approaches. Drug Discov Today. 2009, 14 (3–4): 147-154.View ArticlePubMed
- Chen B, Ding Y, Wild DJ: Assessing drug target association using semantic linked data. PLoS Comput Biol. 2012, 8 (7): e1002574.PubMed CentralView ArticlePubMed
- Yamanishi Y, Araki M, Gutteridge A, Honda W, Kanehisa M: Prediction of drug-target interaction networks from the integration of chemical and genomic spaces. Bioinformatics. 2008, 24 (13): i232-240.PubMed CentralView ArticlePubMed
- Zhao S, Li S: Network-based relating pharmacological and genomic spaces for drug target identification. PLoS One. 2010, 5 (7): e11764.PubMed CentralView ArticlePubMed
- Hwang WC, Zhang A, Ramanathan M: Identification of information flow-modulating drug targets: a novel bridging paradigm for drug discovery. Clin Pharmacol Ther. 2008, 84 (5): 563-572.View ArticlePubMed
- Chen L, Wang Q, Zhang L, Tai J, Wang H, Li W, Li X, He W: A novel paradigm for potential drug-targets discovery: quantifying relationships of enzymes and cascade interactions of neighboring biological processes to identify drug-targets. Mol Biosyst. 2011, 7 (4): 1033-1041.View ArticlePubMed
- Zhu M, Gao L, Li X, Liu Z, Xu C, Yan Y, Walker E, Jiang W, Su B, Chen X, et al: The analysis of the drug-targets based on the topological properties in the human protein-protein interaction network. J Drug Target. 2009, 17 (7): 524-532.View ArticlePubMed
- Korcsmáros T, Szalay MS, Böde C, Kovács IA, Csermely P: How to design multi-target drugs. Expert Opin Drug Discov. 2007, 2 (6): 10.View Article
- Hase T, Tanaka H, Suzuki Y, Nakagawa S, Kitano H: Structure of protein interaction networks and their implications on drug design. PLoS Comput Biol. 2009, 5 (10): e1000550.PubMed CentralView ArticlePubMed
- Razick S, Magklaras G, Donaldson IM: iRefIndex: a consolidated protein interaction database with provenance. BMC Bioinforma. 2008, 9: 405.View Article
- Bader GD, Donaldson I, Wolting C, Ouellette BF, Pawson T, Hogue CW: BIND–The Biomolecular Interaction Network Database. Nucleic Acids Res. 2001, 29 (1): 242-245.PubMed CentralView ArticlePubMed
- Stark C, Breitkreutz BJ, Chatr-Aryamontri A, Boucher L, Oughtred R, Livstone MS, Nixon J, Van Auken K, Wang X, Shi X, et al: The BioGRID Interaction Database: 2011 update. Nucleic Acids Res. 2011, 39 (Database issue): D698-D704.PubMed CentralView ArticlePubMed
- Ruepp A, Waegele B, Lechner M, Brauner B, Dunger-Kaltenbach I, Fobo G, Frishman G, Montrone C, Mewes HW: CORUM: the comprehensive resource of mammalian protein complexes--2009. Nucleic Acids Res. 2010, 38 (Database issue): D497-D501.PubMed CentralView ArticlePubMed
- Xenarios I, Salwinski L, Duan XJ, Higney P, Kim SM, Eisenberg D: DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res. 2002, 30 (1): 303-305.PubMed CentralView ArticlePubMed
- Keshava Prasad TS, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, Telikicherla D, Raju R, Shafreen B, Venugopal A, et al: Human Protein Reference Database--2009 update. Nucleic Acids Res. 2009, 37 (Database issue): D767-D772.PubMed CentralView ArticlePubMed
- Lynn DJ, Winsor GL, Chan C, Richard N, Laird MR, Barsky A, Gardy JL, Roche FM, Chan TH, Shah N, et al: InnateDB: facilitating systems-level analyses of the mammalian innate immune response. Mol Syst Biol. 2008, 4: 218.PubMed CentralView ArticlePubMed
- Kerrien S, Aranda B, Breuza L, Bridge A, Broackes-Carter F, Chen C, Duesbury M, Dumousseau M, Feuermann M, Hinz U, et al: The IntAct molecular interaction database in 2012. Nucleic Acids Res. 2012, 40 (Database issue): D841-D846.PubMed CentralView ArticlePubMed
- Chautard E, Fatoux-Ardore M, Ballut L, Thierry-Mieg N, Ricard-Blum S: MatrixDB, the extracellular matrix interaction database. Nucleic Acids Res. 2011, 39 (Database issue): D235-D240.PubMed CentralView ArticlePubMed
- Licata L, Briganti L, Peluso D, Perfetto L, Iannuccelli M, Galeota E, Sacco F, Palma A, Nardozza AP, Santonico E, et al: MINT, the molecular interaction database: 2012 update. Nucleic Acids Res. 2012, 40 (Database issue): D857-861.PubMed CentralView ArticlePubMed
- Guldener U, Munsterkotter M, Oesterheld M, Pagel P, Ruepp A, Mewes HW, Stumpflen V: MPact: the MIPS protein interaction resource on yeast. Nucleic Acids Res. 2006, 34 (Database issue): D436-D441.PubMed CentralView ArticlePubMed
- Goll J, Rajagopala SV, Shiau SC, Wu H, Lamb BT, Uetz P: MPIDB: the microbial protein interaction database. Bioinformatics. 2008, 24 (15): 1743-1744.PubMed CentralView ArticlePubMed
- Pagel P, Kovac S, Oesterheld M, Brauner B, Dunger-Kaltenbach I, Frishman G, Montrone C, Mark P, Stumpflen V, Mewes HW, et al: The MIPS mammalian protein-protein interaction database. Bioinformatics. 2005, 21 (6): 832-834.View ArticlePubMed
- Brown KR, Jurisica I: Online predicted human interaction database. Bioinformatics. 2005, 21 (9): 2076-2082.View ArticlePubMed
- Mora A, Donaldson IM: iRefR: an R package to manipulate the iRefIndex consolidated protein interaction database. BMC Bioinforma. 2011, 12 (1): 455.View Article
- Rual JF, Venkatesan K, Hao T, Hirozane-Kishikawa T, Dricot A, Li N, Berriz GF, Gibbons FD, Dreze M, Ayivi-Guedehoussou N, et al: Towards a proteome-scale map of the human protein-protein interaction network. Nature. 2005, 437 (7062): 1173-1178.View ArticlePubMed
- Stelzl U, Worm U, Lalowski M, Haenig C, Brembeck FH, Goehler H, Stroedicke M, Zenkner M, Schoenherr A, Koeppen S, et al: A human protein-protein interaction network: a resource for annotating the proteome. Cell. 2005, 122 (6): 957-968.View ArticlePubMed
- Deane CM, Salwinski L, Xenarios I, Eisenberg D: Protein interactions: two methods for assessment of the reliability of high throughput observations. Mol Cell Proteomics. 2002, 1 (5): 349-356.View ArticlePubMed
- MITAB for iRefIndex 8.0. http://irefindex.uio.no/wiki/README_MITAB2.6_for_iRefIndex_8.0,
- Razick S, Mora A, Michalickova K, Boddie P, Donaldson IM: iRefScape. A Cytoscape plug-in for visualization and data mining of protein interaction data from iRefIndex. BMC Bioinforma. 2011, 12: 388.View Article
- von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P: Comparative assessment of large-scale data sets of protein-protein interactions. Nature. 2002, 417 (6887): 399-403.View ArticlePubMed
- Yu H, Braun P, Yildirim MA, Lemmens I, Venkatesan K, Sahalie J, Hirozane-Kishikawa T, Gebreab F, Li N, Simonis N, et al: High-quality binary protein interaction map of the yeast interactome network. Science. 2008, 322 (5898): 104-110.PubMed CentralView ArticlePubMed
- PSISCORE Registry. http://psiscore.bioinf.mpi-inf.mpg.de/registry.php,
- MI scores. http://docs.google.com/Doc?docid=0AQ_p-HKWUjHoZGQ5cGNtcmhfMjJ2ZDdwcDhmag&hl=en,
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature genetics. 2000, 25 (1): 25-29.PubMed CentralView ArticlePubMed
- Apic G, Ignjatovic T, Boyer S, Russell RB: Illuminating drug discovery with biological pathways. FEBS Lett. 2005, 579 (8): 1872-1877.View ArticlePubMed
- Kanehisa M, Goto S: KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000, 28 (1): 27-30.PubMed CentralView ArticlePubMed
- Green ML, Karp PD: The outcomes of pathway database computations depend on pathway ontology. Nucleic Acids Res. 2006, 34 (13): 3687-3697.PubMed CentralView ArticlePubMed
- Soh D, Dong D, Guo Y, Wong L: Consistency, comprehensiveness, and compatibility of pathway databases. BMC Bioinforma. 2010, 11: 449.View Article
- Stobbe MD, Houten SM, Jansen GA, van Kampen AH, Moerland PD: Critical assessment of human metabolic pathway databases: a stepping stone for future integration. BMC Syst Biol. 2011, 5: 165.PubMed CentralView ArticlePubMed
- Karp PD, Riley M, Paley SM, Pellegrini-Toole A: The MetaCyc Database. Nucleic Acids Res. 2002, 30 (1): 59-61.PubMed CentralView ArticlePubMed
- Caspi R, Altman T, Dale JM, Dreher K, Fulcher CA, Gilham F, Kaipa P, Karthikeyan AS, Kothari A, Krummenacker M, et al: The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Res. 2010, 38 (Database issue): D473-D479.PubMed CentralView ArticlePubMed
- da Huang W, Sherman BT, Lempicki RA: Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009, 4 (1): 44-57.View ArticlePubMed
- da Huang W, Sherman BT, Lempicki RA: Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009, 37 (1): 1-13.View ArticlePubMed
- Kamburov A, Wierling C, Lehrach H, Herwig R: ConsensusPathDB--a database for integrating human functional interaction networks. Nucleic Acids Res. 2009, 37 (Database issue): D623-D628.PubMed CentralView ArticlePubMed
- Schrattenholz A, Groebe K, Soskic V: Systems Biology Approaches and Tools for Analysis of Interactomes and Multi-target Drugs. Systems Biology in Drug Discovery and Development: Methods and Protocols. Edited by: Yan Q. 2010, Springer Science, New York, 29-58. vol. 662View Article
- Wishart DS, Knox C, Guo AC, Shrivastava S, Hassanali M, Stothard P, Chang Z, Woolsey J: DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. 2006, 34 ((Database issue): D668-D672.PubMed CentralView ArticlePubMed
- Gunther S, Kuhn M, Dunkel M, Campillos M, Senger C, Petsalaki E, Ahmed J, Urdiales EG, Gewiess A, Jensen LJ, et al: SuperTarget and Matador: resources for exploring drug-target relationships. Nucleic Acids Res. 2008, 36 (Database issue): D919-D922.PubMed CentralPubMed
- Zhu F, Shi Z, Qin C, Tao L, Liu X, Xu F, Zhang L, Song Y, Zhang J, Han B, et al: Therapeutic target database update 2012: a resource for facilitating target-oriented drug discovery. Nucleic Acids Res. 2012, 40 (Database issue): D1128-1136.PubMed CentralView ArticlePubMed
- McDonagh EM, Whirl-Carrillo M, Garten Y, Altman RB, Klein TE: From pharmacogenomic knowledge acquisition to clinical applications: the PharmGKB as a clinical pharmacogenomic biomarker resource. Biomark Med. 2011, 5 (6): 795-806.PubMed CentralView ArticlePubMed
- Drug Field Documentation and Sources. http://drugbank.ca/documentation,
- igraph: Network analysis and visualization. http://cran.r-project.org/web/packages/igraph/index.html,
- org.Hs.eg.db -Genome-wide annotation for Human. http://www.bioconductor.org/packages/2.2/data/annotation/html/org.Hs.eg.db.html,
- Hanczar B, Hua J, Sima C, Weinstein J, Bittner M, Dougherty ER: Small-sample precision of ROC-related estimates. Bioinformatics. 2010, 26 (6): 822-830.View ArticlePubMed
- Hand DJ: Measuring classifier performance: a coherent alternative to the area under the ROC curve. Mach Learn. 2009, 77: 21.View Article
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.