Drug targets (DTs) are defined here as proteins targeted by drugs. These proteins are not necessarily the products of disease-linked genes (which we will call Disease Proteins, DPs) but can be any protein whose binding might lead to a positive effect in the treatment of a disease. Yildirim et al. have presented a distinction between etiological and palliative drugs (the first targeting the DP or its neighbourhood, and the second attacking a different part of the network, probably to counteract symptoms of the disease-related proteins), and state that most known drugs are palliative . This diversity of ways of treating a disease raises an important question: What are drug targets and why do they work? And can we predict them to help drug discovery?
Several studies have attempted to characterize drug targets from a theoretical point of view as such knowledge could be a tool to speed up the drug discovery process. Bioinformatics methods to characterize and predict drug targets have included: pathway and tissue enrichment, domain enrichment, number of exons and protein degree in an interaction network , GO enrichment , sequence similarity to known targets , side-effect similarity , physicochemical properties of the sequence of known drug targets , entropies of tissue expression and ratios of non-synonymous to synonymous SNPs , methods based on drug similarity, target similarity and network similarity [8, 9], in addition to traditional text and data mining approaches . These studies include network-based and non-network-based prediction methods, supervised and non-supervised, from those using the protein interaction space to those including chemical and pharmacological spaces, from single metrics to elaborated predictors with multiple features. Their predictive power has been evaluated by metrics such as the sensitivity, specificity or accuracy, and, specially, the Receiver Operating Characteristic (ROC), which has been widely used during recent years [6, 11–13].
Drug targets can also be characterized in terms of protein network attributes such as degree and centrality. The degree of a protein in a protein interaction network is equivalent to the number of interactions a protein is involved in, while centrality measures quantify the relative importance of a protein. Types of centrality measures include Betweeness Centrality (according to the number of shortest paths that go through it) and Closeness Centrality (the shortest distance between that protein and all others). A number of studies have investigated drug targets in terms of such network based metrics including degree, betweenness centrality , bridging centrality  and pathway closeness centrality . These studies reported significant differences between drug targets and non-drug targets suggesting that these network-based properties might be useful in predicting drug targets. For example, Zhu et al. had some success using an assembly of network metrics (including degree) to train a support-vector machine to rank potential drug targets from the human proteome. This study used only those interactions contained in BioGrid to generate network metrics for proteins and they reported that 94 of their 200 top-ranked proteins were drug targets known to DrugBank.
The initial goal of this paper was to evaluate the predictive value of two simple graph-theoretical metrics, degree and centrality, that previously have been observed to correlate with drug targets [2, 7, 16–18] - the analysis could be extended to other network based prediction metrics. A number of observations have been made from these studies: drug targets are more likely to interact with more than 3 partners for FDA-approved drugs than non-approved , drug targets have high degree and centralities , drug targets have higher degree but far from the highest , drug targets have higher Betweenness Centrality , and more than 40% of drug targets are involved in 1 pathway . In contrast, Hase et al. claim that middle to low degree nodes happen to be advantageous targets.
These studies suggested network-based metrics might be useful for drug target prediction; however, the disparate conclusions (drug-targets are high-degree, middling degree or low-degree) was confusing. In trying to reproduce some studies, we commonly had difficulties determining exactly what data sets were used and found that studies often reported average drug target degree instead of entire degree distributions making it difficult to compare results between studies. We hypothesized that the distribution of graph-based metrics might be very dependent upon the choice of data. So the second goal of this paper was to use an exploratory data analysis approach to ask how network-based metric distributions changed when using various subsets of a well-defined, consolidated data set called iRefIndex . The iRefIndex is a consolidated non-redundant dataset of 13 protein interaction databases (BIND , BioGrid , CORUM , DIP , HPRD , InnateDB , IntAct , MatrixDB , MINT , MPact , MPIDB , MPPI  and OPHID ), that examines the sequence of each protein in order to detect redundancies.
The studies above that have investigated network based metrics of drug targets rely upon PI data, and explicitly or tacitly make choices of different source databases, data integration strategies, representation of proteins and complexes, and data reliability assumptions. Previous work from our group  has shown the susceptibility of the graphical properties of a protein interaction network (PIN) to variables such as the number of included databases, redundant information between databases, canonical representation of proteins, complex representation, and reliability of included information, which makes this an important issue when comparing results from different drug target prediction studies.
Here, we examined the effect of data integration on the distribution of drug targets across degree and centrality measures (and the ability of these measures to predict drug targets). The above mentioned studies work with limited data sets: Yildirim et al. use two high-throughput papers [34, 35], which correspond to 8.2% of all known human interactions present in the consolidated interaction database iRefIndex , while both Sakharkar et al. and Zhu et al. use the BioGrid database , which corresponds to 15.7% of human interactions in iRefIndex, and Hase et al. use results from one study , which correspond to 3.8%. We hypothesized that different conclusions might be reached just by using the complete iRefIndex data set.
Next, we examined the effect of sub-setting interaction data upon the drug target distribution over proteins of varying degree and centrality. We hypothesized that using subsets of interaction data deemed to be more reliable might alter this distribution and be useful for the purpose of predicting drug targets. There are several methods used to rank protein interactions according to some specific notion of reliability. Early attempts include the Expression Profile Reliability (EPR index), which compares protein interaction and RNA expression profiles, and the Paralogous Verification Method (PVM) that searches after paralogs of interactors which also interact (Deane et al.). In this paper, we examined five methods that have been argued to change the reliability profile of data. The first method is a bibiliometric-based measure called “lpr” [19, 33, 37, 38] which is able to distinguish high-throughput and low-throughput experiments. It has been suggested that low-throughput studies contain a higher rate of reliable interactions than high-throughput studies  although this conclusion has been contested . The second and third methods are two annotation-based scores generated by Intact and by multiple PSICQUIC services , which take into account the number of publications supporting an interaction, number and type of experimental methods, and interaction types . As a fourth method, we considered the effect of removing all n-ary derived interactions from our data set. N-ary (aka complex) interactions are created by a family of interaction-detection methods that show that a set of proteins are somehow interacting without specifying the exact binary interactions involved. N-ary derived interactions include binary interaction records that are actually spoke or matrix representations of n-ary data and we have shown that the inclusion of such data can alter graphical properties of a network . Finally, we considered the effect of removing all predicted interactions (by orthologous transfer) from our consolidated data set - iRefIndex includes the predicted interaction database OPHID . Each of these five “more reliable” datasets was examined in terms of their effects on the distribution of drug targets across bins of proteins of varying degree and centrality. Further, each distribution was assessed in terms of its effect on degree and centrality as drug target predictors.
Further, we addressed the effect of representing all n-ary data using a spoke-model representation (where only interactions between each member of the group and one chosen protein are included) versus a matrix-representation (where all possible pair-wise interactions between the group of proteins are included [33, 38]). The representation of n-ary data is not always apparent in a study, but we know that this choice has consequences for network properties .
Finally, we consider the drug target predictive ability of pathway data – a data source that is overlapping but complementary to interaction data. This partial overlap drew our attention to the usefulness of pathway data to drug target prediction, and motivated us to consider a pathway-degree metric for proteins.
In summary, we have chosen degree and centrality as simple drug target predictor features, in order to study the validity of the conclusions about them found in the literature when we work with consolidated protein interaction data from iRefIndex and various decisions regarding data integration, representation and reliability. We have previously shown that network properties can be altered by these choices and we will show the potential effect of these factors on drug target prediction.