Discovering patterns in drug-protein interactions based on their fingerprints

Background The discovering of interesting patterns in drug-protein interaction data at molecular level can reveal hidden relationship among drugs and proteins and can therefore be of paramount importance for such application as drug design. To discover such patterns, we propose here a computational approach to analyze the molecular data of drugs and proteins that are known to have interactions with each other. Specifically, we propose to use a data mining technique called Drug-Protein Interaction Analysis (D-PIA) to determine if there are any commonalities in the fingerprints of the substructures of interacting drug and protein molecules and if so, whether or not any patterns can be generalized from them. Method Given a database of drug-protein interactions, D-PIA performs its tasks in several steps. First, for each drug in the database, the fingerprints of its molecular substructures are first obtained. Second, for each protein in the database, the fingerprints of its protein domains are obtained. Third, based on known interactions between drugs and proteins, an interdependency measure between the fingerprint of each drug substructure and protein domain is then computed. Fourth, based on the interdependency measure, drug substructures and protein domains that are significantly interdependent are identified. Fifth, the existence of interaction relationship between a previously unknown drug-protein pairs is then predicted based on their constituent substructures that are significantly interdependent. Results To evaluate the effectiveness of D-PIA, we have tested it with real drug-protein interaction data. D-PIA has been tested with real drug-protein interaction data including enzymes, ion channels, and protein-coupled receptors. Experimental results show that there are indeed patterns that one can discover in the interdependency relationship between drug substructures and protein domains of interacting drugs and proteins. Based on these relationships, a testing set of drug-protein data are used to see if D-PIA can correctly predict the existence of interaction between drug-protein pairs. The results show that the prediction accuracy can be very high. An AUC score of a ROC plot could reach as high as 75% which shows the effectiveness of this classifier. Conclusions D-PIA has the advantage that it is able to perform its tasks effectively based on the fingerprints of drug and protein molecules without requiring any 3D information about their structures and D-PIA is therefore very fast to compute. D-PIA has been tested with real drug-protein interaction data and experimental results show that it can be very useful for predicting previously unknown drug-protein as well as protein-ligand interactions. It can also be used to tackle problems such as ligand specificity which is related directly and indirectly to drug design and discovery.


Background
In many different and extremely complex ways, the chemical pathways in our bodies are affected by various diseases. When one is sick, it might be a mistake in one reaction in a pathway that stops an important protein from being produced or causes too much of it to be produced. To correct such mistakes, drug molecules can be developed to interact with target protein molecules to activate or inhibit some of its functions thereby causing a protein to be produced more, or less. To facilitate drug design and discovery, it would therefore be very useful if we can predict whether or not a particular drug candidate may interact with a particular target protein based on its their structures at the molecular or submolecular levels.
Over the past decade, a lot of effort has been made to investigate into how drug and protein interact and the most notable among the work done are those related to protein-ligand docking [1]. Ligand is a molecule that binds to another chemical entity to form a large complex and protein-ligand docking is concerned with the prediction of the position and orientation of a ligand for binding with a protein receptor. If a ligand candidate that binds with a certain target can be found, drug molecules can then be designed to contain this ligand. However, the finding of such ligand candidate is difficult as protein-ligand docking requires knowledge about the 3D structures of the proteins and obtaining such knowledge can be very difficult [2].
Instead of investigating into protein-ligand docking, there has also been some effort to look into the analysis of molecular substructures [3] and biological activities [4]. In [3], for example, the concept of "privileged" substructures is introduced as chemical substructures that are commonly present in many drugs. In other words, in predicting if a drug may have any interaction with a protein, one can search for the presence of such privileged substructures in the drug molecules as an indicator of the likelihood of the existence of an interaction relationship with a protein. While such approach to finding privileged substructures may sound reasonable, it is considered controversial as abundance of drug structures may be a trivial consequence of their abundance in biochemical molecules.
Other than finding privileged substructures, a variety of statistical methods have recently been proposed to predict drug-target or more generally, protein-ligand interactions [5,6]. There have also been some attempts to mine structural patterns from biological or biochemical data based on molecular fingerprints. The concept of molecular fingerprints, which is first introduced in [7], refers to the representation of chemical structures originally designed to assist in chemical database search.
They become so widely used later on for data analysis tasks such as similarity search [8], clustering [9], and classification [10]. Molecular fingerprints have been used in such tasks to encode a wide range of 2D and 3D structural or conformational features of the molecules. A novel method for representing and analyzing 3D protein-ligand binding interactions, for example, is proposed in [11]. The key to the proposed method is to analyse the fingerprints obtained from translating the 3D structural binding information from a protein-ligand complex into a one-dimensional binary string.
Most of the work mentioned above has been performed independently from the viewpoints of either ligands or proteins. Not much work has been done to investigate into how the chemical and biological space may interact with each other. In [2], the paper reports on some attempts made to try to connect the two space. It proposes an approach to extract drug substructures and protein domains from a drug-protein interactions dataset by encoding chemical substructures of the drugs and the proteins domains of the dataset into molecular fingerprints. The paper explains how sparse canonical correspondence analysis (SCCA) can be performed on the data. As pointed out in the paper, the effectiveness of the proposed approach depends very much on the correct setting of a number of predefined parameters and the method may not work well when sparsity of data is not a relevant characteristic.
To identify ligand candidates efficiently for such applications as drug design and discovery, we need to be able to predict if a drug may interact with a protein without having to obtain full information of the 3D structures of protein molecules at an early stage. To do so, we propose to use a data mining algorithm called D-PIA (Drug-Protein Interaction Analysis). Instead of relying on the availability of the 3D structural information of a target protein to predict if it may have any interaction with a certain drug candidate, D-PIA only makes use of the 2D molecular fingerprints of the protein in the prediction process.
Proteins are molecules consisting of a long chain of amino acids with unique structures and substructures. A protein domain is a part of a protein chain that can evolve, function, and exist independently of the rest of the other parts of the chain [12]. D-PIA performs its tasks by first breaking down drug molecules into substructures and proteins into their protein domains. By so doing, D-PIA attempts to determine if the drug substructures may interact or bound with the protein domains and if the strength of such interactions or bindings may determine if drugs can be designed for optimal compatibility with the human body and with other drugs [13].
Once the drug substructures and protein domains are identified, D-PIA makes use of a probabilistic measure to determine if a drug substructure and a protein domain are interdependent on each other and it does so in several steps: (i) for each drug in the database, the fingerprints of its molecular substructures are first obtained; (ii) for each protein in the database, the fingerprints of its protein domains are obtained; (iii) based on known interactions between drugs and proteins, an interdependency measure between the fingerprint of each drug substructure and protein domain is then computed; (iv) based on the interdependency measure, drug substructures and protein domains that are significantly interdependent are identified; and (v) the existence of interaction relationship between a previously unknown drug-protein pairs is then predicted based on their constituent substructures that are significantly interdependent.
D-PIA has been tested with real data involving two thousand drugs and the proteins that they interact with. Our experimental results show that it can be very helpful for predicting drug-protein and protein-ligand interactions. It can also be used to address problems such as ligand specificity.

Discovering interesting association patterns
To determine whether or not the ith substructure of a drug has a sufficiently strong interdependency relationship with the jth protein domain of proteins, we construct a contingency table (Table 1) of P rows and Q columns.
Here in this table, occ ij denotes the number of occurrences of the case when sub i and dom j both takes on the value 1 in I. Let exp ij = occ i+ occ +j T be the expected number of occ ij , where occ i+ = Q k=1 occ ik and occ +j = P k=1 occ kj and T = l,k occ lk. An interdependency relationship between them is considered to exist if occ ij is significantly different from exp ij . To decide if this is the case, the approach taken in [14] is used to calculate an adjusted residual test statistic: where and 1 − occ i+ T 1 − occ +j T is the maximal likelihood of z ij defined in [15]. ad ij has an approximate normal distribution with a mean of approximately zero and a variance of approximately one. Therefore, if its absolute value exceeds 1.96, it would be considered significant at a = 0.05 by conventional criteria. Based on (1), we can determine if a drug substructure sub i has an interdependency relationship with the protein domain dom j , at the 95% confidence level.
It should be noted that the value of ad ij can be positive and negative. When ad ij is positive, sub i and dom j is interdepdent on each other and when ad ij is negative, they are not.

Determining the weight of evidence for the discovered patterns
Since the existing of drug substructure in a drug is important for determining the interaction between protein domains, it is necessary to ensure that they are utilized in the prediction of an interaction relationship between a drug and a protein. The interdependency relationships discovered by (2) determines only the interdependency between drug substructures and protein domains, but it does not measure how strong the interdependency is. For this reason, we introduce the weight of evidence measure for the patterns discovered above. Suppose that dom j = 1 is found to be interdependent with sub i = 1. Then the weight of evidence provided by sub i = 1 in favor of dom j = 1 opposed to dom i = 0 can be defined as [16]: where WoE can be used to be a positive or negative measurement for supporting or refuting the existence of an interaction relationship between a drug containing sub i and a protein containing dom j to have an interaction relationship. Hence, for a drug to be predicted to interact with a target protein, it should have sufficient support from its substructures in the sense that they should have a large enough degree of interdependency with the protein domains of the target protein.

Evaluation of D-PIA
One way to evaluate the effectiveness of D-PIA is to see if it can correctly predict drug-protein interactions that it has no previous knowledge of. Here we propose to evaluate D-PIA by testing it to see if it can predict known drug-target interactions correctly.
Given a pair of drug D i and protein P j , the potential interaction between them can be estimated by determining if there is any significant interdependency between the substructures in D i and the protein domains in P j . To do so, let us denote the set of substructures in D i as DS i = {s 1 , s 2 , ..., s a } and the set of domains in P j as PD j = {d 1 , d 2 , ..., d b }, where a is the total number of substructures in D i , and b is the total number of protein domains in P j . For ∀s' DS i ∀ d' PD i , we consider the interaction between s' and d' as significant when (6) below is satisfied.
For a pair of D i and P j , there are a × b possible significant interdependency relationship of substructures and protein domains in total. The potential interaction between D i and P j can be estimated based on the interacting substructures between them. If there is only 1 significant interdependency between the substructures of a drug and protein out of the total a × b such possible relationships, we may consider that the potential interaction between D i and P j as very weak. On the other hand, if more than half of the associations are significant, we may consider that the potential interaction between D i and P j as high. Therefore we could assert that there is potential interaction between D i and P j as (7).
where val(x) = 1 if |x| > 1.96, otherwise val(x) = 0. The interaction between the drug, D i , and the protein, P j , will be more significant if the value of w(D i , P j ) is higher than some user-supplied threshold, denoted as R, i.e if w(D i , P j ) >R, and if, at the same time, the WoE(D i , P j ) is also high, then it means that the interaction between D i and P j is not only just strong, but the strong interaction relationship is also supported with strong evidence.

Results
To evaluate the effectiveness of D-PIA, we used the dataset from [2] which contains information about 1862 drugs. Each drug in the dataset is represented by a fingerprint with 881 substructures as defined in the Pub-Chem database [17], i.e., each drug can be encoded as a binary vector whose elements encode for the presence or absence of a chemical substructure using 1 and 0, respectively. An example of the fingerprint of such a substructure is given in Figure 1.
Other than the drugs, the dataset also contains information about 1554 proteins in total. According to the UniProt [18] and Pfam database [19], each of them contains a total of 876 protein domains and thus, each protein can be encoded as a binary vector whose elements encode for the presence or absence of a protein domain using 1 and 0 respectively. An example of the protein sequence and its protein domains is given in Figure 2.
Given the drugs and proteins as described above, D-PIA determines the adjusted residuals for the drug substructures and protein domains based on Equation (2) above. In Table 2, we list some of the adjusted residuals that D-PIA computes to determine if there is significant interdependency relationship between a drug substructure and a protein domain. As shown in the table, for example, the drug substructures of SUB840, SUB841, SUB861 are interdependent with the protein domains of PF00104 and PF00105.
To evaluate the effectiveness of D-PIA, we therefore try to determine if there is a strong enough drug-protein interaction between the drugs D i and the protein P j in our dataset based on the adjusted residuals obtained between the substructures of the drugs and the protein domains of the proteins as illustrated in Table 2. We set R to 10% in our experiments and found D-PIA to be able to predict the existence of Drug-Protein interaction at an accuracy of 85.4%. A 5-fold cross-validation approach is used to evaluate the ability of D-PIA to determine if a drug interacts with a protein and this approach is described as follows: 1) We split the drug-protein interactions dataset into five subsets of equal size and take each subset in turn as a test set. 2) We perform D-PIA on the remaining 4 sets.
3) Based on the significant interdependency relationships determined between drug substructures and protein domains, D-PIA attempts to predict the existence of interactions between drug and protein in the testing data and the accuracy over the five folds are computed.
A ROC (receiver operating characteristic) curve [21] based on the experimental results can be obtained as shown in Figure 3.
While w(D i , P j ) represents the existence of a significant interdependency relationship between a drug substructure sub m and a protein domain dom n , it does not tell us how strong the interdependency relationship is. To find out, we compute, as discussed above, the WoE (sub m , dom n ) measure for the interaction between sub m and dom n . We summarize the result of the interaction between the drug substructures sub m and protein domain dom n and we present some of the results in Table 3.

Discussion
The ROC in Figure 3 is a chart of true-positive vs falsepositive for the prediction results of the experiments. The true-positive is concerned with the rate of correctly predicted drug-protein interactions whereas the falsepositives is concerned with the rate of incorrectly predicted drug-protein interactions.  We can see from the chart that D-PIA can be very accurate in predicting drug-protein interactions most of the ROC curve is much above the reference line (random prediction). The AUC (area under the ROC curve) score (which is 1 for perfect accuracy and 0.5 for random prediction) score for D-PIA is 0.7497 which shows that that it is much better than prediction at random.
These results show that D-PIA can be used to predict how likely a drug candidate may interact with a particular protein. Based on the WoE computed as shown in Table 3, we also know that candidate drugs that have the substructures SUB695 are significantly interdependent with the protein domains PF04960, etc., and we believe that the interdependency relationships and the WoE measures between them such as shown in Table 3 could be very useful for the drug discovery, pharmacological analysis, ligand specificity, etc.

Conclusions
One common approach to drug discovery is to tackle the protein-ligand docking problem. To effectively do so, there is a need for information related to the 3D  structures to be known. As such information is difficult and expensive to obtain, D-PIA is proposed here to discover patterns in known drug-protein interaction to predict those that are unknown so that the protein-ligand docking problem can be more easily tackled without having to rely on any 3D information. D-PIA makes use of fingerprints of the known drug substructures and protein domains to infer the existence of interactions between corresponding drugs and proteins. Experimental results show that the D-PIA can work effectively and can infer drug-protein interaction with high accuracy and can be a promising tool for computer aided drug discovery.