Skip to main content

Discovering patterns in drug-protein interactions based on their fingerprints

Abstract

Background

The discovering of interesting patterns in drug-protein interaction data at molecular level can reveal hidden relationship among drugs and proteins and can therefore be of paramount importance for such application as drug design. To discover such patterns, we propose here a computational approach to analyze the molecular data of drugs and proteins that are known to have interactions with each other. Specifically, we propose to use a data mining technique called Drug-Protein Interaction Analysis (D-PIA) to determine if there are any commonalities in the fingerprints of the substructures of interacting drug and protein molecules and if so, whether or not any patterns can be generalized from them.

Method

Given a database of drug-protein interactions, D-PIA performs its tasks in several steps. First, for each drug in the database, the fingerprints of its molecular substructures are first obtained. Second, for each protein in the database, the fingerprints of its protein domains are obtained. Third, based on known interactions between drugs and proteins, an interdependency measure between the fingerprint of each drug substructure and protein domain is then computed. Fourth, based on the interdependency measure, drug substructures and protein domains that are significantly interdependent are identified. Fifth, the existence of interaction relationship between a previously unknown drug-protein pairs is then predicted based on their constituent substructures that are significantly interdependent.

Results

To evaluate the effectiveness of D-PIA, we have tested it with real drug-protein interaction data. D-PIA has been tested with real drug-protein interaction data including enzymes, ion channels, and protein-coupled receptors. Experimental results show that there are indeed patterns that one can discover in the interdependency relationship between drug substructures and protein domains of interacting drugs and proteins. Based on these relationships, a testing set of drug-protein data are used to see if D-PIA can correctly predict the existence of interaction between drug-protein pairs. The results show that the prediction accuracy can be very high. An AUC score of a ROC plot could reach as high as 75% which shows the effectiveness of this classifier.

Conclusions

D-PIA has the advantage that it is able to perform its tasks effectively based on the fingerprints of drug and protein molecules without requiring any 3D information about their structures and D-PIA is therefore very fast to compute. D-PIA has been tested with real drug-protein interaction data and experimental results show that it can be very useful for predicting previously unknown drug-protein as well as protein-ligand interactions. It can also be used to tackle problems such as ligand specificity which is related directly and indirectly to drug design and discovery.

Background

In many different and extremely complex ways, the chemical pathways in our bodies are affected by various diseases. When one is sick, it might be a mistake in one reaction in a pathway that stops an important protein from being produced or causes too much of it to be produced. To correct such mistakes, drug molecules can be developed to interact with target protein molecules to activate or inhibit some of its functions thereby causing a protein to be produced more, or less. To facilitate drug design and discovery, it would therefore be very useful if we can predict whether or not a particular drug candidate may interact with a particular target protein based on its their structures at the molecular or sub-molecular levels.

Over the past decade, a lot of effort has been made to investigate into how drug and protein interact and the most notable among the work done are those related to protein-ligand docking [1]. Ligand is a molecule that binds to another chemical entity to form a large complex and protein-ligand docking is concerned with the prediction of the position and orientation of a ligand for binding with a protein receptor. If a ligand candidate that binds with a certain target can be found, drug molecules can then be designed to contain this ligand. However, the finding of such ligand candidate is difficult as protein-ligand docking requires knowledge about the 3D structures of the proteins and obtaining such knowledge can be very difficult [2].

Instead of investigating into protein-ligand docking, there has also been some effort to look into the analysis of molecular substructures [3] and biological activities [4]. In [3], for example, the concept of "privileged" substructures is introduced as chemical substructures that are commonly present in many drugs. In other words, in predicting if a drug may have any interaction with a protein, one can search for the presence of such privileged substructures in the drug molecules as an indicator of the likelihood of the existence of an interaction relationship with a protein. While such approach to finding privileged substructures may sound reasonable, it is considered controversial as abundance of drug structures may be a trivial consequence of their abundance in biochemical molecules.

Other than finding privileged substructures, a variety of statistical methods have recently been proposed to predict drug-target or more generally, protein-ligand interactions [5, 6]. There have also been some attempts to mine structural patterns from biological or biochemical data based on molecular fingerprints. The concept of molecular fingerprints, which is first introduced in [7], refers to the representation of chemical structures originally designed to assist in chemical database search. They become so widely used later on for data analysis tasks such as similarity search [8], clustering [9], and classification [10]. Molecular fingerprints have been used in such tasks to encode a wide range of 2D and 3D structural or conformational features of the molecules. A novel method for representing and analyzing 3D protein-ligand binding interactions, for example, is proposed in [11]. The key to the proposed method is to analyse the fingerprints obtained from translating the 3D structural binding information from a protein-ligand complex into a one-dimensional binary string.

Most of the work mentioned above has been performed independently from the viewpoints of either ligands or proteins. Not much work has been done to investigate into how the chemical and biological space may interact with each other. In [2], the paper reports on some attempts made to try to connect the two space. It proposes an approach to extract drug substructures and protein domains from a drug-protein interactions dataset by encoding chemical substructures of the drugs and the proteins domains of the dataset into molecular fingerprints. The paper explains how sparse canonical correspondence analysis (SCCA) can be performed on the data. As pointed out in the paper, the effectiveness of the proposed approach depends very much on the correct setting of a number of predefined parameters and the method may not work well when sparsity of data is not a relevant characteristic.

To identify ligand candidates efficiently for such applications as drug design and discovery, we need to be able to predict if a drug may interact with a protein without having to obtain full information of the 3D structures of protein molecules at an early stage. To do so, we propose to use a data mining algorithm called D-PIA (D rug-P rotein I nteraction A nalysis). Instead of relying on the availability of the 3D structural information of a target protein to predict if it may have any interaction with a certain drug candidate, D-PIA only makes use of the 2D molecular fingerprints of the protein in the prediction process.

Proteins are molecules consisting of a long chain of amino acids with unique structures and substructures. A protein domain is a part of a protein chain that can evolve, function, and exist independently of the rest of the other parts of the chain [12]. D-PIA performs its tasks by first breaking down drug molecules into substructures and proteins into their protein domains. By so doing, D-PIA attempts to determine if the drug substructures may interact or bound with the protein domains and if the strength of such interactions or bindings may determine if drugs can be designed for optimal compatibility with the human body and with other drugs [13].

Once the drug substructures and protein domains are identified, D-PIA makes use of a probabilistic measure to determine if a drug substructure and a protein domain are interdependent on each other and it does so in several steps: (i) for each drug in the database, the fingerprints of its molecular substructures are first obtained; (ii) for each protein in the database, the fingerprints of its protein domains are obtained; (iii) based on known interactions between drugs and proteins, an interdependency measure between the fingerprint of each drug substructure and protein domain is then computed; (iv) based on the interdependency measure, drug substructures and protein domains that are significantly interdependent are identified; and (v) the existence of interaction relationship between a previously unknown drug-protein pairs is then predicted based on their constituent substructures that are significantly interdependent.

D-PIA has been tested with real data involving two thousand drugs and the proteins that they interact with. Our experimental results show that it can be very helpful for predicting drug-protein and protein-ligand interactions. It can also be used to address problems such as ligand specificity.

Methods

Suppose that we have a set of M drugs {D 1 , D 2 , ... D i , .. D M } with each characterized by p substructure descriptors respectively. Suppose also that we have a set of N proteins {P 1 , P 2 , ... P j , ... P N } with q protein domains descriptors identified in each of them respectively.

Each of the M drugs can therefore be represented as D i = (sub i1 , sub i2 ,..., sub ix ,..., sub ip ), where sub ix is the x th substructure of the i th drug where i∈{1, 2,..., M} and x∈{1, 2, ..., p} and sub ix = 1 when the i th substructure exists in the drug, otherwise sub ix = 0. Similarly, each protein can be represented as P j = (dom j1 , dom j2 ,...,dom jy ,...,dom jq ), where dom jy is the y th protein domain of the j th protein, j∈{1, 2, ..., N}, y∈{1,2,..., q} and dom jy = 1 when the y th protein domain dom jy exists in the protein, otherwise dom jy = 0. The existence of one of more interaction relationships between the given drugs and proteins are represented by a matrix I= (α 1 , α 2 ,..., α M )T, where α i = (α i1 , α i2 ,...α lk ,...α iN ), l∈{1, 2, ..., M}, k∈{1, 2,..., N}. α lk = 1 when there is an interaction between the l th drug and k th protein.

Discovering interesting association patterns

To determine whether or not the i th substructure of a drug has a sufficiently strong interdependency relationship with the j th protein domain of proteins, we construct a contingency table (Table 1) of P rows and Q columns.

Table 1 Observed drug substructures and protein domains occurrence

Here in this table, occ ij denotes the number of occurrences of the case when sub i and dom j both takes on the value 1 in I. Let ex p i j = o c c i + o c c + j T be the expected number of occ ij , where oc c i + = ∑ k = 1 Q oc c i k and oc c + j = ∑ k = 1 P oc c k j and T= ∑ l , k oc c l k . An interdependency relationship between them is considered to exist if occ ij is significantly different from exp ij . To decide if this is the case, the approach taken in [14] is used to calculate an adjusted residual test statistic:

a d i j = z i j 1 - o c c i + T 1 - o c c + j T
(1)

where

z ij = oc c ij - ex p ij ex p ij
(2)

and 1 - o c c i + T 1 - o c c + j T is the maximal likelihood of z ij defined in [15].

ad ij has an approximate normal distribution with a mean of approximately zero and a variance of approximately one. Therefore, if its absolute value exceeds 1.96, it would be considered significant at α = 0.05 by conventional criteria. Based on (1), we can determine if a drug substructure sub i has an interdependency relationship with the protein domain dom j , at the 95% confidence level.

It should be noted that the value of ad ij can be positive and negative. When ad ij is positive, sub i and dom j is interdepdent on each other and when ad ij is negative, they are not.

Determining the weight of evidence for the discovered patterns

Since the existing of drug substructure in a drug is important for determining the interaction between protein domains, it is necessary to ensure that they are utilized in the prediction of an interaction relationship between a drug and a protein. The interdependency relationships discovered by (2) determines only the interdependency between drug substructures and protein domains, but it does not measure how strong the interdependency is. For this reason, we introduce the weight of evidence measure for the patterns discovered above.

Suppose that dom j = 1 is found to be interdependent with sub i = 1. Then the weight of evidence provided by sub i = 1 in favor of dom j = 1 opposed to dom i = 0 can be defined as [16]:

W o E d o m j = 1 d o m j = 0 : s u b i = 1 = I d o m j = 1 : s u b i = 1 - I d o m j = 0 : s u b i = 1
(3)

where

I ( d o m j = 1 : s u b i = 1 ) = l o g P r ( d o m j = 1 | s u b i = 1 ) Pr ( d o m j = 1 )
(4)
I ( d o m j = 0 : s u b i = 1 ) = l o g P r ( d o m j = 0 | s u b i = 1 ) Pr ( d o m j = 0 )
(5)

WoE can be used to be a positive or negative measurement for supporting or refuting the existence of an interaction relationship between a drug containing sub i and a protein containing dom j to have an interaction relationship. Hence, for a drug to be predicted to interact with a target protein, it should have sufficient support from its substructures in the sense that they should have a large enough degree of interdependency with the protein domains of the target protein.

Evaluation of D-PIA

One way to evaluate the effectiveness of D-PIA is to see if it can correctly predict drug-protein interactions that it has no previous knowledge of. Here we propose to evaluate D-PIA by testing it to see if it can predict known drug-target interactions correctly.

Given a pair of drug D i and protein P j , the potential interaction between them can be estimated by determining if there is any significant interdependency between the substructures in D i and the protein domains in P j . To do so, let us denote the set of substructures in D i as DS i = {s 1 , s 2 , ..., s a } and the set of domains in P j as PD j = {d 1 , d 2 , ..., d b }, where a is the total number of substructures in D i , and b is the total number of protein domains in P j . For ∀s' ∈ DS i ∀ d' ∈ PD i , we consider the interaction between s' and d' as significant when (6) below is satisfied.

| a d s ′ d ′ | > 1 . 96
(6)

For a pair of D i and P j , there are a × b possible significant interdependency relationship of substructures and protein domains in total. The potential interaction between D i and P j can be estimated based on the interacting substructures between them. If there is only 1 significant interdependency between the substructures of a drug and protein out of the total a × b such possible relationships, we may consider that the potential interaction between D i and P j as very weak. On the other hand, if more than half of the associations are significant, we may consider that the potential interaction between D i and P j as high. Therefore we could assert that there is potential interaction between D i and P j as (7).

w D i , P j = Σ i , j v a l ( a d s j ′ d j ′ ) a × b
(7)

where val(x) = 1 if |x| > 1.96, otherwise val(x) = 0.

The interaction between the drug, D i , and the protein, P j , will be more significant if the value of w(D i , P j ) is higher than some user-supplied threshold, denoted as R, i.e if w(D i , P j ) >R, and if, at the same time, the WoE(D i , P j ) is also high, then it means that the interaction between D i and P j is not only just strong, but the strong interaction relationship is also supported with strong evidence.

Results

To evaluate the effectiveness of D-PIA, we used the dataset from [2] which contains information about 1862 drugs. Each drug in the dataset is represented by a fingerprint with 881 substructures as defined in the PubChem database [17], i.e., each drug can be encoded as a binary vector whose elements encode for the presence or absence of a chemical substructure using 1 and 0, respectively. An example of the fingerprint of such a substructure is given in Figure 1.

Figure 1
figure 1

Drug Structure and PubChem Molecular Fingerprint.

Other than the drugs, the dataset also contains information about 1554 proteins in total. According to the UniProt [18] and Pfam database [19], each of them contains a total of 876 protein domains and thus, each protein can be encoded as a binary vector whose elements encode for the presence or absence of a protein domain using 1 and 0 respectively. An example of the protein sequence and its protein domains is given in Figure 2.

Figure 2
figure 2

An example of protein sequence and its protein domains [20].

Given the drugs and proteins as described above, D-PIA determines the adjusted residuals for the drug substructures and protein domains based on Equation (2) above. In Table 2, we list some of the adjusted residuals that D-PIA computes to determine if there is significant interdependency relationship between a drug substructure and a protein domain. As shown in the table, for example, the drug substructures of SUB840, SUB841, SUB861 are interdependent with the protein domains of PF00104 and PF00105.

Table 2 Some examples of adjusted residuals

To evaluate the effectiveness of D-PIA, we therefore try to determine if there is a strong enough drug-protein interaction between the drugs D i and the protein P j in our dataset based on the adjusted residuals obtained between the substructures of the drugs and the protein domains of the proteins as illustrated in Table 2. We set R to 10% in our experiments and found D-PIA to be able to predict the existence of Drug-Protein interaction at an accuracy of 85.4%. A 5-fold cross-validation approach is used to evaluate the ability of D-PIA to determine if a drug interacts with a protein and this approach is described as follows:

  1. 1)

    We split the drug-protein interactions dataset into five subsets of equal size and take each subset in turn as a test set.

  2. 2)

    We perform D-PIA on the remaining 4 sets.

  3. 3)

    Based on the significant interdependency relationships determined between drug substructures and protein domains, D-PIA attempts to predict the existence of interactions between drug and protein in the testing data and the accuracy over the five folds are computed.

A ROC (receiver operating characteristic) curve [21] based on the experimental results can be obtained as shown in Figure 3.

Figure 3
figure 3

ROC curve for the experiments.

While w(D i , P j ) represents the existence of a significant interdependency relationship between a drug substructure sub m and a protein domain dom n , it does not tell us how strong the interdependency relationship is. To find out, we compute, as discussed above, the WoE(sub m , dom n ) measure for the interaction between sub m and dom n . We summarize the result of the interaction between the drug substructures sub m and protein domain dom n and we present some of the results in Table 3.

Table 3 High value of adjust residual and WoE for drug protein substructures interactions

Discussion

The ROC in Figure 3 is a chart of true-positive vs false-positive for the prediction results of the experiments. The true-positive is concerned with the rate of correctly predicted drug-protein interactions whereas the false-positives is concerned with the rate of incorrectly predicted drug-protein interactions.

We can see from the chart that D-PIA can be very accurate in predicting drug-protein interactions most of the ROC curve is much above the reference line (random prediction). The AUC (area under the ROC curve) score (which is 1 for perfect accuracy and 0.5 for random prediction) score for D-PIA is 0.7497 which shows that that it is much better than prediction at random.

These results show that D-PIA can be used to predict how likely a drug candidate may interact with a particular protein. Based on the WoE computed as shown in Table 3, we also know that candidate drugs that have the substructures SUB695 are significantly interdependent with the protein domains PF04960, etc., and we believe that the interdependency relationships and the WoE measures between them such as shown in Table 3 could be very useful for the drug discovery, pharmacological analysis, ligand specificity, etc.

Conclusions

One common approach to drug discovery is to tackle the protein-ligand docking problem. To effectively do so, there is a need for information related to the 3D structures to be known. As such information is difficult and expensive to obtain, D-PIA is proposed here to discover patterns in known drug-protein interaction to predict those that are unknown so that the protein-ligand docking problem can be more easily tackled without having to rely on any 3D information. D-PIA makes use of fingerprints of the known drug substructures and protein domains to infer the existence of interactions between corresponding drugs and proteins. Experimental results show that the D-PIA can work effectively and can infer drug-protein interaction with high accuracy and can be a promising tool for computer aided drug discovery.

References

  1. Sousa SF, Fernandes PA, Ramos MJ: Protein-ligand docking: current status and future challenges. Proteins. 2006, 65: 15-26. 10.1002/prot.21082.

    Article  CAS  PubMed  Google Scholar 

  2. Yamanishi Y, Pauwels E, Saigo H, Stoven V: Extracting sets of chemical substructures and protein domains governing drug-target interactions. J Chem Inf Model. 2011

    Google Scholar 

  3. DeSimone RW, Currie KS, Mitchell SA, Darrow JW, Pippin DA: Privileged structures: applications in drug discovery. Comb Chem High Throughput Screen. 2004, 7: 473-494.

    Article  CAS  PubMed  Google Scholar 

  4. Klekota J, Roth FP: Chemical substructures that enrich for biological activity. Bioinformatics. 2008, 24: 2518-2525. 10.1093/bioinformatics/btn479.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  5. Keiser MJ: Predicting new molecular targets for known drugs. Nature. 2009, 462: 175-181. 10.1038/nature08506.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  6. Keiser MJ, Roth BL, Armbruster BN, Ernsberger P, Irwin JJ, Shoichet BK: Relating protein pharmacology by ligand chemistry. Nat Biotechnol. 2007, 25: 197-206. 10.1038/nbt1284.

    Article  CAS  PubMed  Google Scholar 

  7. Kauvar LM, Higgins DL, Villar HO, Sportsman JR, Engqvist-Goldstein A, Bukar R, Bauer KE, Dilley H, Rocke DM: Predicting ligand binding to proteins by affinity fingerprinting. Chem Biol. 1995, 2: 107-118. 10.1016/1074-5521(95)90283-X.

    Article  CAS  PubMed  Google Scholar 

  8. Johnson MA, Maggiora GM: Concepts and Applications of Molecular Similarity. 1990, New York: Wiley

    Google Scholar 

  9. McGregor MJ, Pallai PV: Clustering of large databases of compounds: using the MDL 'keys' as structural descriptors. J Chem Inf Comput Sci. 1997, 37: 443-448. 10.1021/ci960151e.

    Article  CAS  Google Scholar 

  10. Xue L, Godden JW, Stahura FL, Bajorath J: Design and evaluation of a molecular fingerprint involving the transformation of property descriptor values into a binary classification scheme. J Chem Inf Comput Sci. 2003, 43: 1151-1157. 10.1021/ci030285+.

    Article  CAS  PubMed  Google Scholar 

  11. Deng Z, Chuaqui C, Singh J: Structural interaction fingerprint (SIFt): a novel method for analyzing three-dimensional protein-ligand binding interactions. J Med Chem. 2004, 47: 337-344. 10.1021/jm030331x.

    Article  CAS  PubMed  Google Scholar 

  12. Protein Domain. [http://en.wikipedia.org/wiki/Protein_domain]

  13. Harris DC: Quantitative Chemical Analysis. 1998, Freeman and Co, 2

    Google Scholar 

  14. De Jong KA, Spears W: An analysis of the interacting roles of population size and crossover in genetic algorithms. Proceedings of the First International Conference on Parallel Problem Solving from Nature:. 1990, October ; Dortmund

    Google Scholar 

  15. De Jong KA, Spears W: On the virtues of parameterized uniform crossover. Proceedings of the 4th International Conference on Genetic Algorithms. 1991, San Mateo: Morgan Kaufmann Publishers, 230-236.

    Google Scholar 

  16. Osteyee DB, Good IJ: Information, Weight of Evidence, the Singularity between Probability Measures and Signal Detection. 1974, Berlin: Springer-Verlag

    Google Scholar 

  17. Chen B, Wild D, Guha R: PubChem as a source of polypharmacology. J Chem Inf Model. 2009, 49: 2044-2055. 10.1021/ci9001876.

    Article  CAS  PubMed  Google Scholar 

  18. Uniprot Consortium: The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res. 2010, 38: D142-D148.

    Article  Google Scholar 

  19. Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz H, Ceric G, Forslund K, Eddy SR, Sonnhammer EL, Bateman A: The Pfam protein families database. Nucleic Acids Res. 2008, 36: D281-D288. 10.1093/nar/gkn226.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  20. Trust Sanger Institute: PFAM. [http://pfam.sanger.ac.uk/search/sequence]

  21. Gribskov M, Robinson N: Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Comput Chem. 1996, 20: 25-33. 10.1016/S0097-8485(96)80004-0.

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

This article has been published as part of BMC Bioinformatics Volume 13 Supplement 9, 2012: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2011: Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S9.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weimin Luo.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

Weimin Luo carried out this study and drafted the manuscript. Keith CC Chan conceived this study and participated in its design and helped to draft the manuscript. All the authors read and approved the final manuscript.

Weimin Luo and Keith CC Chan contributed equally to this work.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Luo, W., Chan, K.C. Discovering patterns in drug-protein interactions based on their fingerprints. BMC Bioinformatics 13 (Suppl 9), S4 (2012). https://doi.org/10.1186/1471-2105-13-S9-S4

Download citation

  • Published:

  • DOI: https://doi.org/10.1186/1471-2105-13-S9-S4

Keywords