Mapping small molecule binding data to structural domains
© Kruger et al.; licensee BioMed Central Ltd. 2012
Published: 13 December 2012
Skip to main content
© Kruger et al.; licensee BioMed Central Ltd. 2012
Published: 13 December 2012
Large-scale bioactivity/SAR Open Data has recently become available, and this has allowed new analyses and approaches to be developed to help address the productivity and translational gaps of current drug discovery. One of the current limitations of these data is the relative sparsity of reported interactions per protein target, and complexities in establishing clear relationships between bioactivity and targets using bioinformatics tools. We detail in this paper the indexing of targets by the structural domains that bind (or are likely to bind) the ligand within a full-length protein. Specifically, we present a simple heuristic to map small molecule binding to Pfam domains. This profiling can be applied to all proteins within a genome to give some indications of the potential pharmacological modulation and regulation of all proteins.
In this implementation of our heuristic, ligand binding to protein targets from the ChEMBL database was mapped to structural domains as defined by profiles contained within the Pfam-A database. Our mapping suggests that the majority of assay targets within the current version of the ChEMBL database bind ligands through a small number of highly prevalent domains, and conversely the majority of Pfam domains sampled by our data play no currently established role in ligand binding. Validation studies, carried out firstly against Uniprot entries with expert binding-site annotation and secondly against entries in the wwPDB repository of crystallographic protein structures, demonstrate that our simple heuristic maps ligand binding to the correct domain in about 90 percent of all assessed cases. Using the mappings obtained with our heuristic, we have assembled ligand sets associated with each Pfam domain.
Small molecule binding has been mapped to Pfam-A domains of protein targets in the ChEMBL bioactivity database. The result of this mapping is an enriched annotation of small molecule bioactivity data and a grouping of activity classes following the Pfam-A specifications of protein domains. This is valuable for data-focused approaches in drug discovery, for example when extrapolating potential targets of a small molecule with known activity against one or few targets, or in the assessment of a potential target for drug discovery or screening studies.
Research in the field of drug discovery is increasingly driven by the data mining of large-scale pharmacological, screening, patent, literature and other bioactivity data. Such approaches have led to interesting concepts that challenge historical dogma - for example the view that many small molecules and indeed drugs exert their effect through interactions with multiple rather than a single target . New targets have been predicted for FDA approved drugs through analysis of large-scale bioactivity databases  and side-effect data mined from package inserts .
The discipline of combining small molecule bioactivity, the 'ligand space', with bioinformatics analyses of the 'target space' is also known under the name chemogenomics [4, 5]. Chemogenomic approaches can be used to systematically examine and explore the binding of small molecules to large target families such as kinases [6, 7] or G-protein coupled receptors (GPCRs) [8, 9] or for the design of compounds targeting multiple proteins . One of the current limitations of these approaches is the biased distribution of data that is available for individual targets. While there are a few prominent target classes such as certain GPCR families, protein kinases and various protease families, for which the bioactivity of many thousands of ligands has been measured, most targets have measured bioactivities for only a few compounds or no annotation at all . To partially address this limitation, we propose an indexing of target space at a structural domain level, allowing aggregating ligands known to bind targets containing a given structural domain into a larger bioactivity class. The practical implication for the analysis of large-scale bioactivity data is a necessity to automatically and reliably annotate large numbers of protein targets with a domain containing the site of small molecule binding. We therefore propose to map small molecule binding to structural domains and present an initial implementation for targets in the ChEMBL database  (version chembl_13). Previous studies have statistically associated small molecule binding to protein domains  and direct mapping has been applied to ligands in crystallographic structures . Here we extrapolate these mappings to pharmacologically relevant interactions described in the CHEMBL database.
Structural domains are independent folding units that form the basic evolutionary and architectural 'building blocks' of proteins . While there can be large sequence differences between members of a domain family, the fold of the peptide backbone is generally conserved , even though (exceptional) cases of homologous proteins with differing folds have been identified and discussed . A small protein would typically consist of one domain, while longer proteins are often an assembly of more than one domain . In some eukaryotic proteins, the underlying intron-exon structure of the gene reflects this structural domain segmentation . For the mapping of small molecule binding, targets consisting of combinations of domains impose a challenge because the binding site for the ligand might lie in either domain and in addition more than one domain in a protein might interact with the same or different ligands. Domain assignment information is available from a number of publicly available resources. SCOP  and CATH  are databases that define protein architecture based on hierarchical definitions of 3D structural domains. Pfam-A  is a database of hidden Markov chain models of non-overlapping full domain sequence alignments. Pfam-A domain definitions are also manually annotated and curated. Interpro  is a database that integrates different domain models into a comprehensive set of protein domains. For our purposes, the Pfam-A database with its non-overlapping, non-hierarchical architecture and extensive coverage of protein families, is ideal to map ligand binding to a given protein domain. In this study, we propose a simple heuristic to map the site of small molecule binding to Pfam-A domains and compare our results with binding site information from the protein sequence database Uniprot  and PDBe , a repository of crystallographic protein structures.
In order to assess the impact of incomplete annotation for our set of ChEMBL targets, we determined for each target the number of residues belonging to a Pfam domain as a fraction of the number of residues in the overall protein sequence. We found that for the entire set of human proteins, the median of this fraction is 0.50 and about a quarter of all proteins have less than 20 percent of all residues assigned to a Pfam domain. The low ratio of residues within Pfam domains is likely due to incomplete coverage of Pfam-A models for the human proteome. For human protein targets in the ChEMBL database, the ratio of residues within Pfam domains is significantly higher (p < 2.2*10-16, Bonferroni adjusted for multiple testing): the median proportion of Pfam residues relative to sequence length is 0.72. In comparison, this ratio is 0.69 for all protein targets in the ChEMBL database, including non-human protein targets. Previous works suggests that proteins consist mainly of highly structured regions [20, 21]. Therefore, we propose that coverage of Pfam-A domain annotation is almost complete for most ChEMBL targets but not for the entire set of human proteins. This is most likely due to the preference of drug discovery programs for well-characterized targets and the priority of disease-related proteins in functional and structural studies.
Our attempt at mapping of ligand binding to discrete Pfam domains is based on the assumption that small molecule binding takes place within the structurally conserved region of a protein domain rather than in the surrounding non-Pfam domain regions. Following this premise, and assuming that the annotation with Pfam domains for our set of ChEMBL targets is complete, the mapping of small molecule binding is immediately achieved for proteins with a single domain. Thus, with our initial assumption, the heuristic covers 50% of all protein targets in the ChEMBL target dictionary. To estimate the accuracy of the outlined assumptions, we carried out systematic queries against UniprotKB/Swiss-Prot and PDBe and systematically evaluated the overlap of binding sites annotations and Pfam domain predictions. The Methods Section Code and queries describes the queries in detail.
Given that about half of all proteins in the ChEMBL target dictionary have more than one domain, we investigated ways to expand our mapping of small molecule binding from targets with only a single domain to targets with multiple domains. We had observed with high probability that small molecule binding in single domain protein takes place between the boundaries of a domain. We prepared a set of single domain protein targets from the ChEMBL data base by selecting each protein that had at least one ligand tested against it in a binding assay with a reported activity value less or equal 50 μM (see also Methods sections Mapping and Manual curation of input data). The occurrence of a domain in this set is thus a validation of a domain's potential to mediate a small molecule binding interaction. In the following, we consider all domains from this set as 'seed' domains with the potential to mediate small molecule binding. If such a 'seed' domain co-occurs with one or more 'non-seed' domains, our mapping defaults to this previously established seed domain. Hence, the mapping follows a heuristic based on the assumption that domains with known ligands take precedence over domains that do not occur in single domain proteins with known ligands. For example, in protein kinase Akt-3 (Q9Y243), which also contains a Pkinase_C and PH domain, the target of small molecule binding is the Pkinase domain. In total, our mapping covers 197,642 activities. A table with all mappings is provided in Additional file 2.
% correct Uniprot (N = 511)
% correct PDBe (N = 217)
# total predictions
Combinations of co-occurring validated domains
# ChEMBL targets
# PDB accessions
Small molecule binding at the interface of two or more Pfam-A domains
# ChEMBL targets
Topoisom_I, Topo_C_assoc, Topoisom_I_N
0.35, 0.31, 0.35
Pfam domains with most ligands tested in binding assays
Statistical analysis of power-law parameters
Frequency of Pfam domains
Ligands per Pfam domain family
Ligands per target
Goodness of fit
yes (p = 5.1*10^-9)
yes/no (p = 0.48)
yes/no (p = 0.57)
yes (p = 3.9*10^-3)
yes (p = 0.10)
yes (p = 8.5 *10^-8)
yes (p = 2.1*10^-4)
Yes/no (p = 0.16)
no (p = 1.0*10^-3)
support for power-law
Loadings of the principal components
In this study, we show that small molecule binding sites are associated with the regions in a protein that map to a Pfam domain, and hence typically have a discrete structure defined by a conserved sequence profile. We exploit this knowledge to map small molecule binding to Pfam domains in single- and multi-domain proteins. The integration of small molecule bioactivity data from the ChEMBL database and (predicted) structural data from Pfam will drive cross-linking across databases and deeper semantic annotation for chemical biology. In addition, our mapping allowed for an analysis of the distribution of known small molecule ligands per Pfam domain. The power-law behavior of this distribution mirrors the genomic distribution of protein folds and the incremental progression of drug discovery.
The heuristic presented here is simple and efficient. However, the mapping does not address two naturally occurring edge cases. Firstly, a number of Pfam domains occur only in combination with other domains and hence are not picked up in the initial seeding step. We address this partially by manually including such domains if they occur in more than one hundred ChEMBL targets. The second case is the relatively rare occurrence of ligand binding at the interface of domains, as discussed in the section on mapping small molecule binding to multidomain proteins.
The mapping described in this study further provides ligand sets for the development of methods to predict bioactivity for new compounds and gives an estimate of the chemical space of ligands associated with each domain. We also used these sets as a starting point to explore the selectivity of small molecules within and across protein families following the Pfam domain definitions. Mappings and ligand sets resulting from this study will be kept up-to-date with new ChEMBL releases and are available at http://www.ebi.ac.uk/~fkrueger/mapChEMBLPfam, along with documentation.
Practically, the mapping was carried out as follows. For all targets in the ChEMBL target dictionary, we collected activities measured in binding assays that are linked directly and unambiguously to a single target. (Assay type = B, multi- and complex-flags = 0) The activity type was required to be either of the following: Ki, Kd, IC50, EC50, -Log Ki, pKd, pA2, pI, pKa. We further filtered out all activities weaker than 50 μM. The remaining mappings were kept and a dictionary of validated domains created. Multi-domain proteins were scanned for the presence of validated domains and categorized as either of the following. i) No validated domain, ii) only one validated domain (or multiple copies thereof), iii) more than one validated domain. Case i) results in no mapping, case ii) assigns all ligands to the validated domain. In the case iii) that more than one validated domain occurs in a protein we did not assign any mapping. A summary of all co-occurrences of validated Pfam-A domains is provided in Additional file 6.
Validation was carried out against data from Uniprot as well as PDBe. Uniprot lists manually curated positions of residues that participate in ligand binding while information about residues in close proximity to the bound ligand can be extracted from PDBe using the algorithm PDBeMotif . Binding site annotations from PDBeMotif contain explicit information about the ligand, in the form of a three-letter code, and the residue numbers of interacting residues in the target protein. We can thus assess binding within Pfam domain boundaries by comparing the position of each binding site residue with the start and end positions of a given domain. Predictions on multi-domain proteins were benchmarked by calculating the fraction of residues within a predicted domain over all residues involved in the binding of the corresponding ligand. The resulting ratio can be considered as a measure of association between a predicted Pfam domain and ligand binding, with high values indicating strong associations and vice versa. We argue that a value of 0.5 or greater is a robust measure of association between a Pfam-A domain and ligand binding. Accordingly, predictions benchmarked against Uniprot or PDBe were either classified as correct if this ratio was equal or greater than 0.5 or classified as false if this ratio was less than 0.5.
In some few cases, small molecule bioactivities reported in ChEMBL are mapped to Uniprot identifiers that represent fragments of a protein. This might be due to annotation errors, or the lack of a Uniprot entry representing the full-length protein. These cases can be problematic for our mapping. As an example, some activities extracted from an article on phosphodiesterase inhibitors (PubMed 8027992) map to the Uniprot identifier Q864F1. This identifier represents an N-terminal fragment of the pig phosphodiesterase 5, containing only the GAF domain and, crucially, missing the PDEase_I domain. Thus, small molecule binding is incorrectly mapped to the GAF domain. We identified five critical protein fragments in the ChEMBL target dictionary and removed these manually before applying our mapping algorithm. A list of these targets and justification for removal is provided in Additional file 7.
Statistical analysis was carried out in R  unless otherwise stated. The protocol we followed to test the distributions of Pfam domain occurrences and number of known ligands for a power-law behavior comprises 4 steps. In the first step, we use the R package plfit.R to determine the scale parameter α and xmin. We then use the package powerlaw.R http://www.rickwash.com/papers/cscw08-appendix/powerlaw.R to calculate the goodness-of-fit and corresponding p-Value. For the maximum-likelihood calculations we use the functions pareto.lnorm.llr, pareto.exp.llr and pareto.weibull.llr. Visualizations were created using the script plplot.py. All functions except powerlaw.R were provided by Aaron Clauset and Cosma Shalizi http://tuvalu.santafe.edu/~aaronc/powerlaws/.
We selected ligands from mappings for 6 Pfam domains and retrieved pre-calculated descriptor values from the compound_properties table within the ChEMBL database. To prepare the data for scaling to unit variance, we excluded as outliers the first and hundredth percentile of each descriptor value distribution (see Additional file 5). Scaling to unit variance and principal component analysis was carried out using the R function prcomp.
The workflow for this study was implemented in python and R. The code is deposited at https://github.com/fak/mapChEMBLPfam. Pfam domain annotations and estimated domain boundaries for all protein entries were retrieved from http://pfam.sanger.ac.uk/protein/X?output=xml where X is the Uniprot accession of a query protein. The corresponding function can be found as getPfamDomains.py in the code repository. Binding site annotations from Uniprot were retrieved from http://www.uniprot.org/uniprot/X.xml, where X is the Uniprot accession of a ChEMBL target. Residues in close proximity to the bound ligand were retrieved from PDBeMotif using a query submitted to http://www.ebi.ac.uk/pdbe-site/pdbemotif/hitlist.xml. The corresponding deposited functions are called queryUniprot.py and queryPDB.py, respectively. We used SIFTS  to translate between PDBe and Uniprot residue coordinates. Protein coding genes in the human genome were extracted from Ensembl using Ensembl Biomarts  with the deposited function queryBioMaRt.R.
We thank Samuel Croset (EMBL-EBI) for exploratory work on validation of the mapping described in this article. We thank Saqib Mir (EMBL-EBI) for help with designing the PDBeMotif XML queries. The work was supported by funding from the EMBL Member Nations; FAK is a member of Fitzwilliam College, University of Cambridge.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 17, 2012: Eleventh International Conference on Bioinformatics (InCoB2012): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S17.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.